In the development of intelligent vehicles, there is a strong need for robust models that can automatically detect the parts and regions of cars, both inside and outside the vehicle. Opening doors on nearby vehicles, for example, must be identified to avoid collisions [zhu2019novel, liu2016radar], while the autonomous transition shifts the focus towards the interior, necessitate a holistic understanding of the cabin for intelligent personal assistants and advanced user interactions [vogel2018emotion]. Part recognition can also be used in production environments for automatic quality control [Luckow.2016b, MurielMazzetto.2020].
Developing methods for automatic object detection in images and videos is the major goal of computer vision. Reliable approaches are particular challenging in domains in which objects of the same class appear in a wide range of variations and environments. A generic perception system must be able to recognise parts and regions across the non-trivial automotive domain; a single car model can have more thanvariations [pil2004linking] due to the vast number of equipment options. Additionally, in real-world scenarios, such a system must be able to operate in different, changing, and subpar perspectives and lighting conditions, robust against human occlusion, and with images of varying quality, to be assimilable into more complex frameworks. However, to date, literature on generic optical detection systems that meet or even consider all these criteria exists is lacking. Typically, investigations focus on car parts in a static, lab-like environment. Existing methods designed for fine-grained recognition of vehicles often only function in fixed lighting conditions and using images from specific viewpoints, such as frontal and rear view images [hu2015learning], or extract distinctive parts automatically, but cannot assign a label to them [simon2015neural].
A statistical system that detects regions of car exteriors in 110 images was developed by [chavez2011vision], while [Luckow.2016b] proposes a visual inspection system for the manufacturing process. The latter recognises different vehicle properties related to quality faults of four vehicle models, using AlexNet, GoogleNet, and Inception V3, in 82 000 images from production environments and 106 in-the-wild images from Twitter. The system achieved an F1 score of 87.2 % on the top five classes. Studies based on the Stanford Cars and BMW-10 dataset [krause20133d, sharif2017framework] predict the make, model and year, but not the underlying vehicle topology, while investigations using the BoxCars focus on re-identification in traffic surveillance [sochor2016boxcars, wang2017orientation].
Addressing the gap in the literature, we made a broad empirical comparison of state-of-the-art computer vision methods for automotive part recognition and detection. To do so, we created and labelled more than 12 000 car part images in challenging real-life environments. Additionally, we demonstrate in-domain model transfer capabilities. The remainder of this paper is organised as follows. We provide a brief overview of the deep neural networks utilised in our experiments inSection II. In Section III, we introduce three new datasets with distinctive characteristics suitable for the tasks: Close-CAR (close ) consisting of close shot images under sub-optimal conditions, Mix-CAR (mix ) a multi-label dataset covering 18 different BMW models of the last six years, each with up to a hundred equipment options, and MuSe-CAR-Part (mpart ) capturing elevated human vehicle interactions in real-life videos. Next, we explain modifications to the network architectures and chosen settings, including in- and out-of-domain transfer capabilities, for each experiment in Section IV. We discuss the quantitative and qualitative results obtained using these approaches in Section V and propose potential applications in Section VI. On the test set, our best-performing systems achieved an F1 of % (fine-tuned ResNet50 ) in a single label setting and the Darknet backbone resulted in a mean average precision (mAP) of % on mix and when jointly trained % on mpart . The weights of the best models of each category will be made publicly available111upon acceptance.
Ii Deep Neural Networks for Optical Recognition and Detection
Deep neural networks, extracting high-level features across a large number of layers, form the state-of-the-art in computer vision. We briefly present a number of popular network architectures used in our experiments.
In general, we can express visual recognition systems as any Deep Neural Network (mostly Convolutional (CNNs)) represented by a function , where
represents an input image, which is mapped (classified) by the non-linear functionto a class label of the displayed object instance. [simonyan2014very] introduced 16 and 19 layers CNNs, VGG16 and VGG19
, respectively, which significantly pushed the benchmark on the 2014 ImageNet challenge.ResNet50 [he2016deep]
employed residual connections to train networks that were much deeper than the VGG nets, leading to a further increase in performance.Inception architectures [szegedy2016rethinking] are based on the idea to use wide networks performing multiple convolutions in parallel, instead of ever deeper networks. InceptionV3 is an evolution, adding regularisation and batch normalisation to the auxiliary classifiers and applying label smoothing. Updating the InceptionV3 network, [szegedy2017inception] introduced InceptionResNetV2
that employs residual connections in the inception modules, which accelerates the training process. Its processing cost is similar to the non-residualInceptionV4 , introduced in the same work. Inspired by InceptionV3 , Xception [chollet2017xception] was developed, replacing the Inception modules with depth-wise separable convolutions, i. e., convolutions that act on different channels of the previous layer’s output. Xception has a similar number of parameters as InceptionV3 , but it is simpler to implement and showed an increase in performance on several benchmarks. Compared to other approaches, DenseNet201 [huang2017densely]
is a CNN, in which each layer has a feed-forward connection to each other layer, not just its immediate successor. This helps the network propagate features and combat the vanishing gradient problem, as well as decreasing computational cost due to a reduced number of parameters.MobileNetV2 [sandler2018mobilenetv2] was developed to reduce the memory footprint making deep neural networks more suitable for mobile applications. Its architecture builds upon residual connections between linear bottleneck layers. NASNetLarge and NASNetMobile [zoph2018learning] follow a different concept of optimisation, searching for a building block on a small dataset, followed by a transfer of stacked blocks to a larger dataset. Most of these architectures require specific image pre-processing.
In contrast to recognition, a visual object detection system is often learnt as a regression task, where also the position of an object on a given image is predicted aside of the class. This can be expressed by a predicted object () with the properties: central coordinates X () and Y (), the height () and width () of the bounding box, as well as the confidence of the class ().
“You-Only-Look-Once“ (YOLO V3) is one of the most efficient and popular object detection frameworks [redmon2018yolov3]. Unlike similar algorithms, all classes and bounding boxes are predicted simultaneously. Learning classes in dependence of each other leads to a performance advantage and increases context image understanding. So is a video frame extracted from a video stream and divided into an
grid. Each grid cell has a feature vector with the size ofthe number of anchors *a 5-dimensional object vector + the number of classes. The so-called anchors (width-height pairs) enable the network to detect and predict parallel objects of different sizes equally efficiently.
For the actual prediction functionality, a so-called backbone is used, which corresponds to a neural network. Typically, two different backbones [redmon2018yolov3], derived from the same neural network blocks, are combined with the YOLO framework: Darknet and for resource efficient usage (e. g. , mobile applications), the parameter reduced network TinyDarknet . As almost all applications benefit from a resource-efficient implementation, we also evaluated SqueezeNet , which reduces parameters by downsampling and a ‘squeeze and extend’ process containing a fire module with decreased filter size and input channel number. A detailed description can be found in [iandola2016squeezenet].
Iii Dataset and Preparation
In this section, we introduce three separately collected and annotated real-world datasets. In each of them, vehicle parts appear in different environments and conditions, giving them unique characteristics.
Close-CAR consists of real-world, close-shot images of two kinds: 1 743 images of interior parts comprising 19 classes, and 1 066 exterior parts comprising 10 classes222interior classes: A/C, A/C infotainment, A/C radio infotainment, A/C radio, armrest, console, cruise control, door inside, floor mats, glass holder, glove compartment, infotainment, radio, roof window, seat, speaker, speedometer, steering wheel, and sun visor; exterior classes: door ex, door handle, exhaust, foglight, grills, headlight, mirror ex, taillight, tire, and trunk. Every image depicts only a single car part333or a combination of up to the for A/C, radio and infotainment which are physically located side by side of various car makes, types (such as SUVs or sedans) and models. Resolution and capture angle to the object vary across the images, which were taken with several hand-held devices, such as the iPhone 5 and 6. Since the photographs were taken under real-world conditions, many suffer from overexposure, underexposure, blurring, reflections from metallic surfaces and shadows. This makes the dataset challenging for generic recognition. Train(ing), devel(opment) and test sets were partitioned in a class-stratified 80 %-10 %-10 % split.
Mix-CAR is a multi-label, multi-class real-world dataset that contains 15 003 images of cars from 18 BMW models, each with up to 100 different cars and options. Including a large number of equipment variations444e. g. , various styles of painting colours, upholstery/interior trims, wheels, seats, (head-up) display, towbar, loudspeaker and ventilation covers etc.; they enable robust discriminative features to be learnt. The dataset depicts the cars’ interior and exterior. We identified 8 113 of the images (4 724 exterior and 3 389 interior) as suitable for labelling 29 car parts and partitioned them as for close . Each picture averages 3.7 bounding boxes with 1 to 15 unique labels.
is a subset of the multimodal in-the-wild dataset MuSe-CAR, originally collected from YouTube to study multimodal sentiment analysis in-the-wild[stappen2020muse]. The 300 videos provide complex in-the-wild footage, including a range of shot sizes (close-up, medium, and long), camera motion (free, stable, unstable, zoom, and fixed), moving objects, highly variety in backgrounds, and people interacting with the car they are reviewing. We selected 74 videos from 25 different channels and sampled 1 124 frames across several topic segments. In total, 29 classes were labelled according to mix , resulting in 6 146 labels averaging 5.47 labels per frame.
Iv Go-CaRD EXPERIMENTS
We conducted experiments on four Tesla V100 GPUs (128 GB GPU RAM in total) and using Tensorflow.
Iv-a Transfer and joint learning
To train models on limited datasets efficiently, we used out-of-domain and inner-domain transfer-learning as well as inner-domain data injection techniques for joint training. The first one, initialises the network weights using networks previously trained on large, general purpose datasets and is used for the recognition systems. This technique increases the training stability of the low-level filters when used on moderate-sized datasets. For the detection systems, we used inner-domain transfer learning, by utilising networks trained on the larger dataset mix , fine tuning these models to predict mpart . Finally, we used inner-domain, joint training to smooth training and improve results. Hence, instead of tuning mpart after the training is finished on mix , we inject degrees of mpart to mix during training while evaluating the improvement on mpart .
Iv-B Car Part Recognition
The recognition networks consist of a network base and a head. The base was initialised with weights trained on the ImageNet [deng2009imagenet] dataset of approximately 1.2 million images from 1 000 categories.
The head, corresponding to the final (top) layers of the network, was randomly initialised.
We compared a static (functions as feature extractor) and trainable base in combination with different heads:
b) a parameter-intensive (int ) head, in which we add a trainable 2D convolutional layer with 2 048 filters, a kernel size of 3x3, and valid padding, on the frozen base, followed by two fully-connected dense layers (1 024 neurons with a sigmoid and 256 neurons with a ReLU activation function, respectively);
) head, in which we add a trainable 2D convolutional layer with 2 048 filters, a kernel size of 3x3, and valid padding, on the frozen base, followed by two fully-connected dense layers (1 024 neurons with a sigmoid and 256 neurons with a ReLU activation function, respectively);c) fullhas a trainable base topped with a head consisting of a 2D convolutional layer with 1 024 filters, a kernel size of 3x3 followed by a 1 024- and a 256-fully connected layer with a sigmoid activation function.
We also evaluated several other combinations and layer configurations. Given very similar results, we decided to omit these for conciseness. The models were trained for up to 400 epochs and a batch size of 32, applying an Adam optimiser with a learning rate of 0.001. All experiments were executed on in- and outside resized555224 x 224: DenseNet201 , NASNetMobile , ResNet50 , VGG16 ; 299 x 299: InceptionResNetV2 , InceptionV3 , Xception ; 331 x 331: NASNetLarge images, reporting F1 score separately and combined.
Iv-C Car Part Detection
The backbone networks were trained in a two step procedure: First, to smooth the training, only the last 3 layers were trained for 50 epochs, with a learning rate of 0.001; second, we unfroze all layers and re-initialised the learning rate to 0.0001 and ran up to 200 epochs. This learning rate is further reduced by the factor of 0.1 when the validation loss stagnates for 5 epochs. To increase stability, we applied gradient clipping at 0.5 to the squeezenet. The input size was set to 416 x 416. Crafted using kmeans, theSqueezeNet and Darknet utilised nine anchors while the TinyDarknet
utilised six. For numerical performance comparison, we used mAP, which is based on the Intersection of Union (IoU) reflecting the area under the interpolated precision-recall curve averaged across all unique recall levels. The true positives are considered if the correct class is predicted and the IoU is larger than a threshold. We used three thresholds to report the mAP under poor (), moderate (), and good () fit.
V Results and Discussion
V-a Quantitative results
Table I depicts detailed results of car part recognition, demonstrating that models without frozen parameters (full) consistently yield the best results, except for VGG16 (outside) and InceptionResNetV2 (inside). ResNet50 achieved %, followed by Xception and DenseNet201 (combined). Both variants with a frozen network base using the comprehensive (int) and the simpler head (light), perform considerably worse. This indicates that fine-tuning the entire model and the head using a convolutional layer, instead of max-pooling, is worthwhile for the task.
Figure 2 demonstrates that, these three networks also have the best efficiency in terms of parameter usage compared to the other ones (indicated by the dotted line), while InceptionResNetV2 clearly underperforms. Of the two models specifically developed for mobile applications, MobileNetV2 clearly outperforms NASNetMobile and achieves also the best overall result in the detection of interior parts with almost identical numbers of parameters.
|data||IoU level on dev. / test|
In the more challenging task of detection, Darknet performs best, achieving a mAP (IoU 0.5) of % on mix test, followed by SqueezeNet and TinyDarknet (cf. Table I). Low-resource training (mpart to mpart ) fails almost entirely. A purely in-domain transfer from mix to mpart without transfer-learning and injecting training data of the target domain results in low prediction performance of up to % on an IoU level 0.2. Using the trained mix weights for a two-step (fine-) tuning, and a transfer-training procedure, the results improve to up to % on test (IoU 0.2). Varying learning rates and isolated fine-tuning on only a part of the model does not yield better results. This is in contrast to the joint training approach. The results are gradually improved by injecting more data from mpart to mix , until both datasets are used entirely, achieving an mAP of %. In this performance region, also the IoU threshold seems to become more relevant. If we examine the results of the two strongest models of each dataset (on test) on class-level, both contain tire, headlight, and door handle among the top 5 performing classes, with the worst 5 being almost exclusively interior classes.
V-B Qualitative results
Figure 1 depicts examples of parts incorrectly recognised
by the best-performing architecture. A and B are both predicted as floor mats, probably due to patterns on the fabric below the A/C (A) and backrest (B), indicating a sensitivity to distinctive patterns. Certain classes in less common variations, such as the open glove compartments in C and D, appear to be more vulnerable to variations in lighting. Similarly, images with high contrast combined with reflections, such as the tinted sun roof of a white car (E) and the reflection of a metal exhaust in the dark (F), are more prone to confusion.
Regarding detection, we found that the models still have a high detection rate in distance shots, for example, door handles on distant, approaching vehicles that are hardly visible to the human eye. We attribute this to the learnt context. For instance, a door handle is always located at the same position of the door and the relative location only depends on the camera perspective. Our transfer and fine-tuning approach drastically improved the robustness when people interact with an object, especially in the case of minor occlusions due to finger pointing or gestures, such as B and C in Figure 3. Greatly improved, although still with limitations, is the detection of objects that are gripped (A-I.: the bounding box around the sun visor is reduced in size due to the hand-grip) or are obscured by human body parts (A-III. the rear door is largely obscured and not detected) in joint usage of both datasets. In one limitation, used on consecutive video frames the models temporarily ’lose’ objects with moving parts such as when a door or the trunk is opened, as in the example in Figure 4). Additional distinction between classes (open, closed) and the extraction and annotation of similar frames should overcome this issue.
A reliable detection of objects inside vehicles has many potential applications. The interior becomes increasingly important with the evolution of autonomous driving and the growing number of customer functions dedicated to communication and entertainment, in addition to future use cases, such as mobile working by occupants as the car drives to its destination. Object and passenger detection forms a basic component for a model of the car cabin. One example is the monitoring of the driver for unexpected take over scenarios (cf. Figure 3, D) II. to III.) in semi-autonomous driving. Another is gesture control, an intuitive form of user interaction, that can be implemented by localising individual fingers and their relation to objects within the vehicle. Which parts of the car a passenger is pointing towards or interacting with is important information for an advanced intelligent vehicle assistant (cf. Figure 3, B and C). Such systems depend on context to develop an understanding of the user’s intent [vogel2018emotion]. They can elevate the driving experience through smart assistance upon recognising driver distraction or stress. By learning the passenger’s preferences, these models can also provide personalised user interfaces or change the interior configuration for increased comfort. Finally, the contextual input provided by computer vision algorithms and other sensor sources can help intelligent assistants anticipate the passengers’ intentions and act proactively.
Applications of part detection outside the vehicle exist in autonomous driving, e. g. , collision avoidance of opening parts on nearby vehicles. In case of crashes, part localisation may help reconstruct the accidents for insurance claims. Signs of wear may also be detected [Balitskii.2019], allowing timely maintenance. Automotive production can also benefit from part recognition and detection as well, with applications in oncoming humanoid robot generations, process monitoring and quality control. Use cases include verifying manufacturing steps, for example, via virtual inspection [Luckow.2016b], or generating models that simulate entire factories [Petschnigg.2020].
Generally, the ability to detect specific parts of vehicles may also be useful for sales and marketing, where the use of multiple modalities is a promising direction for sentiment analysis [Karas.2020]. Incorporating visual information into multimodal approaches is a valuable step towards understanding videos reviewing vehicles, a large number of which are available online [stappen2020muse].
As a potential cornerstone for deep learning tasks in the automobile domain, we made the first attempt to develop a generic, optical car part recognition and detection system for realistic conditions. To do so, we introduced three new datasets, each the biggest of its kind, containing images with varying illumination, obstructions and views from a wide range of make and models, each with up to 100 equipment variations. On close we achieved an F1 score of % for the recognition and an mAP of % on mix (both test) for the detection task. These datasets allowed us to empirically evaluate cross- and in-domain transfer and joint learning concepts of various computer vision models. The architecture jointly trained on mix and mpart performed best for the detection task of the relatively small mpart with a mAP of %. In addition, the extensive qualitative descriptions provide useful guidance for future work towards the goal of a fully generic system.
We limited our work to a fixed set of hyperparameters, but expect to obtain sporadic better results with a broader hyperparameter search or by step-wise adjusting the learning parameters by hand during training. Additionally, we plan to combine detection and recognition approaches in one model as well as explore human-object interactions with the vehicles more closely and in additional environments. Finally, no global image understanding was explicitly used in this work, but could be considered in oncoming efforts.
We thank the BMW Group for the provision of the data and computation resources. Thanks also to Dr Judith Dineley for editing the text.