Source code related to BoxCars publication
In this paper, we focus on fine-grained recognition of vehicles mainly in traffic surveillance applications. We propose an approach orthogonal to recent advancement in fine-grained recognition (automatic part discovery, bilinear pooling). Also, in contrast to other methods focused on fine-grained recognition of vehicles, we do not limit ourselves to frontal/rear viewpoint but allow the vehicles to be seen from any viewpoint. Our approach is based on 3D bounding boxes built around the vehicles. The bounding box can be automatically constructed from traffic surveillance data. For scenarios where it is not possible to use the precise construction, we propose a method for estimation of the 3D bounding box. The 3D bounding box is used to normalize the image viewpoint by "unpacking" the image into plane. We also propose to randomly alter the color of the image and add a rectangle with random noise to random position in the image during training Convolutional Neural Networks. We have collected a large fine-grained vehicle dataset BoxCars116k, with 116k images of vehicles from various viewpoints taken by numerous surveillance cameras. We performed a number of experiments which show that our proposed method significantly improves CNN classification accuracy (the accuracy is increased by up to 12 percent points and the error is reduced by up to 50 compared to CNNs without the proposed modifications). We also show that our method outperforms state-of-the-art methods for fine-grained recognition.READ FULL TEXT VIEW PDF
In this paper, we focus on fully automatic traffic surveillance camera
Fine-grained categorisation has been a challenging problem due to small
This paper studies the problems of vehicle make & model classification. ...
The ability to correctly classify and retrieve apparel images has a vari...
Detecting small, densely distributed objects is a significant challenge:...
Part models of object categories are essential for challenging recogniti...
In this paper, we study the sensitivity of CNN outputs with respect to i...
Source code related to BoxCars publication
Fine-grained recognition of vehicles is interesting, both from the application point of view (surveillance, data retrieval, etc.) and from the point of view of general fine-grained recognition research applicable in other fields. For example, Gebru et al.  proposed an estimation of demographic statistics based on fine-grained recognition of vehicles. In this article, we are presenting methodology which considerably increases the performance of multiple state-of-the-art CNN architectures in the task of fine-grained vehicle recognition. We target the traffic surveillance context, namely images of vehicles taken from an arbitrary viewpoint – we do not limit ourselves to frontal/rear viewpoints. As the images are obtained from surveillance cameras, they have challenging properties – they are often small and taken from very general viewpoints (high elevation). We also construct the training and testing sets from images from different cameras as it is common for surveillance applications that it is not known a priori under which viewpoint the camera will be observing the road.
Methods focused on the fine-grained recognition of vehicles usually have some limitations – they can be limited to frontal/rear viewpoints or use 3D CAD models of all the vehicles. Both these limitations are rather impractical for large scale deployment. There are also methods for fine-grained recognition in general which were applied on vehicles. The methods recently follow several main directions – automatic discovery of parts [2, 3], bilinear pooling [4, 5], or exploiting structure of fine-grained labels [6, 7]. Our method is not limited to any particular viewpoint and it does not require 3D models of vehicles at all.
We propose an orthogonal approach to these methods and use CNNs with a modified input to achieve better image normalization and data augmentation (therefore, our approach can be combined with other methods). We use 3D bounding boxes around vehicles to normalize vehicle image (see Figure 4 for examples). This work is based on our previous conference paper ; it pushes the performance further and we mainly propose a new method on how to build the 3D bounding box without any prior knowledge (see Figure 1). Our input modifications are able to significantly increase the classification accuracy (up to 12 percentage points, classification error is reduced by up to 50 %).
The key contributions of the paper are:
Complex and thorough evaluation of our previous method .
Our novel data augmentation techniques further improve the results of the fine-grained recognition of vehicles relative both to our previous method and other state-of-the-art methods (Section III-C).
We collected more samples to the BoxCars dataset, increasing the dataset size almost twice (Section IV).
We will make the collected dataset and source codes for the proposed algorithm publicly available111https://medusa.fit.vutbr.cz/traffic for future reference and comparison.
In order to provide context to the proposed method, we present a summary of existing fine-grained recognition methods (both general and focused on vehicles).
We divide the fine-grained recognition methods from recent literature into several categories as they usually share some common traits. Methods exploiting annotated model parts (e.g. [9, 10]) are not discussed in detail as it is not common in fine-grained datasets of vehicles to have the parts annotated.
Parts of classified objects may be discriminatory and provide lots of information for the fine-grained classification task. However, it is not practical to assume that the location of such parts is known a priori as it requires significantly more annotation work. Therefore, several papers[11, 12, 13, 14, 3, 2, 15]
have dealt with this problem and proposed methods how to automatically (during both training and test time) discover and localize such parts. The methods differ mainly in the ways in which they are used for the discovery of discriminative parts. The features extracted from the parts are usually classified by SVMs.
Lin et al.  use only convolutional layers from the net for extraction of features which are classified by a bilinear classifier . Gao et al.  followed the path of bilinear pooling and proposed a method for Compact Bilinear Pooling getting the same accuracy as the full bilinear pooling with a significantly lower number of features.
Xie et al. 
proposed to use a hyper-class for data augmentation and regularization of fine-grained deep learning. Zhou et al. use CNN with Bipartite Graph Labeling to achieve better accuracy by exploiting the fine-grained annotations and coarse body type (e.g. Sedan, SUV). Lin et al.  use three neural networks for simultaneous localization, alignment and classification of images. Each of these three networks does one of the three tasks and they are connected into one bigger network. Yao et al.  proposed an approach which uses responses to random templates obtained from images and classifies merged representations of the response maps by SVM. Zhang et al. 
use pose normalization kernels and their responses warped into a feature vector. Chai et al. propose to use segmentation for fine-grained recognition to obtain the foreground parts of an image. A similar approach was also proposed by Li et al. ; however, the authors use a segmentation algorithm which is optimized and fine-tuned for the purpose of fine-grained recognition. Finally, Gavves et al.  propose to use object proposals to obtain the foreground mask and unsupervised alignment to improve fine-grained classification accuracy.
The goal of fine-grained recognition of vehicles is to identify the exact type of the vehicle, that is its make, model, submodel, and model year. The recognition system focused only on vehicles (in relation to general fine-grained classification of birds, dogs, etc.) can benefit from that the vehicles are rigid, have some distinguishable landmarks (e.g. license plates), and rigorous models (e.g. 3D CAD models) can be available.
There is a multitude of papers [22, 23, 24, 25, 26, 27, 28, 29] using a common approach: they detect the license plate (as a common landmark) on the vehicle and extract features from the area around the license plate as the front/rear parts of vehicles are usually discriminative.
There were several approaches on how to deal with viewpoint variance using synthetic 3D models of vehicles. Lin et al. propose to jointly optimize 3D model fitting and fine-grained classification, Hsiao et al.  use detected contour and align the 3D model using 3D chamfer matching. Krause et al.  propose to use synthetic data to train geometry and viewpoint classifiers for the 3D model and 2D image alignment. Prokaj et al.  propose to detect SIFT features on the vehicle image and on every 3D model seen from a set of discretized viewpoints.
Gu et al.  propose extracting the center of a vehicle and roughly estimate the viewpoint from the bounding box aspect ratio. Then, they use different Active Shape Models for alignment of data taken from different viewpoints and use segmentation for background removal.
Stark et al.  propose using an extension of Deformable Parts Model (DPM)  to be able to handle multi-class recognition. The model is represented by latent linear multi-class SVM with HOG  features. The authors show that the system outperforms different methods based on Locally-constrained Linear Coding  and HOG. The recognized vehicles are used for eye-level camera calibration.
Liu et al.  use deep relative distance trained on a vehicle re-identification task and propose training the neural net with Coupled Clusters Loss instead of triplet loss. Boonsim et al.  propose a method for fine-grained recognition of vehicles at night. The authors use relative position and shape of features visible at night (e.g. lights, license plates) to identify the make&model of a vehicle, which is visible from the rear side.
Fang et al.  propose using an approach based on detected parts. The parts are obtained in an unsupervised manner as high activations in a mean response across channels of the last convolutional layer of used CNN. The authors in  introduce spatially weighted pooling of convolutional features in CNNs to extract important features from the image.
Existing methods for the fine-grained classification of vehicles usually have significant limitations. They are either limited to frontal/rear viewpoints [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35] or require some knowledge about 3D models of the vehicles [39, 38, 37, 36] which can be impractical when new models of vehicles emerge.
Our proposed method does not have such limitations. The method works with arbitrary viewpoints and we require only 3D bounding boxes of vehicles. The 3D bounding boxes can either be automatically constructed from traffic video surveillance data [49, 50] or we propose a method on how to estimate the 3D bounding boxes both at training and test time from single images (see Section III-D).
There is a large number of datasets of vehicles (e.g [51, 52]) which are usable mainly for vehicle detection, pose estimation, and other tasks. However, these datasets do not contain annotations of the precise vehicles’ make and model.
When it comes to the fine-grained recognition datasets, there are some [41, 38, 36, 33] which are relatively small in number of samples or classes. Therefore, they are impractical for the training of CNN and deployment of real world traffic surveillance applications.
Yang et al.  published a large dataset CompCars. The dataset consists of a web-nature part, made of 136k of vehicles from 1 600 classes taken from different viewpoints. It also contains a surveillance-nature part with 50k frontal images of vehicles taken from surveillance cameras.
Liu et al.  published dataset VeRi-776 for the vehicle re-identification task. The dataset contains over 50k images of 776 vehicles captured by 20 cameras covering an 1.0 km area in 24 hours. Each vehicle is captured by cameras under different viewpoints, illuminations, resolutions and occlusions. The dataset also provides various attributes, such as bounding boxes, vehicle types, and colors.
In traffic surveillance applications, it is common that prior fine-grained vehicle classification is necessary to detect vehicles; therefore, we include a brief overview of existing methods for vehicle detection. It is possible to use standard object detectors – either based on convolutional neural networks [55, 56], AdaBoost , Deformable Part Models [42, 58] or Hough Transformation . There were also attempts to improve specifically vehicle detection based on geometric information , during night , or to increase the accuracy of localization of occluded vehicles .
In agreement with recent progress in the Convolutional Neural Networks [63, 64, 65], we use CNN for both classification and verification (determining whether a pair of vehicles has the same type). However, we propose to use several data normalization and augmentation techniques to significantly boost the classification performance (up to error reduction compared to base net). We utilize information about 3D bounding boxes obtained from traffic surveillance camera . Finally, in order to increase the applicability of our method to scenarios where the 3D bounding box is not known, we propose an algorithm for bounding box estimation both at training and test time.
We based our work on 3D bounding boxes proposed by  (Fig. 4) which can be automatically obtained for each vehicle seen by a surveillance camera (see Figure 2 for schematic 3D bounding box construction process or the original paper  for further details). These boxes allow us to identify the side, roof, and front (or rear) side of vehicles in addition to other information about the vehicles. We use these localized segments to normalize the image of the observed vehicles (considerably boosting the recognition performance).
The normalization is done by unpacking the image into a plane. The plane contains rectified versions of the front/rear (), side (), and roof (). These parts are adjacent to each other (Fig. 3) and they are organized into the final matrix :
The unpacking itself is done by obtaining homography between points (Fig. 3) and perspective warping parts of the original image. The left top submatrix is filled with zeros. This unpacked version of the vehicle is used instead of the original image to feed the net. The unpacking is beneficial as it localizes parts of the vehicles, normalizes their position in the image and it does all that without the necessity of using DPM or other algorithms for part localization. Later in the text, we will refer to this normalization method as Unpack.
It it possible to infer additional information about the vehicle from the 3D bounding box and we found out that these data slightly improve the classification and verification performance. One piece of this auxiliary information is the encoded viewpoint (direction from which the vehicle is observed). We also add a rasterized 3D bounding box as an additional input to the CNNs. Compared to our previously proposed auxiliary data fed to the net , we handle frontal and rear vehicle sides differently.
View. The viewpoint is extracted from the orientation of the 3D bounding box – Fig. 4. We encode the viewpoint as three 2D vectors , where (front/rear, side, roof) and pass them to the net. Vectors are connecting the center of the bounding box with the centers of the box’s faces. Therefore, it can be computed as . Point is the center of the bounding box and it can be obtained as the intersection of diagonals and . Points for denote the centers of each face, again computed as intersections of face diagonals. In contrast to our previous approach , which did not take the direction of the vehicle into account; instead, we encode the information about the vehicle direction ( for vehicles going to camera, for vehicles going from the camera), in order to determine which side of the bounding box is the frontal one. The vectors are normalized to have a unit size; storing them with a different normalization (e.g. the front one normalized, the other in the proper ratio) did not improve the results.
Rast. Another way of encoding the viewpoint and also the relative dimensions of vehicles is to rasterize the 3D bounding box and use it as an additional input to the net. The rasterization is done separately for all sides, each filled by one color. The final rasterized bounding box is then a four-channel image containing each visible face rasterized in a different channel. Formally, point of the rasterized bounding box is obtained as
where denotes the quadrilateral defined by points , , and in Figure 3.
Finally, the 3D rasterized bounding box is cropped by the 2D bounding box of the vehicle. For an example, see Figure 4, showing rasterized bounding boxes for different vehicles taken from different viewpoints.
In order to increase the diversity of the training data, we propose additional data augmentation techniques. The first one (denoted as Color) deals with the fact that for fine-grained recognition of vehicles (and some other objects), their color is irrelevant. The other method (ImageDrop) deals with some potentially missing parts of the vehicle. Examples of the data augmentation are shown in Figure 5
. Both these augmentation techniques are done only with predefined probability during training, otherwise they are not modified. During testing, we do not modify the images at all.
The results presented in Section V-E show that both these modifications improve the classification accuracy both in combination with other presented techniques or by themselves.
Color. In order to increase training samples color variability, we propose to randomly alternate the color of the image. The alternation is done in the HSV color space by adding the same random values to each pixel in the image (each HSV channel is processed separately).
ImageDrop. Inspired by Zeiler et al. , who evaluated the influence of covering a part of the input image on the probability of the ground truth class, we take this a step further and in order to deal with missing parts on the vehicles, we take a random rectangle in the image and fill it with random noise, effectively dropping any information contained in that part of the image.
As the results (Section V) show, the most important part of the proposed algorithm is Unpack followed by Color and ImageDrop. However, the 3D bounding box is required for unpacking the vehicles and we acknowledge that there may be scenarios when such information is not available. For these cases, we propose a method on how to estimate the 3D bounding box for both training and test time when only limited information is available.
As proposed by , the vehicle’s contour and vanishing points are required for the bounding box construction. Therefore, it is necessary to estimate the contour and vanishing points for the vehicle. For estimating the vehicle contour, we use Fully Convolutional Encoder-Decoder network designed by Yang et al.  for general object contour detection and masks with probabilities of vehicles contours for each image pixel. To obtain the final contour, we search for global maxima along line segments from 2D bounding box centers to edge points of the 2D bounding box (see Figure 6 for examples).
We found out that the exact position of the vanishing point is not required for 3D bounding box construction, but the directions to the vanishing points are much more important. Therefore, we use regression to obtain the directions towards the vanishing points and then assume that the vanishing points are in infinity.
Following the work by Rothe et al. , we formulated the regression of the direction towards the vanishing points as a classification task into bins corresponding to angles and we used ResNet50  with three classification outputs. We found this approach more robust than a direct regression. We added three separate fully connected layers with softmax activation (one for each vanishing point) after the last average pooling in the ResNet50 (see Figure 7). Each of these layers generates probabilities for each vanishing point belonging to the specific direction bin (represented as angles). We quantized the angle space by bins of from to (60 bins per vanishing point in total).
As the training data for the regression we used BoxCars116k dataset (Section IV) with the test samples omitted. The direction to vanishing points were obtained by method [49, 50]; however, the quality of the ground truth bounding boxes was manually verified during annotation of the dataset and imprecise samples were removed by the annotators. To construct the lines on which the vanishing points are, we use the center of the 2D bounding box. Even though there is bias in the direction of the training data (some bins have very low number of samples), it is highly unlikely that for example, the first vanishing point direction will be close to horizontal.
With all this estimated information it is then possible to construct the 3D bounding box in both training and test time. It is important to note that by using this 3D bounding box estimation, it is possible to use this method outside the scope of traffic surveillance. It is only necessary to train the regressor of vanishing points directions. For the training of such a regressor, it is possible to use either the directions themselves or viewpoints on the vehicle and focal lengths of the images.
Using this estimated bounding box, it is possible to unpack the vehicle image in test time without any additional information required. This enables the usage of the method when the traffic surveillance data are not available. The results in Section V-C show that by using this estimated 3D bounding boxes, our method still significantly outperforms other convolutional neural networks without input modification.
We collected and annotated a new dataset BoxCars116k. The dataset is focused on images taken from surveillance cameras as it is meant to be useful for traffic surveillance applications. We do not restrict that the vehicles are taken from the frontal side (Fig. 8). We used surveillance cameras mounted near streets and tracked passing vehicles. The cameras were placed on various locations around Brno, Czech Republic and recorded the passing traffic from an arbitrary (reasonable) surveillance viewpoint. Each correctly detected vehicle (by Faster-RCNN  trained on COD20k dataset ) is captured in multiple images, as it passes by the camera; therefore, we have more visual information about each vehicle.
The dataset is formed by two parts. The first part consists of data from BoxCars21k dataset  which were cleaned up and some imprecise annotations were then corrected (e.g. missing model years for some uncommon vehicle types).
We also collected other data from videos relevant to our previous work [49, 50, 71]. We detected all vehicles, tracked them and for each track collected images of the respective vehicle. We downsampled the framerate to FPS to avoid collecting multiple and almost identical images of the same vehicle.
The new dataset was annotated by multiple human annotators with an interest in vehicles and sufficient knowledge about vehicle types and models. The annotators were assigned to clean up the processed data from invalid detections and assign exact vehicle type (make, model, submodel, year) for each obtained track. While preparing the dataset for annotation, 3D bounding boxes were constructed for each detected vehicle using the method proposed by . Invalid detections were then distinguished by the annotators based on these constructed 3D bounding boxes. In the cases when all 3D bounding boxes were not constructed precisely, the whole track was invalidated.
Vehicle type annotation reliability is guaranteed by providing multiple annotations for each valid track ( annotations per vehicle). The annotation of a vehicle type is considered as correct in the case of at least three identical annotations. Uncertain cases were authoritatively annotated by the authors.
The tracks in BoxCars21k dataset consist of exactly 3 images per track. In the new part of the dataset, we collect an arbitrary number of images per track (usually more than 3).
The dataset contains 27 496 vehicles (116 286 images) of 45 different makes with 693 fine-grained classes (make & model & submodel & model year) collected from 137 different cameras with a large variation of viewpoints. Detailed statistics about the dataset can be found in Figure 9 and the supplementary material. The distribution of types in the dataset is shown in Figure 9 (top right) and samples from the dataset are in Figure 8. The dataset also includes information about the 3D bounding box  for each vehicle and an image with a foreground mask extracted by background subtraction [72, 73]. The dataset has been made publicly available222https://medusa.fit.vutbr.cz/traffic for future reference and evaluation.
Compared to “web-based” datasets, the new BoxCars116k dataset contains images of vehicles relevant to traffic surveillance which have specific viewpoints (high elevation), usually small images, etc. Compared to other fine-grained surveillance datasets, our dataset provides data with a high variation of viewpoints (see Figure 9 and 3D plots in the supplementary material).
Our task is to provide a dataset for fine-grained recognition in traffic surveillance without any viewpoint constraint. Therefore, we have constructed the splits for training and evaluation in a way which reflects the fact that it is not usually known beforehand from which viewpoints the vehicles will be seen by the surveillance camera.
Thus, for the construction of the splits, we randomly selected cameras and used all tracks from these cameras for training and vehicles from the rest of the cameras for testing. In this way, we are testing the classification algorithms on images of vehicles from previously unseen cameras (viewpoints). This splits selection process implies that some of the vehicles from the test set may be taken under slightly different viewpoints from the ones that are in the training set.
We constructed two splits. In the first one (hard), we are interested in recognizing the precise type, including the model year. In the other one (medium), we omit the difference in model years and all vehicles of the same subtype (and potentially different model years) are present in the same class. We selected only types which have at least 15 tracks in the training set and at least one track in the testing set. The hard split contains 107 fine-grained classes with 11 653 tracks (51 691 images) for training and 11 125 tracks (39 149 images) for testing. Detailed split statistics can be found in the supplementary material.
We thoroughly evaluated our proposed algorithm on the BoxCars116k dataset. First, we evaluated how these methods improved classification accuracy with different nets, compared them to the state of the art, and analyzed how using approximate 3D bounding boxes influence the achieved accuracy. Then, we searched for the main source of improvements, analyzed improvements of different modifications separately, and also evaluated the usability of features from the trained nets for the task of vehicle type identity verification.
In order to show that our modifications improve the accuracy independently on the used nets, we use several of them:
As there are several options how to use the proposed modifications of input data and add additional auxiliary data, we define several labels which we will use:
ALL – All five proposed modifications (Unpack, Color, ImageDrop, View, Rast).
IMAGE – Modifications working only on the image level (Unpack, Color, ImageDrop).
CVPR16 – Modifications as proposed in our previous CVPR paper  (Unpack, View, Rast – however, the View and Rast modifications differ from those ones used in this paper as the original modifications do not distinguish between the frontal and rear side of vehicles).
The first experiment which was done was evaluation how our modifications have improved classification accuracy for different CNNs.
All the nets were fine-tuned from models pre-trained on ImageNet
for approximately 15 epochs which was sufficient for the nets to converge. We used the same batch size (except for ResNet151, where we had to use a smaller batch size because of GPU memory limitations), the same initial learning rate and learning rate decay and the same hyperparameters for every net (initial learning rate, weight decay , quadratic learning rate decay, loss is averaged over 100 iterations). We also used standard data augmentation techniques as a horizontal flip and randomly moving bounding box . As ResNets do not use fully connected layers, we only use IMAGE modifications for them.
For each net and modification we evaluate the accuracy improvement of the modification in percentage points and also evaluate the classification error reduction.
The summary results for both medium and hard splits are shown in Table I and the raw results are in the supplementary material. As we have correspondences between the samples in the dataset and know which samples are from the same track, we are able to use mean probability across track samples and merge the classification for the whole track. Therefore, we always report the results in the form of single sample accuracy/whole track accuracy. As expected, the results for whole tracks are much better than for single samples. For the traffic surveillance scenario, we consider to be more important the whole track accuracy as it is rather common to have a full track of observations of the same vehicle.
There are several things which should be noted about the results. The most important one is that our modifications significantly improve classification accuracy (up to +12 percentage points) and reduce classification error (up to 50 % error reduction). Another important fact is that our new modifications push the accuracy much further compared to the original method .
The table also shows that the difference between ALL modifications and IMAGE modifications is negligible and therefore it is reasonable to only use the IMAGE modifications. This also results in CNNs which just use the Unpack modification during test time as the other image modifications (Color, ImageDrop) are used only during fine-tuning of CNNs.
Moreover, the evaluation shows that the results are almost identical for the hard and medium split; therefore, we will only report additional results on the hard split, as it is the main goal to distinguish also the model years. The names for the splits were chosen to be consistent with the original version of dataset  and the small difference between medium and hard split accuracies is caused mainly by the size of the new dataset.
|method||accuracy [%]||speed [FPS]|
|BCNN (VGG-M) ||/|
|BCNN (VGG16) ||/|
|CBL (VGG16) ||/|
|CBL (VGG19) ||/|
|PCM (AlexNet) ||/|
|PCM (VGG19) ||/|
|AlexNet + ALL (ours)||/||580|
|VGG16 + ALL (ours)||/||154|
|VGG19 + ALL (ours)||/||133|
|VGG16+CBL + ALL (ours)||/||146|
|VGG19+CBL + ALL (ours)||/||126|
|Resnet50 + IMAGE (ours)||/||151|
|Resnet101 + IMAGE (ours)||/||93|
|Resnet152 + IMAGE (ours)||/||65|
In order to examine the performance of our method, we also evaluated other state-of-the-art methods for fine-grained recognition. We used three different algorithms for general fine-grained recognition with a published code. We always first used the code to reproduce the results in respective papers to ensure that we are using the published work correctly. All of the methods use CNNs and the used net influences the accuracy; therefore, the results should be compared with respective base CNNs.
It was impossible to evaluate methods focused only on fine-grained recognition of vehicles as they are usually limited to frontal/rear viewpoint or require 3D models of vehicles for all the types. In the following text we define labels for each evaluated state-of-the-art method and describe details for the method separately.
BCNN. Lin et al.  proposed to use Bilinear CNN. We used VGG-M and VGG16 networks in a symmetric setup (details in the original paper), and trained the nets for 30 epochs (the nets converged around the epoch). We also used image flipping to augment the training set.
and reduced the number of output features of the bilinear layers. We used the Caffe implementation of the layer provided by the authors and used 8 192 features. We trained the net using the same hyper-parameters, protocol, and data augmentation as described in SectionV-A.
|net||no modification||GT 3D BB||estimated 3D BB|
PCM. Simon et al.  propose Part Constellation Models and use neural activations (see the paper for additional details) to get the parts of the model. We used AlexNet (BVLC Caffe reference version) and VGG19 as base nets for the method. We used the same hyper-parameters as the authors with the exception of fine-tuning number of iterations which was increased, and the parameter of used linear SVM was cross-validated on the training data.
The results of all comparisons can be found in Table II. As the table shows, our method significantly outperforms both standard CNNs [64, 74, 69] and methods for fine-grained recognition [4, 3, 5]. The results for fine-grained recognition methods should be compared with the same used base network as for different networks, they provide different results. Our best accuracy () is better by a large margin compared to all other variants (both standard CNN and fine-grained methods).
In order to provide approximate information about the processing efficiency, we measured how many images different methods are able to process per second (referenced as FPS). The measurement was done with GTX1080 and CUDNN whenever possible. In the case of BCNN we reported the numbers as reported by the authors, as we were forced to save some intermediate data to disk because we were not able to fit all the data to memory (200 GB). The results are also shown in Table II; they show that our input modification decreased the processing speed; however, the speed penalty is small and the method is still usable for real-time processing.
We also evaluated how the results will be influenced when, instead of using the 3D bounding boxes obtained from the surveillance data (long-time observation of video [49, 50]), the estimated 3D bounding boxes (Section III-D) would be used instead.
The classification results are shown in Table III; they show that the proposed modifications still significantly improve the accuracy even if only the estimated 3D bounding box – the less accurate one – is used. This result is fairly important as it enables to transfer this method to different (non-surveillance) scenarios. The only additional data which is then required is a reliable training set of directions towards the vanishing points (or viewpoints and focal length) from the vehicles (or other rigid objects).
We were also interested in finding out the main reason why the classification accuracy is improved. We have analyzed several possibilities and found out that the most important aspect is viewpoint difference.
For every training and testing sample we computed the viewpoint (unit 3D vector from vehicles’ 3D bounding boxes centers) and for each testing sample we found one training sample with the lowest viewpoint difference (see Figure 11). Then, we divided the testing samples into several bins based on the difference angle. For each of these bins we computed the accuracy for the standard nets without any modifications and nets with the proposed modifications. There is 56% of the test samples in the first bin (), and in the middle bins there are 22% and 17% of test data. In the last bin, there are 5% of the test data. Finally, we obtained an improvement in percentage points for each modification and bin, by comparing the net’s performance on the data in the bin with and without the modification harnessed. The results are displayed in Figure 10.
There are several facts which should be noted. The first and most important is that the Unpack modification alone improves significantly the accuracy for larger viewpoint differences (the accuracy is improved by more than 20 percent points for the last bin). The other important fact, which should be noted, is that the other modifications (mainly Color and ImageDrop) improve the accuracy furthermore. This improvement is independent on the training-testing viewpoint difference.
We were also curious how different modifications by themselves help to improve the accuracy. We conducted two types of experiments which focus on different aspects of the modifications. The evaluation is not done on ResNets, as we only use IMAGE level modifications with ResNets; thus, we cannot evaluate Rast and View modifications with ResNets.
The first experiment is focused on the influence of each modification by itself. Therefore, we compute the accuracy improvement (in accuracy percent points) for the modifications as , where stands for the accuracy of the classifier described by its contents. The results are shown in Table IV. As it can be seen in the table, the most contributing modifications are Color, Unpack, and ImageDrop.
The second experiment evaluates how a given modification contributed to the accuracy improvement when all of the modifications are used. Thus, the improvement is computed as . See Table V for the results, which confirm the previous findings and Color, Unpack, and ImageDrop are again the most positive modifications.
Lastly, we evaluated the quality of features extracted from the last layer of the convolutional nets for the verification task. Under the term verification, we understand the task to determine whether a pair of vehicle tracks share the same fine-grained type or not. In agreement with previous works in the field , we use cosine distance between the features for the verification.
We collected 5 million random pairs of vehicle tracks from the test part of BoxCars116k splits and evaluate the verification on these pairs. As we used tracks which can have a different number of vehicle images, we used 9 random pairs of images for each pair of tracks and then used median distance between these image pairs as the distance between the whole tracks.
Precision-Recall curves and Average Precisions are shown in Figure 12. As the results show, our modifications significantly improve the average precision for each CNN in the given task. Moreover, as the figure shows, the method outperforms human performance (black dots in Figure 12), as reported in the previous paper .
This article presents and sums up multiple algorithmic modifications suitable for CNN-based fine-grained recognition of vehicles. Some of the modifications were originally proposed in a conference paper , while others are results of the ongoing research. We also propose a method for obtaining the 3D bounding boxes necessary for the image unpacking (which has the largest impact on performance improvement) without observing a surveillance video, but only working with the individual input image. This considerably increases the application potential of the proposed methodology (and the performance for such estimated 3D boxes is only somewhat lower than when “proper” bounding boxes are used). We focused on a thorough evaluation of the methods: we coupled them with multiple state-of-the-art CNN architectures [74, 69], and measured the contribution/influence of individual modifications.
Our method significantly improves the classification accuracy (up to +12 percentage points) and reduces the classification error (up to 50 % error reduction) compared to the base CNNs. Also, our method outperforms other state-of-the-art methods [4, 3, 5] by 9 percentage points in single image accuracy and by 7 percentage points in whole track accuracy.
We collected, processed, and annotated a dataset BoxCars116k targeted to fine-grained recognition of vehicles in the surveillance domain. Contrary to a majority of existing vehicle recognition datasets, the viewpoints are greatly varying and correspond to surveillance scenarios; the existing datasets are mostly collected from web images and the vehicles are typically captured from eye-level positions. This dataset has been made publicly available for future research and evaluation.
This work was supported by The Ministry of Education, Youth and Sports of the Czech Republic from the National Programme of Sustainability (NPU II); project IT4Innovations excellence in science – LQ1602.
M. Stark, J. Krause, B. Pepik, D. Meger, J. Little, B. Schiele, and D. Koller, “Fine-grained categorization for 3D scene understanding,” inBMVC, 2012.
Z. Zivkovic, “Improved adaptive gaussian mixture model for background subtraction,” inICPR, 2004, pp. 28–31.