BoxCars: Improving Fine-Grained Recognition of Vehicles using 3D Bounding Boxes in Traffic Surveillance

03/02/2017 ∙ by Jakub Sochor, et al. ∙ 0

In this paper, we focus on fine-grained recognition of vehicles mainly in traffic surveillance applications. We propose an approach orthogonal to recent advancement in fine-grained recognition (automatic part discovery, bilinear pooling). Also, in contrast to other methods focused on fine-grained recognition of vehicles, we do not limit ourselves to frontal/rear viewpoint but allow the vehicles to be seen from any viewpoint. Our approach is based on 3D bounding boxes built around the vehicles. The bounding box can be automatically constructed from traffic surveillance data. For scenarios where it is not possible to use the precise construction, we propose a method for estimation of the 3D bounding box. The 3D bounding box is used to normalize the image viewpoint by "unpacking" the image into plane. We also propose to randomly alter the color of the image and add a rectangle with random noise to random position in the image during training Convolutional Neural Networks. We have collected a large fine-grained vehicle dataset BoxCars116k, with 116k images of vehicles from various viewpoints taken by numerous surveillance cameras. We performed a number of experiments which show that our proposed method significantly improves CNN classification accuracy (the accuracy is increased by up to 12 percent points and the error is reduced by up to 50 compared to CNNs without the proposed modifications). We also show that our method outperforms state-of-the-art methods for fine-grained recognition.



There are no comments yet.


page 1

page 5

page 6

page 7

page 8

page 13

page 14

Code Repositories


Source code related to BoxCars publication

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Fine-grained recognition of vehicles is interesting, both from the application point of view (surveillance, data retrieval, etc.) and from the point of view of general fine-grained recognition research applicable in other fields. For example, Gebru et al. [1] proposed an estimation of demographic statistics based on fine-grained recognition of vehicles. In this article, we are presenting methodology which considerably increases the performance of multiple state-of-the-art CNN architectures in the task of fine-grained vehicle recognition. We target the traffic surveillance context, namely images of vehicles taken from an arbitrary viewpoint – we do not limit ourselves to frontal/rear viewpoints. As the images are obtained from surveillance cameras, they have challenging properties – they are often small and taken from very general viewpoints (high elevation). We also construct the training and testing sets from images from different cameras as it is common for surveillance applications that it is not known a priori under which viewpoint the camera will be observing the road.

Methods focused on the fine-grained recognition of vehicles usually have some limitations – they can be limited to frontal/rear viewpoints or use 3D CAD models of all the vehicles. Both these limitations are rather impractical for large scale deployment. There are also methods for fine-grained recognition in general which were applied on vehicles. The methods recently follow several main directions – automatic discovery of parts [2, 3], bilinear pooling [4, 5], or exploiting structure of fine-grained labels [6, 7]. Our method is not limited to any particular viewpoint and it does not require 3D models of vehicles at all.

We propose an orthogonal approach to these methods and use CNNs with a modified input to achieve better image normalization and data augmentation (therefore, our approach can be combined with other methods). We use 3D bounding boxes around vehicles to normalize vehicle image (see Figure 4 for examples). This work is based on our previous conference paper [8]; it pushes the performance further and we mainly propose a new method on how to build the 3D bounding box without any prior knowledge (see Figure 1). Our input modifications are able to significantly increase the classification accuracy (up to 12 percentage points, classification error is reduced by up to 50 %).

The key contributions of the paper are:

  • Complex and thorough evaluation of our previous method [8].

  • Our novel data augmentation techniques further improve the results of the fine-grained recognition of vehicles relative both to our previous method and other state-of-the-art methods (Section III-C).

  • We remove the requirement of the previous method [8] to know the 3D bounding box by estimating the bounding box both at training and test time (Section III-D).

  • We collected more samples to the BoxCars dataset, increasing the dataset size almost twice (Section IV).

We will make the collected dataset and source codes for the proposed algorithm publicly available111 for future reference and comparison.

Ii Related Work

In order to provide context to the proposed method, we present a summary of existing fine-grained recognition methods (both general and focused on vehicles).

Ii-a General Fine-Grained Object Recognition

We divide the fine-grained recognition methods from recent literature into several categories as they usually share some common traits. Methods exploiting annotated model parts (e.g. [9, 10]) are not discussed in detail as it is not common in fine-grained datasets of vehicles to have the parts annotated.

Ii-A1 Automatic Part Discovery

Parts of classified objects may be discriminatory and provide lots of information for the fine-grained classification task. However, it is not practical to assume that the location of such parts is known a priori as it requires significantly more annotation work. Therefore, several papers

[11, 12, 13, 14, 3, 2, 15]

have dealt with this problem and proposed methods how to automatically (during both training and test time) discover and localize such parts. The methods differ mainly in the ways in which they are used for the discovery of discriminative parts. The features extracted from the parts are usually classified by SVMs.

Ii-A2 Methods using Bilinear Pooling

Lin et al. [4] use only convolutional layers from the net for extraction of features which are classified by a bilinear classifier [16]. Gao et al. [5] followed the path of bilinear pooling and proposed a method for Compact Bilinear Pooling getting the same accuracy as the full bilinear pooling with a significantly lower number of features.

Ii-A3 Other Methods

Xie et al. [6]

proposed to use a hyper-class for data augmentation and regularization of fine-grained deep learning. Zhou et al.

[7] use CNN with Bipartite Graph Labeling to achieve better accuracy by exploiting the fine-grained annotations and coarse body type (e.g. Sedan, SUV). Lin et al. [17] use three neural networks for simultaneous localization, alignment and classification of images. Each of these three networks does one of the three tasks and they are connected into one bigger network. Yao et al. [13] proposed an approach which uses responses to random templates obtained from images and classifies merged representations of the response maps by SVM. Zhang et al. [18]

use pose normalization kernels and their responses warped into a feature vector. Chai et al.

[19] propose to use segmentation for fine-grained recognition to obtain the foreground parts of an image. A similar approach was also proposed by Li et al. [20]; however, the authors use a segmentation algorithm which is optimized and fine-tuned for the purpose of fine-grained recognition. Finally, Gavves et al. [21] propose to use object proposals to obtain the foreground mask and unsupervised alignment to improve fine-grained classification accuracy.

Ii-B Fine-Grained Recognition of Vehicles

The goal of fine-grained recognition of vehicles is to identify the exact type of the vehicle, that is its make, model, submodel, and model year. The recognition system focused only on vehicles (in relation to general fine-grained classification of birds, dogs, etc.) can benefit from that the vehicles are rigid, have some distinguishable landmarks (e.g. license plates), and rigorous models (e.g. 3D CAD models) can be available.

Ii-B1 Methods Limited to Frontal/Rear Images of Vehicles

There is a multitude of papers [22, 23, 24, 25, 26, 27, 28, 29] using a common approach: they detect the license plate (as a common landmark) on the vehicle and extract features from the area around the license plate as the front/rear parts of vehicles are usually discriminative.

There are also papers [30, 31, 32, 33, 34, 35] directly extracting features from frontal images of vehicles by different methods and optionally exploiting the standard structure of parts on the frontal mask of car (e.g. headlights).

Ii-B2 Methods based on 3D CAD Models

There were several approaches on how to deal with viewpoint variance using synthetic 3D models of vehicles. Lin et al.

[36] propose to jointly optimize 3D model fitting and fine-grained classification, Hsiao et al. [37] use detected contour and align the 3D model using 3D chamfer matching. Krause et al. [38] propose to use synthetic data to train geometry and viewpoint classifiers for the 3D model and 2D image alignment. Prokaj et al. [39] propose to detect SIFT features on the vehicle image and on every 3D model seen from a set of discretized viewpoints.

Ii-B3 Other Methods

Gu et al. [40] propose extracting the center of a vehicle and roughly estimate the viewpoint from the bounding box aspect ratio. Then, they use different Active Shape Models for alignment of data taken from different viewpoints and use segmentation for background removal.

Stark et al. [41] propose using an extension of Deformable Parts Model (DPM) [42] to be able to handle multi-class recognition. The model is represented by latent linear multi-class SVM with HOG [43] features. The authors show that the system outperforms different methods based on Locally-constrained Linear Coding [44] and HOG. The recognized vehicles are used for eye-level camera calibration.

Liu et al. [45] use deep relative distance trained on a vehicle re-identification task and propose training the neural net with Coupled Clusters Loss instead of triplet loss. Boonsim et al. [46] propose a method for fine-grained recognition of vehicles at night. The authors use relative position and shape of features visible at night (e.g. lights, license plates) to identify the make&model of a vehicle, which is visible from the rear side.

Fang et al. [47] propose using an approach based on detected parts. The parts are obtained in an unsupervised manner as high activations in a mean response across channels of the last convolutional layer of used CNN. The authors in [48] introduce spatially weighted pooling of convolutional features in CNNs to extract important features from the image.

Ii-B4 Summary of Existing Methods

Existing methods for the fine-grained classification of vehicles usually have significant limitations. They are either limited to frontal/rear viewpoints [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35] or require some knowledge about 3D models of the vehicles [39, 38, 37, 36] which can be impractical when new models of vehicles emerge.

Our proposed method does not have such limitations. The method works with arbitrary viewpoints and we require only 3D bounding boxes of vehicles. The 3D bounding boxes can either be automatically constructed from traffic video surveillance data [49, 50] or we propose a method on how to estimate the 3D bounding boxes both at training and test time from single images (see Section III-D).

Ii-C Datasets for Fine-Grained Recognition of Vehicles

There is a large number of datasets of vehicles (e.g [51, 52]) which are usable mainly for vehicle detection, pose estimation, and other tasks. However, these datasets do not contain annotations of the precise vehicles’ make and model.

When it comes to the fine-grained recognition datasets, there are some [41, 38, 36, 33] which are relatively small in number of samples or classes. Therefore, they are impractical for the training of CNN and deployment of real world traffic surveillance applications.

Yang et al. [53] published a large dataset CompCars. The dataset consists of a web-nature part, made of 136k of vehicles from 1 600 classes taken from different viewpoints. It also contains a surveillance-nature part with 50k frontal images of vehicles taken from surveillance cameras.

Liu et al. [54] published dataset VeRi-776 for the vehicle re-identification task. The dataset contains over 50k images of 776 vehicles captured by 20 cameras covering an 1.0 km area in 24 hours. Each vehicle is captured by cameras under different viewpoints, illuminations, resolutions and occlusions. The dataset also provides various attributes, such as bounding boxes, vehicle types, and colors.

Ii-D Vehicle Detection

In traffic surveillance applications, it is common that prior fine-grained vehicle classification is necessary to detect vehicles; therefore, we include a brief overview of existing methods for vehicle detection. It is possible to use standard object detectors – either based on convolutional neural networks [55, 56], AdaBoost [57], Deformable Part Models [42, 58] or Hough Transformation [59]. There were also attempts to improve specifically vehicle detection based on geometric information [60], during night [61], or to increase the accuracy of localization of occluded vehicles [62].

Iii Proposed Methodology for Fine-Grained Recognition of Vehicles

In agreement with recent progress in the Convolutional Neural Networks [63, 64, 65], we use CNN for both classification and verification (determining whether a pair of vehicles has the same type). However, we propose to use several data normalization and augmentation techniques to significantly boost the classification performance (up to error reduction compared to base net). We utilize information about 3D bounding boxes obtained from traffic surveillance camera [49]. Finally, in order to increase the applicability of our method to scenarios where the 3D bounding box is not known, we propose an algorithm for bounding box estimation both at training and test time.

Iii-a Image Normalization by Unpacking the 3D Bounding Box

Figure 2: 3D bounding box construction process. Each set of lines with the same color intersects in one vanishing point. See the original paper for full details [49]. The image was adopted from the paper with the authors’ permission.

We based our work on 3D bounding boxes proposed by [49] (Fig. 4) which can be automatically obtained for each vehicle seen by a surveillance camera (see Figure 2 for schematic 3D bounding box construction process or the original paper [49] for further details). These boxes allow us to identify the side, roof, and front (or rear) side of vehicles in addition to other information about the vehicles. We use these localized segments to normalize the image of the observed vehicles (considerably boosting the recognition performance).

The normalization is done by unpacking the image into a plane. The plane contains rectified versions of the front/rear (), side (), and roof (). These parts are adjacent to each other (Fig. 3) and they are organized into the final matrix :

Figure 3: 3D bounding box and its unpacked version.

The unpacking itself is done by obtaining homography between points (Fig. 3) and perspective warping parts of the original image. The left top submatrix is filled with zeros. This unpacked version of the vehicle is used instead of the original image to feed the net. The unpacking is beneficial as it localizes parts of the vehicles, normalizes their position in the image and it does all that without the necessity of using DPM or other algorithms for part localization. Later in the text, we will refer to this normalization method as Unpack.

Figure 4: Examples of data normalization and auxiliary data fed to nets. Left to right: vehicle with 2D bounding box, computed 3D bounding box, vectors encoding viewpoints on the vehicle (View), unpacked image of the vehicle (Unpack), and rasterized 3D bounding box fed to the net (Rast).

Iii-B Extended Input to the Neural Nets

Figure 5: Examples of proposed data augmentation techniques. Left most image contains the original cropped image of the vehicle and other images contains augmented versions of the image (Top – Color, Bottom – ImageDrop).

It it possible to infer additional information about the vehicle from the 3D bounding box and we found out that these data slightly improve the classification and verification performance. One piece of this auxiliary information is the encoded viewpoint (direction from which the vehicle is observed). We also add a rasterized 3D bounding box as an additional input to the CNNs. Compared to our previously proposed auxiliary data fed to the net [8], we handle frontal and rear vehicle sides differently.

View. The viewpoint is extracted from the orientation of the 3D bounding box – Fig. 4. We encode the viewpoint as three 2D vectors , where (front/rear, side, roof) and pass them to the net. Vectors are connecting the center of the bounding box with the centers of the box’s faces. Therefore, it can be computed as . Point is the center of the bounding box and it can be obtained as the intersection of diagonals and . Points for denote the centers of each face, again computed as intersections of face diagonals. In contrast to our previous approach [8], which did not take the direction of the vehicle into account; instead, we encode the information about the vehicle direction ( for vehicles going to camera, for vehicles going from the camera), in order to determine which side of the bounding box is the frontal one. The vectors are normalized to have a unit size; storing them with a different normalization (e.g. the front one normalized, the other in the proper ratio) did not improve the results.

Rast. Another way of encoding the viewpoint and also the relative dimensions of vehicles is to rasterize the 3D bounding box and use it as an additional input to the net. The rasterization is done separately for all sides, each filled by one color. The final rasterized bounding box is then a four-channel image containing each visible face rasterized in a different channel. Formally, point of the rasterized bounding box is obtained as


where denotes the quadrilateral defined by points , , and in Figure 3.

Finally, the 3D rasterized bounding box is cropped by the 2D bounding box of the vehicle. For an example, see Figure 4, showing rasterized bounding boxes for different vehicles taken from different viewpoints.

Iii-C Additional Training Data Augmentation

In order to increase the diversity of the training data, we propose additional data augmentation techniques. The first one (denoted as Color) deals with the fact that for fine-grained recognition of vehicles (and some other objects), their color is irrelevant. The other method (ImageDrop) deals with some potentially missing parts of the vehicle. Examples of the data augmentation are shown in Figure 5

. Both these augmentation techniques are done only with predefined probability during training, otherwise they are not modified. During testing, we do not modify the images at all.

The results presented in Section V-E show that both these modifications improve the classification accuracy both in combination with other presented techniques or by themselves.

Color. In order to increase training samples color variability, we propose to randomly alternate the color of the image. The alternation is done in the HSV color space by adding the same random values to each pixel in the image (each HSV channel is processed separately).

ImageDrop. Inspired by Zeiler et al. [66], who evaluated the influence of covering a part of the input image on the probability of the ground truth class, we take this a step further and in order to deal with missing parts on the vehicles, we take a random rectangle in the image and fill it with random noise, effectively dropping any information contained in that part of the image.

Iii-D Estimation of 3D Bounding Box from a Single Image

As the results (Section V) show, the most important part of the proposed algorithm is Unpack followed by Color and ImageDrop. However, the 3D bounding box is required for unpacking the vehicles and we acknowledge that there may be scenarios when such information is not available. For these cases, we propose a method on how to estimate the 3D bounding box for both training and test time when only limited information is available.

As proposed by [49], the vehicle’s contour and vanishing points are required for the bounding box construction. Therefore, it is necessary to estimate the contour and vanishing points for the vehicle. For estimating the vehicle contour, we use Fully Convolutional Encoder-Decoder network designed by Yang et al. [67] for general object contour detection and masks with probabilities of vehicles contours for each image pixel. To obtain the final contour, we search for global maxima along line segments from 2D bounding box centers to edge points of the 2D bounding box (see Figure 6 for examples).

We found out that the exact position of the vanishing point is not required for 3D bounding box construction, but the directions to the vanishing points are much more important. Therefore, we use regression to obtain the directions towards the vanishing points and then assume that the vanishing points are in infinity.

Figure 6: Estimation of 3D bounding box. Left to right: image with vehicle 2D bounding box, output of contour object detector [67], our constructed contour, estimated directions towards vanishing points, ground truth (green) and estimated (red) 3D bounding box.
Figure 7: Used CNN for estimation of directions towards vanishing points. The vehicle image is fed to ResNet50 with 3 separate outputs which predict probabilities for directions of vanishing points as probabilities in a quantized angle space (60 bins from to ).

Following the work by Rothe et al. [68], we formulated the regression of the direction towards the vanishing points as a classification task into bins corresponding to angles and we used ResNet50 [69] with three classification outputs. We found this approach more robust than a direct regression. We added three separate fully connected layers with softmax activation (one for each vanishing point) after the last average pooling in the ResNet50 (see Figure 7). Each of these layers generates probabilities for each vanishing point belonging to the specific direction bin (represented as angles). We quantized the angle space by bins of from to (60 bins per vanishing point in total).

As the training data for the regression we used BoxCars116k dataset (Section IV) with the test samples omitted. The direction to vanishing points were obtained by method [49, 50]; however, the quality of the ground truth bounding boxes was manually verified during annotation of the dataset and imprecise samples were removed by the annotators. To construct the lines on which the vanishing points are, we use the center of the 2D bounding box. Even though there is bias in the direction of the training data (some bins have very low number of samples), it is highly unlikely that for example, the first vanishing point direction will be close to horizontal.

With all this estimated information it is then possible to construct the 3D bounding box in both training and test time. It is important to note that by using this 3D bounding box estimation, it is possible to use this method outside the scope of traffic surveillance. It is only necessary to train the regressor of vanishing points directions. For the training of such a regressor, it is possible to use either the directions themselves or viewpoints on the vehicle and focal lengths of the images.

Using this estimated bounding box, it is possible to unpack the vehicle image in test time without any additional information required. This enables the usage of the method when the traffic surveillance data are not available. The results in Section V-C show that by using this estimated 3D bounding boxes, our method still significantly outperforms other convolutional neural networks without input modification.

Iv BoxCars116k Dataset

Figure 8: Collate of random samples from the BoxCars116k dataset.

We collected and annotated a new dataset BoxCars116k. The dataset is focused on images taken from surveillance cameras as it is meant to be useful for traffic surveillance applications. We do not restrict that the vehicles are taken from the frontal side (Fig. 8). We used surveillance cameras mounted near streets and tracked passing vehicles. The cameras were placed on various locations around Brno, Czech Republic and recorded the passing traffic from an arbitrary (reasonable) surveillance viewpoint. Each correctly detected vehicle (by Faster-RCNN [55] trained on COD20k dataset [70]) is captured in multiple images, as it passes by the camera; therefore, we have more visual information about each vehicle.

Iv-a Dataset Acquisition

The dataset is formed by two parts. The first part consists of data from BoxCars21k dataset [8] which were cleaned up and some imprecise annotations were then corrected (e.g. missing model years for some uncommon vehicle types).

We also collected other data from videos relevant to our previous work [49, 50, 71]. We detected all vehicles, tracked them and for each track collected images of the respective vehicle. We downsampled the framerate to FPS to avoid collecting multiple and almost identical images of the same vehicle.

The new dataset was annotated by multiple human annotators with an interest in vehicles and sufficient knowledge about vehicle types and models. The annotators were assigned to clean up the processed data from invalid detections and assign exact vehicle type (make, model, submodel, year) for each obtained track. While preparing the dataset for annotation, 3D bounding boxes were constructed for each detected vehicle using the method proposed by [49]. Invalid detections were then distinguished by the annotators based on these constructed 3D bounding boxes. In the cases when all 3D bounding boxes were not constructed precisely, the whole track was invalidated.

Vehicle type annotation reliability is guaranteed by providing multiple annotations for each valid track ( annotations per vehicle). The annotation of a vehicle type is considered as correct in the case of at least three identical annotations. Uncertain cases were authoritatively annotated by the authors.

The tracks in BoxCars21k dataset consist of exactly 3 images per track. In the new part of the dataset, we collect an arbitrary number of images per track (usually more than 3).

Iv-B Dataset Statistics

Figure 9: BoxCars116k dataset statistics – top left: 2D bounding box dimensions, top right: number of fine-grained types samples, bottom left: azimuth distribution ( denotes frontal viewpoint), bottom right: elevation distribution.

The dataset contains 27 496 vehicles (116 286 images) of 45 different makes with 693 fine-grained classes (make & model & submodel & model year) collected from 137 different cameras with a large variation of viewpoints. Detailed statistics about the dataset can be found in Figure 9 and the supplementary material. The distribution of types in the dataset is shown in Figure 9 (top right) and samples from the dataset are in Figure 8. The dataset also includes information about the 3D bounding box [49] for each vehicle and an image with a foreground mask extracted by background subtraction [72, 73]. The dataset has been made publicly available222 for future reference and evaluation.

Compared to “web-based” datasets, the new BoxCars116k dataset contains images of vehicles relevant to traffic surveillance which have specific viewpoints (high elevation), usually small images, etc. Compared to other fine-grained surveillance datasets, our dataset provides data with a high variation of viewpoints (see Figure 9 and 3D plots in the supplementary material).

Iv-C Training & Test Splits

Our task is to provide a dataset for fine-grained recognition in traffic surveillance without any viewpoint constraint. Therefore, we have constructed the splits for training and evaluation in a way which reflects the fact that it is not usually known beforehand from which viewpoints the vehicles will be seen by the surveillance camera.

Thus, for the construction of the splits, we randomly selected cameras and used all tracks from these cameras for training and vehicles from the rest of the cameras for testing. In this way, we are testing the classification algorithms on images of vehicles from previously unseen cameras (viewpoints). This splits selection process implies that some of the vehicles from the test set may be taken under slightly different viewpoints from the ones that are in the training set.

We constructed two splits. In the first one (hard), we are interested in recognizing the precise type, including the model year. In the other one (medium), we omit the difference in model years and all vehicles of the same subtype (and potentially different model years) are present in the same class. We selected only types which have at least 15 tracks in the training set and at least one track in the testing set. The hard split contains 107 fine-grained classes with 11 653 tracks (51 691 images) for training and 11 125 tracks (39 149 images) for testing. Detailed split statistics can be found in the supplementary material.

V Experiments

We thoroughly evaluated our proposed algorithm on the BoxCars116k dataset. First, we evaluated how these methods improved classification accuracy with different nets, compared them to the state of the art, and analyzed how using approximate 3D bounding boxes influence the achieved accuracy. Then, we searched for the main source of improvements, analyzed improvements of different modifications separately, and also evaluated the usability of features from the trained nets for the task of vehicle type identity verification.

In order to show that our modifications improve the accuracy independently on the used nets, we use several of them:

  • AlexNet [64]

  • VGG16, VGG19 [74]

  • ResNet50, ResNet101, ResNet152 [69]

  • CNNs with Compact Bilinear Pooling layer [5] in combination with VGG nets denoted as VGG16+CBL and VGG19+CBL.

As there are several options how to use the proposed modifications of input data and add additional auxiliary data, we define several labels which we will use:

  • ALL – All five proposed modifications (Unpack, Color, ImageDrop, View, Rast).

  • IMAGE – Modifications working only on the image level (Unpack, Color, ImageDrop).

  • CVPR16 – Modifications as proposed in our previous CVPR paper [8] (Unpack, View, Rast – however, the View and Rast modifications differ from those ones used in this paper as the original modifications do not distinguish between the frontal and rear side of vehicles).

V-a Improvements for Different CNNs

The first experiment which was done was evaluation how our modifications have improved classification accuracy for different CNNs.

. modif. improvement [pp] error reduction [%] mean best mean best medium ALL 7.49/6.29 11.84/10.99 26.83/34.50 36.71/50.32 IMAGE 7.19/6.15 12.09/11.63 27.38/36.21 35.23/49.55 CVPR16 2.99/3.18 5.22/5.65 10.86/17.71 19.76/32.25 hard ALL 7.00/5.83 11.14/10.85 25.59/33.52 33.40/48.76 IMAGE 6.74/5.81 11.02/10.53 26.12/35.95 33.04/47.33 CVPR16 2.12/2.44 3.56/3.92 7.93/14.57 12.68/24.10

Table I: Summary statistics of improvements by our proposed modifications for different CNNs. The improvements over baseline CNNs are reported as single sample accuracy/track accuracy in percentage points. We also present classification error reduction in the same format. The raw numbers can be found in the supplementary material.

All the nets were fine-tuned from models pre-trained on ImageNet


for approximately 15 epochs which was sufficient for the nets to converge. We used the same batch size (except for ResNet151, where we had to use a smaller batch size because of GPU memory limitations), the same initial learning rate and learning rate decay and the same hyperparameters for every net (initial learning rate

, weight decay , quadratic learning rate decay, loss is averaged over 100 iterations). We also used standard data augmentation techniques as a horizontal flip and randomly moving bounding box [74]. As ResNets do not use fully connected layers, we only use IMAGE modifications for them.

For each net and modification we evaluate the accuracy improvement of the modification in percentage points and also evaluate the classification error reduction.

The summary results for both medium and hard splits are shown in Table I and the raw results are in the supplementary material. As we have correspondences between the samples in the dataset and know which samples are from the same track, we are able to use mean probability across track samples and merge the classification for the whole track. Therefore, we always report the results in the form of single sample accuracy/whole track accuracy. As expected, the results for whole tracks are much better than for single samples. For the traffic surveillance scenario, we consider to be more important the whole track accuracy as it is rather common to have a full track of observations of the same vehicle.

There are several things which should be noted about the results. The most important one is that our modifications significantly improve classification accuracy (up to +12 percentage points) and reduce classification error (up to 50 % error reduction). Another important fact is that our new modifications push the accuracy much further compared to the original method [8].

The table also shows that the difference between ALL modifications and IMAGE modifications is negligible and therefore it is reasonable to only use the IMAGE modifications. This also results in CNNs which just use the Unpack modification during test time as the other image modifications (Color, ImageDrop) are used only during fine-tuning of CNNs.

Moreover, the evaluation shows that the results are almost identical for the hard and medium split; therefore, we will only report additional results on the hard split, as it is the main goal to distinguish also the model years. The names for the splits were chosen to be consistent with the original version of dataset [8] and the small difference between medium and hard split accuracies is caused mainly by the size of the new dataset.

method accuracy [%] speed [FPS]
AlexNet [64] /
VGG16 [74] /
VGG19 [74] /
Resnet50 [69] /
Resnet101 [69] /
Resnet152 [69] /
BCNN (VGG-M) [4] /
BCNN (VGG16) [4] /
CBL (VGG16) [5] /
CBL (VGG19) [5] /
PCM (AlexNet) [3] /
PCM (VGG19) [3] /
AlexNet + ALL (ours) / 580
VGG16 + ALL (ours) / 154
VGG19 + ALL (ours) / 133
VGG16+CBL + ALL (ours) / 146
VGG19+CBL + ALL (ours) / 126
Resnet50 + IMAGE (ours) / 151
Resnet101 + IMAGE (ours) / 93
Resnet152 + IMAGE (ours) / 65
Table II: Comparison of different vehicle fine-grained recognition methods. Accuracy is reported as single image accuracy/whole track accuracy. Processing speed was measured on a machine with GTX1080 and CUDNN. FPS reported by authors.

V-B Comparison with the State of the Art

In order to examine the performance of our method, we also evaluated other state-of-the-art methods for fine-grained recognition. We used three different algorithms for general fine-grained recognition with a published code. We always first used the code to reproduce the results in respective papers to ensure that we are using the published work correctly. All of the methods use CNNs and the used net influences the accuracy; therefore, the results should be compared with respective base CNNs.

It was impossible to evaluate methods focused only on fine-grained recognition of vehicles as they are usually limited to frontal/rear viewpoint or require 3D models of vehicles for all the types. In the following text we define labels for each evaluated state-of-the-art method and describe details for the method separately.

BCNN. Lin et al. [4] proposed to use Bilinear CNN. We used VGG-M and VGG16 networks in a symmetric setup (details in the original paper), and trained the nets for 30 epochs (the nets converged around the epoch). We also used image flipping to augment the training set.

CBL. We modified compatible nets with Compact BiLinear Pooling proposed by [5] which followed the work of [4]

and reduced the number of output features of the bilinear layers. We used the Caffe implementation of the layer provided by the authors and used 8 192 features. We trained the net using the same hyper-parameters, protocol, and data augmentation as described in Section 


net no modification GT 3D BB estimated 3D BB
AlexNet / / /
VGG16 / / /
VGG19 / / /
VGG16+CBL / / /
VGG19+CBL / / /
ResNet50 / / /
ResNet101 / / /
ResNet152 / / /
Table III: Comparison of classification accuracy (percent) on the hard split with standard nets without any modifications, IMAGE modifications using 3D bounding box from surveillance data, and IMAGE modifications using estimated 3D BB (Section III-D).
Figure 10: Correlation of improvement relative to CNNs without modification with respect to train-test viewpoint difference. The -axis contains bins viewpoint difference bins (in degrees), and the -axis denotes improvement compared to base net in percent points, see Section V-D for details. The graphs show that with increasing viewpoint difference, the accuracy improvement of our method increases. Only one representative of each CNN family (AlexNet, VGG, ResNet, VGG+CBL) is displayed – results for all CNNs are in the supplementary material.

PCM. Simon et al. [3] propose Part Constellation Models and use neural activations (see the paper for additional details) to get the parts of the model. We used AlexNet (BVLC Caffe reference version) and VGG19 as base nets for the method. We used the same hyper-parameters as the authors with the exception of fine-tuning number of iterations which was increased, and the parameter of used linear SVM was cross-validated on the training data.

The results of all comparisons can be found in Table II. As the table shows, our method significantly outperforms both standard CNNs [64, 74, 69] and methods for fine-grained recognition [4, 3, 5]. The results for fine-grained recognition methods should be compared with the same used base network as for different networks, they provide different results. Our best accuracy () is better by a large margin compared to all other variants (both standard CNN and fine-grained methods).

In order to provide approximate information about the processing efficiency, we measured how many images different methods are able to process per second (referenced as FPS). The measurement was done with GTX1080 and CUDNN whenever possible. In the case of BCNN we reported the numbers as reported by the authors, as we were forced to save some intermediate data to disk because we were not able to fit all the data to memory (200 GB). The results are also shown in Table II; they show that our input modification decreased the processing speed; however, the speed penalty is small and the method is still usable for real-time processing.

V-C Influence of Using Estimated 3D Bounding Boxes instead of the Surveillance Ones

We also evaluated how the results will be influenced when, instead of using the 3D bounding boxes obtained from the surveillance data (long-time observation of video [49, 50]), the estimated 3D bounding boxes (Section III-D) would be used instead.

The classification results are shown in Table III; they show that the proposed modifications still significantly improve the accuracy even if only the estimated 3D bounding box – the less accurate one – is used. This result is fairly important as it enables to transfer this method to different (non-surveillance) scenarios. The only additional data which is then required is a reliable training set of directions towards the vanishing points (or viewpoints and focal length) from the vehicles (or other rigid objects).

V-D Impact of Training/Testing Viewpoint Difference

Figure 11: Examples of viewpoint difference between the training and testing sets. Each pair shows a testing sample (left) and its corresponding “nearest” training sample (right); by “nearest” we mean the sample with the lowest angle between its viewpoint and the test sample’s viewpoint.

We were also interested in finding out the main reason why the classification accuracy is improved. We have analyzed several possibilities and found out that the most important aspect is viewpoint difference.

For every training and testing sample we computed the viewpoint (unit 3D vector from vehicles’ 3D bounding boxes centers) and for each testing sample we found one training sample with the lowest viewpoint difference (see Figure 11). Then, we divided the testing samples into several bins based on the difference angle. For each of these bins we computed the accuracy for the standard nets without any modifications and nets with the proposed modifications. There is 56% of the test samples in the first bin (), and in the middle bins there are 22% and 17% of test data. In the last bin, there are 5% of the test data. Finally, we obtained an improvement in percentage points for each modification and bin, by comparing the net’s performance on the data in the bin with and without the modification harnessed. The results are displayed in Figure 10.

There are several facts which should be noted. The first and most important is that the Unpack modification alone improves significantly the accuracy for larger viewpoint differences (the accuracy is improved by more than 20 percent points for the last bin). The other important fact, which should be noted, is that the other modifications (mainly Color and ImageDrop) improve the accuracy furthermore. This improvement is independent on the training-testing viewpoint difference.

Figure 12: Precision-Recall curves for verification of fine-grained types. Black dots represent the human performance [8]. Only one representative of each CNN family (AlexNet, VGG, ResNet, VGG+CBL) is displayed – results for all CNNs are in the supplementary material.

V-E Impact of Individual Modifications

We were also curious how different modifications by themselves help to improve the accuracy. We conducted two types of experiments which focus on different aspects of the modifications. The evaluation is not done on ResNets, as we only use IMAGE level modifications with ResNets; thus, we cannot evaluate Rast and View modifications with ResNets.

mean best
Unpack / /
View / /
Rast / /
Color / /
ImageDrop / /
Table IV: Summary of improvements for different nets and modifications computed as . The raw data can be found in the supplementary material.

The first experiment is focused on the influence of each modification by itself. Therefore, we compute the accuracy improvement (in accuracy percent points) for the modifications as , where stands for the accuracy of the classifier described by its contents. The results are shown in Table IV. As it can be seen in the table, the most contributing modifications are Color, Unpack, and ImageDrop.

The second experiment evaluates how a given modification contributed to the accuracy improvement when all of the modifications are used. Thus, the improvement is computed as . See Table V for the results, which confirm the previous findings and Color, Unpack, and ImageDrop are again the most positive modifications.

V-F Vehicle Type Verification

Lastly, we evaluated the quality of features extracted from the last layer of the convolutional nets for the verification task. Under the term verification, we understand the task to determine whether a pair of vehicle tracks share the same fine-grained type or not. In agreement with previous works in the field [63], we use cosine distance between the features for the verification.

We collected 5 million random pairs of vehicle tracks from the test part of BoxCars116k splits and evaluate the verification on these pairs. As we used tracks which can have a different number of vehicle images, we used 9 random pairs of images for each pair of tracks and then used median distance between these image pairs as the distance between the whole tracks.

Precision-Recall curves and Average Precisions are shown in Figure 12. As the results show, our modifications significantly improve the average precision for each CNN in the given task. Moreover, as the figure shows, the method outperforms human performance (black dots in Figure 12), as reported in the previous paper [8].

mean best
Unpack / /
View / /
Rast / /
Color / /
ImageDrop / /
Table V: Summary of improvements for different nets and modifications computed as . The raw data can be found in the supplementary material.

Vi Conclusion

This article presents and sums up multiple algorithmic modifications suitable for CNN-based fine-grained recognition of vehicles. Some of the modifications were originally proposed in a conference paper [8], while others are results of the ongoing research. We also propose a method for obtaining the 3D bounding boxes necessary for the image unpacking (which has the largest impact on performance improvement) without observing a surveillance video, but only working with the individual input image. This considerably increases the application potential of the proposed methodology (and the performance for such estimated 3D boxes is only somewhat lower than when “proper” bounding boxes are used). We focused on a thorough evaluation of the methods: we coupled them with multiple state-of-the-art CNN architectures [74, 69], and measured the contribution/influence of individual modifications.

Our method significantly improves the classification accuracy (up to +12 percentage points) and reduces the classification error (up to 50 % error reduction) compared to the base CNNs. Also, our method outperforms other state-of-the-art methods [4, 3, 5] by 9 percentage points in single image accuracy and by 7 percentage points in whole track accuracy.

We collected, processed, and annotated a dataset BoxCars116k targeted to fine-grained recognition of vehicles in the surveillance domain. Contrary to a majority of existing vehicle recognition datasets, the viewpoints are greatly varying and correspond to surveillance scenarios; the existing datasets are mostly collected from web images and the vehicles are typically captured from eye-level positions. This dataset has been made publicly available for future research and evaluation.


This work was supported by The Ministry of Education, Youth and Sports of the Czech Republic from the National Programme of Sustainability (NPU II); project IT4Innovations excellence in science – LQ1602.


  • [1] T. Gebru, J. Krause, Y. Wang, D. Chen, J. Deng, E. L. Aiden, and L. Fei-Fei, “Using deep learning and google street view to estimate the demographic makeup of the US,” 2017.
  • [2] J. Krause, H. Jin, J. Yang, and L. Fei-Fei, “Fine-grained recognition without part annotations,” in

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , June 2015.
  • [3] M. Simon and E. Rodner, “Neural activation constellations: Unsupervised part model discovery with convolutional networks,” in International Conference on Computer Vision (ICCV), 2015.
  • [4] T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear CNN models for fine-grained visual recognition,” in International Conference on Computer Vision (ICCV), 2015.
  • [5] Y. Gao, O. Beijbom, N. Zhang, and T. Darrell, “Compact bilinear pooling,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [6] S. Xie, T. Yang, X. Wang, and Y. Lin, “Hyper-class augmented and regularized deep learning for fine-grained image classification,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [7] F. Zhou and Y. Lin, “Fine-grained image classification by exploring bipartite-graph labels,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [8] J. Sochor, A. Herout, and J. Havel, “Boxcars: 3D boxes as CNN input for improved fine-grained vehicle recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [9] S. Huang, Z. Xu, D. Tao, and Y. Zhang, “Part-stacked CNN for fine-grained visual categorization,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [10] L. Zhang, Y. Yang, M. Wang, R. Hong, L. Nie, and X. Li, “Detecting densely distributed graph patterns for fine-grained image categorization,” IEEE Transactions on Image Processing, vol. 25, no. 2, pp. 553–565, Feb 2016.
  • [11] S. Yang, L. Bo, J. Wang, and L. G. Shapiro, “Unsupervised template learning for fine-grained object recognition,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds.    Curran Associates, Inc., 2012, pp. 3122–3130. [Online]. Available:
  • [12] K. Duan, D. Parikh, D. Crandall, and K. Grauman, “Discovering localized attributes for fine-grained recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • [13] B. Yao, “A codebook-free and annotation-free approach for fine-grained image categorization,” in Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), ser. CVPR ’12.    Washington, DC, USA: IEEE Computer Society, 2012, pp. 3466–3473. [Online]. Available:
  • [14] J. Krause, T. Gebru, J. Deng, L. J. Li, and L. Fei-Fei, “Learning features and parts for fine-grained recognition,” in Pattern Recognition (ICPR), 2014 22nd International Conference on, Aug 2014, pp. 26–33.
  • [15] X. Zhang, H. Xiong, W. Zhou, W. Lin, and Q. Tian, “Picking deep filter responses for fine-grained image recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [16] H. Pirsiavash, D. Ramanan, and C. C. Fowlkes, “Bilinear classifiers for visual recognition,” in Advances in Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta, Eds.    Curran Associates, Inc., 2009, pp. 1482–1490. [Online]. Available:
  • [17] D. Lin, X. Shen, C. Lu, and J. Jia, “Deep LAC: Deep localization, alignment and classification for fine-grained recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [18] N. Zhang, R. Farrell, and T. Darrell, “Pose pooling kernels for sub-category recognition,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, June 2012, pp. 3665–3672.
  • [19] Y. Chai, E. Rahtu, V. Lempitsky, L. Van Gool, and A. Zisserman, “TriCoS: A tri-level class-discriminative co-segmentation method for image classification,” in European Conference on Computer Vision, 2012.
  • [20] L. Li, Y. Guo, L. Xie, X. Kong, and Q. Tian, “Fine-Grained Visual Categorization with Fine-Tuned Segmentation,” IEEE International Conference on Image Processing, 2015.
  • [21] E. Gavves, B. Fernando, C. Snoek, A. Smeulders, and T. Tuytelaars, “Local alignments for fine-grained categorization,” International Journal of Computer Vision, vol. 111, no. 2, pp. 191–212, 2015. [Online]. Available:
  • [22] V. Petrovic and T. F. Cootes, “Analysis of features for rigid structure vehicle type recognition,” in BMVC, 2004, pp. 587–596.
  • [23] L. Dlagnekov and S. Belongie, “Recognizing cars,” UCSD CSE Tech Report CS2005-0833, Tech. Rep., 2005.
  • [24] X. Clady, P. Negri, M. Milgram, and R. Poulenard, “Multi-class vehicle type recognition system,” in Proceedings of the 3rd IAPR Workshop on Artificial Neural Networks in Pattern Recognition, ser. ANNPR ’08.    Berlin, Heidelberg: Springer-Verlag, 2008, pp. 228–239. [Online]. Available:
  • [25] G. Pearce and N. Pears, “Automatic make and model recognition from frontal images of cars,” in IEEE AVSS, 2011, pp. 373–378.
  • [26] A. Psyllos, C. Anagnostopoulos, and E. Kayafas, “Vehicle model recognition from frontal view image measurements,” Computer Standards & Interfaces, vol. 33, no. 2, pp. 142 – 151, 2011, {XVI} {IMEKO} {TC4} Symposium and {XIII} International Workshop on {ADC} Modelling and Testing. [Online]. Available:
  • [27] S. Lee, J. Gwak, and M. Jeon, “Vehicle model recognition in video,” International Journal of Signal Processing, Image Processing and Pattern Recognition, vol. 6, no. 2, p. 175, 2013.
  • [28] B. Zhang, “Reliable classification of vehicle types based on cascade classifier ensembles,” IEEE Transactions on Intelligent Transportation Systems, vol. 14, no. 1, pp. 322–332, March 2013.
  • [29] D. F. Llorca, D. Colás, I. G. Daza, I. Parra, and M. A. Sotelo, “Vehicle model recognition using geometry and appearance of car emblems from rear view images,” in 17th International IEEE Conference on Intelligent Transportation Systems (ITSC), Oct 2014, pp. 3094–3099.
  • [30] B. Zhang, “Classification and identification of vehicle type and make by cortex-like image descriptor HMAX,” IJCVR, vol. 4, pp. 195–211, 2014.
  • [31] J.-W. Hsieh, L.-C. Chen, and D.-Y. Chen, “Symmetrical SURF and its applications to vehicle detection and vehicle make and model recognition,” Intelligent Transportation Systems, IEEE Transactions on, vol. 15, no. 1, pp. 6–20, Feb 2014.
  • [32] C. Hu, X. Bai, L. Qi, X. Wang, G. Xue, and L. Mei, “Learning discriminative pattern for real-time car brand recognition,” Intelligent Transportation Systems, IEEE Transactions on, vol. 16, no. 6, pp. 3170–3181, Dec 2015.
  • [33] L. Liao, R. Hu, J. Xiao, Q. Wang, J. Xiao, and J. Chen, “Exploiting effects of parts in fine-grained categorization of vehicles,” in International Conference on Image Processing (ICIP), 2015.
  • [34] R. Baran, A. Glowacz, and A. Matiolanski, “The efficient real- and non-real-time make and model recognition of cars,” Multimedia Tools and Applications, vol. 74, no. 12, pp. 4269–4288, 2015. [Online]. Available:
  • [35] H. He, Z. Shao, and J. Tan, “Recognition of car makes and models from a single traffic-camera image,” IEEE Transactions on Intelligent Transportation Systems, vol. PP, no. 99, pp. 1–11, 2015.
  • [36] Y.-L. Lin, V. I. Morariu, W. Hsu, and L. S. Davis, “Jointly optimizing 3D model fitting and fine-grained classification,” in ECCV, 2014.
  • [37] E. Hsiao, S. Sinha, K. Ramnath, S. Baker, L. Zitnick, and R. Szeliski, “Car make and model recognition using 3D curve alignment,” in IEEE WACV, March 2014.
  • [38] J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3D object representations for fine-grained categorization,” in ICCV Workshop 3dRR-13, 2013.
  • [39] J. Prokaj and G. Medioni, “3-D model based vehicle recognition,” in IEEE WACV, Dec 2009.
  • [40] H.-Z. Gu and S.-Y. Lee, “Car model recognition by utilizing symmetric property to overcome severe pose variation,” Machine Vision and Applications, vol. 24, no. 2, pp. 255–274, 2013. [Online]. Available:
  • [41]

    M. Stark, J. Krause, B. Pepik, D. Meger, J. Little, B. Schiele, and D. Koller, “Fine-grained categorization for 3D scene understanding,” in

    BMVC, 2012.
  • [42] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester, “Cascade object detection with deformable part models,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2010, pp. 2241–2248.
  • [43] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1.    IEEE, 2005, pp. 886–893.
  • [44] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Locality-constrained linear coding for image classification,” in CVPR, June 2010, pp. 3360–3367.
  • [45] H. Liu, Y. Tian, Y. Yang, L. Pang, and T. Huang, “Deep relative distance learning: Tell the difference between similar vehicles,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [46] N. Boonsim and S. Prakoonwit, “Car make and model recognition under limited lighting conditions at night,” Pattern Analysis and Applications, pp. 1–13, 2016. [Online]. Available:
  • [47] J. Fang, Y. Zhou, Y. Yu, and S. Du, “Fine-grained vehicle model recognition using a coarse-to-fine convolutional neural network architecture,” IEEE Transactions on Intelligent Transportation Systems, vol. PP, no. 99, pp. 1–11, 2016.
  • [48] Q. Hu, H. Wang, T. Li, and C. Shen, “Deep CNN—s with spatially weighted pooling for fine-grained car recognition,” IEEE Transactions on Intelligent Transportation Systems, vol. PP, no. 99, pp. 1–10, 2017.
  • [49] M. Dubská, J. Sochor, and A. Herout, “Automatic camera calibration for traffic understanding,” in BMVC, 2014.
  • [50] M. Dubská, A. Herout, R. Juránek, and J. Sochor, “Fully automatic roadside camera calibration for traffic surveillance,” Intelligent Transportation Systems, IEEE Transactions on, vol. 16, no. 3, pp. 1162–1171, June 2015.
  • [51] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” IJCV, 2015.
  • [52] K. Matzen and N. Snavely, “NYC3DCars: A dataset of 3D vehicles in geographic context,” in International Conference on Computer Vision (ICCV), 2013.
  • [53] L. Yang, P. Luo, C. Change Loy, and X. Tang, “A large-scale car dataset for fine-grained categorization and verification,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [54] X. Liu, W. Liu, H. Ma, and H. Fu, “Large-scale vehicle re-identification in urban surveillance videos,” in Multimedia and Expo (ICME), 2016 IEEE International Conference on.    IEEE, 2016, pp. 1–6.
  • [55] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems (NIPS), 2015.
  • [56] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” CoRR, vol. abs/1506.02640, 2015. [Online]. Available:
  • [57] P. Dollár, R. Appel, S. Belongie, and P. Perona, “Fast feature pyramids for object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 8, pp. 1532–1545, Aug 2014.
  • [58] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1627–1645, 2010. [Online]. Available:
  • [59] J. Gall, A. Yao, N. Razavi, L. V. Gool, and V. Lempitsky, “Hough forests for object detection, tracking, and action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 11, pp. 2188–2202, Nov 2011.
  • [60] L. Wang, Y. Lu, H. Wang, Y. Zheng, H. Ye, and X. Xue, “Evolving boxes for fast vehicle detection,” in 2017 IEEE International Conference on Multimedia and Expo (ICME), July 2017, pp. 1135–1140.
  • [61] G. Salvi, “An automated nighttime vehicle counting and detection system for traffic surveillance,” in 2014 International Conference on Computational Science and Computational Intelligence, vol. 1, March 2014, pp. 131–136.
  • [62] E. R. Corral-Soto and J. H. Elder, “Slot cars: 3D modelling for improved visual traffic analytics,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, July 2017.
  • [63] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “DeepFace: Closing the gap to human-level performance in face verification,” in CVPR, 2014, pp. 1701–1708. [Online]. Available:
  • [64] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds.    Curran Associates, Inc., 2012, pp. 1097–1105. [Online]. Available:
  • [65] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” in British Machine Vision Conference, 2014.
  • [66] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in European conference on computer vision.    Springer, 2014, pp. 818–833.
  • [67] J. Yang, B. Price, S. Cohen, H. Lee, and M.-H. Yang, “Object contour detection with a fully convolutional encoder-decoder network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 193–202.
  • [68] R. Rothe, R. Timofte, and L. Van Gool, “Deep expectation of real and apparent age from a single image without facial landmarks,” International Journal of Computer Vision, pp. 1–14, 2016. [Online]. Available:
  • [69] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [70] R. Juránek, A. Herout, M. Dubská, and P. Zemčík, “Real-time pose estimation piggybacked on object detection,” in ICCV, 2015.
  • [71] J. Sochor, R. Juránek, J. Špaňhel, L. Maršík, A. Široký, A. Herout, and P. Zemčík, “BrnoCompSpeed: Review of traffic camera calibration and a comprehensive dataset for monocular speed measurement,” Intelligent Transportation Systems (under review), IEEE Transactions on, 2016.
  • [72] C. Stauffer and W. E. L. Grimson, “Adaptive background mixture models for real-time tracking,” in CVPR, vol. 2, 1999, pp. 246–252.
  • [73]

    Z. Zivkovic, “Improved adaptive gaussian mixture model for background subtraction,” in

    ICPR, 2004, pp. 28–31.
  • [74] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.