Vehicle Re-Identification Based on Complementary Features

05/09/2020 ∙ by Cunyuan Gao, et al. ∙ Zhejiang University 25

In this work, we present our solution to the vehicle re-identification (vehicle Re-ID) track in AI City Challenge 2020 (AIC2020). The purpose of vehicle Re-ID is to retrieve the same vehicle appeared across multiple cameras, and it could make a great contribution to the Intelligent Traffic System(ITS) and smart city. Due to the vehicle's orientation, lighting and inter-class similarity, it is difficult to achieve robust and discriminative representation feature. For the vehicle Re-ID track in AIC2020, our method is to fuse features extracted from different networks in order to take advantages of these networks and achieve complementary features. For each single model, several methods such as multi-loss, filter grafting, semi-supervised are used to increase the representation ability as better as possible. Top performance in City-Scale Multi-Camera Vehicle Re-Identification demonstrated the advantage of our methods, and we got 5-th place in the vehicle Re-ID track of AIC2020. The codes are available at



There are no comments yet.


page 5

Code Repositories


Implementation of Vehicle Re-Identification Based on Complementary Features for 2020 AICity Challenge Track2

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the development of ITS, vehicle Re-ID has attracted increasing attention in computer vision community

[27, 29, 15]

. The target of vehicle Re-ID is to find the vehicles in the gallery that have the same identical as the query vehicle. Compared with other image retrieval tasks, vehicle Re-ID is more challenging because of two main reasons. Firstly, same identical vehicles may have different orientations in different cameras and the appearance of the front part and the rear part of a vehicle is much dissimilar. Secondly, two different vehicles with same brand and serial have completely same appearance except of very trivial difference.

In the past, license plate recognition was a conventional and efficient vehicle Re-ID solution. However, the license plate of the vehicle may be blocked, dirt, or the video is not clear enough to be clearly seen. Also, there are confusing letters making license plate recognition unreliable. Therefore, vehicle Re-ID through vehicle visual feature is an essential part.

At present, feature extraction and metric learning are the mainstream research directions in the field of vehicle Re-ID. Compared with traditional manual selection features, Convolutional Neural Networks (CNN) can automatically learn discriminative high-level semantic features according to task requirements, which greatly improves the performance of vehicle Re-ID. Due to occlusion, lighting and diverse viewpoints, there are still many challenging problems.

In recent years, a few powerful networks [6, 9] have been proposed. These networks utilize residual block or densely connected block to increase the network’s depth and make the deep networks extract more discriminative features from the training images. Specially, for vehicle Re-ID task, part-based method [5] aims to combine the local part feature and global feature to increase the performance. In this paper, our solution to the vehicle Re-ID track in AIC2020 could be summarized as follows:

  • We train multiple CNNs with a few methods including multi-loss, filter grafting, part-based feature learning, etc. These methods could do much more help to increase each single network’s representation ability.

  • We use semi-supervised method to annotate images in the test set with a fake label, and combine these fake-labeled images with the original training set to improve the final performance.

  • Several post-processing methods are utilized to further make a progress of the result.

Figure 1: Overview of our basic framework.

2 Related Works

2.1 Vehicle Re-identificaiton

As several vehicle datasets [14, 1, 24] have been proposed, vehicle Re-ID has attracted increasing attention in recent several years. The current work of vehicle Re-ID is still focused on the feature level, and how to extract more informative features. While vehicle RE-ID is also improved significantly in recent years because of the development of CNN. Zhu et al. [29] and Guo et al. [4] seek a better feature encoding method. Another solution is based on the part to extract highly distinguished features [25, 23, 5, 13], these methods mainly rely on strong supervision information of each key part. However, the cost of annotation is high and the practical application is limited. Zhou et al. [27]

adopt a viewpoint-aware attention model and a adversarial training architecture to implement effective multi-view feature inference from single-view input. Zhou

et al. [28] focus on the uncertainty of vehicle viewpoint in Re-ID, and propose an Adversarial Bi-directional LSTM Network (ABLN). Inspired by the behavior in human’s recognition process, Chu et al. [2] propose a novel viewpoint-aware metric learning approach, which learns two metrics for similar viewpoints and different viewpoints in two feature space.

2.2 Re-ranking

When vehicle Re-ID is regarded as a retrieval process, re-ranking is a key step to improve its performance. Re-ranking is mainly studied in generic instance retrieval [3, 18, 10, 20]. The main advantage of many re-ranking methods is that it can be implemented without the need for additional training samples and can be applied to any initial ranking results.

A popular re-ranking approach is k-reciprocal encoding [26]

. By encoding k-reciprocal nearest neighbors into a single vector. In order to obtain similar relationships from similar samples, a local expansion query is proposed to obtain more robust k-reciprocal features. The final distance based on the combination of the original distance and the Jaccard distance effectively improves the Re-ID performance on multiple large-scale data sets.

3 Our Approach

3.1 Basic framework

The basic framework of our approach is shown in the Fig. 1

. During training phases, the training images with ID label are sent into the backbone and generate a 2048-d feature and its predicted ID label. Two networks with the same structure will be trained in parallel and perform filter grafting. Then we apply two different loss functions as the optimization objective. The first one is hard triplet loss

[7], it will enlarge the feature’s distance between two different label samples and reduce the distance between samples with same label. The triplet loss can be defined as:

where , , are the features extracted from anchor, positive and negative samples receptively and is the total number of training images in one batch. In theory, in order to ensure the best effect of network training, we must choose hard positive and hard negative,

The second one is cross entropy loss. Given a batch of training images, we denote as the ground truth ID label,

as ID prediction logits of the

-th image on -th class and as the total number of traing images in one batch. The cross entropy loss can be defined as:

where =. Finally, we combine these two losses with a balanced weight, which is defined as:

where is the balanced weight of the triplet loss.

BNNeck [16]

refers to the addition of a batch normalization (BN) layer after backbone.

On the basis of the framework, we add more useful modules to improve the single model’s performance. One is MGN [22], this module splits the input sample horizontally into several parts and calculates each part’s loss individually, while the global feature is also taken into loss calculation. The second one is self-attention constrain (SAC) [11] which makes the network pay more attention to some subtleties. Another one is squeeze-and-excitation(SE) block [8], it is also an attention module that could enhance the feature’s representation power by performing dynamic channel-wise feature re-calibration.

During the testing phase, the trained network model will extract 2048-d feature from each input image. Features from different networks are concatenated together for vehicle Re-ID.

3.2 Filter grafting

Filter grafting [17] is a learning paradigm which aims to reactivate invalid filters during training such that the representation power could be improved. The weight of filters that are valid for other networks is grafted to the filters that are not valid for the self-network. Multiple networks are grafted together to promote progress The main framework is shown in Fig. 2, given a network, two training processes are started parallelly denoted as and

, respectively. In each epoch training, we graft the effective filter weight of

into the invalid filter of . The grafting in this process occurs at the convolution level, not the filter level, which means that we graft all filter weights in a certain layer in to in the same layer. Conversely, the same principle is used for grafting filter into . Filter grafting does not change the network structure. When performing grafting, the specific operation is shown in the following formula:

After weighting the filter parameters of and , the final filter parameters after grafting are obtained. We use the following formula to constrain parameters

where and are two hyper-parameters and is the entropy of the variables.

Although the two networks are trained in parallel when training, only use one of the grafted networks for testing.

Figure 2: Filter grafting with two networks. The two networks and accept filter parameters information from each other.

3.3 Semi-supervised learning

The annotations of test set images are not accessible, so we propose a semi-supervised learning method to annotate test images with a fake label. Therefore, the test set becomes a supplementary of the training set. The steps of this method is as follows:

  • Train Vehicle Re-ID models using the original training set.

  • Extract feature from the images in the test set using the trained Re-ID models.

  • Cluster the testing images based on their features, and assign each cluster a fake label as the ID label.

  • Merge the fake-labeled test set and the original training set to be a new training set. Then we train Re-ID models using the new training set.

The number of IDs in the test set is 333. Therefore, the clustering algorithm we use is the k-means algorithm, and the clustering center is set to 333.

3.4 Post-processing

For the training results of multiple single models, we utilize several post-processing techniques.

Center crop. For the test set query and gallery, there is generally a bit of background redundancy. When we extract the test set features, we use center crop to make the image more focused on the vehicle itself and reduce the impact of the surrounding background.

Image flipping. When the model is inferred forward, we input the vehicle image and the horizontal flip of the vehicle image, and average the two feature values as the final feature of the vehicle. This can mitigate the impact of angle on vehicle feature expression.

Model ensemble. Model ensemble is proven to be effective in detection, classification, and Re-ID tasks. Different tasks have different ensemble methods. For Re-ID task, we have tried a lot of ensemble methods, and find out that it is better to expand the feature dimensions by concatenating features than to calculate the average feature of different models. Given the above trained networks, we could extract feature vectors for an image , the feature vector can be expressed as of the th model. . represents Normalization, and then we concatenate them as the final feature representations for an image . The vehicle features extracted by each model are concatenated, and in our subsequent experiments, it is proved that the performance of the model would be improved compared to the single model.

Query expansion. Query expansion is also a re-ranking method. Through query expansion, the retrieval recall rate can be improved. Especially when the accuracy of top is high, the effect of this method will be significantly improved. After the first retrieval, the feature of query and features of top images retrieved in the gallery are averaged as the feature of the query, and then the second retrieval is performed, and the above operation is repeated times. The parameters and need to be adjusted for the official test set.

Gallery feature merge. The official test set also provides track id information, which contains gallery’s 798 vehicle tracks, and each track is multiple pictures of the same vehicle. This is a prior information that can be used. The post-processing method for this prior information is to average the picture features of each track as the features of the pictures under this track. There is also a parameter that needs to be adjusted, or is all the pictures under this track. This post-processing method can be applied to some Re-ID tasks that provide video track information.

Re-ranking with k-reciprocal encoding. K-reciprocal encoding [26] is an effective re-ranking method. A basic assumption is that if the returned image is sorted within the nearest neighbors of probe, then it is likely to be a true match. The idea of this method is to use query-to-test, query-to-query and test-to-test correlation in order to take the query and test distribution into consideration when ranking the results.

4 Experiments

4.1 Datasets analyse

CityFlow [21] is one of the largest vehicle Re-ID datasets. The dataset contains 56,277 bounding boxes of 666 vehicle identities, where 36,935 of them from 333 object identities form the training set, and the test set consists of 18,290 bounding boxes from the other 333 identities. The standard probe and gallery sets consist of 1,052 and 18,290 images respectively. The synthetic dataset provided by AI City Challenge 2020 is generated by VehicleX, which is a publicly aviablable 3D engine. There are 192,150 images of 1362 vehicles are annotated with detailed labels in total. We select 50 vehicle identities of 3462 images in CityFlow as our self-val set. All the left images are used as the training set. So our vehicle Re-ID model self-training set has a total of 225,623 images of 1645 vehicle ids. Besides, after semi-supervised learning, the images in the test set are assigned 330 ids, finally we get a training set of 1975 ids.

4.2 Implementation Details

We totally trained several Re-ID models including DenseNet161 [9], SE-ResNet152 [8] and ResNet152 with MGN [22]. For some models, our training dataset is the officially provided training set and synthetic dataset. Due to the background redundancy of some vehicle images, we resize the image to 414310, and then randomly crop to 384288. At the same time, we perform random erasing to increase the difficulty of sample learning and horizontally flip the image to add diversity of the training images.

For other models, we use AI City Challenge 2019 first place Baidu’s open-source refined dataset by their detector. In this part of the dataset, we resize the image to 384

288, and no need to perform the operation of crop to background.

We adopt the strategy of random sample, each vehicle id selects 4 instance images in one batchsize. And we use Stochastic Gradient Descent (SGD) to train CNN models in total of 1000 epochs. Strategies such as warm up learning rate and learning rate gradient attenuation are added to the training process. The initial learning rate is set to 0.03, which is decayed to 3

10, 310 and 310

at 300th epoch, 600th epoch and 900th epoch. We implement our model on PyTorch, and all the models are trained and tested on eight Tesla P40 GPUs.

4.3 Semi-Supervised Learning,

We use 225,623 images of 1645 vehicle ids train Re-ID model to extract the features of the test set. Given that there are 333 ids in the test set, k-means is used to cluster the test set images and 19,342 images of 333 vehicle ids are added to the original training set to retrain the Re-ID model. For each single model under different training sets, the mAP on the self-val set are shown in Table. 4.3. With the increase in the amount of training data, each of our single models has significantly improved on self-val set. However, because fake label is not very accurate, and the training model overfits the test set with certain incorrect labels, only DenseNet161 improves on the test set in the end, and as a single model of the last ensemble models. Model selection experiments are conducted in next section.

Methods Train Dataset self-val set mAP(%) DenseNet161 Train Set 66.8 Train Set + Test Set (fake label) 72 SE-ResNet152 Train Set 71.2 Train Set + Test Set (fake label) 72 ResNet152 with MGN Train Set 66.7 Train Set + Test Set (fake label) 70
Table 1: After adding the fake label test set to the training set, the comparison of mAP on the self-val set of each model.
Figure 3: Our vehicle Re-ID network rank the top-10 lists of four query images on the our self-val set of CityFlow dataset. The image with the green border boundaries belong to the same identity as the query, and the image with the red boundaries do not.

4.4 Evaluation of Vehicle Re-Identification

Model Ensemble. We conduct experiments on different model ensemble strategies as shown in the Table. 4.4. Average of feature merge: take the average of the 2048-dimensional feature inferred from each single model. Average of distance matrix: take the average of the distance matrix calculated by the features of the query and gallery inferred from each single model. Concatenated feature: concatenate the 2048-d features output by each single model, expanding in dimension. Experiments show that Concatenated feature performs best.

Therefore, we concatenate the normalized features of SE-ResNet152 trained on the train set, ResNet152 with MGN (trained on the train set), DenseNet161 (trained on the train set) and DenseNet161 (trained on the train set with fake labe test set) for feature distance calculation, which is better than using a single model.

When the test set is labeled with fake labels, the clustering results are not refined, resulting in a large number of incorrect labels in the clustering results.

Although ResNet152 with MGN and SE-ResNet152 (trained on the train set with fake labe test set) perform better on self-val set than without fake label test set, but during training, the model overfits the test set with a large number of incorrect labels. Therefore, in the final model ensemble, the mAP accuracy decreases on the test set.

However, DenseNet161 does not fit the test set with the incorrect labels, so we choose these 4 models listed in the Table. 4.4 for the final ensemble feature.

Methods Self-val set mAP(%) Test set mAP(%) SE-ResNet152 (TT) 71.2 - ResNet152 with MGN (TT) 66.7 - DenseNet161 (TT) 66.8 - DenseNet161 (TTF) 72 - Ensemble strategies Average of feature merge 72.9 - Average of distance matrix 74.2 - Concatenated feature 74.8 66.84
Table 2: The mAP accuracy between single models and the ensemble model under different ensemble strategies on the validation set and test set. TT: trained on the train set, TTF: trained on the train set with fake labe test set. (In this table, we do not use re-ranking and query expansion on self-val set).

Post processing. We conduct the post-processing ablation experiment on the self-val set of CityFlow datasets, with SE-ResNet152 as the backbone of single model. It can be seen in Table. 4.4 that each post-processing method has a certain improvement on the result. Although there is some improvement in query expansion on the self-val set, it is not reflected in the test set. Gallery feature merge will bring a certain improvement to the result in the test set. So in the end, the post-processing we do on the test set is shown in the testing stage of Fig. 1.

SE-ResNet152 Center Crop? Image Flip? Query Expansion? Gallery Feature Merge? - - Re-Ranking? Self-val set mAP(%) 71.1 71.6 72.4 74.2 - 83.8
Table 3: Comparison of different post-processing methods on our self-val set of CityFlow datasets.

Impact of filter grafting. To verify the effectiveness of filter grafting on our vehicle Re-ID model, we have done some experiments on the SE-ResNet152. We try two models grafting, which will bring about 1% improvement in mAP on the baseline. Due to the limitation of the time and GPU resources, no more grafting models are added and we did not try filter grafting on other backbones.

4.5 Results of Vehicle Re-Identification

Compare with others on the AIC2020: Table. 4.5 lists the ranks of our team and the results of our team on the vehicle Re-ID task of AI City Challenge 2020.

Rank Team ID Team Name Score 1 73 Baidu-UTS 0.8413 2 42 ZOOZOO 0.7810 3 39 DMT 0.7322 4 36 IOSB-VeRi 0.6899 5 30 BestImage 0.6684 6 44 BeBetter 0.6683 7 72 UMDRC 0.6668 8 7 Ainnovation 0.6561 9 46 NMB 0.6206 10 81 Shahe 0.6191
Table 4: Ranking list of City-Scale Multi-Camera Vehicle Re-Identification in AIC2020.

Visualization of the results: The qualitative results on CityFlow our self-val set are shown in Fig. 3. The top-10 ranking lists for our query images are visualized. From the results, we can see that our model successfully retrieves most of the vehicle pictures from the gallery.

5 Conclusions

We participated in the task of the NVIDIA AI City Challenge 2020: the vehicle re-identification task. Our solution for vehicle re-identification is based on training multiple CNNs with a few methods including multi-loss, filter grafting, part-based feature learning,etc. Then, we use semi-supervised method to annotate images in the test set with a fake label. Meanwhile, several post-processing methods are utilized to furthermake a progress of the result. Finally, we finished with 5-th place in the 2020 AI-City challenge for City-Scale Multi-Camera Vehicle Re-Identification.


  • [1] Yan Bai, Yihang Lou, Feng Gao, Shiqi Wang, Yuwei Wu, and Ling-Yu Duan. Group-sensitive triplet embedding for vehicle reidentification. IEEE Transactions on Multimedia, 20(9):2385–2399, 2018.
  • [2] Ruihang Chu, Yifan Sun, Yadong Li, Zheng Liu, Chi Zhang, and Yichen Wei. Vehicle re-identification with viewpoint-aware metric learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 8282–8291, 2019.
  • [3] Ondrej Chum, James Philbin, Josef Sivic, Michael Isard, and Andrew Zisserman. Total recall: Automatic query expansion with a generative feature model for object retrieval. In 2007 IEEE 11th International Conference on Computer Vision, pages 1–8. IEEE, 2007.
  • [4] Haiyun Guo, Chaoyang Zhao, Zhiwei Liu, Jinqiao Wang, and Hanqing Lu. Learning coarse-to-fine structured feature embedding for vehicle re-identification. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    , 2018.
  • [5] Bing He, Jia Li, Yifan Zhao, and Yonghong Tian. Part-regularized near-duplicate vehicle re-identification. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 3997–4005, 2019.
  • [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [7] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
  • [8] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
  • [9] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
  • [10] Herve Jegou, Hedi Harzallah, and Cordelia Schmid. A contextual dissimilarity measure for accurate and efficient image search. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007.
  • [11] Minyue Jiang, Yuan Yuan, and Qi Wang. Self-attention learning for person re-identification. In BMVC, page 204, 2018.
  • [12] Hongye Liu, Yonghong Tian, Yaowei Yang, Lu Pang, and Tiejun Huang. Deep relative distance learning: Tell the difference between similar vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2167–2175, 2016.
  • [13] Xiaobin Liu, Shiliang Zhang, Qingming Huang, and Wen Gao. Ram: a region-aware deep model for vehicle re-identification. In 2018 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2018.
  • [14] Xinchen Liu, Wu Liu, Huadong Ma, and Huiyuan Fu. Large-scale vehicle re-identification in urban surveillance videos. In 2016 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2016.
  • [15] Yihang Lou, Yan Bai, Jun Liu, Shiqi Wang, and Ling-Yu Duan. Embedding adversarial learning for vehicle re-identification. IEEE Transactions on Image Processing, 28(8):3794–3807, 2019.
  • [16] Hao Luo, Youzhi Gu, Xingyu Liao, Shenqi Lai, and Wei Jiang. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
  • [17] Fanxu Meng, Hao Cheng, Ke Li, Zhixin Xu, Rongrong Ji, Xing Sun, and Gaungming Lu. Filter grafting for deep neural networks. arXiv preprint arXiv:2001.05868, 2020.
  • [18] Danfeng Qin, Stephan Gammeter, Lukas Bossard, Till Quack, and Luc Van Gool. Hello neighbor: Accurate object retrieval with k-reciprocal nearest neighbors. In CVPR 2011, pages 777–784. IEEE, 2011.
  • [19] Florian Schroff, Dmitry Kalenichenko, and James Philbin.

    Facenet: A unified embedding for face recognition and clustering.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
  • [20] Xiaohui Shen, Zhe Lin, Jonathan Brandt, Shai Avidan, and Ying Wu. Object retrieval and localization with spatially-constrained similarity measure and k-nn re-ranking. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3013–3020. IEEE, 2012.
  • [21] Zheng Tang, Milind Naphade, Ming-Yu Liu, Xiaodong Yang, Stan Birchfield, Shuo Wang, Ratnesh Kumar, David Anastasiu, and Jenq-Neng Hwang. Cityflow: A city-scale benchmark for multi-target multi-camera vehicle tracking and re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8797–8806, 2019.
  • [22] Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM international conference on Multimedia, pages 274–282, 2018.
  • [23] Xiu-Shen Wei, Chen-Wei Xie, and Jianxin Wu. Mask-cnn: Localizing parts and selecting descriptors for fine-grained image recognition. arXiv preprint arXiv:1605.06878, 2016.
  • [24] Dominik Zapletal and Adam Herout. Vehicle re-identification for automatic video traffic surveillance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 25–31, 2016.
  • [25] Ning Zhang, Jeff Donahue, Ross Girshick, and Trevor Darrell. Part-based r-cnns for fine-grained category detection. In European conference on computer vision, pages 834–849. Springer, 2014.
  • [26] Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1318–1327, 2017.
  • [27] Yi Zhou and Ling Shao. Aware attentive multi-view inference for vehicle re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6489–6498, 2018.
  • [28] Yi Zhou and Ling Shao. Vehicle re-identification by adversarial bi-directional lstm network. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 653–662. IEEE, 2018.
  • [29] Jianqing Zhu, Huanqiang Zeng, Jingchang Huang, Shengcai Liao, Zhen Lei, Canhui Cai, and Lixin Zheng.

    Vehicle re-identification using quadruple directional deep learning features.

    IEEE Transactions on Intelligent Transportation Systems, 2019.