Image-based Vehicle Re-identification Model with Adaptive Attention Modules and Metadata Re-ranking

by   Quang Truong, et al.

Vehicle Re-identification is a challenging task due to intra-class variability and inter-class similarity across non-overlapping cameras. To tackle these problems, recently proposed methods require additional annotation to extract more features for false positive image exclusion. In this paper, we propose a model powered by adaptive attention modules that requires fewer label annotations but still out-performs the previous models. We also include a re-ranking method that takes account of the importance of metadata feature embeddings in our paper. The proposed method is evaluated on CVPR AI City Challenge 2020 dataset and achieves mAP of 37.25


page 3

page 4


Attribute-guided Feature Extraction and Augmentation Robust Learning for Vehicle Re-identification

Vehicle re-identification is one of the core technologies of intelligent...

Multi-Attention-Based Soft Partition Network for Vehicle Re-Identification

Vehicle re-identification (Re-ID) distinguishes between the same vehicle...

VOC-ReID: Vehicle Re-identification based on Vehicle-Orientation-Camera

Vehicle re-identification is a challenging task due to high intra-class ...

Discriminative Feature and Dictionary Learning with Part-aware Model for Vehicle Re-identification

With the development of smart cities, urban surveillance video analysis ...

An Empirical Study of Vehicle Re-Identification on the AI City Challenge

This paper introduces our solution for the Track2 in AI City Challenge 2...

Density-Adaptive Kernel based Re-Ranking for Person Re-Identification

Person Re-Identification (ReID) refers to the task of verifying the iden...

Dual Embedding Expansion for Vehicle Re-identification

Vehicle re-identification plays a crucial role in the management of tran...

Code Repositories

1 Introduction

In recent years, computer vision has achieved accomplishments across its sub-fields thanks to the continuing development of Convolutional Neural Network (CNN). Among sub-fields of computer vision, object re-identification has gained attention lately due to several technical difficulties. The first challenge is intra-class variability. Because of illumination conditions, obstacles, and occlusions, an object may appear different across non-overlapping cameras. The second challenge is the inter-class similarity. Two objects may share similar looks, such as identical twins or cars from the same manufacturing process. Unlike image classification whose task is to classify images based on visual contents, object re-identification demands a robust system to respond to local features and global features. Local features involve differentiating two objects with similar viewpoints. In contrast, global features involve clustering images that belong to the same objects, regardless of viewpoints. Re-identification systems also have to possess a good generalization ability to deal with unseen features due to plenty of object variations.

Initially, most of the research projects about re-identification focus on person re-identification, and vehicle re-identification has adopted the previous contributions successfully despite the difference of domains [6, 11, 15, 21, 19, 14, 13, 2, 16, 23, 17, 7, 9]

. However, the majority of these projects adopt the pre-trained ImageNet classification-specific models and perform transfer learning for the vehicle re-identification task. Our proposed method focuses on GLAMOR, a model designed for re-identification proposed by Suprem

et al. [19], which proves that training from scratch with a smaller dataset ( real images and synthetic images versus M images of ImageNet) does not necessarily result in poorer performance. In fact, GLAMOR outperforms ResNet50 baseline with % mAP improvement [21]. We also propose a slight modification to -reciprocal encoding re-ranking [30] so that it includes the metadata attributes during the re-ranking process. The remainder of the paper is structured as follows: Section reviews the related work, Section illustrates our proposed approach, Section focuses on our experiment, and Section draws a conclusion and discusses potential rooms for improvement to study the re-identification problem.

2 Related Work

Re-identification problems have been a challenging task in computer vision. Unlike image classification, where images are required to be classified into classes, re-identification is to identify a probe image in an image gallery. While image classification achieves successful results [20, 22, 25] thanks to large popular datasets such as COCO [12] or ImageNet [3][10], re-identification is yet to have sufficiently large datasets to train model. DukeMTMC [4] and Market-1501 [29] are datasets specifically for person re-identification, while Veri-776 [14] and VehicleID [13] are for vehicle re-identification. These datasets share a common disadvantage, which is the lack of images per identity. Intra-class variability and inter-class similarity are also common problems in re-identification due to diverse backgrounds or similar looks.

Novel approaches to overcome the above disadvantages have been proposed recently. Hermans et al. prove that triplet loss [24, 18, 2] is suitable for re-identification task since it optimizes the embedding space so that images with the same identity are closer to each other compared to those with different identities [6]. Hermans et al. also propose the Batch Hard technique to select the hardest negative samples within a batch, minimizing the intra-class variability of an identity [6, 2, 11].

Besides data mining techniques and alternative loss functions, there have been several efforts to implement new models designed for re-identification

[19, 13, 6, 23, 9]. Specifically, Suprem et al. focus on using attention-based regularizers [19, 9] to extract more global and local features and ensure low sparsity of activations. Wang et al. utilize 20 key point locations to extract local features based on orientation thanks to attention mechanism, and then fuse the extracted features with global features for orientation-invariant feature embedding [23].

Re-ranking is also an important post-processing method that is worth considering in re-identification. Zhong et al. propose a re-ranking method that encodes the -reciprocal nearest neighbors of a probe image to calculate -reciprocal feature [30]. The Jaccard distance is then re-calculated and combined with the original distance to get the final distance. Khorramshahi et al. utilize triplet probabilistic embedding [17] proposed by Sankaranarayanan et al. to create similarity score for re-ranking task [9]. Huang et al. propose metadata distance, which uses classification confidence and confusion distance. Metadata distance is then combined with the original distance to get the final distance [7].

3 Proposed Approach

3.1 System Overview

The overview of our system can be generalized into three main stages: pre-processing, deep embedding computing, and post-processing. The system is described in Figure 1. Pre-processing is necessary since the bounding boxes of the provided dataset are loosely cropped. The loosely cropped images contain unnecessary information, which hinders the performance of our model.

Figure 1: System Overview.

The deep metric embedding module is a combination of two models, GLAMOR [19] and Counter GLAMOR, that are trained on the provided dataset. The output of the module is a distance matrix where represents the images in query and represents the images in the gallery. Additional classifiers are also trained on the provided dataset to extract metadata attributes for further post-processing.

Post-processing is essential in re-identification since it removes false-positive images at the top. Illumination conditions, vehicle poses, and other various factors affect the outputs negatively. Figure 2 shows an example of two images with close embedding distance due to similarities in brightness, pose, color, and occlusion.

Figure 2: An example of two images with close embedding distance due to similar brightness, pose, color, and occlusion.

3.2 Pre-processing

3.2.1 Detectron2

We adopt pretrained Detectron2 [27] on MS COCO dataset [12] to detect vehicle in an image and then to crop the bounding box out of the image. Detectron2 is a Facebook platform for object detection and segmentation that implements state-of-the-art object detection algorithms, including Mask R-CNN[5]. We perform image cropping on training, query, and test sets, and then use the cropped images for training as well as evaluating models.

3.2.2 Image Labeling for Vehicle Attribute Extractor

As shown in Figure 2, car type does not match. Even though they have close embedding distance, the embedding distance is mostly affected by noise features. Therefore, vehicle metadata attributes should be extracted to eliminate undesired features such as obstacles in the background.

We adopt pre-trained ResNeXt101[28, 1] on ImageNet [10] for rapid convergence. We train ResNext101 to classify color and type.

The given color labels and type labels do not reflect the training set. For example, the training set does not contain any orange cars. The number of cars per category is also unevenly distributed; there is a lack of RV or bus images. Therefore, we cluster types based on their common visual attributes. For types, we suggest having categories: small vehicle (sedan, hatchback, estate, and sports car), big vehicle (SUV and MPV), van, pickup, truck, and long car (bus and RV). For color, we exclude orange, pink, and purple.

The query set and test set, however, do contain the excluded categories. Moreover, there are different cameras in the query and test sets (the training set is collected from cameras while the query and test sets are collected from cameras), so performing prediction on the query and test sets will eventually result in incorrect classification. Therefore, we extract the features before the last fully-connected layer and calculate the Euclidean distance between the query set and the test set for the re-ranking process.

3.3 Deep Embedding Computing

We adopt GLAMOR, an end-to-end ResNet50-backboned re-identification model powered by attention mechanism, proposed by Suprem et al. [19]

. GLAMOR introduces two modules. Global Attention Module reduces sparsity and enhances feature extraction. In the meantime, Local Attention Module extracts unsupervised part-based features. Unlike the original model, we have modified the model slightly to increase the performance. Instead of using the original Local Attention Module, we use Convolutional Block Attention Module (CBAM) as our local feature extractor because CBAM focuses on two principal dimensions: spatial and channel

[25, 26]. As a feature adaptive refinement module, CBAM learns effectively where to emphasize or suppress the information to be passed forward to the later convolutional blocks. The detailed architecture of GLAMOR is represented in Figure 3.

Figure 3: Architecture of GLAMOR.

We also realize the loss of information in the current GLAMOR implementation at the concatenation step. Suprem et al. apply a channel-wise mask to combine global features and local features [19]. However, only half of each is fed forward to later convolutional blocks. The sum of global features and local features , where , is calculated as follow:


where , , and for each , and . Therefore, we propose another concatenation formula to counter the loss of information in Equation (1) just by swapping the mask position:


The concatenation formula in Equation (2) is used for another GLAMOR. The distance embedding matrix of two GLAMORs is then averaged for the final result. The proposed method significantly increases the accuracy due to generalization and balancing effects.

The two models are trained separately on both synthetic data and training data. Training models on synthetic dataset helps models converge faster than training on the real dataset alone. Our models converge in epochs, while a pre-trained ResNet50 baseline model converges after epochs [21].

Our metric learning method is a combination of batch hard triplet loss [24, 18] and softmax loss with label smoothing [20]

. The reason is that triplet loss is used for learning embeddings whereas softmax loss inteprets probability distributions of a list of potential outcomes. The combination loss is


where and

are hyperparameters that can be fine-tuned. The revised triplet loss proposed by FaceNet

[18] is


where are anchor, positive, and negative samples of a triplet, and are the distance from an anchor sample to a positive sample and to a negative sample, and is the margin constraint. The softmax with label smoothing proposed by Szegedy et al. [20] is


where is the ground truth ID label,

is the ID prediction logits of class

, is the number of IDs in the dataset, and is a hyperparameter to reduce over-confidence of classifiers.

3.4 Post-processing

3.4.1 Re-ranking

We adopt the re-ranking with -reciprocal encoding method [30] proposed by Zhong et al. and modify the formula to include Euclidean distance embedding of metadata attributes. Given a probe image and a gallery image where is gallery set, the revised original distance matrix is


where is the original distance between and , is the hyperparameter of feature for fine-tuning, and is the metadata distance between and of feature . We then generate the -reciprocal nearest neighbor set and re-calculate the pairwise distance between the probe image and the gallery image using Jaccard distance and a more robust -reciprocal nearest neighbor set :


The final distance embedding is

Figure 4: The effects of re-ranking method.

3.4.2 Distance Averaging by Track

Given the test track for each test image, we calculate the average distance between a probe image and a track. Then, we replace the distance between the probe image and each image in that track with the calculated average distance. The problem becomes finding tracks that have the most similar car to the probe image instead of finding individual images. The method increases mAP since the top results will be populated with correct images from the same track for uncomplicated cases.

4 Experiment

Based on [8], we have enough resources for building our models with the provided utilities. After being cropped with Detectron2[27], the images are resized to for training GLAMOR models [19] and for training ResNeXt101 model [28]. Image size may largely affect the re-identification results according to [15]; therefore, we choose as our image size because vehicle images tend to have the width larger than the height. The default image size of the pre-trained ResNeXt101 is , so we keep it in order to transfer learning efficiently. The images are then augmented with flipping and cropping techniques, color jitter, color augmentation[10], and random erasing [31].

The GLAMOR models are pre-trained with the synthetic data for around epochs with an initial learning rate of , learning rate decay of for every epochs, a margin of , and the ratio between triplet loss and softmax loss. After that, we feed the transformed images above to the GLAMOR models for the re-identification task with similar parameters. The models converge quickly in around epochs thanks to the pre-trained weights.

We repeat the same procedure with ResNext101 but with pre-trained weights from ImageNet [10, 1], instead of the synthetic data. After training for epochs for re-identification task with an initial learning rate of , learning rate decay of for every epochs, and the same margin and loss ratio, we keep that weight to train the ResNext101 models further to classify type and color. For the classification task, we train the models using softmax loss only with the learning rate of and learning rate decay of for every epochs.

Even though we have weights of two different models GLAMOR and ResNeXt101 for re-identification tasks, we find that GLAMOR outperforms ResNext101. Therefore, we decide to use only GLAMOR models for the re-identification task. On the other hand, since ResNext101 is a state-of-the-art image classification model, we use it as our metadata attribute extractor.

Table 1 compares the result of our system with those of other teams. Our proposed approach achieves mAP of % and ranks th in Track of the AI City Challenge 2020. Table 2 compares our result with two different base line results provided in [21].

Rank Team ID Team Name mAP (%)
Insight Centre
Table 1: Track 2 Competition Results. Our result is highlighted in bold.
Model Rank@1(%) mAP (%)
Table 2: Comparison with base line models.

5 Conclusion

In this paper, we introduce an attention-driven re-identification method based on GLAMOR [19]. We also incorporate metadata attribute embedding in the re-ranking process, which boosts the performance of the model. In addition, several techniques in pre-processing and post-processing are adopted to enhance the results. Below are topics that should be further studied in order to improve our system:

  • Image super-resolution for pre-processing.

  • GAN-based models in vehicle re-identification.

  • View-aware feature extraction.

  • Intensive hyperparameter tuning.


  • [1] R. Cadene (2019)

    Pretrained models for Pytorch

    Note: Cited by: §3.2.2, §4.
  • [2] G. Chen, T. Zhang, J. Lu, and J. Zhou (2019) Deep meta metric learning. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Vol. , pp. 9546–9555. Cited by: §1, §2.
  • [3] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    Vol. , pp. 248–255. Cited by: §2.
  • [4] M. Gou, S. Karanam, W. Liu, O. Camps, and R. J. Radke (2017-07) DukeMTMC4ReID: a large-scale multi-camera person re-identification dataset. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §2.
  • [5] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick (2017) Mask R-CNN. CoRR abs/1703.06870. External Links: Link, 1703.06870 Cited by: §3.2.1.
  • [6] A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification. External Links: 1703.07737 Cited by: §1, §2, §2.
  • [7] T. Huang, J. Cai, H. Yang, H. Hsu, and J. Hwang (2019-06)

    Multi-view vehicle re-identification using temporal attention model and metadata re-ranking

    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §1, §2.
  • [8] Jakel21 (2019) Vehicle ReID baseline. Note: Cited by: §4.
  • [9] P. Khorramshahi, N. Peri, A. Kumar, A. Shah, and R. Chellappa (2019-06)

    Attention driven vehicle re-identification and unsupervised anomaly detection for traffic understanding

    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §1, §2, §2.
  • [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2017-05) ImageNet classification with deep convolutional neural networks. Commun. ACM 60 (6), pp. 84–90. External Links: ISSN 0001-0782, Link, Document Cited by: §2, §3.2.2, §4, §4.
  • [11] R. Kumar, E. Weill, F. Aghdasi, and P. Sriram (2019) Vehicle re-identification: an efficient baseline using triplet embedding. External Links: 1901.01015 Cited by: §1, §2.
  • [12] T. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár (2014) Microsoft COCO: common objects in context. External Links: 1405.0312 Cited by: §2, §3.2.1.
  • [13] H. Liu, Y. Tian, Y. Wang, L. Pang, and T. Huang (2016) Deep relative distance learning: tell the difference between similar vehicles. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2167–2175. Cited by: §1, §2, §2.
  • [14] X. Liu, W. Liu, H. Ma, and H. Fu (2016) Large-scale vehicle re-identification in urban surveillance videos. In 2016 IEEE International Conference on Multimedia and Expo (ICME), Vol. , pp. 1–6. Cited by: §1, §2.
  • [15] H. Luo, Y. Gu, X. Liao, S. Lai, and W. Jiang (2019) Bag of tricks and a strong baseline for deep person re-identification. External Links: 1903.07071 Cited by: §1, §4.
  • [16] K. Nguyen, T. Hoang, M. Tran, T. Le, N. Bui, T. Do, V. Vo-Ho, Q. Luong, M. Tran, T. Nguyen, T. Truong, V. Nguyen, and M. Do (2019-06) Vehicle re-identification with learned representation and spatial verification and abnormality detection with multi-adaptive vehicle detectors for traffic video analysis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §1.
  • [17] S. Sankaranarayanan, A. Alavi, C. D. Castillo, and R. Chellappa (2016-09) Triplet probabilistic embedding for face verification and clustering. 2016 IEEE 8th International Conference on Biometrics Theory, Applications and Systems (BTAS). External Links: ISBN 9781467397339, Link, Document Cited by: §1, §2.
  • [18] F. Schroff, D. Kalenichenko, and J. Philbin (2015-06)

    FaceNet: a unified embedding for face recognition and clustering

    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: ISBN 9781467369640, Link, Document Cited by: §2, §3.3.
  • [19] A. Suprem and C. Pu (2020) Looking GLAMORous: vehicle re-id in heterogeneous cameras networks with global and local attention. External Links: 2002.02256 Cited by: §1, §2, §3.1, §3.3, §3.3, §4, §5.
  • [20] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2015) Rethinking the Inception architecture for computer vision. External Links: 1512.00567 Cited by: §2, §3.3.
  • [21] Z. Tang, M. Naphade, M. Liu, X. Yang, S. Birchfield, S. Wang, R. Kumar, D. Anastasiu, and J. Hwang (2019) CityFlow: a city-scale benchmark for multi-target multi-camera vehicle tracking and re-identification. External Links: 1903.09254 Cited by: §1, §3.3, §4.
  • [22] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang (2017) Residual attention network for image classification. External Links: 1704.06904 Cited by: §2.
  • [23] Z. Wang, L. Tang, X. Liu, Z. Yao, S. Yi, J. Shao, J. Yan, S. Wang, H. Li, and X. Wang (2017) Orientation invariant feature embedding and spatial temporal regularization for vehicle re-identification. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 379–387. Cited by: §1, §2.
  • [24] K. Q. Weinberger and L. K. Saul (2009-06) Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res. 10, pp. 207–244. External Links: ISSN 1532-4435 Cited by: §2, §3.3.
  • [25] S. Woo, J. Park, J. Lee, and I. S. Kweon (2018) CBAM: convolutional block attention module. External Links: 1807.06521 Cited by: §2, §3.3.
  • [26] S. Woo, J. Park, J. Lee, and I. S. Kweon (2019) Official PyTorch code for ”BAM: Bottleneck Attention Module (BMVC2018)” and ”CBAM: Convolutional Block Attention Module (ECCV2018)”. Note: Cited by: §3.3.
  • [27] Y. Wu, A. Kirillov, F. Massa, W. Lo, and R. Girshick (2019) Detectron2. Note: Cited by: §3.2.1, §4.
  • [28] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2016) Aggregated residual transformations for deep neural networks. External Links: 1611.05431 Cited by: §3.2.2, §4.
  • [29] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 1116–1124. Cited by: §2.
  • [30] Z. Zhong, L. Zheng, D. Cao, and S. Li (2017) Re-ranking person re-identification with k-reciprocal encoding. External Links: 1701.08398 Cited by: §1, §2, §3.4.1.
  • [31] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang (2017) Random erasing data augmentation. External Links: 1708.04896 Cited by: §4.