Vehicle Re-identification is a challenging task due to intra-class variability and inter-class similarity across non-overlapping cameras. To tackle these problems, recently proposed methods require additional annotation to extract more features for false positive image exclusion. In this paper, we propose a model powered by adaptive attention modules that requires fewer label annotations but still out-performs the previous models. We also include a re-ranking method that takes account of the importance of metadata feature embeddings in our paper. The proposed method is evaluated on CVPR AI City Challenge 2020 dataset and achieves mAP of 37.25READ FULL TEXT VIEW PDF
In recent years, computer vision has achieved accomplishments across its sub-fields thanks to the continuing development of Convolutional Neural Network (CNN). Among sub-fields of computer vision, object re-identification has gained attention lately due to several technical difficulties. The first challenge is intra-class variability. Because of illumination conditions, obstacles, and occlusions, an object may appear different across non-overlapping cameras. The second challenge is the inter-class similarity. Two objects may share similar looks, such as identical twins or cars from the same manufacturing process. Unlike image classification whose task is to classify images based on visual contents, object re-identification demands a robust system to respond to local features and global features. Local features involve differentiating two objects with similar viewpoints. In contrast, global features involve clustering images that belong to the same objects, regardless of viewpoints. Re-identification systems also have to possess a good generalization ability to deal with unseen features due to plenty of object variations.
Initially, most of the research projects about re-identification focus on person re-identification, and vehicle re-identification has adopted the previous contributions successfully despite the difference of domains [6, 11, 15, 21, 19, 14, 13, 2, 16, 23, 17, 7, 9]
. However, the majority of these projects adopt the pre-trained ImageNet classification-specific models and perform transfer learning for the vehicle re-identification task. Our proposed method focuses on GLAMOR, a model designed for re-identification proposed by Supremet al. , which proves that training from scratch with a smaller dataset ( real images and synthetic images versus M images of ImageNet) does not necessarily result in poorer performance. In fact, GLAMOR outperforms ResNet50 baseline with % mAP improvement . We also propose a slight modification to -reciprocal encoding re-ranking  so that it includes the metadata attributes during the re-ranking process. The remainder of the paper is structured as follows: Section reviews the related work, Section illustrates our proposed approach, Section focuses on our experiment, and Section draws a conclusion and discusses potential rooms for improvement to study the re-identification problem.
Re-identification problems have been a challenging task in computer vision. Unlike image classification, where images are required to be classified into classes, re-identification is to identify a probe image in an image gallery. While image classification achieves successful results [20, 22, 25] thanks to large popular datasets such as COCO  or ImageNet , re-identification is yet to have sufficiently large datasets to train model. DukeMTMC  and Market-1501  are datasets specifically for person re-identification, while Veri-776  and VehicleID  are for vehicle re-identification. These datasets share a common disadvantage, which is the lack of images per identity. Intra-class variability and inter-class similarity are also common problems in re-identification due to diverse backgrounds or similar looks.
Novel approaches to overcome the above disadvantages have been proposed recently. Hermans et al. prove that triplet loss [24, 18, 2] is suitable for re-identification task since it optimizes the embedding space so that images with the same identity are closer to each other compared to those with different identities . Hermans et al. also propose the Batch Hard technique to select the hardest negative samples within a batch, minimizing the intra-class variability of an identity [6, 2, 11].
Besides data mining techniques and alternative loss functions, there have been several efforts to implement new models designed for re-identification[19, 13, 6, 23, 9]. Specifically, Suprem et al. focus on using attention-based regularizers [19, 9] to extract more global and local features and ensure low sparsity of activations. Wang et al. utilize 20 key point locations to extract local features based on orientation thanks to attention mechanism, and then fuse the extracted features with global features for orientation-invariant feature embedding .
Re-ranking is also an important post-processing method that is worth considering in re-identification. Zhong et al. propose a re-ranking method that encodes the -reciprocal nearest neighbors of a probe image to calculate -reciprocal feature . The Jaccard distance is then re-calculated and combined with the original distance to get the final distance. Khorramshahi et al. utilize triplet probabilistic embedding  proposed by Sankaranarayanan et al. to create similarity score for re-ranking task . Huang et al. propose metadata distance, which uses classification confidence and confusion distance. Metadata distance is then combined with the original distance to get the final distance .
The overview of our system can be generalized into three main stages: pre-processing, deep embedding computing, and post-processing. The system is described in Figure 1. Pre-processing is necessary since the bounding boxes of the provided dataset are loosely cropped. The loosely cropped images contain unnecessary information, which hinders the performance of our model.
The deep metric embedding module is a combination of two models, GLAMOR  and Counter GLAMOR, that are trained on the provided dataset. The output of the module is a distance matrix where represents the images in query and represents the images in the gallery. Additional classifiers are also trained on the provided dataset to extract metadata attributes for further post-processing.
Post-processing is essential in re-identification since it removes false-positive images at the top. Illumination conditions, vehicle poses, and other various factors affect the outputs negatively. Figure 2 shows an example of two images with close embedding distance due to similarities in brightness, pose, color, and occlusion.
We adopt pretrained Detectron2  on MS COCO dataset  to detect vehicle in an image and then to crop the bounding box out of the image. Detectron2 is a Facebook platform for object detection and segmentation that implements state-of-the-art object detection algorithms, including Mask R-CNN. We perform image cropping on training, query, and test sets, and then use the cropped images for training as well as evaluating models.
As shown in Figure 2, car type does not match. Even though they have close embedding distance, the embedding distance is mostly affected by noise features. Therefore, vehicle metadata attributes should be extracted to eliminate undesired features such as obstacles in the background.
We adopt pre-trained ResNeXt101[28, 1] on ImageNet  for rapid convergence. We train ResNext101 to classify color and type.
The given color labels and type labels do not reflect the training set. For example, the training set does not contain any orange cars. The number of cars per category is also unevenly distributed; there is a lack of RV or bus images. Therefore, we cluster types based on their common visual attributes. For types, we suggest having categories: small vehicle (sedan, hatchback, estate, and sports car), big vehicle (SUV and MPV), van, pickup, truck, and long car (bus and RV). For color, we exclude orange, pink, and purple.
The query set and test set, however, do contain the excluded categories. Moreover, there are different cameras in the query and test sets (the training set is collected from cameras while the query and test sets are collected from cameras), so performing prediction on the query and test sets will eventually result in incorrect classification. Therefore, we extract the features before the last fully-connected layer and calculate the Euclidean distance between the query set and the test set for the re-ranking process.
We adopt GLAMOR, an end-to-end ResNet50-backboned re-identification model powered by attention mechanism, proposed by Suprem et al. 
. GLAMOR introduces two modules. Global Attention Module reduces sparsity and enhances feature extraction. In the meantime, Local Attention Module extracts unsupervised part-based features. Unlike the original model, we have modified the model slightly to increase the performance. Instead of using the original Local Attention Module, we use Convolutional Block Attention Module (CBAM) as our local feature extractor because CBAM focuses on two principal dimensions: spatial and channel[25, 26]. As a feature adaptive refinement module, CBAM learns effectively where to emphasize or suppress the information to be passed forward to the later convolutional blocks. The detailed architecture of GLAMOR is represented in Figure 3.
We also realize the loss of information in the current GLAMOR implementation at the concatenation step. Suprem et al. apply a channel-wise mask to combine global features and local features . However, only half of each is fed forward to later convolutional blocks. The sum of global features and local features , where , is calculated as follow:
where , , and for each , and . Therefore, we propose another concatenation formula to counter the loss of information in Equation (1) just by swapping the mask position:
The concatenation formula in Equation (2) is used for another GLAMOR. The distance embedding matrix of two GLAMORs is then averaged for the final result. The proposed method significantly increases the accuracy due to generalization and balancing effects.
The two models are trained separately on both synthetic data and training data. Training models on synthetic dataset helps models converge faster than training on the real dataset alone. Our models converge in epochs, while a pre-trained ResNet50 baseline model converges after epochs .
. The reason is that triplet loss is used for learning embeddings whereas softmax loss inteprets probability distributions of a list of potential outcomes. The combination loss is
are hyperparameters that can be fine-tuned. The revised triplet loss proposed by FaceNet is
where are anchor, positive, and negative samples of a triplet, and are the distance from an anchor sample to a positive sample and to a negative sample, and is the margin constraint. The softmax with label smoothing proposed by Szegedy et al.  is
where is the ground truth ID label,
is the ID prediction logits of class, is the number of IDs in the dataset, and is a hyperparameter to reduce over-confidence of classifiers.
We adopt the re-ranking with -reciprocal encoding method  proposed by Zhong et al. and modify the formula to include Euclidean distance embedding of metadata attributes. Given a probe image and a gallery image where is gallery set, the revised original distance matrix is
where is the original distance between and , is the hyperparameter of feature for fine-tuning, and is the metadata distance between and of feature . We then generate the -reciprocal nearest neighbor set and re-calculate the pairwise distance between the probe image and the gallery image using Jaccard distance and a more robust -reciprocal nearest neighbor set :
The final distance embedding is
Given the test track for each test image, we calculate the average distance between a probe image and a track. Then, we replace the distance between the probe image and each image in that track with the calculated average distance. The problem becomes finding tracks that have the most similar car to the probe image instead of finding individual images. The method increases mAP since the top results will be populated with correct images from the same track for uncomplicated cases.
Based on , we have enough resources for building our models with the provided utilities. After being cropped with Detectron2, the images are resized to for training GLAMOR models  and for training ResNeXt101 model . Image size may largely affect the re-identification results according to ; therefore, we choose as our image size because vehicle images tend to have the width larger than the height. The default image size of the pre-trained ResNeXt101 is , so we keep it in order to transfer learning efficiently. The images are then augmented with flipping and cropping techniques, color jitter, color augmentation, and random erasing .
The GLAMOR models are pre-trained with the synthetic data for around epochs with an initial learning rate of , learning rate decay of for every epochs, a margin of , and the ratio between triplet loss and softmax loss. After that, we feed the transformed images above to the GLAMOR models for the re-identification task with similar parameters. The models converge quickly in around epochs thanks to the pre-trained weights.
We repeat the same procedure with ResNext101 but with pre-trained weights from ImageNet [10, 1], instead of the synthetic data. After training for epochs for re-identification task with an initial learning rate of , learning rate decay of for every epochs, and the same margin and loss ratio, we keep that weight to train the ResNext101 models further to classify type and color. For the classification task, we train the models using softmax loss only with the learning rate of and learning rate decay of for every epochs.
Even though we have weights of two different models GLAMOR and ResNeXt101 for re-identification tasks, we find that GLAMOR outperforms ResNext101. Therefore, we decide to use only GLAMOR models for the re-identification task. On the other hand, since ResNext101 is a state-of-the-art image classification model, we use it as our metadata attribute extractor.
Table 1 compares the result of our system with those of other teams. Our proposed approach achieves mAP of % and ranks th in Track of the AI City Challenge 2020. Table 2 compares our result with two different base line results provided in .
|Rank||Team ID||Team Name||mAP (%)|
In this paper, we introduce an attention-driven re-identification method based on GLAMOR . We also incorporate metadata attribute embedding in the re-ranking process, which boosts the performance of the model. In addition, several techniques in pre-processing and post-processing are adopted to enhance the results. Below are topics that should be further studied in order to improve our system:
Image super-resolution for pre-processing.
GAN-based models in vehicle re-identification.
View-aware feature extraction.
Intensive hyperparameter tuning.
Pretrained models for Pytorch. Note: https://github.com/Cadene/pretrained-models.pytorch Cited by: §3.2.2, §4.
2009 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 248–255. Cited by: §2.
Multi-view vehicle re-identification using temporal attention model and metadata re-ranking. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §1, §2.
Attention driven vehicle re-identification and unsupervised anomaly detection for traffic understanding. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §1, §2, §2.
FaceNet: a unified embedding for face recognition and clustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §2, §3.3.