Implementation of Vehicle Re-Identification Based on Complementary Features for 2020 AICity Challenge Track2
In this work, we present our solution to the vehicle re-identification (vehicle Re-ID) track in AI City Challenge 2020 (AIC2020). The purpose of vehicle Re-ID is to retrieve the same vehicle appeared across multiple cameras, and it could make a great contribution to the Intelligent Traffic System(ITS) and smart city. Due to the vehicle's orientation, lighting and inter-class similarity, it is difficult to achieve robust and discriminative representation feature. For the vehicle Re-ID track in AIC2020, our method is to fuse features extracted from different networks in order to take advantages of these networks and achieve complementary features. For each single model, several methods such as multi-loss, filter grafting, semi-supervised are used to increase the representation ability as better as possible. Top performance in City-Scale Multi-Camera Vehicle Re-Identification demonstrated the advantage of our methods, and we got 5-th place in the vehicle Re-ID track of AIC2020. The codes are available at https://github.com/gggcy/AIC2020_ReID.READ FULL TEXT VIEW PDF
Implementation of Vehicle Re-Identification Based on Complementary Features for 2020 AICity Challenge Track2
With the development of ITS, vehicle Re-ID has attracted increasing attention in computer vision community[27, 29, 15]
. The target of vehicle Re-ID is to find the vehicles in the gallery that have the same identical as the query vehicle. Compared with other image retrieval tasks, vehicle Re-ID is more challenging because of two main reasons. Firstly, same identical vehicles may have different orientations in different cameras and the appearance of the front part and the rear part of a vehicle is much dissimilar. Secondly, two different vehicles with same brand and serial have completely same appearance except of very trivial difference.
In the past, license plate recognition was a conventional and efficient vehicle Re-ID solution. However, the license plate of the vehicle may be blocked, dirt, or the video is not clear enough to be clearly seen. Also, there are confusing letters making license plate recognition unreliable. Therefore, vehicle Re-ID through vehicle visual feature is an essential part.
At present, feature extraction and metric learning are the mainstream research directions in the field of vehicle Re-ID. Compared with traditional manual selection features, Convolutional Neural Networks (CNN) can automatically learn discriminative high-level semantic features according to task requirements, which greatly improves the performance of vehicle Re-ID. Due to occlusion, lighting and diverse viewpoints, there are still many challenging problems.
In recent years, a few powerful networks [6, 9] have been proposed. These networks utilize residual block or densely connected block to increase the network’s depth and make the deep networks extract more discriminative features from the training images. Specially, for vehicle Re-ID task, part-based method  aims to combine the local part feature and global feature to increase the performance. In this paper, our solution to the vehicle Re-ID track in AIC2020 could be summarized as follows:
We train multiple CNNs with a few methods including multi-loss, filter grafting, part-based feature learning, etc. These methods could do much more help to increase each single network’s representation ability.
We use semi-supervised method to annotate images in the test set with a fake label, and combine these fake-labeled images with the original training set to improve the final performance.
Several post-processing methods are utilized to further make a progress of the result.
As several vehicle datasets [14, 1, 24] have been proposed, vehicle Re-ID has attracted increasing attention in recent several years. The current work of vehicle Re-ID is still focused on the feature level, and how to extract more informative features. While vehicle RE-ID is also improved significantly in recent years because of the development of CNN. Zhu et al.  and Guo et al.  seek a better feature encoding method. Another solution is based on the part to extract highly distinguished features [25, 23, 5, 13], these methods mainly rely on strong supervision information of each key part. However, the cost of annotation is high and the practical application is limited. Zhou et al. 
adopt a viewpoint-aware attention model and a adversarial training architecture to implement effective multi-view feature inference from single-view input. Zhouet al.  focus on the uncertainty of vehicle viewpoint in Re-ID, and propose an Adversarial Bi-directional LSTM Network (ABLN). Inspired by the behavior in human’s recognition process, Chu et al.  propose a novel viewpoint-aware metric learning approach, which learns two metrics for similar viewpoints and different viewpoints in two feature space.
When vehicle Re-ID is regarded as a retrieval process, re-ranking is a key step to improve its performance. Re-ranking is mainly studied in generic instance retrieval [3, 18, 10, 20]. The main advantage of many re-ranking methods is that it can be implemented without the need for additional training samples and can be applied to any initial ranking results.
A popular re-ranking approach is k-reciprocal encoding 
. By encoding k-reciprocal nearest neighbors into a single vector. In order to obtain similar relationships from similar samples, a local expansion query is proposed to obtain more robust k-reciprocal features. The final distance based on the combination of the original distance and the Jaccard distance effectively improves the Re-ID performance on multiple large-scale data sets.
The basic framework of our approach is shown in the Fig. 1
. During training phases, the training images with ID label are sent into the backbone and generate a 2048-d feature and its predicted ID label. Two networks with the same structure will be trained in parallel and perform filter grafting. Then we apply two different loss functions as the optimization objective. The first one is hard triplet loss, it will enlarge the feature’s distance between two different label samples and reduce the distance between samples with same label. The triplet loss can be defined as:
where , , are the features extracted from anchor, positive and negative samples receptively and is the total number of training images in one batch. In theory, in order to ensure the best effect of network training, we must choose hard positive and hard negative,
The second one is cross entropy loss. Given a batch of training images, we denote as the ground truth ID label,
as ID prediction logits of the-th image on -th class and as the total number of traing images in one batch. The cross entropy loss can be defined as:
where =. Finally, we combine these two losses with a balanced weight, which is defined as:
where is the balanced weight of the triplet loss.
On the basis of the framework, we add more useful modules to improve the single model’s performance. One is MGN , this module splits the input sample horizontally into several parts and calculates each part’s loss individually, while the global feature is also taken into loss calculation. The second one is self-attention constrain (SAC)  which makes the network pay more attention to some subtleties. Another one is squeeze-and-excitation(SE) block , it is also an attention module that could enhance the feature’s representation power by performing dynamic channel-wise feature re-calibration.
During the testing phase, the trained network model will extract 2048-d feature from each input image. Features from different networks are concatenated together for vehicle Re-ID.
Filter grafting  is a learning paradigm which aims to reactivate invalid filters during training such that the representation power could be improved. The weight of filters that are valid for other networks is grafted to the filters that are not valid for the self-network. Multiple networks are grafted together to promote progress The main framework is shown in Fig. 2, given a network, two training processes are started parallelly denoted as and
, respectively. In each epoch training, we graft the effective filter weight ofinto the invalid filter of . The grafting in this process occurs at the convolution level, not the filter level, which means that we graft all filter weights in a certain layer in to in the same layer. Conversely, the same principle is used for grafting filter into . Filter grafting does not change the network structure. When performing grafting, the specific operation is shown in the following formula:
After weighting the filter parameters of and , the final filter parameters after grafting are obtained. We use the following formula to constrain parameters
where and are two hyper-parameters and is the entropy of the variables.
Although the two networks are trained in parallel when training, only use one of the grafted networks for testing.
The annotations of test set images are not accessible, so we propose a semi-supervised learning method to annotate test images with a fake label. Therefore, the test set becomes a supplementary of the training set. The steps of this method is as follows:
Train Vehicle Re-ID models using the original training set.
Extract feature from the images in the test set using the trained Re-ID models.
Cluster the testing images based on their features, and assign each cluster a fake label as the ID label.
Merge the fake-labeled test set and the original training set to be a new training set. Then we train Re-ID models using the new training set.
The number of IDs in the test set is 333. Therefore, the clustering algorithm we use is the k-means algorithm, and the clustering center is set to 333.
For the training results of multiple single models, we utilize several post-processing techniques.
Center crop. For the test set query and gallery, there is generally a bit of background redundancy. When we extract the test set features, we use center crop to make the image more focused on the vehicle itself and reduce the impact of the surrounding background.
Image flipping. When the model is inferred forward, we input the vehicle image and the horizontal flip of the vehicle image, and average the two feature values as the final feature of the vehicle. This can mitigate the impact of angle on vehicle feature expression.
Model ensemble. Model ensemble is proven to be effective in detection, classification, and Re-ID tasks. Different tasks have different ensemble methods. For Re-ID task, we have tried a lot of ensemble methods, and find out that it is better to expand the feature dimensions by concatenating features than to calculate the average feature of different models. Given the above trained networks, we could extract feature vectors for an image , the feature vector can be expressed as of the th model. . represents Normalization, and then we concatenate them as the final feature representations for an image . The vehicle features extracted by each model are concatenated, and in our subsequent experiments, it is proved that the performance of the model would be improved compared to the single model.
Query expansion. Query expansion is also a re-ranking method. Through query expansion, the retrieval recall rate can be improved. Especially when the accuracy of top is high, the effect of this method will be significantly improved. After the first retrieval, the feature of query and features of top images retrieved in the gallery are averaged as the feature of the query, and then the second retrieval is performed, and the above operation is repeated times. The parameters and need to be adjusted for the official test set.
Gallery feature merge. The official test set also provides track id information, which contains gallery’s 798 vehicle tracks, and each track is multiple pictures of the same vehicle. This is a prior information that can be used. The post-processing method for this prior information is to average the picture features of each track as the features of the pictures under this track. There is also a parameter that needs to be adjusted, or is all the pictures under this track. This post-processing method can be applied to some Re-ID tasks that provide video track information.
Re-ranking with k-reciprocal encoding. K-reciprocal encoding  is an effective re-ranking method. A basic assumption is that if the returned image is sorted within the nearest neighbors of probe, then it is likely to be a true match. The idea of this method is to use query-to-test, query-to-query and test-to-test correlation in order to take the query and test distribution into consideration when ranking the results.
CityFlow  is one of the largest vehicle Re-ID datasets. The dataset contains 56,277 bounding boxes of 666 vehicle identities, where 36,935 of them from 333 object identities form the training set, and the test set consists of 18,290 bounding boxes from the other 333 identities. The standard probe and gallery sets consist of 1,052 and 18,290 images respectively. The synthetic dataset provided by AI City Challenge 2020 is generated by VehicleX, which is a publicly aviablable 3D engine. There are 192,150 images of 1362 vehicles are annotated with detailed labels in total. We select 50 vehicle identities of 3462 images in CityFlow as our self-val set. All the left images are used as the training set. So our vehicle Re-ID model self-training set has a total of 225,623 images of 1645 vehicle ids. Besides, after semi-supervised learning, the images in the test set are assigned 330 ids, finally we get a training set of 1975 ids.
We totally trained several Re-ID models including DenseNet161 , SE-ResNet152  and ResNet152 with MGN . For some models, our training dataset is the officially provided training set and synthetic dataset. Due to the background redundancy of some vehicle images, we resize the image to 414310, and then randomly crop to 384288. At the same time, we perform random erasing to increase the difficulty of sample learning and horizontally flip the image to add diversity of the training images.
For other models, we use AI City Challenge 2019 first place Baidu’s open-source refined dataset by their detector. In this part of the dataset, we resize the image to 384288, and no need to perform the operation of crop to background.
We adopt the strategy of random sample, each vehicle id selects 4 instance images in one batchsize. And we use Stochastic Gradient Descent (SGD) to train CNN models in total of 1000 epochs. Strategies such as warm up learning rate and learning rate gradient attenuation are added to the training process. The initial learning rate is set to 0.03, which is decayed to 310, 310 and 310
at 300th epoch, 600th epoch and 900th epoch. We implement our model on PyTorch, and all the models are trained and tested on eight Tesla P40 GPUs.
We use 225,623 images of 1645 vehicle ids train Re-ID model to extract the features of the test set. Given that there are 333 ids in the test set, k-means is used to cluster the test set images and 19,342 images of 333 vehicle ids are added to the original training set to retrain the Re-ID model. For each single model under different training sets, the mAP on the self-val set are shown in Table. 4.3. With the increase in the amount of training data, each of our single models has significantly improved on self-val set. However, because fake label is not very accurate, and the training model overfits the test set with certain incorrect labels, only DenseNet161 improves on the test set in the end, and as a single model of the last ensemble models. Model selection experiments are conducted in next section.
Model Ensemble. We conduct experiments on different model ensemble strategies as shown in the Table. 4.4. Average of feature merge: take the average of the 2048-dimensional feature inferred from each single model. Average of distance matrix: take the average of the distance matrix calculated by the features of the query and gallery inferred from each single model. Concatenated feature: concatenate the 2048-d features output by each single model, expanding in dimension. Experiments show that Concatenated feature performs best.
Therefore, we concatenate the normalized features of SE-ResNet152 trained on the train set, ResNet152 with MGN (trained on the train set), DenseNet161 (trained on the train set) and DenseNet161 (trained on the train set with fake labe test set) for feature distance calculation, which is better than using a single model.
When the test set is labeled with fake labels, the clustering results are not refined, resulting in a large number of incorrect labels in the clustering results.
Although ResNet152 with MGN and SE-ResNet152 (trained on the train set with fake labe test set) perform better on self-val set than without fake label test set, but during training, the model overfits the test set with a large number of incorrect labels. Therefore, in the final model ensemble, the mAP accuracy decreases on the test set.
However, DenseNet161 does not fit the test set with the incorrect labels, so we choose these 4 models listed in the Table. 4.4 for the final ensemble feature.
Post processing. We conduct the post-processing ablation experiment on the self-val set of CityFlow datasets, with SE-ResNet152 as the backbone of single model. It can be seen in Table. 4.4 that each post-processing method has a certain improvement on the result. Although there is some improvement in query expansion on the self-val set, it is not reflected in the test set. Gallery feature merge will bring a certain improvement to the result in the test set. So in the end, the post-processing we do on the test set is shown in the testing stage of Fig. 1.
Impact of filter grafting. To verify the effectiveness of filter grafting on our vehicle Re-ID model, we have done some experiments on the SE-ResNet152. We try two models grafting, which will bring about 1% improvement in mAP on the baseline. Due to the limitation of the time and GPU resources, no more grafting models are added and we did not try filter grafting on other backbones.
Compare with others on the AIC2020: Table. 4.5 lists the ranks of our team and the results of our team on the vehicle Re-ID task of AI City Challenge 2020.
Visualization of the results: The qualitative results on CityFlow our self-val set are shown in Fig. 3. The top-10 ranking lists for our query images are visualized. From the results, we can see that our model successfully retrieves most of the vehicle pictures from the gallery.
We participated in the task of the NVIDIA AI City Challenge 2020: the vehicle re-identification task. Our solution for vehicle re-identification is based on training multiple CNNs with a few methods including multi-loss, filter grafting, part-based feature learning,etc. Then, we use semi-supervised method to annotate images in the test set with a fake label. Meanwhile, several post-processing methods are utilized to furthermake a progress of the result. Finally, we finished with 5-th place in the 2020 AI-City challenge for City-Scale Multi-Camera Vehicle Re-Identification.
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3997–4005, 2019.
Facenet: A unified embedding for face recognition and clustering.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
Vehicle re-identification using quadruple directional deep learning features.IEEE Transactions on Intelligent Transportation Systems, 2019.