Vehicle re-identification (Re-ID) aims to identify the same vehicles that are captured by various cameras. It is an essential technology for analyzing and predicting traffic flow in a smart city and uses visual appearance-based Re-ID methods in general. However, Vehicle Re-ID is challenging for two reasons. First, different lighting and complex environments create difficulties with appearance-based vehicle Re-ID. Also, if the vehicle is captured using different cameras, large variations in appearance will be produced. Secondly, different vehicles can be very similar to each other visually when they are in the same category.
Deep learning methods [25, 10, 18] are used to tackle this complex vehicle Re-ID task and achieve significant progress. They extract features using deep learning networks and distinguish vehicles by comparing the distances between their features. However, requiring a large amount of data to improve performance is a drawback of deep learning. The reported result  shows that the more training data a model has, the better performance it makes. Data from the wild environment need a heavy workload of annotation. Many studies have attempted to use inexpensive synthetic data to replace real data. Such research is called domain adaptation.
in which a neural network learns features that are as discriminative as possible for the main classification task on the real data, while, at the same time, learning indistinguishable features between real and synthetic data . To implement this idea, we introduce a domain discrimination layer and associated cross-entropy loss to train the network indiscriminative for both domains. Secondly, to exploit the specific labels in synthetic data such as color, type, and orientation, we have also adopted semi-supervised learning methods. Since these labels exist only in synthetic data, a semi-supervised learning approach that can handle unlabeled data is applicable to improve the performance. In training, classification losses for the exclusive labels are selectively applied depending on the data domain . Our model trained the real and synthetic data of the AI City Challenge using the domain-adaptation and semi-supervised learning approach was 12.87% better than the baseline model that was trained with only real data. In this work, we propose a novel framework named StRDAN, standing for Synthetic-to-Real Domain Adaptation Network. Our major contribution is threefold:
StRDAN is trained with inexpensive large-scale synthetic data as well as real data to improve the performance.
A new training approach for StRDAN is proposed, which is combined with domain adaptation and semi-supervised learning methods and corresponding losses.
2 Related Work
In this section, we review the prior works from two aspects: vehicle Re-ID and the domain adaptation method with synthetic data.
Vehicle Re-ID: Vehicle Re-ID methods generally have two characteristics: contrastive loss and spatio-temporal feature. First, in terms of contrastive loss, prior works [14, 15, 16] proposed methods that use constrative loss in forms of siamese network, triplet loss, and metric learning. Liu et al.  also introduced the VeRi dataset for the first large-scale vehicle Re-ID benchmark. Second, spatio-temporal feature is the key to performance improvement. Vehicle Re-ID task achieved a huge progress using the spatio-temporal features. Tan et al.  uses spatial-temporal features for multi-camera vehicle tracking and vehicle Re-ID, and their method proved by winning The AI City Challenge 2019. Shen et al.  proposed a two-stage framework for matching visual appearance and an LSTM based path inference mechanism.
Domain Adaptation with Synthetic Data: To overcome the lack of data, Zhou et al. [37, 38] proposed a method that improves the Re-ID performance by augmenting various viewpoint vehicle images with Generative Adversarial Networks (GAN). There is also a method to deal with inconsistency in the distribution of different data sources. When deploying the well-trained model directly to a new dataset, the performance drops significantly due to the differences among datasets named domain bias. Peng et al. 
proposed a domain adaptation framework to address this problem, which contained an image-to-image translation network and an attention-based feature learning network. We can use VehicleX simulator to leverage synthetic data and domain randomization to overcome the reality gap [28, 29]. Liu et al.  also proposed a domain adaptation method. However, they only considered real-to-real domain adaptation. The recent vehicle Re-ID research  proposed PAMTRI uses synthetic data to improve the performance and have a similar architecture with ours. Compared to PAMTRI that requires additional effort to get vehicle pose and label for real data, our StRDAN uses domain adaptation to utilize synthetic data and adopts semi-supervised learning that doesn’t need to extra annotation workload. Our method is simple and easy to train.
3 Synthetic-to-Real Domain Adaptation Network (StRDAN)
In this work, we developed a neural network using the real and synthetic vehicle datasets provided for the Track 2 of the 2020 AI City Challenge. The real dataset is the CityFlow-reID dataset, which is a subset of CityFlow made available for the Track 2 challenge and consists of 56,277 images for 666 unique vehicles collected from 40 cameras. Here, 36,935 images from 333 vehicle identities are provided for training, while 18,290 images from the other 333 identities are given for testing. The remaining 1052 images from the same identities in the test set are provided as query data.
The synthetic vehicle dataset consists of 192,150 images from 1,362 distinct vehicles created using a large-scale synthetic dataset generator called VehicleX  to form an augmented training set. The synthetic dataset has not only the vehicle ID but also additional information such as the color, type, and orientation of an object, whereas the real dataset has only the vehicle ID. Here, vehicles are distinguished into 12 colors and 11 types. The orientation is represented by a rotation angle on the horizontal plane in the range of [0, 360).
We also trained and evaluated our model using the VeRi real dataset and the City Challenge synthetic data to examine the validity and robustness of our approach. The Veri dataset contains over 50,000 images of 776 vehicles captured by 20 cameras. The training set contains 37,781 images of 576 vehicles, while the testing set contains 11,579 images of 200 vehicles. We don’t VeRi has additional labels that are color and type.
3.2 Overall Architecture
The overall architecture of the proposed synthetic-to-real domain adaptation network (StRDAN) is shown in Figure 2
. The model consists of a backbone network for feature extraction and multiple fully connected (FC) softmax layers for classification. Input images are sampled in batch in equal numbers from the real and synthetic datasets. For a mini-batch,n different vehicle identities are chosen from the real and synthetic datasets, respectively, then m samples are randomly selected from the images of the chosen identities. Therefore, each batch contains images.
The backbone network extracts a highly-abstracted feature vector (dim
= 2048) from an input image. Conceptually, any convolution neural network designed for image classification can be used as a backbone network. In prior work, various CNN networks, such as VGG-CNN-M-1024, MobileNet , ResNet 
, have been adopted as a backbone for vehicle Re-ID. In the proposed StRDAN, ResNet-50 has been selected as a backbone network. The feature map extracted by the backbone network is flattened and fed into various FC softmax layers for classification of vehicle id, real or synthetic, color, type, and orientation. The outputs are fed into five cross-entropy loss functions and one triplet loss function. Our model is trained in an end-to-end manner by updating the parameters in the network to reduce the total loss, which is a combination of the cross-entropy losses and the triplet loss.
3.3 Key Features
Adversarial Domain Adaptation. An annotated dataset is essential for supervised learning of a deep neural network. However, collecting and manually annotating a large amount of data is a time consuming and expensive task. To overcome this problem, an approach to generate automatically labeled data using a graphic simulator has been introduced. In the AI City Challenge, a synthetic vehicle dataset created using VehicleX is provided to overcome the lack of real data. However, the synthetic data has similar but different distributions, compared to the real data. Therefore, it is necessary to train a neural network to be predictive of the classification task, but uninformative as to the domain of the input.
We adopted the adversarial domain adaptation approach in which a neural network learns features that are as discriminative as possible for the main classification task on the real domain and , at the same time, as indistinguishable as possible between the real and synthetic domains  . To implement this idea, we introduced a domain discrimination layer and its associated cross-entropy loss to make the network be trained indiscriminative to the two domains. Also, to train the network more discriminative for vehicle identities and shape signatures, we introduced not only a vehicle-id classification layer and its associated cross-entropy loss but also a triplet loss.
Semi-supervised Learning. The synthetic data has labels such as vehicle type, color, and orientation unlike the real data. We use the labels as multi-task learning to improve generalization performance of all the tasks . In this case, various approaches for semi-supervised learning can be introduced to improve learning accuracy because semi-supervised learning basically combines a small amount of labeled data with a large amount of unlabeled data during training. In Zhai et al. ’s work , they create artificial labels for both unlabeled and labeled data and utilize them in training. Their approach inspired us an idea to use joint and disjoint labels between real and synthetic data for improving the performance. Here, joint labels attached to both real and synthetic data are vehicle ID and domain (real or synthetic), while disjoint labels attached to only synthetic data are vehicle type, color, and orientation. As shown in Figure 2
, the losses are also classified into joint and disjoint losses, which are associated with joint and disjoint labels, respectively. The triplet loss is classified as a joint loss because the vehicle id contributes to distinguishing batch images into anchor, positive and negative images. The semi-supervised learning approach we consider in this paper has a learning objective in the following form:
where is the joint loss defined in both real and synthetic domains and is the disjoint loss defined in the synthetic domain. is parameters of the network. In the next section, we will describe the losses in more detail.
4 Loss Function
4.1 Joint Losses
Vehicle ID. A cross-entropy loss following the softmax function is the most common loss in image classification. The cross-entropy loss of the vehicle ID classifier, , is represented as follows:
where denotes the number of images in a mini-batch, represents the number of classes, is the
element of an one-hot encoded vector for the ground-truth of thesample in a mini-batch, and corresponds to the element of the output of the softmax FC layer for the image.
Domain. We adopted the adversarial domain adaptation approach. In this work, domains are real and synthetic. A softmax FC layer for domain discrimination is added to the backbone network. The loss to make the network be trained indiscriminative to two domain is defined as follows:
The domain discrimination loss is defined as the negative value of binary cross-entropy loss. Since the cross-entropy loss makes the network be trained discriminative between two domains, its negative loss would make the model more indistinguishable. If a vehicle captured by a camera is drawn by a graphic simulator in the same orientation, the features extracted from a synthetic image would be similar to that from a real image as the domain-dependent features are suppressed. The negative cross-entropy loss function is implemented by the gradient reversal layer .
Triplet Loss. In a mini-batch that contains identities and images for each identity, each image (anchor) has images of same identity (positives) and images of different identities (negatives). The triplet loss aims at pulling the positive pair (, ) together while pushing the negative pair (, ) away by a margin. That is, this loss forces the network to be trained to minimize the distance between the features from the same classes of images and, at the same time, to maximize the distance between the features from the different classes of images. The triplet loss is defined as follows:
where represents predicted vector of image of the identity group and is the margin to control the difference between positive and negative pair distances and helps cluster the distribution more dense.
4.2 Disjoint Losses
Color, Type, and Orientation
. The softmax cross-entropy loss is applied for these three targets. In fact, in terms of data type, orientation is continuous and of ratio type, whereas color and type are categorical and nominal. Therefore, it is natural to use regression to predict orientation. However, orientation estimation is one of the toughest problems for regression due to the wide range of the regression target. Actually, in our experiments, the optimization has not been converged for regression. Therefore, we convert the orientation regression to a direct classification intodiscrete bins, with softmax cross-entropy loss, as shown in  or . We divide the 360-degree orientation space into six bins of 60 degrees each. The cross-entropy losses for the color, type, and orientation are applied only to the synthetic images and set zero to the real images. The loss function can be presented as follows:
where is one of color, type, and orientation, and is a mask value that is set to 1 if the data in a mini-batch has , and 0 otherwise.
5.1 Evaluation Metric
To evaluate the performance of each model, we used the official evaluation metric for the AI City Challenge, which is the rank-K mean Average Precision (mAP) that measures the mean of average precision for each query considering only the top K matches. K is chosen to be 100. The average precision is computed for each query image by calculating the area under the Precision-Recall curve, and then the mean of the average precision over all the queries is computed.
Our backbone network, ResNet-50, is initialized with the weights pre-trained on ImageNet  to accelerate the training process. We train the model end-to-end with an AMSGrad optimizer
for 60 epochs. The initial learning rate is set to 0.0003 and reduced by 0.1 after 20 and 40 epochs. The weight decay factor for L2 regulation is set to 0.0005, and the batch size is 64. For each mini-batch, two and two different vehicle-ids are selected from the real and synthetic datasets, respectively, and four images with the same ID are sampled. Therefore, a total of 16 different images with four different IDs from the real and synthetic datasets are sampled. An input image is resized to (128, 256). We adopt horizontal-flip and randomly-erase augmentations. In post-processing, we use the re-ranking algorithm proposed by Zhong et al.et al. , which is ordering the distance matrix between the features with the Jaccard distance and original output distance.
|StRDAN (R, baseline)||73.0|
|StRDAN (R+S, best)||76.1|
5.3 Results and Discussion
We trained and evaluated our models using the CityFlow-reID real dataset and the VeRI real dataset together with the synthetic data generated by VehicleX. The evaluation results of the models trained using the selected disjoint losses are shown in Table 1 and Table 2.
Performance on AI City Dataset. The baseline is Case 1 where a neural network is composed of the backbone network and the vehicle ID classifier. The baseline is trained with the real dataset using the vehicle-ID cross-entropy and triplet losses. As shown in Table 1, comparing with the baseline, the domain adaptation and semi-supervised learning approaches introduced in this study improve significantly the model performance by at least 8.48% (in Case 8) and up to 12.87% (in Case 4). One interesting thing is that the model shows the best performance in Case 4 where only the vehicle type is considered among three labels of the synthetic image. On the contrary, in Case 8 where all three labels are considered, the model shows the worst performance.
Performance on VeRi Dataset.Table 2 also shows that the domain adaptation and semi-supervised learning approaches with synthetic dataset and additional losses contribute to performance improvement. The performance is improved by up to 3.1% in Case 2 and at least 1.2% in Case 3. Unlike the cases with the AI City dataset, the case with only the orientation label shows the best performance. However, in this case, the model cannot converge with the AI City dataset. In terms of performance, the models with the Veri data is much better than those with the AI City data. In Table 3 we compare our StRDAN with other methods. Except for PAMTRI and StRDAN (R+S), all the models have been trained using only the VeRi real dataset without synthetic data. The table shows that our model outperforms the other methods in the table.
Domain Adaptation and Semi-supervised Learning. Based on the experimental results, it is clear that the domain adaptation and semi-supervised learning approaches contribute to extracting more important semantic features for vehicle Re-ID. However, there remains further research on unexpected phenomena: First, a model trained with only one loss out of three disjoint losses performs best. Second, the more disjoint losses are included, the lower the performance. Third, the best performance depends on the real dataset.
In this paper we propose an approach using domain adaptation and semi-supervised learning to fully utilize the synthetic data. Based on the experiment results, we found that increasing training data via with domain adaptation, improves performance. We also explored specific labels that only synthetic data has and discovered that using these labels with semi-supervised learning helps model extracting more semantic features.
As future work, the following issues need to be addressed.
Synergy between the disjoint losses and the dependency of the real data with the disjoint losses, which are discussed in the previous section.
Effect of reality on synthetic data. The image data synthesized by VehicleX is easily distinguishable from real image data and very far from realistic. More realistic synthetic data obtained by driving simulation software can improve the performance much more.
Prediction of orientation. We convert orientation regression to a direct classification into six discrete bins. However, as we have not tried various bin counts, it is necessary to investigate the optimal number of bins. Since the orientation is one of the key features to identify vehicles that are captured in various camera angles, the proper representation of orientation can boost performance.
-  (2014) Domain-adversarial neural networks. ArXiv abs/1412.4446. Cited by: §1, §3.3.
-  (2018) Group-sensitive triplet embedding for vehicle reidentification. IEEE Transactions on Multimedia 20, pp. 2385–2399. Cited by: Table 3.
-  (2014) Return of the devil in the details: delving deep into convolutional nets. ArXiv abs/1405.3531. Cited by: §3.2.
Unsupervised domain adaptation by backpropagation. ArXiv abs/1409.7495. Cited by: §1, §4.1.
-  (2016) Domain-adversarial training of neural networks. ArXiv abs/1505.07818. Cited by: §1, §3.3.
-  (2018) Unsupervised representation learning by predicting image rotations. ArXiv abs/1803.07728. Cited by: §4.2.
-  (2016) Identity mappings in deep residual networks. ArXiv abs/1603.05027. Cited by: §3.2.
-  (2017) In defense of the triplet loss for person re-identification. ArXiv abs/1703.07737. Cited by: §4.1.
-  (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. ArXiv abs/1704.04861. Cited by: §3.2.
Multi-view vehicle re-identification using temporal attention model and metadata re-ranking. In CVPR Workshops, Cited by: §1.
-  (2019) Vehicle re-identification: an efficient baseline using triplet embedding. 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–9. Cited by: Table 3.
-  (2019) Supervised joint domain learning for vehicle re-identification. In Proc. CVPR Workshops, pp. 45–52. Cited by: §2.
-  (2016) Deep relative distance learning: tell the difference between similar vehicles. , pp. 2167–2175. Cited by: Table 3.
-  (2016) Large-scale vehicle re-identification in urban surveillance videos. 2016 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §2.
-  (2016) A deep learning-based approach to progressive vehicle re-identification for urban surveillance. In European conference on computer vision, pp. 869–884. Cited by: 3rd item, §2.
-  (2017) Provid: progressive and multimodal vehicle reidentification for large-scale urban surveillance. IEEE Transactions on Multimedia 20 (3), pp. 645–658. Cited by: §2.
-  (2018) PROVID: progressive and multimodal vehicle reidentification for large-scale urban surveillance. IEEE Transactions on Multimedia 20, pp. 645–658. Cited by: Table 3.
-  (2019) Vehicle re-identification with location and time stamps. In CVPR Workshops, Cited by: §1.
-  (2019) The 2019 ai city challenge. In CVPR Workshops, Cited by: §2.
-  (2019) Cross domain knowledge transfer for unsupervised vehicle re-identification. In 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 453–458. Cited by: §2.
-  (2019) On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237. Cited by: §5.2.
-  (2017) Learning deep neural networks for vehicle re-id with visual-spatio-temporal path proposals. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1900–1909. Cited by: §2.
-  (2017) Learning deep neural networks for vehicle re-id with visual-spatio-temporal path proposals. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1918–1927. Cited by: Table 3.
-  (2019) Multi-camera vehicle tracking and re-identification based on visual and spatial-temporal features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 275–284. Cited by: §2.
-  (2019) Multi-camera vehicle tracking and re-identification based on visual and spatial-temporal features. In CVPR Workshops, Cited by: §1.
-  (2019) PAMTRI: pose-aware multi-task learning for vehicle re-identification using highly randomized synthetic data. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 211–220. Cited by: §2, Table 3.
-  (2019) CityFlow: a city-scale benchmark for multi-target multi-camera vehicle tracking and re-identification. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8789–8798. Cited by: 3rd item.
-  (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 23–30. Cited by: §2.
-  (2018) Training deep networks with synthetic data: bridging the reality gap by domain randomization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 969–977. Cited by: §2.
-  (2017) Orientation invariant feature embedding and spatial temporal regularization for vehicle re-identification. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 379–387. Cited by: Table 3.
-  (2019) Simulating content consistent vehicle datasets with attribute descent. Note: arXiv:1912.08855 Cited by: §2, §3.1.
-  (2019) S4L: self-supervised semi-supervised learning. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1476–1485. Cited by: §1, §3.3.
-  (2017) A survey on multi-task learning. ArXiv abs/1707.08114. Cited by: §3.3.
-  (2019) VehicleNet: learning robust feature representation for vehicle re-identification. In CVPR Workshops, Cited by: §1.
-  (2017) Re-ranking person re-identification with k-reciprocal encoding. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3652–3661. Cited by: §5.2.
-  (2018) On the continuity of rotation representations in neural networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5738–5746. Cited by: §4.2.
-  (2018) Aware attentive multi-view inference for vehicle re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6489–6498. Cited by: §2.
-  (2018) Vehicle re-identification by adversarial bi-directional lstm network. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 653–662. Cited by: §2.
-  (2018) Vehicle re-identification by adversarial bi-directional lstm network. 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 653–662. Cited by: Table 3.
-  (2018) Viewpoint-aware attentive multi-view inference for vehicle re-identification. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6489–6498. Cited by: Table 3.