Video surveillance for traffic control and security plays a significant role in current public transportation systems. The task of vehicle reID often undergoes intensive changes in appearance and background. Captured images in different datasets by different cameras is a primary cause of such variations. Usually, datasets differ form each other regarding lightings, viewpoints and backgrounds, even the resolution, etc. As shown in Fig.1, images in VeRi-776 are brighter and have more viewpoints than images in VehicleID. And images in VehicleID have higher resolution than images in VeRi-776. Besides that, it could not contain all cases in real scenario for every dataset, which makes different datasets form their own unique style and causes the domain bias among datsets. For reID 
, most existing works follow the supervised learning paradigm which always trains the reID model using the images in the target dataset first to adapt the style of the target dataset. However, it is observed that, when the well-trained reID model is tested on other dataset without fine-tuning, there is always a severe performance drop due to the domain bias.
There are few studies on vehicle reID about the cross-domain adaptation. And only a few methods exploit unlabeled target data for unsupervised person reID modelling . However, some of them 
need extra information about source domain while training, such as attribute labels and spatio-temporal labels, which are not existing on some datasets. And there are only several methods exploiting unsupervised learning[10, 11] without any labels, for instance, SPGAN  and PTGAN . SPGAN is designed for person reID that integrates a SiaNet with CycleGAN  and it does not need any additional labels during training. However, though SPGAN is effective on the person transfer task, it causes deformation and color distortion in vehicle transfer task in our experiment. PTGAN is composed of PSPNet  and CycleGAN to learn the style of target domain and maintain the identity information of source domain. In order to keep the identity information, PSPNet is utilized to segment the person images first. As we all know, it needs pre-trained segment model for PSPNet, which increases the complexity of the training stage.
To sump up, this paper proposes an end-to-end image-to-image translation network for the vehicle datasets, which named VTGAN. To preserve the identity information of images from source domain and learn style of images from target domain, for every generator in VTGAN, it is composed of a content encoder, a style encoder and a decoder. An attention model is proposed in the content encoder to preserve the identity information from the source domain. And the style encoder is designed to learn the style of the target domain with the style loss. Furthermore, VTGAN does not need any labels and paired images during the translation procedure, which is closer to the real scenario. To better adapt the target domain (unlabeled), ATTNet is designed for vehicle reID with the generated images obtained from the stage of translation. It has better generalization ability through the proposed attention structure to focus on the foreground and neglect the background information of the input image as much as possible during training procedure. In summary, our contributions can be summarized into two aspects:
1) We propose VTGAN to generate the images which have the style of target domain and preserve identity information of source domain. It is an efficient unsupervised learning model and works by transferring content and style between different domains separately.
2) ATTNet is presented to train the generated images, which is based on attention structure and could extract more distinctive cues while suppressing background for vehicle reID task.
Our ultimate goal is to perform vehicle reID model in an unknown target domain for which are not labeled directly. Hence, we introduce a two-step vehicle reID method based on Generative Adversarial Network (GAN). The first step is to transfer the style between source domain and target domain. In this step, the VTGAN is proposed to generate images which have the style of target domain and preserve the identity information of source domain. After generating the style transferred images, in the second step, we design a multi-task network with the attention structure to obtain more discriminative features for vehicle reID.
VTGAN is designed to transfer the style between source domain and target domain in the case of preserving the identity information of images from source domain. As illustrated in Fig.2, VTGAN consists of generators , and domain discriminators for both domains. For each generator in VTGAN, it contains content encoder , style encoder and decoder three components. is designed to preserve the identity information from images of source domain through the proposed attention model, which could extract the foreground while suppressing background. And to learn the style of target domain, the with the gram loss is added to the translation network. At last, the decoder embeds the output of and to generate the translated image.
2.2.1 Content Encoder
In order to keep the identity information from source domain, the attention model is designed to assign higher scores of visual attention to the region of interest while suppressing background in the content encoder.
As shown in Fig.3, we denote the input feature map of attention model as . In this work, a simple feature fusion structure is utilized to generate the . We fuse the every output of the ResBlock to form , which can be formulated as , where is the feature map generated by the ResBlock. and
denotes the concatenation operation. For the feature vectorof the feature map at the spatial location , we can calculate its corresponding attention mask by
where is the Full Connected layer (FC) to learn a mapping function in the attention module and are the weights of the FC. The final attention mask
is a probability map obtained using a Sigmod layer. The scores represent the probability of foreground in the input image.
And after the attention model, a mask is generated, which has high scores for foreground and low scores for background. Hence, the attended feature map is computed by element-wise product of the attention mask and the input feature map, which could be described as follows:
where is the spatial location in mask or feature map . And is performed in an element-wise product.
2.2.2 Style Encoder
Besides the content branch, there is a branch to learn style of target domain. In this branch, different with the and , the style network and do not contain the attention model. To learn the style of the target domain, is designed with the gram loss to output the style features which has similar distribution of the target domain . The gram loss could be formulated as follows:
where is the number of feature maps, is calculated by , and represent the width and height of images. , , and are the gram matrix of output features , , and , respectively.
2.2.3 Decoder Network
For the decoder network, it is composed of two deconvolution layers and a convolutional layer to output the generated images . The input of the decoder network is the combination of and which represent the content features and style features, respectively. In this paper, we employ a concatenate layer to fuse and and a global skip connection structure to make training faster and resulting model generalizes better, which could be expressed as:
where represents the concatenate layer. And
represents the feature map generated by the second stride convolution blocks.and are the output of content encoder and style encoder, respectively.
2.2.4 Loss function
We formulate the loss function in VTGAN as a combination of adversarial loss, content loss and style loss:
where the and control the relative importance of three objectives. The style loss could be calculated by Eq.(3). VTGAN utilizes the target domain identity constraint as an auxiliary for image-image translation. Target domain identity constraint was introduced by 
to regularize the generator to be the identity matrix on samples from target domain, written as:
For , it consists of three parts which two adversarial losses and a cycle consistency loss. VTGAN applies adversarial losses to both mapping functions. For the generator and its discriminator , the objective could be expressed as:
where, and represent the source domain and target domain, respectively. and denote the sample distributions in the source and target domain. The objective of generator and discriminator also could be built. Besides, the VTGAN requires and when it learns the mapping of and . So the cycle consistency loss is employed in VTGAN which could make the network more stable. The cycle consistency loss could be defined as:
As we all know, the diversity background is a big factor for the problem of cross domain. And in order to make the reID model adapt to the target domain, we are facing a condition that it is better to focus on the vehicle images and neglect the background when we train the feature learning model. Hence, a two-stream reID network with attention structure is designed in this paper.
As shown in Fig.4, the input images are obtained from the image generation module and are divided into positive and negative sample pairs. For one branch, the input image is fed into five ResNet Blocks  to output the feature maps with the size of . Then they are passed into a Global Average Pooling (GAP) layer to obtain the feature map . is utilized to generate the mask through the proposed attention structure. Given the feature map , its attention map is computed as , where the one operator is convolution. After obtaining the attention map , the attended feature map could be calculated by . The operator is performed in an element-wise product. Then the attended feature map will be fed into the subsequent structure. A shortcut connection architecture is introduced to embed the input of the attention network directly to its output with an element-wise sum layer, which could be described as . In this way, both the original feature map and the attended feature map are combined to form features and utilized as input to the subsequent structure. And after two FC layers, we could obtain the feature . At last, a skip connection structure is utilized to integrate and by the concatenate layer to obtain more discriminative features for identification task and verification task, which could be described as .
VeRi-776  is a large-scale urban surveillance vehicle dataset for reID. This dataset contains over 50,000 images of 776 vehicles with identity annotations, camera geo-locations, image timestamps, vehicle types and colors information. In this paper, 37,781 images of 576 vehicles are employed as a train set. VehicleID  is a surveillance dataset from real-world, which contains 26267 vehicles and 221763 images in total. From the original testing data, four subsets, which contain 800, 1600, 2400 and 3200 vehicles, are extracted for vehicle search in different scales.
3.2 Implementation Details
For VTGAN, we train the model in the tensorflow
and the learning rate is set to 0.0002. Note that, we do not utilize any label notation during the learning procedure. The min-batch size of the proposed method is 16 and epoch is set to 6. During the testing procedure, we employ the Generatorfor VeRi-776 VehicleID translation and the Generator for VehicleID VeRi-776 translation. The translated images are utilized for training reID models. For ATTNet, We implement the proposed vehicle re-id model in the Matconvnet 
framework. We utilize stochastic gradient descent with a momentum ofduring the training procedure. The batch size is set to 16. The learning rate of the first 50 epoch is set to 0.1, and the last 5 to 0.01.
3.3.1 Comparison Methods
There are really little methods about the vehicle reID of cross domain. So in this paper, we only discuss several methods and test them on VeRi-776 and VehicleID. Direct Transfer means directly applying the model trained by images from source domain on the target domain. CycleGAN , SPGAN  and VTGAN are employed to translate images from source domain to target domain, and then the generated images are utilized to train reID model. Baseline  denotes the compared training network of reID. ATTNet is our proposed network.
3.3.2 Comparison of generated images
To demonstrate the effectiveness of our proposed style transform model, the VehicleID and VeRi-776 are utilized to train the VTGAN. And CycleGAN and SPGAN are taken as compared methods. Fig.5 is the comparison results, which the source domain is VeRi-776, and target domain is VehicleID. For each group, the first row is the original images in VeRi-776. The second and third rows are generated by CycleGAN and SPGAN, respectively. The last row is generated by the proposed VTGAN.
From the Fig.5, we could find that, most images generated by CycleGAN are distorted seriously when transfer images from VeRi-776 to VehicleID. And though the SPGAN works better than the CycleGAN, the generated images also have evident deformation. However, for VTGAN, not only is the vehicle color and type information completely preserved, but also learns the style of the target dataset. As we can see from Fig.5, generated images by VTGAN have higher resolution and become darker, which learns from VehicleID.
3.3.3 The impact of Image-Image Translation
Firstly, we utilize CycleGAN to translate labeled images from the source domain to the target domain then train the baseline reID model with translated images in a supervised way. As shown in Table.1, when trained on VeRi-776 training set using the baseline method and tested on VehicleID different testing sets, rank-1 accuracy improves from 35% to 39.39%, 30.42% to 32.97%, 27.28% to 28.44% and 25.41% to 26.38%, respectively. Through such an image-level domain adaptation method, effective domain adaptation baselines can be learned. From the Fig. 5, we could find that, though some of generated images by CycleGAN are distorted, the performance of reID model trained by generated images is improved. This illustrates methods of image-image translation have learned the important style information from the target domain, which could narrow-down the domain gap to a certain degree.
3.3.4 The impact of VTGAN
To verify the effectiveness of the proposed VTGAN, we conduct several experiments which training sets are images generated from different image translation methods. As shown in Table.1, on VehicleID, compared with , the gains of are 5.05%, 6%, 6.66% and 5.79% in rank-1 of different test sets, respectively. Though SPGAN has better performance in the stage image-to-image translation than CycleGAN, it also causes deformation and color distortion in real scenario (see Fig.5). Hence, compared with , for different size of test sets on VehicleID, has 1.57%, 1.51%, 1.56% and 1.72% improvements in mAP, respectively. All of these could demonstrate that the structure of VTGAN is more stable and could generate suitable samples for training in the target domain.
3.3.5 The impact of ATTNet
To further improve re-ID performance on target dataset, we propose ATTNet. Fig.6 is CMC resutls on VehicleID of different methods. As shown in Fig.6, compared to methods with baseline reID model, not only original images but also generated images, methods using ATTNet have better performance. For instance, from the Table.1, we could find that, compared with , has 8.26%, 9.05%, 8.67%, and 7.99% improvements in rank-1 of different test sets when the model is trained on VeRi-776 and tested on VehicleID. Besides, it is obvious that compared with the baseline methods, the reID model using the ATTNet have significant improvement for every image translation method. This demonstrates that the reID model trained by the proposed ATTNet can better adapt to cross-domain task than the baseline method.
|Methods||Test size = 800||Test size = 1600||Test size = 2400||Test size = 3200|
|Direct Transfer + Baseline||40.05||35.00||56.68||34.90||30.42||48.85||31.65||27.28||44.49||29.57||25.41||42.11|
|CycleGAN + Baseline||44.24||39.39||60.10||37.68||32.97||53.16||33.17||28.44||47.92||30.73||26.38||43.84|
|SPGAN + Baseline||48.27||42.87||66.55||42.51||37.46||58.97||38.41||33.54||53.68||35.04||30.45||49.13|
|VTGAN + Baseline||49.53||44.44||66.74||43.90||38.97||59.93||40.07||35.10||56.29||36.86||32.17||51.63|
|Direct Transfer + ATTNet||47.97||43.26||62.93||43.94||39.47||58.51||40.42||35.95||54.34||37.60||33.40||50.55|
|CycleGAN + ATTNet||46.96||42.68||60.72||43.27||38.88||57.44||39.39||35.09||53.05||37.05||33.07||49.38|
|SPGAN + ATTNet||52.72||48.25||67.20||48.01||43.44||63.04||44.17||39.51||59.05||41.05||36.75||54.63|
|VTGAN + ATTNet||54.01||49.48||68.66||49.72||45.18||63.99||45.18||40.71||59.02||42.94||38.72||55.87|
In this paper, we propose a vehicle reID framework based on GAN, which includes the VTGAN and ATTNet for domain adaptation. The VTGAN is designed to generate the vehicle images which preserve the identity information of source domain and learn the style of target domain. The ATTNet is proposed to train the reID model with generated images. It can be observed from the results that both the VTGAN and ATTNet achieve good results. What’s more, it is obvious that existing datasets usually contain several viewpoints of vehicle images. It is also a limit for reID task in new domain. Hence, in our future studies, we would focus on using the GAN to generate the various viewpoints of vehicle images to expand the dataset and improve the performance of reID model.
Lin Wu, Yang Wang, Ling Shao, and Meng Wang,
“3-d personvlad: Learning deep global representations for
video-based person reidentification,”
IEEE transactions on neural networks and learning systems, 2019.
-  Yang Wang, Xuemin Lin, Lin Wu, and Wenjie Zhang, “Effective multi-query expansions: Collaborative deep networks for robust landmark retrieval,” IEEE Transactions on Image Processing, vol. 26, no. 3, pp. 1393–1404, 2017.
-  Lin Wu, Yang Wang, Junbin Gao, and Xue Li, “Where-and-when to look: Deep siamese attention networks for video-based person re-identification,” IEEE Transactions on Multimedia, 2018.
-  Yang Wang, Xuemin Lin, Lin Wu, Wenjie Zhang, Qing Zhang, and Xiaodi Huang, “Robust subspace clustering for multi-view data by exploiting correlation consensus,” IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 3939–3949, 2015.
-  Lin Wu, Yang Wang, Xue Li, and Junbin Gao, “Deep attention-based spatially recursive networks for fine-grained visual recognition,” IEEE Transactions on Cybernetics, vol. 49, no. 5, pp. 1791–1802, 2019.
-  W. Deng, L. Zheng, and G Kang, “Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification,” , pp. 994–1003, 2018.
J. Wang, X. Zhu, and S. Gong,
“Transferable joint attribute-identity deep learning for unsupervised person re-identification,”Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition., pp. 2275–2284, 2018.
-  L. Wei, S. Zhang, and W. Gao, “Person transfer gan to bridge domain gap for person re-identification,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition., pp. 79–88, 2018.
-  Lin Wu, Yang Wang, and Ling Shao, “Cycle-consistent deep generative hashing for cross-modal retrieval,” IEEE Transactions on Image Processing, vol. 28, no. 4, pp. 1602–1612, 2019.
Yang Wang, Wenjie Zhang, Lin Wu, Xuemin Lin, Meng Fang, and Shirui Pan,
“Iterative views agreement: An iterative low-rank based structured optimization method to multi-view spectral clustering,”in
International Joint Conference on Artificial Intelligence, 2016, pp. 2153–2159.
-  Yang Wang, Lin Wu, Xuemin Lin, and Junbin Gao, “Multiview spectral clustering via structured low-rank matrix factorization,” IEEE transactions on neural networks and learning systems, vol. 29, no. 10, pp. 4833–4843, 2018.
Kim T, Cha M, and Kim H,
“Learning to discover cross-domain relations with generative
Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org., pp. 1857–1865, 2017.
-  Zhao H, Shi J, and Qi X, “Pyramid scene parsing network,” IEEE Conf. on Computer Vision and Pattern Recognition, pp. 2881–2890, 2017.
-  Y. Taigman, A. Polyak, and L Wolf, “Unsupervised cross-domain image generation,” arXiv preprint arXiv:1611.02200, 2016.
-  He K, Zhang X, and Ren S, “Deep residual learning for image recognition,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
-  X. Liu, W. Liu, and T. Mei, “Provid: Progressive and multimodal vehicle reidentification for large-scale urban surveillance,” IEEE Transactions on Multimedia, vol. 20, no. 3, pp. 645–658, 2018.
-  H. Liu, Y. Tian, and Y. Yang, “Deep relative distance learning: Tell the difference between similar vehicles,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2167–2175, 2016.
-  M. Abadi, P. Barham, and J. Chen, “Tensorflow: A system for large-scale machine learning,” In OSDI, vol. 16, pp. 265–283, 2016.
Vedaldi A and Lenc K,
“Matconvnet: Convolutional neural networks for matlab,”Proceedings of the 23rd ACM international conference on Multimedia. ACM, pp. 689–692, 2015.
-  Z. Zheng, L. Zheng, and Y. Yang, “A discriminatively learned cnn embedding for person reidentification,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 14, no. 1, 2017.