Cross Domain Knowledge Transfer for Unsupervised Vehicle Re-identification

03/19/2019 ∙ by Jinjia Peng, et al. ∙ 0

Vehicle re-identification (reID) is to identify a target vehicle in different cameras with non-overlapping views. When deploy the well-trained model to a new dataset directly, there is a severe performance drop because of differences among datasets named domain bias. To address this problem, this paper proposes an domain adaptation framework which contains an image-to-image translation network named vehicle transfer generative adversarial network (VTGAN) and an attention-based feature learning network (ATTNet). VTGAN could make images from the source domain (well-labeled) have the style of target domain (unlabeled) and preserve identity information of source domain. To further improve the domain adaptation ability for various backgrounds, ATTNet is proposed to train generated images with the attention structure for vehicle reID. Comprehensive experimental results clearly demonstrate that our method achieves excellent performance on VehicleID dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video surveillance for traffic control and security plays a significant role in current public transportation systems. The task of vehicle reID often undergoes intensive changes in appearance and background. Captured images in different datasets by different cameras is a primary cause of such variations. Usually, datasets differ form each other regarding lightings, viewpoints and backgrounds, even the resolution, etc. As shown in Fig.1, images in VeRi-776 are brighter and have more viewpoints than images in VehicleID. And images in VehicleID have higher resolution than images in VeRi-776. Besides that, it could not contain all cases in real scenario for every dataset, which makes different datasets form their own unique style and causes the domain bias among datsets. For reID [1]

, most existing works follow the supervised learning paradigm which always trains the reID model using the images in the target dataset first to adapt the style of the target dataset

[2][3][4][5]. However, it is observed that, when the well-trained reID model is tested on other dataset without fine-tuning, there is always a severe performance drop due to the domain bias.

Figure 1: Illustration of the datasets bias between VeRi-776 and VehicleID. The VeRi-776 and VehicleID present different styles,e.g., distinct lightings, backgrounds, viewpoints, resolutions etc.

There are few studies on vehicle reID about the cross-domain adaptation. And only a few methods exploit unlabeled target data for unsupervised person reID modelling [6][7][8]. However, some of them [9]

need extra information about source domain while training, such as attribute labels and spatio-temporal labels, which are not existing on some datasets. And there are only several methods exploiting unsupervised learning

[10, 11] without any labels, for instance, SPGAN [6] and PTGAN [8]. SPGAN is designed for person reID that integrates a SiaNet with CycleGAN [12] and it does not need any additional labels during training. However, though SPGAN is effective on the person transfer task, it causes deformation and color distortion in vehicle transfer task in our experiment. PTGAN is composed of PSPNet [13] and CycleGAN to learn the style of target domain and maintain the identity information of source domain. In order to keep the identity information, PSPNet is utilized to segment the person images first. As we all know, it needs pre-trained segment model for PSPNet, which increases the complexity of the training stage.

To sump up, this paper proposes an end-to-end image-to-image translation network for the vehicle datasets, which named VTGAN. To preserve the identity information of images from source domain and learn style of images from target domain, for every generator in VTGAN, it is composed of a content encoder, a style encoder and a decoder. An attention model is proposed in the content encoder to preserve the identity information from the source domain. And the style encoder is designed to learn the style of the target domain with the style loss. Furthermore, VTGAN does not need any labels and paired images during the translation procedure, which is closer to the real scenario. To better adapt the target domain (unlabeled), ATTNet is designed for vehicle reID with the generated images obtained from the stage of translation. It has better generalization ability through the proposed attention structure to focus on the foreground and neglect the background information of the input image as much as possible during training procedure. In summary, our contributions can be summarized into two aspects:

1) We propose VTGAN to generate the images which have the style of target domain and preserve identity information of source domain. It is an efficient unsupervised learning model and works by transferring content and style between different domains separately.

2) ATTNet is presented to train the generated images, which is based on attention structure and could extract more distinctive cues while suppressing background for vehicle reID task.

2 Method

2.1 Overview

Our ultimate goal is to perform vehicle reID model in an unknown target domain for which are not labeled directly. Hence, we introduce a two-step vehicle reID method based on Generative Adversarial Network (GAN). The first step is to transfer the style between source domain and target domain. In this step, the VTGAN is proposed to generate images which have the style of target domain and preserve the identity information of source domain. After generating the style transferred images, in the second step, we design a multi-task network with the attention structure to obtain more discriminative features for vehicle reID.

2.2 Vtgan

Figure 2: The structure of VTGAN. VTGAN contains two mapping functions: and , and associated adversarial discriminators and . and represents cycle consistency loss and gram loss which are employed to further regularize the mappings (best viewed in color).

VTGAN is designed to transfer the style between source domain and target domain in the case of preserving the identity information of images from source domain. As illustrated in Fig.2, VTGAN consists of generators , and domain discriminators for both domains. For each generator in VTGAN, it contains content encoder , style encoder and decoder three components. is designed to preserve the identity information from images of source domain through the proposed attention model, which could extract the foreground while suppressing background. And to learn the style of target domain, the with the gram loss is added to the translation network. At last, the decoder embeds the output of and to generate the translated image.

2.2.1 Content Encoder

In order to keep the identity information from source domain, the attention model is designed to assign higher scores of visual attention to the region of interest while suppressing background in the content encoder.

Figure 3: The illustration of proposed attention structure.

As shown in Fig.3, we denote the input feature map of attention model as . In this work, a simple feature fusion structure is utilized to generate the . We fuse the every output of the ResBlock to form , which can be formulated as , where is the feature map generated by the ResBlock. and

denotes the concatenation operation. For the feature vector

of the feature map at the spatial location , we can calculate its corresponding attention mask by

(1)

where is the Full Connected layer (FC) to learn a mapping function in the attention module and are the weights of the FC. The final attention mask

is a probability map obtained using a Sigmod layer. The scores represent the probability of foreground in the input image.

And after the attention model, a mask is generated, which has high scores for foreground and low scores for background. Hence, the attended feature map is computed by element-wise product of the attention mask and the input feature map, which could be described as follows:

(2)

where is the spatial location in mask or feature map . And is performed in an element-wise product.

2.2.2 Style Encoder

Besides the content branch, there is a branch to learn style of target domain. In this branch, different with the and , the style network and do not contain the attention model. To learn the style of the target domain, is designed with the gram loss to output the style features which has similar distribution of the target domain . The gram loss could be formulated as follows:

(3)

where is the number of feature maps, is calculated by , and represent the width and height of images. , , and are the gram matrix of output features , , and , respectively.

2.2.3 Decoder Network

For the decoder network, it is composed of two deconvolution layers and a convolutional layer to output the generated images . The input of the decoder network is the combination of and which represent the content features and style features, respectively. In this paper, we employ a concatenate layer to fuse and and a global skip connection structure to make training faster and resulting model generalizes better, which could be expressed as:

(4)

where represents the concatenate layer. And

represents the feature map generated by the second stride convolution blocks.

and are the output of content encoder and style encoder, respectively.

2.2.4 Loss function

We formulate the loss function in VTGAN as a combination of adversarial loss, content loss and style loss:

(5)

where the and control the relative importance of three objectives. The style loss could be calculated by Eq.(3). VTGAN utilizes the target domain identity constraint as an auxiliary for image-image translation. Target domain identity constraint was introduced by [14]

to regularize the generator to be the identity matrix on samples from target domain, written as:

(6)

For , it consists of three parts which two adversarial losses and a cycle consistency loss. VTGAN applies adversarial losses to both mapping functions. For the generator and its discriminator , the objective could be expressed as:

(7)

where, and represent the source domain and target domain, respectively. and denote the sample distributions in the source and target domain. The objective of generator and discriminator also could be built. Besides, the VTGAN requires and when it learns the mapping of and . So the cycle consistency loss is employed in VTGAN which could make the network more stable. The cycle consistency loss could be defined as:

(8)

2.3 ATTNet

As we all know, the diversity background is a big factor for the problem of cross domain. And in order to make the reID model adapt to the target domain, we are facing a condition that it is better to focus on the vehicle images and neglect the background when we train the feature learning model. Hence, a two-stream reID network with attention structure is designed in this paper.

Figure 4: The structure of ATTNet.

As shown in Fig.4, the input images are obtained from the image generation module and are divided into positive and negative sample pairs. For one branch, the input image is fed into five ResNet Blocks [15] to output the feature maps with the size of . Then they are passed into a Global Average Pooling (GAP) layer to obtain the feature map . is utilized to generate the mask through the proposed attention structure. Given the feature map , its attention map is computed as , where the one operator is convolution. After obtaining the attention map , the attended feature map could be calculated by . The operator is performed in an element-wise product. Then the attended feature map will be fed into the subsequent structure. A shortcut connection architecture is introduced to embed the input of the attention network directly to its output with an element-wise sum layer, which could be described as . In this way, both the original feature map and the attended feature map are combined to form features and utilized as input to the subsequent structure. And after two FC layers, we could obtain the feature . At last, a skip connection structure is utilized to integrate and by the concatenate layer to obtain more discriminative features for identification task and verification task, which could be described as .

3 Experiments

3.1 Datasets

VeRi-776 [16] is a large-scale urban surveillance vehicle dataset for reID. This dataset contains over 50,000 images of 776 vehicles with identity annotations, camera geo-locations, image timestamps, vehicle types and colors information. In this paper, 37,781 images of 576 vehicles are employed as a train set. VehicleID [17] is a surveillance dataset from real-world, which contains 26267 vehicles and 221763 images in total. From the original testing data, four subsets, which contain 800, 1600, 2400 and 3200 vehicles, are extracted for vehicle search in different scales.

3.2 Implementation Details

For VTGAN, we train the model in the tensorflow

[18]

and the learning rate is set to 0.0002. Note that, we do not utilize any label notation during the learning procedure. The min-batch size of the proposed method is 16 and epoch is set to 6. During the testing procedure, we employ the Generator

for VeRi-776 VehicleID translation and the Generator for VehicleID VeRi-776 translation. The translated images are utilized for training reID models. For ATTNet, We implement the proposed vehicle re-id model in the Matconvnet [19]

framework. We utilize stochastic gradient descent with a momentum of

during the training procedure. The batch size is set to 16. The learning rate of the first 50 epoch is set to 0.1, and the last 5 to 0.01.

3.3 Evaluation

3.3.1 Comparison Methods

There are really little methods about the vehicle reID of cross domain. So in this paper, we only discuss several methods and test them on VeRi-776 and VehicleID. Direct Transfer means directly applying the model trained by images from source domain on the target domain. CycleGAN [12], SPGAN [6] and VTGAN are employed to translate images from source domain to target domain, and then the generated images are utilized to train reID model. Baseline [20] denotes the compared training network of reID. ATTNet is our proposed network.

3.3.2 Comparison of generated images

To demonstrate the effectiveness of our proposed style transform model, the VehicleID and VeRi-776 are utilized to train the VTGAN. And CycleGAN and SPGAN are taken as compared methods. Fig.5 is the comparison results, which the source domain is VeRi-776, and target domain is VehicleID. For each group, the first row is the original images in VeRi-776. The second and third rows are generated by CycleGAN and SPGAN, respectively. The last row is generated by the proposed VTGAN.

Figure 5: The effect of the generated images. The first row is original images. The generated images using CycleGAN, SPGAN lie in the second row and third row respectively. The last row are generated images by VTGAN.

From the Fig.5, we could find that, most images generated by CycleGAN are distorted seriously when transfer images from VeRi-776 to VehicleID. And though the SPGAN works better than the CycleGAN, the generated images also have evident deformation. However, for VTGAN, not only is the vehicle color and type information completely preserved, but also learns the style of the target dataset. As we can see from Fig.5, generated images by VTGAN have higher resolution and become darker, which learns from VehicleID.

3.3.3 The impact of Image-Image Translation

Firstly, we utilize CycleGAN to translate labeled images from the source domain to the target domain then train the baseline reID model with translated images in a supervised way. As shown in Table.1, when trained on VeRi-776 training set using the baseline method and tested on VehicleID different testing sets, rank-1 accuracy improves from 35% to 39.39%, 30.42% to 32.97%, 27.28% to 28.44% and 25.41% to 26.38%, respectively. Through such an image-level domain adaptation method, effective domain adaptation baselines can be learned. From the Fig. 5, we could find that, though some of generated images by CycleGAN are distorted, the performance of reID model trained by generated images is improved. This illustrates methods of image-image translation have learned the important style information from the target domain, which could narrow-down the domain gap to a certain degree.

3.3.4 The impact of VTGAN

To verify the effectiveness of the proposed VTGAN, we conduct several experiments which training sets are images generated from different image translation methods. As shown in Table.1, on VehicleID, compared with , the gains of are 5.05%, 6%, 6.66% and 5.79% in rank-1 of different test sets, respectively. Though SPGAN has better performance in the stage image-to-image translation than CycleGAN, it also causes deformation and color distortion in real scenario (see Fig.5). Hence, compared with , for different size of test sets on VehicleID, has 1.57%, 1.51%, 1.56% and 1.72% improvements in mAP, respectively. All of these could demonstrate that the structure of VTGAN is more stable and could generate suitable samples for training in the target domain.

3.3.5 The impact of ATTNet

To further improve re-ID performance on target dataset, we propose ATTNet. Fig.6 is CMC resutls on VehicleID of different methods. As shown in Fig.6, compared to methods with baseline reID model, not only original images but also generated images, methods using ATTNet have better performance. For instance, from the Table.1, we could find that, compared with , has 8.26%, 9.05%, 8.67%, and 7.99% improvements in rank-1 of different test sets when the model is trained on VeRi-776 and tested on VehicleID. Besides, it is obvious that compared with the baseline methods, the reID model using the ATTNet have significant improvement for every image translation method. This demonstrates that the reID model trained by the proposed ATTNet can better adapt to cross-domain task than the baseline method.

Methods Test size = 800 Test size = 1600 Test size = 2400 Test size = 3200
mAP(%) Rank1(%) Rank5(%) mAP(%) Rank1(%) Rank5(%) mAP(%) Rank1(%) Rank5(%) mAP(%) Rank1(%) Rank5(%)
Direct Transfer + Baseline 40.05 35.00 56.68 34.90 30.42 48.85 31.65 27.28 44.49 29.57 25.41 42.11
CycleGAN + Baseline 44.24 39.39 60.10 37.68 32.97 53.16 33.17 28.44 47.92 30.73 26.38 43.84
SPGAN + Baseline 48.27 42.87 66.55 42.51 37.46 58.97 38.41 33.54 53.68 35.04 30.45 49.13
VTGAN + Baseline 49.53 44.44 66.74 43.90 38.97 59.93 40.07 35.10 56.29 36.86 32.17 51.63
Direct Transfer + ATTNet 47.97 43.26 62.93 43.94 39.47 58.51 40.42 35.95 54.34 37.60 33.40 50.55
CycleGAN + ATTNet 46.96 42.68 60.72 43.27 38.88 57.44 39.39 35.09 53.05 37.05 33.07 49.38
SPGAN + ATTNet 52.72 48.25 67.20 48.01 43.44 63.04 44.17 39.51 59.05 41.05 36.75 54.63
VTGAN + ATTNet 54.01 49.48 68.66 49.72 45.18 63.99 45.18 40.71 59.02 42.94 38.72 55.87
Table 1: Comparison of various domain adaptation methods over Baseline model and ATTNet-reID model on VehicleID.
(a) Test size=800
Test size=2400
(c) Test size=2400
(d) Test size=3200
(b) Test size=1600
(b) Test size=1600
Figure 6: The CMC curves of different methods on VehicleID. (a) The result tested on the set with 800 vehicles. (b) The result tested on the set with 1600 vehicles. (c) The result tested on the set with 2400 vehicles. (d) The result tested on the set with 3200 vehicles.

4 Conclusion

In this paper, we propose a vehicle reID framework based on GAN, which includes the VTGAN and ATTNet for domain adaptation. The VTGAN is designed to generate the vehicle images which preserve the identity information of source domain and learn the style of target domain. The ATTNet is proposed to train the reID model with generated images. It can be observed from the results that both the VTGAN and ATTNet achieve good results. What’s more, it is obvious that existing datasets usually contain several viewpoints of vehicle images. It is also a limit for reID task in new domain. Hence, in our future studies, we would focus on using the GAN to generate the various viewpoints of vehicle images to expand the dataset and improve the performance of reID model.

References

  • [1] Lin Wu, Yang Wang, Ling Shao, and Meng Wang, “3-d personvlad: Learning deep global representations for video-based person reidentification,”

    IEEE transactions on neural networks and learning systems

    , 2019.
  • [2] Yang Wang, Xuemin Lin, Lin Wu, and Wenjie Zhang, “Effective multi-query expansions: Collaborative deep networks for robust landmark retrieval,” IEEE Transactions on Image Processing, vol. 26, no. 3, pp. 1393–1404, 2017.
  • [3] Lin Wu, Yang Wang, Junbin Gao, and Xue Li, “Where-and-when to look: Deep siamese attention networks for video-based person re-identification,” IEEE Transactions on Multimedia, 2018.
  • [4] Yang Wang, Xuemin Lin, Lin Wu, Wenjie Zhang, Qing Zhang, and Xiaodi Huang, “Robust subspace clustering for multi-view data by exploiting correlation consensus,” IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 3939–3949, 2015.
  • [5] Lin Wu, Yang Wang, Xue Li, and Junbin Gao, “Deep attention-based spatially recursive networks for fine-grained visual recognition,” IEEE Transactions on Cybernetics, vol. 49, no. 5, pp. 1791–1802, 2019.
  • [6] W. Deng, L. Zheng, and G Kang, “Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification,”

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

    , pp. 994–1003, 2018.
  • [7] J. Wang, X. Zhu, and S. Gong,

    “Transferable joint attribute-identity deep learning for unsupervised person re-identification,”

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition., pp. 2275–2284, 2018.
  • [8] L. Wei, S. Zhang, and W. Gao, “Person transfer gan to bridge domain gap for person re-identification,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition., pp. 79–88, 2018.
  • [9] Lin Wu, Yang Wang, and Ling Shao, “Cycle-consistent deep generative hashing for cross-modal retrieval,” IEEE Transactions on Image Processing, vol. 28, no. 4, pp. 1602–1612, 2019.
  • [10] Yang Wang, Wenjie Zhang, Lin Wu, Xuemin Lin, Meng Fang, and Shirui Pan,

    “Iterative views agreement: An iterative low-rank based structured optimization method to multi-view spectral clustering,”

    in

    International Joint Conference on Artificial Intelligence

    , 2016, pp. 2153–2159.
  • [11] Yang Wang, Lin Wu, Xuemin Lin, and Junbin Gao, “Multiview spectral clustering via structured low-rank matrix factorization,” IEEE transactions on neural networks and learning systems, vol. 29, no. 10, pp. 4833–4843, 2018.
  • [12] Kim T, Cha M, and Kim H, “Learning to discover cross-domain relations with generative adversarial networks,”

    Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org.

    , pp. 1857–1865, 2017.
  • [13] Zhao H, Shi J, and Qi X, “Pyramid scene parsing network,” IEEE Conf. on Computer Vision and Pattern Recognition, pp. 2881–2890, 2017.
  • [14] Y. Taigman, A. Polyak, and L Wolf, “Unsupervised cross-domain image generation,” arXiv preprint arXiv:1611.02200, 2016.
  • [15] He K, Zhang X, and Ren S, “Deep residual learning for image recognition,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
  • [16] X. Liu, W. Liu, and T. Mei, “Provid: Progressive and multimodal vehicle reidentification for large-scale urban surveillance,” IEEE Transactions on Multimedia, vol. 20, no. 3, pp. 645–658, 2018.
  • [17] H. Liu, Y. Tian, and Y. Yang, “Deep relative distance learning: Tell the difference between similar vehicles,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2167–2175, 2016.
  • [18] M. Abadi, P. Barham, and J. Chen, “Tensorflow: A system for large-scale machine learning,” In OSDI, vol. 16, pp. 265–283, 2016.
  • [19] Vedaldi A and Lenc K,

    “Matconvnet: Convolutional neural networks for matlab,”

    Proceedings of the 23rd ACM international conference on Multimedia. ACM, pp. 689–692, 2015.
  • [20] Z. Zheng, L. Zheng, and Y. Yang, “A discriminatively learned cnn embedding for person reidentification,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 14, no. 1, 2017.