Pedestrian re-identification (Re-ID) aims to match specific person identities across multiple cameras. As more and more surveillance cameras are being deployed in cities, pedestrian Re-ID can play an indispensable role in modern security systems. In recent years, deep learning methods have made a significant progress on pedestrian Re-ID task[6, 7, 11]. However, pedestrian Re-ID still faces many challenges, one of which is the large data amount. As the pedestrian Re-ID task is an open-set problem, it is impossible to manually label all the pedestrian images produced by surveillance cameras day by day. Based on the situation, domain adaptation has attracted much attention on the pedestrian Re-ID task [3, 13, 4, 1, 9].
Compared with traditional domain adaptation tasks, domain adaptive pedestrian Re-ID is much harder, as the source domain and target domain share no identical classes. Moreover, in VisDA-2020, a synthetic dataset are provided as the labeled source domain  and a real-world dataset is adopted as the target domain, where exists a large domain gap. In this work, we analyzed the biases in domain adaptive pedestrian Re-ID task introduced by different datasets and different cameras, and proposed a domain adaptation framework to solve the problem.
2.1 Data Generation
Domain adaptive pedestrian Re-ID task is faced with two main biases which will introduce disturbance to the discriminative ability of the model.
The first one is the inter-domain gap between different datasets. In the challenge, the source domain contains synthetic pedestrian images while the target domain is consisted of realistic ones. The huge appearance difference between the two domains brings poor performance to directly using the model trained on source domain for testing. To bridge this domain gap, generative adversarial networks (GAN) is commonly adopted to transfer source domain images to target domain. In the challenge, an SPGAN-transferred dataset is provided
. Through transferring, realistic texture from the target domain is added into the labeled source synthetic images. Therefore, conducting supervised learning on the SPGAN-transferred dataset can gain better discriminative ability on the target domain.
The second one is the intra-domain bias introduced by different cameras, which indicate differences on orientation, illumination, occlusion, resolution and many more conditions. In this work, we introduced starGAN to produce images with different camera styles [17, 16]. With a large amount of additional images, the model can generalize better among images captured with different cameras. This part of data will be named CamStyle data in the following sections.
2.2 Baseline Model
As the source domain provides labeled images while the target domain doesn’t, we train the baseline model in supervised manner on source domain. Both SPGAN-transferred images and CamStyle images are utilized in the training process. Label-smoothed cross entropy loss is adopted for classification and soft-margin triplet loss is adopted for better clustering performance.
2.3 Domain Adaptation
To better eliminate the inter-domain gap in domain adaptation tasks, we introduced a domain adaptation framework to train the model simultaneously on source domain and target domain. We designed a model with backbone and different classifiers for each domain. With this structure, the network can fully utilize the extracted feature to identify classes from each domain and narrow the feature distribution gap between these two domains. For the source domain, the training procedure is the same as that at baseline model training stage. And for the target domain, an additional clustering operation will be executed to produce pseudo labels. As the exact class number of target domain is unknown, DBSCAN is adopted as the clustering algorithm. The model pipeline is shown in Figure1
, where blue lines indicate source domain data and orange lines indicate target domain data. The dashed lines represent the labels flow. For source domain, the labels come from the dataset, while for target domain, labels come from clustering. The embedded features extracted by backbone network are then passed through corresponding classifiers to get their classification scores.
The domain adaptation training process is separated into two stages with different pseudo label generation strategies. At the first stage, we selected
classes with most samples clustered as the target training set and discard the rest. And as the model can better discriminate different identities, the outliers are regarded as classes with few samples. So at the second stage, we added anotherclasses each with one sample into the target training dataset. For the triplet loss calculating, these classes will only contribute to the loss of negative samples. Through the adoption of the two-stage pseudo label generation strategy, the model can continuously improve its performance.
The clustering process is executed every epochs. After the pseudo label generation, the source domain data and target domain data are sampled at a certain rate each to form a mini-batch. And for each mini-batch, original data and CamStyle data are also sampled at a fixed proportion. That means a mini-batch is either composed of source domain data and target domain data, while contains both original data and Camstyle data.
2.4 Post processing
After the features are extracted for testing set, we adopted several post-processing methods to further improve the model performance. The main focus of treatment is on the camera bias, which will largely influence the discriminative ability of the model. Firstly, the mean value of features under the same camera is calculated and subtracted from each feature. Then for each sample, the feature is updated with its closest neighbors. Considering there is no camera label provided in the testing set, we trained an additional camera model to predict the camera label for each image. Besides, inspired by 
, the features extracted by the camera model is utilized to calculate a camera distance matrix, which will be subtracted from the original feature distance matrix at a certain rate. Additionally, we built up a topology map representing the probability of showing up under a certain camera based on the given camera labels in validation set. Images under cameras with larger probability will be assigned larger distance weights. Traditional re-ranking is also adopted to update the distance matrix.
3 Experimental Results
3.1 Implementation Details
. At the baseline model training stage, a 700-class classifier conducts the classification operation for source domain, while at the domain adaptation stage, we designed two classifiers with corresponding dimensions to source domain and target domain. We trained the models on different backbones pretrained on ImageNet, among which ResNet50-ibn-a , ResNet50-ibn-b , ResNet101-ibn-a  and HRNetv2-w18 
showed better performance when testing on target domain. For data augmentation, we used random horizontal flip, padding and erasing.
We utilized SGD optimizer with a original learning rate. Warm-up strategy is adopted during the first epochs, and the learning rate is decayed at the and epochs. The model is trained for epochs in total.
For the standard model training, the images are resized to . We also finetuned multiple models with different image sizes based on standard models to compare features in more scales. Specifically, is selected by us for larger image size.
And for camera models, we utilized ResNet101 , ResNet152 , ResNet101-ibn-a  and HRNetv2-w18 . The camera distance matrix used in post processing step is generated from the mean of all camera distance matrixes. Finally we calculated a weighted sum of Re-ID distance matrixes and conducted the post processing methods to achieve the final score. Further details can be found at Reproduce Instructions.
3.2 Ablation Study
3.2.1 Effectiveness of Individual Components
|+ Domain Adaptation||44.9||75.3||86.7||91.0|
|+ Post Processing||70.9||86.5||92.8||94.4|
We compare the evaluation result on validation set to demonstrate the effectiveness of each component in our model structure and the experimental results are summarized in Table 1. In the table, ”Direct Transfer” means testing on target domain validation set with model trained on original source domain. It can be seen that directly applying the model trained on source domain to target domain shows poor performance, with a 16.2% mAP and 32.4% Rank-1. Through the introduction of SPGAN-generated data, part of the domain gap has been narrowed. And by adding extra CamStyle data, the performance is boosted by large margin. It shows that although diminishing the domain gap is effective, the camera bias is also an important issue.
The domain adaptation process further reduces the inter-domain gap, with about 15% growth on mAP and Rank-1. And during the finetuning stage with more samples utilized, there is an additional 4% increase. The post processing methods also play a significant role in the model performance. The experiment is executed on ResNet50-ibn-a backbone, and the stats are similar on the other backbones. Note that the stats are obtained offline, so there might be some differences under different evaluation systems.
3.2.2 Visualization of Generated Data
We selected some of our generated data utilized in training processes to show the influence from data more intuitively. From Figure 2, it can be observed that through the introduction of SPGAN, a large number of texture is added into the images. Add the CamStyle data further improves the diversity of data, where a variety of resolution, illumination conditions are simulated.
3.2.3 Effectiveness of model ensemble
With the above generated data and training stages, our best model (ResNet50-ibn-a) can reach about 71.1 mAP and 79.8 Rank-1. We also trained the model with ResNet50-ibn-b, ResNet101-ibn-a and HRNetv2-w18 backbones with different image sizes. Finally we integrated all models and adjusted some post-processing parameters to gain more than 5.5% mAP and 4.7% boost compared to the mean performance of all backbones. Note that our final Rank-1 is the highest among the competitors.
We have presented our framework for the domain adaptive pedestrian Re-ID challenge. It mainly focuses on the domain gap and camera bias which would influence the discriminative ability of models in the task. The top performance during the challenge had proved the effectiveness of our proposed methods, which can be further analyzed and better utilized in future works.
Instance-guided context rendering for cross-domain person re-identification.
Proceedings of the IEEE International Conference on Computer Vision, pp. 232–242. Cited by: §1.
Imagenet: a large-scale hierarchical image database.
2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §3.1.
-  (2018) Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 994–1003. Cited by: §1, §2.1.
-  (2019) Self-similarity grouping: a simple unsupervised cross domain adaptation approach for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6112–6121. Cited by: §1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.1.
A strong baseline and batch normalization neck for deep person re-identification. IEEE Transactions on Multimedia, pp. 1–1. External Links: Cited by: §1, §3.1.
-  (2019-06) Bag of tricks and a strong baseline for deep person re-identification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §1, §3.1.
-  (2018) Two at once: enhancing learning and generalization capacities via ibn-net. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 464–479. Cited by: §3.1, §3.1.
-  (2020) Unsupervised domain adaptive re-identification: theory and practice. Pattern Recognition 102, pp. 107173. Cited by: §1.
-  (2019) Dissecting person re-identification from the viewpoint of viewpoint. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 608–617. Cited by: §1.
-  (2018) Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), pp. 480–496. Cited by: §1.
-  (2020) Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence. Cited by: §3.1, §3.1.
-  (2019) Self-training with progressive augmentation for unsupervised cross-domain person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8222–8231. Cited by: §1.
-  (2017) Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1318–1327. Cited by: §2.4.
-  (2020) Random erasing data augmentation.. In AAAI, pp. 13001–13008. Cited by: §3.1.
-  (2018) Generalizing a person retrieval model hetero-and homogeneously. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–188. Cited by: §2.1.
-  (2018) Camera style adaptation for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5157–5166. Cited by: §2.1.
-  (2020) VOC-reid: vehicle re-identification based on vehicle-orientation-camera. In Proc. CVPR Workshops, Cited by: §2.4.