Progressive Transfer Learning for Person Re-identification

08/07/2019 ∙ by Zhengxu Yu, et al. ∙ Zhejiang University 0

Model fine-tuning is a widely used transfer learning approach in person Re-identification (ReID) applications, which fine-tuning a pre-trained feature extraction model into the target scenario instead of training a model from scratch. It is challenging due to the significant variations inside the target scenario, e.g., different camera viewpoint, illumination changes, and occlusion. These variations result in a gap between the distribution of each mini-batch and the distribution of the whole dataset when using mini-batch training. In this paper, we study model fine-tuning from the perspective of the aggregation and utilization of the global information of the dataset when using mini-batch training. Specifically, we introduce a novel network structure called Batch-related Convolutional Cell (BConv-Cell), which progressively collects the global information of the dataset into a latent state and uses this latent state to rectify the extracted feature. Based on BConv-Cells, we further proposed the Progressive Transfer Learning (PTL) method to facilitate the model fine-tuning process by joint training the BConv-Cells and the pre-trained ReID model. Empirical experiments show that our proposal can improve the performance of the ReID model greatly on MSMT17, Market-1501, CUHK03 and DukeMTMC-reID datasets. The code will be released later on at <https://github.com/ZJULearning/PTL>

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Person re-identification (ReID) is to re-identify the same person in different images captured by different cameras or at different time. Due to its wide applications in surveillance and security, person ReID has attracted much interest from both academia and industry in recent years.

With the development of deep learning methods and the newly emerged person ReID datasets, the performance of person ReID has been significantly boosted recently. However, several open problems remain. First, training a feature extraction model from scratch need a large volume of annotation data, but the annotated data is hard to acquire in person ReID tasks due to the poor quality of the image and the privacy concerns of pedestrians. Hence, making use of the existing datasets to help training the feature extractor have attracted great attention in the community. Second, the significant variations between different scenarios and within the same scenario make the person ReID task challenging. A noticeable performance degradation often occurs if we directly apply a pre-trained model on the target dataset without fine-tuning it into the target scenario.

Most of the recently proposed works [1, 12] have focused on mitigating the impact of variations between different datasets. Most of these works focus on transferring the image style of the target domain and the source domain to the same by using Generative Adversarial Networks (GANs) based models. However, the imperfect style transferring models can bring in noises and potentially change the data distribution of the whole dataset. Meanwhile, the person ID in the generated images is not guaranteed to be the same as in the real images.

As for variations inside the dataset, which we focused in this work, it is less mentioned in recently proposed works. The distribution difference between each mini-batch and the entire dataset caused by internal variations has a significant influence on the model fine-tuning process. This difference leads to a deviation of gradient estimation and thus affect the effect of model fine-tuning. The most straightforward approach to mitigate this problem is increasing the batch size. However, Keskar et al.

[9] and our experiments revealed that using a large-batch setting tends to converge to sharp minimizers, and further leads to poorer performance.

Moreover, most of the state-of-the-art deep learning methods in person ReID task have used an off-the-shelf network, like DenseNet [8] and ResNet [6]

, as backbone network. However, Deep CNNs are difficult to initialize and optimize with limited training data. Therefore, model fine-tuning is widely used to mitigate shortages of annotated training data in person ReID tasks, which make the study of how to mitigate the impact of the internal variation more critical. For instance, most of the off-the-shelf models used in ReID tasks are pre-trained on a relatively larger dataset like ImageNet

[13] and then fine-tuning into the target dataset.

In this paper, we study how to mitigate the impact of internal variations from the viewpoint of aggregation and utilization of the global information of the dataset. First, we propose a novel CNN building block, which we call the Batch-related Convolutional Cell (BConv-Cell). The BConv-Cell progressively aggregates the global information of the dataset into a latent state in a batch-wise manner. The latent state aggregated in previous batches will be used to mitigate the impact of the internal variations in the following batches. Based on the BConv-Cells, we further propose the Progressive Transfer Learning (PTL) method to fine-tune the pre-trained model by integrating it with the BConv-Cells. We conduct extensive experiments on MSMT17 [17], Market-1501 [22], CUHK03 [10] and DukeMTMC-reID [23] datasets to show that our proposal can effectively promote the ReID performance.

We summarize the contributions of this work as follows:

  1. We propose a novel network structure called the Batch-related Convolutional Cell (BConv-Cell). In mini-batch training, the BConv-Cells can progressively aggregate the global information of the dataset, and then use this information to help optimize model in the next batches.

  2. Based on the BConv-Cells, we then propose the Progressive Transfer Learning (PTL) method to fine-tune a pre-trained model into the target scenario by integrating the BConv-Cells.

  3. The experimental results show that the model fine-tuned by using our proposal can achieve state-of-the-art performance on four persuasive person ReID datasets.

Figure 1: Sketch of the PTL network. The black dotted box indicates the backbone network. and are the outputs of the -th Conv-block of the backbone and the related BConv-Cell respectively, denotes the input image, indicates the -th input batch, is the latent state of the -th BConv-Cell after the last batch. The red dotted box denotes the block pair of the Conv-block and the BConv-Cell. In each block pair, and are concatenated and feeding into a 1x1 Conv layer before feed into the BConv-Cell. The output of the last BConv-Cell and that of backbone network are concatenated before feeding into the 1x1 Conv-layer. The latent state of the BConv-Cell is stored after every batch and feedback to the same BConv-Cell when next batch coming.

2 Batch-related Convolutional Cell

The BConv-Cell is based on a straightforward thought that making use of the global information of the dataset to mitigate the adverse influence caused by internal variation.
The BConv-Cell is inspired by the Conv-LSTMs [19]. However, there are several fundamental difference between the BConv-Cells and the Conv-LSTMs. First, there is no time concept and explicit temporal connections between inputs in the BConv-Cells. Meanwhile, the BConv-Cells is not designed to handle sequential inputs but single segmented images. Second, the BConv-Cells have a different architecture from the Conv-LSTMs. The BConv-Cells only maintain a latent state which contained the aggerated global information, but the Conv-LSTMs reserved both the hidden state and the cell state. Moreover, the BConv-Cells is not designed to conduct prediction.
By using the memory mechanism, the BConv-Cells can progressively collect global information and use it to facilitate the parameter optimization process during fine-tuning. More than that, different from other LSTM based methods like meta-learners, the output of the BConv-Cells can be directly used as the extracted feature. Meanwhile, the nature of the BConv-Cells is a stack of Conv-layers, so it can be used as a building block of a multi-layer feature extraction network.

The key equations of the BConv-Cell have shown as follow:

(1)

where denotes the convolution operator, denotes the Hadamard product,

denotes a sigmoid function,

is the input of the BConv-Cell in -th batch. , and is the output of input gate , forget gate and output gate respectively, is the latent state reserved after -th batch, is the weight of the corresponding convolutional layer in the BConv-Cell and is the output of the BConv-Cell. All the input , latent state and gate output

are 3-dimensional tensors.

As shown in Eq. 1, the output is determined by the latent state and the input . The latent state is determined by the input and . From the fourth formula of Eq. 1, we can notice that the maintains part of the information of all the historical input batches. The iteration formula of latent state as:

(2)

where is the simplified notation of the composition of functions .

3 Progressive Transfer Learning Network

3.1 Progressive Transfer Learning

Given an off-the-shelf CNN as the backbone network, we pair up the BConv-Cells with its Conv-blocks to form a new network, and we name it as the progressive transfer learning (PTL) network. A sketch of the PTL network has shown in Figure 1. The red dotted box denotes a building block of the PTL network, which formed by a BConv-Cell, a 1x1 Conv-layer and the Conv-block of the backbone network. Formally, we define this building block as:

(3)

where indicate the input image of -th batch, is the output of the -th Conv-block in -th batch, is the output of the -th BConv-Cell. Eq. 3 only contains the second and the third equation when . The function and represent the mapping function learned by Conv-block and BConv-Cell respectively. is the 1x1 Conv-layer as shown in Figure 1. is the latent state of the -th BConv-Cell after the -th batch. The structure of the Conv-block is flexible, which can be replaced by Conv-block of many Deep CNNs like DenseNet or ResNet.

As shown in Eq. 3, the BConv-Cell learn the mapping function from input to feature space while collecting global information and updating the latent state. We can notice from Eq. 3 that the discriminative knowledge of the past batches is progressively aggregated into the latent state.

3.2 Network Architecture

We have tested the PTL method with several different structures of backbone networks, including DenseNets and ResNets. We use the DenseNet-161 as backbone network to describe the construction of the PTL network.

The DenseNet-161 consists of five Conv-blocks, we use four BConv-Cells to pair up with the top four Conv-blocks as shown in the Figure 1. At the top of the network, we use a BConv-Cell to capture the low-level feature of the input image, which is shown in the left of Figure 1

. At the bottom of the network, the output of the last BConv-Cell is concatenated with the output of the last Conv-block and then feed into a 1x1 Conv-layer to get the feature vector. During training, the feature vector is then fed into a classifier which contains three Fully connection layers. For simplicity, the classifier is not shown in Figure

1. During evaluating, we directly use the feature vector conduct image retrieve.

As we can see in the Figure 1, feature maps transmit along two dimensions in the PTL network. The first is batch iteration, BConv-Cells evolve the latent states with each input batch and transmit it to the next batch. The second is the depth of the network, in which feature maps transmit from the first layer to the last layer.

During testing, we set all the latent states as zeros. To simulate the test condition, all the latent states are set to zeros at the beginning of each epoch during training. This setting ensures that historical knowledge is progressively collected and aggregated only once in each epoch.

As we mentioned above, the backbone in Figure 1 can be replaced by most of the commonly used feature extraction networks. In this work, we use ResNet-50, DenseNet-161 and MGN [16] as backbone network.

3.3 Parameter Optimization Procedure

Our proposal facilitate parameter optimization by using the BConv-Cells to cooperate with the backbone network, which does not limit the selection of the optimization method. Hence, the combined model still can be optimized by using commonly used optimizers like SGD and SGD-M.

We argue that the PTL method can make up for two shortcomings of SGD-M optimizer. First, in SGD-M, the historical gradient is aggregated in a linear sum roughly by using humanly pre-defined weights, which make it inflexible and not optimized. Second, the loss after each batch only determined by the current input batch, which has a strong bias and leading to performance oscillation during training.

By using the PTL method, the historical gradient aggregation is replaced by calculating the gradient of a composition function recursively with learnable weights. More than that, the sample bias of current batch can be mitigated by using the historical knowledge carried by the learned latent states .

4 Student-Teacher Distillation Method

Figure 2: The implementation of STD method. The teacher model is set to evaluation mode during the whole process.

Compared with the backbone network, the parameter number of the PTL network grew up inevitably. To fairly compare with baselines, we introduce an improved model distillation method called Student-Teacher Distillation (STD) method to fine-tune a backbone model (like ResNet-50) by using the fine-tuned PTL model. The STD method is not essential for the PTL method in practical applications.

As the prerequisite, we assume we have obtained a fine-tuned model by using our proposed PTL method. We then introduce a new objective function for model distillation. The objective function consists of two parts. First is a cross entropy loss between the output predictions of the student model and the ground truth. The second is a L1 loss between the output feature vector of the student model and that of the teacher model. The new objective function is given by:

(4)

where is a hyper-parameter to adjust the ratio of the cross-entropy loss and the L1 loss. This new object function combines both supervision information and merit of the PTL network to extract discriminative feature.

The implementation of STD method has shown in Figure 2. We set the teacher model to evaluation mode during the whole process. The input image feeds into teacher and student model at the same time. After which, the parameter of the student network will be updated according to the proposed objective function in Eq. 4. After training, the teacher model can be abandoned.

5 Experiments

We first carried out model fine-tuning experiments with our proposal on four convincing ReID datasets and compared it with both the state-of-the-art ReID methods and several transfer-learning methods. We then conduct model transferring experiments among multiple datasets to evaluate the performance of the PTL method when handling multiple step transferring.

5.1 Dataset

We selected four persuasive ReID datasets to evaluate our proposal, including Market-1501, DukeMTMC-reID, MSMT17 and CUHK03.

Market-1501.

The Market-1501 dataset contains 32,668 annotated bounding boxes of 1,501 identities.

DukeMTMC-reID.

The DukeMTMC-reID dataset contains 1,404 identities. 702 IDs are selected as the training set and the remaining 702 IDs as the testing set.

Msmt17.

The raw video on the MSMT17 dataset is recorded in 4 days with different weather conditions in a month using 12 outdoor cameras and three indoor cameras. The MSMT17 dataset contains 126,441 bounding boxes of 4,101 identities. We followed the same dataset split by Wei et al. [17], and we also used the evaluation code provided by them (https://github.com/JoinWei-PKU/MSMT17_Evaluation).

Cuhk03.

The CUHK03 dataset consists of 14,097 images of 1,467 persons from 6 cameras. Two types of annotations are provided in this dataset: manually labeled pedestrian bounding boxes and DPM-detected bounding boxes. We followed the same dataset split as used in the [16]. For all experiments on Market-1501, DukeMTMC-reID and CUHK03, we used the evaluation code provided in Open-ReID (https://github.com/Cysu/open-reid).

Market-Duke.

We use the training sets of the two datasets Market-1501 and DukeMTMC-reID to form a new dataset called the Market-Duke dataset. We further use this dataset to train the models to compare the difference between one-step model fine-tuning and multi-step model fine-tuning. All validation, query and gallery set of these two datasets are abandoned.

5.2 Experiment Setting

We select the DenseNet-161 model and ResNet-50 model both pre-trained on the ImageNet dataset as backbone model. As for state-of-the-art model in ReID tasks, we select the MGN [16] model, which also use a ResNet-50 as backbone network. We modified the backbone network by using our proposed PTL method, and name these models as DenseNet-161+PTL, ResNet-50+PTL and MGN+PTL respectively. We then use the STD method to train the DenseNet-161 model (DenseNet-161+PTL+STD) by using the DenseNet-161+PTL as teacher model.

All images have been reshaped into 256x128 (height x width) before feeding into the network except for the experiments of MGN and MGN+PTL, which use image size 384x128. We take out the output of the 1x1 Conv-layer as the discriminative feature. The initial learning rate is set to 0.01 and decay the learning rate ten times every ten epochs. Models are fine-tuned for 50 epochs. Unless otherwise stated, in all of our experiments, we use SGD-M as the optimizer. The hyper-parameter is set to 0.8 by practicing in the following experiments. For experiments involved MGN and MGN+PTL, we followed the experiment setting in [16]. Meanwhile, we use the single-query setting in all experiments.

To the best of our knowledge, the official implementation of the MGN has not been released. Hence, we use the reproduction version published in https://github.com/GNAYUOHZ/ReID-MGN. The results obtained by this reproduction code are noted as MGN (reproduced) in all the tables. Although the MGN model uses the ResNet-50 as the backbone network, the parameter number of the MGN model (66m) is much more than the ResNet-50 model (28m). Due to the GPU usage limitation, we have not conducted experiments about the MGN+PTL+STD.

We use the cross-entropy loss in all of the fine-tuning processes in our experiments, except for the MGN+PTL. The MGN use a combined loss function insists of cross-entropy loss and triplet loss, we use the same loss function in all the experiments of the MGN+PTL.

5.3 Msmt17

Method #Param. mAP CMC-1
GoogLeNet [17] - 23.00 47.60
PDC [17] - 29.70 58.00
GLAD [17] - 34.00 61.40
ResNet-50 28m 28.63 59.77
ResNet-50+PTL 35m 32.58 62.76
DenseNet-161 32m 38.60 70.80
DenseNet-161+PTL 42m 42.25 72.65
DenseNet-161+PTL+STD 32m 41.38 73.12
Table 1: Results on the MSMT17 dataset. #Param. indicates parameter number, indicates million.

We first evaluate the PTL method on the MSMT17 dataset. The detailed evaluation statistics has shown in Table 1. We can see a significant performance promotion by using the PTL method. The performance of the DenseNet-161+PTL+STD outperforms not only the backbone model but also all of the baseline methods. We also can notice that the DenseNet-161+PTL+STD model can achieve higher CMC-1 score than the DenseNet-161+PTL model. We attribute the success to the combined loss function of the STD method. By combining the cross-entropy Loss with the L1 loss, the student model can learn the discriminative knowledge from the teacher model while imposing restrictions on the learned knowledge.

5.4 Market-1501

Method mAP CMC-1
DML [21] 70.51 89.34
HA-CNN [11] 75.70 91.20
PCB+RPP [14] 81.60 93.80
MGN [16] 86.90 95.70
DenseNet-161* 69.90 88.30
DenseNet-161 76.40 91.70
DenseNet-161+PTL 77.50 92.50
DenseNet-161+PTL+STD 77.50 92.20
MGN (reproduced) 85.80 94.60
MGN+PTL 87.34 94.83
Table 2: Results on the Market-1501 dataset. DenseNet-161* used a batch size of 90, other experiments involving DenseNet-161 used a batch size of 32. MGN (reproduced) is our reproduction of the MGN [16].

We use DenseNet-161 and MGN as backbone model to evaluate the performance of the PTL method on Market-1501 dataset. We select several state-of-the-art person ReID methods as baselines. Among these methods, the DML [21] is also pre-trained on ImageNet and transferred to Market-1501.

The results has summarized in Table 2. We can notice that by using the PTL method and the STD method, a simple DenseNet-161 model can outperform the state-of-the-art transfer learning based person ReID methods on Market-1501 dataset. Meanwhile, the MGN+PTL outperforms all the state-of-the-art methods.

Moreover, we can notice that using a large batch size (DenseNet-161*) is not an effective way to narrow the gap between the distribution of each mini-batch and the distribution of the dataset. In contrast, large batch size can result in poor performance.

5.5 Cuhk03

We then conduct model fine-tuning experiments on CUHK03 dataset. We compare the performance of MGN+PTL with several state-of-the-art methods. The results has summarized in Table 3. We can notice that by using the PTL method, the ReID performance of the MGN model has promoted tremendously, and outperforms all the state-of-the-art methods.

Methods Detected Labelled
mAP CMC-1 mAP CMC-1
HA-CNN [11] 38.60 41.70 41.00 44.40
PCB [14] 54.20 61.30 - -
PCB+RPP [14] 57.50 63.70 - -
MGN [16] 66.00 66.80 67.40 68.00
MGN (reproduced) 69.41 71.64 72.96 74.07
MGN+PTL 74.22 76.14 77.31 79.79
Table 3: Results on the CUHK03 dataset.

5.6 DukeMTMC-reID

Method mAP CMC-1
HA-CNN [11] 63.80 80.50
PCB [14] 69.20 83.30
MGN [16] 78.40 88.70
MGN (reproduced) 77.07 87.70
MGN+PTL 79.16 89.36
Table 4: Results on the DukeMTMC-reID dataset.

We then conduct experiments on DukeMTMC-reID dataset. As for baselines, we compare the performance with several state-of-the-art methods, including HA-CNN, PCB and MGN. The results have shown in Table 4, we can notice that by using our method, the MGN+PTL model can outperforms all state-of-the-art methods.

5.7 Transfer among Multiple Datasets

In real-world applications, ReID model needs to transfer among a sort of datasets to take advantage of all available data. Therefore, we conduct multiple dataset transferring experiments to evaluate the performance of our proposal when dealing with model fine-tuning among multiple datasets. Similar to the experiment on MSMT17 dataset, we also use the STD method to train a DenseNet-161 model to compare with the baselines fairly.

The detailed results have shown in Table 5, from which we can see that the PTL method achieves better performance compared with baseline models. We also notice that the performance of two-step transferring achieves better performance compare with one step transferring. For instance, fine-tuning follow the order ’Duke to Market to MSMT17’ outperforms ’Market-Duke to MSMT17’. We argue that it is caused by the substantial style variation in the Market-Duke dataset is richer than either Market-1501 or DukeMTMC-reID dataset.

More than that, we can see that the order of fine-tuning can influence the performance of the final model. The model fine-tuned by ’Duke to Market to MSMT17’ can achieve highest score in both mAP and CMC-1.

Method Fine-tuning list mAP CMC-1

DenseNet-161
Market-Duke to MSMT17 41.12 72.49
DenseNet-161+PTL+STD Market-Duke to MSMT17 42.53 74.11
DenseNet-161 Market to Duke to MSMT17 41.22 72.78
DenseNet-161+PTL+STD Market to Duke to MSMT17 42.34 73.60
DenseNet-161 Duke to Market to MSMT17 41.80 73.00
DenseNet-161+PTL+STD Duke to Market to MSMT17 42.73 74.31
DenseNet-161+PTL Duke to Market 76.00 91.30
DenseNet-161+PTL+STD Duke to Market 75.50 91.10
DenseNet-161+PTL Duke to MSMT17 to Market 77.90 91.60
DenseNet-161+PTL+STD Duke to MSMT17 to Market 77.40 91.60
Table 5: Results of transfer among multiple datasets. Fine-tuning list indicates the order of fine-tuning, e.g., ’Duke to Market’ means the model is fine-tuned on DukeMTMC-reID before fine-tune it on Market-1501. Market and Duke denote the Market-1501 dataset and the DukeMTMC-reID dataset respectively.

5.8 Evaluate STD Method on MSMT17

In this subsection, we evaluate the STD method on MSMT17 dataset. We conduct a series of comparative experiments by adjusting the ratio of the cross-entropy loss and L1 loss.

We use the DenseNet-161+PTL model transferring from Market-1501 to MSMT17 in Table 5 as the teacher model. The student model is a DenseNet-161 model which has been transferred from ImageNet to Market-1501 using a SGD-M optimizer.

Method #Param. mAP CMC-1
DenseNet-161+PTL 42m - 42.45 72.48
DenseNet-161+PTL+STD 32m 0.00 38.60 70.80
DenseNet-161+PTL+STD 32m 0.30 41.26 72.52
DenseNet-161+PTL+STD 32m 0.50 42.27 73.49
DenseNet-161+PTL+STD 32m 0.80 42.51 73.37
DenseNet-161+PTL+STD 32m 1.00 41.66 72.32
Table 6: Results of the STD method on the MSMT17 dataset. The denotes the hyper-parameter in Eq. 4. The denotes the proportion of L1 loss in the combined loss function. means use a SGD-M optimizer to fine-tune a DenseNet-161 model on MSMT17 without using the STD method.

The detailed results are shown in Table 6. From this table, we can see that by using the STD method, the performance of the DenseNet-161 model is promoted significantly. Meanwhile, we can see that the score of the DenseNet-161+PTL+STD grows up when grows up. However, when bigger than 0.8, the score no longer increases anymore. We argue that it is because the cross-entropy loss in the combined loss function is essential.

6 Related Works

6.1 Transfer Learning Methods

Many transfer learning methods have been proposed recently. Zhong et al. [24] proposed a domain adaption approach which transfers images from one camera to the style of another camera. Fan et al. [2] proposed an unsupervised fine-tuning approach which used an IDE model trained on DukeMTMC-reID as start point and fine-tuned it on target dataset. Different from these approaches, our method is based on model fine-tuning, which is more flexible and easy to conduct.

As for optimization methods used in transfer learning, training a meta-learner to learn how to update the parameters of the backbone model have attracted lots of attention recently [5, 3]. In these approaches, parameters are updated using a learned update algorithm. For instance, Finn et al. [3] proposed a meta-learning method MAML by using a LSTM network to update parameters. Our proposal is distinct from these approaches in several aspects. First, the goal of these meta-learning works is to find a better parameter optimization route which can efficiently optimize the model parameter. Differently, the PTL network is designed to mitigate the distribution difference between mini-batch and the whole dataset. Meanwhile, the BConv-Cells can be directly participating in the feature extraction.

6.2 Person Re-identification Networks

With the prosperity of deep learning, using deep learning networks as feature extractor has become a common practice in person ReID tasks. Many deep learning based person ReID methods [15, 20, 10] have been proposed. As for transfer learning based deep person ReID method, Geng et al. [4] proposed a deep transfer learning model to address the data sparsity problem.

6.3 Knowledge Distillation Methods

Our proposed STD method is a special case of knowledge distillation [7]. More generally, it can be seen as a special case of learning with privileged information. Using distillation for model compression is mentioned by Hinton et al. [7]. Wu et al. [18]

used the distillation method to improve the accuracy of binary neural networks on ImageNet.

7 Conclusion

In this paper, we propose a Batch-related Convolutional Cell (BConv-Cell) to mitigate the impact of the bias of each mini-batch caused by internal variations. The BConv-Cells can progressively collect the global information of the dataset during training while participating in the feature extraction. This global information will be used to mitigate the bias of each mini-batch in the next iterations. Based on the BConv-Cells, we propose the Progressive Transfer Learning (PTL) method to fine-tune the pre-trained model into the target dataset. Extensive experiments show that our method can improve the performance of the backbone network significantly and achieved state-of-the-art performance on four datasets, including Market-1501, MSMT17, CUHK03, and DukeMTMC-reID datasets.

Acknowledgements

This work was supported in part by the National Nature Science Foundation of China (Grant Nos: 61751307), in part by the National Youth Top-notch Talent Support Program and in part by Alibaba-Zhejiang University Joint Institute of Frontier Technologies.

References

  • [1] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao (2018) Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 994–1003. Cited by: §1.
  • [2] H. Fan, L. Zheng, C. Yan, and Y. Yang (2018) Unsupervised person re-identification: clustering and fine-tuning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14 (4), pp. 83. Cited by: §6.1.
  • [3] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    ,
    pp. 1126–1135. Cited by: §6.1.
  • [4] M. Geng, Y. Wang, T. Xiang, and Y. Tian (2016) Deep transfer learning for person re-identification. arXiv preprint arXiv:1611.05244. Cited by: §6.2.
  • [5] D. Ha, A. M. Dai, and Q. V. Le (2016) HyperNetworks. CoRR abs/1609.09106. External Links: Link, 1609.09106 Cited by: §6.1.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
  • [7] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §6.3.
  • [8] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks.. In CVPR, Vol. 1, pp. 3. Cited by: §1.
  • [9] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang (2016) On large-batch training for deep learning: generalization gap and sharp minima. CoRR abs/1609.04836. External Links: Link, 1609.04836 Cited by: §1.
  • [10] W. Li, R. Zhao, T. Xiao, and X. Wang (2014) Deepreid: deep filter pairing neural network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 152–159. Cited by: §1, §6.2.
  • [11] W. Li, X. Zhu, and S. Gong (2018) Harmonious attention network for person re-identification. In CVPR, Vol. 1, pp. 2. Cited by: Table 2, Table 3, Table 4.
  • [12] L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele, and M. Fritz (2018) Disentangled person image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 99–108. Cited by: §1.
  • [13] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §1.
  • [14] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang (2018) Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), pp. 480–496. Cited by: Table 2, Table 3, Table 4.
  • [15] R. R. Varior, M. Haloi, and G. Wang (2016)

    Gated siamese convolutional neural network architecture for human re-identification

    .
    In European Conference on Computer Vision, pp. 791–808. Cited by: §6.2.
  • [16] G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou (2018) Learning discriminative features with multiple granularities for person re-identification. In 2018 ACM Multimedia Conference on Multimedia Conference, pp. 274–282. Cited by: §3.2, §5.1, §5.2, §5.2, Table 2, Table 3, Table 4.
  • [17] L. Wei, S. Zhang, W. Gao, and Q. Tian (2018) Person trasfer gan to bridge domain gap for person re-identification. In Computer Vision and Pattern Recognition, IEEE Conference on, Cited by: §1, §5.1, Table 1.
  • [18] X. Wu (2016)

    High performance binarized neural networks trained on the imagenet classification task

    .
    CoRR abs/1604.03058. External Links: Link, 1604.03058 Cited by: §6.3.
  • [19] S. Xingjian, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo (2015) Convolutional lstm network: a machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pp. 802–810. Cited by: §2.
  • [20] X. Zhang, H. Luo, X. Fan, W. Xiang, Y. Sun, Q. Xiao, W. Jiang, C. Zhang, and J. Sun (2017) Alignedreid: surpassing human-level performance in person re-identification. arXiv preprint arXiv:1711.08184. Cited by: §6.2.
  • [21] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu (2018) Deep mutual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4320–4328. Cited by: §5.4, Table 2.
  • [22] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1116–1124. Cited by: §1.
  • [23] Z. Zheng, L. Zheng, and Y. Yang (2017) Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §1.
  • [24] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang (2018) Camera style adaptation for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5157–5166. Cited by: §6.1.