Person re-identification (ReID) aims at matching persons of the same identity across multiple camera views. Recent works in ReID mainly focus on three settings, , fully-supervised [zhang2020relation, zheng2019joint, zhou2019omni], fully-unsupervised [lin2019bottom, lin2020unsupervised, wang2020unsupervised] and unsupervised domain adaptive [fu2019self, zhai2020multiple, zhong2019invariance] ReID. Despite their good performance on a seen domain (, a domain with training data), most of them suffer from drastic performance decline on unseen domains. In real-world applications, the ReID systems will inevitably search persons in new scenes. Therefore, it is necessary to learn a model that has good generalization ability to unseen domains.
To meet this goal, domain generalization (DG) is a promising solution that aims to learn generalizable models with one or several labeled source domains. As shown in Fig. 1, compared to other settings, DG does not require the access to target domains. Generally, DG can be divided into two categories, single-source DG [jin2020style, Liao2020QAConv, zhou2019learning] and multi-source DG [kumar2019fairest, song2019generalizable], according to the number of source domains. Recent works mainly focus on single-source DG where only one labeled source domain is available. However, a single domain provides limited training samples and scene information, restricting the improvement of single-source DG methods. In contrast, multi-source DG utilizes multiple datasets of different distributions, providing more training data that contain numerous variations and environmental factors. However, due to the strong compatibility of deep networks, directly aggregating all source domains together might lead the model to overfit on the domain bias, hampering the generalization ability of the model. Although we can sample balanced training data from all source domains during training to reduce the impact of domain bias, the above issue still remains.
In this paper, we study the multi-source DG and aim to enforce the model to learn discriminative features without domain bias so that the model can be generalized to unseen domains. To achieve this goal, this paper introduces a meta-learning strategy for multi-source DG, which simulates the train-test process of DG during model optimization. In our method, we dynamically divide the source domains into meta-train and meta-test sets at each iteration. The meta-train is regarded as source data, and the meta-test is regarded as “unseen” data. During training, we encourage the loss of meta-train samples to optimize the model towards a direction that can simultaneously improve the accuracy of meta-test samples. Nevertheless, meta-learning causes a problem for traditional parametric-based identification loss — unstable optimization. On the one hand, ReID datasets contain numerous IDs, so the number of classifier parameters will surge when multiple domains are used for training. On the other hand, the unified optimization of classifiers is unstable due to the asynchronous update by the high-order gradients of the meta-test. Consequently, we propose a memory-based identification loss, which uses a non-parametric memory to take full advantage of meta-learning while avoiding unstable optimization. We also introduce a meta batch normalization layer (MetaBN), which mixes meta-train knowledge with meta-test features to simulate the feature variations in different domains. Our full method is called Memory-based Multi-Source Meta-Learning (ML). Experiments on four large-scale ReID datasets demonstrate the effectiveness of our ML when testing on unseen domains and show that our ML can achieve state-of-the-art results.
Our contributions are summarized as follows:
We propose a Multi-Source Meta-Learning framework for multi-source DG, which can simulate the train-test process of DG during training. Our method enables the model to learn domain-invariant representations and thus improves the generalization ability.
We equip our framework with a memory-based module, which implements the identification loss in a non-parametric way and can prevent unstable optimization caused by traditional parametric manner during meta-optimization.
We present MetaBN to generate diverse meta-test features, which can be directly injected into our meta-learning framework and obtain further improvement.
2 Related Work
Recently, supervised learning approaches[chen2019abd, suh2018part, tay2019aanet, wang2018learning, zhang2020relation, zheng2019joint, zhou2019omni] have achieved significant performance in person re-identification (ReID), relying on labeled training data. Considering the difficulties and complexities of annotations, unsupervised learning (USL) [fan2018unsupervised, lin2019bottom, lin2020unsupervised, wang2020unsupervised] and unsupervised domain adaptation (UDA) [chen2019instance, fu2019self, zhai2020ad, zhai2020multiple, zhong2018generalizing, zhong2019invariance, zou2020joint] methods are proposed. UDA aims to utilize labeled source data and unlabeled target data to improve the model performance on the target domain. UDA methods mainly focus on generating pseudo-labels on target data [fu2019self, zhai2020multiple, zhong2019invariance] or transferring source images to the styles of the target domain for providing extra supervision during adaptation [chen2019instance, zhong2018generalizing, zou2020joint]. USL approaches learn discriminative features only from unlabeled target data, the mainstream [fan2018unsupervised, lin2019bottom] of which is to train models with pseudo-labels obtained by clustering.
Domain Generalization. Although USL and UDA ReID methods show good performance, they still need to collect a large amount of target data for training models. In contrast, domain generalization (DG) has no access to any target domain data. By carefully designing, DG methods [2020EccvDMG, jin2020style, Li2018MLDG] can improve the model performance on unseen domains. Most existing DG methods focus on closed-set tasks [2020EccvDMG, khosla2012undoing, Li2018MLDG, muandet2013domain, qiao2020learning], assuming that the target data have the same label space as the source data. Lately, several works [jin2020style, Liao2020QAConv, song2019generalizable, zhou2019learning] were introduced to learn generalizable models for person ReID. SNR [jin2020style] disentangles identity-relevant and identity-irrelevant features and reconstructs more generalizable features. Liao [Liao2020QAConv] propose a novel QAConv for calculating the similarity between samples, which can effectively improve ReID accuracy in unseen data but is inefficient during testing. DIMN [song2019generalizable] proposes a mapping subnet to match persons within a mini-batch and trains the model with data from one domain at each iteration. Song [song2019generalizable] claim that DIMN uses meta-learning in the training stage. However, DIMN optimizes the model with the common training strategy, which is completely different from our meta-learning strategy.
Meta Learning. The concept of meta-learning [thrun1998learning]
is learning to learn, and has been initially proposed in the machine learning community. Recently, meta-learning has been applied to various deep-based applications, including model optimization[andrychowicz2016learning, li2016learning], few-shot learning [finn2017model, snell2017prototypical, sun2019meta, vinyals2016matching] and domain generalization [balaji2018metareg, guo2020learning, Li2018MLDG, Li_2019_ICCV, li2019feature]. MAML [finn2017model] and its variant Reptile [nichol2018first] are proposed to learn a good initialization for fast adapting a model to a new task. Li [Li2018MLDG] first extend MAML [finn2017model] to closed-set DG. Latter, meta-learning was applied to closed-set DG [balaji2018metareg, Li_2019_ICCV, li2019feature] and open-set DG [guo2020learning]. In this paper, we propose a memory-based meta-learning approach, which is tailor-made for multi-source DG in ReID.
For multi-source domain generalization (DG) in person ReID, we are provided with source domains in the training stage. The label spaces of the source domains are disjointed. The goal is to train a generalizable model with the source data. In the testing stage, the model is evaluated directly on a given unseen domain .
This paper designs a Memory-based Multi-source Meta-Learning (ML) framework for multi-source domain generalization (DG) in person ReID task. In our framework, we introduce a meta-learning strategy, which simulates the train-test process of DG during model optimization. Specifically, we dynamically split the source domains into meta-train and meta-test at each iteration. During training, we first copy the original model and update it with the loss from meta-train data. Then we use the updated model to compute the meta-test loss. The memory-based identification loss and triplet loss are adopted for effective meta-learning. We also inject a meta batch normalization layer (MetaBN) into the network, which diversifies the meta-test features with meta-train distributions to further facilitate the effect of meta-learning. Finally, the combination of the meta-train and meta-test losses is used to update the original model towards a generalizable direction that performs well on meta-train and meta-test domains.
3.2 Meta-Learning for Multi-Source DG
We adopt the concept of “learning to learn” to simulate the train-test process of domain generalization during the model optimization. At each training iteration, we randomly divide source domains into domains as meta-train and the rest one domain as meta-test. The process of computing the meta-learning loss includes the meta-train and the meta-test stages.
In the meta-train stage, we calculate the meta-train loss on the meta-train samples to optimize the model. In the meta-test stage, the optimized model is used to calculate the meta-test loss with the meta-test samples. Finally, the network is optimized by the combination of meta-train and meta-test losses, ,
where denotes the parameters of the network, and denotes the parameters of the model optimized by the . Note that, is only used to update , the derivative of which is the high-order gradients on .
Remark. In the proposed meta-learning objective, the meta-test loss encourages the loss of meta-train samples to optimize the model towards a direction that can improve the accuracy of meta-test samples. By iteratively enforcing the generalization process from meta-train domains to meta-test domain, the model can avoid overfitting to domain bias and can learn domain-invariant representations that generalize well on unseen domains.
3.3 Memory-based Identification Loss
Identification loss can effectively learn discriminative person representations in a classification manner. Commonly, a fully-connected layer is adopted as the classifier to produce the probabilities that are used for computing the cross-entropy loss. Although existing works[balaji2018metareg, han2018coteaching, sun2019meta] show the effectiveness of meta-learning in the classification task, the parametric classifier is inadequate in the context of ReID. This is because ReID is an open-set task, where different domains contain completely different identities and the number of identities in each domain is commonly large. In multi-source DG of ReID, we have two kinds of parametric classifier selections, one global FC classifier or parallel FC classifiers for each domain, both of which will lead to problems during meta-learning.
For the global FC classifier (Fig. 3(a)), the dimension of the FC layer is the sum of all source identities. Different from closed-set tasks [balaji2018metareg, 2020EccvDMG], the global FC classifier contains a large number of parameters when trained with multiple person ReID datasets. This will lead to unstable optimization during the meta-learning. As for parallel FC classifiers in Fig. 3(b), although we can alleviate the parameter burden by only identifying persons within their own domain classifier, the number of parameters for all classifiers is still large. Moreover, during the meta-learning, the classifier of the meta-test domain is only updated by high-order gradients, which is asynchronous with the feature encoder. This optimization process is unequal and unstable, leading to an incomplete usage of meta-learning.
Taking all the above into consideration, inspired by [memory], we propose a memory-based identification loss for multi-source DG, which is non-parametric and suitable for both meta-learning and person ReID. As shown in Fig. 3(c), we maintain a feature memory for each domain, which contains the centroids of each identity. The similarities between features and memory centroids are used to compute the identification loss. The memory-based identification loss has two advantages to our meta-learning framework. First, the memory is a non-parametric classifier, which avoids the unstable optimization caused by a large number of parameters. Second, the asynchronous update between the feature encoder and memory has a slight influence on model training. This is because the memory is updated smoothly by a momentum instead of being updated by an optimizer. Thus, the memory is insensitive to the changes of the feature encoder caused by the last few training iterations. In Sec. 4.4, we show that our meta-learning framework gains more improvements with the memory-based identification loss than with the FC-based identification loss. Next, we will introduce the memory-based identification loss in detail.
Memory Initialization. We maintain an individual memory for each source domain. For a source domain with identities, the memory has slots, where each slot saves the feature centroid of the corresponding identity. In initialization, we use the model to extract features for all samples of . Then, we initialize the centroid of each identity with a feature, which is averaged on the features of the corresponding identity. For simplicity, we omit the superscript of the domain index and introduce the memory updating and memory-identification loss for one domain.
Memory Updating. At each training iteration, we update the memory with the features in the current mini-batch. A centroid in the memory is updated through,
where denotes the samples belonging to the th identity and denotes the number of samples for the th identity in current mini-batch. controls the updating rate.
Memory-based identification loss. Given an embedding feature from the forward propagation, we calculate the similarities between and each centroid in the memory. The memory-based identification loss aims to classify into its own identity, which is calculated by:
where is the temperature factor that controls the scale of distribution.
Triplet loss. We also use triplet loss [triplet] to train the model, which is formulated as,
where is the Euclidean distance between an anchor feature and a hard positive feature, and is the Euclidean distance between an anchor feature and a hard negative feature. is the margin of triplet loss and refers to .
In our meta-learning strategy, the meta-test loss is important for learning generalizable representations, since the meta-test plays the role of the “unseen” domain. Intuitively, if the meta-test examples are sampled from more diverse distributions, the model will be optimized to be more robust to variations and thus be more generalizable to unseen domains. To achieve this goal, we introduce MetaBN to generate more diverse meta-test features at the feature-level. As shown in Fig. 2, we replace the last batch normalization layer (BN) [bn] in the network with MetaBN. During training, MetaBN utilizes the domain information from meta-train domains to inject domain-specific information into meta-test features. This process can diversify meta-test features, enabling the model to simulate more feature variations. The operation of MetaBN is illustrated in Fig. 4.
In the meta-train stage, for the th meta-train domain, MetaBN normalizes the meta-train features as the traditional BN, and saves the mini-batch mean and mini-batch variance , which are used in the following meta-test stage.
In the meta-test stage, MetaBN uses the saved mean and variance to form Gaussian Distributions. Note that, the generated distribution mainly reflects the high-level domain information instead of specific identity information. This is because each saved mean and variance is calculated over samples belonging to dozens of identities. Considering this factor, we sample features from these distributions and inject these domain-specific features into meta-test features.
Specifically, for the th distribution, we sample one feature for each meta-test feature:
where denotes Gaussian Distribution. By doing so, we obtain (the batch size of meta-test features) sampled features, which are mixed with the original meta-test features for generating new features ,
where denotes the original meta-test features. denotes sampled features from the th Gaussian Distribution.
is the mixing coefficient, which is sampled from Beta Distribution, ,.
Finally, the mixed features are normalized by batch normalization,
where and denote mini-batch mean and variance of . and denote the learnable parameters that scale and shift the normalized value.
3.5 Training procedure of ML
During training, source domains are separated into meta-train domains and one meta-test domain at each iteration. The model is optimized by the losses calculated in the meta-train and meta-test stages.
where denotes the parameters of the network. and denote the training samples and memory of the th meta-train domain, respectively.
The total loss for meta-train is averaged over meta-train domains, formulated as,
Meta-test. In the meta-test stage, the meta-test domain is performed on the new parameters , which is obtained by optimizing with . With the MetaBN proposed in Sec. 3.4, we can obtain mixed features for each meta-test sample. The average memory-based identification loss over these features is considered as the meta-test memory-based identification loss. The meta-test loss is:
where denotes the meta-test samples and denotes the th mixed features generated by the MetaBN.
4.1 Benchmarks and Evaluation Metrics
We conduct experiments on four large-scale person re-identification benchmarks: Market-1501[market1501]dukemtmc2, dukemtmc], CUHK03 [cuhk03, rerank] and MSMT17 [msmt17]. For studying the multi-source DG, we divide these four datasets into two parts: three domains as source domains for training and the other one as target domain for testing. For CUHK03, we adopt CUHK-NP [rerank] detected subset for evaluation. The statistics of these four benchmarks are shown in Table 1. As shown in Fig. 5, distributions vary by domains: (1) MSMT17 is the largest dataset that contains images in a variety of situations; (2) DukeMTMC-reID and Market-1501 are closely related to MSMT17 and each other; (3) CUHK03 has a relatively more distinct distribution compared with the other three. For simplicity, we denote Market-1501, DukeMTMC-reID, CUHK03, and MSMT17 as M, D, C, and MS in tables.
|Benchmarks||# IDs||# images||# cameras|
The cumulative matching characteristic (CMC) at Rank-1 and mean average precision (mAP) are used to evaluate performance on the target testing set.
4.2 Implementation Details
We implement our method with two common backbones, i.e., ResNet-50 [resnet] and IBN-Net50 [ibn]. Images are resized to 256128 and the training batch size is set to 64. We use random flipping and random cropping for data augmentation. For the memory, the momentum coefficient is set to 0.2 and the temperature factor is set to 0.05. The margin of triplet loss is 0.3. To optimize the model, we use Adam optimizer with a weight decay of 0.0005. The learning rate of inner loop and outer loop are initialized to and increase linearly to
in the first 10 epochs. Then,and are decayed by 0.1 at the 30th epoch and 50th epoch. The total training stage takes 60 epochs.
Baseline. For the baseline, we directly train the model with the memory-based identification loss and triplet loss using the data of all the source domains. That is, the baseline does not apply the meta-learning strategy and MetaBN.
4.3 Comparison with State-of-the-Art methods
|QAConv [Liao2020QAConv] *||MS+D+C||3,110||75,406||35.6||65.7||MS+M+C||3,159||71,820||47.1||66.1|
|QAConv [Liao2020QAConv] *||MS+D+M||2,494||62,079||21.0||23.5||D+M+C||2,820||55,748||7.5||24.3|
We reimplement this work based on the authors’ code on Github with the same source datasets as us.
Since there is no multi-source DG method evaluating on large-scale datasets, we compare our method with state-of-the-art single-source DG methods, including OSNet-IBN [zhou2019omni], OSNet-AIN [zhou2019learning], SNR [jin2020style] and QAConv [Liao2020QAConv]. SNR [jin2020style] and QAConv [Liao2020QAConv] use the ResNet-50 as the backbone. OSNet-IBN [zhou2019omni] and OSNet-AIN [zhou2019learning] use their self-designed networks that have better performance than ResNet-50. When testing on Market-1501, DukeMTMC-reID, and CUHK03, the existing single-source DG methods utilize MSMT17 as the source domain for model training. They combine the train set and test set of MSMT17, which is denoted as Combined MSMT17 (Com-MS) in this paper. To verify that the effectiveness of our method is obtained by multi-source meta-learning instead of training with more IDs and images, we only use the training sets of the source domains for model training. For example, when using Market-1501 as the target domain, we train the model with the train sets of DukeMTMC-reID, CUHK03, and MSMT17, including 3,110 IDs and 75,406 images. The numbers of IDs and images are less than that of Combined MSMT17 (3,110 IDs 4,101 IDs, and, 75,406 images 126,441 images). To conduct a fair comparison, we also reimplement recent published QAConv [Liao2020QAConv] with the same training data as us, based on the source code111https://github.com/ShengcaiLiao/QAConv. Comparison results are reported in Table 2.
Results on Market-1501 and DukeMTMC-reID. From Table 2, we can make the following observations. First, when using Combined MSMT17 as the source data, OSNet-AIN [zhou2019learning] and QAConv [Liao2020QAConv] achieve the best results on both Market-1501 and DukeMTMC-reID. Second, compared to single-source DG methods that use more training data (Combined MSMT17), our ML outperforms them by a large margin on Market-1501 and achieves comparable results with them on DukeMTMC-reID. Specifically, when testing on Market-1501, with the same backbone, our ML surpasses SNR [jin2020style] by 6.7% in mAP and 4.4% in Rank-1 accuracy. Third, when training with multiple source domains, with the same backbone, our ML produces significantly higher results than QAConv. Specifically, our ML is higher than QAConv by 12.5% in mAP for Market-1501 and by 3.4% in mAP for DukeMTMC-reID. This demonstrates the superiority of our method over the method that considers all the source domains as one domain. Fourth, when using the IBN-Net50 as the backbone, our ML can achieve better mAP than using ResNet-50.
Results on CUHK03 and MSMT17. There is only one method (QAConv [Liao2020QAConv]) evaluated on CUHK03 and MSMT17. When testing on MSMT17, QAConv [Liao2020QAConv] uses DukeMTMC-reID as the source data. Clearly, our ML achieves higher results than QAConv [Liao2020QAConv] on both datasets, no matter how many source domains QAConv is trained with. We also find that both our ML and QAConv produce poor results on CUHK03 and MSMT17, indicating there is still a large room for generalizable models in DG.
4.4 Ablation Studies
Effectiveness of Meta-Learning. To investigate the effectiveness of the proposed meta-learning strategy, we conduct ablation studies in Table 3. Clearly, the model trained with the proposed meta-learning strategy consistently improves the results with different backbones. Specifically, with ResNet-50, adding meta-learning optimization increases the baseline by 5.3% in Rank-1 accuracy on Market-1501 and by 3.7% in Rank-1 accuracy on CUHK03. With IBN-Net50 backbone, meta-learning strategy gains 5.4% and 2.8% improvement in mAP on Market-1501 and CUHK03, respectively. This demonstrates that by simulating the train-test process during training, the meta-learning strategy helps the model to learn domain-invariant representations that can perform well on unseen domains.
Effectiveness of MetaBN. As shown in Table 3, plugging MetaBN into the meta-learning-based model further improves the generalization ability. For ResNet-50 backbone, MetaBN improves the meta-optimized model by 1.3% and 1.6% in Rank-1 accuracy on Market-1501 and CUHK03. For IBN-Net50, we can observe similar improvements. The results validate that diversifying meta-test features by MetaBN is able to help the model to learn more generalizable representations for unseen domains.
Loss function components. We conduct experiments to evaluate the impact of the memory-based identification loss and triplet loss. Results in Table 4 show that the memory-based identification loss is the predominant supervision for training a generalizable model and additionally adding the triplet loss can slightly improve the performance.
Comparison of different classifiers. In Table 5, we compare different types of identification classifiers. We have the following observations. First, compared with the two parametric classifiers, our proposed non-parametric classifier gains higher improvement with the meta-learning strategy. Second, when directly training with multi-source data without the meta-learning, the model trained with memory-based identification loss achieves higher results. These two observations demonstrate that the proposed memory-based identification loss is suitable for multi-source DG and our meta-learning strategy.
|✓||39.7 (2.0)||68.3 (1.3)||21.2 (0.0)||21.9 (1.0)|
|✓||40.9 (3.2)||69.3 (2.3)||23.9 (2.7)||24.3 (3.4)|
|✓||48.1 (7.0)||74.5 (6.6)||29.9 (4.2)||30.7 (5.3)|
Effectiveness of Multi-Source. Table 6 shows the comparison between two-source DG and multi-source DG. Despite bringing more domain bias, training with more source domains consistently produces higher results when testing on an unseen domain. This demonstrates the significance of studying multi-source DG.
To better understand the advantage of the meta-learning strategy, we visualize the distributions of the inference features of baseline and ML in Fig. 6. Both baseline and ML are trained with DukeMTMC-reID, CUHK03, and MSMT17, and the inference features are obtained by 7 persons in the Market-1501 testing set. We use t-SNE [tsne] to reduce the features into a 2-D space. Different colors denote different identities. As shown in Fig. 6, compared with the baseline, our ML pushes the features of the same identity more compact and pull the features of different identities more discriminating. This suggests that the proposed ML leads the model to learn more generalizable representations that can perform well on unseen domains.
In this paper, we propose a Memory-based Multi-source Meta-Learning (ML) framework for multi-source domain generalization (DG) in person ReID. The proposed meta-learning strategy enables the model to simulate the train-test process of DG during training, which can efficiently improve the generalization ability of the model on unseen domains. Besides, we introduce a memory-based module and MetaBN to take full advantage of meta-learning and obtain further improvement. Extensive experiments demonstrate the effectiveness of our framework for training a generalizable ReID model. Our method achieves state-of-the-art generalization results on four large-scale benchmarks.