Multiple Expert Brainstorming for Domain Adaptive Person Re-identification

07/03/2020 ∙ by Yunpeng Zhai, et al. ∙ Xiamen University Nanyang Technological University Peking University 0

Often the best performing deep neural models are ensembles of multiple base-level networks, nevertheless, ensemble learning with respect to domain adaptive person re-ID remains unexplored. In this paper, we propose a multiple expert brainstorming network (MEB-Net) for domain adaptive person re-ID, opening up a promising direction about model ensemble problem under unsupervised conditions. MEB-Net adopts a mutual learning strategy, where multiple networks with different architectures are pre-trained within a source domain as expert models equipped with specific features and knowledge, while the adaptation is then accomplished through brainstorming (mutual learning) among expert models. MEB-Net accommodates the heterogeneity of experts learned with different architectures and enhances discrimination capability of the adapted re-ID model, by introducing a regularization scheme about authority of experts. Extensive experiments on large-scale datasets (Market-1501 and DukeMTMC-reID) demonstrate the superior performance of MEB-Net over the state-of-the-arts.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Person re-identification (re-ID) aims to match persons in an image gallery collected from non-overlapping camera networks. It has attracted increasing interest from the computer vision community thanks to its wide applications in security and surveillance. Though supervised re-ID methods have achieved very decent results, they often experience catastrophic performance drops while applied to new domains. Domain adaptive person re-ID that can well generalize across domains remains an open research challenge.

Unsupervised domain adaptation (UDA) in re-ID has been studied extensively in recent years. Most existing works about UDA can be roughly categorized into three classes. The first class attempts to align feature distributions between source and target domains [32], aiming to minimize the inter-domain gaps for optimal adaptation. The second class addresses the domain gap by employing generative adversarial networks (GAN) as a style transformer for converting sample images from a source domain to a target domain while preserving the person identity information as much as possible [19], [5], [33], [21]

. To leverage the sample distribution in target domains, the third class adopts self-supervised learning and employs clustering to predict pseudo-labels of target-domain samples iteratively to fine-tune re-ID models

[7], [34], [27], [8]

. On the other hand, often the best performance of deep learning is achieved by ensemble models, which integrate multiple sub-networks as well as their discrimination capability. However, ensemble learning with respect to domain adaptive re-ID remains unexplored. How to leverage specific features and knowledge of multiple networks and optimally adapt them to an unlabelled target domain remains to be elaborated.

In this paper, we present an multiple expert brainstorming network (MEB-Net), which learns and adapts multiple networks with different architectures for optimal re-ID in an unlabelled target domain. MEB-Net conducts iterative training where clustering for pseudo-labels and models feature learning are alternately executed. For feature learning, MEB-Net adopts a mutual learning strategy where networks with different architectures are pre-trained in a source domain as expert models equipped with specific features and knowledge. The adaptation is accomplished through brainstorming-based mutual learning among multiple expert models. To accommodate the heterogeneity of experts learned with different architectures, a regularization scheme is introduced to modulate the experts’ authority according to their feature distributions in the target domain, and further enhances the discrimination capability of the re-ID model.

The contributions of this paper are summarized as follows.

  • We propose a novel multiple expert brainstorming network (MEB-Net) based on mutual learning among expert models, each of which is equipped with knowledge of an architecture.

  • We design an authority regularization to accommodate the heterogeneity of experts learned with different architectures, modulating the authority of experts and enhance the discrimination capability of re-ID models.

  • Our MEB-Net approach achieves significant performance gain over the state-of-the-art on commonly used datasets: Market-1501 and DukeMTMC-reID.

2 Related Works

2.1 Unsupervised Domain Adaptive Re-ID

Unsupervised domain adaptation (UDA) for person re-ID defines a learning problem for target domains where source domains are fully labeled while sample labels in target domains are totally unknown. Methods have been extensively explored in recent years, which take three typical approaches as follows.

Feature distribution alignment. In [18], Lin et al. proposed minimizing the distribution variation of the source’s and the target’s mid-level features based on Maximum Mean Discrepancy (MMD) distance. Wang et al. [32] utilized additional attribute annotations to align feature distributions of source and target domains in a common space.

Image-style transformation. GAN-based methods have been extensively explored for domain adaptive person re-ID [21], [44], [33], [5], [19]. HHL [44] simultaneously enforced cameras invariance and domain connectedness to improve the generalization ability of models on the target set. PTGAN [33], SPGAN [5], ATNet [19] and PDA-Net [15] transferred images with identity labels from source into target domains to learn discriminative models.

Self-supervised learning. Recently, the problem about how to leverage the large number of unlabeled samples in target domains have attracted increasing attention [7], [36], [20], [34], [35], [45], [39]. Clustering [7], [41], [38] and graph matching [36] methods have been explored to predict pseudo-labels in target domains for discriminative model learning. Reciprocal search [20] and exemplar-invariance approaches [35] were proposed to refine pseudo labels, taking camera-invariance into account concurrently. SSG [8] utilized both global and local feature of persons to build multiple clusters, which are then assigned pseudo-labels to supervise the model training.

However, existing works barely explored the domain adaptive person re-ID task using methods of model ensemble, which have achieved impressive performance on many other tasks.

2.2 Knowledge Transfer

Distilling knowledge from well trained neural networks and transferring it to another model/network has been widely studied in recent years

[11], [3], [16], [37], [2], [1]. The typical approach of knowledge transfer is the teacher-student model learning, which uses the soft output distribution of a teacher network to supervise a student network, so as to make student models learn discrimination ability from teacher models.

The mean-teacher model [30] averaged model weights at different training iterations to create supervisions for unlabeled samples. Deep mutual learning [40] adopted a pool of student models by training them with supervision from each other. Mutual mean teaching [9] designed a symmetrical framework with hard pseudo-labels as well as refined soft labels for unsupervised domain adaptive re-ID. However, existing methods with teacher-student mechanisms mostly adopted a symmetrical framework which largely neglected the different confidence of teacher networks when they are heterogeneous.

2.3 Model Ensemble

There is a considerable number of previous works on ensembles with neural networks. A typical approach [28], [31], [13], [26] generally create a series of networks with shared weights during training and then implicitly ensemble them at test time. Another approach [25] focus on label refinery by well trained networks for training a new model with higher discrimination capability. However, these methods cannot be directly used on unsupervised domain adaptive re-ID tasks, where the training set and the testing set share non-overlapping label space.

Figure 1: Overview of proposed multiple expert brainstorming network (MEB-Net). Multiple expert networks with different architectures are first pre-trained in the source domain and then adapted to the target domain through brainstorming.

3 The Proposed Approach

Figure 2:

Flowchart of proposed expert brainstorming in MEB-Net, which consists of two components, feature extraction and mutual learning. In mutual learning, multiple expert networks are organized to collaboratively learn from each other by their predictions and the pseudo-labels, and improve themselves for the target domain in an unsupervised mutual learning manner. More details are described in Sec.

3.4.

We study the problem of unsupervised domain adaptive re-ID using model ensemble methods from a source-domain to a target-domain. The labelled source-domain dataset are denoted as , which has sample images with unique identities. and denote the sample images and the person identities, where each sample in is associated with a person identity in . The sample images in the target-domain have no identity available. We aim to leverage the labelled sample images in and the unlabelled sample images in to learn a transferred re-ID model for the target-domain .

3.1 Overview

MEB-Net adopts a two-stage training scheme including supervised learning in source domains (Fig. 1a) and unsupervised adaptation in target domains (Fig. 1

b). In the initialization phase, multiple initial expert models with different network architectures are pre-trained by the source dataset in a supervised manner. Afterwards the trained experts are adapted to the target domain by iteratively brainstorming with each other using the unlabelled samples in the target dataset. Specifically, in each iterative epoch, a clustering algorithm is employed on target samples to predict pseudo-labels, which are then utilized to fine-tune the expert networks by mutual learning. In addition, the authority regularization is employed to modulate the authority of expert networks according to their discrimination capability during training. In this way, the knowledge from multiple networks is fused, enhanced, and transferred to the target domain, as described in

Algorithm 1.

3.2 Learning in Source Domains

The proposed MEB-Net aims to transfer the knowledge of multiple networks from a labelled source domain to an unlabelled target domain. For each architecture, a deep neural network (DNN) model parameterized with (a pre-trained expert) is first trained in a supervised manner. transforms each sample image into a feature representation

, and outputs a predicted probability

of image belonging to the identity . The cross entropy loss with label smoothing is defined as

(1)

where if , otherwise . is a small constant, which is set as 0.1. The softmax triplet loss is also defined as

(2)

where denotes the hardest positive sample of the anchor , and denotes the hardest negative sample. denotes the L distance. The overall loss is therefore calculated as

(3)

With network architectures, the supervised learning thus produces pre-trained re-ID models each of which acts as an expert for brainstorming.

Input: Source domain dataset , target domain dataset .
Input: K network architectures .
Output: Expert model parameters .

1:  Initialize pre-trained weights of model with each architecture .
2:  for each epoch do
3:     Extract average features on : .
4:     Generate pseudo-labels of by clustering samples using .
5:     Evaluate authority of each expert model by inter-/intra-cluster scatter.
6:     for each iteration , mini-batch  do
7:         Calculate soft-labels from each temporally average model with : , .
8:         Calculate output of each current model with : , .
9:         Update parameters by optimizing Eq. 14 with authority .
10:         Update temporally average model weights following Eq. 4.
11:     end for
12:  end for
13:  Return Expert model parameters
Algorithm 1 Multiple Expert Brainstorming Network

3.3 Clustering in the Target Domain

In the target domain, MEB-Net consists of a clustering-based pseudo-label generation procedure and a feature learning procedure, which are mutually enforced. Each epoch consists of three steps: (1) For sample images in the target domain, each expert model extracts convolutional features and determines the ensemble features by averaging features extracted by multiple expert models

; (2) A mini-batch k-means clustering is performed on

to classify all target-domain samples into

different clusters; (3) The produced cluster IDs are used as pseudo-labels for the training samples . The steps 3 and 4 in Algorithm 1 summarize this clustering process.

3.4 Expert Brainstorming

With multiple expert models with different architectures which absorb rich knowledge from the source domain, MEB-Net aims to organize them to collaboratively learn from each other and improve themselves for the target domain in an unsupervised mutual learning manner, Fig. 2.

In each training iteration, the same batch of images in the target domain are first fed to all the expert models parameterized by , to predict the classification confidence predictions and feature representations . To transfer knowledge from one expert to others, the class predictions of each expert can serve as soft class labels for training other experts. However, directly using the current predictions as soft labels to train each model decreases the independence of expert models’ outputs, which might result in an error amplification. To avoid this error, MEB-Net leverages the temporally average model of each expert model, which preserves more original knowledge, to generate reliable soft pseudo labels for supervising other experts. The parameters of the temporally average model of expert at current iteration are denoted as , which is updated as

(4)

where is the scale factor, and the initial temporal average parameters are . Utilizing this temporal average model of expert , the probability for each identity is predicted as , and the feature representation is calculated as .

Mutual identity loss. For each expert model , the mutual identity loss of models learned by a certain expert is defined as the cross entropy between the class prediction of the expert and the temporal average model of the expert , as

(5)

The mutual identity loss for expert is set as the average of above losses of models learned by all other experts, as

(6)

Mutual triplet loss. For each expert model , the mutual triplet loss of models learned by a certain expert is also defined as binary cross entropy, as

(7)

where denotes the softmax of the feature distance between negative sample pairs:

(8)

where denotes the hardest positive sample of the anchor according to the pseudo-labels , and denotes the hardest negative sample. denotes distance. The mutual triplet loss for expert is calculated as the average of above triplet losses of models learned by all other experts, as

(9)

Voting loss. In order to learn stable and discriminative knowledge from the pseudo-labels obtained by clustering as described in Sec. 3.3, we introduce voting loss which consists of the identity loss and the triplet loss. For each expert model , the identity loss is defined as cross entropy with label smoothing, as

(10)

where if , otherwise . is small constant. The softmax triplet loss is defined as:

(11)

where denotes the hardest positive sample of the anchor , and denotes the hardest negative sample. denotes L distance. The voting loss is defined by summarizing the identity loss and the triplet loss:

(12)

Overall loss. For each expert model , the individual brainstorming loss is defined by

(13)

The overall loss is defined by the sum loss of the individual brainstorming for each expert model.

(14)

3.5 Authority Regularization

Figure 3: Illustration of our proposed authority regularization. It modulates the authority of different experts according to the inter-/intra-cluster scatter of each single expert. A larger scatter means better discrimination capability.

Expert networks with different architectures are equipped with various knowledge, and thus have different degrees of discrimination capability in the target domain. To accommodate the heterogeneity of experts, we propose an authority regularization (AR) scheme, which modulates the authority of different experts according to the inter-/intra-cluster scatter of each single expert, Fig. 3. Specifically, for each expert we extract sample features and cluster all the training samples in the target domain into groups as . The intra-cluster scatter of the cluster is defined as

(15)

where is the average feature of the cluster (with samples). The inter-cluster scatter is defined as

(16)

where is the average feature of all training samples in the target domain. To evaluate the discrimination of each expert model in the unlabeled target domain, the inter-/intra-cluster scatter is defined as

(17)

gets larger when the inter-cluster scatter is larger or the intra-cluster scatter is smaller. And a larger means better discrimination capability. Before feature learning in each epoch, we calculate scatter for each expert as , and defined expert authority as the mean normalization of , as

(18)

We re-define the mutual identity loss in Eq. 6 and the mutual triplet loss in Eq. 9 as the weighted sum of and for other experts, as

(19)

and

(20)

With the regularization scheme, MEB-Net modulates the authority of experts to facilitate discrimination in the target domain.

4 Experiments

4.1 Datasets and Evaluation Metrics

We evaluate the proposed method on Market-1501 [42] and DukeMTMC-reID [24][43].

Market-1501: This dataset contains 32,668 images of 1,501 identities from 6 disjoint surveillance cameras. Of the 32,668 person images, 12,936 images from 751 identities form a training set, 19,732 images from 750 identities (plus a number of distractors) form a gallery set, and 3,368 images from 750 identities form a query set.

DukeMTMC-reID: This dataset is a subset of the DukeMTMC. It consists of 16,522 training images, 2,228 query images, and 17,661 gallery images of 1,812 identities captured using 8 cameras. Of the 1812 identities, 1,404 appear in at least two cameras and the rest 408 (considered as distractors) appear in a single camera.

Evaluation Metrics:

For each evaluation, we use one dataset as the target domain and the other one as the source domain. Cumulative Matching Characteristic (CMC) curve and mean average precision (mAP) are used as the evaluation metrics.

4.2 Implementation Details

MEB-Net is trained by two stages: pre-training in source domains and the adaptation in target domains.

Stage 1: Pre-training in source domains. We first pre-train three initial expert models on the source dataset in a supervised manner as described in Section 3.2. Specifically, we adopt three architectures: DenseNet-121 [12], ResNet-50 [10] and Inception-v3 [29]

as backbone networks of each expert, respectively, and initialize them by using parameters pre-trained on the ImageNet

[4]

. Zero padding is employed on the final features to obtain representations of the same 2048 dimensions for all networks. During training, the input image is uniformly resized to

and traditional image augmentation was performed via random flipping and random erasing. For each identity from the training set, a mini-batch of size 64 is sampled with = 16 randomly selected identities and = 4 randomly sampled images for computing the hard batch triplet loss. We use the Adam [14] with weight decay 0.0005 to optimize the parameters. The initial learning rate is set to 0.00035 and is decreased to 1/10 of its previous value on the 40th and 70th epoch in the total 80 epochs.

Stage 2: Adaptation in target domains. For unsupervised adaptation on target datasets, we follow the same data augmentation strategy and triplet loss setting. The temporal ensemble momentum in Eq 4 is set to 0.999. The learning rate is fixed to 0.00035 for overall 40 epochs. In each epoch, we conduct mini-batch k-means clustering and the number of groups is set as 500 for all target datasets. Each epoch consists of 800 training iterations. During testing, we only use one expert network for feature representations.

4.3 Comparison with State-of-the-Arts

Methods Market-1501 DukeMTMC-reID
mAP R-1 R-5 R-10 mAP R-1 R-5 R-10
LOMO[17] 8.0 27.2 41.6 49.1 4.8 12.3 21.3 26.6
Bow[42] 14.8 35.8 52.4 60.3 8.3 17.1 28.8 34.9
UMDL[22] 12.4 34.5 52.6 59.6 7.3 18.5 31.4 37.6
MMFA[18] 27.4 56.7 75.0 81.8 24.7 45.3 59.8 66.3
TJ-AIDL[32] 26.5 58.2 74.8 81.1 23.0 44.3 59.6 65.0
UCDA-CCE[23] 30.9 60.4 - - 31.0 47.7 - -
ATNet[19] 25.6 55.7 73.2 79.4 24.9 45.1 59.5 64.2
SPGAN+LMP[5] 26.7 57.7 75.8 82.4 26.2 46.4 62.3 68.0
CamStyle[46] 27.4 58.8 78.2 84.3 25.1 48.4 62.5 68.9
HHL[44] 31.4 62.2 78.8 84.0 27.2 46.9 61.0 66.7
ECN[45] 43.0 75.1 87.6 91.6 40.4 63.3 75.8 80.4
PDA-Net[15] 47.6 75.2 86.3 90.2 45.1 63.2 77.0 82.5
PUL[6] 20.5 45.5 60.7 66.7 16.4 30.0 43.4 48.5
UDAP[27] 53.7 75.8 89.5 93.2 49.0 68.4 80.1 83.5
PCB-PAST[39] 54.6 78.4 - - 54.3 72.4 - -
SSG[8] 58.3 80.0 90.0 92.4 53.4 73.0 80.6 83.2
MMT-500[9] 71.2 87.7 94.9 96.9 63.1 76.8 88.0 92.2
MEB-Net(Ours) 76.0 89.9 96.0 97.5 66.1 79.6 88.3 92.2
Table 1: Comparison with state-of-the-art methods: For the adaptation on Market-1501 and that on DukeMTMC-reID. The top-three results are highlighted with bold, italic, and underline fonts, respectively.

We compare MEB-Net with state-of-the-art methods including: hand-crafted feature approaches (LOMO[17], BOW[42], UMDL[22]), feature alignment based methods (MMFA[18], TJ-AIDL[32], UCDA-CCE[23]), GAN-based methods (SPGAN [5], ATNet[19], CamStyle[46], HHL[44], ECN[45] and PDA-Net[15]), pseudo-label prediction based methods (PUL[6], UDAP[27], PCB-PAST[39], SSG[8] MMT[9]). Table 1 shows the person Re-ID performance while adapting from Market1501 to DukeMTMC-reID and vice versa.

Hand-crafted feature approaches. As Table.1 shows, MEB-Net outperforms hand-crafted feature approaches including LOMO, BOW and UMDL by large margins, as deep network can learn more discriminative representations than hand-crafted features.

Feature alignment approaches. MEB-Net significantly exceeds the feature alignment unsupervised Re-ID models. The reason lies in that it explores and utilizes the similarity between unlabelled sample in target domains in an more effective manner of brainstorming.

GAN-based approaches.

The performance of these approaches is diverse. In particular, ECN performs better than most methods using GANs because it enforces cameras in-variance as well as latent sample relations. However, MEB-Net can achieve higher performance than GAN-based methods without generating new images, which indicates its more efficient use of the unlabelled samples.

Pseudo-labels based approaches. The line of approaches perform clearly better than other approaches in most cases, as they fully make use of the unlabelled target samples by assigning pseudo-labels to them according to sample feature similarities. For a fair comparison, we report MMT-500 with the cluster number of 500, which is the same as the proposed MEB-Net. As Table.1 shows, MEB-Net achieves an mAP of and a rank-1 accuracy of for the DukeMTMC-reIDMarket1501 transfer, which outperforms the state-of-the-art (by MMT-500) by and , respectively. And for Market1501DukeMTMC-reID transfer, MEB-Net obtains an mAP of and a rank-1 accuracy of which outperforms the state-of-the-art by and , respectively.

4.4 Ablation Studies

Methods Market-1501 DukeMTMC-reID
mAP R-1 R-5 R-10 mAP R-1 R-5 R-10
Supervised Models 82.5 93.7 98.1 98.5 67.1 82.1 90.0 92.1
Direct Transfer 31.5 60.6 75.7 80.8 29.7 46.5 61.8 67.7
Baseline(Only ) 69.5 86.8 94.9 96.6 60.6 75.0 85.5 89.4
MEB-Net w/o 70.7 87.1 94.8 96.7 58.3 72.6 83.6 88.5
MEB-Net w/o 70.2 87.9 94.8 96.6 60.4 75.0 86.1 89.3
MEB-Net w/o 74.9 88.4 95.8 97.7 63.0 76.6 87.3 90.8
MEB-Net w/o AR 75.5 89.3 95.9 97.4 65.4 77.9 88.9 91.9
MEB-Net 76.0 89.9 96.0 97.5 66.1 79.6 88.3 92.2
Table 2: Ablation studies of MEB-Net. Supervised Models: Re-ID models trained using the labelled training images of the target domain. Direct Transfer: Re-ID models trained by using the labelled training images of the source domain. (Eq. 12), (Eq. 4), (Eq. 6) and (Eq. 9) are described in Sec. 3.4. AR: Authority Regularization as described in Sec. 3.5.

Detailed ablation studies are performed to evaluate the components of MEB-Net as shown in Table 2.

Supervised models vs. Direct transfer. We first derive the upper and lower performance bounds by the supervised models and the direct transfer models for the ablation studies as shown in Table 2. Specifically, the supervised models are trained using labelled target-domain training images and evaluated over the target-domain test images. The direct transfer models are trained using the labelled source-domain images and evaluated over the target-domain test images. We evaluate all three architectures and report the best results in Table 2. It can be observed that the huge performance gaps between the Direct Transfer models and the Supervised Models due to the domain shift. Take the Market-1501 as an example. The mAP accuracy of the supervised model reaches up to 82.5% but sharply drops to 31.5% for the directly transferred model trained on DukeMTMC-reID.

Voting loss: To investigate the effectiveness of the proposed MEB-Net, we create baseline ensemble models that only use voting loss. Specifically, pseudo-labels are predicted by averaging the features outputted from all expert networks. And then the pseudo-labels are used to supervise the training of each expert network individually by optimizing the voting loss. As Table 1 shows, the Baseline model outperforms the Direct Transfer model by a large margin. For example, the mAP accuracy improves from 31.5% to 69.5% and from 29.7% to 60.6%, respectively, while evaluated over the datasets Market1501 and DukeMTMC-reID. This shows that the voting loss effectively make use of the ensemble models to predict more accurate pseudo-labels and fine-tune each network. At the same time, we can observe that there are still large performance gaps between the Baseline models and the Supervised Models, , a drop of 13% in mAP while transferring from DukeMTMC-reID to Market1501.

Temporally Average Networks: We verify the effectiveness of the temporally average models in MEB-Net. The model removing the temporally average models is denoted as ”MEB-Net w/o ”. For this experiment, we directly use the prediction of the current networks parameterized by instead of the temporally average networks with parameters as soft labels. As Table 2 shows, distinct drops of 5.3% mAP and 2.8% rank-1 accuracy are observed when transferring from DukeMTMC-reID to Market-1501 dataset. Similarly, 7.8% mAP and 7.0% rank-1 accuracy decreases are shown on DukeMTMC-reID dataset. Without using temporally average models, networks tend to degenerate to be homogeneous, which substantially decreases the learning capability.

Effectiveness of mutual learning: We evaluate the mutual learning component in Sec. 3.4 from two aspects: the mutual identity loss and the mutual triplet loss. The former is denoted as ”MEB-Net w/o ”. Results show that mAP drops from 76.0% to 70.2% on Market-1501 dataset and from 66.1% to 60.4% on DukeMTMC-reID dataset. Similar drops can also be observed when studying the mutual triplet loss, which are denoted as ”MEB-Net w/o ”. For example, the mAP drops to 74.9% and 63.0% for DukeMTMC-reIDMarket-1501 and vice versa, respectively. The effectiveness of the mutual learning, including both two mutual loss, can be largely attributed to that it enhances the discrimination capability of all expert networks.

Authority Regularization: We verify the effectiveness of the proposed authority regularization (Sec 3.5) of MEB-Net. Specifically, we remove the authority regularization, and set authority (in Eq. 19 and Eq. 20) equally for all expert models. The model is denoted as ”MEB-Net w/o AR”, of which the results are shown in Table 1. Experiments without authority regularization shows distinct drops on both Market-1501 and DukeMTMC-reID datasets, which indicates that equivalent brainstorming among experts hinders feature discrimination because an unprofessional expert may provide erroneous supervision. Specifically, the rank-1 accuracy drops to 89.3% when transferring from DukeMTMC-reID to Market-1501 dataset, and drops to 77.9% when transferring from Market-1501 to DukeMTMC-reID. Improvements on both setting demonstrate the effectiveness of the proposed authority regularization module.

4.5 Discussion

Architectures Supervised Dire. tran. Sing. tran. Base. ens. MEB-Net
DenseNet-121 80.0 30.8 57.8 69.5 76.0
ResNet-50 82.5 31.5 62.4 65.6 72.2
Inception-v3 68.3 28.5 51.5 62.3 71.3
Table 3: Accuracy of mAP () of networks with different architectures for DukeMTMC-reID Market-1501 transfer. Supervised: Supervised learning by labelled samples; Dire. tran.: Direct transfer; Sing. tran.

: Single model transfer learning;

Base. ens. (Baseline ensemble) and MEB-Net are conducted among all three networks.

Comparison with Baseline Ensemble. Considering that ensemble models usually achieve more superior performance than a single model, we compare mAPs of our approach with other baseline methods, including single model transfer and baseline model ensemble. All experiments are conducted using a clustering algorithm. Experiment results shown in Table. 3 demonstrate that the single model transfer learning improves the mAP beyond direct transfer, which can be attributed to its fine-tuning itself by pseudo-label prediction. The baseline model ensemble uses all networks to extract average features of unlabelled samples for pseudo-label prediction, but without mutual learning among them while adaptation in the target domain. The improvement of baseline ensemble than single model transfer is because of more accurate pseudo-labels. However, MEB-Net performs significantly better than all compared methods. It validates that MEB-Net provides a more effective ensemble method with respect to domain adaptive person re-ID.

Figure 4: Evaluation with different epoch. The performance of all networks ascend to a stable value after 20 epochs.

Number of Epochs. We evaluate the mAP of MEB-Net after each epoch, respectively. As shown in Fig. 4, the models become stronger when the iterative clustering proceeds. The performance is improved in early epochs, and finally converges after 20 epochs for both datasets.

5 Conclusion

The paper proposed a multiple expert brainstorming network (MEB-Net) for domain adaptive person re-ID. MEB-Net adopts a mutual learning strategy, where networks of each architecture are pre-trained to initialize several expert models while the adaptation is accomplished through brainstorming (mutual learning) among expert models. Furthermore, an authority regularization scheme was introduced to tackle the heterogeneity of experts. Experiments demonstrated the effectiveness of MEB-Net for improving the discrimination ability of re-ID models. Our approach efficiently assembled discrimination capability of multiple networks while requiring solely a single model during inference time throughout.

References

  • [1] R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, and G. E. Hinton (2018) Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235. Cited by: §2.2.
  • [2] H. Bagherinezhad, M. Horton, M. Rastegari, and A. Farhadi (2018) Label refinery: improving imagenet classification through label progression. arXiv preprint arXiv:1805.02641. Cited by: §2.2.
  • [3] T. Chen, I. Goodfellow, and J. Shlens (2015) Net2net: accelerating learning via knowledge transfer. arXiv preprint arXiv:1511.05641. Cited by: §2.2.
  • [4] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li (2009) ImageNet: A large-scale hierarchical image database. In IEEE CVPR, Cited by: §4.2.
  • [5] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao (2018) Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In IEEE CVPR, Cited by: §1, §2.1, §4.3, Table 1.
  • [6] H. Fan, L. Zheng, C. Yan, and Y. Yang (2018) Unsupervised person re-identification: clustering and fine-tuning. TOMCCAP 14 (4), pp. 83:1–83:18. Cited by: §4.3, Table 1.
  • [7] H. Fan, L. Zheng, and Y. Yang (2017) Unsupervised person re-identification: clustering and fine-tuning. CoRR abs/1705.10444. Cited by: §1, §2.1.
  • [8] Y. Fu, Y. Wei, G. Wang, Y. Zhou, H. Shi, and T. S. Huang (2019) Self-similarity grouping: a simple unsupervised cross domain adaptation approach for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 6112–6121. Cited by: §1, §2.1, §4.3, Table 1.
  • [9] Y. Ge, D. Chen, and H. Li (2020) Mutual mean-teaching: pseudo label refinery for unsupervised domain adaptation on person re-identification. arXiv preprint arXiv:2001.01526. Cited by: §2.2, §4.3, Table 1.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016-06) Deep residual learning for image recognition. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §4.2.
  • [11] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.2.
  • [12] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 4700–4708. Cited by: §4.2.
  • [13] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger (2016) Deep networks with stochastic depth. In European conference on computer vision (ECCV), pp. 646–661. Cited by: §2.3.
  • [14] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
  • [15] Y. Li, C. Lin, Y. Lin, and Y. F. Wang (2019) Cross-dataset person re-identification via unsupervised pose disentanglement and adaptation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 7919–7929. Cited by: §2.1, §4.3, Table 1.
  • [16] Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and L. Li (2017) Learning from noisy labels with distillation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1910–1918. Cited by: §2.2.
  • [17] S. Liao, Y. Hu, X. Zhu, and S. Z. Li (2015-06) Person re-identification by local maximal occurrence representation and metric learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.3, Table 1.
  • [18] S. Lin, H. Li, C. Li, and A. C. Kot (2018) Multi-task mid-level feature alignment network for unsupervised cross-dataset person re-identification. In BMVC, Cited by: §2.1, §4.3, Table 1.
  • [19] J. Liu, Z. Zha, D. Chen, R. Hong, and M. Wang (2019) Adaptive transfer network for cross-domain person re-identification. In IEEE CVPR, Cited by: §1, §2.1, §4.3, Table 1.
  • [20] Z. Liu, D. Wang, and H. Lu (2017) Stepwise metric promotion for unsupervised video person re-identification. In IEEE ICCV, pp. 2448–2457. Cited by: §2.1.
  • [21] J. Lv and X. Wang (2018) Cross-dataset person re-identification using similarity preserved generative adversarial networks. In KSEM, W. Liu, F. Giunchiglia, and B. Yang (Eds.), pp. 171–183. Cited by: §1, §2.1.
  • [22] P. Peng, T. Xiang, Y. Wang, M. Pontil, S. Gong, T. Huang, and Y. Tian (2016-06) Unsupervised cross-dataset transfer learning for person re-identification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.3, Table 1.
  • [23] L. Qi, L. Wang, J. Huo, L. Zhou, Y. Shi, and Y. Gao (2019) A novel unsupervised camera-aware domain adaptation framework for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 8080–8089. Cited by: §4.3, Table 1.
  • [24] E. Ristani, F. Solera, R. S. Zou, R. Cucchiara, and C. Tomasi (2016) Performance measures and a data set for multi-target, multi-camera tracking. In IEEE ECCV Workshops, Cited by: §4.1.
  • [25] Z. Shen, Z. He, and X. Xue (2019) Meal: multi-model ensemble via adversarial learning. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 4886–4893. Cited by: §2.3.
  • [26] S. Singh, D. Hoiem, and D. Forsyth (2016) Swapout: learning an ensemble of deep architectures. In Advances in neural information processing systems, pp. 28–36. Cited by: §2.3.
  • [27] L. Song, C. Wang, L. Zhang, B. Du, Q. Zhang, C. Huang, and X. Wang (2018) Unsupervised domain adaptive re-identification: theory and practice. CoRR abs/1807.11334. Cited by: §1, §4.3, Table 1.
  • [28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting.

    The journal of machine learning research

    15 (1), pp. 1929–1958.
    Cited by: §2.3.
  • [29] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 2818–2826. Cited by: §4.2.
  • [30] A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pp. 1195–1204. Cited by: §2.2.
  • [31] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus (2013) Regularization of neural networks using dropconnect. In International conference on machine learning, pp. 1058–1066. Cited by: §2.3.
  • [32] J. Wang, X. Zhu, S. Gong, and W. Li (2018) Transferable joint attribute-identity deep learning for unsupervised person re-identification. In IEEE CVPR, Cited by: §1, §2.1, §4.3, Table 1.
  • [33] L. Wei, S. Zhang, W. Gao, and Q. Tian (2018) Person transfer gan to bridge domain gap for person re-identification. In IEEE CVPR, Cited by: §1, §2.1.
  • [34] J. Wu, S. Liao, Z. Lei, X. Wang, Y. Yang, and S. Z. Li (2019) Clustering and dynamic sampling based unsupervised domain adaptation for person re-identification. In IEEE ICME, pp. 886–891. Cited by: §1, §2.1.
  • [35] Y. Wu, Y. Lin, X. Dong, Y. Yan, W. Ouyang, and Y. Yang (2018) Exploit the unknown gradually: one-shot video-based person re-identification by stepwise learning. In IEEE CVPR, Cited by: §2.1.
  • [36] M. Ye, A. J. Ma, L. Zheng, J. Li, and P. C. Yuen (2017) Dynamic label graph matching for unsupervised video re-identification. In IEEE ICCV, pp. 5152–5160. Cited by: §2.1.
  • [37] J. Yim, D. Joo, J. Bae, and J. Kim (2017) A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4133–4141. Cited by: §2.2.
  • [38] Y. Zhai, S. Lu, Q. Ye, X. Shan, J. Chen, R. Ji, and Y. Tian (2020) AD-cluster: augmented discriminative clustering for domain adaptive person re-identification. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
  • [39] X. Zhang, J. Cao, C. Shen, and M. You (2019) Self-training with progressive augmentation for unsupervised cross-domain person re-identification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 8222–8231. Cited by: §2.1, §4.3, Table 1.
  • [40] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu (2018) Deep mutual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4320–4328. Cited by: §2.2.
  • [41] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian (2016) MARS: A video benchmark for large-scale person re-identification. In ECCV, pp. 868–884. Cited by: §2.1.
  • [42] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015-12) Scalable person re-identification: a benchmark. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §4.1, §4.3, Table 1.
  • [43] Z. Zheng, L. Zheng, and Y. Yang (2017) Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In IEEE ICCV, Cited by: §4.1.
  • [44] Z. Zhong, L. Zheng, S. Li, and Y. Yang (2018) Generalizing a person retrieval model hetero- and homogeneously. In ECCV, pp. 176–192. Cited by: §2.1, §4.3, Table 1.
  • [45] Z. Zhong, L. Zheng, Z. Luo, S. Li, and Y. Yang (2019) Invariance matters: exemplar memory for domain adaptive person re-identification. In IEEE CVPR, Cited by: §2.1, §4.3, Table 1.
  • [46] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang (2019) CamStyle: A novel data augmentation method for person re-identification. IEEE TIP 28 (3), pp. 1176–1190. Cited by: §4.3, Table 1.