ReID is a branch of image retrieval, which aims to search a query pedestrian image from gallery set. Although the ReID issue achieves dramatic progress with the development of CNN, it is still a challenging problem due to the complexity of pedestrian. The feature representations obtained by the CNN extractor cannot represent the input pedestrian exactly because pedestrians themselves have great similarities in the classification task.
Related studies are performed on this problem. For pedestrian similarity issue, [Zheng2016, Yao2017] adopted a deep convolutional network and body parts to learn discriminative representations. They rely on powerful CNN to extract feature representations. [Zheng2016],[Varior2016],[Ding2015] utilized metric loss to train a deep model for discriminative representations. However, these works concentrate on the encoder output and ground-truth, neglecting the correlation between the input and encoder output. Specifically, for different images of the same pedestrian, the above methods extract the common features of the pedestrian, ignoring the unique characteristics and details of different images, which may damage the generalization of the model. For examples, in Fig 1, (a), (b) and (c) are the three different pedestrians of which have six different images respectively. The common features of the pedestrian cannot describe the details clearly, which hinder the capability of feature representations.
In this paper, we focus on the correlation between input and encoder output to enhance feature representations for each image. We design a DIM network to maximize the MI between the input pedestrian and encoder output. Based on the DIM network, We propose an end-to-end Global-DIM network to capture the discriminative feature representations. Furthermore, inspired by the success of the part-based ReID strategy in [Sun2017], we implement an end-to-end Local-DIM network to maximize the MI between the parts of pedestrians.
Besides, the feature representations are also limited to the dataset. Because the target dataset in the real surveillance video has different feature distribution and is not labeled, the model trained on the source dataset is difficult to apply. To address such problem, various unsupervised domain adaptation methods try to fit the distribution of source and target dataset [Fan2017],[Deng2017],[Zhong2017a]. However, these methods usually rely on auxiliary information like camera labels and are time-consuming in the training phase. Our DIM network provides a feasible solution for unsupervised ReID issue, as it has no auxiliary cues and is easy to train. We propose a Transfer Deep InfoMax (TF-DIM) model to extract the specific characteristics of pedestrians in the target dataset for unsupervised ReID.
Our contributions can be summarized into three aspects: 1). We propose a DIM network to maximize the MI between the input image and encoder output, therefore boosting the capability of feature representations. 2). Based on the IDE and PCB baseline, we build Global-DIM and Local-DIM networks to verify the effectiveness of DIM network in ReID issue. 3). Based on the unlabeled DIM loss, We construct a TF-DIM model to obtain feature representations with a well-generalized performance for unsupervised ReID issue. 4). By using the theory of MI, we propose an end-to-end high-quality ReID model for both supervised and unsupervised ReID issues, which does not need any extra knowledge cues.
Ii Related works
Ii-a Supervised-learned ReID
In order to deal with the problem of pedestrian similarities, a wide range of methods are proposed to achieve discriminative feature representations. They can be classified by feature and metric learning. For feature learning methods, considering the characteristics of the pedestrian, one idea is parsing pedestrian into part features. Wei et al.[Wei2017a] proposed a Global-Local-Alignment Descriptor (GLAD) to partition the pedestrian. Su et al. [Su2017] utilized pedestrian part cues to boost feature representations. Additionally, the attention mechanism [Xu2015] is widely employed in deep ReID. Liu et al. [Liu2017] and Zhao et al. [Zhao2017] enhanced the feature representations with attention mechanism, in which the network can choose the important part. [Sun2017] proposed an end-to-end attention ReID network for robust feature representations. For metric learning methods, [Varior2016] computes the contrastive loss to distinguish similar pedestrians. [Ding2015] applied triplet loss for better feature representations. However, these methods neglect the correlation between input and encoder output, which may damage the generalization of the model. In this paper, we introduce the MI theory [Hjelm2018] to improve the capacity of feature representations.
Ii-B Unsupervised-learned ReID
The big size of unlabeled target dataset in real surveillance scenarios is difficult to recognize in the pre-trained model based on source labeled dataset. Xiang et al. [Kodirov]Yu2017, Fan2017] assumed that the dataset has clustering properties. They belong to pseudo label learning. [Deng2017, Zhong2017a] adopted CycleGAN [Zhu2017] to align the distribution of source and target domains. Nevertheless, these methods usually rely on prior knowledge cues like camera information and are not easy to train. Our proposed TF-DIM model demands no extra knowledge and can be used as a supplement to other methods.
Ii-C Mutual-information Estimation
MI has developed in the domain of feature learning for a long time. Linsker and Bell [Linsker, Bell] applied the information principle to the neural network. Ji et al. [Ji2018] maximized the MI between the class pairs for unsupervised clustering and segmentation. Contrastive Predictive Coding (CPC) [Oord2018]
learned useful representations from high-dimensional space by MI estimation. Mutual Information Neural Estimator (MINE)[Belghazi2018] aimed at supervised classification problems by maximizing the MI. [Hjelm2018] replaces the KL-based divergence with Jensen-Shannon divergence (JSD) as the latter is bound to better results. Inspired by [Hjelm2018], we explore some possibilities around ReID issue.
Iii Deep InfoMax Network
The traditional works pay more attention to the encoder features and the ground-truth labels, in which the feature representations tend to preserve the commonality of categories. However, there are many commonalities between pedestrians in ReID task. In other words, the inter-class distance for pedestrians is not big enough and the intra-class distance is not small enough.
The discriminative feature representations in ReID task should have the ability to identify the pedestrian from the entire dataset, that is, extracting the unique information from the image. For this purpose, we maximize the MI between input pedestrians and encoder outputs , where is the number of images.
The MI between and is defined as:
where is the distribution of raw dataset , and is the distribution of .
We transform the Eq.1 so as to compute the maximization of MI.
denotes the Kullback-Leibler divergence[NIPS2016_6066]. The Eq.4 means that the maximization of MI is to enlarge the distance between and . However, the upper limit is unbounded, which is likely to achieve an infinite result. Therefore, we introduce a bounded Jensen-Shannon divergence [Fuglede2004] instead
where is a constant value.
In [Nowozin2016], can be concluded as
. There are four FC layers, first three FC layers followed by ReLU activation and last FC layer followed by sigmoid activation.
The DIM network needs positive and negative samples as input. There are two available sampling strategies. The first idea is to choose , as sample pairs, where denotes images randomly disrupted in a batch [Hjelm2018]. The second solution is partly sampling which introduces the ground-truth identity labels. Although the first strategy seems to give wrong negative samples for the DIM network, the experiments in section VIII-B show that random sampling has competitive performance over prior labels strategy.
In this paper, we choose a random sampling strategy for the DIM network. According to Eq.8, The DIM network loss is rewritten as
Iv Global DIM Network
Our approach aims to introduce Deep InfoMax loss to enhance the capability of feature representations in ReID task. Based on the Deep InfoMax network in the above, we propose our Global DIM network. The whole framework is shown in Fig.3, which consists of two parts: encoder and DIM networks.
The encoder network is improved on the IDE baseline. The baseline uses ResNet-50 [He2016]
as backbone, which finetunes on the pre-trained model of ImageNet challenge[Socher2009]. Following the strategy in [Zhong]
, we discard the original final 1000 dim fully connected layer (FC), add two FC layers: the output of the first layer is 512 dim, followed with Batch Normalization[Ioffe2015], LeakyReLU, and Dropout layer [Srivastava2014]. The output of the second FC layer is ID dim, where ID is the number of identities in the datasets.
The encoder network based on IDE baseline is an effective supervised ReID model. The cross-entropy loss employed in the encoder network is
where is the number of images, is the -th label in a batch.
Based on the feature representations obtained from the encoder network, our Global DIM network considers the correlation between the input and the feature representations, in which adds the DIM network. The full objective loss is:
where is hyper-parameter which denote the weight of the DIM loss in the whole objective loss.
V Local DIM Network
Inspired by [Hjelm2018] that the DIM method takes effects in local features, we propose our Local DIM approach based on the local features. The framework of the Local-DIM network is demonstrated in Fig. 4, consisting of two parts: part encoder and DIM networks.
Our part encoder network is improved on the PCB model. The encoder network uses ResNet-50 [He2016] as the backbone, which is finetuned on the pre-trained model of ImageNet challenge [Socher2009]
. After the global average pooling, the 3D tensor block is partitioned into six branches, which represent the six parts of the pedestrian. Then, each branch is followed with a classifier block: the output of the first FC layer is 512 dim, trailed by Batch Normalization[Ioffe2015], LeakyReLU, and Dropout layer [Srivastava2014] in sequence. The output dimension of the second FC layer equals the number of identities in the dataset.
The part encoder model is an effective method to identify the pedestrian given the datasets in supervised learning. In the part encoder model, the pedestrian is partitioned into six parts for extracting important part features. Then, the cross-entropy loss is employed for important part features to ensure the effective feature representations.
The loss of the part encoder network is
where is the number of branches, denotes the -th part feature of the -th image, and
is the probability distribution of, is the label corresponding to .
In summary, we propose an end-to-end Local-DIM model based on the part encoder and DIM networks. The full objective loss is:
where is hyper-parameter which denote the weight of the DIM loss in the whole objective loss.
Vi Unsupervised Transfer Learning with DIM
Our approach provides a solution to address the cross-dataset unsupervised ReID issue as a result of the fact that the DIM network needs no extra labels information. The cross-dataset unsupervised method is called TF-DIM. The TF-DIM transfers the feature representations from a labeled source dataset and finetunes on the unlabeled target dataset. The concrete steps are:
step(1): Supervised part-based DIM learning in the labeled source dataset. As discussed above, we train the Global and Local-DIM networks for learning an encoder network which has discriminative feature representations.
step(2): Transfer model in the unlabeled target dataset. In this step, we load the weight of the encoder network to the unlabeled target dataset. The DIM network is initialized to adapt the unlabeled target dataset.
step(3): Finetune on the unlabeled target dataset. Based on the pretrained model, we finetune the unlabeled dataset with the DIM loss.
Training for Global-DIM network The Global-DIM network involves the global encoder and DIM networks. The global encoder generates the global feature representations, and the DIM network maximizes the MI between the feature representations and input pedestrians. Because the input pedestrians and feature representations mismatch in size, the input pedestrians and are reduced to and in practice through the global encoder network. is the output of the global encoder network corresponding to the input
. The Global-DIM network is optimised with SGD optimizer in 60 epochs. The learning rate is 0.3, decaying by 10 after 40 epochs. The input images are resized to 256x128.
Training for Local-DIM network The Local-DIM network consists two parts: the local encoder and six DIM networks. The encoder network takes input images and outputs the six important part features. The six part features correspond to six DIM networks, and the sampling strategy in each DIM network is the same as the Global-DIM method. The weights of the six DIM networks are shared for learning the correlations between six parts. The input pedestrians are resized to 384x192. The whole Local-DIM network is trained with SGD optimizer in 60 epochs. The learning rate sets to 0.02, decaying by 10 after 40 epochs.
Training for TF-DIM method
The TF-DIM method is built on the basis of Local-DIM framework. The Local-DIM model trained on the source dataset is served as a pre-trained model for transfer learning in the target dataset. The DIM network dominates the cross-dataset TF-DIM approach because of its unsupervised DIM loss. We finetune on the target dataset with SGD optimizer in 60 epochs. The learning rate is 0.00005, decaying by 10 after 40 epochs.
Our approach is evaluated on three ReID datasets: Market-1501 [Zheng], DukeMTMC-reID [Zheng2017, Ristani2016] and CUHK03 [Li].
Market-1501 contains 32668 annotated images of 1501 identities from six overlapped cameras. There are 12936 images of 751 identities for training, 19732 images of 750 identities for testing. The query set collects 3368 images of 750 identities to search in the gallery set.
DukeMTMC-reID consists of 36411 bounding boxes of 1404 identities. During the evaluation phase, there are 16522 images of 702 identities for training, 17661 images of the other 702 identities for testing. The 2208 query images are picked from the gallery set.
CUHK03 has 14096 images of 1476 identities from two cameras. The dataset consists of labeled bounding boxes and DPM detected bounding boxes. In our experiments, we use the detected bounding boxes as it is closer to practical application. We follow the new protocol [Zhong2017] to evaluate our approach.
Important Parameters. We evaluate the hyper-parameters and to demonstrate the effectiveness of DIM loss. The results are shown in Fig.5 and 6. Our approach improves obviously over the IDE and PCB baseline respectively, which proves the validity of DIM loss. In the Global-DIM network, the mAP and Rank-1 achieve the best performance when equals to 0.02. At the same time, the optimal is 0.01 for the Local-DIM network to achieve better performance.
Comparison between the random and label sampling. The label sampling introduces the ground-truth identity labels for the DIM network to form input pairs. The performance of the random and label sampling strategies is demonstrated in Fig.7 and 8. To some extent, random sampling is superior to the label one. It is because that when sampling a few images from the dataset, the possibility of choosing the images belonging to the same identity is relatively low. Also, the random sampling strategy has an advantage in unsupervised learning. Therefore, we choose the random sampling strategy for the DIM network.
Comparison between the parameters sharing of six DIM networks in the Local-DIM method. There are six DIM networks in our Local-DIM model. Whether the six DIM networks share parameters is nontrivial because of the time-consuming problem in the training stage. In Fig. 9, we compare the results of shared and not shared parameters in six DIM networks. The empirical results show that the parameters sharing strategy is superior to the non-sharing ones. The underlying theory may be that the six DIM networks are correlated by parameters sharing strategy, achieving discriminative part features.
Comparison between the IDE baseline and Global-DIM model. The experiments between the IDE and Global-DIM model are specified in Table I. The Global-DIM model gains +0.7% and +1.9% improvements in rank-1 and mAP accuracy on Market-1501 respectively. Also, the Global-DIM has +3.6% and +5.6% improvements on DukeMTMC-reID. When tested on CUHK03, the Global-DIM increases +6.2% and +5.9% in rank-1 and mAP. The above experiments prove the validity of the Global-DIM model. Also, compared with the performance in DukeMTMC-reID and CUHK03 datasets, the improvements in Market-1501 dataset is a little weak. This may be because there are fewer pedestrian variances in the Market-1501 than DukeMTMC-reID and CUHK03, which is consistent with our motivation.
Comparison between the PCB baseline and Local-DIM model. The performance between the PCB and Local-DIM model is demonstrated in Table II. On Market-1501, the Local-DIM model gains +0.4% and +1.6% in rank-1 accuracy and mAP. Also, the performance increases +4.3% and +3.7% respectively on DukeMTMC-reID. When evaluated on CUHK03 dataset, the results get +2.7% and +3.4% increase. The experiments in Table I and II prove the generality of our DIM network in global and local feature representations.
Comparison between the Direct Transfer and TF-DIM model based on Global-DIM network. The Direct Transfer method means that the target dataset is directly tested on the model trained on the source dataset. The encoder network is based on the Global-DIM network trained on the source dataset. Table III illustrates the comparison results between the Direct Transfer and the TF-DIM model. When the DukeMTMC-reID is the source dataset and Market-1501 is the target dataset, our TF-DIM method has +0.8% and +0.6% gains in rank-1 accuracy and mAP. When we test the model on the DukeMTMC-reID dataset, our approach gains +2.7% and +2.2% in rank-1 accuracy and mAP.
Comparison between the Direct Transfer and TF-DIM model based on Local-DIM network. The Table IV shows the comparison between the TF-DIM model and the Direct Transfer method. The encoder network is based on the Local-DIM network trained on the source dataset. When the DukeMTMC-reID is the source dataset and Market-1501 is the target dataset, our TF-DIM method gains +1.7% and +1.2% in rank-1 accuracy and mAP. The improvements are +2.6% and +1.7% in rank-1 accuracy and mAP when the Market-1501 is the source dataset.
Viii-C Comparison with the state-of-the-art methods
Comparison between our approach and the state-of-the-art models. We compare our approach with the state-of-the-art ReID methods. The results are shown in Table V,VI. Compared with the IDE and PCB baseline, our approach has considerable improvements in rank-1 accuracy and mAP. On Market-1501 dataset, the Local-DIM model gains +0.4% and +1.6% in rank-1 accuracy and mAP. Also, the performance increases +4.3% and +3.7% respectively on DukeMTMC-reID. When evaluated on CUHK03 dataset, the results get +2.7% and +3.4% increase. All the experiments are evaluated under single-query and no re-ranking strategy. In our paper, we achieve the state-of-the-art results on these three datasets.
Comparison between the TF-DIM model and the state-of-the-art unsupervised domain adaptation methods. We compare our TF-DIM model with the state-of-the-art unsupervised ReID methods. The results are shown in Table VII. DukeMTMC-reID/Market-1501 is the source dataset and Market-1501/DukeMTMC-reID is the target dataset. The compared methods are divided into three categories: hand-crafted, unsupervised and unsupervised domain adaptation methods. The Direct Transfer is directly tested on the target dataset based on our Global-DIM and Local-DIM methods. The results show that our TF-DIM model achieves competitive performance compared with related state-of-the-art methods.
Although our TF-DIM method in Table VII has a promising performance, it does not seem to be convincing in rank-1 accuracy and mAP. The key of the unsupervised domain adaptation ReID issue is to converge the distribution of the source and target datasets like [Deng2017],[Wei2017]. Our TF-DIM method proposes a DIM loss to enhance the feature representations, which can be a complement to unsupervised ReID methods. We want to emphasize that although our TF-DIM model has good universality, unsupervised ReID task needs to be further optimized in conjunction with datasets.
|Direct Transfer (Global-DIM)||46.4||20.1||32.8||17.3|
|Direct Transfer (Local-DIM)||56.4||27.2||40.1||22.8|
In this paper, we propose a DIM network which maximizes the MI between the input image and encoder output to boost the capability of feature representations in ReID issue. Taking advantage of the MI theory, the DIM network enhances the representative features from input pedestrians. The Global-DIM and Local-DIM networks based on the IDE and PCB baseline achieve the state-of-the-art results, therefore proving the effectiveness of the DIM network in ReID issue. Furthermore, because the DIM network does not extra labels, we propose a TF-DIM network which has a promising performance in unsupervised ReID issue.
This work is supported by National Natural Science Foundation of China under Grant U1613214 and National Key R&D Program of China under Grant 2017YFC0821402.