Person re-identification aims to re-identify a query person across multiple non-overlapping cameras. The task is challenging since pedestrian images from different camera views suffer from large variations in poses, lightings and backgrounds. Many earlier works solve the re-identification problem by dividing it into two separated parts: feature extraction[13, 14, 36, 32, 9, 11, 16, 31] and metric learning [18, 34, 35, 3, 10, 13, 14]. A large number of hand-crafted features are designed to enhance the robustness of pedestrian images to pose, viewpoint and illumination changes. After the feature is extracted, metric learning is applied to learn a metric for the features so that the images of the same person are close while the ones of different pedestrians are far away from each other in the metric space.
due to their advantages on feature learning. Different from previous works, CNNs learn features and metrics jointly from data in an end-to-end manner. Then an embedding is learned to measure the similarities between images using Euclidean distance. The loss function of a CNN plays an important role on its performance. Verification loss[1, 4, 5, 12, 19, 23, 24, 25, 26, 33] is popular among CNNs on person re-identification task benefited from its simple motivation: reducing the variance between intra-class embeddings while increasing the distinction between inter-class ones. However, verification loss takes image pairs or triplets for training, the number of which grows rapidly as the number of classes grows. When there are numerous person identities, verification loss is likely to show slow convergence and unstable performance. Identification loss is usually used for classification task, and it has also been applied to person re-identification task [29, 38, 30] due to its simplicity and discriminative ability. Though identification loss can separate inter-class embeddings efficiently, it does not explicitly reduce the intra-class variance. Thus, the performance may be limited since the embeddings of the same person can have large distance on test data due to viewpoint, pose and background variations. To absorb the merits of the above two losses, recent works [39, 6] tend to combine them (please see Fig. 1a), and have achieved promising results. However, the inefficiency issue from verification loss still remains despite their performance improvement.
In addition to the inefficiency problem, the existing deep embedding models draw little attention to the importance of each embedding dimension. They simply accumulate the squared difference of each dimension (Euclidean distance) to measure the distance between embeddings [4, 5, 6, 24, 25, 29, 39]. In other words, each dimension of an embedding contributes equally to the total distance. Imagine that a matched embedding and a non-matched embedding have the same distance to the embedding of the query image (Fig. 1(a)), Euclidean distance method is not able to distinguish the matched one from the two. If a model learns to measure the importance of each dimension, then reweights the embeddings so that the important dimension is emphasized while the unimportant one is depressed (Fig. 1(b)), such problem can be alleviated. Unfortunately, few works have considered the importance of different embedding dimensions.
To overcome the above two shortcomings, this paper proposes a new CNN model for person re-identification. Specifically, we employ identification loss with center loss to train CNN, which does not require image pairs or triplets as input. Center loss  aims to pull images to the corresponding class center so that the intra-class variance is reduced. It functions similarly as verification loss but the learning process is more efficient. Meanwhile, a new feature reweighting (FRW) layer to adaptively learn the importance of each dimension has been designed. The FRW layer is placed after the embedding layer, performing element-wise multiplication upon its input. By doing so, the model gains the freedom to explicitly adjust the scales of the learned embeddings so that some less important features could be squeezed to avoid overfitting. Fig. 1b shows the structure of our proposed CNN. The contributions of this paper can be summarized as follows:
We employ identification loss with center loss to train a deep CNN model without constructing image pairs or triplets as input, thus improving the training efficiency.
We design a new FRW layer to explicitly emphasize the importance of each embedding dimension, leading to an improved embedding to boost the performance.
2 Related Work
To learn effective embeddings, existing works can be classified into two categories: 1) Improving the deep CNN structure to learn discriminative embeddings; 2) Designing better loss functions for deep CNN training.
CNN structure: To improve the CNN embeddings, Li  propose to jointly handle various variations with filter pairing component. Yi  design a Siamese CNN to handle the divided images to finally compute a merged similarity score between images. Ahmed  propose a cross-input neighborhood difference layer to capture local relationship between two images as well as a patch summary layer to summarize the features learned from the previous layers. Wang  propose a joint framework of single-image representation and cross-image representation to get a merged result. Xiao  propose a CNN that learns features from multiple domains. Cheng  design a multi-channel parts-based CNN to learn both the global and the local features. Varior  propose a gating function for CNN to emphasize fine common local patterns. Varior 
use Long Short-Term Memory to emphasize contextual information for learning features. Shi propose a moderate positive sample mining method to learn a variation insensitive feature. Sun  propose to add an Eigenlayer before the last fully connected (FC) layer to learn decorrelated weight vectors. Although existing CNN structures have achieved promising results, they still suffer from learning inefficiency problem because of verification loss.
Loss function: Binary identification loss, contrastive loss and triplet loss are three main types of verification loss. CNNs with binary identification loss have been used by [12, 1]. They output a binary prediction, indicating whether the two images belong to the same identity or not. Many other deep models learn an embedding for each image, and compute the similarities between embeddings based on the Euclidean distance. The works [24, 25] use contrastive loss to train a CNN, which requires a pair of image samples for training. The methods [27, 4, 26, 5, 8] employ triplet loss or its variations with CNN, which requires image triplets during training. For simplicity, the approaches [29, 30, 38]
apply identification loss to person re-identification task since it learns discriminative features efficiently. The combination of identification loss and verification loss has been found effective on face recognition, and it also gives excellent performance on person re-identification [39, 6]. Recently, the work  propose center loss to reduce the intra-class variance on face recognition task, without constructing image pairs or triplets during training. However, for person re-identification, the mainstream loss to handle the intra-class variance is still verification loss.
3 Our CNN Model
3.1 The Overall Architecture
Our proposed CNN model is a single CNN (different from the previous Siamese CNNs) that consists of nine convolutional layers, four max pooling layers, one FC layer, one FRW layer, and finally a softmax classifier. Fig. 3 gives the detailed illustration of the model. All the convolutional layers use 3
2 filters with stride 2. Batch normalization is applied after each convolutional layer or FC layer to speed up the training. Then leaky rectified linear unit (LReLU) is used after these layers as the non-linear activation function. After the FRW layer, we get a 512D embedding equipped with identification loss and center loss.
3.2 Identification Loss and Center Loss
Identification loss aims to enlarge the inter-class distinction and is usually used for multi-class classification task. It can be formulated as follows:
where is the batch size, is the -th embedding of the batch, and is the class label of the current input. is the -th column of the FC weights , is the -th item of the bias term , and is the number of categories.
Center loss is proposed by Wen  to reduce the intra-class variance for face recognition. It maintains a center point for each class, and keeps pushing each image embedding to its corresponding center so that the variations between image embeddings and their centers are small. It can be formulated as follows:
where is the corresponding center of the embedding . Specifically, unlike other CNN parameters, the updating of the class centers
are additionally performed instead of backpropagation:
where is the learning rate of the centers ranging from to , if the is satisfied, otherwise .
During training, the two losses are optimized jointly using the formula as:
where is a scalar to balance the two loss functions. As we can see from Eq. (4), the loss function of our model only involves batches of single image samples, which leads to the improvement of training efficiency over the existing deep person re-identification models.
3.3 FRW Layer
The importance of each embedding dimension has always been assumed to be equal in the existing works, ignoring the difference between different dimension. Here we argue that CNN should have the freedom to learn such difference. In this work, a new FRW layer is proposed to reweight the learned embedding of a CNN. More specifically, FRW layer performs an element-wise product of an embedding and the FRW weights, formulated as follows:
where is a learned embedding, is the weights of FRW layer, and denotes element-wise product. Intuitively, FRW layer enlarges certain dimensions of the embedding while reducing the other ones to strengthen the more essential features so that the similarities between embeddings can be reflected more accurately by Euclidean distance. For example, the central area of a pedestrian image can be more important than the border areas. For stability, additional constraint on the weights of FRW layer has been developed:
where controls the importance of the constraint, and is a constant to constrain the norm of the weight vector.
From another perspective, FRW layer can be seen as a separated part from softmax classifier. Among deep embedding models, the weights of softmax classifier are usually discarded after training because these weights are trained specifically on training classes, which are useless for different testing classes. Nevertheless, the trained weights of softmax classifier do contain general knowledge that is irrelevant to classes. We can treat the trained softmax weights as two parts: one that contains knowledge specific to each training class and the other that learns general knowledge applicable to all classes. Accordingly, the softmax classifier and the FRW layer in our model handle the two kinds of knowledge respectively. Formally, the standard softmax weights can be decomposed as follows:
where is the -th column of a standard softmax classifier weight, is the -th column of the softmax classifier weight from our model, and is the weight of our FRW layer. From Eq. (7), we can see that the FRW layer and the softmax classifier in our model is equivalent to the standard softmax classifier. By separating a FRW layer from the standard softmax classifier, the learned general knowledge about feature importance could be merged into the embeddings, and so it is applicable in testing phase.
There are two types of training paradigms in this paper: 1) For relatively large dataset (CUHK03 
), we simply train the model on its training set using stochastic gradient descent with mini-batches; 2) As for small datasets (CUHK01, VIPeR 
), we adopt a similar deep transfer learning method from. We pretrain the model on large person re-identification datasets (CUHK03 +Market1501 ), then fine-tune it on the corresponding training set of small data. Note that a two-stepped fine-tuning strategy from  is used in this paper to conduct a more effective transfer learning. After pretraining, the weights of the softmax classifier cannot be reused in the fine-tuning stage because the two datasets have different classes. Therefore, the softmax classifier weights should be replaced by a randomly initialized one with nodes, where is the number of classes of the small dataset. Then, we do first-step fine-tuning by freezing the other parameters and only training the newly added weights until the classifier converges. After that, we fine tune all the parameters altogether as the second step. The reason for the two-stepped tuning is to avoid the newly added weights to backpropagate harmful gradients to the pretrained weights of the previous layers.
Testing is simple and efficient in the deep embedding senario. We feed all the testing images to the CNN model to get an embedding for each of them. Then we normalize each embedding to an unit vector. Finally, we compute the Euclidean distance between all the pairs from two camera views to measure the cumulative match curve (CMC).
CUHK03: CUHK03  consists of 13164 images from 1360 identities. It provides two settings, one annotated by human (labeled) and the other annotated by detectors (detected). We adopt the latter setting since it is closer to practical scenarios. Following the protocol in , we do 20 random splits, wherein 1160 identities are for training, 100 identities are for testing. The evaluation is in single shot.
CUHK01: CUHK01  contains 971 identities with two camera views, and each identity from each view has two images. Following the setting in , we randomly select one image for each identity in each view for both training and testing images. Then 485 identities are randomly selected for training, and the remaining 486 are for testing. The evaluation is based on 10 random splits, in single shot.
VIPeR: VIPeR  contains 632 identities with two camera views. Each identity from each view has one image. Half of the identities are used for training, and the other half are for testing. The evaluation is also based on 10 random splits, in single shot.
4.2 Data Preparation
To reduce overfitting, we conduct data augmentation on each dataset. Each training image is augmented by 2D random translation as in [1, 12]. We sample three images with 2D translation for each training image as well as a horizontal reflection. Each image is resized to . The mean of each training data is subtracted respectively.
|Method||Rank 1||Rank 5||Rank 10|
|Siamese LSTM ||57.3||80.1||88.3|
|Gated S-CNN ||61.8||80.9||88.3|
|Joint Learning ||52.2||-||-|
|CNN Embedding ||66.1||90.1||95.5|
|Deep Transfer *||84.1||-||-|
uses ImageNet for pretraining, while CNN-IV is our implementation of without ImageNet pretraining.
4.3 Models for Comparison
We compare our proposed model to a number of the existing methods, including state-of-the-art ones. In order to have a systematic comparison, we also implement several baseline models. We name the proposed model (Fig. 3) as CNN-FRW-IC. We implement a version without FRW layer to check the effectiveness of FRW layer, named CNN-IC. We also have one where the FRW layer is replaced by a FC layer, named CNN-FC-IC. We add the extra FC layer to check if simply increasing the depth of the network improves accuracy. There is also a version that only uses identification loss without FRW layer, named CNN-I.
To our knowledge,  gives the best accuracy among the existing deep embedding models, using identification loss and verification loss (binary identification loss) as loss function. So we implement a Siamese CNN with the two losses under our framework without FRW layer, named CNN-IV.
4.4 Training Settings
Our models are implemented using TensorFlow. We use the Adam optimizer to update parameters, where the exponential decay rate for the 1st and 2nd moment estimates are 0.9 and 0.999, respectively. The number of training iterations is 25k. The initial learning rate is 0.001, decayed by 0.1 after 22k iterations. The batch size is set to 100. The weight decay is 0.001. As for the center loss, the updating rate of the centers are, and its balance coefficient is . The balance coefficient of FRW layer is , and the constant is set to 200.
|Method||Rank 1||Rank 5||Rank 10|
|1 GL ||41.5||-||-|
|Siamese LSTM ||42.4||68.7||79.4|
|Metric Ensemble ||45.9||77.5||88.9|
|Gated S-CNN ||37.8||66.9||77.4|
|Joint Learning ||35.8||-||-|
|Deep Transfer *||56.3||-||-|
|Method||Rank 1||Rank 5||Rank 10|
|1 GL ||50.1||-||-|
|Deep Transfer *||77.0||-||-|
4.5 Results on CUHK03
From Table 1, we can see that the model with only identification loss gets the worst accuracy among the baseline models. The accuracy of our implementation of identification loss and verification loss is slightly worse than the one reported in  as they use extra ImageNet data for pretraining. Identification loss with center loss gets the same accuracy as identification loss with verification loss, which verifies the effectiveness of center loss. With the new designed FRW layer, the performance can be further improved. In contrast, the performance drops a little when a naive FC layer is added, which indicates that simply adding more layers does not bring any improvement. Among the models that do not use extra training data, our proposed model CNN-FRW-IC achieves the best rank 1, rank 5 and rank 10 accuracy on CUHK03 (detected).
|Method||Rank 1||Rank 5||Rank 10|
4.6 Results on VIPeR and CUHK01
The results of VIPeR and CUHK01 are shown in Table 2 and 3, respectively. Similar to CUHK03, the results of our implementation of identification and verification loss are not as good as  due to the lack of ImageNet pretraining. CNN-IC reaches a higher performance than CNN-IV on VIPeR but a worse performance on CUHK01. The model with FRW layer has an improved accuracy on both the two datasets, and it outperforms most of the existing models. Similarly, adding a FC layer reduces the accuracy.
4.7 Comparison on Convergence Speed
Intuitively, center loss is more efficient than verification loss on training since it constructs batches of single image samples as input instead of person pairs or triplets. We conduct a comparative experiment on the two types of losses to see how their convergence speed differs in practice. We reduce the training iterations to 5k, where the learning rate is decayed by 0.1 at 4k iterations. The other settings remain the same as before. From Table 4, we see that the model of center loss has a better performance than the verification model. It is worth noting that the model accuracy of center loss with 5k training iterations is slightly better than the one with 25k training iterations, indicating that it has converged with 5k iterations. On the contrary, the accuracy of verification model drops when the number of training iterations is reduced. Therefore, center loss does converge faster than verification loss. More importantly, when a larger person re-identification dataset is used, the efficiency gap between the two losses will be more significant.
In this paper, we have proposed a novel CNN architecture for person re-identification. The proposed architecture utilizes identification loss and center loss to jointly balance the intra/inter class distances. By using center loss, our model becomes more efficient compared to the one using verification loss. Our model also contains a new FRW layer that learns to reweight the learned embedding for each dimension. Thus, the network gains more freedom to distribute the importance for each dimension. Based on the experimental results on CUHK03, CUHK01 and VIPeR, our proposed CNN outperforms the state-of-the-art competitors in most cases.
This work was supported by the National Key Research and Development Plan (Grant No.2016YFC0801003), and the Chinese National Natural Science Foundation Projects #61672521, #61473291, #61572501, #61502491, #61572536.
E. Ahmed, M. Jones, and T. Marks.
An improved deep learning architecture for person re-identification.In CVPR, 2015.
-  I. Barbosa, M. Cristani, B. Caputo, A. Rognhaugen, and T. Theoharis. Looking beyond appearances: Synthetic training data for deep cnns in re-identification. arXiv preprint arxiv:1701.03153, 2017.
-  D. Chen, Z. Yuan, B. Chen, and N. Zheng. Similarity learning with spatial constraints for person re-identification. In CVPR, 2016.
-  D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng. Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In CVPR, 2016.
-  S. Ding, L. Lin, G. Wang, and H. Chao. Deep feature learning with relative distance comparison for person re-identification. Pattern Recognition, 48(10):2993–3003, 2015.
-  M. Geng, Y. Wang, T. Xiang, and Y. Tian. Deep transfer learning for person re-identification. arXiv preprint arxiv:1611.05244, 2016.
-  D. Gray, S. Brennan, and H. Tao. Evaluating appearance models for recognition, reacquisition, and tracking. In IEEE International Workshop on PETS, 2007.
-  A. Hermans, L. Beyer, and B. Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arxiv:1703.07737, 2017.
-  S. Khamis, C.-H. Kuo, V. Singh, V. Shet, and L. Davis. Joint learning for attribute-consistent person re-identification. In ECCV, 2014.
-  E. Kodirov, T. Xiang, Z. Fu, and S. Gong. Person re-identification by unsupervised 1 graph learning. In ECCV, 2016.
-  W. Li and X. Wang. Locally aligned feature transforms across views. In CVPR, 2013.
-  W. Li, R. Zhao, T. Xiao, and X. Wang. DeepReID: Deep filter pairing neural network for person re-identification. In CVPR, 2014.
-  S. Liao, Y. Hu, X. Zhu, and S. Li. Person re-identification by local maximal occurrence representation and metric learning. In CVPR, 2015.
-  S. Liao and S. Li. Efficient psd constrained asymmetric metric learning for person re-identification. In ICCV, 2015.
-  H. Liu, J. Feng, M. Qi, J. Jiang, and S. Yan. End-to-end comparative attention networks for person re-identification. In IEEE Transactions on Image Processing, 2016.
-  B. Ma, Y. Su, and F. Jurie. A novel image representation for person re-identification and face verification. In BMVC, 2012.
-  N. Martinel, A. Das, C. Micheloni, and A. Roy-Chowdhury. Temporal model adaptation for person re-identification. In ECCV, 2016.
-  S. Paisitkriangkrai, C. Shen, and A. Hengel. Learning to rank in person re-identification with metric ensembles. In CVPR, 2015.
-  H. Shi, Y. Yang, X. Zhu, S. Liao, Z. Lei, W. Zheng, and S. Li. Embedding deep metric for person re-identification: A study against large variations. In ECCV, 2016.
-  H. Shi, X. Zhu, S. Liao, Z. Lei, Y. Yang, and S. Li. Constrained deep metric learning for person re-identification. CoRR, 2015.
-  Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. In NIPS, 2014.
-  Y. Sun, L. Zheng, W. Deng, and S. Wang. Svdnet for pedestrian retrieval. arXiv preprint arXiv:1703.05693, 2017.
-  E. Ustinova, Y. Ganin, and V. Lempitsky. Multiregion bilinear convolutional neural networks for person re-identification. CoRR, 2015.
-  R. Varior, M. Haloi, and G. Wang. Gated siamese convolutional neural network architecture for human re-identification. In ECCV, 2016.
-  R. Varior, B. Shuai, J. Lu, D. Xu, and G. Wang. A siamese long short-term memory architecture for human re-identification. In ECCV, 2016.
-  F. Wang, W. Zuo, L. Lin, D. Zhang, and L. Zhang. Joint learning of single-image and cross-image representations for person re-identification. In CVPR, 2016.
J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and
Learning fine-grained image similarity with deep ranking.In CVPR, 2014.
-  Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In ECCV, 2016.
-  T. Xiao, H. Li, W. Ouyang, and X. Wang. Learning deep feature representations with domain guided dropout for person re-identification. In CVPR, 2016.
-  T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang. End-to-end deep learning for person search. arXiv preprint arxiv:1604.01850, 2016.
-  F. Xiong, M. Gou, O. Camps, and M. Sznaier. Person re-identification using kernel-based metric learning methods. In ECCV, 2014.
-  Y. Yang, J. Yang, J. Yan, S. Liao, D. Yi, and S. Li. Salient color names for person re-identification. In ECCV, 2014.
-  D. Yi, Z. Lei, S. Liao, and S. Li. Deep metric learning for person re-identification. In ICPR, 2014.
-  L. Zhang, T. Xiang, and S. Gong. Learning a discriminative null space for person re-identification. In CVPR, 2016.
-  Y. Zhang, B. Li, H. Lu, A. Irie, and X. Ruan. Sample-specific svm learning for person re-identification. In CVPR, 2016.
-  R. Zhao, W. Ouyang, and X. Wang. Learning mid-level filters for person re-identification. In CVPR, 2014.
-  L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identification: A benchmark. In ICCV, 2015.
-  L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, and Q. Tian. Person re-identification in the wild. arXiv preprint arXiv:1604.02531, 2016.
-  Z. Zheng, L. Zheng, and Y. Yang. A discriminatively learned cnn embedding for person re-identification. arXiv preprint arXiv:1611.05666, 2016.
-  Z. Zheng, L. Zheng, and Y. Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. arXiv preprint arXiv:1701.07717, 2017.