1 Introduction
Person reidentification aims to reidentify a query person across multiple nonoverlapping cameras. The task is challenging since pedestrian images from different camera views suffer from large variations in poses, lightings and backgrounds. Many earlier works solve the reidentification problem by dividing it into two separated parts: feature extraction
[13, 14, 36, 32, 9, 11, 16, 31] and metric learning [18, 34, 35, 3, 10, 13, 14]. A large number of handcrafted features are designed to enhance the robustness of pedestrian images to pose, viewpoint and illumination changes. After the feature is extracted, metric learning is applied to learn a metric for the features so that the images of the same person are close while the ones of different pedestrians are far away from each other in the metric space.In recent years, convolutional neural networks (CNNs) have achieved promising results on person reidentification [1, 4, 5, 6, 12, 15, 19, 20, 23, 24, 25, 26, 29, 33, 39, 2, 40]
due to their advantages on feature learning. Different from previous works, CNNs learn features and metrics jointly from data in an endtoend manner. Then an embedding is learned to measure the similarities between images using Euclidean distance. The loss function of a CNN plays an important role on its performance. Verification loss
[1, 4, 5, 12, 19, 23, 24, 25, 26, 33] is popular among CNNs on person reidentification task benefited from its simple motivation: reducing the variance between intraclass embeddings while increasing the distinction between interclass ones. However, verification loss takes image pairs or triplets for training, the number of which grows rapidly as the number of classes grows. When there are numerous person identities, verification loss is likely to show slow convergence and unstable performance. Identification loss is usually used for classification task, and it has also been applied to person reidentification task [29, 38, 30] due to its simplicity and discriminative ability. Though identification loss can separate interclass embeddings efficiently, it does not explicitly reduce the intraclass variance. Thus, the performance may be limited since the embeddings of the same person can have large distance on test data due to viewpoint, pose and background variations. To absorb the merits of the above two losses, recent works [39, 6] tend to combine them (please see Fig. 1a), and have achieved promising results. However, the inefficiency issue from verification loss still remains despite their performance improvement.In addition to the inefficiency problem, the existing deep embedding models draw little attention to the importance of each embedding dimension. They simply accumulate the squared difference of each dimension (Euclidean distance) to measure the distance between embeddings [4, 5, 6, 24, 25, 29, 39]. In other words, each dimension of an embedding contributes equally to the total distance. Imagine that a matched embedding and a nonmatched embedding have the same distance to the embedding of the query image (Fig. 1(a)), Euclidean distance method is not able to distinguish the matched one from the two. If a model learns to measure the importance of each dimension, then reweights the embeddings so that the important dimension is emphasized while the unimportant one is depressed (Fig. 1(b)), such problem can be alleviated. Unfortunately, few works have considered the importance of different embedding dimensions.
To overcome the above two shortcomings, this paper proposes a new CNN model for person reidentification. Specifically, we employ identification loss with center loss to train CNN, which does not require image pairs or triplets as input. Center loss [28] aims to pull images to the corresponding class center so that the intraclass variance is reduced. It functions similarly as verification loss but the learning process is more efficient. Meanwhile, a new feature reweighting (FRW) layer to adaptively learn the importance of each dimension has been designed. The FRW layer is placed after the embedding layer, performing elementwise multiplication upon its input. By doing so, the model gains the freedom to explicitly adjust the scales of the learned embeddings so that some less important features could be squeezed to avoid overfitting. Fig. 1b shows the structure of our proposed CNN. The contributions of this paper can be summarized as follows:

We employ identification loss with center loss to train a deep CNN model without constructing image pairs or triplets as input, thus improving the training efficiency.

We design a new FRW layer to explicitly emphasize the importance of each embedding dimension, leading to an improved embedding to boost the performance.
2 Related Work
To learn effective embeddings, existing works can be classified into two categories: 1) Improving the deep CNN structure to learn discriminative embeddings; 2) Designing better loss functions for deep CNN training.
CNN structure: To improve the CNN embeddings, Li [12] propose to jointly handle various variations with filter pairing component. Yi [33] design a Siamese CNN to handle the divided images to finally compute a merged similarity score between images. Ahmed [1] propose a crossinput neighborhood difference layer to capture local relationship between two images as well as a patch summary layer to summarize the features learned from the previous layers. Wang [26] propose a joint framework of singleimage representation and crossimage representation to get a merged result. Xiao [29] propose a CNN that learns features from multiple domains. Cheng [4] design a multichannel partsbased CNN to learn both the global and the local features. Varior [24] propose a gating function for CNN to emphasize fine common local patterns. Varior [25]
use Long ShortTerm Memory to emphasize contextual information for learning features. Shi
[19] propose a moderate positive sample mining method to learn a variation insensitive feature. Sun [22] propose to add an Eigenlayer before the last fully connected (FC) layer to learn decorrelated weight vectors. Although existing CNN structures have achieved promising results, they still suffer from learning inefficiency problem because of verification loss.Loss function: Binary identification loss, contrastive loss and triplet loss are three main types of verification loss. CNNs with binary identification loss have been used by [12, 1]. They output a binary prediction, indicating whether the two images belong to the same identity or not. Many other deep models learn an embedding for each image, and compute the similarities between embeddings based on the Euclidean distance. The works [24, 25] use contrastive loss to train a CNN, which requires a pair of image samples for training. The methods [27, 4, 26, 5, 8] employ triplet loss or its variations with CNN, which requires image triplets during training. For simplicity, the approaches [29, 30, 38]
apply identification loss to person reidentification task since it learns discriminative features efficiently. The combination of identification loss and verification loss has been found effective on face recognition
[21], and it also gives excellent performance on person reidentification [39, 6]. Recently, the work [28] propose center loss to reduce the intraclass variance on face recognition task, without constructing image pairs or triplets during training. However, for person reidentification, the mainstream loss to handle the intraclass variance is still verification loss.3 Our CNN Model
3.1 The Overall Architecture
Our proposed CNN model is a single CNN (different from the previous Siamese CNNs) that consists of nine convolutional layers, four max pooling layers, one FC layer, one FRW layer, and finally a softmax classifier. Fig. 3 gives the detailed illustration of the model. All the convolutional layers use 3
3 filters, with stride 1 and zero paddings. The max pooling layers all have 2
2 filters with stride 2. Batch normalization is applied after each convolutional layer or FC layer to speed up the training. Then leaky rectified linear unit (LReLU) is used after these layers as the nonlinear activation function. After the FRW layer, we get a 512D embedding equipped with identification loss and center loss.
3.2 Identification Loss and Center Loss
Identification loss aims to enlarge the interclass distinction and is usually used for multiclass classification task. It can be formulated as follows:
(1) 
where is the batch size, is the th embedding of the batch, and is the class label of the current input. is the th column of the FC weights , is the th item of the bias term , and is the number of categories.
Center loss is proposed by Wen [28] to reduce the intraclass variance for face recognition. It maintains a center point for each class, and keeps pushing each image embedding to its corresponding center so that the variations between image embeddings and their centers are small. It can be formulated as follows:
(2) 
where is the corresponding center of the embedding . Specifically, unlike other CNN parameters, the updating of the class centers
are additionally performed instead of backpropagation:
(3)  
where is the learning rate of the centers ranging from to , if the is satisfied, otherwise .
During training, the two losses are optimized jointly using the formula as:
(4) 
where is a scalar to balance the two loss functions. As we can see from Eq. (4), the loss function of our model only involves batches of single image samples, which leads to the improvement of training efficiency over the existing deep person reidentification models.
3.3 FRW Layer
The importance of each embedding dimension has always been assumed to be equal in the existing works, ignoring the difference between different dimension. Here we argue that CNN should have the freedom to learn such difference. In this work, a new FRW layer is proposed to reweight the learned embedding of a CNN. More specifically, FRW layer performs an elementwise product of an embedding and the FRW weights, formulated as follows:
(5) 
where is a learned embedding, is the weights of FRW layer, and denotes elementwise product. Intuitively, FRW layer enlarges certain dimensions of the embedding while reducing the other ones to strengthen the more essential features so that the similarities between embeddings can be reflected more accurately by Euclidean distance. For example, the central area of a pedestrian image can be more important than the border areas. For stability, additional constraint on the weights of FRW layer has been developed:
(6) 
where controls the importance of the constraint, and is a constant to constrain the norm of the weight vector.
From another perspective, FRW layer can be seen as a separated part from softmax classifier. Among deep embedding models, the weights of softmax classifier are usually discarded after training because these weights are trained specifically on training classes, which are useless for different testing classes. Nevertheless, the trained weights of softmax classifier do contain general knowledge that is irrelevant to classes. We can treat the trained softmax weights as two parts: one that contains knowledge specific to each training class and the other that learns general knowledge applicable to all classes. Accordingly, the softmax classifier and the FRW layer in our model handle the two kinds of knowledge respectively. Formally, the standard softmax weights can be decomposed as follows:
(7) 
where is the th column of a standard softmax classifier weight, is the th column of the softmax classifier weight from our model, and is the weight of our FRW layer. From Eq. (7), we can see that the FRW layer and the softmax classifier in our model is equivalent to the standard softmax classifier. By separating a FRW layer from the standard softmax classifier, the learned general knowledge about feature importance could be merged into the embeddings, and so it is applicable in testing phase.
3.4 Training
There are two types of training paradigms in this paper: 1) For relatively large dataset (CUHK03 [12]
), we simply train the model on its training set using stochastic gradient descent with minibatches; 2) As for small datasets (CUHK01
[11], VIPeR [7]), we adopt a similar deep transfer learning method from
[6]. We pretrain the model on large person reidentification datasets (CUHK03 [12]+Market1501 [37]), then finetune it on the corresponding training set of small data. Note that a twostepped finetuning strategy from [6] is used in this paper to conduct a more effective transfer learning. After pretraining, the weights of the softmax classifier cannot be reused in the finetuning stage because the two datasets have different classes. Therefore, the softmax classifier weights should be replaced by a randomly initialized one with nodes, where is the number of classes of the small dataset. Then, we do firststep finetuning by freezing the other parameters and only training the newly added weights until the classifier converges. After that, we fine tune all the parameters altogether as the second step. The reason for the twostepped tuning is to avoid the newly added weights to backpropagate harmful gradients to the pretrained weights of the previous layers.3.5 Testing
Testing is simple and efficient in the deep embedding senario. We feed all the testing images to the CNN model to get an embedding for each of them. Then we normalize each embedding to an unit vector. Finally, we compute the Euclidean distance between all the pairs from two camera views to measure the cumulative match curve (CMC).
4 Experiments
4.1 Datasets
CUHK03: CUHK03 [12] consists of 13164 images from 1360 identities. It provides two settings, one annotated by human (labeled) and the other annotated by detectors (detected). We adopt the latter setting since it is closer to practical scenarios. Following the protocol in [12], we do 20 random splits, wherein 1160 identities are for training, 100 identities are for testing. The evaluation is in single shot.
CUHK01: CUHK01 [11] contains 971 identities with two camera views, and each identity from each view has two images. Following the setting in [6], we randomly select one image for each identity in each view for both training and testing images. Then 485 identities are randomly selected for training, and the remaining 486 are for testing. The evaluation is based on 10 random splits, in single shot.
VIPeR: VIPeR [7] contains 632 identities with two camera views. Each identity from each view has one image. Half of the identities are used for training, and the other half are for testing. The evaluation is also based on 10 random splits, in single shot.
4.2 Data Preparation
To reduce overfitting, we conduct data augmentation on each dataset. Each training image is augmented by 2D random translation as in [1, 12]. We sample three images with 2D translation for each training image as well as a horizontal reflection. Each image is resized to . The mean of each training data is subtracted respectively.
Method  Rank 1  Rank 5  Rank 10 

XQDA [13]  46.3  78.9  88.6 
MLAPG [14]  51.2     
DNS [34]  54.7  84.8  94.8 
LSSCDL [35]  51.2     
Siamese LSTM [25]  57.3  80.1  88.3 
IDLA [1]  45.0  76.0  83.5 
Gated SCNN [24]  61.8  80.9  88.3 
EDM [19]  52.0     
Joint Learning [26]  52.2     
CAN [15]  63.1  82.9  88.2 
CNN Embedding [39]  66.1  90.1  95.5 
Deep Transfer [6]*  84.1     
CNNI  75.0  92.1  95.9 
CNNIV  80.2  94.9  97.3 
CNNIC  80.2  96.1  97.9 
CNNFCIC  79.8  95.6  97.6 
CNNFRWIC  82.1  96.2  98.2 
uses ImageNet for pretraining, while CNNIV is our implementation of
[6] without ImageNet pretraining.4.3 Models for Comparison
We compare our proposed model to a number of the existing methods, including stateoftheart ones. In order to have a systematic comparison, we also implement several baseline models. We name the proposed model (Fig. 3) as CNNFRWIC. We implement a version without FRW layer to check the effectiveness of FRW layer, named CNNIC. We also have one where the FRW layer is replaced by a FC layer, named CNNFCIC. We add the extra FC layer to check if simply increasing the depth of the network improves accuracy. There is also a version that only uses identification loss without FRW layer, named CNNI.
To our knowledge, [6] gives the best accuracy among the existing deep embedding models, using identification loss and verification loss (binary identification loss) as loss function. So we implement a Siamese CNN with the two losses under our framework without FRW layer, named CNNIV.
4.4 Training Settings
Our models are implemented using TensorFlow. We use the Adam optimizer to update parameters, where the exponential decay rate for the 1st and 2nd moment estimates are 0.9 and 0.999, respectively. The number of training iterations is 25k. The initial learning rate is 0.001, decayed by 0.1 after 22k iterations. The batch size is set to 100. The weight decay is 0.001. As for the center loss, the updating rate of the centers are
, and its balance coefficient is . The balance coefficient of FRW layer is , and the constant is set to 200.Method  Rank 1  Rank 5  Rank 10 

SCSP [3]  53.5  82.6  91.5 
LSSCDL [35]  42.7  84.3  91.9 
TMA [17]  43.8    83.9 
1 GL [10]  41.5     
Siamese LSTM [25]  42.4  68.7  79.4 
Metric Ensemble [18]  45.9  77.5  88.9 
DNS [34]  51.2  82.1  90.5 
IDLA [1]  34.8  63.6  75.6 
DGD [29]  38.6     
MCPCNN [4]  47.8  74.7  84.8 
Gated SCNN [24]  37.8  66.9  77.4 
EDM [19]  40.9     
Joint Learning [26]  35.8     
Deep Transfer [6]*  56.3     
CNNI  39.1  61.3  70.5 
CNNIV  47.2  72.6  82.3 
CNNIC  49.3  77.3  87 
CNNFCIC  46.6  74.4  84.3 
CNNFRWIC  50.4  77.6  85.8 
Method  Rank 1  Rank 5  Rank 10 

1 GL [10]  50.1     
DNS [34]  69.1  86.9  91.8 
IDLA [1]  47.5  71.6  80.3 
DGD [29]  66.6     
MCPCNN [4]  53.7  84.3  91 
Deep Transfer [6]*  77.0     
CNNI  63.4  84.4  90.5 
CNNIV  74.4  91.3  95.0 
CNNIC  70.1  90.5  94.8 
CNNFCIC  66.1  88.2  93 
CNNFRWIC  70.5  90.0  94.8 
4.5 Results on CUHK03
From Table 1, we can see that the model with only identification loss gets the worst accuracy among the baseline models. The accuracy of our implementation of identification loss and verification loss is slightly worse than the one reported in [6] as they use extra ImageNet data for pretraining. Identification loss with center loss gets the same accuracy as identification loss with verification loss, which verifies the effectiveness of center loss. With the new designed FRW layer, the performance can be further improved. In contrast, the performance drops a little when a naive FC layer is added, which indicates that simply adding more layers does not bring any improvement. Among the models that do not use extra training data, our proposed model CNNFRWIC achieves the best rank 1, rank 5 and rank 10 accuracy on CUHK03 (detected).
Method  Rank 1  Rank 5  Rank 10 

CNNIV  77.0  93.1  96.6 
CNNIC  80.7  95.8  97.7 
4.6 Results on VIPeR and CUHK01
The results of VIPeR and CUHK01 are shown in Table 2 and 3, respectively. Similar to CUHK03, the results of our implementation of identification and verification loss are not as good as [6] due to the lack of ImageNet pretraining. CNNIC reaches a higher performance than CNNIV on VIPeR but a worse performance on CUHK01. The model with FRW layer has an improved accuracy on both the two datasets, and it outperforms most of the existing models. Similarly, adding a FC layer reduces the accuracy.
4.7 Comparison on Convergence Speed
Intuitively, center loss is more efficient than verification loss on training since it constructs batches of single image samples as input instead of person pairs or triplets. We conduct a comparative experiment on the two types of losses to see how their convergence speed differs in practice. We reduce the training iterations to 5k, where the learning rate is decayed by 0.1 at 4k iterations. The other settings remain the same as before. From Table 4, we see that the model of center loss has a better performance than the verification model. It is worth noting that the model accuracy of center loss with 5k training iterations is slightly better than the one with 25k training iterations, indicating that it has converged with 5k iterations. On the contrary, the accuracy of verification model drops when the number of training iterations is reduced. Therefore, center loss does converge faster than verification loss. More importantly, when a larger person reidentification dataset is used, the efficiency gap between the two losses will be more significant.
5 Conclusion
In this paper, we have proposed a novel CNN architecture for person reidentification. The proposed architecture utilizes identification loss and center loss to jointly balance the intra/inter class distances. By using center loss, our model becomes more efficient compared to the one using verification loss. Our model also contains a new FRW layer that learns to reweight the learned embedding for each dimension. Thus, the network gains more freedom to distribute the importance for each dimension. Based on the experimental results on CUHK03, CUHK01 and VIPeR, our proposed CNN outperforms the stateoftheart competitors in most cases.
Acknowledgements
This work was supported by the National Key Research and Development Plan (Grant No.2016YFC0801003), and the Chinese National Natural Science Foundation Projects #61672521, #61473291, #61572501, #61502491, #61572536.
References

[1]
E. Ahmed, M. Jones, and T. Marks.
An improved deep learning architecture for person reidentification.
In CVPR, 2015.  [2] I. Barbosa, M. Cristani, B. Caputo, A. Rognhaugen, and T. Theoharis. Looking beyond appearances: Synthetic training data for deep cnns in reidentification. arXiv preprint arxiv:1701.03153, 2017.
 [3] D. Chen, Z. Yuan, B. Chen, and N. Zheng. Similarity learning with spatial constraints for person reidentification. In CVPR, 2016.
 [4] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng. Person reidentification by multichannel partsbased cnn with improved triplet loss function. In CVPR, 2016.
 [5] S. Ding, L. Lin, G. Wang, and H. Chao. Deep feature learning with relative distance comparison for person reidentification. Pattern Recognition, 48(10):2993–3003, 2015.
 [6] M. Geng, Y. Wang, T. Xiang, and Y. Tian. Deep transfer learning for person reidentification. arXiv preprint arxiv:1611.05244, 2016.
 [7] D. Gray, S. Brennan, and H. Tao. Evaluating appearance models for recognition, reacquisition, and tracking. In IEEE International Workshop on PETS, 2007.
 [8] A. Hermans, L. Beyer, and B. Leibe. In defense of the triplet loss for person reidentification. arXiv preprint arxiv:1703.07737, 2017.
 [9] S. Khamis, C.H. Kuo, V. Singh, V. Shet, and L. Davis. Joint learning for attributeconsistent person reidentification. In ECCV, 2014.
 [10] E. Kodirov, T. Xiang, Z. Fu, and S. Gong. Person reidentification by unsupervised 1 graph learning. In ECCV, 2016.
 [11] W. Li and X. Wang. Locally aligned feature transforms across views. In CVPR, 2013.
 [12] W. Li, R. Zhao, T. Xiao, and X. Wang. DeepReID: Deep filter pairing neural network for person reidentification. In CVPR, 2014.
 [13] S. Liao, Y. Hu, X. Zhu, and S. Li. Person reidentification by local maximal occurrence representation and metric learning. In CVPR, 2015.
 [14] S. Liao and S. Li. Efficient psd constrained asymmetric metric learning for person reidentification. In ICCV, 2015.
 [15] H. Liu, J. Feng, M. Qi, J. Jiang, and S. Yan. Endtoend comparative attention networks for person reidentification. In IEEE Transactions on Image Processing, 2016.
 [16] B. Ma, Y. Su, and F. Jurie. A novel image representation for person reidentification and face verification. In BMVC, 2012.
 [17] N. Martinel, A. Das, C. Micheloni, and A. RoyChowdhury. Temporal model adaptation for person reidentification. In ECCV, 2016.
 [18] S. Paisitkriangkrai, C. Shen, and A. Hengel. Learning to rank in person reidentification with metric ensembles. In CVPR, 2015.
 [19] H. Shi, Y. Yang, X. Zhu, S. Liao, Z. Lei, W. Zheng, and S. Li. Embedding deep metric for person reidentification: A study against large variations. In ECCV, 2016.
 [20] H. Shi, X. Zhu, S. Liao, Z. Lei, Y. Yang, and S. Li. Constrained deep metric learning for person reidentification. CoRR, 2015.
 [21] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identificationverification. In NIPS, 2014.
 [22] Y. Sun, L. Zheng, W. Deng, and S. Wang. Svdnet for pedestrian retrieval. arXiv preprint arXiv:1703.05693, 2017.
 [23] E. Ustinova, Y. Ganin, and V. Lempitsky. Multiregion bilinear convolutional neural networks for person reidentification. CoRR, 2015.
 [24] R. Varior, M. Haloi, and G. Wang. Gated siamese convolutional neural network architecture for human reidentification. In ECCV, 2016.
 [25] R. Varior, B. Shuai, J. Lu, D. Xu, and G. Wang. A siamese long shortterm memory architecture for human reidentification. In ECCV, 2016.
 [26] F. Wang, W. Zuo, L. Lin, D. Zhang, and L. Zhang. Joint learning of singleimage and crossimage representations for person reidentification. In CVPR, 2016.

[27]
J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and
Y. Wu.
Learning finegrained image similarity with deep ranking.
In CVPR, 2014.  [28] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In ECCV, 2016.
 [29] T. Xiao, H. Li, W. Ouyang, and X. Wang. Learning deep feature representations with domain guided dropout for person reidentification. In CVPR, 2016.
 [30] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang. Endtoend deep learning for person search. arXiv preprint arxiv:1604.01850, 2016.
 [31] F. Xiong, M. Gou, O. Camps, and M. Sznaier. Person reidentification using kernelbased metric learning methods. In ECCV, 2014.
 [32] Y. Yang, J. Yang, J. Yan, S. Liao, D. Yi, and S. Li. Salient color names for person reidentification. In ECCV, 2014.
 [33] D. Yi, Z. Lei, S. Liao, and S. Li. Deep metric learning for person reidentification. In ICPR, 2014.
 [34] L. Zhang, T. Xiang, and S. Gong. Learning a discriminative null space for person reidentification. In CVPR, 2016.
 [35] Y. Zhang, B. Li, H. Lu, A. Irie, and X. Ruan. Samplespecific svm learning for person reidentification. In CVPR, 2016.
 [36] R. Zhao, W. Ouyang, and X. Wang. Learning midlevel filters for person reidentification. In CVPR, 2014.
 [37] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person reidentification: A benchmark. In ICCV, 2015.
 [38] L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, and Q. Tian. Person reidentification in the wild. arXiv preprint arXiv:1604.02531, 2016.
 [39] Z. Zheng, L. Zheng, and Y. Yang. A discriminatively learned cnn embedding for person reidentification. arXiv preprint arXiv:1611.05666, 2016.
 [40] Z. Zheng, L. Zheng, and Y. Yang. Unlabeled samples generated by gan improve the person reidentification baseline in vitro. arXiv preprint arXiv:1701.07717, 2017.
Comments
There are no comments yet.