With an increasing demand of intelligent cellphones and digital cameras, people today take more photos to jot down daily life and stories. Such an overwhelming trend is generating a desperate demand for smart tools to recognize the same person (known as query), across different time and space, among thousands of images from personal data, social media or Internet.
Previous work [Anguelov et al.2007, Oh et al.2016, Oh et al.2015, Li et al.2016c, Zhang et al.2015, Li et al.2016a] has demonstrated that person recognition in such unconstrained settings remains a challenging problem due to many factors, such as non-frontal faces, varying light and illumination, the variability in appearance, texture of identities, etc.
The recently proposed PIPA dataset [Zhang et al.2015] contains thousands of images with complicated scenarios and similar appearance among persons. The illumination, scale and context of the data varies a lot and many instances have partial or even no faces. Figure 1 shows some samples from both the training and test sets. Previous work [Zhang et al.2015, Oh et al.2015, Li et al.2016a]
resort to identifying the same person via a multi-cue, multi-level manner where the training set is used only for extracting features and the follow-up classifier (SVM or neural network) is trained on thetest_0 set111 We denote test_0 as the reference set throughout the paper.. The recognition system is evaluated on the test_1 set. We argue that such a practice is infeasible and ad hoc in realistic application since the second training on test_0 is auxiliary and needs re-training if new samples are added. Instead, we aim at providing a set of robust and well generalized feature representations, which is trained directly on the training set, and at identifying the person by measuring the feature similarity between two splits on test set. There is no need to train on test_0 and the system still performs well even if new data comes in.
As shown in the bottom of Figure 1, the woman in green box wears similar scarf as does the person in light green box. Her face is shown partially or does not appear in some cases. To obtain robust feature representations, we train several deep models for different regions and combine the similarity score of features from different regions to have the prediction of one’s identity. Our key observation is that during training, the cross-entropy loss does not guarantee the similarity among samples within a category. It magnifies the difference across classes and ignore the feature similarity of the same class. To this end, we propose a congenerous cosine loss222 https://github.com/sciencefans/coco_loss, namely COCO, to enlarge the inter-class distinction as well as narrow down the inner-class variation. It is achieved by measuring the cosine distance between sample and its cluster centroid in a cooperative manner. Moreover, we also align each region patch to a pre-defined base location to further make samples within a category be more closer in the feature space. Such an alignment strategy could also make the network less prone to overfitting.
Figure 2 illustrates the training pipeline of our proposed algorithm at a glance. Each instance in the image is annotated with a ground truth head and we train a face and human body detector respectively, using the RPN framework [Ren et al.2015]
to detect these two regions. Then a human pose estimator[Wei et al.2016] is applied to detect key parts of the person in order to localize the upper body region. After cropping four region patches, we conduct an affine transformation to align different patches from training samples to a ‘base’ location. Four deep models are trained separately on the PIPA training set using the COCO loss to obtain a set of robust features. To sum up, the contributions in this work are as follows:
Propose a congenerous cosine loss to directly optimize the cosine distance among samples within and across categories. It is achieved in a cheap softmax manner with normalized inputs and less complexity.
Design a person recognition pipeline that leverages from several body regions to obtain a discriminative representation of features, without the necessity of conducting a second training on the test set.
Align region patches to the base location via affine transformation to reduce variation among samples, making the network less prone to overfitting.
2 Related Work
Person recognition in photo albums [Anguelov et al.2007, Oh et al.2016, Oh et al.2015, Li et al.2016c, Zhang et al.2015, Li et al.2016a] aims at recognizing the identity of people in daily life photos, where such scenarios can be complex with cluttered background. [Anguelov et al.2007] first address the problem by proposing a Markov random filed framework to combine all contextual cues to recognize the identity of persons. Recently, [Zhang et al.2015] introduce a large-scale dataset for this task. They accumulate the cues of poselet-level person recognizer trained by a deep model to compensate for pose variations. In [Oh et al.2015], a detailed analysis of different cues is explicitly investigated and three additional test splits are proposed for evaluation. [Li et al.2016c] embed scene and relation contexts in LSTM and formulate person recognition as a sequence prediction task.
Person re-identification is to match pedestrian images from various perspectives in cameras for a typical time period and has led to many important applications in video [Li et al.2014, Yi et al.2014, Zhao et al.2014, Tapaswi et al.2012, Prosser et al.2010]. Existing work employ metric learning and mid-level feature learning to address this problem. [Li et al.2014] propose a deep network using pairs of people to encode photometric transformation. [Yi et al.2014] incorporate a Siamese deep network to learn the similarity metric between pairs of images. The main difference between person recognition and re-identification resides in the data logistics. The former is to identify the same person across places and time. In most cases, the identity varies a lot in appearance under different occasions. The latter is to detect person in a consecutive video, meaning that the appearance and background do not vary much in terms of time.
have dramatically advanced the computer vision community in recent years, with high performance boost in tremendous tasks, such as image classification[He et al.2016, Li et al.2016b], object detection [Girshick2015, Li et al.2017], object tracking [Chi et al.2017]
, etc. The essence behind the success of deep learning resides in both its superior expression power of non-linear complexity in high-dimension space[Hinton and Salakhutdinov2006] and large-scale datasets [Deng et al.2009, Guo et al.2016] where the deep networks could, in full extent, learn complicated patterns and representative features.
The proposed COCO loss is trained for each body region to obtain robust features. Section 3.1 depicts the detection of each region given an image; for inference in Section 3.4, we extract the features from corresponding regions on both test_1 and test_0, merge the similarity scores from different regions and make the final prediction on one’s identity.
3.1 Region detection
The features of four regions , namely, face, head, whole body and upper body, are utilized to train the features. We first state the detection of these regions.
Face. We pre-train a face detector in a region proposal network (RPN) spirit following Faster RCNN [Ren et al.2015]. The source of data comes from Internet and the number of images is roughly 300,000. The network structure is a shallow version of the ResNet model [He et al.2016] where we remove layers after res_3b and add two loss heads (classification and regression). Then we finetune the face model on PIPA training set for COCO loss. The face detector identifies keypoints of the face (eye, brow, mouth, etc.) and we align the detected face patch to a ‘base’ shape via translation, rotation and scaling. Let denote keypoints detected by the face model and the aligned results, respectively. We define as two affine spaces, then an affine transformation is defined as:
is a linear transformation matrix inand being the bias in . Such an alignment scheme ensures samples both within and across category do not have large variance: if the network is learned without alignment, it has to distinguish more patterns, e.g., different rotations among persons, making it more prone to overfitting; if the network is equipped with alignment, it can focus more on differentiating features of different identities despite of rotation, viewpoint, translation, etc.
Head, whole body and upper body. The head region is given as the ground truth for each person. To detect a whole body, we also pre-train a detector in the RPN framework. The model is trained on the large-scale human celebrity dataset [Guo et al.2016], where we use the first 87021 identities in 4638717 images. The network structure is an inception model [Ioffe and Szegedy2015] with the final pooling layer replaced by a fully connected layer. To determine the upper body region, we conduct human pose estimation [Wei et al.2016] to identity keypoints of the body and the upper part is thereby located by these points. The head, whole body and upper body models, which are used for COCO loss training, are finetuned on PIPA training set using the pretained inception model, following similar procedure of patch alignment stated previously for the face region. The aligned patches of four regions are shown in Figure 2(c).
3.2 Congenerous cosine loss for training
The intuition behind designing a COCO loss is that we directly compare and optimize the cosine distance (similarity) between two features. Let
denote the feature vector of the-th sample from region , where is the feature dimension. For brevity, we drop the superscript since each region model undergoes the same COCO training. We first define the cosine similarity of two features from a mini-batch as:
The cosine similarity measures how close two samples are in the feature space. A natural intuition to a desirable loss is to increase the similarity of samples within a category and enlarge the centroid distance of samples across classes. Let be the labels of sample , where is the total number of categories, we have the following loss to maximize:
where is an indicator function and is a trivial number for computation stability. Such a design is reasonable in theory and yet suffers from computational inefficiency. Since the complexity of the loss above is , the loss increases quadratically as batch size goes bigger. Also the network suffers from unstable parameter update and is hard to converge if we directly compute loss from two arbitrary samples from a mini-batch.
Inspired by the center loss [Wen et al.2016], we define the centroid of class as the average of features over a mini-batch :
where is a trivial number for computation stability. Incorporating the spirit of Eqn. 3 with class centroid, we have the following output of sample to maximize:
The direct intuition behind Eqn. 5 is to measure the distance of one sample against other samples by way of a class centroid, instead of a direct pairwise comparison as in Eqn. 3. The numerator ensures sample is close enough to its own class
and the denominator enforces a minimal distance against samples in other classes. The exponential operation is to transfer the cosine similarity to a normalized probability output, ranging from 0 to 1.
To this end, we propose the congenerous cosine (COCO) loss, which is to increase similarity within classes and enlarge variation across categories in a cooperative way:
where indexes along the class dimension in , is the binary mapping of sample based on its label . In practice, COCO loss can implemented in a neat way via the softmax operation. For Eqn. 5, if we constrain the feature and centroid to be normalized (i.e., , ) and loose the summation in the denominator to include , the probability output of sample becomes:
where is the input to softmax. can be seen as weights in the classification layer with bias term being zero. The advantage of COCO loss in Eqn. 6 and 7 from the naive version in Eqn. 3 are two folds: it reduces the complexity of computation and could be achieved via the softmax with normalized inputs in terms of cosine distance.
The derivative of loss w.r.t. the input feature , written in an element-wise form and dropping sample index for brevity, is as follows:
where is the top gradient w.r.t. the normalized feature. We can derive the gradient w.r.t. centroid in a similar manner. Note that both the features and cluster centroids are trained end-to-end. The features are initialized from the pretrain models and the initial value of is thereby obtained via Eqn. 4.
3.3 Relationship of COCO with counterparts
COCO loss is formulated as a metric learning approach in the feature space, using cluster centroid in the cosine distance as metric to both enlarge inter-class variation as well as narrow down inner-class distinction. It can be achieved via a softmax operation under several constraints. Figure 3 shows the visualization of feature clusters under different loss schemes. For softmax loss, it only enforces samples across categories to be far away while ignores the similarity within one class (3(c)). In COCO loss, we replace the weights in the classification layer before softmax, with a clearly defined and learnable cluster centroid (3(d)). The center loss [Wen et al.2016] is similar in some way to ours. However, it needs an external memory to store the center of classes and thus the computation is twice as ours (3(a-b)). [Liu et al.2016b] proposed a generalized large-margin softmax loss, which also learns discriminative features by imposing intra-class compactness and inter-class separability. The coupled cluster loss [Liu et al.2016a] is a further adaptation from the triplet loss [Wang and Gupta2015] where in one mini-batch, the positives of one class will get as far as possible from the negatives of other classes. It optimizes the inter-class distance in some sense but fails to differentiate among the negatives.
|Test split||Face||Head||Upper body||Whole body|
|n-a||ali||n-a||ali||[Li et al.2016c]||n-a||ali||[Li et al.2016c]||n-a||ali|
At testing stage, we measure the similarity of features between two test splits to recognize the identity of each instance in test_1 based on the labels in test_0. The similarity between two patches and in test_1 and test_0 is denoted by , where indicates a specific region model. A key problem is how to merge the similarity scores from different regions. We first normalize the preliminary result in order to have scores across different regions comparable:
are parameters of the logistic regression. The final scoreis a weighted mean of the normalized scores of each region: , where is the total number of regions and being the weight of each region’s score. The identity of patch in test_1 is decided by the label corresponding to the maximum score in the reference set: . Such a scheme guarantees that when new training data are added into test_0, there is no need to train a second model or SVM on the reference set, which is quite distinct from previous work. The test parameters of and are determined by a validation set on PIPA.
|Method||Face||Head||Upper body||Whole body||original||album||time||day|
|[Li et al.2016c]||✓||✓||84.93||78.25||66.43||43.73|
Dataset and evaluation metric. The People In Photo Albums (PIPA) dataset [Zhang et al.2015] is adopted for evaluation. The PIPA dataset is divided into train, validation, test and leftover sets, where the head of each instance is annotated in all sets. The test set is split into two subsets, namely test_0 and test_1 with roughly the same number of instances. Such a division is called the original split. As did in [Oh et al.2015, Li et al.2016c, Zhang et al.2015, Li et al.2016a], the training set is only used for learning feature representations; the recognition system is trained on test_0 and evaluated on test_1. In this work, as mentioned previously, we take full advantage of the training set and remove the second training on test_0. Moreover, [Oh et al.2015] introduced three more challenging splits, including album, time and day
. Each case emphasizes different temporal distance (different albums, events, days, etc.) between the two subsets. The evaluation metric is the averaged classification accuracy over all instances ontest_1.
For the face model, the initial learning rate is set to 0.001 and decreased by 10% after 20 epochs. For other three models, the initial learning rate is set to 0.005 and decreased by 20% after 10 epochs. The weight decay and momentum are 0.005 and 0.9 across models. We use stochastic gradient descent with the Adam optimizer[Kingma and Ba2015]. Note that during training, each patch on region of an identity is cropped from the input, where the image for a specific person is resized to the extent that the longer dimension of the head is fixed at 224 across patches, ensuring the scale of different body regions for each instance is the same. Moreover, for the whole body model we simply use the similarity transformation (scale, transform, rotation) instead affine operation for better performance.
4.2 Component analysis
Feature alignment in different regions. Table 1 reports the performance of using feature alignment and different body regions, where several remarks could be observed. First, the alignment case in each region performs better by a large margin than the non-alignment case, which verifies the motivation of patch alignment to alleviate inner-class variance stated in Section 3.1. Second, for the alignment case, the most representative features to identify a person reside in the region of face, followed by head, upper body and whole body at last. Such a clue is not that obvious for the non-alignment case. Third, we notice that for the whole body region, accuracy in the non-alignment case is higher than that of the alignment case in time and day. This is probably due to the improper definition of base points on these two sets.
Congenerous cosine loss. Figure 5 shows the histogram of the cosine distance among positive pairs (i.e., same identity in test_0 and test_1) and negative pairs. We can see that in the COCO trained model, the discrepancy between inter-class (blue) and inner-class (green) samples in the test set is well magnified; whereas in the softmax trained case, such a distinction is not obvious. This verifies the effectiveness of our COCO loss.
Similarity integration from regions. Table 2 depicts the ablation study on the combination of merging the similarity score from different body regions during inference. Generally speaking, taking all regions into consideration could result in the best accuracy of 92.78 on the original set. It is observed that the performance is still fairly good on day and time if the two scores of face and upper body alone are merged.
[Li et al.2016c] also employs a multi-region processing step and we include the comparison in Table 1 and 2. On some splits our model is superior (e.g., head region, 82 vs 81 on original, 44 vs 42 on day); whereas on other splits ours is inferior (e.g., upper body region, 69 vs 70 on album, 57 vs 58 on time). This is probably due to distribution imbalance among splits: the upper body region differs greatly in appearance with some instances absent of this region, making COCO hard to learn features. However, under the score integration scheme, the final merged prediction can complement features learned among different regions and achieves better performance against [Li et al.2016c].
4.3 Comparison to state-of-the-arts
We can see from Table 3 that our recognition system outperforms against previous state-of-the-arts, PIPER [Zhang et al.2015], RNN [Li et al.2016c], Naeil [Oh et al.2015], in all four test splits. Figure 4 visualizes a few examples of the predicted instances by our model, where complex scenes with non-frontal faces and body occlusion can be handled properly in most scenarios. Failure cases are probably due to the almost-the-same appearance configuration in these scenarios.
In this work, we propose a person recognition method to identify the same person, where four models for different body regions are trained. Region patches are further aligned via affine transformation to make the model less prone to overfitting. Moreover, the training procedure employs a COCO loss to reduce the inner-class variance as well as enlarge inter-class varation. Our pipeline requires only one-time training of the model; we utilize the similarity between text_0 and text_1 to determine the person’s identity during inference. Experiments show that the proposed method outperforms against other state-of-the-arts on the PIPA dataset.
We would like to thank reviewers for helpful comments; Hasnat, Weiyang and Feng for method discussion and paper clarification in the initial release. H. Li is financially supported by the Hong Kong Ph.D Fellowship Scheme.
- [Anguelov et al.2007] Dragomir Anguelov, Kuang chih Lee, Salih Burak Gokturk, and Baris Sumengen. Contextual identity recognition in personal photo albums. In CVPR, 2007.
- [Chi et al.2017] Zhizhen Chi, Hongyang Li, Huchuan, and Ming-Hsuan Yang. Dual deep network for visual tracking. IEEE Trans. on Image Processing, 2017.
- [Deng et al.2009] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
- [Girshick2015] Ross Girshick. Fast R-CNN. In ICCV, 2015.
- [Guo et al.2016] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. MS-Celeb-1M: a dataset and benchmark for large-scale face recognition. arXiv preprint:1607.08221, 2016.
- [He et al.2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
- [Hinton and Salakhutdinov2006] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313, 2006.
- [Ioffe and Szegedy2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML. 2015.
- [Kingma and Ba2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
[Krizhevsky et al.2012]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton.
Imagenet classification with deep convolutional neural networks.In NIPS, pages 1106–1114, 2012.
- [LeCun et al.1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, volume 86, pages 2278–2324, 1998.
- [Li et al.2014] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. DeepReID: Deep filter pairing neural network for person re-identification. In CVPR, 2014.
- [Li et al.2016a] Haoxiang Li, Jonathan Brandt, Zhe Lin, Xiaohui Shen, and Gang Hua. A multi-level contextual model for person recognition in photo albums. In CVPR, 2016.
- [Li et al.2016b] Hongyang Li, Wanli Ouyang, and Xiaogang Wang. Multi-bias non-linear activation in deep neural networks. In ICML, 2016.
- [Li et al.2016c] Yao Li, Guosheng Lin, Bohan Zhuang, Lingqiao Liu, Chunhua Shen, and Anton van den Hengel. Sequential person recognition in photo albums with a recurrent network. arXiv preprint:1611.09967, 2016.
- [Li et al.2017] Hongyang Li, Yu Liu, Wanli Ouyang, and Xiaogang Wang. Zoom out-and-in network with recursive training for object proposal. arXiv preprint:1702.05711, 2017.
- [Liu et al.2016a] Hongye Liu, Yonghong Tian, Yaowei Wang, Lu Pang, and Tiejun Huang. Deep relative distance learning: Tell the difference between similar vehicles. In CVPR, 2016.
- [Liu et al.2016b] Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-margin softmax loss for convolutional neural networks. In ICML, 2016.
- [Oh et al.2015] Seong Joon Oh, Rodrigo Benenson, Mario Fritz, and Bernt Schiele. Person recognition in personal photo collections. In ICCV, 2015.
- [Oh et al.2016] Seong Joon Oh, Rodrigo Benenson, Mario Fritz, and Bernt Schiele. Faceless person recognition; privacy implications in social media. In ECCV, 2016.
- [Prosser et al.2010] Bryan Prosser, Wei-Shi Zheng, Shaogang Gong, and Tao Xiang. Person re-identification by support vector ranking. In BMVC, 2010.
- [Ren et al.2015] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NIPS, 2015.
- [Tapaswi et al.2012] Makarand Tapaswi, Martin Bäuml, and Rainer Stiefelhagen. “Knock! Knock! Who is it?” Probabilistic Person Identification in TV Series. In CVPR, 2012.
- [Wang and Gupta2015] Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In ICCV, 2015.
- [Wei et al.2016] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. In CVPR, 2016.
[Wen et al.2016]
Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao.
A discriminative feature learning approach for deep face recognition.In ECCV, 2016.
- [Yi et al.2014] Dong Yi, Zhen Lei, and Stan Z. Li. Deep metric learning for practical person re-identification. arXiv preprint:1407.4979, 2014.
- [Zhang et al.2015] Ning Zhang, Manohar Paluri, Yaniv Taigman, Rob Fergus, and Lubomir Bourdev. Beyond frontal faces: Improving person recognition using multiple cues. In CVPR, 2015.
- [Zhao et al.2014] Rui Zhao, Wanli Ouyang, and Xiaogang Wang. Unsupervised salience learning for person re-identification. In CVPR, 2014.