Viewpoint-Aware Loss with Angular Regularization for Person Re-Identification

12/03/2019 ∙ by Zhihui Zhu, et al. ∙ IEEE Tencent SUN YAT-SEN UNIVERSITY 0

Although great progress in supervised person re-identification (Re-ID) has been made recently, due to the viewpoint variation of a person, Re-ID remains a massive visual challenge. Most existing viewpoint-based person Re-ID methods project images from each viewpoint into separated and unrelated sub-feature spaces. They only model the identity-level distribution inside an individual viewpoint but ignore the underlying relationship between different viewpoints. To address this problem, we propose a novel approach, called Viewpoint-Aware Loss with Angular Regularization (VA-reID). Instead of one subspace for each viewpoint, our method projects the feature from different viewpoints into a unified hypersphere and effectively models the feature distribution on both the identity-level and the viewpoint-level. In addition, rather than modeling different viewpoints as hard labels used for conventional viewpoint classification, we introduce viewpoint-aware adaptive label smoothing regularization (VALSR) that assigns the adaptive soft label to feature representation. VALSR can effectively solve the ambiguity of the viewpoint cluster label assignment. Extensive experiments on the Market1501 and DukeMTMC-reID datasets demonstrated that our method outperforms the state-of-the-art supervised Re-ID methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Person re-identification(Re-ID) which targets at recognizing pedestrians across non-overlapping camera views, is an important and challenging problem in visual surveillance analysis and has drawn increasing research attention [27, 24, 20]. While Re-ID has gained considerable development in recent years, existing supervised person re-identification still faces some major visual appearance challenges, such as changes in viewpoint or poses, low resolution, illumination and etc.

Among these challenges, in this work, we focus on the problem of viewpoint variations, which is one of the most important and difficult challenges in Re-ID research and practical application. In practice, due to the effect of viewpoint variation, images from different viewpoints of the same identity usually have massive visual appearance differences, and it may even be possible that some images of different identities from the same viewpoints are more similar in visual appearance than images of the same identity from different viewpoints. Some examples are shown in Figure 1 (a)(b). This problem greatly limits the practical application of Re-ID.

Figure 1: Comparisons of different feature learning methods. Our VA-reID method learns the unified space using the soft label instead of the hard label. Images with the thick purple border in the figure are the ambiguous viewpoint categories.

A key problem for tackling viewpoint variations is to learn discriminative feature representations for body images with different viewpoints. However, there are some inadequacies of the existing viewpoint-based feature learning method [2, 4, 14, 17]. 1) They treat viewpoint learning and identity discrimination as two separate progresses, In such a case, it is not a principle way to learn optimal identity classification under various viewpoint variations. 2) They cast viewpoints of persons as hard labels, while in reality the viewpoint of person is ambiguous. As shown in Figure 1(c), these methods learn separate features for different viewpoints. For example, DVAML [2] learns two different feature subspaces for image pairs with similar and dissimilar viewpoints, OSCNN [4] and PSE [17] learn a linear combination of different viewpoints’ features. Actually, projecting the features into separated and unrelated subspaces only models the identity-level distribution within each viewpoint but may ignore the underlying relationship among different viewpoints. Thus, the relationship between the features from separate viewpoint subspaces cannot be directly learned, compromising the model’s ability to match images of a person from different viewpoints.

To solve this problem, we propose a novel angular-based feature learning method that projects all features into a unified subspace and directly models the distribution of the features from different viewpoints. As shown in Figure 1

(d)(f), the feature distribution is modeled at both the identity-level and viewpoint-level. At the identity-level, different identities are pushed away from each other to form identity-level clusters. At the viewpoint-level, the features in each identity cluster will further produce three viewpoint-level clusters (front, side, back), and a novel center regularization is used to pull the centers of these clusters closer to each other due to their visual similarity to the same identity.

In addition, we further consider the problem of the viewpoint cluster label assignment. While conventionally, each image will be assigned a hard viewpoint cluster label, we find that the viewpoint of some image samples are indeed ambiguous, as shown in Figure 1(e), and that a hard viewpoint cluster label assignment may mislead the learning. Therefore, we propose to relax the hard assignment problem and instead perform a soft label assignment, as shown in Figure 1 (f). We propose a novel viewpoint-aware regularization method, called view aware label smoothing regularization (VALSR). VALSR replaces the common one-hot hard label with the adaptive soft label that changes adaptively according to similarity to the classification centers. Notice that we use the viewpoint label only for training.

In summary, this work develop a joint learning model for the identity and viewpoint discrimination learning, where in particular we introduce the soft multilabel to model viewpoint aware feature distribution for overcoming the ambiguity problem on viewpoint labelling and greatly solve the viewpoint variations. The main contributions include:

  • We put forward the idea of modeling the viewpoint distribution and identity distribution jointly rather than separately. For this purpose, we propose a novel solution called Viewpoint-Aware Loss with Angular Regularization, which effectively models the distribution on both identity-level and viewpoint-level, and specially we impose the center regularization to connect identity and viewpoint discrimination. The experimental results also demonstrate our advanced performance against related methods a lot.

  • To overcome the ambiguity on viewpoint labelling, we develop viewpoint-aware adaptive label smoothing, which allows smooth transition between features of different discrete veiwpoints by assigning adaptive soft viewpoint label. The soft label will get self-adaption dynamically according to the prediction probability and have better performance against noise data and over-fitting

2 Related Work

Viewpoint-Aware Person Re-identification. Person Re-ID aims to recognize pedestrians across non-overlapping camera views. A recent survey [8] provides a detailed review and prospection of person Re-ID. Existing studies on viewpoint variation mainly lie in four aspects: pose-based methods [18, 17], segmentation-based methods [7], generation-based methods [28, 14, 19] and viewpoint-based methods [2]. Pose-based methods and segmentation-based methods are common practices to address this problem. Pose-based methods usually take advantage of pose information to pay attention to human body parts while segmentation-based methods utilize human parsing information to obtain position information of human body parts. Then they can make align the body parts or extract the local feature of body parts. Generation-based methods usually use the generative model to generate images [14] or generate feature of other viewpoints [28]. Recently PersonX [19] utilizes a large-scale synthetic data engine to generate pedestrian images with arbitrary rotation angle, and analyses the important impact of viewpoints on Re-ID in detail. Viewpoint-based methods use hard viewpoint label of image directly and utilize them to help features learning. DVAML [2] tries to learn two different feature sub-spaces for image pairs with similar and dissimilar viewpoints but gets little improvement.

Figure 2: Overview of the proposed VA-reID method. Feature is extracted by backbone network. The proposed viewpoint-aware angular loss projects features onto a hyper-sphere to form the identity-level cluster (light green and blue circle) and the viewpoint-level cluster (dark green and brown circle). Furthermore, our Adaptive Label Smoothing Regularization eliminates the hard margin between clusters by introducing adaptive soft labels.

Loss Function for Identification

. Metric learning targets at learning a metric space in which the samples from different classes are far away while the samples from the same classes are compact. The popular loss functions include the softmax-based loss, the contrastive loss and the triplet loss. In face recognition field, amount of representative softmax-based methods

[16, 11, 15, 5] have been proposed.These improvements are mainly concentrated on two aspects: normalization and margin. The former is applied for features and weight of fully connected layer. As early as in [16], the advantages of weight normalization, such as reducing computational complexity and making the network converge faster, have been demonstrated. Crystal loss [15] proposes feature normalization and verified the effectiveness. Feature normalization is beneficial to increase angular discrimination in feature space direction. The combination of feature normalization and weight normalization can achieve better results. Recent softmax-based work mainly focused on the latter. The concept of angular margin is firstly introduced to metric space in L-Softmax [12]. Later works such as SphereFace [11], ArcFace [5] continue to make some improvements to obtain better performance.

3 Methodology

3.1 Problem Formulation and Overview

Given a query image, the target of Re-ID is to get a ranking list of images from gallery set across non-overlapping camera views. Define an image where

denotes the feature extracted by a deep model of the

-th image, is the identity label and is the viewpoint label. Notice that each image has only one viewpoint label . Given a training set with

images, the deep feature

is a global ReID feature extracted by a CNN backbone (e.g., ResNet, DenseNet) denoted as where denotes the parameters of CNN.

Now we briefly introduce the pipeline of the proposed Viewpoint-Aware ReID method (VA-reID). As shown in Figure 2, a CNN backbone extracts a global feature for each image . We introduce two types of losses: the identity angular loss and the viewpoint-aware angular loss . Integrating two angular-related losses into a unified framework of learning can build a two-level distribution for feature on a unified hyper-sphere, including the identity-level distribution and the viewpoint-level distribution. For the identity-level distribution, features with the same identity can be assembled to form identity clusters by (i.e., the large light green and the light blue circles in Figure 2). For the viewpoint-level, in each identity cluster, we pull the features of the same viewpoint close to form viewpoint clusters (i.e, the small-dark green and the brown circle inside the large circle) by angular loss and center regularization term . As a result, the overall loss of our proposed method is:

(1)

Furthermore, as shown in the bottom-right rectangle in Figure 2, at the viewpoint-level, a novel viewpoint-aware adaptive LSR regularization item is used in to eliminate the hard margin between viewpoint clusters. At the identity-level, a small adaptive label smoothing regularization is explored to effectively increase the generalization ability of our model.

3.2 VA-reID for Person Re-Identification

Softmax loss is widely used for Re-ID feature learning. Let denote the number of identities,

denote the deep learning ReID feature of a image, and

denote the -th column of the weights . As a result, the prediction probability of belongs the identity is:

(2)

We remove the bias and normalize the feature and the weight as advised in arcface loss [5]. Generally, weight normalization helps to improve class imbalance, and feature normalization is instrumental in generalization of metric space. Therefore, all the features can be projected onto a hypershere with the same length, and the probability of deep feature belonging to identity is equivalent to the cosine distance between and :

(3)

where

denotes the angle between the feature vector

and the -th column of : . is a margin that is used to improve the discriminative ability of classification, and is the scale factor used to promote convergence.

Based on previous experience [11, 5], the angular loss helps to model a better discriminative distribution of identities. Analogously, we extend the identity-level angular loss to the proposed viewpoint-aware loss (VA-reID). From Eq.(4), we observe that the -th column of the weight can be viewed as the center of identity . To get a higher probability of image belonging to , we need to pull the feature vector closer to center

. To model the distribution of different viewpoints, each identity class is further classified into

subclass, corresponding to subclasses corresponding to viewpoints (i.e., front viewpoint, side viewpoint and back viewpoint. denotes the number of viewpoints and is assigned 3 in this paper). We denote the viewpoint centers as . As a result, given a deep feature , we model the probability that belongs to the identity and viewpoint as follows:

(4)

where denotes the angle between feature and the center of the -th identity and the -th viewpoint : .

Given a training sample with identity label and viewpoint label , the identity classification loss and viewpoints-aware loss are:

(5)
(6)

where and is the classification label. The traditional methods use the hard label for features learning, e.g.,

(7)
(8)

which ignore the ambiguity problem on viewpoint labelling. To overcome this problem, we introduce the soft label learning methods. Furthermore, to maintain the visual similarity between features from the same person but with different viewpoints, we propose to the center regularization term to connect identity and viewpoint discrimination. It helps to pull the viewpoint centers closer to the corresponding identity center and the formula is:

(9)
Figure 3: (a) Illustration of adaptive soft label for identity-level learning. (b) Illustration of viewpoint-aware adaptive soft label for viewpoint-level learning .

- Adaptive Identity Label Learning. In this section, we introduce the soft identity label to replace the conventional hard identity label. The soft label will get self-adaption dynamically according to the prediction probability and have better performance against noise data and over-fitting.

Assumption 1. The network tends to prioritize learning simple patterns of real data firstly and then the noise [1].

As aforementioned, in cross entropy loss, the one-hot encoding is used to be as ground-truth probability distribution and thus the model tends to maximize the expected log-likelihood of one label, which may result in over-fitting and harm the generalization ability of model. Label smoothing regularization(LSR)

[21] is proposed to further address the problems. The probability distribution formula of LSR is

(10)

where is a manual value. It replaces the one-hot hard label with the soft label by introduce a small manual parameter to adjust the probability distribution. This encourages the model to lower some confidence on other categories except the ground-truth. However, notice that

is a fixed value, which results in the same expected probability of ground-truth category for every input sample and so do other categories. Actually input samples have different logits and applying the same expected probability for them is unsuitable. Considering a case that there exist noises in training set, an image

with the true identity label is annotated as wrongly. According the assumption [1], the network tends to learn positive samples firstly and then noise samples. And usually noise samples have smaller logits of label category than that of positive samples.

(11)

We develop a new smoothing parameter , which is related to the prediction probability of network. is a multiplicative scaling coefficient. When the output probability is large, we will get a small and get a large confidence on this category. In contrast, when the output probability is small, we will get large and a large confidence on other categories. The adaptive label smoothing regularization have better performance against noise data and over-fitting. An illustration of the adaptive soft identity label is in Figure 3(a). We apply the adaptive-LSR (ALSR) into the identity-level by subsituting Eq.(11) into .

- Viewpoint-Aware Adaptive Label Learning.

Assumption 2. The viewpoint of person in reality is a continuous value rather than the hard one.

Based on this assumption, we extend the ALSR to a viewpoint-aware angular loss, i.e., Viewpoint-Aware ALSR (VALSR). As mentioned earlier, our proposed viewpoint-aware angular loss models two levels of distribution, i.e., the identity-level distribution (identity clusters) and the viewpoint-level distribution (viewpoint clusters). We split every identity label into three sub-categories according to the viewpoints (front, side, back) and thus each image is classified into viewpoint-aware categories. We argue that, when assigning the soft label for viewpoint-aware angular loss, the degree of regularization will vary according to the level of cluster that the label belongs to. If current unassigned viewpoint-aware soft label is in the same identity cluster with the ground-truth label’s cluster, a stronger relaxation (higher probability) should be assigned as shown in Figure 3(b), because images in this cluster will have the strong correlation and the visual similarity with the ground-truth. On the opposite, as shown in Figure 3(b), if current soft label is in the different identity cluster with the ground-truth, a weaker relaxation should be assigned. As a result, given a training sample with identity label and viewpoint label , the soft label for viewpoint-aware ALSR is

(12)

where and . We apply the viewpoint adaptive-LSR into angular loss by subsituting Eq.(12) into .

3.3 Joint Global and Local Features

Furthermore, we expect to build a model for both global and local features extraction. VA-reID method have excellent performance of feature extraction, especially for images with various viewpoints. Interestingly, we observe that the viewpoint label are more suitable for the whole body rather than body parts due to huge similarity of body parts with different viewpoint (e.g., lower body, leg). Thus, we mainly apply our VA-reID method to global feature.

In order to extract local features effectively, we choose the classical multistripe pyramid structure, which is similar to [20, 25] for local branches, as shown in Figure 4. Jointing the global and local features could effectively boost the performance of the Re-ID model.

Figure 4: Jointing global and local features. the VA-reID method is used to extract global features while multi-stripes structure is for local features. Xent: cross entropy.

4 Experiments

4.1 Datasets and Evaluation Metrics.

We annotate the viewpoint label111Available at https://github.com/zzhsysu/VA-ReID of two widely used benchmarks including Market-1501 and DukeMTMC-reID. Viewpoints are divided into three categories: front, side, back. We evaluate our model in the two datasets. Notice that we use the viewpoint label only for training. During the test stage, we don’t use any viewpoint label.

Market-1501 dataset contains 32,668 person images of 1,501 identities captured by six cameras. Training set is composed of 12,936 images of 751 identities while testing data is composed of the other images of 750 identities. In addition, 2,793 distractors also exist in testing data.

DukeMTMC-reID dataset contains 36,411 person images of 1,404 identities captured by eight cameras. They are randomly divided, with 702 identities as the training set and the remaining 702 identities as the testing set. In the testing set, For each identity in each camera, one image is picked for the query set while the rest remain for the gallery set.

Evaluation Metrics.

Two widely used evaluation metrics including mean average precision (mAP) and matching accuracy (Rank-1/Rank-5) are adopted in our experiments.

Category Method Market-1501 DukeMTMC-reID
mAP Rank-1 Rank-5 mAP Rank-1 Rank-5
stripe based PCB [20] 77.4 92.3 97.2 66.1 81.7 89.7
PCB+RPP [20] 81.6 93.8 97.5 69.2 83.3 90.5
MGN [22] 86.9 95.7 - 78.4 88.7 -
attention based HA-CNN [10] 75.7 91.2 - 63.8 80.5 -
ABD-Net [3] 88.28 95.60 - 78.59 89.00 -
human parsing SPReID [7] 83.36 93.68 97.57 73.34 85.95 92.95
DSA-reID [23] 87.6 95.7 98.4 74.3 86.2
metric learning Pyramid [25] 88.2 95.7 98.4 79.0 89.0 -
SRB(ResNet50) [13] 85.9 94.5 - 76.4 86.4 -
SRB(SeResNext101) [13] 88.0 95.0 - 79.0 88.4 -
HPM [6] 82.7 94.2 97.5 74.3 86.6 -
pose/view related OSCNN [4] 73.5 83.9 - - - -
PDC [18] 63.41 84.14 - - - -
PN-GAN [14] 72.58 89.43 - 53.20 73.58 -
PIE [26] 69.25 87.33 95.56 64.09 80.84 88.30
PGR [9] 77.21 93.87 97.74 65.98 83.63 91.66
This work Ours 91.70 96.23 98.69 84.51 91.61 96.23
Ours+reranking 95.43 96.79 98.31 91.82 93.85 96.50
Table 1: Performance (%) comparisons to the state-of-the-art results on Market-1501 and DukeMTMC-reID. Our proposed VA-reID model outperforms the state-of-the-art methods.

4.2 Implementation Details.

We resize images to

as in many re-ID systems. In training stage, we set batch size to be 64 by sampling 16 identities and 4 images per identity. The SeResnext model with the pretrained parameters on ImageNet is considered as the backbone network. Some common data augmentation strategies include horizontal flipping, random cropping, padding, random erasing (with a probability of 0.5) are used. We adopt Adam optimizer to train our model and set weight decay

. The total number of epoch is 200 and the epoch milestones are

. The learning rate is initialized to and is decayed by a factor of 0.1 when the epoch get the milestones. At the beginning, we warm up the models for 10 epochs and the learning rate grows linearly from to . The parameters in the loss function are set as follows: , .

4.3 Comparison to the State-of-the-art.

We evaluate our proposed VA-reID model with the state-of-the-art methods based on deep learning. These methods include: 1) the stripe based methods PCB, MGN; 2) the metric learning related methods SRB,Pyramid,HPM; 3) the human semantic parsing based methods SPReID,DSA-reID; 4) the attention mechanisms based methods HA-CNN,ABD-Net; 5) the pose/view related methods OSCNN, PDC, PN-GAN, PIE, PGR. We show the results in Table 1 and we can observe that our model achieves state-of-the-art.

- Comparison to the Pose/View Related Methods. Our model outperforms the pose/view-related methods. Without reranking, our model achieves an improvement over the best pose/view-related method PGR by 14.49%/2.36% on mAP/Rank-1 metrics in Market-1501 and by 18.53%/7.98% on mAP/Rank-1 metrics in DukeMTMC-reID.

- Comparison to the Metric Learning Related Methods. Our model outperforms the metric learning related methods. Without reranking, our model achieves an improvement over the second best method SRB by 3.70%/1.23% on mAP/Rank-1 metrics in Market-1501 and by 5.51%/3.21% on mAP/Rank-1 metrics in DukeMTMC-reID.

- Comparison to Other Methods. Our model outperforms the stripe based, the human semantic parsing based and the attention mechanisms based methods. Comparison to the recent state-of-the-art method ABD-net, our model achieves an improvement by 3.42%/0.63% on mAP/Rank-1 metrics in Market-1501 and by 5.92%/2.61% on mAP/Rank-1 metrics in DukeMTMC-reID without reranking.

4.4 Ablation Study.

We perform comprehensive ablation study to demonstrate: 1) the effectiveness of the adaptive label smoothing; 2) the effectiveness of the viewpoint-aware adaptive label smoothing; 3) the effectiveness of center regularization; 4) the effectiveness of jointing global and local features. Notice that the viewpoint-aware loss function is for global branch. Loss of VA-reID is +, where is adaptive label smoothing loss item, is viewpoint-aware adaptive label smoothing loss item and is center regularization. We use the model trained only with as the baseline, and set parameter =0.1 and =0.2. The performance (%) comparisons of different modules on Market-1501 and DukeMTMC-reID datasets are shown in Table 2.

- Effectiveness of Adaptive Label Smoothing. Comparing results of cross entropy loss (Xent) and label smoothing loss (LSR), we can observe that using label smoothing gets better performance. We use adaptive label smoothing (ALSR) as basic loss and it achieves an improvement over label smoothing by 0.25%/0.35% on mAP/Rank-1 metrics in Market-1501 and by 0.52%/0.45% on mAP/Rank-1 metrics in DukeMTMC-reID. This is because ALSR replaces the one-hot hard label with the adaptive soft label for identity classification. The adaptive soft label actually helps to learn discriminative features while ignore the negative impact of noises. This comparison demonstrates the effectiveness of adaptive label smoothing.

- Effectiveness of Viewpoint-Aware Adaptive Label Smoothing. Comparing results to the baseline model, we observe that combining the viewpoint-aware adaptive label smoothing(VALSR) and the adaptive label smoothing(ALSR) can achieve an improvement by 1.70%/0.67% on mAP/Rank-1 metrics in Market-1501 and by 0.87%/0.99% on mAP/Rank-1 metrics in DukeMTMC-reID. This is because VALSR uses the viewpoint-aware adaptive soft label for identity-viewpoint classification. For each identity, the viewpoint-aware adaptive soft label helps to learn the embedding of viewpoint-related compact features. This comparison demonstrates the effectiveness of viewpoint-aware adaptive label smoothing.

Method Market-1501 DukeMTMC-reID
mAP Rank-1 mAP Rank-1
Xent 86.30 94.31 76.70 86.94
LSR 86.72 94.35 77.47 87.43
(baseline) 86.97 94.7 77.99 87.39
+ 88.67 95.37 78.86 88.38
+ 88.25 95.25 78.25 87.84
VA-reID 89.97 95.87 81.48 91.11
VA-reID+RR 95.09 96.32 90.66 92.46
VA-reID+local 91.70 96.23 84.51 91.61
VA-reID+local+RR 95.43 96.79 91.82 93.85
Table 2: Performance (%) comparisons of different modules on Market-1501 and DukeMTMC-reID datasets. RR: using reranking. Xent: cross entropy loss.

- Effectiveness of Center Regularization. Comparing results of the baseline model and baseline with , we can observe that adding an center regularization to the baseline model can get a slight improvement. Comparing results of + and VA-reID, we can observe an significant improvement by 3.00%/1.17% on mAP/Rank-1 metrics in Market-1501 and by 3.45%/3.27% on mAP/Rank-1 metrics in DukeMTMC-reID. This is because the center regularization and the viewpoint-aware adaptive soft label can complement each other. For each identity, viewpoint-aware adaptive soft label generates feature clusters of viewpoints while the center regularization makes centers of these clusters closer. This comparison demonstrates the effectiveness of center regularization.

- Effectiveness of Jointing Global and Local Features. Comparing results of the VA-reID model, without reranking, jointing VA-reID and local feature can get a significant improvement by 1.73%/0.36% on mAP/Rank-1 metrics in Market-1501 and by 3.03%/0.50% on mAP/Rank-1 metrics in DukeMTMC-reID. VA-reID uses viewpoint-aware loss only for learning discriminative global features while the local branch adopts multi-strips structure to learning fine-grained features. Adding local features helps to improve performance further. This comparison demonstrates the effectiveness of jointing global and local features.

4.5 Further Evaluations.

Following evaluations and analysis are made to further verify the effectiveness of our methods including: (1) the influence of the noisy viewpoint label; (2) visualization of the influence of viewpoints variance to the retrival; (3) the effect of hyperparameters; (4) viewpoint based retrieval and complexity analysis (see supplementary material

222Available at https://github.com/zzhsysu/VA-ReID).

- Influence of the Noisy Viewpoint Label. We exam the influence of the noises in the viewpoint labels on the Re-ID performance. As show in Table 3 three different types of viewpoints labels and the performance comparison using different viewpoint labels:

1) VA-reID with P uses viewpoints labels predicted by a simple Resnet-50 classifier trained on 600 images of extra viewpoints datasets. 2) VA-reID with PC uses viewpoints labels predicted by clustering pose information generated by Open-Pose. 3) VA-reID with GT directly uses ground-truth labels annotated by human. The accuracy of viewpoints classification is 78.7% and the accuracy of pose clustering is 44.57%.

From Table 3, we observe that the amount of noises in the viewpoints labels have very little influence on our method. VA-reID with P achieves very similar performance to VA-reID with GT. What’s more, even with the viewpoints prediction accuracy of only 44.57%, the performance of VA-reID with PC does not drop much, our method still achieves state-of-art performance on DukeMTMC-reID. This experiments verify that our proposed VA-reID method is robust to viewpoints label noises and a simple classifier or clustering method is good enough to get the viewpoints information.

Method DukeMTMC-reID
mAP Rank-1 Rank-5
PN-GAN [14] 53.20 73.58 -
VA-reID with P 81.05 90.75 95.65
VA-reID with PC 80.69 90.35 95.83
VA-reID with GT 81.48 91.11 95.38
Table 3: Performance (%) comparisons for different generation methods of the viewpoint label. P: using predictive viewpoint labels. PC: using pose-based clustering viewpoint labels. GT: using ground-truth viewpoint labels.

- Visualization of Results. Figure 5 shows examples of retrieval results by baseline method and VA-reID. We observe that, in the case of high viewpoints variance between query and gallery, our method has a much higher retrieval precision compared to baseline. Take the result in the first row as an example, for a query image in front viewpoint, baseline method only correctly retrieves two images with the same front viewpoint, while VA-reID is able to successfully retrieve images from all three viewpoints. Similar to the first row, the results in other rows further demonstrate excellent performance on cross-viewpoints images retrieval.

Figure 5: Visual results of the baseline method and the VA-reID method. The red box represents the wrong result while the green box represents the correct result.

- Hyperparameters Analysis. Figure 6 shows how the parameter and affect the performance in Market-1501 dataset. The performace of our method is stable within a wide range for both parameters. It also presents the excellence of the adaptive label smoothing method.

Figure 6: Performance of VA-reID method on Market-1501 with different hyperparameters. In figure (a), we fix . In figure (b), we fix .

5 Conclusion

This study proposes a novel method to address the viewpoint variation problem in Re-ID. Overall, we make two contributions to the community. Firstly, we propose the viewpoint-aware angular loss to learn the embedding of viewpoint-aware feature in a unified hyper-sphere, which effectively model the feature distribution on both identity-level and viewpoints-level. Secondly, we propose a novel viewpoint aware adaptive label smoothing method to relax the hard margin caused with adaptive soft labels. Experiments show the effectiveness our method.

6 Acknowledgement

This work was supported partially by the National Key Research and Development Program of China (2016YFB1001002), NSFC(61522115,U1811461), Guangdong Province Science and Technology Innovation Leading Talents (2016TX03X157), and Guangzhou Research Project (201902010037).

References

  • [1] D. Arpit, S. Jastrzebski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. C. Courville, Y. Bengio, and S. Lacoste-Julien (2017) A closer look at memorization in deep networks. ArXiv abs/1706.05394. Cited by: §3.2, §3.2.
  • [2] P. Chen, X. Xu, and C. Deng (2018-07) Deep view-aware metric learning for person re-identification. In

    Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence

    ,
    pp. 620–626. Cited by: §1, §2.
  • [3] T. Chen, S. Ding, J. Xie, Y. Yuan, W. Chen, Y. Yang, Z. Ren, and Z. Wang (2019) ABD-net: attentive but diverse person re-identification. arXiv preprint arXiv:1908.01114. Cited by: Table 1.
  • [4] Y. Chen, S. Duffner, A. Stoian, J. Dufour, and A. Baskurt (2018)

    Person re-identification with a body orientation-specific convolutional neural network

    .
    In International Conference on Advanced Concepts for Intelligent Vision Systems, pp. 26–37. Cited by: §1, Table 1.
  • [5] J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2019-06) ArcFace: additive angular margin loss for deep face recognition. In CVPR, Cited by: §2, §3.2, §3.2.
  • [6] Y. Fu, Y. Wei, Y. Zhou, H. Shi, G. Huang, X. Wang, Z. Yao, and T. Huang (2019) Horizontal pyramid matching for person re-identification. In AAAI, Vol. 33, pp. 8295–8302. Cited by: Table 1.
  • [7] M. M. Kalayeh, E. Basaran, M. Gökmen, M. E. Kamasak, and M. Shah (2018-06) Human semantic parsing for person re-identification. In CVPR, Cited by: §2, Table 1.
  • [8] Q. Leng, M. Ye, and Q. Tian (2019) A survey of open-world person re-identification. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §2.
  • [9] J. Li, S. Zhang, Q. Tian, M. Wang, and W. Gao (2019) Pose-guided representation learning for person re-identification. IEEE transactions on pattern analysis and machine intelligence. Cited by: Table 1.
  • [10] W. Li, X. Zhu, and S. Gong (2018-06) Harmonious attention network for person re-identification. In CVPR, Cited by: Table 1.
  • [11] W. Liu, Y. Wen, Z. Yu, M. M. Li, B. Raj, and L. Song (2017) SphereFace: deep hypersphere embedding for face recognition. CVPR, pp. 6738–6746. Cited by: §2, §3.2.
  • [12] W. Liu, Y. Wen, Z. Yu, and M. Yang (2016) Large-margin softmax loss for convolutional neural networks. In ICML, Cited by: §2.
  • [13] H. Luo, Y. Gu, X. Liao, S. Lai, and W. Jiang (2019-06) Bag of tricks and a strong baseline for deep person re-identification. In CVPRW, Cited by: Table 1.
  • [14] X. Qian, Y. Fu, T. Xiang, W. Wang, J. Qiu, Y. Wu, Y. Jiang, and X. Xue (2018-09) Pose-normalized image generation for person re-identification. In ECCV, Cited by: §1, §2, Table 1, Table 3.
  • [15] R. Ranjan, A. Bansal, H. Xu, S. Sankaranarayanan, J. Chen, C. D. Castillo, and R. Chellappa (2018) Crystal loss and quality pooling for unconstrained face verification and recognition. ArXiv abs/1804.01159. Cited by: §2.
  • [16] T. Salimans and D. P. Kingma (2016) Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In NIPS, Cited by: §2.
  • [17] M. Saquib Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen (2018-06) A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking. In CVPR, Cited by: §1, §2.
  • [18] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian (2017-10) Pose-driven deep convolutional model for person re-identification. In CVPR, Cited by: §2, Table 1.
  • [19] X. Sun and L. Zheng (2019-06) Dissecting person re-identification from the viewpoint of viewpoint. In CVPR, Cited by: §2.
  • [20] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang (2018-09) Beyond part models: person retrieval with refined part pooling. In ECCV, Cited by: §1, §3.3, Table 1.
  • [21] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016-06)

    Rethinking the inception architecture for computer vision

    .
    In CVPR, Cited by: §3.2.
  • [22] G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou (2018) Learning discriminative features with multiple granularities for person re-identification. In ACM MM, pp. 274–282. Cited by: Table 1.
  • [23] Z. Zhang, C. Lan, W. Zeng, and Z. Chen (2019-06) Densely semantically aligned person re-identification. In CVPR, Cited by: Table 1.
  • [24] L. Zhao, X. Li, Y. Zhuang, and J. Wang (2017-10) Deeply-learned part-aligned representations for person re-identification. In The IEEE International Conference on Computer Vision, Cited by: §1.
  • [25] F. Zheng, C. Deng, X. Sun, X. Jiang, X. Guo, Z. Yu, F. Huang, and R. Ji (2019-06) Pyramidal person re-identification via multi-loss dynamic training. In CVPR, Cited by: §3.3, Table 1.
  • [26] L. Zheng, Y. Huang, H. Lu, and Y. Yang (2019) Pose invariant embedding for deep person re-identification. IEEE Transactions on Image Processing. Cited by: Table 1.
  • [27] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. In

    The IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §1.
  • [28] Y. Zhou and L. Shao (2018-06) Viewpoint-aware attentive multi-view inference for vehicle re-identification. In CVPR, Cited by: §2.