Discovering Underlying Person Structure Pattern with Relative Local Distance for Person Re-identification

01/29/2019 ∙ by Guangcong Wang, et al. ∙ 0

Modeling the underlying person structure for person re-identification (re-ID) is difficult due to diverse deformable poses, changeable camera views and imperfect person detectors. How to exploit underlying person structure information without extra annotations to improve the performance of person re-ID remains largely unexplored. To address this problem, we propose a novel Relative Local Distance (RLD) method that integrates a relative local distance constraint into convolutional neural networks (CNNs) in an end-to-end way. It is the first time that the relative local constraint is proposed to guide the global feature representation learning. Specially, a relative local distance matrix is computed by using feature maps and then regarded as a regularizer to guide CNNs to learn a structure-aware feature representation. With the discovered underlying person structure, the RLD method builds a bridge between the global and local feature representation and thus improves the capacity of feature representation for person re-ID. Furthermore, RLD also significantly accelerates deep network training compared with conventional methods. The experimental results show the effectiveness of RLD on the CUHK03, Market-1501, and DukeMTMC-reID datasets. Code is available at <https://github.com/Wanggcong/RLD_codes>.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

page 6

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Person re-identification (ReID) aims to re-target pedestrian images across non-overlapped cameras given a probe image. Currently, most of person re-ID models achieve a significant performance by using deep learning for feature representation. However, it is still an extremely challenging task because a person’s appearance often undergoes large intra-class variations, e.g., significant changes in illumination, background clutter, pose, viewpoint, and occlusion. Among these variations, the deformable pose is one of the most challenging problems because feature representations of person images are very sensitive to the complex deformable poses (Figure

1 (a)) . How to exploit the person structure information to improve the performance of person re-ID remains an open problem.


Figure 1: Motivation. (a) Examples of deformable person structure. Each column denotes one person ID. Three image pairs show that the person structure may be affected by pose/view changes, partial body detections, loose bounding box detections. (b) Proposed relative local distance matrix (computed by pair-wise patches, denoted as green allows in (a)). Even though the person structure is deformable, the relative local distances of person patches are quite robust, i.e., a person image consists of head, upper body, lower body and foot from top to bottom. The relative local distances are learnable and can discover the underlying person structure, which builds a bridge between the global and local feature and thus improves the performance of re-ID (Best viewed in color).

Recent research reveals that lots of pose-driven methods [35, 24, 22, 20, 30]

, which exploit human pose annotations from extra datasets, are greatly beneficial to improve the accuracy of person re-ID. These methods first train good pose estimators on human pose datasets and then predict the pose keypoints on re-ID datasets. With predicted pose keypoints, they can jointly learn both the global full-body and discriminative local body-part features for re-ID systems. However, these methods need huge amounts of expensive human pose annotations, which are limited in real-world scenarios.

Motivated by this point, a question arises: can we directly model the person structure without any extra annotation? Obstacles to answering this question are the large pose deformations and the complex view variations. Existing re-ID methods that may offer a partial solution to this problem can be generally categorized into two groups.

In the first group, researchers attempt to exploit the person structure information by simply assuming that person images are well aligned. With this strong assumption, they directly divided the person images into horizontal stripes or grids [7, 32, 39] for local feature representations, which are further concatenated into one feature representation for global metric learning. However, such an assumption does not hold in many real-world scenarios due to diverse poses, different camera views, and inevitable detection errors, as shown in Figure 1 (a). In these scenarios, simply extracting features from stripes or grids of person images leads to the dramatic reduction of the performance.

In the second group, instead of using the strong assumption of person part alignment, some studies [15, 2, 23] seek an effective approach to automatically align the person body part for local metric learning. For each local patch of a probe image, they find the most similar patch of a gallery image. The similarity between two images depends on all of the corresponding patch pairs. However, the feature representation of local matching is dynamic and thus costs quite expensive computation when the gallery set is large. Suppose there are probe images and gallery images, patch matching methods need

on-line feature extraction operations while conventional matching methods need

operations ( operations can be off-line). Besides, it is hard to integrate local matching into CNNs because such an assignment problem (e.g., Hungary algorithm) is non-differentiable and thus cannot be integrated into CNNs using gradient descent methods for an end-to-end optimization. These two drawbacks limit their applications in real-world scenarios.

Different from these two kinds of re-ID methods, we propose a novel Relative Local Distance (RLD) method that integrates a person structure constraint into CNNs in an end-to-end way. Specially, RLD extends conventional classification networks by adding a small branch for exploiting person structure patterns. To achieve this, a relative local distance matrix is computed between different column feature vectors of feature maps and used to represent the structure of a person. After several fully connected layers, a classification loss is used to learn a structure-aware feature representation for each person identity. With the discovered underlying person structure, RLD builds a bridge between the global and local feature and thus improves the capacity of feature representation for person re-ID. In addition, RLD significantly accelerates the deep network training by adding only a small overhead to conventional re-ID systems. Note that, RLD differs a lot from the second-group methods because local distances of RLD are computed between different patches within one image while the second-group methods perform patch matching between image pairs. As far as we know, it is the first time the relative local distance is proposed to exploit object structure for facilitating feature learning.

Compared with conventional re-ID methods, the proposed RLD method is featured in four aspects. First, RLD improves the performance of person re-ID without any pose annotation. Second, the person structure branch is a very small overhead, e.g., 0.8% of the entire CNNs. Third, RLD significantly accelerates the deep network training. Fourth, the person structure of RLD is learnable and there is no strong assumption. Therefore, it can be also easily generalized to other object recognition problems.

Overall, this paper makes three main contributions.

  • First, it is the first time the relative local distance (RLD) is proposed to exploit the underlying person structure and facilitate feature learning without any pose annotation for person re-ID systems.

  • Second, with the joint training of the person identity loss and person structure loss, RLD significantly accelerates deep network training compared with conventional methods and improves the performance of person re-ID with a very small overhead.

  • Third, the experimental results show the effectiveness of RLD on the CUHK03, Market-1501, DukeMTMC-reID datasets.

The rest of the paper is organized as follows. Section 2 reviews the related work. Section 3 introduces our relative local distance model. The experimental comparisons, ablation studies, and analyses are presented in Section 4. Section 5 concludes the paper.

2 Related Work

Person re-identification (re-ID) [39, 38, 33, 17] has been studied extensively in the past ten years. Currently, deep models, which adopt the identity loss [37, 27, 10], verification loss [15, 1, 5, 9], triplet loss [8, 28, 12, 7] or other metric functions [18, 29, 4], are successfully employed in person re-ID and achieve promising results due to the remarkable representation ability of CNNs.

Recently, researchers [35, 24, 22, 20, 30] attempted to exploit person structure information to learn local discriminative features. To this they trained human pose estimators on pose annotated datasets and used the pose estimators to help the feature learning of person re-ID. For example, a spindle network structure [35] is proposed to use human landmark annotations for body joint localization and body region generation, which further guided multi-stage feature decomposition and tree-structured competitive feature fusion. A pose-driven CNN model [24] leveraged human pose information and transformed a global body image into an image containing normalized part regions for feature learning. In addition, Sarfraz et. al. [22] incorporated both the fine and coarse pose information of the person to learn a discriminative embedding by explicitly including this information into the learning process of a re-ID model. However, these methods depend on huge amounts of expensive human pose annotations.

Instead of using pose annotated datasets to train pose estimators, lots of approaches directly split person structure into stripes or grids [7, 32, 39] for local feature representations that can be concatenated into one feature vector for further global metric learning. In [32], person images are cropped into three overlapped stripes which are used to train three independent networks. At the score level, three networks are fused for metric learning. Similarly, [7] split a person image into three stripes and jointly learned both the global full-body and local body-parts features to improve the performance of person re-ID. These models assumed the person bodies are well aligned, which may fail in real-world scenarios.

Differently, some studies [23, 2, 15] attempted to conduct local metric learning to align the person structure. For example, Shen et. al. [23]

introduced a boosting-based approach to learn a correspondence structure which indicates the patch-wise matching probabilities between images from a target camera pair. In

[2], a patch-based deformable model is proposed to combine appearance term with a deformation cost that controls relative placement of patches. Li et. al. [15]

proposed a filter pairing neural network to handle misalignment and geometric transforms by integrating a patch matching layer into the neural network. Our RLD method differs a lot from these patch-based matching methods in that RLD focuses on self-structure exploration while patch-based matching methods focus on the matching between image pairs. Patch-based matching methods are not off-line algorithms and cannot extract gallery features in advance. When the gallery image set is very large, it costs too much time to extract deep features for person retrieval.

3 Method


Figure 2:

Overview of the relative local distance model. RLD uses ResNet50 as the convolutional backbone architecture, containing 4 building blocks. After the backbone architecture, there are two branches, i.e., global identity branch and person structure branch. The global identity branch consists of a 512-dim FC layer, a Batch Normalization layer, a ReLU layer, a dropout layer, a

-dim FC layer ( denotes the number of person identities) and a Cross Entropy layer. Person structure branch also contains similar layers. The only difference is that the first dimension of Person structure branch is a 256-dimension. GP denotes a global average pooling operation and HP denotes a horizonal pooling operation (Best viewed in color).

The proposed Relative Local Distance (RLD) aims to exploit the underlying spatial structure of an object. Currently, a common approach of state-of-the-art Convolutional Neural Networks (CNNs) is to add a global average pooling (GAP) layer at the top of neural networks to avoid the over-fitting problem. However, this strategy partially drops the spatial structure information. In Section 3.1, we demonstrate the reason why GAP partially drops the structure information. To compensate for this, we introduce a novel relative local distance to describe the person structure in Section 3.2 and 3.3. In Section 3.4 and 3.5, the proposed network architecture and implementation details are discussed, respectively.

3.1 Drawbacks of Global Average Pooling.

Let represent a RGB image of size , represent a convolutional neural network. The output feature representation of size is computed by . We define the concept of column feature vector as vector of . contains column feature vectors. After , several fully connected layers are directly used in some famous network architectures, e.g., AlexNet and VGG. To avoid over-fitting, an NIN network [19] is proposed to replace the traditional fully connected layers by global average pooling. After that, the identity (classification) loss is computed by

(1)

where is the -th dimension value of the one-hot label , represents the probability of the -th person identity. However, directly pooling feature maps into a feature vector may discard some discriminative local features and lead to the confusion of object structure, especially fine-grained recognition tasks like person re-ID. For example, there are two persons with a similar appearance. The subtle differences are the type of shoes, shown in Figure 2. Without person structure constraints of , GAP may partially drop subtle differences of person body parts by the average operation. As some empirical evidence of information loss in GAP, our observation is two-fold.

Observation 1: spatial information of is available. When we compute a spatial feature map of size by averaging channels and use the spatial feature map as feature representation for person retrieval. Without the help of column feature vectors, we obtain 1.9% rank-1 accuracy of performance on the CUHK03 dataset (compared with 0% random ranking). This implies that spatial feature maps contain available discriminative information and can be further exploited. -dimension column feature vectors cannot fully encode all the spatial information. Directly using GAP to conduct an average pooling operation on all of column feature vectors leads to the spatial information loss.

Observation 2: column feature vectors are diverse. When we compute relative local distances between different column feature vectors on using the GAP trained model. We observe that neighbor column feature vectors are similar while distant column feature vectors are very different. This demonstrates that contains different kinds of discriminative feature representations for different image regions. It is not the best way to simply conduct the average pooling for CNNs when we want to capture the fine-grained discriminative regions.

From these two observations, we believe that although GAP avoids the over-fitting problem to some extent, it also partially drops spatial information. Considering the drawbacks of GAP, we are encouraged to develop an effective method to capture discriminative spatial information, e.g., the person structure information.

3.2 Relative Local Distance

Although there are diverse deformable person structures, an underlying object skeleton exists. For example, a person image consists of head, upper body, lower body and foot from top to bottom. To address the structure-confusion problem of GAP, an intuitive approach is to introduce a structure constraint for person re-ID. In this paper, we propose a learnable person structure by introducing a relative local distance without using pose annotations. Inspired by Observation 2, we model an underlying person structure using relative local distance comparison.

Given the output feature maps of size , we first use horizonal pooling to deal with viewpoint changes. We obtain vertical feature maps of size . This is reasonable because different camera views may lead to dramatic horizonal changes of person structure [17]. We then explore robust vertical person structures to guide the global feature representation learning. Specially, we split into column feature vectors with dimensions, denoted as , where . We define a relative local distance between two column feature vectors by computing the cosine distance

(2)

where the relative local distance builds a relationship between two local features. To simplify computation, we re-write Eq. 2 as a matrix form

(3)

where is a relative distance matrix, describing similarities between different person body parts.

3.3 Learnable Person Structure

Mathematically, relative distance matrix is actually a complete structured probabilistic graph

that we can model all kinds of interactions between all of random variables. Suppose a body part

is a random variable (node), is interaction (edge) between and random variables. A graph denotes a distribution over -dimensional space, describing the dependencies of body parts for the -th person. We use , ,…, graphs to describe person identities. As a simple example, suppose a person wears white shirts and blue pants while another person wears white shirts and black pants. Let , we obtain

(4)

where is the similarity score between white shirts and blue pants, and is the similarity score between white shirts and black pants. In this way, we can also distinguish different person appearances by using relative local distance matrixes that differ a lot from the absolute appearance feature representation.

is robust to diverse deformable person structures due to the fact that person bodies are structured in the order along the vertical axis. Affected by cameras views, deformable poses and imperfect detectors, the same body parts of a person collected from different camera views differ a lot in location, size and shape. To this we reshape a matrix into a feature vector and add two fully connected layers and a person structure loss to make the person structure learnable. By doing this, we can learn a local structure-aware feature representation to help the the global feature representation. The person structure loss in our network is the softmax loss,

(5)

where is the -dimension value of the one-hot label , represents the probability of the -th person structure.

Finally, according to Eq. 1 and Eq. 5, we define a structure-aware identity loss by

(6)

where controls the relative importance of the identity loss and structure-aware loss.

3.4 Network Architecture

RLD is conceptually simple. Conventional classification networks contain one output for person identity prediction. RLD extends this pipeline and add a branch with a small overhead to guide neural networks be aware of the underlying person structure and then learns a structure-aware feature representation. The overview of our RLD framework is shown in Figure 2.

To demonstrate the generality of RLD, we use different architectures, ResNet18, ResNet34, ResNet50, and ResNet101. To simplify discussion, we take ResNet50 as an example. The other architectures will be discussed in Section 4. In Figure 2, RLD uses ResNet50 as the convolutional backbone architecture, containing 4 building blocks. After the backbone architecture, two branches are used, i.e., a global identity branch and a person structure branch. The global identity branch consists of a 512-dim FC layer, a Batch Normalization layer, a ReLU layer, a dropout layer, a -dim FC layer ( denotes the number of person identities) and a Cross Entropy layer. The person structure branch also contains similar layers. The only difference is that the dimension of fully connected layer is 256.

3.5 Implementation Details

The training images are augmented with horizontal flip and normalization and resized to and cropped to

at the center with a small random perturbation. We use SGD with a mini-batch size of 32. We train RLD for 60 epochs. The learning rate starts from 0.1 and is decayed to 0.01 after 40 epochs. The backbone model is pre-trained on ImageNet and the learning rate for all the pre-trained layers are set to 0.1

of the base learning rate. We set to 0.2. During testing, images are resized to . We still extract global pooling features for feature representation.

4 Experiment

In this section, we evaluate our RLD method on three large-scale person ReID benchmark datasets, i.e., CUHK03, Market-1501 and DukeMTMC-reID, and present ablation studies to reveal the importance of each main component/factor of our method. We then reveal the benefits of the RLD model compared with state-of-the-art methods. We use CMC and mAP for evaluation. Note that, we use the same setting when comparing with baselines, including all of the hyper-parameters. And the baseline follows current works. That is, we simply set to 0 in Eq. 6. To eliminate bias, we repeat the procedure 4 times to get an average performance for each experiment.

Datasets. The CUHK03 dataset contains 13,164 images of 1,467 identities. Each identity is observed by 2 cameras. It offers both hand-labeled and DPM-detected bounding boxes, and we use the latter in this paper. We adopt the new training/testing protocol proposed in [42].

The Market-1501 dataset with six cameras is collected in Tsinghua University. Overlap exists among different cameras. Overall, this dataset contains 32,668 annotated bounding boxes of 1,501 identities. Among them, 12,936 images from 751 identities are used for training, and 19,732 images from 750 identities plus distractors are used for gallery. As for query, 3,368 hand-drawn bounding boxes from 750 identities are adopted. Each annotated identity is present in at least two cameras.

The DukeMTMC-reID dataset has 8 cameras. There are 1,404 identities appearing in more than two cameras and 408 identities (distractor ID) who appear in only one camera. Specially, 702 IDs are selected as the training set and the remaining 702 IDs are used as the testing set. In the testing set, one query image is picked for each ID in each camera and the remaining images are put in the gallery. In this way, there are 16,522 training images of 702 identities, 2,228 query images of the other 702 identities and 17,661 gallery images (702 ID + 408 distractor ID).

4.1 Evaluation and Model Analysis

To provide more insights on the performance of our approach, we conduct a lot of ablation studies by isolating each main component of our method.

(a) CUHK03.
(b) Market-1501.
(c) DukeMTMC-reID.
Figure 3: Comparisons of convergence.

Effect of the person structure branch. To show the benefit of the person structure branch, we conduct an ablation study by isolating this constraint. We do this by removing the person structure branch and thus the framework is degraded as a baseline. In this experiment, we use ResNet-50 as the backbone architecture. As shown in Table 1, compared with the baseline, it is observed that the improvement in mAP is +5.5%, +3.3%, and +2.8% on CUHK03, Market-1501 and DukeMTMC-reID, respectively. The improvement in rank-1 accuracy is +5.7%, +1.8%, and +1.8%, respectively. Therefore, we conclude that the person structure branch works.

Methods CUHK03 Market-1501 DukeMTMC-reID
r1 r5 r10 mAP r1 r5 r10 mAP r1 r5 r10 mAP
baseline 46.8 67.5 76.5 43.0 86.6 94.8 96.6 68.0 75.9 87.3 90.7 56.9
RLD (ours) 52.5 72.3 79.8 48.5 88.4 95.2 96.9 71.3 77.7 88.4 91.4 59.7
Table 1: Effect of the person structure branch.

Comparison of convergence. To demonstrate the person structure constraint accelerates deep network training, we compare the RLD framework with the baseline that does not contain the person structure branch. We use ResNet-50 as the backbone architecture. The comparison of identity losses is shown in Figure 3. We can see that our RLD obtains the faster convergence than the baseline on three datasets. Note that, the learning rate is decayed to 0.01 after 40 epochs and both identity losses are close to 0. We conclude that the person structure constraint guides the global feature representation learning.

Effect of network architectures. To show the effect of our RLD method, we instantiate RLD with different network architectures, i.e., ResNet-18, ResNet-34, ResNet-50 and ResNet-101, as shown in Table 2. We conduct four experiments on CUHK03. Compared with the baseline, we can see that our RLD method obtains 4.9%, 3.3%, 5.7% and 3.5% improvement in rank-1 accuracy on ResNet-18, ResNet-34, ResNet-50 and ResNet-101, respectively. The improvement in mAP is 5.2%, 4.0%, 5.5% and 3.3%, respectively. Therefore, we conclude that our RLD branch can be generalized to different network architectures.

Methods ResNet-18 ResNet-34 ResNet-50 ResNet-101
r1 r5 mAP r1 r5 mAP r1 r5 mAP r1 r5 mAP
baseline 39.9 61.2 36.3 46.0 66.4 41.9 46.8 67.5 43.0 50.0 71.3 46.3
RLD (ours) 44.8 66.0 41.5 49.3 68.8 45.9 52.5 72.3 48.5 53.5 72.6 49.6
Table 2: Effect of network architectures.

Parameters of the person structure branch. One may think if the better performance would be attributed to the increase of model parameters. To investigate this point, we compute the parameters of different components. We use ResNet-50 as the backbone architecture. As shown in Table 3, we can see that the person structure branch is a small overhead, i.e., 0.8% of entire model parameters. We conclude that the improvement of RLD is attributed to the benefit of the person structure constraint, but not the increase of model parameters.

baseline person structure branch
#params 25.0M 0.2M 0.8%
Table 3: Parameters of the person structure branch.

Influence of parameters . To investigate the impact of the important parameter in Eq. 6, we conduct a sensitivity analysis experiment on CUHK03. We use ResNet-50 as the backbone architecture. As shown in Figure 3 (c) and (d), when is in the range of 0.20.3, our model nearly keeps the best performance.


Figure 4: Influence of parameters . (Best viewed in color).

Comparison of Visualization. To show our RLD method can capture the person structure information, we randomly select 5 images and visualize the RLD matrixes with/without using the person structure branch for training, as shown in Figure 5. We can see that without using the person structure constraint, the strong dependencies of the RLD matrix elements are scattered. Differently, when using the person structure constraint, the strong dependencies are close to the diagonal line. This verifies the fact that local dependencies are often stronger than the remote ones for the person structure. We conclude that RLD can guide CNNs learn a structure-aware feature representation. Please refer to supplementary materials for more examples.


Figure 5: Comparison of visualization. Each column shows an image and its corresponding outputs with/without using the person structure branch for training.(Best viewed in color).

4.2 Comparisons to the State-of-the-Art

In this section, we compare our RDL with state-of-the-art methods.

Methods R1 R5 R10 mAP
BoW+kissme [38] 44.4 63.9 72.2 20.8
WARCA [13] 45.2 68.1 76.0 -
KLFDA [14] 46.5 71.1 79.9 -
SOMAnet [3] 73.9 - - 47.9
SVDNet [25] 82.3 92.3 95.2 62.1
PAN [40] 82.8 - - 63.4
Transfer [11] 83.7 - - 65.5
Triplet Loss [12] 84.9 94.2 - 69.1
DML [34] 87.7 - - 68.8
MultiRegion [26] 66.4 85.0 90.2 41.2
HydraPlus [21] 76.9 91.3 94.5 -
PAR [36] 81.0 92.0 94.7 -
MultiLoss [16] 83.9 - - 64.4
PDC* [24] 84.4 92.7 94.9 63.4
PartLoss [31] 88.2 - - 69.3
RLD (ours) 88.4 95.2 96.9 71.3
RLD+Era [43] 90.2 96.4 97.8 76.1
Table 4: Comparison of the proposed method with the art on Market-1501. The compared methods are categorized into 3 groups. Group 1: hand-crafted methods. Group 2: deep learning methods employing global feature. Group 3: deep learning methods employing part features. * denotes those requiring extra part labels. Our method is denoted by RLD.
Methods CUHK03 Duke.
rank-1 mAP rank-1 mAP
BoW+kissme [38] 6.4 6.4 25.1 12.2
LOMO+XQDA [17] 12.8 11.5 30.8 17.0
GAN [41] - - 67.7 47.1
PAN [40] 36.3 34.0 71.6 51.5
SVDNet [25] 41.5 37.3 76.7 56.8
MultiScale [6] 40.7 37.0 79.2 60.6
TriNet+Era [43] 55.5 50.7 73.0 56.6
SVDNet+Era [43] 48.7 43.5 79.3 62.4
RLD (ours) 52.5 48.5 77.7 59.7
RLD + Era [43] 56.2 52.2 79.5 63.4
Table 5: Comparison with prior art on DukeMTMC-reID and CUHK03. Rank-1 accuracy (%) and mAP (%) are shown..

Evaluations on Market-1501. We compare the proposed RLD and RLD+Era with fifteen existing state-of-the-art methods, which can be grouped into three categories, i.e., handcrafted feature methods, deep learning methods with global features, and deep learning methods with part features. Among them, PDC requires extra part labeling to align parts. Differently, our RLD can learn a person structure-aware feature representation without using extra annotation. Comparisons on Market-1501 are shown in Table 4.

Evaluations on CUHK03 and DukeMTMC-reID. We compare our RLD and RLD+Era methods with eight state-of-the-art methods on the CUHK03 and DukeMTMC-reID (Duke. for short) datasets , respectively. As shown in Table 5, it is encouraging to see that our approach significantly outperforms the competing methods. For example, RLD+Era obtains a good performance on both CUHK03 (56.2% rank-1 and 52.2%mAP) and DukeMTMC-reID (79.5% rank-1 and 63.4% mAP), while SVDNet+Era or TriNet+Era only obtains a good performance either on CUHK03 or DukeMTMC-reID.

Remarks. The RLD method is easily implemented. Without bells and whistles, our RLD model achieves competitive results with state-of-the-arts. While outside the scope of this work, we expect many such techniques (e.g., last convolutional kernel, image size, multi-loss or re-ranking) to be applicable to ours.

5 Conclusion

In this paper, we propose a novel Relative Local Distance (RLD) method to model the underlying person structure for person re-identification (re-ID). To this a relative local distance matrix is introduced to learn a structure-aware feature representation. With the discovered underlying person structure, the RLD method builds a bridge between the global and local feature and thus improves the capacity of feature representation for person re-ID. In summary, our contribution is three-fold. First, it is the first time the relative local distance is proposed to exploit the underlying person structure without using any pose annotation for person re-ID. Second, with the joint training of the person identity loss and person structure loss, RLD significantly accelerates deep network training and improves the performance of person re-ID. Third, the experimental results show the effectiveness of RLD on three benchmark datasets.

We intend to extend this work in two directions. First, we intend to generalize the relative local distance model to general object recognition in an end-to-end way. We also hope this insight can be further developed in the fine-grained classification problem. Second, we intend to study how to further exploit discriminative features from the last convolutional layers instead of using global average pooling.

References