DeepAI
Log In Sign Up

Deep-Person: Learning Discriminative Deep Features for Person Re-Identification

Recently, many methods of person re-identification (Re-ID) rely on part-based feature representation to learn a discriminative pedestrian descriptor. However, the spatial context between these parts is ignored for the independent extractor to each separate part. In this paper, we propose to apply Long Short-Term Memory (LSTM) in an end-to-end way to model the pedestrian, seen as a sequence of body parts from head to foot. Integrating the contextual information strengthens the discriminative ability of local representation. We also leverage the complementary information between local and global feature. Furthermore, we integrate both identification task and ranking task in one network, where a discriminative embedding and a similarity measurement are learned concurrently. This results in a novel three-branch framework named Deep-Person, which learns highly discriminative features for person Re-ID. Experimental results demonstrate that Deep-Person outperforms the state-of-the-art methods by a large margin on three challenging datasets including Market-1501, CUHK03, and DukeMTMC-reID. Specifically, combining with a re-ranking approach, we achieve a 90.84 query setting.

READ FULL TEXT VIEW PDF

page 1

page 3

page 8

1 Introduction

Person re-identification (Re-ID) refers the task of matching a specific person across multiple non-overlapping cameras. It has been receiving increasing attention in the computer vision community thanks to its various surveillance applications. Despite decades of study on person Re-ID task, it is still very challenging due to inaccurate person bounding box detection and large variations in illumination, pose, background clutter, occlusion, and ambiguity in visual appearance. Discriminative features focusing mainly on full person are inevitable to cope with these challenges in person Re-ID.

Most early works in person Re-ID either focus on discriminative hand-craft feature representation or robust distance metric for similarity measurement. Benefiting from the development of deep learning and increasing large-scale datasets 

zheng2017unlabeled ; zheng2015scalable ; li2014deepreid

, recent person Re-ID methods combine feature extraction and distance metric into an end-to-end deep convolution neural network (CNN). Nevertheless, most recent CNN-based methods endeavor to either design a better feature representation or develop a more robust feature learning, but rarely both aspects together.

Figure 1: Visualization of CNN learned feature maps with traditional global-based method (a) and our proposed Deep-Person (b). The heat map reveals regions that the network focuses on to compute feature for person Re-ID. Deep-Person tends to mainly focus on the whole body with accurate and detailed features aligned to person. Conversely, without the context between each part, the traditional one suffers from occlusion, blurring, and background clutter.

The CNN-based methods focusing on better feature representations can be roughly divided into three categories: 1) Global full-body representation, which is adopted in many methods li2014deepreid ; chen2017multi . Global average pooling is widely used for such global feature extraction, which decreases the granularity of features, thus resulting in missing local details (see Fig. 1 (a)); 2) Local body-part representation, which has been exploited in many works with variant part partitions. A straightforward partition into predefined rigid body parts is used in many works li2017JLML ; ustinova2017multi ; shi2016embedding ; ahmed2015improved . This may make the learned feature focus on some person details, Yet, due to pose variations, imperfect pedestrian detectors, and occlusion, such trivial partition fails to correctly learn features aligned to full person, leading to part-based features far from robust. Some recent works endeavor to develop better body partitions with some sophisticated methods yao2017deep ; li2017learning ; wu2018what or using extra pose annotation zhao2017spindle ; su2017pose . Although these part-based methods can enrich the generated feature describing better some person details, they all ignore the contextual information between the body parts, still failing to well align to full person and suffering from occlusion, blurring, and background noise. In varior2016siamese

, the authors propose to first convert the original person image into sequential LOMO and Color Names features, then rely on Recurrent Neural Network (RNN) to model the spatial context. Yet, the separation of sequence feature extraction and spatial context modeling hinders the end-to-end training and optimization, resulting in degraded performance; 3) Combination of global and local representation 

cheng2016person ; li2017learning ; su2017pose ; li2017JLML ; wu2016enhanced , which concatenates the global and part-based feature as the final feature representation. This combined feature representation usually requires more computation and extra space in test phase due to an extra branch compared to the single branch model, yielding slower runtime in practice.

The methods dedicated for robust feature learning usually consider the person Re-ID problem as either a classification task or a ranking task. Thanks to recently increasing large-scale Re-ID datasets, person Re-ID is regarded as a multi-class person identification task in many works li2014deepreid ; zheng2016wild ; zheng2016person . The obtained ID-discriminative Embedding (IDE) given by the penultimate fully connected layer has shown great potentials for person Re-ID. Yet, the training objective of identification task is not totally consistent with the testing procedure. Such learned IDE may be optimal for identification model on the training set, but may not be optimal to describe unseen person images in the test stage. Furthermore, the identification task does not explicitly learn a similarity measurement required for retrieval during the test stage. On the other hand, the ranking task aims to make the distance between the positive pairs closer than that of the negative pairs by a given margin. Therefore, a similarity measurement is explicitly learned. Yet, all identity information of the annotated Re-ID is not fully utilized. Recently, some works chen2017multi ; wang2016joint ; liu2017end leverage both classification and ranking tasks with triplet loss to learn more discriminative features for person Re-ID.

In this paper, we propose to model the pedestrian as a sequence of body parts from head to foot, and learn all part features together with the spatial contextual information rather than use an independent branch for each separate part. For that, we apply Long Short-Term Memory (LSTM) hochreiter1997long in an end-to-end way, enhancing the discriminative capacity of local feature which aligns better to full person with the prior knowledge of body structure. We also adopt global full-body representation in the proposed Deep-Person model. We feed the global and part-based features into two separate network branches for identification tasks. Different from the classical combination of global and local representation cheng2016person ; li2017learning ; su2017pose ; li2017JLML ; wu2016enhanced concatenating global and part-based feature as the final feature representation for person Re-ID, we further add a ranking task branch using triplet loss to explicitly learn the similarity measurement. More specifically, we use the global average pooled feature of the backbone feature (shown in Fig. 2

) as similarity estimation, also considered as the final pedestrian descriptor. Such three-branch Deep-Person model learns highly discriminative features for person Re-ID.

The main contributions of this paper are three folds: 1) We propose to regard the pedestrian as a sequence of body parts from head to foot, and apply LSTM in an end-to-end fashion to take into account the contextual information between body parts, enhancing the discriminative capacity of local feature which aligns better to full person; 2) We develop a novel three-branch framework which leverages two kinds of complementary advantages: local body-part and global full-body feature representation, as well as identification task and ranking task for a better feature learning. The proposed Deep-Person yields highly discriminative features for person Re-ID; 3) In the test phase, the proposed Deep-Person only performs a forward pass of the backbone network followed by a global average pooling. Consequently, compared to the single branch model, our model requires no additional runtime and space during testing, while still outperforming the state-of-the-art methods by large margins on three popular Re-ID datasets.

2 Related Work

Some methods consider the person Re-ID as a special case of image retrieval problem,

i.e., given a probe image, the framework ranks all gallery images based on their distances in the projected space with the probe, then returns the top most similar images. In this sense, they tend to focus on robust distance metrics, such as wang2018equidistance ; zhao2017multiple ; liu2018m3l ; ZHAO201879 ; CHENG2018 ; ZHAO201890 . Yet in this paper, we concentrate on a high quality pedestrian descriptor through better feature representation and more robust feature learning. So we focus on two types of closely related deep learning methods for person Re-ID: those relying on part-based representations and those focusing on multi-loss learning. For a complete review of person Re-ID methods, the interested reader is referred to zheng2016person . One novelty of the proposed Deep-Person lies on the use of LSTM to model the pedestrian seen as a sequence of body parts from head to foot. We also shortly review some related work using LSTM for sequence modeling.

Part-based person Re-ID approaches

There are many methods use part-based representation to learn discriminative features for person Re-ID. Following the strategy of part partition, the part-based approaches can be roughly divided into two categories: 1) Rigid body part partition, which has been widely adopted in many methods cheng2016person ; li2017JLML ; ustinova2017multi ; shi2016embedding ; wu2016enhanced ; chen2016similarity ; varior2016siamese , where the authors use predefined rigid grids as local parts. Each part is fed into an individual branch. All the individual part features are then concatenated together as the final part-based representation; 2) Flexible body part partition, which is more reasonable to localize appropriate parts. For example, Yao et al.yao2017deep

use an unsupervised method to generate a set of part boxes, and then employ the RoI pooling to produce part features. The Spatial Transformer Networks (STN) 

jaderberg2015spatial with novel spatial constraints are applied to localize deformable person parts in li2017learning . With extra pose annotation, Zhao et al.zhao2017spindle utilize the learned body joints to obtain subregion bounding boxes. Su et al.su2017pose further extend zhao2017spindle by normalizing the pose-based parts into fixed size and orientation, and introducing the Pose Transformation Network (PTN) to eliminate the pose variations.

Figure 2: Illustration of the Deep-Person architecture. Given a triplet of images , each image is fed into a part-based identification branch and a global-based identification branch . Meanwhile, a distance ranking branch is applied on

using the triplet loss function. Note that each image in

is fed into the same backbone network. The duplication of the backbone is only for visualization purpose. The two colors of Global Average Pooling are the same. Different colors are only used to represent different branches.

Although rigid body parts are simple to implement, due to the inaccurate pedestrian detection and occlusion, such trivial partition is not beneficial for learning discriminative feature which should be aligned to full person. The representation based on flexible body parts improves the full person alignment to a certain extent by localizing appropriate parts. Yet, such methods usually have a much more complex pipeline or require extra prior knowledge (e.g. human pose). Furthermore, the existing methods based on either rigid or flexible body parts all ignore the relationship between the body parts. Whereas, the pedestrian can always be decomposed into a sequence of body parts from head to foot. Based on this simple yet important property of pedestrian, we propose to learn all the parts together in a sequence way rather than discarding the spatial context by independent part feature extractors. For that, we naturally apply LSTM for the sequence-level modeling. The LSTM is also used in varior2016siamese to model the spatial contextual information between body parts. However, the authors first divide the input image into rigid parts and extract hand-craft feature for each individual part, then use LSTM to model the spatial relationship in a separate step. The proposed Deep-Person jointly integrates the deep feature extraction and sequence modeling in an end-to-end fashion, leading to more discriminative features focusing mainly on full body for person Re-ID.

Approaches based on joint multi-loss learning

Recently, Zheng et al. zheng2016person propose that person Re-ID lies in between instance retrieval task and image classification task. For the first point, the person Re-ID is regarded as a ranking task, where a ranking loss is adopted for feature learning. In cheng2016person , a new term is added to the original triplet loss to further pull the instances of the same person closer. Hermans et al. hermans2017defense introduce a variant of the standard triplet loss using hard mining within the batch. From the point of view of classification task, the person Re-ID problem is usually solved with a Softmax loss. There are two ways to perform person Re-ID as a classification task. The first one is known as verification network li2014deepreid , which takes a pair of images as input and determines whether they belong to the same identity or not by a binary classification network. The second one is called identification network zheng2016person , namely multi-class recognition network, where each individual is regarded as an independent category.

The classification task and ranking task are complementary to each other. Some approaches optimize the network simultaneously with both type of loss functions, leveraging the complementary advantages of these two tasks. For example, In chen2017multi ; wang2016joint , triplet loss and verification loss are trained together. The identification loss and verification loss are simultaneously optimized in geng2016deep ; qian2017multi . The combination of triplet loss and identification loss is adopted in liu2017end to optimize the comparative attention network. The proposed Deep-Person also rely on triplet loss and identification loss. Different from liu2017end , the identification loss on the novel part-based representation is also adopted in addition to the identification loss on global representation, further boosting the discriminative ability of learned feature for person Re-ID.

Sequence modeling using LSTM

The LSTM has been widely used in many sequence-based problems, such as image caption xu2015show , machine translation sutskever2014sequence , speech recognition graves2013speech , text recognition shi2017end etc. The LSTM has also shown a high potential in image classification Wang2016unified and object detection bell2016inside , where LSTM models the spatial dependencies and captures richer contextual information. Unlike these typical CNN-RNN frameworks that use an LSTM to model the context and learn a better representation directly, the LSTM in our network is only used to describe a person structure as a sequence. The backbone CNN features (implicitly integrating both global and part information) are our final representation for person Re-ID, which is influenced by the branch of LSTM through backward procedure. In this sense, the branch for the LSTM in our network can be considered as a special “loss function” mainly designed for parts, which is complementary to the two other loss functions: Softmax and Triplet loss. This usage of LSTM has never been seen in the previous CNN-RNN approaches.

3 Architecture of Deep-Person

3.1 Overview of Deep-Person

The proposed Deep-Person model focuses on both feature representation and feature learning. It is built upon two kinds of complementary designs detailed in the following: 1) Global representation and part-based local representation; 2) Softmax-based identification branch and ranking branch with the triplet loss.

Recent advances in person re-identification rely on deep learning to learn discriminative features from detected pedestrians. As pointed out in li2017learning , the learned representation of the full body focuses more on global information such as shape. Whereas, in some cases, only certain body parts such as the head, upper body, or lower body are discriminative for person re-identification cheng2016person ; shi2016embedding ; varior2016siamese . In this sense, the part-based local representation of detected pedestrian is complementary to the global representation. We propose to use the LSTM-based RNN to naturally model the spatial dependency between each part of the pedestrian body, which can be seen as a sequence of body parts from head to foot. Hopefully, combining the complementary global representation and the LSTM-based local representation would strengthen the discriminative ability of learned features for person re-identification.

Recently, a common component of most deep models for person Re-ID is the Softmax-based identification branch that distinguishes different IDs based on the learned deep features. Yet, the training objective for identification branch is not totally consistent with the goal of person Re-ID, which aims to pair each probe with one of the gallery images. This is because that the identification branch does not explicitly learn a similarity measurement which is required for person Re-ID. Recently, as advocated in liu2017end , a distance ranking branch with the triplet loss helps to learn a similarity measure between a pair of images. In this sense, the identification branch and the distance ranking branch with the triplet loss constitute another complementarity.

Our proposed Deep-Person model leverages the above two kinds of complementaries. The overall architecture is depicted in Fig. 2. It is composed of two main components: (1) Backbone network for learning shared low-level feature with size ; (2) Multi-branch network to learn a highly discriminative pedestrian representation thanks to the three complementary branches: part-based identification branch , global-based identification branch , and distance ranking branch using the triplet loss. A joint learning strategy is adopted for simultaneously optimizing per-branch feature representation and discovering correlated complementary information.

Figure 3:

Each vector in the feature sequence describes the region of the corresponding receptive field in the raw image.

3.2 Part-Sequence Learning Using LSTM

It has been shown in many methods that part-based representation learns feature focusing on some person details, which is useful for person Re-ID. Most part-based methods roughly decompose the extracted pedestrian into predefined rigid body parts which approximately correspond to head, shoulder, upper-body, upper-leg and lower-leg, respectively. Each segmented part is then fed into an individual branch to learn the corresponding local feature. This may give interesting results in some cases. Yet, the individual process of each part ignores the spatial dependencies between different parts, which is useful to learn discriminative and robust features focusing mainly on the whole body for person Re-ID.

We notice that pedestrian in images can be decomposed into a sequence of body parts from head to foot. Even though each part does not always lie in the same position in different images, all the pedestrian parts can be modeled in a sequence way thanks to the priori knowledge of body structure. The sequential representation of pedestrian naturally motivates us to resort to RNN with LSTM cells, which is detailed in Fig. 4. The recurrent connections between the LSTM units enable the RNN to yield features based on the historical inputs. Therefore the learned features at any point are refined by the spatial contexts. What’s more, benefiting from the internal gating mechanisms, LSTM can control the information flow from the current state to the next state. Consequently, LSTM cell is competent to propagate certain relevant contexts and filter out some irrelevant parts. Based on the above insights, we propose to adopt LSTM to model the sequence of body parts for person Re-ID.

More specifically, to get the spatial contexts, we extract directly a sequence of feature vectors from the shared low-level features without any explicit segmentation. As depicted in Fig. 2, each row of undergoes an average pooling (i.e., average pooling with kernel), which results in a corresponding feature vector of length . As illustrated in Fig. 3, each feature vector describes a rectangle region in raw image given by the corresponding receptive field. A two-layer Bidirectional LSTM is then built upon the feature sequence. Thanks to the context modeling of LSTM, each resulting feature vector of length

may describe better its associated part. Finally, all the resulting feature vectors characterizing the underlying local parts are concatenated together as the final part-based person representation. This part-based feature is learned via a Softmax layer with

output neurons, where

denotes the number of pedestrian identities.

Figure 4: The structure of LSTM unit. An LSTM consists of a cell module and three gates, namely the input gate , the output gate and the forget gate . At the time step , LSTM takes the -th slice feature as well as the previous hidden state as inputs, and predicts a feature vector (generated by attaching a fully-connected layer after ).

3.3 Global Representation Learning

Part-based features focus more on discriminative pedestrian details. It is difficult to distinguish two different identities with very similar visual details (e.g., wearing the same clothes) using only part-based representation. In this case, the shape information is required to distinguish them. Indeed, the global feature is complementary to the part-based local feature, and contains more high-level semantics, like shape. Similar to many deep neural networks for person Re-ID, we straightforward extract a global representation by inserting a global average pooling and a fully-connected layer after the shared low-level feature . A Softmax layer with output neurons is then appended for global feature learning.

3.4 Deep Metric Learning with Triplet Loss

The part-based and global representation learning described in Sec. 3.2 and 3.3 do not explicitly learn a similarity measure which is required for person Re-ID during test. We propose a third branch in Deep-Person model that is responsible for distance ranking. For that, we apply another independent global average pooling to the shared low-level feature , which results in a feature in the metric space for similarity estimation. This feature is also adopted as the final pedestrian descriptor for person Re-ID.

More specifically, we adopt the improved triplet loss for deep metric learning proposed in hermans2017defense . The main idea lies in forming batches named by randomly sampling classes (person identities), and then randomly sampling images of each class (person), thus resulting in a batch of images. Given such a mini-batch , we can constitute a triplet for each selected anchor image , where (resp. ) denotes a positive (resp. negative) sample in the mini-batch having the same (resp. different) person ID as . Since hard triplet mining strategy is crucial for learning based on triplet loss, only the hardest positive and negative sample in the mini-batch are selected for each anchor sample to form the triplets for loss computation:

(1)

where , is a margin that is enforced between positive and negative pairs, and is referred as the distance function between feature vector and . In this paper, we use the Euclidean distance as the distance metric. This objective function encourages the features of positive pairs to be closer in the learned feature space than the negative pairs by a predefined margin .

3.5 Model Training

As discussed in Sec. 3.1, different branches have complementary strengths for learning discriminative pedestrian descriptors. To leverage these complementaries, we jointly train the whole network to predict person identity for both part-based and global feature learning while also satisfying the triplet objective. There are two identification subnets built with Softmax loss for multi-class person identification task, defined as:

(2)

where is the classification feature of -th sample, is the identity of -th sample, is the number of samples, and

is respectively the weight and bias of the classifier for

-th identity. To simplify the notation and without any ambiguity, we replace with (resp. ) for the corresponding part-based (resp. global-based) classification loss function (resp. ). Then the final loss is given by:

(3)

where () denotes the loss weight for different branch.

4 Experiments

4.1 Implementation Details

The proposed Deep-Person model is built on the PyTorch framework. The backbone network is the ResNet-50 

he2016deep

model pre-trained on ImageNet, where the global average pooling and fully connected layers are discarded. The parameter

, number of channels of (see Fig. 2) is set to 2048, and , the number of hidden units in BLSTM is set to 256. Depending on the height of input images and the backbone network, is set to 8 in this paper.

We follow common dataset augmentation strategies with different scales and aspect ratios to train the Deep-Person model. Concretely, all training images are first resized to 256128. Then, we randomly crop each resized image with scale in the interval [0.64, 1.0] and aspect ratio in [2, 3]. Finally, the cropped patch is resized again to 256

128. Randomly horizontal flip with a probability of 0.5 is also applied. During testing phase, images are simply resized to 256

128. The globally pooled feature

, input of the distance ranking branch is used as the pedestrian descriptor for retrieval. Before feeding the input image to the network, we follow the image normalization procedure in original ResNet-50 model trained on ImageNet by subtracting the mean value and then dividing by the standard deviation.

As described in 3.4, is used to form a batch, where and

denote respectively the number of sampled classes and instances of each class. To train our Deep-Person model, we redefine the notion of “epoch” such that every

trained batches form an “epoch”. This redefinition implies that an “epoch” covers approximately all the identities. Jointly training such a network is not trivial in the beginning of training stages. We apply gradient clipping to avoid gradient explosion. The Adam with the default hyper-parameters (

, , ) is used to minimize the loss function given in Eq. (3), where , , and are all set to 1. The margin in Eq. (1) is set to 0.5. Following the common practice of learning rate decaying schedule in hermans2017defense , we set the initial learning rate to and apply the decay schedule at epoch 100. The total number of training epochs for all conducted experiments are set to 150. The mini-batch size is set to 128 for all experiments. Following different average number of samples for each identity in different datasets detailed in Sec. 4.2, is set respectively to 8, 4, and 8 for Market-1501, CHUK03, and DukeMTMC-reID.

4.2 Datasets and Evaluation Protocol

Datasets

We evaluate our proposed method, Deep-Person, on three widely used large-scale datasets: Market-1501 zheng2015scalable , CUHK03 li2014deepreid , and DukeMTMC-reID zheng2017unlabeled ; ristani2016performance dataset. A brief description of them is given as follows:

Market-1501: This dataset consists of 32,668 images of 1,501 identities captured by six cameras in front of a supermarket. The provided pedestrian bounding boxes are given by Deformable Part Model (DPM) felzenszwalb2008discriminatively and further validated by manual annotations. The dataset is divided into training set consisting of 12,936 images of 751 identities and testing dataset containing the other 19,732 images of 750 identities. For each identity in the testing set, only one image from each camera is selected as a query, resulting in 3,368 query images in total.

CUHK03: This dataset contains 14,097 images of 1,467 identities shot by six cameras in the CUHK campus. It provides two types of annotations: manually labeled pedestrian bounding boxes and automatic detections given by DPM detector. We conduct experiments on both types of annotated datasets named as labeled dataset and detected dataset, respectively. This dataset also offers 20 random splits, each of which selects 100 test identities and use the other 1,367 persons for training. The average performance on all splits is reported for evaluation on this dataset.

DukeMTMC-reID: This dataset is a subset of Duke-MTMC ristani2016performance for image-based re-identification, in the format of the Market-1501 dataset. It is composed of 36,411 images of 1,812 different identities taken by 8 high-resolution cameras, where 1,404 identities appear in more than two cameras and the other 408 identities are regarded as distractor IDs. Among the 1404 identities, 16,522 images of 702 identities are used for training, the other 702 identities are divided into 2,228 query images and 17,661 gallery images.

Evaluation Protocol

We follow the standard evaluation protocol. Concretely, the precisions of rank-1, rank-5, and rank-10 are reported for CUHK03. The cumulative matching characteristics (CMC) at rank-1 and mean average precision (mAP) for performance evaluation on Market-1501 and DukeMTMC-reID. Following most related works, the evaluation on CHUK03 and DukeMTMC-reID is performed under single query setting. Both single and multiple query settings are used for Market-1501 dataset.

4.3 Comparison with Related Methods

We compare the proposed method, Deep-Person, with the state-of-the-art approaches on Market-1501, CUHK03, and DukeMTMC-reID datasets. The proposed Deep-Person consistently outperforms the state-of-the-art methods on all three datasets. The details are given as follows:

Methods rank1 mAP
BoW+KISSME zheng2015scalable 25.13 12.17
LOMO+XQDA liao2015person 30.75 17.04
LSRO zheng2017unlabeled 67.68 47.13
AttIDNet lin2017improving 70.69 51.88
PAN zheng2017pedestrian 71.59 51.51
ACRN schumann2017person 72.58 51.96
SVDNet sun2017svdnet 76.70 56.80
DPFL chen2017person 79.20 60.60
Deep-Person (Ours) 80.90 64.80
Table 1: Quantitative comparison with state-of-the-art methods on DukeMTMC-reID dataset.
Methods Single Query Multiple Query
rank1 mAP rank1 mAP
OL-MANS zhou2017efficient 60.67 - 66.80 -
DNS zhang2016learning 61.02 35.68 71.56 46.03
Gated S-CNN varior2016gated 65.88 39.55 76.04 48.45
CRAFT-MFA chen2017personPAMI 68.70 42.30 77.70 50.30
P2S zhou2017point 70.72 44.27 85.78 55.73
CADL lin2017consistent 73.84 47.11 80.85 55.58
Spindle zhao2017spindle 76.90 - - -
MSCAN li2017learning 80.31 57.53 86.79 66.70
SVDNet sun2017svdnet 82.30 62.10 - -
Part-Aligned zhao2017deeply 81.00 63.40 - -
PDC su2017pose 84.14 63.41 - -
JLML li2017JLML 85.10 65.50 89.70 74.50
LSRO zheng2017unlabeled 83.97 66.07 88.42 76.10
SSM bai2017scalable 82.21 68.80 88.18 76.18
TriNet hermans2017defense 84.92 69.14 90.53 76.42
DPFL chen2017person 88.60 72.60 92.20 80.40
Deep-Person (Ours) 92.31 79.58 94.48 85.09
Table 2: Comparison with state-of-the-art results on Market-1501. The / best result is highlighted in red/blue.
Methods Labeled Detected
r1 r5 r10 r1 r5 r10
EDMshi2016embedding 61.3 88.9 96.4 52.1 82.9 91.8
OL-MANS zhou2017efficient 61.7 88.4 95.2 62.7 87.6 93.8
DNS zhang2016learning 62.5 90.0 94.8 54.7 84.7 94.8
Gated S-CNN varior2016gated - - - 68.1 88.1 94.6
GOG matsukawa2016hierarchical 67.3 91.0 96.0 65.5 88.4 93.7
DictRW cheng2017DictRW 71.1 91.7 94.7 - - -
MSCAN li2017learning 74.2 94.3 97.5 68.0 91.0 95.4
MTDnet chen2017multi 74.7 96.0 97.5 - - -
Quadruplet chen2017beyond 75.5 95.2 99.2 - - -
SSM bai2017scalable 76.6 94.6 98.0 72.7 92.4 96.1
MuDeep qian2017multi 76.9 96.1 98.4 75.6 94.4 97.5
SVDNet sun2017svdnet - - - 81.8 95.2 97.2
CRAFT-MFA chen2017personPAMI - - - 84.3 97.1 98.3
LSRO zheng2017unlabeled - - - 84.6 97.6 98.9
JLML li2017JLML 83.2 98.0 99.4 80.6 96.9 98.7
Part-Aligned zhao2017deeply 85.4 97.6 99.4 81.6 97.3 98.4
Spindle zhao2017spindle 88.5 97.8 98.6 - - -
PDC su2017pose 88.7 98.6 99.2 78.3 94.8 97.2
Deep-Person (Ours) 91.5 99.0 99.5 89.4 98.2 99.1
Table 3: Evaluation on CUHK03 in terms of rank-1 (r1), rank-5 (r5), and rank-10 (r10) matching rate, using manually labeled pedestrian bounding boxes and automatic detections by DPM.

Evaluation on DukeMTMC-reID

The comparison with state-of-the-art methods on DukeMTMC-reID dataset is depicted in Table 1. Our Deep-Person also outperforms all the state-of-the-art approaches, achieving an improvement of 1.7% rank-1 accuracy and 4.2% mAP. It is worth noting that the previous state-of-the-art DPFL chen2017person take multi-scale person images as input. With this design, the performance is expected to be further improved.

Evaluation on Market-1501

As depicted in Table 2, the proposed Deep-Person achieves compelling results on Market-1501 dataset. Concretely, our Deep-Person significantly improves the state-of-the-art results by about 7% on mAP and 3.7% on rank-1 matching rate under single query mode. When combining with an effective re-ranking approach zhong2017re , the performance is further boosted, reaching 90.84% mAP. To the best of our knowledge, this is the first time that an mAP higher than 90% is achieved on the Market-1501 dataset with single query setting. We also observe similar performance improvements using multiple query setting on this dataset, getting 4.7% and 2.3% improvement on mAP and rank-1, respectively.

Evaluation on CUHK03

The evaluation of Deep-Person on CUHK03 dataset in terms of rank-1, rank-5, and rank-10 matching rate is given in Table 3. Using the manually annotated pedestrian bounding boxes, our Deep-Person yields 2.8% rank-1 accuracy improvement. An improvement of 4.8% rank-1 accuracy is achieved when using an automatic method DPM to extract pedestrians. The later setting is coherent with practical application, which demonstrates the potential and robustness of Deep-Person in practice. Deep-Person also outperforms the state-of-the-art methods under both rank-5 and rank-10 matching rate.

Feature type Single Query
rank1 mAP
85.33 64.71
w/o LSTM 82.39 60.75
+ w/o LSTM 86.49 66.74
+ 87.74 69.82
Table 4: Effectiveness of the complementary advantage between global features and the novel LSTM-based part features.

4.4 Ablation Study

We further evaluate several variants of Deep-Person to verify the effectiveness of each individual component. Without loss of generality, the ablation study is performed on Market-1501 dataset under single query setting, using the same settings as described in Sec. 4.1.

Effectiveness of LSTM-based Parts

The proposed Deep-Person leverages the complementary information between global representation and a novel LSTM-based part representation. We evaluate the contribution of this complementary advantage and the effect of the novel LSTM-based part representation. For that, we discard the ranking branch in Deep-Person. As depicted in Table 4, the combination of global and part-based branches outperforms each individual branch alone. Furthermore, the adopted LSTM achieves 3.1% performance gain in mAP. Note that for a fair comparison between the variant + without LSTM and + , a fully-connected layer with 256 neurons is attached after each part slice in the former so that the two models have approximately the same number of model parameters.

Figure 5: Visualization of feature maps extracted from three variants of Deep-Person. From left to right in (a-d): raw image, the feature map of using branch alone, + without LSTM, and + with LSTM, respectively.
Figure 6: Comparison of the gallery match ranks of each probe image under occlusion, blurring, background cluster, and imperfect detection. Each probe image has multiple matches in the gallery. Smaller numbers mean better ranking performances.

To get some insights about the improvements, we analyze the learned feature maps of three variants of Deep-Person: branch alone, + without LSTM, and + with LSTM. Some illustrations are given in Fig. 5. Using global branch alone only captures coarse areas with few part details. However, the missed part details such as the upper-body or backpack are potentially meaningful to identify the person. Combining with part-based representation without LSTM enriches some fine details. Adopting LSTM in the combination of global and part-based branches focuses more on the pedestrian with fine details, ignoring irrelevant regions (e.g. the top corners in Fig. 5 (d) and the motorbike in Fig. 5 (b)). As a result, the learned features of + using LSTM align better to the full person and thus are more complete and robust for person Re-ID. The above analyses are consistent with the gallery match ranks test illustrated in Fig. 6. From this figure, one can see that + with LSTM achieves better matching ranks than the other two variants in most cases. This implies that LSTM-based part representation with global descriptor is capable of mitigating occlusion, blurring, background cluster and imperfect detection.

Category Methods Single Query
rank1 mAP
Rigid
Partition
SCSP chen2016similarity 51.90 26.35
LSTM-Siamese varior2016siamese 61.60 35.31
MR B-CNN ustinova2017multi 66.36 41.17
JLML li2017JLML 85.10 65.50
Flexible
Partition
Spindle zhao2017spindle 76.90 -
MSCAN li2017learning 80.31 57.53
Part-Aligned zhao2017deeply 81.00 63.40
PDC su2017pose 84.14 63.41
+ 87.74 69.82
Table 5: Comparison with state-of-the-art part-based models on Market-1501 dataset.

We also compare a variant of Deep-Person using global and LSTM-based part branches with some state-of-the-art part-based models. The comparison is depicted in Table 5. Thanks to the use of LSTM modeling spatial context information, an improvement of 6.41% mAP and 3.6% rank1 are achieved compared to the best part-based method. It is worth noting that Spindle zhao2017spindle and PDC su2017pose use extra pose annotations for body-part detection. As compared to the method varior2016siamese using also LSTM to model spatial relationship between body parts, using LSTM in an end-to-end way in Deep-Person achieves a significant improvement.

Loss type Single Query
rank1 mAP
Identification loss 85.33 64.71
Triplet loss 81.29 63.51
Identification + Triplet loss 88.45 72.96
Table 6: Effectiveness of multi-loss on the variant of Deep-Person discarding part-based branch .
Feature type Single Query
rank1 mAP
84.65 67.00
87.74 69.82
Table 7: Experimental results using different fused features as the final pedestrian descriptor on the + two-branch model.

Effectiveness of Multi-loss

Using the complementary advantage of identification and ranking loss is another important aspect of our Deep-Person. To simplify the evaluation of this complementary advantage, we discard the part-based branch . As shown in Table 6, combining identification loss and ranking loss (e.g. triplet loss) respectively achieves a rank1 accuracy of 88.45% and 72.96% mAP, which performs much better than using each of them alone. This reveals the complementarity of identification and ranking information to learn discriminative features for person Re-ID, as well as the effectiveness of jointly optimizing.

Choice of final pedestrian descriptor

As described in Sec. 3, Deep-Person uses the global average pooled feature of the backbone feature as the final pedestrian descriptor. Complementary information between global and part-based representation is implicitly integrated into thanks to the combination of back-propagation error differentials of the global and part-based branch during training. Following recent works cheng2016person ; li2017learning ; su2017pose ; li2017JLML ; wu2016enhanced using global and part-based branches, the concatenation of the penultimate fully connected layers of these two branches denoted as seems to be a reasonable alternative. We use a two-branch + variant of Deep-Person to evaluate these two choices. As shown in Table 7, the implicit fused feature significantly outperforms the directly concatenated feature . One possible reason for this is that the identification loss makes the features near the classification layer focus more on the difference of training identities. Such feature might be discriminative for identities in training images, but is not discriminative for the unseen identities during test. Whereas, the backbone feature may be more robust and generalize better to the unseen test categories. This comparison motivates us to append the ranking branch with triplet loss after , which is considered as the final pedestrian descriptor in Deep-Person.

5 Conclusion

In this paper, we introduce a novel three-branch framework named Deep-Person to learn highly discriminative deep features for person Re-ID. Different from most existing methods which either focus on feature representation or feature learning alone, complementary advantages on both aspects are considered in Deep-Person. Concretely, local body-part and global full-body features are jointly employed. The identification loss and ranking loss are applied to simultaneously learn an ID-discriminative embedding and a similarity measurement.

In addition, in contrast to the existing part-based methods which usually discard the spatial context of body structure, we innovatively propose to regard the pedestrian as a sequence of body parts from head to foot, and apply LSTM in an end-to-end fashion to take into account the contextual information between body parts, enhancing the discriminative capacity of local feature which aligns better to full person. Such a descriptor is heuristic for person representation. Furthermore, to our best knowledge, this work is the first method to use LSTM for person Re-ID in an end-to-end way.

Extensive evaluations on three popular and challenging datasets demonstrate the superiority of the proposed Deep-Person over the state-of-the-art methods. In the future, we would like to adopt attention mechanism to automatically select more discriminative body parts rather than simply slice the backbone feature alone the vertical direction.

Acknowledgment

This work was supported by National Key R&D Program of China (No. 2018YFB1004600), NSFC61573160 and NSFC61703171. This work was supported to Dr. Xiang Bai by the National Program for Support of Top-notch Young Professionals and the Program for HUST Academic Frontier Youth Team.

References

References

  • (1) Z. Zheng, L. Zheng, Y. Yang, Unlabeled samples generated by gan improve the person re-identification baseline in vitro, in: Porc. of IEEE Intl. Conf. on Computer Vision, 2017, pp. 3774–3782.
  • (2) L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, Q. Tian, Scalable person re-identification: A benchmark, in: Porc. of IEEE Intl. Conf. on Computer Vision, 2015, pp. 1116–1124.
  • (3)

    W. Li, R. Zhao, T. Xiao, X. Wang, Deepreid: Deep filter pairing neural network for person re-identification, in: Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2014, pp. 152–159.

  • (4)

    W. Chen, X. Chen, J. Zhang, K. Huang, A multi-task deep network for person re-identification., in: Proc. of the AAAI Conf. on Artificial Intelligence, 2017, pp. 3988–3994.

  • (5) S. G. Wei Li, Xiatian Zhu, Person re-identification by deep joint learning of multi-loss classification, in: Proc. of Intl. Joint Conf. on Artificial Intelligence, 2017, pp. 2194–2200.
  • (6) E. Ustinova, Y. Ganin, V. Lempitsky, Multi-region bilinear convolutional neural networks for person re-identification, in: AVSS, 2017, pp. 1–6.
  • (7) H. Shi, Y. Yang, X. Zhu, S. Liao, Z. Lei, W. Zheng, S. Z. Li, Embedding deep metric for person re-identification: A study against large variations, in: Proc. of European Conference on Computer Vision, 2016, pp. 732–748.
  • (8) E. Ahmed, M. Jones, T. K. Marks, An improved deep learning architecture for person re-identification, in: Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2015, pp. 3908–3916.
  • (9) H. Yao, S. Zhang, Y. Zhang, J. Li, Q. Tian, Deep representation learning with part loss for person re-identification, CoRR abs/1707.00798. arXiv:1707.00798.
  • (10) D. Li, X. Chen, Z. Zhang, K. Huang, Learning deep context-aware features over body and latent parts for person re-identification, in: Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2017, pp. 7398–7407.
  • (11) L. Wu, Y. Wang, X. Li, J. Gao, What-and-where to match: Deep spatially multiplicative integration networks for person re-identification, Pattern Recognition 76 (2018) 727–738.
  • (12) H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, X. Tang, Spindle net: Person re-identification with human body region guided feature decomposition and fusion, in: Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2017, pp. 907–915.
  • (13) C. Su, J. Li, S. Zhang, J. Xing, W. Gao, Q. Tian, Pose-driven deep convolutional model for person re-identification, in: Porc. of IEEE Intl. Conf. on Computer Vision, 2017, pp. 3980–3989.
  • (14) R. R. Varior, B. Shuai, J. Lu, D. Xu, G. Wang, A siamese long short-term memory architecture for human re-identification, in: Proc. of European Conference on Computer Vision, 2016, pp. 135–153.
  • (15) D. Cheng, Y. Gong, S. Zhou, J. Wang, N. Zheng, Person re-identification by multi-channel parts-based cnn with improved triplet loss function, in: Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2016, pp. 1335–1344.
  • (16) S. Wu, Y.-C. Chen, X. Li, A.-C. Wu, J.-J. You, W.-S. Zheng, An enhanced deep feature representation for person re-identification, in: Proc. of IEEE Winter Conf. on Applications of Computer Vision, 2016, pp. 1–8.
  • (17) L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, Q. Tian, Person re-identification in the wild, in: Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2017, pp. 3346–3355.
  • (18) L. Zheng, Y. Yang, A. G. Hauptmann, Person re-identification: Past, present and future, CoRR abs/1610.02984. arXiv:1610.02984.
  • (19) F. Wang, W. Zuo, L. Lin, D. Zhang, L. Zhang, Joint learning of single-image and cross-image representations for person re-identification, in: Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2016, pp. 1288–1296.
  • (20) H. Liu, J. Feng, M. Qi, J. Jiang, S. Yan, End-to-end comparative attention networks for person re-identification, IEEE Trans. on Image Processing 26 (7) (2017) 3492–3506.
  • (21) S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (8) (1997) 1735–1780.
  • (22) J. Wang, Z. Wang, C. Liang, C. Gao, N. Sang, Equidistance constrained metric learning for person re-identification, Pattern Recognition 74 (2018) 38–51.
  • (23) C. Zhao, X. Wang, W. K. Wong, W. Zheng, J. Yang, D. Miao, Multiple metric learning based on bar-shape descriptor for person re-identification, Pattern Recognition 71 (2017) 218–234.
  • (24) X. Liu, X. Ma, J. Wang, H. Wang, M3l: Multi-modality mining for metric learning in person re-identification, Pattern Recognition 76 (2018) 650–661.
  • (25) C. Zhao, X. Wang, D. Miao, H. Wang, W. Zheng, Y. Xu, D. Zhang, Maximal granularity structure and generalized multi-view discriminant analysis for person re-identification, Pattern Recognition 79 (2018) 79 – 96.
  • (26) D. Cheng, Y. Gong, X. Chang, W. Shi, A. Hauptmann, N. Zheng, Deep feature learning via structured graph laplacian embedding for person re-identification, Pattern Recognition.
  • (27) Z. Zhao, B. Zhao, F. Su, Person re-identification via integrating patch-based metric learning and local salience learning, Pattern Recognition 75 (2018) 90 – 98, distance Metric Learning for Pattern Recognition.
  • (28) D. Chen, Z. Yuan, B. Chen, N. Zheng, Similarity learning with spatial constraints for person re-identification, in: Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2016, pp. 1268–1277.
  • (29) M. Jaderberg, K. Simonyan, A. Zisserman, et al., Spatial transformer networks, in: Proc. of Advances in Neural Information Processing Systems, 2015, pp. 2017–2025.
  • (30) A. Hermans, L. Beyer, B. Leibe, In defense of the triplet loss for person re-identification, CoRR abs/1703.07737. arXiv:1703.07737.
  • (31) M. Geng, Y. Wang, T. Xiang, Y. Tian, Deep transfer learning for person re-identification, CoRR abs/1611.05244. arXiv:1611.05244.
  • (32) X. Qian, Y. Fu, Y.-G. Jiang, T. Xiang, X. Xue, Multi-scale deep learning architectures for person re-identification, in: Porc. of IEEE Intl. Conf. on Computer Vision, 2017, pp. 5409–5418.
  • (33)

    K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, Y. Bengio, Show, attend and tell: Neural image caption generation with visual attention, in: Proc. of Intl. Conf. on Machine Learning, 2015, pp. 2048–2057.

  • (34) I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural networks, in: Proc. of Advances in Neural Information Processing Systems, 2014, pp. 3104–3112.
  • (35) A. Graves, A. Mohamed, G. E. Hinton, Speech recognition with deep recurrent neural networks, in: IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 6645–6649.
  • (36) B. Shi, X. Bai, C. Yao, An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition, IEEE Trans. Pattern Anal. Mach. Intell. 39 (11) (2017) 2298–2304.
  • (37) J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, W. Xu, Cnn-rnn: A unified framework for multi-label image classification, in: Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2016, pp. 2285–2294.
  • (38) S. Bell, C. Lawrence Zitnick, K. Bala, R. Girshick, Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks, in: Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2016, pp. 2874–2883.
  • (39) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • (40) E. Ristani, F. Solera, R. S. Zou, R. Cucchiara, C. Tomasi, Performance measures and a data set for multi-target, multi-camera tracking, in: ECCV Workshop on Benchmarking Multi-Target Tracking, 2016, pp. 17–35.
  • (41) P. Felzenszwalb, D. McAllester, D. Ramanan, A discriminatively trained, multiscale, deformable part model, in: Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2008, pp. 1–8.
  • (42) S. Liao, Y. Hu, X. Zhu, S. Z. Li, Person re-identification by local maximal occurrence representation and metric learning, in: Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2015, pp. 2197–2206.
  • (43) Y. Lin, L. Zheng, Z. Zheng, Y. Wu, Y. Yang, Improving person re-identification by attribute and identity learning, CoRR abs/1703.07220. arXiv:1703.07220.
  • (44) Z. Zheng, L. Zheng, Y. Yang, Pedestrian alignment network for large-scale person re-identification, CoRR abs/1707.00408. arXiv:1707.00408.
  • (45) A. Schumann, R. Stiefelhagen, Person re-identification by deep learning attribute-complementary information, in: Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition Workshops, 2017, pp. 1435–1443.
  • (46) Y. Sun, L. Zheng, W. Deng, S. Wang, Svdnet for pedestrian retrieval, in: Porc. of IEEE Intl. Conf. on Computer Vision, 2017, pp. 3820–3828.
  • (47) Y. Chen, X. Zhu, S. Gong, Person re-identification by deep learning multi-scale representations, in: Porc. of IEEE Intl. Conf. on Computer Vision, 2017, pp. 2590–2600.
  • (48) J. Zhou, P. Yu, W. Tang, Y. Wu, Efficient online local metric adaptation via negative samples for person re-identification, in: Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2017, pp. 2439–2447.
  • (49) L. Zhang, T. Xiang, S. Gong, Learning a discriminative null space for person re-identification, in: Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2016, pp. 1239–1248.
  • (50) R. R. Varior, M. Haloi, G. Wang, Gated siamese convolutional neural network architecture for human re-identification, in: Proc. of European Conference on Computer Vision, 2016, pp. 791–808.
  • (51) Y.-C. Chen, X. Zhu, W.-S. Zheng, J.-H. Lai, Person re-identification by camera correlation aware feature augmentation, IEEE Trans. Pattern Anal. Mach. Intell. 40 (2) (2018) 392–408.
  • (52) S. Zhou, J. Wang, J. Wang, Y. Gong, N. Zheng, Point to set similarity based deep feature learning for person re-identification, in: Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2017, pp. 5028–5037.
  • (53) J. Lin, L. Ren, J. Lu, J. Feng, J. Zhou, Consistent-aware deep learning for person re-identification in a camera network, in: Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2017, pp. 3396–3405.
  • (54) L. Zhao, X. Li, Y. Zhuang, J. Wang, Deeply-learned part-aligned representations for person re-identification, in: Porc. of IEEE Intl. Conf. on Computer Vision, 2017, pp. 3239–3248.
  • (55) S. Bai, X. Bai, Q. Tian, Scalable person re-identification on supervised smoothed manifold, in: Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2017, pp. 3356–3365.
  • (56) T. Matsukawa, T. Okabe, E. Suzuki, Y. Sato, Hierarchical gaussian descriptor for person re-identification, in: Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2016, pp. 1363–1372.
  • (57) D. Cheng, X. Chang, L. Liu, A. G. Hauptmann, Y. Gong, N. Zheng, Discriminative dictionary learning with ranking metric embedded for person re-identification, in: Proc. of Intl. Joint Conf. on Artificial Intelligence, 2017, pp. 964–970.
  • (58) W. Chen, X. Chen, J. Zhang, K. Huang, Beyond triplet loss: A deep quadruplet network for person re-identification, in: Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2017, pp. 1320–1329.
  • (59) Z. Zhong, L. Zheng, D. Cao, S. Li, Re-ranking person re-identification with k-reciprocal encoding, in: Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2017, pp. 3652–3661.