Person re-identification with fusion of hand-crafted and deep pose-based body region features

03/27/2018 ∙ by Jubin Johnson, et al. ∙ 0

Person re-identification (re-ID) aims to accurately re- trieve a person from a large-scale database of images cap- tured across multiple cameras. Existing works learn deep representations using a large training subset of unique per- sons. However, identifying unseen persons is critical for a good re-ID algorithm. Moreover, the misalignment be- tween person crops to detection errors or pose variations leads to poor feature matching. In this work, we present a fusion of handcrafted features and deep feature representa- tion learned using multiple body parts to complement the global body features that achieves high performance on un- seen test images. Pose information is used to detect body regions that are passed through Convolutional Neural Net- works (CNN) to guide feature learning. Finally, a metric learning step enables robust distance matching on a dis- criminative subspace. Experimental results on 4 popular re-ID benchmark datasets namely VIPer, DukeMTMC-reID, Market-1501 and CUHK03 show that the proposed method achieves state-of-the-art performance in image-based per- son re-identification.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Person re-ID is an important task in video surveillance and has various applications in security and law-enforcement. The main goal in re-ID is to retrieve all instances of a probe person from a large gallery set. In the instance of a security breach or event, law enforcement can use a re-ID system to automatically identify and track persons-of-interest, which would save hundreds of man-hours.

Apart from the obvious challenges posed by viewpoint variations and occlusions, re-ID can be regarded as a zero-shot learning problem where the training and testing classes are different. It is therefore, crucial to learn discriminative representations for unseen persons. Existing approaches handle this challenge by either learning discriminative subspaces/metrics [43, 19, 56, 21, 3, 26, 6, 45, 27], or generating discriminative features [25, 41, 10, 50, 53]

. The success of deep learning in computer vision has led to emergence of works that learn discriminative features using CNNs 

[18, 36, 39, 59, 55, 57] and through metric learning losses such as contrastive loss [35], triplet loss [22, 15], and more recently quadruplet loss [7].

Figure 1: Common challenges in image-based re-ID. (a-b) Scale variations, (c-d) detection errors, (e-f) pose misalignments

A majority of these approaches learn a global feature representation for the person crop, without considering the spatial information contained in the image. Previous works [42] have shown that the representations learned by a deep classification model on the global image are focused mainly on one body region- the upper body. The drawbacks of such global representations are illustrated in Fig. 1. Since person crops for large-scale datasets like Market1501 and CUHK03 are generated from video frames using detectors such as DPM [11], inaccurate detection boxes might impact feature learning, e.g., Fig. 1 (a-d). The top half of the images in Fig. 1(a) correspond to the head-shoulder region and background respectively. Directly comparing the feature maps learned from the global image would impede the similarity score. The pose change or non-rigid body deformation makes the metric learning difficult, e.g., Fig. 1 (e-f). Moreover, occluded parts of the human body might introduce irrelevant context into the learned feature and it is non-trivial to emphasize local differences in a global feature, especially when we have to distinguish two people with very similar appearances. To explicitly overcome these drawbacks, recent studies have paid attention to part-based, local feature learning. Some works [36, 42] divide the whole body into a few fixed parts, without considering the alignment between parts. However, it still suffers from inaccurate detection box, pose variation, and occlusion.

In the proposed re-ID framework, pose information is utilized to guide the feature learning process for various body parts. Pose has been recently used to guide local learning for better alignment of features [48, 32, 51]. Our work is inspired by [48]

where multiple body regions are used to guide the feature learning followed by a tree-based feature fusion step to effectively combine the different local features. We show that such detailed level of granularity in the body parts is not always beneficial to the learning process. A simple concatenation of the features learned from three body regions - head , body and leg, along with the global features generalize well across datasets captured from different domains. Second, we adopt a disjoint framework which not only utilizes the powerful feature extraction capability of CNN but also brings into play the exclusive superiority of complementary handcrafted features and advanced metric learning method.

2 Related Work

We review work closely related to our method. An extensive survey of person re-ID can be found at [54, 16]. Most ReID pipelines are composed of two main steps, feature learning and metric learning. For feature learning, several effective approaches have been proposed, for example, the Ensemble of Local Features (ELF) [13], the Local Maximal Oc- currence (LOMO) [19]

. These handcrafted features have made impressive improve- ments over robust feature representation, and advanced the person re-id research. The promising performance of CNN on ImageNet classification indicates that classification network extracts discriminative image features. Therefore, several works

[40, 55, 39] fine-tuned the classification networks on target datasets as feature extractors for person re-ID. For example, Xiao et al[40] propose a novel dropout strategy to train a classification model with multiple datasets jointly. Wu et al[39] combine the hand-crafted histogram features and deep features to fine-tune the classification network. Besides classification network, siamese network and triplet network are two other popular networks for person ReID. The siamese network takes a pair of images as input, and is trained to verify the similarity between those two images [1, 57, 36, 31]. Ahmed et al[1] and Zheng et al[57] employ the siamese network to infer the description and a corresponding similarity metric simultaneously. Shi et al[31] replace the Euclidean distance with Mahalanobis distance in the siamese network. Varior et al[36] combine the LSTM and siamese network for person ReID. Some other works [22, 15] employ triplet networks to learn the representation for person ReID. Recently, many works generate deep representation from local parts [32, 48, 17, 46]. Li et al[17]

employ Spatial Transform Network (STN) for part localization, and propose Multli-Scale Context-Aware Network to infer representations on the generated local parts.

Figure 2: Framework of the proposed method

Zhang et al[46] align the local features with the global features in a mutual learning framework. Pose invariant embedding (PIE) [51] aligns pedestrians to a standard pose to reduce the impact of pose variation. A Global-Local Alignment Descriptor (GLAD) [37] does not directly align pedestrians, but rather detects key pose points and extracts local features from corresponding regions. [47] presents a deep mutual learning strategy where an ensemble of students learn collaboratively and teach each other throughout the training process. DarkRank [8] introduces a new type of knowledge-cross sample similarity for model compression and acceleration, achieving state-of-the-art performance. These methods use mutual learning in classification

Our work is inspired by the part-based approach introduced by Zhao et al[48]. They firstly extract human body parts with fourteen body joints, then fuse the features extracted on body parts using a feature fusion network that is trained end-to-end. We argue that the feature fusion network and the micro-body regions actually hamper the performance in test images. Experimental evaluations are provided in section 4 to validate this claim.

Figure 3: Illustration of the body part-based deep feature learning

3 Proposed Method

The proposed re-ID framework involves two stages for learning the feature representation of an input person image. The overall framework is illustrated in Fig. 2. First, the three body regions, namely head, upper body and lower body are learned using a pose prediction network. The body regions are used as guidance to the network for learning region-specific deep features in the next step through a ROI pooling layer as illustrated in Fig. 3. The handcrafted features are learned and combined with the deep features in the next step. Finally, metric learning is used to generate a discriminative subspace for accurate comparison between the query and probe features.

3.1 Body region prediction

Similar to Zhao et al[48], human body region information is extracted using a pose prediction network (PPN) which will be used to guide the feature learning process for each sub-region. A sequential framework inspired by the Convolutional Pose Machines (CPM) [38]

generates body joint response maps in a coarse-to-fine manner. In each stage, a convolutional network is used to extract image features and then combine the response maps from the previous stage, to produce increasingly refined estimations for body joint locations. Given an input image, the PPN predicts fourteen response maps

. The body joint locations are then obtained by max pooling each of the response maps, followed by grouping them to generate three rectangle region proposals representing the head, the upper-body and lower-body regions, respectively. A few examples are shown in Fig. 

4. More details can be found in [48] with the exception that we do not mine out micro body regions.

3.2 Multi-part representation learning for re-ID

The body parts generated in the previous step guides the learning for the multi-part representation The general flowchart for training the network is shown in Fig. 3. The network takes the person image together with the 3 body regions as input and computes one global feature and three part features. Details of the network and structure are present in [48]

. It essentially contains three inception modules and an ROI pooling stage for pooling out the body region features. Each feature is 256 dimensional which is concatenated to obtain the final feature vector. During testing this multi-part feature is extracted and used to distinguish different persons.

Figure 4: Examples of person images and the corresponding body regions extracted from them.

3.3 Fusion of handcrafted features and metric learning

In addition to the deep features, low-level features like color texture and illumination are crucial factors for describing images of persons. In addition to HSV color histogram, the Scale Invariant Local Ternary Pattern (SILTP) descriptor is utilized for illumination invariant texture de- scription. In order to extract spatial information, sliding windows (subwindow, i.e., 10×10 size) are applied to describe local details of a person image. Adopting pyramid representation, within each subwindow, we extract two scales of SILTP histograms and a joint HSV histogram. To address viewpoint changes, we check all subwindows at the same horizontal location, and maximize the local occur- rence of each pattern (i.e., the same histogram bin) among those subwindows, it is called Local Maximal Occurrence (LOMO) features. For 12848 image, all the computed local maximal occurrences are concatenated into the final de- scriptor with 26960 dimensions and it is called LOMO. Finally, a log transform is applied to suppress large bin values, and both HSV and SILTP features are normalized. For metric learning, we adopt the widely used Cross-view Quadratic Discriminant Analysis (XQDA) [19] to learn the subspace.

4 Experiments

4.1 Datasets

The proposed framework is evaluated on the most widely used datasets in the past year for image-based re-ID namely, VIPeR [12], CUHK-03 [18], Market-1501 [52] and DukeMTMC-reID [29, 59].

VIPeR is one of the earliest and widely used datasets for person re-ID. It contains 632 unique identities captured across two cameras, with each identity having one image per camera. The dataset is split randomly into equal halves for training and testing.

CUHK-03 consists of 14,097 cropped images from 1,467 identities. For each identity, images are captured from two cameras and there are about 5 images for each view. Two ways are used to produce the cropped images, i.e., human annotation and detection by DPM [11]. Our evaluation is based on the combined dataset. We use the standard experimental setting to select 1,367 identities for training, and the rest 100 for testing.

Market-1501 contains 32,668 images from 1,501 identities, and each image is annotated with a bounding box detected by DPM. Each identity is captured by at most six cameras. We use the standard training, testing, and query split provided by the authors in [52].

DukeMTMC-reID is a subset of the DukeMTMC [29] for image-based re-identification, in the format of the Market-1501 dataset. It contains 16,522 training images of 702 identities, 2,228 query images of the other 702 identities and 17,661 gallery images. We report mAP and Rank-1 precision on Market-1501 and DukeMTMC-reID datasets, while rank-1, -5, and -10 accuracies are reported on the other 2 datasets. Similar to [40, 48, 42], we learn a single model using the training splits of all the four datasets. The statistics of the training and testing splits are shown in Table 1.

Dataset #ID #Train/Val img #Probe/Gallery ID #Probe/Gallery img
VIPeR 632 506/126 316/316 316/316
CUHK-03 1467 21012/5252 100/100 100/100
Market-1501 1501 10348/2588 750/750+junk 3368/19732
DukeMTMC-reID 1404 13218/3304 702/702 2228/17661
Table 1: Details of the datasets and the corresponding training/testing splits.
Method Rank-1 mAP
LOMO+XQDA [19] 43.79 22.22
k-reciprocal [60] 77.11 63.63
PIE [51] 79.33 55.95
Deep Context-aware [17] 80.31 57.53
Deep Part-Aligned [49] 81.0 63.4
Scalable SSM [3] 82.21 68.80
SVDNet [33] 82.3 62.1
Pose-driven [32] 84.14 63.41
APR [20] 84.29 64.67
In Defense Triplet [15] 84.92 69.14
Deep Mutual Learning [47] 87.73 68.83
REDA [61] 87.08 71.31
SpindleNet [48] 84.5 65.1
Darkrank [8] 89.8 74.3
GLAD [37] 89.9 73.9
AlignedReID [46] 91.8 79.3
Ours 91.9 78.39
Ours + LOMO +XQDA 92.09 79.11
Table 2: Comparisons on Market1501 in single query setting
Method Rank-1 mAP
LOMO+XQDA [19] 30.75 17.04
IDE [54] 65.22 44.99
GAN [59] 67.68 47.13
SpindleNet [48] 68.9 46.2
Discriminatively [57] 68.9 49.3
APR [20] 70.69 51.88
PAN [58] 71.59 51.51
SVDNet [33] 76.7 56.8
DPFL [9] 79.2 60.6
PSE [30] 79.8 62.0
REDA [61] 79.31 62.44
Mid-level [44] 80.43 63.88
Deep-Person [4] 80.90 64.80
PCB [34] 83.3 69.2
GP-reID [2] 85.2 72.8
Ours 79.75 63.6
Ours + LOMO +XQDA 80.36 64.8
Table 3: Comparisons on DukeMTMC-reID in single query setting

4.2 Comparisons with other methods

We compare the performance of our proposed method on the standard evaluation protocols for the 4 datasets. The results are reported in Tables  2 - 5. In the tables, Ours

represents our results using cosine similarity for matching, while

Ours+LOMO+XQDA represents the full framework with the addition of handcrafted features and metric learning.

On Market-1501, the proposed method achieves state-of-the-art performance with Rank-1 accuracy of 92.09% as shown in Table 2. Our mAP of 79.1% is only slightly lower than 79.3% of AlignedReID [46]. The influence of metric learning is not significant for datasets with large gallery sets, nevertheless, the inclusion does bring about a slight increase in performance.

Table 3 shows the performance on DukeMTMC-reID, obtaining accuracy of 80.36% and 64.8% on Rank-1 and mAP measures. ,respectively. Although some recent methods [2, 4, 34] perform better than the proposed method on this dataset, they have lower scores on the other three datasets. This shows that our method is able to generalize well across datasets captured from different domains.

VIPeR and CUHK-03 follow the single-shot evaluation strategy, where only one query and gallery image is selected for each identity. Hence, we use the CMC ranking evaluation metric to compare with other methods. The comparisons on CUHK-03 are summarized in Table 

4. The proposed method obtains accuracy of 92.7%, 99.1% and 99.6% on Rank- 1, 5 and 10 measures which is the state-of-the-art. We even outperform methods such as  [48, 23] which use multiple images of the same identity in the gallery.

On VIPeR dataset, we achieve 63.6% rank-1 accuracy, which is around 7% more than the current state-of-the-art. As can be seen from Table 5, for small-scale datasets, a big factor for this improvement is the metric learning step.

Rank
Method r=1 r=5 r=10
LOMO+XQDA [19] 44.6 - -
DNS [45] 62.6 90.0 94.8
Gated SCNN [35] 61.8 88.1 92.6
k-reciprocal [60] 64.0 - -
Scalable SSM [3] 76.6 94.6 98.0
PL-Net [42] 82.8 96.6 98.6
Deep Part-Aligned [49] 85.4 97.6 99.4
GAN [59] 84.6 97.6 98.9
Discriminatively [57] 83.4 97.1 98.7
In Defense Triplet [15] 75.5 95.2 99.2
HydraPlus-Net [23] 91.8 98.4 99.1
Pose-driven [32] 88.7 98.6 99.2
SpindleNet [48] 88.5 97.8 98.6
Darkrank [8] 89.7 98.4 99.2
GLAD [37] 85.0 97.9 99.1
AlignedReID [46] 92.4 98.9 99.5
Ours 89.4 98.4 99.1
Ours + LOMO +XQDA 92.7 99.1 99.6
Table 4: Comparisons on CUHK03 dataset
Rank
Method r=1 r=5 r=10
Deepreid [18] 19.9 49.3 64.7
LOMO+XQDA [19] 40.0 - 80.5
DNS [45] 41.0 69.8 81.6
Gated SCNN [35] 37.8 66.9 77.4
TMA [24] 48.2 87.7 93.5
Scalable SSM [3] 53.7 91.5 96.1
Deep Part-Aligned [49] 48.7 74.7 85.1
MuSDeep [28] 43.0 74.4 85.8
Pose-driven [32] 51.3 74.1 84.2
SpindleNet [48] 53.8 74.1 83.2
GLAD [37] 54.8 74.5 83.5
PL-Net [42] 56.7 82.6 90.0
SCSP [5] 53.5 91.5 96.7
HydraPlus-Net [23] 56.6 78.8 87.0
Ours 45.6 66.1 74.7
Ours + LOMO +XQDA 63.3 83.5 91.1
Table 5: Comparisons on VIPeR dataset
Market-1501 Rank-1 mAP
Global 86.4 67.6
Head 36.7 16.8
Body 62.9 38.3
Leg 60.4 36.2
Concat(Global + 7 body regions) 85.2 64.56
SpindleNet FFN (Global + 7 body regions) 84.5 65.1
Concat(Global + 3 body regions) 88.3 69.5
Table 6: Investigation on the effect of each body part towards the final representation performance on Market-1501 dataset.
Backbone
architecture
DukeMTMC-reID Market-1501
Rank-1 mAP Rank-1 mAP
Inception 79.75 63.6 91.9 78.39
Resnet-50 77.69 61.13 91.56 75.63
Table 7: Comparison of architecture in performance for DukeMTMC-reID and Market-1501

4.3 Comparisons with feature fusion of SpindleNet

SpindleNet [48] uses a tree fusion structure to combine the features obtained from the multiple body regions and showed that their fusion strategy based on feature competition and fine-tuning obtained the best performance. In our experiments, we observed that the addition of finer (micro) body regions deteriorated the performance of the overall framework. Table 6 shows the experimental results validating this on the Market-1501 dataset. We trained SpindleNet and our network on the training split of this dataset. The addition of the micro body regions is shown to bring down the performance of both the feature fusion network as well as the linear concatenation. It is to be noted that the global features perform better than the FFN of SpindleNet. The use of the macro body regions - head leg and body, utilize complementary information that cannot be well represented by the global features, improving the performance by 2% on both mAP and Rank-1.

Query                      Top 10 matches from gallery

Figure 5: Qualitative evaluation on DukeMTMC-reID (first three rows) and Market1501 datasets. The first column is the probe image, followed by the top 10 results from the gallery.

Query                      Top 10 matches from gallery

Figure 6: Qualitative evaluation on VIPeR (first three rows) and CUHK-03 datasets. The single-shot setting with only one correct gallery image is more challenging for person re-ID. Red boxes indicate that the identity is same as that of query.

4.4 Comparison of backbone CNN architecture

The base network for deep representation learning contains convolutional layers followed by Inception modules. Table 7 shows the effect of using ResNet-50 architecture [14] for our task of multi-body representation. ResNet-50 is deeper and takes a larger input image size of 224224 for learning the feature representations. However, this does not equate to a better representation for the unseen test images as the mAP measures are consistently better for both the datasets under consideration.

4.5 Qualitative comparison

Fig. 5 and 6 show the retrieval results obtained using queries from all the datasets. For Market-1501 and DukeMTMC-reID, the gallery contains multiple images of the query identity captured from a camera different to that of the query. As can be seen from Fig. 5, our method is able to capture most person belonging to the probe, despite a huge variation in appearance and pose. The single-shot setting followed by VIPeR and CUHK-03, where only a single instance of the query identity is present in the gallery is more difficult for person re-ID. Although the person is partially occluded (row 5 in Fig. 6) or the appearance changes drastically (last row in Fig. 6), our method is consistently able to retrieve the correct person.

5 Conclusion

In this paper, we have shown that deep classification models that are trained on the global image tend to learn poor representations for unseen test images. It is shown that a fusion of handcrafted features and deep feature representation learned using multiple body parts to complement the global body features achieves high performance on such zero-shot learning problems. Experimental evaluations on four benchmark datasets in person re-ID show that the proposed method performs among the state-of-the-art.

References