High-Order Information Matters: Learning Relation and Topology for Occluded Person Re-Identification

03/18/2020 ∙ by Guan'an Wang, et al. ∙ 3

Occluded person re-identification (ReID) aims to match occluded person images to holistic ones across dis-joint cameras. In this paper, we propose a novel framework by learning high-order relation and topology information for discriminative features and robust alignment. At first, we use a CNN backbone and a key-points estimation model to extract semantic local features. Even so, occluded images still suffer from occlusion and outliers. Then, we view the local features of an image as nodes of a graph and propose an adaptive direction graph convolutional (ADGC)layer to pass relation information between nodes. The proposed ADGC layer can automatically suppress the message-passing of meaningless features by dynamically learning di-rection and degree of linkage. When aligning two groups of local features from two images, we view it as a graph matching problem and propose a cross-graph embedded-alignment (CGEA) layer to jointly learn and embed topology information to local features, and straightly predict similarity score. The proposed CGEA layer not only take full use of alignment learned by graph matching but also re-place sensitive one-to-one matching with a robust soft one. Finally, extensive experiments on occluded, partial, and holistic ReID tasks show the effectiveness of our proposed method. Specifically, our framework significantly outperforms state-of-the-art by6.5

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Illustration of high-order relation and topology information. (a) In occluded ReID, key-points suffer from occlusions (1⃝2⃝) and outliers (3⃝). (b) Vanilla method relies on one-order key-points information in all three stages, which is not robust. (c) Our method learn features via an graph to model relation information , and view alignment as a graph matching problem to model topology information by learning both node-to-node and edge-to-edge correspondence.

Person re-identification (ReID) [6, 40] aims to match images of a person across dis-joint cameras, which is widely used in video surveillance, security and smart city. Recently, various of methods [24, 36, 18, 41, 16, 19, 40, 11, 34] have been proposed for person ReID. However, most of them focus on holistic images, while neglecting occluded ones, which may be more practical and challenging. As shown in Figure 1(a), persons can be easily occluded by some obstacles (e.g. baggage, counters, crowded public, cars, trees) or walk out of the camera fields, leading to occluded images. Thus, it is necessary to match persons with occluded observation, which is known as occluded person Re-ID problem [45, 25].

Compared with matching persons with holistic images, occluded ReID is more challenging due to the following reasons [42, 45]: (1) With occluded regions, the image contains less discriminative information and is more likely to match wrong persons. (2) Part-based features have been proved to be efficient [34] via part-to-part matching. But they require strict person alignment in advance, thus cannot work very well in seriously occluded situations. Recently, many occluded/partial person ReID methods [45, 46, 25, 10, 8, 33, 22] are proposed, most of them only consider one-order information for feature learning and alignment. For example, the pre-defined regions [34], poses [25] or human parsing [10] are used to for feature learning and alignment. We argue that besides one-order information, high-order one should be imported and may work better for occluded ReID.

In Figure 1(a), we can see that key-points information suffers from occlusion (1⃝2⃝) and outliers (3⃝). For example, key-points 1⃝ and 2⃝ are occluded, leading to meaningless features. Key-points 3⃝ are outliers, leading to misalignment. A common solution is shown in Figure 1(b). It extracts local features of key-point regions, supposes all key-points are accurate and local features well aligned. In this solution, all three stages rely on the one-order key-points information, which is not very robust. In this paper, as shown in 1(c), we propose a novel framework for both discriminative feature and robust alignment. In feature learning stage, we view local features of an image as nodes of a graph to learn relation information. By passing message in the graph, the meaningless features caused by occluded key-points can be improved by their neighbor meaningful features. In alignment stage, we use graph matching algorithm [37] to learn robust alignment. Besides aligning with node-to-node correspondence, it models extra edge-to-edge correspondence. We then embed the alignment information to features by constructing a cross-images graph, where node message of an image can be passed to nodes of the other images. Thus, the features of outlier key-points can be repaired by its corresponding features on the other image. Finally, instead of computing similarity with predefined distance, we use a network to learn similarity supervised by a verification loss.

Specifically, we propose a novel framework jointly modeling high-order relation and human-topology information for occluded person re-identification. As shown in Figure 2, our framework includes three modules, i.e. one-order semantic module (), high-order relation module () and high-order human-topology module (). (1) In the , we utilize a CNN backbone to learn feature maps and a human key-points estimation model to learn key-points. Then we can extract semantic features of corresponding key-points. (2) In the , we view the learned semantic features of an image as nodes of a graph and propose an adaptive-direction graph convolutional (ADGC) layer to learn and pass messages of edge features. The ADGC layer can automatically decide the direction and degree of every edge. Thus it can promote the message passing of semantic features and suppress that of meaningless and noisy ones. At last, the learned nodes contain both semantic and related information. (3) In the

, We propose a cross-graph embedded-alignment (CGEA) layer. It takes two graphs as inputs, learns correspondence of nodes across the two graphs using graph-matching strategy, and passes messages by viewing the learned correspondence as an adjacency matrix. Thus, the related features can be enhanced, and alignment information can be embedded in features. Finally, to avoid hard one-to-one alignment, we predict the similarity of the two graphs by mapping them to a logit and supervise with a verification loss.

The main contributions of this paper are summarized as follows: (1) A novel framework of jointly modeling high-order relation and human-topology information is proposed to learn well and robustly aligned features for occluded ReID. To our best of our knowledge, this is the first work that introduces such high-order information to occluded ReID. (2) An adaptive directed graph convolutional (ADGC) layer is proposed to dynamically learn the directed linkage of the graph, which can promote message passing of semantic regions and suppress that of meaningless regions such as occlusion or outliers. With it, we can better model the relation information for occluded ReID. (3) A cross-graph embedded-alignment (CGEA) layer conjugated with verification loss is proposed to learn feature alignment and predict similarity score. They can avoid sensitive hard one-to-one person matching and perform a robust soft one. (4) Extensive experimental results on occluded, partial, and holistic ReID datasets demonstrate that the proposed model performs favorably against state-of-the-art methods. Especially on the occluded-Duke dataset, our method significantly outperforms state-of-the-art by at least 3.7% and 6.5% in terms of Rank-1 and mAP scores.

Figure 2: Illustration of our proposed framework. It consists of an one-order semantic module , a high-order relation module and a high-order topology module . The module learns semantic local features of key-point regions. In , we view the local features of an image as nodes of a graph and propose an adaptive direction graph convolutional (ADGC) layer to pass relation information between nodes. In , we view alignment problem as a graph matching problem and propose a cross-graph embedded-alignment (CGEA) layer to joint learn and embed topology information to local features, and straightly predict similarity scores.

2 Related Works

Person Re-Identification. Person re-identification addresses the problem of matching pedestrian images across disjoint cameras [6]. The key challenges lie in the large intra-class and small inter-class variation caused by different views, poses, illuminations, and occlusions. Existing methods can be grouped into hand-crafted descriptors [24, 36, 18], metric learning methods [41, 16, 19]

and deep learning algorithms

[40, 11, 34]. All those ReID methods focus on matching holistic person images, and cannot perform well for the occluded images, which limits the applicability in practical surveillance scenarios.

Occluded Person Re-identification. Given occluded probe images, occluded person re-identification [45] aims to find the same person of full-body appearance in dis-joint cameras. This task is more challenging due to incomplete information and spatial misalignment. Zhuo et al.[45] use occluded/non-occluded binary classification(OBC) loss to distinguish the occluded images from holistic ones. In their following works, a saliency map is predicted to highlight the discriminative parts, and a teacher-student learning scheme further improves the learned features. Miao et al.[25] propose a pose guided feature alignment method to match the local patches of probe and gallery images based on the human semantic key-points. And they use a pre-defined threshold of key-points confidence to determine whether the part is occluded or not. Fan et al.[3] use a spatial-channel parallelism network (SCPNet) to encode part features to specific channels and fuse the holistic and part features to get discriminative features. Luo et al.[22] use a spatial transform module to transform the holistic image to align with the partial ones, then calculate the distance of the aligned pairs. Besides, several efforts are put on the spatial alignment of the partial Re-ID tasks.

Partial Person Re-Identification. Accompanied by occluded images, partial ones often occur due to imperfect detection and outliers of camera views. Like occluded person ReID, partial person ReID [42] aims to match partial probe images to gallery holistic images. Zheng et al.[42] propose a global-to-local matching model to capture the spatial layout information. He et al.[7] reconstruct the feature map of a partial query from the holistic pedestrian, and further improve it by a foreground-background mask to avoid the influence of backgrounds clutter in [10]. Sun et al.propose a Visibility-aware Part Model(VPM) in [33], which learns to perceive the visibility of regions through self-supervision.

Different from existing occluded and partial ReID methods which only use one-order information for feature learning and alignment, we use high-order relation and human-topology information for feature learning and alignment, thus achieve better performance.

Deep Graph Matching. Graph matching refers to establishing nodes corresponding between two or among multiple graphs. Graph matching incorporates both unary similarities between nodes and pairwise similarities between edges from separate graphs to define a matching such that the similarities between the matching graphs are maximized. By encoding the high-order geometrical information, graph matching can be more robust to decimate noise and outliers [37, 35]. Recently, Zanfir et al. [37]

combine deep learning with graph matching by decomposing the graph matching into six steps, including deep feature extractor, affinity matrix, power iteration, bi-stochastic, voting, and loss. Wang

et al. [35] propose a permutation loss regarding with node correspondence to capture combinatorial nature for graph matching.

3 The Proposed Method

This section introduces our proposed framework, including a one-order semantic module () to extract semantic features of human key-point regions, a high-order relation module () to model the relation-information among different semantic local features, and a high-order human-topology module () to learn robust alignment and predict similarities between two images. The three modules are jointly trained in an end-to-end way. An overview of the proposed method is shown in Figure 2.

Semantic Features Extraction.

The goal of this module is to extract one-order semantic features of key-point regions, which is inspired by two cues. Firstly, part-based features have been shown to be efficient for person ReID [34]. Secondly, accurate alignment of local features is necessary in occluded/partial ReID [8, 33, 10]. Following the ideas above and inspired by recent developments on person ReID [40, 34, 23, 4] and human key-points prediction [2, 32], we utilize a CNN backbone to extract local features of different key-points. Please note that although the human key-points prediction have achieved high accuracy, they still suffer from unsatisfying performance under occluded/partial images [17]. Those factors lead to inaccurate key-points positions and their confidence. Thus, the following relation and human-topology information are needed and will be discussed in the next section.

Specifically, given a pedestrian image , we can get its feature map and key-points heat map through the CNN model and key-points model. Through an outer product () and a global average pooling operations (), we can get a group of semantic local features of key-points regions and a global feature . The procedures can be formulated in Eq.(1), where is key-point number, and is channel number.

(1)

Training Loss. Following [40, 11], we utilize classification and triplet losses as our targets as in Eq.(2). Here, is the key-point confidence, and for global features,

is the probability of feature

belonging to its ground truth identity predicted by a classifier,

is a margin, is the distance between a positive pair (, ) from the same identity, (, ) is from different identities. The classifiers for different local features are not shared.

(2)

3.1 High-Order Relation Learning

Although we have the one-order semantic information of different key-point regions, occluded ReID is more challenging due to incomplete pedestrian images. Thus, it is necessary to exploit more discriminative features. We turn to the graph convolutional network (GCN) methods [1] and try to model the high-order relation information. In the GCN, semantic features of different key-point regions are viewed as nodes. By passing messages among nodes, not only the one-order semantic information (node features) but also the high-order relation information (edge features) can be jointly considered.

However, there is still a challenge for occluded ReID. Features of occluded regions are often meaningless even noisy. When passing those features in a graph, it brings in more noise and has side effects on occluded ReID. Hence, we propose a novel adaptive-direction graph convolutional (ADGC) layer to learn the direction and degree of message passing dynamically. With it, we can automatically suppress the message passing of meaningless features and promote that of semantic features.

Figure 3: Illustration of the proposed adaptive directed graph convolutional (ADGC) layers. is a pre-defined adjacent matrix is element-wise subtraction, is element-wise add, is element-wise multiplication, is matrix multiplication. is fully connected layer, is transpose.

Adaptive Directed Graph Convolutional Layer. A simple graph convolutional layer[15] has two input, an adjacent matrix A of the graph and the features X of all node, output can be calculated by:

where is normalized version of A and W refers to parameters.

We improve the simple graph convolutional layer by adaptively learning the adjacent matrix (the linkage of node) based on the input features. We assume that given two local features, the meaningful one is more similar to the global feature than that of meaningless one. Therefore, we propose an adaptive directed graph convolutional (ADGC) layer, whose inputs are a global feature and K local features , and a pre-defined graph (adjacent matrix is A). We use differences between local features and global feature to dynamically update the edges’ weights of all nodes in the graph, resulting . Then a simple graph convolutional can be formulated by and . To stabilize training, we fuse the input local features to the output of our ADGC layer as in the ResNet [7]. Details are shown in Figure 3. Our adaptive directed graph convolutional (ADGC) layer can be formulated in Eq.(3), where and are two unshared fully-connected layers.

(3)

Finally, we implement our high-order relation module as cascade of ADGC layers. Thus, given an image , we can get its semantic features via Eq.(1). Then its relation features can be formulated as below :

(4)

Loss and Similarity. We use classification and triplet losses as our targets as in Eq.(5), where the definition of and can be found in in Eq.(2). Note that is the key-point confidence.

(5)

Given two images and , we can get their relation features and via Eq.(4), and calculate their similarity with cosine distance as in Eq.(6).

(6)

3.2 High-Order Human-Topology Learning

Part-based features have been proved to be very efficient for person ReID [34, 33]. One simple alignment strategy is straightly matching features of the same key-points. However, this one-order alignment strategy cannot deal with some bad cases such as outliers, especially in heavily occluded cases [17]. Graph matching [37, 35] can naturally take the high-order human-topology information into consideration. But it can only learn one-to-one correspondence. This hard alignment is still sensitive to outliers and has a side effect on performance. In this module, we propose a novel cross-graph embedded-alignment layer, which can not only make full use of human-topology information learned by graph matching algorithm, but also avoid sensitive one-to-one alignment.

Revision of Graph Matching. Given two graphs and from image and , the goal of graph matching is to learn a matching matrix between and . Let

be an indicator vector such that

is the matching degree between and . A square symmetric positive matrix is built such that measures how well every pair matches with . For pairs that do not form edges, their corresponding entries in the matrix are set to 0. The diagonal entries contain node-to-node scores, whereas the off-diagonal entries contain edge-to-edge scores. Thus, the optimal matching can be formulated as below:

(7)

Following [37], we parameter matrix in terms of unary and pair-wise point features. The optimization procedure is formulated by a power iteration and a bi-stochastic operations. Thus, we can optimize

in our deep-learning framework with stochastic gradient descent. Restricted by pages, we don not show more details of graph matching, please refer to the paper

[35, 37].

Figure 4: Illustration of the cross-graph embedded-alignment layer. Here, is matrix multiplication,

means fully-connected layer and Rectified Linear Unit,

means graph matching operation, is the learned affinity matrix. Please refer text for more details.

Cross-Graph Embedded-Alignment Layer with Similarity Prediction. We propose a novel cross-graph embedded-alignment layer (CGEA) that both considering the high-order human-topology information learned by GM and avoiding the sensitive one-to-one alignment. The proposed CGEA layer takes two sub-graphs from two images as inputs and outputs the embedded features, including both semantic features and the human-topology guided aligned features.

The structure of our proposed CGEA layer is shown in Figure 4. It takes two groups of features and outputs two groups of features. Firstly, with two groups of nodes and

, we embed them to a hidden space with a fully-connected layer and a ReLU layer, getting two groups of hidden features

and . Secondly, we perform graph matching between and via Eq.(7), and get an affinity matrix between and . Here, means correspondence between and . Finally, the output can be formulated in Eq.(8), where means concatenation operation along channel dimension, is a fully-connected layer.

(8)

We implement our high-order topology module () with a cascade of CGEA layers and a similarity prediction layer . Given a pair of images , we can get their relation features via Eq.(4), and then their topology features of via Eq.(9). After getting the topology features pair , we can compute their similarity using Eq.(10), where is element-wise absolution operation, is a fully-connected layer from to ,

is sigmoid activation function.

(9)
(10)

Verification Loss. The loss of our high-order human-topology module can be formulated in Eq.(11), where is their ground truth, if from the same person, otherwise .

(11)

4 Train and Inference

During the training stage, the overall objective function of our framework is formulated in Eq.(12), where are weights of corresponding terms. We train our framework end-to-end by minimizing the .

(12)

For the similarity, given a pair of images , we can get their relation information based similarity from Eq.(6) and topology information based similarity from Eq.(10). The final similarity can be calculated by combing the two kind of similarities.

(13)

When inferring, given an query image , we first compute its similarity with all gallery images and get its top nearest neighbors. Then we compute the final similarity in Eq.(13) to refine the top .

5 Experiments

Dataset Train Nums (ID/Image) Testing Nums (ID/Image)
Gallery Query

Market-1501
751/12,936 750/19,732 750/3,368
DukeMTMC-reID 702/16,522 1,110/17,661 702/2,228
Occluded-Duke 702/15,618 1,110/17,661 519/2,210
Occluded-ReID - 200/1,000 200/1,000
Partial-REID - 60/300 60/300
Partial-iLIDS - 119/119 119/119
Table 1: Dataset details. We extensively evaluate our proposed method on 6 public datasets, including 2 holistic, 2 occluded and 2 partial ones.
Methods Occluded-Duke Occluded-REID
Rank-1 mAP Rank-1 mAP
Part-Aligned [38] 28.8 20.2 - -
PCB [34] 42.6 33.7 41.3 38.9
Part Bilinear [31] 36.9 - - -
FD-GAN [5] 40.8 - - -
AMC+SWM [42] - - 31.2 27.3
DSR [8] 40.8 30.4 72.8 62.8
SFR [9] 42.3 32 - -
Ad-Occluded [12] 44.5 32.2 - -
TCSDO [46] - - 73.7 77.9
FPR [10] - - 78.3 68.0
PGFA [25] 51.4 37.3 - -
HONet (Ours) 55.1 43.8 80.3 70.2
Table 2: Comparison with state-of-the-arts on two occluded datasets, i.e. Occluded-Duke [25] and Occluded-REID [45].

5.1 Implementation Details

Model Architectures. For CNN backbone, as in [40], we utilize ResNet50 [7] as our CNN backbone by removing its global average pooling (GAP) layer and fully connected layer. To acquire high-resolution features and more structural information, following [34, 4]

, we set the stride of last residual module convolution operation to

and get 16 times down-sampled feature maps. For classifiers, following [23]

, we use a batch normalization layer

[13] and a fully connect layer followed by a softmax function. For the human key-points model, we use HR-Net [32] pre-trained on the COCO dataset [20], a state-of-the-art key-points model. The model predicts 17 key-points, and we fuse all key-points on head region and get final key-points, including head, shoulders, elbows, wrists, hips, knees, and ankles.

Training Details.

We implement our framework with Pytorch. The images are resized to

and augmented with random horizontal flipping, padding 10 pixels, random cropping, and random erasing 

[44]

. When test on occluded/partial datasets, we use extra color jitter augmentation to avoid domain variance. The batch size is set to 64 with 4 images per person. During the training stage, all three modules are jointly trained in an end-to-end way for 120 epochs with the initialized learning rate 3.5e-4 and decaying to its 0.1 at 30 and 70 epochs.

Evaluation Metrics. We use standard metrics as in most person ReID literatures, namely Cumulative Matching Characteristic (CMC) curves and mean average precision (mAP), to evaluate the quality of different person re-identification models. All the experiments are performed in single query setting.

5.2 Experimental Results

Results on Occluded Datasets. We evaluate our proposed framework on two occluded datasets, i.e. Occluded-Duke [25] and Occluded-ReID [45]. Occluded-Duke is selected from DukeMTMC-reID by leaving occluded images and filter out some overlap images. It contains 15,618 training images, 17,661 gallery images, and 2,210 occluded query images. Occluded-ReID is captured by the mobile camera, consist of 2000 images of 200 occluded persons. Each identity has five full-body person images and five occluded person images with different types of severe occlusions.

Four kinds of methods are compared, they are vanilla holistic ReID methods [38, 34], holistic ReID methods with key-points information [31, 5], partial ReID methods [42, 8, 9] and occluded ReID methods [12, 46, 10, 25]. The experimental results are shown in Table 2. As we can see, there is no significant gap between vanilla holistic ReID methods and holistic methods with key-points information. For example, PCB [33] and FD-GAN [5] both achieve approximately Rank-1 score on Occluded-Duke dataset, showing that simply using key-points information may not significantly benefit occluded ReID task. For partial ReID and occluded ReID methods, they both achieve an obvious improvement on occluded datasets. For example, DSR [8] get a and FPR [10] get a Rank-1 scores on Occluded-REID dataset. This shows that occluded and partial ReID task share similar difficulties, i.e. learning discriminative feature and feature alignment. Finally, our proposed framework achieves best performance on Occluded-Duke and Occlude-REID datasets at and in terms of Rank-1 score, showing the effectiveness.

Results on Partial Datasets. Accompanied by occluded images, partial ones often occur due to imperfect detection, outliers of camera views, and so on. To further evaluate our proposed framework, in Table 3 we also report the results on two partial datasets, Partial-REID [42] and Partial-iLIDS [8]. Partial-REID includes 600 images from 60 people, with five full-body images and five partial images per person, which is only used for the test. Partial-iLIDS is based on the iLIDS [8] dataset and contains a total of 238 images from 119 people captured by multiple non-overlapping cameras in the airport, and their occluded regions are manually cropped. Following [33, 10, 46], because the two partial datasets are too small, we use Market-1501 as training set and the two partial datasets as test set. As we can see, our proposed framework significantly outperforms the other methods by at least and in terms of Rank-1 score on the two datasets.

Methods Partial-REID Partial-iLIDS
Rank-1 Rank-3 Rank-1 Rank-3
DSR [8] 50.7 70.0 58.8 67.2
SFR [9] 56.9 78.5 63.9 74.8
VPM [33] 67.7 81.9 65.5 74.8
PGFA [25] 68.0 80.0 69.1 80.9
AFPB [45] 78.5 - - -
FPR [10] 81.0 - 68.1 -
TCSDO [46] 82.7 - - -
HONet(Ours) 85.3 91.0 72.6 86.4
Table 3: Comparison with state-of-the-arts on two partial datasets, i.e. Partial-REID [42] and Partial-iLIDS [8] datasets. Our method achieves best performance on the two partial datasets.

Results on Holistic Datasets. Although recent occluded/partial ReID methods have obtained improvements on occluded/partial datasets, they often fails to get a satisfying performance on holistic datasets. This is caused by the noise during feature learning and alignment. In this part, we show that our proposed framework can also achieve satisfying performance on holistic ReID datasets including Market-1501 and DuekMTMTC-reID. Market-1501 [39] contains 1,501 identities observed from 6 camera viewpoints, 19,732 gallery images and 12,936 training images, all the dataset contains few of occluded or partial person images. DukeMTMC-reID [27, 43] contains 1,404 identities, 16,522 training images, 2,228 queries, and 17,661 gallery images.

Methods Market-1501 DukeMTMC
Rank-1 mAP Rank-1 mAP
PCB [34] 92.3 77.4 81.8 66.1
VPM [33] 93.0 80.8 83.6 72.6
BOT [23] 94.1 85.7 86.4 76.4
SPReID [14] 92.5 81.3 - -
MGCAM [29] 83.8 74.3 46.7 46.0
MaskReID [26] 90.0 75.3 - -
FPR [10] 95.4 86.6 88.6 78.4
PDC [30] 84.2 63.4 - -
Pose-transfer [21] 87.7 68.9 30.1 28.2
PSE [28] 87.7 69.0 27.3 30.2
PGFA [25] 91.2 76.8 82.6 65.5
HONet(Ours) 94.2 84.9 86.9 75.6
Table 4: Comparison with state-of-the-arts on two holistic datasets, Market-1501 [39] and DukeMTMTc-reID [27, 43]. Our method achieves comparable performance on holistic ReID.

Specifically, we conduct experiments on two common holistic ReID datasets Market-1501 [39] and DukeMTMC-reID [27, 43], and compare with 3 vanilla ReID methods [34, 33, 23], 3 ReID methods with human-parsing information [14, 29, 26, 10] and 4 holistic ReID methods with key-points information [30, 21, 28, 25]. The experimental results are shown in Table 4. As we can see, the 3 vanilla holistic ReID methods obtain very competitive performance. For example, BOT [23] gets a and Rank-1 score on two datasets. However, for the holistic ReID methods using external cues such human-parsing and key-points information perform worse. For example, SPReID [14] uses human-parsing information and only achieves Rannk-1 score on Market-1501 dataset. PFGA [25] uses key-points information and only gets a Rank-1 score on DukeMTMC-reID dataset. This shows that simply using external cues such as human-parsing and key-points may not bring improvement on holistic ReID datasets. This is caused by that the most images holistic ReID datasets are well detected, vanilla holistic ReID methods is powerful enough to learn discrimintive features. Finally, we propose a adaptive direction graph convolutional (ADGC) layer which can suppress noisy features and a cross-graph embedded-alignment (CGEA) layer which can avoid hard one-to-one alignment. With the proposed ADGC and CGEA layers, our framework also achieves comparable performance on the two holistic ReID datasets. Specifically, we achieve about and Rank-1 scores on Market-1501 and DukeMTMC-reID datasets.

5.3 Model Analysis

Index Rank-1 mAP
1 49.9 39.5
2 52.4 42.8
3 53.9 43.2
4 55.1 43.8
Table 5: Analysis of one-order semantic module (), high-order relation module () and high-order human-topology module (). The experimental results show the effectiveness of our proposed three modules.

Analysis of Proposed Modules. In this part, we analyze our proposed one-order semantic module (), high-order relation module () and high-order human-topology module (). The experimental results are shown in Table 5. Firstly, in index-1, we remove all the three modules degrading our framework to an IDE model [40], where only a global feature is available. Its performance is unsatisfying and only achieves Rank-1 score. Secondly, in index-2, when using one-order semantic information, the performance is improved by and up to Rank-1 score. This shows that the semantic information from key-points is useful for learning and aligning features. Thirdly, in index-3, extra high-order relation information is added, and the performance is further improved by achieving . This demonstrates the effectiveness of our module . Finally, in index-4, our full framework achieves the best accuracy at Rank-1 score, showing the the effectiveness of our module .

Analysis of Proposed layers. In this part, we further analyze normalization of key-point confidences (NORM), adaptive direction graph convolutional (ADGC) layer and cross-graph embedded-alignment (CGEA) layer, which are the key components of for semantic module (), relation module () and topology module (). Specifically, when removing NORM, straightly use the original confidence score. When removing ADGC, in Eq.(3), we replace with a fixed adjacency matrix linked like a human-topology. Thus, the relation module () degrades to a vanilla GCN, which cannot suppress noise information. When removing CGEA, in Eq.(8), we replace and with a fully-connected matrix. That is, every node of graph 1 is connected to all nodes of graph 2. Then, the topology module () contains no high-order human-topology information for feature alignment and degrades to a vanilla verification module. The experimental results are shown in Table 6. As we can see, when removing NORM, ADGC or CGEA, the performance significantly drop by , % and rank-1 scores. The experimental results show the effectiveness of our proposed NORM, ADGC and CGEA components.

NORM ADGC CGEA Rank-1 mAP
52.5 40.4
53.7 42.2
54.4 43.5
55.1 43.8
Table 6: Analysis of normalization of key-point confidences (NORM), adaptive direction graph convolutional (ADGC) layer and cross-grpah embedded-alignment (CGEA) layer. The experimental results show the effectiveness of our proposed layers.

Analysis of Parameters. We evaluate the effects of parameters in eq.(13), i.e. and . The results are shown Figure 5, and the optimal setting is and . When analyzing one parameter, the other is fixed at the optimal value. It is clear that, when using different and , our model stably outperforms the baseline model. The experimental results show the our proposed framework is robust to different weights. Please note that the performances here are different from Table 2, where the former achieves while and the latter . This is because the latter is computed using average of 10 times for fair comparison.

Figure 5: Analysis of parameters and in Eq.(13). The optimal values are and . When analyze one of them, the other one is fixed as its optimal value. The The experimental results shows that our model is robust to different parameters.

6 Conclusion

In this paper, we propose a novel framework to learn high-order relation information for discriminative features and topology information for robust alignment. For learning relation information, we formulate local features of an image as nodes of a graph and propose an adaptive-direction graph convolutional (ADGC) layer to promote the message passing of semantic features and suppress that of meaningless and noisy ones. For learning topology information, we propose a cross-graph embedded-alignment (CGEA) layer conjugated with a verification loss, which can avoid sensitive hard one-to-one alignment and perform a robust soft alignment. Finally, extensive experiments on occluded, partial and holistic datasets demonstrate the effectiveness of our proposed framework.

Acknowledge

This research was supported by National Key R&D Program of China (No. 2017YFA0700800).

References

  • [1] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. F. Zambaldi, and M. Malinowski. (2018) Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. External Links: Link Cited by: §3.1.
  • [2] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh (2018)

    OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields

    .
    In arXiv preprint arXiv:1812.08008, Cited by: §3.
  • [3] X. Fan, H. Luo, X. Zhang, L. He, C. Zhang, and W. Jiang (2018) Scpnet: spatial-channel parallelism network for joint holistic and partial person re-identification. In

    Asian Conference on Computer Vision

    ,
    pp. 19–34. Cited by: §2.
  • [4] Y. Fu, Y. Wei, Y. Zhou, H. Shi, G. Huang, X. Wang, Z. Yao, and T. Huang (2019) Horizontal pyramid matching for person re-identification. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 8295–8302. Cited by: §3, §5.1.
  • [5] Y. Ge, Z. Li, H. Zhao, G. Yin, S. Yi, X. Wang, et al. (2018) FD-gan: pose-guided feature distilling gan for robust person re-identification. In Advances in Neural Information Processing Systems, pp. 1222–1233. Cited by: §5.2, Table 2.
  • [6] S. Gong, M. Cristani, S. Yan, and C. C. Loy (2014) Person re-identification. Cited by: §1, §2.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 770–778. Cited by: §2, §3.1, §5.1.
  • [8] L. He, J. Liang, H. Li, and Z. Sun (2018) Deep spatial feature reconstruction for partial person re-identification: alignment-free approach. pp. 7073–7082. Cited by: §1, §3, §5.2, §5.2, Table 2, Table 3.
  • [9] L. He, Z. Sun, Y. Zhu, and Y. Wang (2018) Recognizing partial biometric patterns. arXiv preprint arXiv:1810.07399. Cited by: §5.2, Table 2, Table 3.
  • [10] L. He, Y. Wang, W. Liu, X. Liao, H. Zhao, Z. Sun, and J. Feng (2019) Foreground-aware pyramid reconstruction for alignment-free occluded person re-identification.. arXiv: Computer Vision and Pattern Recognition. Cited by: §1, §2, §3, §5.2, §5.2, §5.2, Table 2, Table 3, Table 4.
  • [11] A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification.. arXiv preprint arXiv:1703.07737. Cited by: §1, §2, §3.
  • [12] H. Huang, D. Li, Z. Zhang, X. Chen, and K. Huang (2018) Adversarially occluded samples for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5098–5107. Cited by: §5.2, Table 2.
  • [13] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift.

    international conference on machine learning

    , pp. 448–456.
    Cited by: §5.1.
  • [14] M. M. Kalayeh, E. Basaran, M. Gökmen, M. E. Kamasak, and M. Shah (2018) Human semantic parsing for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1062–1071. Cited by: §5.2, Table 4.
  • [15] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §3.1.
  • [16] M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof (2012) Large scale metric learning from equivalence constraints. In 2012 IEEE conference on computer vision and pattern recognition, pp. 2288–2295. Cited by: §1, §2.
  • [17] J. Li, C. Wang, H. Zhu, Y. Mao, H. Fang, and C. Lu (2018) CrowdPose: efficient crowded scenes pose estimation and a new benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10863–10872. External Links: Link Cited by: §3.2, §3.
  • [18] S. Liao, Y. Hu, X. Zhu, and S. Z. Li (2015) Person re-identification by local maximal occurrence representation and metric learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2197–2206. Cited by: §1, §2.
  • [19] S. Liao and S. Z. Li (2015) Efficient psd constrained asymmetric metric learning for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3685–3693. Cited by: §1, §2.
  • [20] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §5.1.
  • [21] J. Liu, B. Ni, Y. Yan, P. Zhou, S. Cheng, and J. Hu (2018) Pose transferrable person re-identification. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4099–4108. Cited by: §5.2, Table 4.
  • [22] H. Luo, X. Fan, C. Zhang, and W. Jiang (2019)

    STNReID : deep convolutional networks with pairwise spatial transformer networks for partial person re-identification.

    .
    arXiv preprint arXiv:1903.07072. External Links: Link Cited by: §1, §2.
  • [23] H. Luo, Y. Gu, X. Liao, S. Lai, and W. Jiang (2019) Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §3, §5.1, §5.2, Table 4.
  • [24] B. Ma, Y. Su, and F. Jurie (2014) Covariance descriptor based on bio-inspired features for person re-identification and face verification. Image and Vision Computing 32 (6-7), pp. 379–390. Cited by: §1, §2.
  • [25] J. Miao, Y. Wu, P. Liu, Y. Ding, and Y. Yang (2019) Pose-guided feature alignment for occluded person re-identification. In ICCV, Cited by: §1, §1, §2, §5.2, §5.2, §5.2, Table 2, Table 3, Table 4.
  • [26] L. Qi, J. Huo, L. Wang, Y. Shi, and Y. Gao (2018)

    Maskreid: a mask based deep ranking neural network for person re-identification

    .
    arXiv preprint arXiv:1804.03864. Cited by: §5.2, Table 4.
  • [27] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi (2016) Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision, pp. 17–35. Cited by: §5.2, §5.2, Table 4.
  • [28] M. Saquib Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen (2018) A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 420–429. Cited by: §5.2, Table 4.
  • [29] C. Song, Y. Huang, W. Ouyang, and L. Wang (2018)

    Mask-guided contrastive attention model for person re-identification

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1179–1188. Cited by: §5.2, Table 4.
  • [30] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian (2017) Pose-driven deep convolutional model for person re-identification. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3980–3989. Cited by: §5.2, Table 4.
  • [31] Y. Suh, J. Wang, S. Tang, T. Mei, and K. M. Lee (2018) Part-aligned bilinear representations for person re-identification. european conference on computer vision, pp. 418–437. Cited by: §5.2, Table 2.
  • [32] K. Sun, B. Xiao, D. Liu, and J. Wang (2019) Deep high-resolution representation learning for human pose estimation. In CVPR, Cited by: §3, §5.1.
  • [33] Y. Sun, Q. Xu, Y. Li, C. Zhang, Y. Li, S. Wang, and J. Sun (2019) Perceive where to focus: learning visibility-aware part-level features for partial person re-identification.. pp. 393–402. Cited by: §1, §2, §3.2, §3, §5.2, §5.2, §5.2, Table 3, Table 4.
  • [34] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang (2018) Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), pp. 480–496. Cited by: §1, §1, §2, §3.2, §3, §5.1, §5.2, §5.2, Table 2, Table 4.
  • [35] R. Wang, J. Yan, and X. Yang (2019) Learning combinatorial embedding networks for deep graph matching.. arXiv preprint arXiv:1904.00597. External Links: Link Cited by: §2, §3.2, §3.2.
  • [36] Y. Yang, J. Yang, J. Yan, S. Liao, D. Yi, and S. Z. Li (2014) Salient color names for person re-identification. In European conference on computer vision, pp. 536–551. Cited by: §1, §2.
  • [37] A. Zanfir and C. Sminchisescu (2018) Deep learning of graph matching. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2684–2693. External Links: Link Cited by: §1, §2, §3.2, §3.2.
  • [38] L. Zhao, X. Li, Y. Zhuang, and J. Wang (2017) Deeply-learned part-aligned representations for person re-identification. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3239–3248. Cited by: §5.2, Table 2.
  • [39] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. In Proceedings of the IEEE international conference on computer vision, pp. 1116–1124. Cited by: §5.2, §5.2, Table 4.
  • [40] L. Zheng, Y. Yang, and A. G. Hauptmann (2016) Person re-identification: past, present and future. arXiv preprint arXiv:1610.02984. Cited by: §1, §2, §3, §3, §5.1, §5.3.
  • [41] W. Zheng, S. Gong, and T. Xiang (2013) Reidentification by relative distance comparison. IEEE transactions on pattern analysis and machine intelligence 35 (3), pp. 653–668. Cited by: §1, §2.
  • [42] W. Zheng, X. Li, T. Xiang, S. Liao, J. Lai, and S. Gong (2015) Partial person re-identification. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4678–4686. External Links: Link Cited by: §1, §2, §5.2, §5.2, Table 2, Table 3.
  • [43] Z. Zheng, L. Zheng, and Y. Yang (2017) Unlabeled samples generated by gan improve the person re-identification baseline in vitro. arXiv preprint arXiv:1701.07717. External Links: Link Cited by: §5.2, §5.2, Table 4.
  • [44] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang (2017) Random erasing data augmentation. arXiv preprint arXiv:1708.04896. Cited by: §5.1.
  • [45] J. Zhuo, Z. Chen, J. Lai, and G. Wang (2018) Occluded person re-identification. In 2018 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §1, §1, §2, §5.2, Table 2, Table 3.
  • [46] J. Zhuo, J. Lai, and P. Chen (2019) A novel teacher-student learning framework for occluded person re-identification. arXiv preprint arXiv:1907.03253. Cited by: §1, §5.2, §5.2, Table 2, Table 3.