A Novel Teacher-Student Learning Framework For Occluded Person Re-Identification

07/07/2019
by   Jiaxuan Zhuo, et al.
SUN YAT-SEN UNIVERSITY
9

Person re-identification (re-id) has made great progress in recent years, but occlusion is still a challenging problem which significantly degenerates the identification performance. In this paper, we design a teacher-student learning framework to learn an occlusion-robust model from the full-body person domain to the occluded person domain. Notably, the teacher network only uses large-scale full-body person data to simulate the learning process of occluded person re-id. Based on the teacher network, the student network then trains a better model by using inadequate real-world occluded person data. In order to transfer more knowledge from the teacher network to the student network, we equip the proposed framework with a co-saliency network and a cross-domain simulator. The co-saliency network extracts the backbone features, and two separated collaborative branches are followed by the backbone. One branch is a classification branch for identity recognition and the other is a co-saliency branch for guiding the network to highlight meaningful parts without any manual annotation. The cross-domain simulator generates artificial occlusions on full-body person data under a growing probability so that the teacher network could train a cross-domain model by observing more and more occluded cases. Experiments on four occluded person re-id benchmarks show that our method outperforms other state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 6

page 7

04/09/2018

Occluded Person Re-identification

Person re-identification (re-id) suffers from a serious occlusion proble...
11/27/2020

Enhancing Diversity in Teacher-Student Networks via Asymmetric branches for Unsupervised Person Re-identification

The objective of unsupervised person re-identification (Re-ID) is to lea...
07/19/2019

VRSTC: Occlusion-Free Video Person Re-Identification

Video person re-identification (re-ID) plays an important role in survei...
07/06/2021

Learning Disentangled Representation Implicitly via Transformer for Occluded Person Re-Identification

Person re-identification (re-ID) under various occlusions has been a lon...
11/10/2021

Learning to Disentangle Scenes for Person Re-identification

There are many challenging problems in the person re-identification (ReI...
02/09/2022

Motion-Aware Transformer For Occluded Person Re-identification

Recently, occluded person re-identification(Re-ID) remains a challenging...
11/28/2021

Unsupervised Domain Adaptive Person Re-Identification via Human Learning Imitation

Unsupervised domain adaptive person re-identification has received signi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

With the explosive growth of video surveillance systems, person re-identification (re-id) is an increasingly significant task for searching specific persons in surveillance, e.g., criminals, children and other missing persons, and has been widely researched and developed. However, in real-world applications of person re-id, it would easily occur that the target pedestrian is partially occluded by some dynamic or static obstacles, such as other persons, cars, signposts, pillars, etc, especially in crowded places. In the case of occlusions, existing person re-id algorithms no longer perform well because they assume that all detected targets are of full-body persons.

Figure 1. Scenario of occluded person re-id and samples from occluded person re-id datasets. For a person with occlusions detected from one camera, occluded person re-id task aims to search for the same identity from the pedestrian database or other non-overlapping cameras.

Occluded person re-id task (Zheng et al., 2016b; Zhuo et al., 2018; He et al., 2018; Fan et al., 2018) is to re-identify the same person given a detected target person with occlusions (Figure 1

). Different from the full-body person domain, the occluded person domain contains several detected target persons covered with various partial occlusions, making it hard to get the correct matching in occluded person re-id task. The challenges of the occluded person re-id are listed as follows. First, following the ordinary practice in person re-id, features extracted by traditional filters or convolutional neural networks (CNNs) over the whole images would be easily corrupted by the occluded regions. As shown in Figure

2(a), it is more likely to cause misleading matching due to the interference from occlusions. Second, some works (Zhou and Yuan, 2018; Zhang et al., 2018; Zhu et al., 2018) suggested to make use of the pedestrian detection to obtain tight bounding boxes of person body parts so as to extract effective features. However, occlusions have unknown spatial positions, sizes, shapes and colors in different cases, and are even incomplete objects. Hence it is most unlikely to train a toolbox to accurately separate pedestrians and occlusions. Furthermore, even if the bounding box of only person parts can be detected, as (Zheng et al., 2016b), there are also some parts of the person body that cannot be included or covered by residuary occlusions because the boundary between the target person and occlusions is always not a flat edge. In short, it is difficult to delicately eliminate occlusions in this task. Last but not least, one of the most serious drawbacks is that, there are insufficient occluded person data to learn a robust model for occluded person re-id while most of existing public large-scale datasets belong to the full-body person domain, which is also a problem need to be solved.

Figure 2. Motivation of our proposal. The histogram represents the similarity between queries and galleries. Red boxes mean the same one, while green boxes mean different ones. Saliency maps beside images show the regions where the network focus. (a) When extracting features over the whole image, it would easily cause mismatching. (b) When extracting features focusing on the essential parts, it is more likely to match correctly.

In order to address occluded person re-id, we consider training a model which pays more attention to person body parts rather than the whole image. Based on the above consideration, we require the model to capture the essential clues mainly from target persons and ignore other useless information from occlusions or backgrounds. If the essential parts play a major role in features, incorrect matching could be reduced, as shown in Figure 2(b). But it is difficult to train such a robust model without adequate occluded person data. Therefore, we consider taking advantage of large-scale full-body person data to improve the learning of occluded person data.

Considering the above issues, we propose a teacher-student learning framework consisting of the ”teacher” stage and the ”student” stage. In the ”teacher” stage, we only make use of large-scale full-body person data to simulate the occluded person re-id task. The ”teacher” provides the ”student” with a basic scheme to address the real-world occluded person re-id problem. We design a co-saliency network as the teacher network together with a cross-domain simulator. The co-saliency network extracts the backbone features following by two branches, a classification branch and a co-saliency branch. Two branches both feedback to the shared backbone, though they finish different tasks: 1) the classification branch acts as an identity classifier 2) while the co-saliency branch aims to separate pixel-wise predominant regions that is person body parts from the rest of the image. The former enables the network to identify different persons and the latter helps the network capture essential clues by highlighting salient person body parts. Particularly, when training the co-saliency branch, the ground truth comes from the masks predicted by an existing salient object detector rather than by manual annotation. To offer the teacher network simulated occluded material, the cross-domain simulator is designed to transform full-body person data to simulated occluded person data by generating various artificial occlusions over full-body person images with a growing probability during training. As the iteration increasing, the ”teacher” learns a more occlusion-robust model by observing more and more simulated occluded person images. The ”student” then inherits a basic model from the ”teacher” and trains it on real-world occluded person data. Besides, the co-saliency branch of the ”teacher” is used to predict the masks of occluded person images, which offer the better ground truth to the ”student”.

Consequently, our framework combines simulation teaching of the ”teacher” and practical learning of the ”student” to reach better performance in occluded person re-id, advantages of which are as follows: 1) with the help of the teacher-student learning, we break through the restriction of inadequate occluded person data. 2) The ”teacher” provides an effective network paying more attention to person body parts, from which the ”student” gets a better model to deal with the harmful interference of occlusions.

In summary, this paper makes three main contributions.

  • We develop a novel two-stage teacher-student learning framework to solve the challenges of occluded person re-id by building a bridge across the full-body person domain to the occluded person domain.

  • We propose a co-saliency network together with a cross-domain simulator, which trains an occlusion-robust model paying attention to person body parts.

  • Furthermore, the co-saliency branch achieves better performance for occluded person detection than other salient object detectors, and our proposal shows superiority against of the state-of-the-art occluded person re-id methods.

Figure 3. Overview of our proposed framework. The teacher network learns a basic occlusion-robust model only using large-scale full-body person data, which adopts a co-saliency network with a cross-domain simulator to simulate the occluded person re-id task. The student network then practices on real-world occluded person data based on the teacher network.

2. Related Work

Person re-identification. Recently, person re-id has made great progress, which mainly consists of two components, a feature extractor for describing images (Li et al., 2017; Xiao et al., 2016; Cheng et al., 2016; Sun et al., 2018; Yu et al., 2017; Liu et al., 2017; Chang et al., 2018; Kalayeh et al., 2018) and a similarity metric for distance learning (Cheng et al., 2016; Hermans et al., 2017; Chen et al., 2017; Varior et al., 2016; He et al., 2018). Previous works that focus on feature extractors are divided into two major groups, the local ones (Liao et al., 2015; Li et al., 2017) and the global ones (Xiao et al., 2016). For example, Sun et al. (Sun et al., 2018) proposed a network named Part-based Convolutional Baseline (PCB), which could output a convolutional descriptor consisting of several part-level features. Yu et al. (Yu et al., 2017) argued that both high-level and mid-level features were relevant for Cross-Domain Instance Matching (CDIM). To extract more discriminative features, Chang et al. (Chang et al., 2018) proposed Multi-Level Factorisation Net (MLFN), a novel network architecture that factorised the visual appearance of a person into latent discriminative factors at multiple semantic levels without manual annotation. Meanwhile, some other works that focused on similarity metric also have grown rapidly in recent years. Varior et al. (Varior et al., 2016) proposed a gated Siamese Convolutional Neural Network (gated S-CNN) to distinguish positive pairs from hard negative pairs by finer local patterns. Cheng et al. (Cheng et al., 2016)

presented a novel Multi-Channel Parts-based Convolutional Neural Network model with improved triplet loss function, which could pull the instances of the same person closer and push the instances of different persons farther from each other in the learned feature space. Different from most person re-id methods training on full-body person datasets, approaches to solve the occluded person re-id problem tend to move along at a slow.

Occluded person re-id. As person re-id are applied in real-world applications more frequently, some practical challenges in person re-id have gained widely concern, one of which is occlusion. However, there are a few methods (Zheng et al., 2016b; He et al., 2018; Fan et al., 2018) considering how to cope with the occluded person re-id problem. Zheng et al. (Zheng et al., 2016b) proposed a local patch-level matching model namely Ambiguity-sensitive Matching Classifier (AMC) as well as a global-to-local matching model namely Sliding Window Matching (SWM), which was the earliest work to overcome the occluded person re-id problem. However, the computation cost of AMC-SWM was expensive because it extracted multi-patch features from a large number of grid patches respectively. In recent works, He et al. (He et al., 2018) introduced a method called Deep Spatial feature Reconstruction (DSR), which was flexible to calculate the similarity of feature blocks of different sizes. Fan et al. (Fan et al., 2018) then proposed an unsupervised method called Spatial-Channel Parallelism Network (SCPNet), which effectively used the local features to leverage the global features.

As mentioned above, most of previous works for occluded person re-id focus on extracting both local-patch features and global-image features so as to acquire more information. However, the computation cost and the misalignment problem of local patches would depress the efficiency and effect of the task. Different from utilizing local patches or features, our method can capture the essential clues from the whole image by paying more attention to person body parts. And a teacher-student learning framework is designed to achieve better robustness against occlusions in practice.

3. Proposed Method

In this work, we present a teacher-student learning framework for occluded person re-id, which is a two-stage training across from the full-body person domain to the occluded person domain. In this section we detail the implementation of the proposed framework.

3.1. Overview

We focus on the occluded person re-id problem suffered by limited occluded training data and unreliable features from occlusions, and address this key problem through a teacher-student learning framework (Figure 3). This framework contains two stages, the ”teacher” stage for simulation teaching and the ”student” stage for practical learning, as introduced in the following.

In the ”teacher” stage, we make use of large-scale full-body person data to simulate the learning process of occluded person re-id, which we call simulation teaching. To reach it, we design a co-saliency network together with a cross-domain simulator. On the one hand, the co-saliency network extracts the backbone features and develops two separated branches, one of which is the classification branch for identity recognition and the other is the co-saliency branch for salient person detection. The ground truth of the co-saliency branch comes from a existing salient object detector. Two branches constrain the backbone feature extractor to focus on person body parts in order to extract occlusion-robust features. On the other hand, the cross-domain simulator is used to transform full-body person data to simulated occluded person data with a growing probability during training. Benefiting from the cross-domain simulator, the teacher network could extract discriminative and occlusion-robust features by training on full-body person data and simulated occluded person data jointly. After a period of learning, the teacher network would learn a basic occlusion-robust model by observing various cases of simulated occlusions.

Relying on the ”teacher” stage, the ”student” carries forward the basic model and trains it on real-world occluded person data, which is called as practical learning. It is worth mentioning that the co-saliency branch could improve the performance of salient occluded person detection. We thus use the co-saliency branch from the ”teacher” stage to generate the ground truth for the ”student” stage instead of the initial salient object detector.

In general, our framework implements the occlusion-robust learning across from the full-body person domain to the occluded person domain. Experiments in Section 4 would show the effectiveness of our framework in the supervised and unsupervised occluded person re-id tasks.

3.2. Co-saliency Network

As pointed out in Section 1, the main drawback of occluded person re-id arises from unreliable features of occlusions, which easily causes misleading matching (Figure 2(a)). To avoid undesirable effects of occlusions ,we consider to extract occlusion-robust features by paying attention to person body parts, so we propose a co-saliency network.

The co-saliency network consists of one backbone and two separated collaborative branches. The backbone firstly conducts as a feature extractor to exploit rich features by several convolution layers. All convolution layers in the backbone are divided into five convolution blocks, in which the resolution of feature maps is the half of the preceding one. Two branches, namely the classification branch and the co-saliency branch, are then constructed on the shared backbone for guiding the feature extraction. The classification branch is formed by a fully-connected layer and an identity classification loss function, which enforces the backbone to learn discriminative features to minimize the classification error. Another branch, the co-saliency one composed of four co-saliency (CS) blocks and a co-saliency loss (CS loss) function is serving for image-to-image translation that is salient person detection. Each CS block utilizes a deconvolution layer with bilinear interpolation to upsample the preceding feature maps by two times and parallel stacks two

convolution layers to transform the intermediate feature maps from the backbone blocks into ones with the uniform channels. After an pixel-wise summation operation for fusing these two output feature maps, two convolution layers are used to convert the fused feature maps into compatible contextual information. Lastly, the CS loss function calculates the similarity error between the predicted saliency masks and the ground truth by the pixel-level classification. The co-saliency branch achieves finer saliency detection by combining multi-level features from deep and shallow side-output layers, which optimises the backbone features more concerned with salient person body regions.

In the co-saliency network, the backbone and the classification branch act as the feature extractor and the identity classifier , respectively. Suppose that we have domains, each of which has images of identities and . Let denote the image of the domain while and are respectively the identity labels and the saliency ground truth. All samples in domain are expressed as . Thus, and can be optimized by the formula

(1)

where is the softmax loss function that is equal to the cross-entropy between the predicted result and the ground truth. The co-saliency branch is set as a salient person detector to localize the most conspicuous regions, which can be optimized by the formula

(2)

where is the position of a pixel in a image. Combined the backbone with two branches, the loss function of the co-saliency network can be represented as

(3)

where and represent the loss functions of the classification branch and co-saliency branch respectively and is a hyper-parameter to balance two separated branches. Since the classification branch play a major role in occluded person re-id, it is reasonable to set in the loss function.

It can be seen that the classification branch and the co-saliency branch establish an interacting relationship by sharing the backbone features from . Furthermore, the two branches also promote with each other because of the consistent goal aiming to obtain reliable features through focusing on person body parts. They help each other forward, just like two collaborators in a work. The co-saliency branch brings an impressive improvement for the classification branch by encoding the location information of person body parts into the shared features. Meanwhile, The classification branch transmits the meaningful semantic information of the pedestrians to the co-saliency branch through the shared features. Fact proved that the co-saliency network enhances features occlusion-robust capability owing to the beneficial interaction between the co-saliency branch and the classification branch.

3.3. Cross-domain Simulator

To mitigate the limitation of inadequate occluded person data, we develop a teacher-student learning framework, which employs large-scale full-body person data to simulate the training of occluded person data. Nevertheless, the performance would be significantly degraded when directly conducting the training across different domains, from the full-body person domain to the occluded person domain. The key to this problem is to narrow the difference between these two domains. Therefore, we design a cross-domain simulator, which constructs a gradual domain migration from full-body person data to simulated occluded person data. The cross-domain simulator works on the data loading stage of each iteration in the co-saliency network. It firstly remains the identity labels unchanged and then selects some of full-body person images and saliency ground truth at a setting probability to cover with various artificial occlusions. Specifically, the probability is growing with the increase of iterations so that simulated occluded person data can get more involved in training. Moreover, the selected samples would be marked with a new label called occluded/non-occluded binary classification (OBC) label to present the occluded one while the rest would be marked with the non-occluded one. The OBC labels are used for the occluded/non-occluded binary classification loss function so that we could integrate the identity classification loss and the OBC loss into the classification branch. The procedure of the training with the cross-domain simulator is shown in Algorithm 1.

The OBC classifier aims to determine whether a sample is from the full-body person domain or the occluded person domain, which can be optimized by the formula

(4)

where is the OBC labels and denote the non-occluded one and the occluded one, respectively.

Inspired by the cross-domain simulator, we combine the identity classification loss and the OBC loss as the multi-task loss in the classification branch as (Zhuo et al., 2018). The loss function of the classification branch is given by

(5)

where represents the OBC loss function. is a hyper-parameter which balances the proportion of two classifiers in the classification branch, which is always set to more than 0.5. Thereby, we get the final loss function as

(6)

where is 2 because there are the full-body person domain and the occluded person domain using in our framework.

In general, there are three advantages using the cross-domain simulator: 1) The network can simultaneously optimise features discriminative capability by training on full-body person data and occlusion-robust capability by training on simulated occluded person data. Since full-body person data are more likely to be transformed to simulated occluded person data with the increase of iteration, the network gradually enhances the robustness against occlusions by observing more simulated occluded cases. 2) The OBC loss could be jointed into the network, which encodes the prior information whether the person is occluded or not into the framework. 3) The treatment of saliency ground truth could produce a variety of training pairs to achieve data augmentation so that the co-saliency branch improves the capacity for detecting salient occluded persons.

Algorithm 1: Training with the cross-domain simulator
Input: Full-body person images ( images),
   identity labels and saliency masks

   Max epoch of training

   The growing probability of occlusions
Output: The occlusion-robust basic model
Initial: ,
1:  while do:
2:   if select images randomly:
3:    put into the cross-domain simulator
4:     paste a background patch to randomly
5:     paste a black patch to in the same position
6:     label 1 (occluded person)
7:     remain unchanged
8:   else:
9:     label 0 (non-occluded person)
10:   remain unchanged
11: train the model combine and
12: 
13: 
14: end while

4. Experiments

4.1. Datasets

We conduct our experiments on four occluded person re-id datasets, Occluded-REID, Partial-REID, P-DukeMTMC-reID, P-ETHZ and a large-scale full-body person dataset, MARS, which are introduced as follows.

Occluded-REID (Zhuo et al., 2018) is an occluded person dataset captured by mobile cameras with different viewpoints and backgrounds. There are two folders, the occluded one and the whole one, including 2,000 images of 200 identities. Each identity has 5 full-body person images and 5 occluded person images with different types of occlusions.

Partial-REID (Zheng et al., 2016b) is the first partial person re-id dataset including 900 images of 60 persons. Each person has 5 full-body person images, 5 partial person images and 5 occluded person images with various occlusions. All the images were collected at a campus.

P-DukeMTMC-reID is a subset of DukeMTMC (Ristani et al., 2016) captured by multiple cameras, which recorded outdoors on the campus. There are 24,143 images of 1,299 identities and each identity has both full-body person images and occluded person images .

P-ETHZ is a subset of pedestrian dataset ETHZ (Ess et al., 2008)

. This dataset does have considerable illumination variance, scale variance and occlusion. Following

(Zhuo et al., 2018), we use 3,897 images of 85 identities with both full-body person images and occluded person images.

MARS (Zheng et al., 2016a), an extension of Market-1501 (Zheng et al., 2015a), is the first large-scale video based person re-id dataset. It consists of 1,191,003 full-body person images of 1,261 different pedestrians from 20,478 video sequences and 1,191,003 bounding boxes captured by 6 cameras.

Dataset Occluded-REID Partial-REID P-DukeMTMC P-ETHZ
Method r=1 r=2 r=5 mAP r=1 r=2 r=5 mAP r=1 r=2 r=5 mAP r=1 r=2 r=5 mAP
w/o-S w/o-T 3.60 6.60 14.60 7.15 10.67 16.67 30.67 16.48 1.10 1.72 3.15 2.02 25.48 37.86 52.86 32.49
T:S 11.30 16.40 26.10 15.73 16.00 27.67 43.33 23.35 2.15 3.41 5.62 3.41 26.19 33.80 44.29 31.17
T:C 46.75 57.00 66.75 51.57 58.33 67.47 76.67 63.05 15.47 20.69 28.19 18.90 27.38 40.48 48.81 33.14
T:C+S 49.75 60.25 70.50 54.66 62.50 69.17 85.00 66.98 17.76 23.09 31.36 21.28 28.10 37.62 55.48 34.59
T:C+S+D 53.00 61.50 70.25 57.24 67.50 80.00 86.67 72.05 18.35 24.13 31.99 22.00 31.90 41.43 53.81 37.41
T:C+S+D+O(Ours) 55.00 64.50 77.25 59.84 69.17 76.67 85.83 73.11 18.80 24.23 32.21 22.37 33.33 40.48 45.23 37.44
w/-S w/o-T 24.20 32.80 48.40 30.22 25.67 36.67 56.67 32.98 36.15 43.79 54.70 40.67 30.95 41.43 53.33 36.83
T:S 63.79 75.60 86.00 68.85 60.00 72.67 88.33 65.91 40.44 49.54 59.93 45.14 45.39 58.10 71.59 51.74
T:C 62.70 72.30 83.30 67.34 66.00 78.00 87.99 71.16 42.58 51.72 63.75 42.94 48.33 61.67 79.05 55.16
T:C+S 68.50 78.79 86.90 72.77 72.67 81.99 90.67 76.70 47.27 55.36 65.20 51.49 54.28 67.62 75.24 59.91
T:C+S+D 72.40 83.00 90.80 76.59 80.67 90.67 94.99 84.06 50.11 58.83 68.40 54.36 60.48 74.28 83.81 65.90
T:C+S+D+O(Ours) 73.67 84.40 92.87 77.89 82.67 91.33 97.00 85.87 51.42 58.52 69.72 55.60 62.86 75.71 85.24 68.05
Table 1. Evaluation of key components in the proposed framework. ”T” and ”S” mean the ”teacher” stage and the ”student” stage. The representations of key components are as follows: ”S”: the co-saliency branch, ”C”: the classification branch, ”D”: the cross-domain simulator, ”O”: the OBC loss. The top three results are highlighted in red, blue and green, respectively.

4.2. Implementation Details

Model. We choose ResNet-50 (He et al., 2016) as our feature backbone and our baseline is the backbone with a classification branch. Euclidean distance is used for similarity metric.

Optimization.

We implement our proposed method based on Pytorch

(Paszke et al., 2017)

. The network is trained using the adaptive moment estimation optimizer (Adam

(Kingma and Ba, 2014)). The learning rate of the backbone is set to 110 and two branches to 210. The hyper-parameter and are both set to 0.8. We train the model on a single GPU with the batch size 8 for 50K iterations.

Data arrangement. We take occluded person images as the probes and full-body person images as the galleries. The teacher network is only trained on the full-body person dataset, MARS. The student network is trained on half of an occluded person dataset and we test the rest when conducting the supervised experiments. While in unsupervised experiments, the whole occluded person dataset is used for testing without training in the ”student” stage. The input images are resized to 240 240 and randomly cropped to 224 224 for training. For a fair comparison, the other contrast experiments use the same configuration as the above.

Evaluation. The evaluations of the methods are mainly Cumulative Match Characteristic (CMC) (Gray et al., 2007) and mean Average Precision (mAP) (Zheng et al., 2015b) in person re-id. Besides, we use Precision, Recall and F-measure score (=0.3) (Achanta et al., 2009) to evaluate the effectiveness of the co-saliency branch.

4.3. Ablation Analysis

Comparisons of key components. To validate the effectiveness of key components in our framework, we conduct the comparisons between different cases on four occluded person re-id datasets, Occluded-REID, Partial-REID, P-DukeMTMC-reID and P-ETHZ, as displayed in Table 1 and Figure 7. The upper form in Table 1 is the case without the ”student” stage and the lower one is the case with the ”student” stage, which donate the unsupervised experiments and the supervised experiments, respectively. As shown in Table 1, there are poor performances in the case without simulation teaching of the ”teacher” (w/o-T) or with the wrong teaching task (T:S). And the teacher networks equipped with different components lead to different performances. No matter whether in the unsupervised experiments or the supervised experiments, the teacher network with both the classification branch and the co-saliency branch (T:C+S) performs better than the one only with the classifier (T:C), which indicates the network make a promotion with the help of the co-saliency branch. In addition, it illustrates the effectiveness of the cross-domain simulator by comparing between the results of the row T:C+S+D and the row T:C+S. After using the cross-domain simulator, the performances outperform the previous case by a large margin (improve rank-1 accuracy by 6.53%, 8.00%, 3.32% and 13.5% without the student ”stage”, 5.69%, 11.00%, 6.00% and 11.42% with the student ”stage”). Finally, we find that the OBC loss also can bring the network a little upgrade by comparing between the results of the row T:C+S+D+O (Ours) and the row T:C+S+D. In summary, there is of great importance to carry out simulation teaching of the ”teacher” and key components in our method are able to improve the performance of our framework by degrees.

Figure 4. Visualization of saliency maps for real-world occluded person samples.

Besides, we visualize the saliency maps generated by average pooling all feature maps of the last convolution layer in the backbone from real-world occluded person images. As shown in Figure 4, the samples include different kinds of occlusions, e.g., dynamic obstacles (a. other person; b. mobile vehicle), environmental vegetations (c. surrounding tree; e. ,f. brushwood) and static obstacles (d. ,h. buildings, g. appliance). It is obviously that saliency maps of T:C always focuses on the large regions beyond the target person body parts even the other wrong regions. By contrast, saliency maps of T:C+S+D and T:C+S pay more attention to the most essential regions, which indicates that key components can boost the performances of our framework effectively. Finally, the visualizations of combining all the key components (T:C+S+D+O) achieve the best performance.

Effectiveness of co-saliency branch. In the ”student” stage, we use the co-saliency branch from the ”teacher” stage as a salient person detector to get the saliency ground truth of occluded persons. To evaluate the performance of salient occluded person detection, we compare our co-saliency branch with the salient object detector used in the initial stage and 3 existing superior methods in salient object detection, DSS (Hou et al., 2017), NLDF (Luo et al., 2017) and PiCA (Liu et al., 2018). As shown in Figure 5, it can be seen that only the person body parts are more salient and the boundaries between persons and occlusions are clearer using our co-saliency branch (“T:C+S” and “T:C+S+D”), which is superior than the other contrast methods. For quantitative analysis, the precision, recall and F-measure score are listed in Table 2. Our proposal takes the lead in most metrics, which proves that the proposed network could also improve the salient occluded person detection.

Dataset Occluded-REID Partial-REID
Method Precision Recall F-score Precision Recall F-score
DSS (Hou et al., 2017) 0.76 0.30 0.56 0.72 0.33 0.57
NLDF (Luo et al., 2017) 0.79 0.52 0.71 0.73 0.28 0.53
PiCA (Liu et al., 2018) 0.78 0.25 0.52 0.71 0.28 0.52
T:detector 0.80 0.78 0.80 0.79 0.46 0.68
T:C+S 0.81 0.81 0.81 0.80 0.42 0.66
T:C+S+D 0.83 0.80 0.82 0.81 0.42 0.67
T:C+S+D+O 0.82 0.82 0.82 0.81 0.44 0.68
Table 2. Quantitative comparisons between our co-saliency branch and other salient object detectors.
Figure 5. Samples of existing salient object detectors and our co-saliency branch.
(a) Occluded-REID
(b) Partial-REID
Figure 6. Performance comparisons of the cross-domain simulators with 0, 1 and growing probabilities, respectively. Bar charts denote rank-1 and line charts denote mAP.

Effectiveness of cross-domain simulator. The cross-domain simulator transforms full-body person data to simulated occluded person data with a growing probability during training. To further demonstrate that the growing probability of the cross-domain simulator is advantageous, we compare the representations of the cross-domain simulator with the constant probabilities 0, 1 and the growing probability on Occluded-REID and Partial-REID. The constant probability 0 means there are all full-body person data without any transformation. While 1 means there are all simulated occluded person data. As can be seen in Table 6 and Figure 3, the cross-domain simulator with the growing probability performs better than those with 0 and 1 in both the unsupervised and supervised cases. This is reasonable because the cross-domain simulator with the growing probability enable the network to learn discriminative and occlusion-robust features via the gradual transformation from full-body person data to simulated occluded person data.

Dataset Occluded-REID Partial-REID
Probability r=1 r=5 mAP r=1 r=5 mAP
w/o-S 0 49.75 70.50 54.66 62.50 85.00 66.98
1 51.00 72.00 55.98 64.99 85.83 69.36
growing 53.00 70.25 57.24 67.50 86.67 72.05
w/-S 0 68.50 86.90 72.77 72.67 90.67 76.70
1 70.00 90.60 74.42 78.67 94.33 82.08
growing 72.40 90.80 76.59 80.67 94.99 84.06
Table 3. Comparisons of the cross-domain simulators with 0, 1 and growing probabilities on rank-1/5 and mAP.
(a) Occluded-REID
(b) Partial-REID
(c) P-DukeMTMC-reID
(d) P-ETHZ
Figure 7. CMC curves of key components. Dashed lines represent the performances without the ”student” stage (The left column of icon is the unsupervised ones) while solid lines represent those with the ”student” stage (The right column of icon is the supervised ones).
(a) Occluded-REID
(b) Partial-REID
(c) P-DukeMTMC-reID
(d) P-ETHZ
Figure 8. Comparisons with state-of-the-art on CMC curve.

4.4. Comparisons with the State-of-the-art

In this section, we compare our method with the mainstream occluded person re-id works and other state-of-the-art methods in person re-id on four occluded person re-id datasets, Partial-REID, Occluded-REID, P-DukeMTMC-reID and P-ETHZ. First, We evaluate our proposal in the unsupervised experiment to compare with the recent occluded person re-id works. As listed in Table 4, our method (Ours) on Partial-REID achieves on rank-1, which presents the best performance currently. We also compare our methods (Baseline, Baseline+S, Baseline+S+D, Ours) with three traditional methods (the first term, Table 5

) and seven state-of-the-art methods based on deep learning (the second term, Table

5) , as shown in Table 5 and Figure 8. It is evident that our methods and other deep learning methods are more superior than those traditional methods. Besides, our method takes the first and second places almost in all categories, which show better performances than other state-of-the-art methods. On P-DukeMTMC-reID, our method does not perform the best, because P-DukeMTMC-reID is a large-scare occluded person dataset while our method is more effective for the small-scale occluded person dataset. In general, our method is of great superiority on occluded person re-id problem.

Dataset Type Partial-REID
Method r = 1 r = 5 r = 10
SWM (Zheng et al., 2016b) supervised 24.4 52.3 61.3
AMC (Zheng et al., 2016b) supervised 33.3 52.0 62.0
AMC+SWM (Zheng et al., 2016b) supervised 36.0 60.0 70.7
DSR(Multi-scale) (He et al., 2018) unsupervised 43.0 75.0 76.7
AFPB (Zhuo et al., 2018) unsupervised 51.7 79.2 86.7
SCPNet-baseline (Fan et al., 2018) unsupervised 60.0 78.3 83.7
SCPNet-a (Fan et al., 2018) unsupervised 68.3 80.7 88.3
Ours unsupervised 69.2 85.8 93.3
Table 4. Comparisons with mainstream occluded person re-id methods on Partial-REID.
Dataset Occluded-REID Partial-REID P-DukeMTMC P-ETHZ
Methods r=1 r=5 r=1 r=5 r=1 r=5 r=1 r=5
XQDA (Liao et al., 2015) 36.71 65.11 33.14 66.18 15.93 27.50 44.98 70.88
GOG (Matsukawa et al., 2016) 40.50 63.16 41.92 74.00 17.10 29.27 49.17 79.29
NullSpace (Zhang et al., 2016) 46.47 75.36 37.73 72.12 35.17 53.65 40.16 71.53
DGD (Xiao et al., 2016) 41.43 65.74 56.83 77.70 41.53 60.09 51.23 81.01
SVDNet (Sun et al., 2017) 63.13 85.13 56.05 87.06 43.47 63.41 52.21 78.95
REDA (Zhong et al., 2017) 65.79 87.88 76.19 94.57 45.18 62.88 54.43 79.09
ResNet-mid (Yu et al., 2017) 70.80 88.90 66.00 88.33 54.89 75.24 54.76 71.43
PCB (Sun et al., 2018) 66.60 89.19 69.99 93.67 51.42 68.77 45.24 69.05
AFPB (Zhuo et al., 2018) 68.14 88.29 78.52 94.87 46.15 63.47 58.15 84.61
MLFN (Chang et al., 2018) 64.70 87.70 64.33 87.33 50.95 70.34 57.14 83.33
Baseline (He et al., 2016) 62.70 83.30 66.00 87.99 42.58 63.75 48.33 79.05
Baseline+S 68.50 86.90 72.67 90.67 47.27 65.20 54.28 75.24
Baseline+S+D 72.40 90.80 80.67 94.99 50.11 68.40 60.48 83.81
Ours 73.67 92.87 82.67 97.00 51.42 69.72 62.86 85.24

Table 5. Comparisons with state-of-the-art on rank-1/5.

5. Conclusions

In this work, we propose a teacher-student learning framework for occluded person re-identification. To address the limitation of inadequate occluded person data, the teacher network makes use of large-scale full-body person data to simulate the occluded person re-id. Supported by the co-saliency network and the cross-domain simulator, the teacher network trains a basic model for the student network. The student network then trains a more occlusion-robust model on real-world occluded person data. Experimental results on four public occluded person re-id datasets demonstrate the effectiveness and superiority of our framework.

References

  • (1)
  • Achanta et al. (2009) Radhakrishna Achanta, Sheila Hemami, Francisco Estrada, and Sabine Süsstrunk. 2009. Frequency-tuned salient region detection. In CVPR. 1597–1604.
  • Chang et al. (2018) Xiaobin Chang, Timothy M Hospedales, and Tao Xiang. 2018. Multi-level factorisation net for person re-identification. In CVPR. 2109–2118.
  • Chen et al. (2017) Weihua Chen, Xiaotang Chen, Jianguo Zhang, and Kaiqi Huang. 2017. Beyond triplet loss: a deep quadruplet network for person re-identification. (2017), 403–412.
  • Cheng et al. (2016) De Cheng, Yihong Gong, Sanping Zhou, Jinjun Wang, and Nanning Zheng. 2016. Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In CVPR. 1335–1344.
  • Ess et al. (2008) Andreas Ess, Bastian Leibe, Konrad Schindler, and Luc Van Gool. 2008. A mobile vision system for robust multi-person tracking. (2008), 1–8.
  • Fan et al. (2018) Xing Fan, Hao Luo, Xuan Zhang, Lingxiao He, Chi Zhang, and Wei Jiang. 2018. SCPNet: Spatial-Channel Parallelism Network for Joint Holistic and Partial Person Re-Identification. arXiv preprint arXiv:1810.06996 (2018).
  • Gray et al. (2007) Douglas Gray, Shane Brennan, and Hai Tao. 2007. Evaluating appearance models for recognition, reacquisition, and tracking. In PETS, Vol. 3. Citeseer, 1–7.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770–778.
  • He et al. (2018) Lingxiao He, Jian Liang, Haiqing Li, and Zhenan Sun. 2018. Deep spatial feature reconstruction for partial person re-identification: Alignment-free approach. In CVPR. 7073–7082.
  • Hermans et al. (2017) Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737 (2017).
  • Hou et al. (2017) Qibin Hou, Ming-Ming Cheng, Xiaowei Hu, Ali Borji, Zhuowen Tu, and Philip HS Torr. 2017. Deeply supervised salient object detection with short connections. In CVPR. 3203–3212.
  • Kalayeh et al. (2018) Mahdi M Kalayeh, Emrah Basaran, Muhittin Gökmen, Mustafa E Kamasak, and Mubarak Shah. 2018. Human semantic parsing for person re-identification. In CVPR. 1062–1071.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Li et al. (2017) Dangwei Li, Xiaotang Chen, Zhang Zhang, and Kaiqi Huang. 2017. Learning deep context-aware features over body and latent parts for person re-identification. (2017), 384–393.
  • Liao et al. (2015) Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z Li. 2015. Person re-identification by local maximal occurrence representation and metric learning. In CVPR. 2197–2206.
  • Liu et al. (2018) Nian Liu, Junwei Han, and Ming-Hsuan Yang. 2018. Picanet: Learning pixel-wise contextual attention for saliency detection. In CVPR. 3089–3098.
  • Liu et al. (2017) Xihui Liu, Haiyu Zhao, Maoqing Tian, Lu Sheng, Jing Shao, Shuai Yi, Junjie Yan, and Xiaogang Wang. 2017.

    Hydraplus-net: Attentive deep features for pedestrian analysis. In

    ICCV. 350–359.
  • Luo et al. (2017) Zhiming Luo, Akshaya Mishra, Andrew Achkar, Justin Eichel, Shaozi Li, and Pierre-Marc Jodoin. 2017. Non-local deep features for salient object detection. In CVPR. 6609–6617.
  • Matsukawa et al. (2016) Tetsu Matsukawa, Takahiro Okabe, Einoshin Suzuki, and Yoichi Sato. 2016. Hierarchical gaussian descriptor for person re-identification. In CVPR. 1363–1372.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).
  • Ristani et al. (2016) Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. 2016. Performance measures and a data set for multi-target, multi-camera tracking. In ECCV. Springer, 17–35.
  • Sun et al. (2017) Yifan Sun, Liang Zheng, Weijian Deng, and Shengjin Wang. 2017. Svdnet for pedestrian retrieval. (2017), 3800–3808.
  • Sun et al. (2018) Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. 2018. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV. 480–496.
  • Varior et al. (2016) Rahul Rama Varior, Mrinal Haloi, and Gang Wang. 2016. Gated siamese convolutional neural network architecture for human re-identification. (2016), 791–808.
  • Xiao et al. (2016) Tong Xiao, Hongsheng Li, Wanli Ouyang, and Xiaogang Wang. 2016. Learning deep feature representations with domain guided dropout for person re-identification. In CVPR. 1249–1258.
  • Yu et al. (2017) Qian Yu, Xiaobin Chang, Yi-Zhe Song, Tao Xiang, and Timothy M Hospedales. 2017. The Devil is in the Middle: Exploiting Mid-level Representations for Cross-Domain Instance Matching. arXiv preprint arXiv:1711.08106 (2017).
  • Zhang et al. (2016) Li Zhang, Tao Xiang, and Shaogang Gong. 2016. Learning a discriminative null space for person re-identification. In CVPR. 1239–1248.
  • Zhang et al. (2018) Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z Li. 2018. Occlusion-aware R-CNN: detecting pedestrians in a crowd. In ECCV. 637–653.
  • Zheng et al. (2016a) Liang Zheng, Zhi Bie, Yifan Sun, Jingdong Wang, Chi Su, Shengjin Wang, and Qi Tian. 2016a. MARS: A Video Benchmark for Large-Scale Person Re-identification. In ECCV. Springer.
  • Zheng et al. (2015a) Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015a. Scalable person re-identification: A benchmark. In ICCV. 1116–1124.
  • Zheng et al. (2015b) Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015b. Scalable person re-identification: A benchmark. In ICCV. 1116–1124.
  • Zheng et al. (2016b) Wei Shi Zheng, Li Xiang, Xiang Tao, Shengcai Liao, Jianhuang Lai, and Shaogang Gong. 2016b. Partial Person Re-Identification. In ICCV.
  • Zhong et al. (2017) Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. 2017. Random erasing data augmentation. arXiv preprint arXiv:1708.04896 (2017).
  • Zhou and Yuan (2018) Chunluan Zhou and Junsong Yuan. 2018. Bi-box Regression for Pedestrian Detection and Occlusion Estimation. In ECCV. 135–151.
  • Zhu et al. (2018) Zheng Zhu, Qiang Wang, Bo Li, Wei Wu, Junjie Yan, and Weiming Hu. 2018. Distractor-aware siamese networks for visual object tracking. In ECCV. 101–117.
  • Zhuo et al. (2018) Jiaxuan Zhuo, Zeyu Chen, Jianhuang Lai, and Guangcong Wang. 2018. Occluded Person Re-identification. (2018).