Dissecting Person Re-identification from the Viewpoint of Viewpoint

12/05/2018 ∙ by Xiaoxiao Sun, et al. ∙ Australian National University 2

In person re-identification (re-ID),In person re-identification (re-ID), we usually refer the challenges of this task to variances in visual factors such as the viewpoint, pose, illumination and background. In spite of acknowledging these factors to be influential, quantitative studies on how they affect a re-ID system are still lacking.To gain insights in this scientific campaign, this paper makes an early attempt in studying a particular factor, viewpoint. We narrow the viewpoint problem down to the pedestrian rotation angle to obtain focused conclusions. In this regard, this paper makes two contributions to the community. First, we introduce a large-scale synthetic data engine, PersonX. Composed of hand-crafted 3D person models, the salient characteristic of this engine is "controllable". That is, we are able to synthesize pedestrians by setting the visual variables to arbitrary values. Second, on the 3D data engine, we quantitatively analyze the influence of pedestrian rotation angle on re-ID accuracy. Comprehensively, the person rotation angles are precisely customized from 0 to 360, allowing us to investigate its effect on the training, query, and gallery sets. Extensive experiment helps us gain deeper understanding of the fundamental problems in person re-ID. Our research also provides beneficial insights for dataset building and future practical usage, e.g., a person of a side view makes a better query.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Viewpoint, pose, illumination, background and resolution are a few visual factors that are generally considered as important influential problems in a person re-identification (re-ID) system. In the community, major endeavor is devoted to algorithm design to mitigate their impact on the system, thus improving the recognition accuracy. Therefore, despite of qualitatively acknowledging the factors as influential, it remains largely unknown how these factors affect the performance quantitatively.

Figure 1: Definition of viewpoint. (a) Illustration of viewpoint of birdview. In this paper, viewpoint is defined to be the rotation angle of person relative to the viewpoint of 0. The field of view (FoV) is the observable extent of the camera. (b) Examples of different viewpoint.

In this paper, we study one of the most important factors, i.e., viewpoint. Here, we denote viewpoint as the pedestrian rotation angle (Fig. 1). In what follows, we use viewpoint to replace pedestrian rotation angle unless specified. Since different views of a person contain different details, the viewpoint of a person influences the underlying visual data of an image, and thus the learning algorithm. Therefore, we aim to investigate the exact influence of viewpoint on the system. This study will benefit the community from two aspects. (1) The conclusions of this research can guide for building the training set effectively. For example, finding that certain angles are more important for learning model of identifying pedestrians. (2) It will advise for designing of query and gallery sets. By discovering viewpoints that are effective for re-ID accuracy, our research can potentially benefit the practical usage of re-ID systems.

In our attempt to reveal the influence of viewpoint, the most notable obstacle is the lack of data. On one hand, existing datasets have an imbalanced distribution of visual factors. In the case of pedestrian viewpoint, for example, some angles might only have a few or even zero images. In another example, if illumination is our subject, then a dataset might only have a specific illumination condition. Therefore, the imbalanced data distribution does not support a comprehensive study of a visual factor. On the other hand, the existing datasets are fixed / static. This inflexibility forbids us to explore how the impact of viewpoint relates to other visual factors. For example, the impact of viewpoint could be conditioned on the background, because background also affects feature learning. To fully understand the role of viewpoint, we need to test its influence by changing the environment, i.e., either more hard or easy. In a brief summary, without a balanced and flexible dataset, we cannot make an objective and comprehensive judgment of a visual factor’s significance to the system.

Accordingly, this paper makes two major contributions to the community. First, we build a large-scale synthetic data engine named PersonX for controllable data collection. PersonX contains 1,266 hand-designed identities and a set of editable visual variables, so as to simulate pedestrians under surveillance videos. In the first place, we show that existing re-ID approaches obtain reasonable accuracy on PersonX, validating that its underlying data structure is similarly indicative with the real-world dataset. More importantly, as the name implies, the most important characteristic of the PersonX engine is “being controllable”. In this engine, persons can take arbitrary poses and viewpoint, and the environment can also be controlled w.r.t the illumination, background, etc. The persons could move by running, walking etc., under the controlled camera view and scenery. We are able to obtain the exact person bounding boxes without external detection tools and thus avoid the influence of detection errors on our system. Therefore, PersonX is indicative, flexible and extendable. It will support future research not only in algorithm design, but also in scientific understanding how various factors affect the re-ID system.

Second, we dissect the person re-ID system by quantitatively understanding the role of pedestrian viewpoint. Three questions are considered. (1) Given a fixed testing set, how does the viewpoint of the training set influence the system? (2) Given a fixed training set, how does the re-ID accuracy change under different viewpoint distributions of the testing set? (3) Given a fixed training set, how does the query viewpoint influence the retrieval? To answer the questions, we perform rigorous quantification on pedestrian images w.r.t controlled viewpoints. In our experiment, we customize the viewpoints of persons in the PersonX engine from 0 to 360. The control group and the experimental group are defined to quantify the impact of viewpoint, so as to obtain scientific insights into the topic. Moreover, we perform empirical study on the real-world Market-1203 dataset where viewpoints of person are labeled. We find that the empirical results are consistent with our findings on the synthetic data.

2 Related Work

We first review re-ID methods that improve the robustness against variations in pose, illumination, and background. We then review methods based on synthetic data.

Against pose variance. To reduce the influence of pose variations, some work [7, 29, 6, 21, 19] proposed to learn pose invariant representation for pedestrian. For example, Farenza et al[7] present two axes dependent on the pose of the body to learn a feature description with pose invariance. Cho et al[6]

use the estimation of target poses to conduct multi-shot matching to reduce the influence of diverse poses. Zheng

et al[29] design a PoseBox Fusion Network to learn a pedestrian descriptor (pose invariant embedding) based on PoseBox and jointly learning strategy.

Against background variance. There are some works being related to reducing the influence of background [4, 26, 5, 20, 23]. For instance, to reduce the influence of background, Chen et al[5] learn from foreground person and the original image, individually. In contrast, Song et al[20] think background can prove useful information, so they propose to segment body first, and then learn representation from body and background regions of images, respectively.

Against resolution variance. Resolution influences the detailed information of the content, so that affects the discrimination of feature learned from different resolution images. To solve this problem, Jing et al[11] propose a mapping function from height-resolution gallery images and low-resolution probe images to reduce the influence of noises. Li [15] et al. joint learns from multi-scale to get a shared subspace across different scales. Wang et al[24] fuses embeddings at lower layers with higher resolution images and higher layers of the network to learn resolution invariance feature.

Against viewpoint variance. The research to reduce influence of viewpoint variations [9, 25, 2, 12, 27] also focuses on learning invariant feature. For example, Gray et al[9] design an ensemble of localized features to reduce the influence of the viewpoint changes for person representation. Karanam et al[12]

learn a single viewpoint invariance dictionary to represent both gallery and probe images during training, then train the dictionary by enforcing explicit constraints on the associated sparse representations of the feature vectors.

Figure 2: Illustration of the PersonX dataset. (a): Image background. In each scene, a person can run towards any direction, thus generating any viewpoint relative to the camera. (1) - (3) represent the pure color background and (4) - (6) use real scenes as the background. (b): Sample pedestrians bounding boxes in the background (4). We show various persons wearing various clothes.

Learning from synthetic data.

Modern computer vision research often needs expensive data acquisition and accurate manual labeling. To address this problem, some work uses 3D synthetic data to assist computer vision problems such as semantic segmentation 

[18], object tracking [8], traffic vision research [14] etc. For person re-ID, SOMAset [3] is a synthetic dataset containing 50 characters and 11 types of outfit. Barbosa et al. build the SOMAset dataset to train a network SOMAnet and find SOMAnet can identify people even if they change apparel between cameras. Bak et al[1] also create a synthetic dataset SyRI including 100 characters to solve the limitation of lighting conditions and use domain adaptation technology to assist the model training of new datasets. In this paper, we aim at analyzing how visual factors influence the identification system by using the controllability of synthetic data.

3 PersonX: A Flexible Person Engine

3.1 Description

Software. The PersonX engine111The PersonX data engine, including pedestrian models, scene assets, project and script files etc., will be released at the link of https://github.com/sxzrt/Dissecting-Person-Re-identification-from-the-Viewpoint-of-Viewpoint.git is built on Unity [16]. We create a 3D controllable world containing 1,266 well-designed persons. Being a configurable system, it can satisfy various data requirements. In PersonX, the characters and objects are realistic, because the texture and material of these models are mapped and generated from the real world by scanning real people and objects. All the settings of visual variables are designed to be editable, e.g., illumination, scenery and background. Therefore, PersonX is highly flexible and extendable.

Identities. PersonX has 1,266 hand-crafted identities including females and males. To ensure diversity, we hand-craft the human models with different skin colors, ages, body forms (height and weight), hair styles, etc. The clothes of these identities include jeans, pants, shorts, slacks, skirts, T-shirts, dress shirts, maxiskirt, etc., and some of these identities have a backpack, shoulder bag, glasses or hat. The materials of the clothes (color and texture) are mapped from images of real clothes. The motion of these characters may be walking, running, idling (standing), having a dialogue etc. Therefore, the characters in PersonX are similar to real people. Figure 2 (b) presents examples of identities of various ages, clothes, body shapes and poses.

Illumination. Illumination can be directional light (sunlight), point light, spotlight, area light, etc. Parameters like color and intensity can be modified for each illumination type. By editing the values of these terms, various kinds of illumination environment can be created.

Camera and scene. Cameras in PersonX are subject to different configurations of resolution, projection, focal length, and height. Scenery can be manually selected or designed.

Figure 3: Sample images of the same person with different viewpoint. Images are sampled at an interval of 20 degrees. Left (330 - 50), front (60 - 140), right (150 - 230) and back (240 - 320) represent four main orientations.

3.2 Subsets of PersonX

This paper attempts to understand person re-ID from the factor of viewpoint. To this end, from the PersonX engine, we construct the PersonX dataset composed of discrete viewpoints of an identity. For simplicity, our system has two camera views i.e., two backgrounds. The details of PersonX are described below.

Background. We configure six different backgrounds and in each experiment use two or three different backgrounds. Under the camera view, persons can move freely in any direction, so persons can exhibit any viewpoint to the camera. Figure 2 illustrates the six backgrounds used in our experiment. Specifically, backgrounds 4, 5, and 6 depict three different street scenes. Among the three scenes, background 4 and 5 share the same illumination and ground color, but background 6 is a shadowed region and the ground color is gray. Meanwhile, backgrounds 1, 2 and 3 are pure color backgrounds, and are used for fair comparison. Because we simplify our system into two cameras, we use several combinations of these six cameras to provide different re-ID environment. The one is PersonX containing two cameras facing the two light color backgrounds 1 and 2. For comparing, the two cameras of PersonX face the backgrounds 1 and 3, and the difference between two backgrounds is more apparent than that in PersonX. On the other hand, cameras focusing on backgrounds 4 and 5 are contained by PersonX to research the viewpoint under little influence of other visual factors. PersonX is consisted by the cameras facing backgrounds 4 and 6. All the cameras have a high resolution of 1024768.

Viewpoint. Figure 3 presents some image examples under various viewpoints. During the progress of rotation, people are walking naturally under the scene, and viewpoints are controlled at different frames. The viewpoint of a person is sampled every 10 from 0 to 350 (36 different angles in total). Each angle has one image, so each person has 36 images in total. For the PersonX dataset, we have 36 (angles) 1266 (identities) 6 (cameras) 273,456 images at the oracle setting. For one pedestrian, the 36 images can be divided into 4 groups to represent the left, front, right and back sides of the person.

A comparison of our datasets and some existing Re-ID datasets is presented in Table 1. Compared with the two existing synthetic datasets SyRI and SMOset, PersonX has much more identities and more appropriately defined camera views. Furthermore, the mechanism of generating data in the two datasets cannot support the research of viewpoint. For example, SyRI is used purely for domain adaptation and does not clearly define multiple cameras, so it can not generate quantitatively controlled data of viewpoint. Although SMOset contains 250 cameras, the cameras are uniformly spanning a hemisphere centered and the persons are stationary. This setup does not conform to the practical rule, i.e., pedestrians move under the field of the camera, and it will introduce the change of background. Different from SMOset, we fix the position of the camera and let the person move under the FoV of the camera, which is consistent with the surveillance system in practice.

 

Datasets #Identity #box #cam. view
Real Data Market-1501 [30] 1,501 32,668 6 N
Market-1203 [30] 1,203 8,569 2 Y
MARS [28] 1,261 1,191,003 6 N
CUHK03 [13] 1,467 14,096 2 N
Duke [17] 1,404 36,411 8 N
Synthetic Data SMOset [3] 50 100,000 250 N
SyRI [1] 100 1,680,000 N
PersonX 1,266 273,456 6 Y
PersonX 1,266 136,728 3 Y
PersonX 1,266 91,152 2 Y
PersonX 1,266 91,152 2 Y

 

Table 1: Comparison of real and synthetic re-ID datasets. “View” denotes whether the dataset has viewpoint labels.

4 Benchmarking and Dataset Validation

Through benchmarking, we aim to validate that PersonX is as indicative as real-world datasets, and as such, conclusions derived from this dataset can be of value to practice.

4.1 Methods and Dataset Settings

We use IDE+ [32], Triplet Loss [10] and PCB [22]

to validate the datasets. The basic method IDE+ is implemented using ResNet50. During training, the batch-size is set to 64 and the model is trained for 50 epochs with an initial learning rate of 0.1, which will decay to 0.01 after 40 epochs. The parameters of the model are initialized on the model pre-trained on ImageNet. For Triplet Loss, the identities-per-batch is set to 32 and images-per-identity is set to 4, so the batch-size is 32

4 128. The initial learning rate is , decaying after 150 epochs (train for 300 epochs). PCB uses the original setup [22].

As mentioned in Section 3, PersonX contains three color backgrounds and three scene backgrounds. To evaluate different situations, we use two combinations of the scenes, i.e., PersonX and PersonX. Because backgrounds 4 and 5 are similar, and because backgrounds 4 and 6 are dissimilar, PersonX and PersonX represent the easy setting and hard setting, respectively. The subsets with color backgrounds are PersonX and PersonX. Meanwhile, we change the camera resolution of PersonX from 1024768 to 512242 to show the sensitivity of the PersonX engine when adding interferences, called PersonX-lr (low resolution) dataset. We randomly sample 410 identities for training and 856 identities for testing. For each camera, each identity has 36 images and one image is chosen as the query image during testing. Accordingly, PersonX contains 44,280 (410363) training and 92,448 (856363) testing images, and PersonX has 29,520 (410362) training and 61,632 (856362) testing images.

 

# Methods IDE+ Triplet Loss PCB
mAP R1 mAP R1 mAP R1
1 Market-1501 67.3 86.5 68.1 85.6 77.4 92.3

2
Market-1203 66.7 71.0 72.8 75.6 77.8 81.4
3 DukeMTMC-reID 55.1 74.2 57.9 76.3 66.1 81.7
4 PersonX 94.5 99.2 95.3 98.6 97.9 99.6
5 PersonX 94.8 99.5 95.6 99.0 97.8 99.6
6 PersonX 94.6 99.1 94.6 98.6 97.6 99.4
7 PersonX 92.7 98.5 94.3 98.5 97.8 99.7
8 PersonX 94.0 99.4 95.0 99.1 97.7 99.8
9 PersonX 91.5 96.3 92.8 97.0 96.8 99.2

10
PersonX-lr 87.9 96.7 89.8 95.6 94.9 98.4
11 PersonX-lr 90.2 98.4 90.9 97.5 95.1 98.9
12 PersonX-lr 85.9 94.3 86.2 93.2 92.6 96.6

 

Table 2: Benchmarking and validation of the PersonX dataset. We use mAP and rank-1 (R1) accuracy for measurement. “lr” represents the low-resolution images 512 242 relative to initial resolution 1024 768. This table validates the eligibility, purity and sensitivity of PersonX as a flexible re-ID engine.

4.2 System Validation

We apply the three methods on real-world and synthetic datasets. For real-world datasets, we follow the standard protocols in the respective papers [30, 13, 17, 31] and evaluate on the Market-1203 dataset. Results are reported in Table 2. Based on the results, we observe three characteristics of PersonX as follows.

1) Eligibility. PersonX can reflect the relative performance of algorithms just like the real-world datasets. The real-world experiment tells us that PCB has the best accuracy and that performance of IDE+ and the triplet loss is close, i.e., PCB triplet IDE+, which is consistent with findings in [22]. On synthetic data, the trend is similar: PCB is usually 2%-3% higher than IDE+ and the triplet loss.

2) Purity. The synthetic images (e.g., PersonX and PersonX

) have a high “purity” and are oracle datasets for viewpoint study. These datasets are high-resolution, have normal and consistent sunlight, relatively consistent background, and the uniformly distributes viewpoints. So, the performance of the three methods is relatively high. In other words, these datasets are featured by nearly independent of environmental influence.

3) Sensitivity. The synthetic engine is sensitively responsive to the changes in the environment, so it is suitable for studying the effect of visual factors. For example, datasets #8 and #9 show the re-ID results under the easy and hard settings, respectively. We find that the rank-1 accuracy and mAP of the three methods are consistently (1%-3%) lower under the hard setting. The same phenomenon can be observed at datasets #5 and #9 where the only difference between the two settings is the background color. Furthermore, datasets #10 #11 and #12 are edited from datasets #7 #8 and #9, respectively, by lowering image resolution. We observe that re-ID accuracy in datasets #10 #11 and #12 is significantly lower that datasets #7, #8 and #9. Therefore, the designed synthetic system reacts to the environment changes sensitively.

The above discussions indicate that PersonX is eligible for evaluating person re-ID, has strictly controlled environment variables, and is sensitive to environmental changes (which is good because it resembles the real world). We believe the PersonX system will provide a useful tool for the community and encourage further development of robust algorithms and deep research.

Figure 4: Illustration of control group* (CG*) and experimental group (EG) when removing 1/2 of viewpoints. CG* means the training set contain 1/2 of viewpoints, i.e., 18 viewpoints (white), that are randomly knocked out from 0-350. EG denotes 18 continuous viewpoints (white) are deleted.

 

# Group Viewpoint / Direction Img# PersonX PersonX PersonX PersonX Market-1203
mAP R1 mAP R1 mAP R1 mAP R1 Img# mAP R1
1 Oracle 36 97.8 99.6 97.6 99.4 97.7 99.8 96.8 99.2 1-8 77.8 81.4
CG 27 97.8 99.6 97.2 99.5 97.6 99.9 96.8 98.8 1-6 72.8 75.9
CG* reserve 27 angles 27 97.7 99.7 97.2 99.5 97.4 99.9 96.4 98.8 1-6* 69.0 73.3
EG delete left 27 97.4 99.6 97.1 99.4 96.9 99.9 96.2 98.6 1-6* 66.6 69.4
3/4 EG delete front 27 97.2 99.6 96.7 99.5 96.6 99.9 96.1 98.9 1-6* 62.1 65.4
EG delete right 27 97.5 99.5 97.3 99.2 97.3 99.9 96.4 98.8 1-6* 64.0 97.8
EG delete back 27 97.1 99.5 96.8 99.4 96.7 100 96.1 98.6 1-6* 59.0 62.6
CG 18 97.6 99.6 97.0 99.4 97.0 99.9 96.3 98.7 1-4 50.2 53.9
CG* reserve 18 angles 18 97.3 99.6 96.8 99.3 96.8 99.8 96.1 98.8 1-4* 51.3 55.1
EG left+right 18 96.2 99.5 95.7 99.4 95.5 99.9 95.3 98.5 1-4* 44.4 47.9
2/4 EG front+back 18 95.4 99.5 94.5 99.2 94.8 99.6 94.2 98.1 1-4* 42.2 45.2
EG front+right (left) 18 96.4 99.5 96.0 99.2 95.9 99.8 95.1 98.2 1-4* 41.0 45.4
EG back+right (left) 18 96.4 99.6 95.8 99.3 95.4 99.8 95.0 98.4 1-4* 45.0 45.7
CG 9 97.1 99.6 96.3 99.4 96.5 99.6 95.3 98.1 1-2
CG* reserve 9 angles 9 95.7 99.5 95.1 99.2 95.1 99.6 94.3 98.1 1-2*
EG left 9 94.7 99.6 93.7 99.1 93.9 99.6 93.2 98.1 1-2*
1/4 EG front 9 87.7 99.4 86.5 98.8 87.0 99.5 86.6 97.4 1-2*
EG right 9 94.3 99.5 93.6 95.8 93.7 99.7 93.1 97.9 1-2*
EG back 9 88.1 99.4 85.9 98.6 87.8 99.3 87.4 96.3 1-2*

 

Table 3: Performance using different viewpoints in the training set. “CG” and “EG” denote the control group and experiment group, respectively. #1 Oracle means all 36 viewpoints are contained by training data. 3/4, 1/2 and 1/4 denote the ratio of used training data as a fraction of the Oracle training set. “3/4 CG” means 3/4 of the training images of each identity are randomly selected, so the whole training data still contains all 36 viewpoints. “3/4 CG*” means the training set contains all the identities but only 27 viewpoints. “3/4 EG” means deleting either the left, front, right or back viewpoints (9 continuous viewpoints) for each identity. “Viewpoint / Direction” shows the viewpoints contained in the training set. “Img#” is the number of training images for each identity.

5 Evaluation of Viewpoint

This section evaluates the impact of viewpoint on person re-ID. Our experiment is based on PCB [22], a state-of-the-art approach. We note that other standard re-ID baselines (e.g., IDE+) can draw similar conclusions. Three questions will be investigated. How does the viewpoint in 1) the training set, 2) the query set, and 3) the gallery set affect re-ID?

5.1 Viewpoint in Training Set

Q: How does the viewpoint in the training set affect learning? First, the initial datasets are oracle state containing all viewpoints of training and testing identities, i.e., 36 images of each identity under one camera. We remove some viewpoints selectively from the training set to analyze the influence of missing viewpoints. For comparison, we also randomly remove the same amount of training data for each identity as the benchmark. Specifically, the experimental settings of removing viewpoint on the training set are as follows. (1) Control group (CG): randomly selecting 3/4, 2/4, 1/4 of images for each identity. This experiment is the benchmark of reducing training data. Figure 4 shows the two settings of removing viewpoints. (2) Control group* (CG*): randomly knocking out 3/4, 2/4, 1/4 of viewpoints from 36 for all identities. (3) Experimental group (EG): selectively deleting continuous viewpoints (1/4, 2/4, 3/4 from 36 viewpoints) of each identity during training. In this group, the selected continuous viewpoints refer to left, front, right and back sides of the person shown in Fig. 3. Note that, all the above setting has the same number of training images, and the viewpoint of the testing set is uniformly distributed. CG* are the averaged results from repeating the experiment 5 times.

Table 3 shows the experimental results of training on 3/4, 2/4 and 1/4 of the training set. Compared with Oracle, the mAP and R1 of CG are similar or slightly reduced because CG still covers all kinds of viewpoint to train PCB, so the model still can learn enough information from the various viewpoint. After deleting some specific viewpoints in CG*, the decline of results becomes noticeable compared with CG, e.g., the results of 1/4 CG* fall 1.0 - 1.4 relative to CG. By contrast, the results are reduced by a large margin for all the four datasets in EG ( e.g., decreasing 1 - 8 of 1/4 EG on PersonX). It reveals that continuous viewpoints, such as left, back etc., contain different details of body information, so the lack of one side will reduce the information learned by the model. Accordingly, the identifiability of the system is influenced. Furthermore, the discrepancies gradually increase with the decreasing of training data. The detailed comparison of CG, CG* and EG of the PersonX and PersonX datasets are shown in Fig. 5, and the results are from models trained on different numbers of viewpoints. It is evident, the drop of mAP in EG is more noticeable than CG and CG*, so honking out continuous viewpoint will bring more influence for training compared to deleting the same amount of data or discontinuous viewpoint.

Figure 5: Comparison of results from experimental and control groups on PersonX (a) and PersonX (b). The horizontal axis is the ratio of training data and the vertical axis is the mAP.

On the other hand, the downtrend of “delete front” and “delete back” is more apparent compared with “delete left” and “delete right” (0.2 - 0.5 of mAP) in 3/4 EG. There are two possible reasons. (1) Flipping the left viewpoint can often obtain the right viewpoint, so deleting left or right has the lesser loss than removing “front” or “back”. (2) Because the front and back sides contain more information than left or right sides, the missing front and back bring more influence for the amount of information learned by the model than the other two sides. When training the model on 1/2 of the 36 viewpoints, the results of “front + back” (shadowed in Table 3) decrease by 0.8-1.5

in mAP compared with training with other viewpoints. Probably because models trained on left or right viewpoint can learn better of the color, outfit (

e.g., long or short sleeve, pants, shorts) etc. general information of pedestrian, but models learned from front or back viewpoint tend to capture more detail information of the person, such as prints of clothes, face. Therefore, for the retrieval image is of the left or right viewpoint that does not contain too much detail information, the model trained on ”front+back” viewpoint cannot work well. These discussions can also be proved by results of 1/4 EG. The mAP of the model train on left or right viewpoint is almost 6 - 9 higher than learning from the viewpoint from front /back.

To provide detail performance of these four models (trained on the viewpoint of left, right, front and back), we show their mAP when setting all query images to a single viewpoint in Fig. 6.

Figure 6: Evaluation of the model trained on the left, front, right and back viewpoints, respectively, when the viewpoints of all queries are same and changed from 0 - 350. Here, the viewpoint is uniformly distributed in the gallery.

Obviously, taking any viewpoint as the query viewpoint, the models trained on the viewpoint of left / right performs favorably against the models learned from viewpoint of front / back. Meanwhile, the same experiment is conducted on the real dataset Market-1203. The decreasing tendency of results changes is consistent with the results changes of synthetic datasets on Oracle, CG, CG* and EG. Subsection conclusions Missing viewpoint compromises training. Missing continuous viewpoints are more detrimental than missing randomly viewpoints. When limited training viewpoints are available, models can be better trained when left / right viewpoints are in training set than front / back viewpoints.

Figure 7: Investigation into query viewpoint. The query viewpoint of every identity is set to left (0), right (180), front (90) and back (270) in (a) - (d), respectively. In each figure, the horizontal axis represents setting the gallery viewpoint from (0) to (350) separately. The results are testing on PersonX by the 4 models of EG 1/4 in Table 3. Mean is the average of 27 marked points in each figure (removing the double positive effect of right or left).

 

# Training Data 1 vs all 1 vs 3 1 vs 9
Oracle CG EG -EG CG EG -EG
mAP R1 mAP R1 mAP () R1 () mAP R1 mAP R1 mAP () R1 () mAP R1
1 PersonX 97.7 99.8 97.6 99.8 97.4 (0.3) 99.4 (0.4) 97.7 99.8 97.4 99.8 96.6 (1.1) 98.3 (1.5) 97.8 99.8
2 PersonX 3/4 97.6 99.9 97.5 99.9 97.2 (0.4) 98.9 (1.0) 97.6 99.9 97.3 99.8 96.7 (0.9) 98.4 (1.5) 97.7 99.9
3 PersonX 3/4 97.0 99.9 96.9 99.9 96.5 (0.5) 98.9 (1.0) 97.0 99.9 96.9 99.9 95.5 (1.5) 97.4 (2.5) 97.1 99.9
4 PersonX 1/4 96.5 99.6 96.3 99.6 95.9 (0.6) 98.3 (1.3) 96.4 99.6 96.0 99.4 94.8 (1.7) 96.6 (3.0) 96.4 99.6
5 PersonX 96.8 99.2 96.7 99.1 96.4 (0.4) 98.2 (1.0) 96.8 99.2 96.4 99.0 95.5 (1.3) 97.1 (2.1) 96.8 99.2
6 PersonX 3/4 96.8 98.8 96.7 98.8 96.4 (0.4) 98.1 (0.7) 96.8 98.8 96.4 98.7 95.5 (1.3) 96.7 (2.1) 96.8 98.8
7 PersonX 2/4 96.3 98.7 96.2 98.7 95.8 (0.5) 97.1 (1.6) 96.3 98.7 95.8 98.5 94.8 (1.5) 95.8 (2.9) 96.3 98.7
8 PersonX 1/4 95.3 98.1 95.1 98.1 94.6 (0.7) 96.6 (1.5) 95.1 98.1 95.2 98.1 93.4 (1.9) 94.9 (3.2) 94.7 97.8
9 PersonX-lr 92.6 96.6 92.4 96.4 91.8 (0.8) 95.9 (0.7) 92.6 96.6 91.7 96.1 90.2 (2.4) 94.3 (2.3) 92.7 96.6
10 PersonX-lr 3/4 92.5 96.0 92.2 95.7 91.7 (0.8) 95.3 (0.7) 92.4 96.0 91.6 95.9 90.0 (2.5) 94.0 (2.0) 92.5 96.0
11 PersonX-lr 2/4 91.7 95.7 91.4 95.6 90.8 (0.9) 94.6 (1.1) 91.6 95.6 90.8 95.4 89.0 (2.7) 93.4 (2.3) 91.7 95.6
12 PersonX-lr 1/4 90.6 95.6 90.2 95.4 89.6 (1.0) 94.3 (1.3) 90.5 95.6 89.5 95.0 87.5 (3.7) 92.7 (2.9) 90.6 95.6
# Real data 1 vs all 1 vs 3 1 vs 5
13 Market-1203 77.8 81.4 77.2 79.4 74.7 (3.1) 74.4 (7.0) 77.2 79.4 79.3 79.7 74.1 (3.7) 71.0 (10.4) 83.1 83.0
14 Market-1203 3/4 72.8 75.9 71.6 73.0 68.8 (4.0) 67.9 (8.0) 76.3 76.4 71.5 68.6 68.5 (4.3) 64.9 (11.0) 80.3 79.3
15 Market-1203 2/4 50.2 53.9 48.8 50.0 44.7 (5.5) 43.5 (10.4) 56.3 55.7 48.7 42.7 44.5 (5.7) 41.1 (12.8) 62.3 60.1

 

Table 4: Evaluation of viewpoint during testing. 1 vs all means database containing 36 viewpoints of each identities. For EG, 1 vs 3 means 3 gallery images that have same id with probe and , are defined as junk during testing. For -EG, it means are defined as junk during testing, i.e., the most different gallery images are defined as junk. CG represents randomly select the same number of images are taken as junk. 1 vs 9 corresponds to take gallery images with ( ) as junk. () represents the discrepancy of result compared with the result in column of Oracle.

5.2 Viewpoint in Query Set

Q: How does the viewpoint in the query set affect retrieval? We quantify how query viewpoint influences the system. Based on the above training models, we modify the evaluating rules to see the effect of viewpoint during testing.

Specifically, the viewpoint of probe images is set to the left viewpoint of person (0), front viewpoint of person (90), right viewpoint (180) and back viewpoint (270), respectively. During retrieval, the true match is set to the image that contains the same person and viewpoint of the person is 0 - 350, separately. Figure 7 shows the results of using these query and gallery images to test the four models that are trained by the left, right, front, back orientations (see Fig 3 for the definition of the four orientations), respectively. The consistent results of four settings are that query image will get the highest results when the viewpoint of the true match is similar to the query viewpoint. For example, the maximum values of Rank1 in (a)-(d) correspond to the gallery viewpoint that is same with the query viewpoint. Meanwhile, it is obvious that the query of left and right viewpoint (a) and (b) has a higher mean of Rank1 than (c) and (d) using the query of front or back viewpoint, e.g., the mean R1 80.0 in (b) query of right viewpoint are bigger than the mean values 74.3 of (c). Because query viewpoint of the left can perform better on both left and right, but query viewpoint of front or back do not have these characters. To reduce the positive influence on query viewpoint of left or right, the R1 of 9 gallery viewpoints of right or left do not be considered when calculating the mean value for fair comparison. In this case, query viewpoint of left and right still have high retrieval results. Subsection conclusions Query viewpoint of left / right generally yields higher re-ID accuracy than front / back viewpoints.

5.3 Viewpoint in Gallery

Q: How does viewpoint distribution in gallery affect retrieval? Finally, we study how the gallery viewpoint distribution affects the re-ID accuracy. Specifically, the gallery images that have similar viewpoints with the probe will be set as junk during testing. We denote the viewpoint of probe and gallery as and , receptively. In experiment, the experimental groups are as follows. 1) gallery images whose are set as junk (i.e., 3 images) 2) gallery images whose are set as junk (i.e., 9 images). For comparison, we also set -EG to represent junking the gallery images whose , because these images can be considered as having a big distance with the query images. The corresponding control groups are randomly setting the same number of images as junk. Since Market-1203 only contains 8 kinds of angles, 3 and 5 means junking the images with the viewpoint similar to the probe viewpoint, e.g., 3 refers to removing the exact viewpoint that is same with probe viewpoint, and it is the previous and next viewpoints.

Experimental results in Table 4 show that randomly setting 3 or 9 gallery images of each person as junk, i.e., control group (CG), has a negligible effect on the results. Meanwhile, the results approximately keep unchanged in -EG when junking 3 images, but there is a rising of most of the result in “1 vs 9” due to the removing of hard retrieval gallery images and the narrowing of the database. For example, in #1 of Table 4, compared to the Oracle results 97.7 (mAP) and 99.8 (R1), the results of CG (1 vs 9) 97.4 (mAP) and 99.8 (R1) has some small changes. In -EG ,the mAP of 97.8 (mAP) rises by 0.1. On the contrary, the results of experimental groups (EG) has obvious decreases, especially on R1. For instance, in #1 (1 vs 9), the results of EG is 96.6 (mAP) and 98.3 (R1), i.e., there are decreases of 1.1 on mAP and 1.5 on R1. Therefore, the results of the well-learned model will be effected when the viewpoint of gallery images is not similar to the query image. The consistent changes of results can also be summarized on real dataset Market-1203. In row #13, there is a decrease of 3.1 - 3.7 on mAP and decrease of 7 - 10 on R1. Meanwhile, there is a phenomenon that the R1 has more manifest changes. One possible reason is that the similar viewpoints are ranked early rather than the image from the same person, when junking the true matches with similar viewpoints in the gallery. On the other hand, the reduction of mAP become more obvious with the environment becoming hard. For example, the decline of results on the PersonX-lr (in row #9) dataset is approximately twice as big as the results on PersonX dataset (in row #5).

Subsection conclusions Sometimes true matches whose viewpoints are dissimilar to the query are harder to be retrieved than false matches with a similar viewpoint. The above problem gets more severe when the environment is less ideal, e.g., complex background, extreme illumination, and low resolution.

6 Conclusion

This paper makes two contributions to the community. First, we build a synthetic data engine PersonX that can generate images under controllable cameras and environment. The dataset from PersonX is shown to be as indicative as real-world datasets. Second, based on PersonX, we conduct comprehensive experiment to quantitatively assess the influence of pedestrian viewpoint on person re-ID accuracy. Interesting and constructive insights are derived, e.g., it is best to use a query image capturing the side view (left or right) of a person. In the future, visual factors such as illumination and background will be studied with the proposed engine.

References

  • [1] S. Bak, P. Carr, and J.-F. Lalonde. Domain adaptation through synthesis for unsupervised person re-identification. arXiv preprint arXiv:1804.10094, 2018.
  • [2] S. Bak, S. Zaidenberg, B. Boulay, and F. Bremond. Improving person re-identification by viewpoint cues. In AVSS, pages 175–180, 2014.
  • [3] I. B. Barbosa, M. Cristani, B. Caputo, A. Rognhaugen, and T. Theoharis. Looking beyond appearances: Synthetic training data for deep cnns in re-identification. Computer Vision and Image Understanding, 167:50–62, 2018.
  • [4] L. Bazzani, M. Cristani, A. Perina, and V. Murino. Multiple-shot person re-identification by chromatic and epitomic analyses. Pattern Recognition Letters, 33(7):898–903, 2012.
  • [5] D. Chen, S. Zhang, W. Ouyang, J. Yang, and Y. Tai. Person search via a mask-guided two-stream cnn model. In ECCV, 2018.
  • [6] Y.-J. Cho and K.-J. Yoon. Improving person re-identification via pose-aware multi-shot matching. In CVPR, pages 1354–1362, 2016.
  • [7] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani. Person re-identification by symmetry-driven accumulation of local features. In CVPR, 2010.
  • [8] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worlds as proxy for multi-object tracking analysis. In CVPR, 2016.
  • [9] D. Gray and H. Tao. Viewpoint invariant pedestrian recognition with an ensemble of localized features. In ECCV, 2008.
  • [10] A. Hermans, L. Beyer, and B. Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
  • [11] X.-Y. Jing, X. Zhu, F. Wu, X. You, Q. Liu, D. Yue, R. Hu, and B. Xu. Super-resolution person re-identification with semi-coupled low-rank discriminant dictionary learning. In CVPR, 2015.
  • [12] S. Karanam, Y. Li, and R. J. Radke. Person re-identification with discriminatively trained viewpoint invariant dictionaries. In ICCV, 2015.
  • [13] W. Li, R. Zhao, T. Xiao, and X. Wang.

    Deepreid: Deep filter pairing neural network for person re-identification.

    In CVPR, 2014.
  • [14] X. Li, K. Wang, Y. Tian, L. Yan, F. Deng, and F.-Y. Wang. The paralleleye dataset: A large collection of virtual images for traffic vision research. IEEE Transactions on Intelligent Transportation Systems, (99):1–13, 2018.
  • [15] X. Li, W.-S. Zheng, X. Wang, T. Xiang, and S. Gong. Multi-scale learning for low-resolution person re-identification. In ICCV, 2015.
  • [16] J. Riccitiello. John riccitiello sets out to identify the engine of growth for unity technologies (interview). VentureBeat. Interview with Dean Takahashi. Retrieved January, 18, 2015.
  • [17] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In ECCV, pages 17–35, 2016.
  • [18] S. Sankaranarayanan, Y. Balaji, A. Jain, S. N. Lim, and R. Chellappa. Learning from synthetic data: Addressing domain shift for semantic segmentation. In CVPR, 2018.
  • [19] M. S. Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen. A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking. In CVPR, 2018.
  • [20] C. Song, Y. Huang, W. Ouyang, and L. Wang.

    Mask-guided contrastive attention model for person re-identification.

    In CVPR, 2018.
  • [21] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian. Pose-driven deep convolutional model for person re-identification. In ICCV, 2017.
  • [22] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV, 2018.
  • [23] M. Tian, S. Yi, H. Li, S. Li, X. Zhang, J. Shi, J. Yan, and X. Wang. Eliminating background-bias for robust person re-identification. In CVPR, 2018.
  • [24] Y. Wang, L. Wang, Y. You, X. Zou, V. Chen, S. Li, G. Huang, B. Hariharan, and K. Q. Weinberger. Resource aware person re-identification across multiple resolutions. In CVPR, 2018.
  • [25] Z. Wu, Y. Li, and R. J. Radke. Viewpoint invariant human re-identification in camera networks using pose priors and subject-discriminative features. IEEE transactions on pattern analysis and machine intelligence, 37(5):1095–1108, 2015.
  • [26] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang. Joint detection and identification feature learning for person search. In CVPR, 2017.
  • [27] K. Zheng, X. Fan, Y. Lin, H. Guo, H. Yu, D. Guo, and S. Wang. Learning view-invariant features for person identification in temporally synchronized videos taken by wearable cameras. In ICCV, 2017.
  • [28] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian. Mars: A video benchmark for large-scale person re-identification. In European Conference on Computer Vision, 2016.
  • [29] L. Zheng, Y. Huang, H. Lu, and Y. Yang. Pose invariant embedding for deep person re-identification. arXiv preprint arXiv:1701.07732, 2017.
  • [30] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identification: A benchmark. In ICCV, 2015.
  • [31] Z. Zheng, L. Zheng, and Y. Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In ICCV, 2017.
  • [32] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang. Camera style adaptation for person re-identification. In CVPR, 2018.