A baseline for clothes-change Person ReID task.
Person re-identification (Reid) is now an active research topic for AI-based video surveillance applications such as specific person search, but the practical issue that the target person(s) may change clothes (clothes inconsistency problem) has been overlooked for long. For the first time, this paper systematically studies this problem. We first overcome the difficulty of lack of suitable dataset, by collecting a small yet representative real dataset for testing whilst building a large realistic synthetic dataset for training and deeper studies. Facilitated by our new datasets, we are able to conduct various interesting new experiments for studying the influence of clothes inconsistency. We find that changing clothes makes Reid a much harder problem in the sense of bringing difficulties to learning effective representations and also challenges the generalization ability of previous Reid models to identify persons with unseen (new) clothes. Representative existing Reid models are adopted to show informative results on such a challenging setting, and we also provide some preliminary efforts on improving the robustness of existing models on handling the clothes inconsistency issue in the data. We believe that this study can be inspiring and helpful for encouraging more researches in this direction.READ FULL TEXT VIEW PDF
A baseline for clothes-change Person ReID task.
Automatically searching for a specific person (e.g., a suspect) across multiple video surveillance cameras at different places, a popular imagination in AI-based science fiction movies and TV dramas, is not only extremely valuable and impactful for public safety and security (e.g., finding the Boston Marathon bombing suspect in an efficient way), but also technically highly challenging. It is about fine-grained search over numerous similar-looking candidates while at the same time having to tolerate significant appearance changes of the same targets.
The identity of a person is usually determined by his/her biological traits/characteristics, such as facial appearance, ages, body shape and so on, rather than appearance like hairstyles, bags, shoes, and clothes et al. In contrast, the widely used benchmarks in Reid – Market1501 , CUHK03 , DukeMTMC-reid 
) do not really contain significant appearance variance of the same person, which can easily mislead Reid models to learn appearance features (e.g., clothes) rather than the desired robust identity-sensitive features related to biological traits. This is particularly true, in the sense that, most successful DNN-based Reid models are learning from data, and the clothes consistency on the same identities makes the models greatly rely on the visual appearance related to clothes, which usually takes the largest part of a human body. Thus, these models can easily fail when people wear similar clothes as others’ or change their own clothes.
Though people do change clothes frequently, especially during different days, and crime suspects may even intentionally do that on the same day for more effective hiding, the clothes inconsistency problem has long been overlooked in the research community. There are mainly two reasons for that. Firstly, a traditional assumption about Reid is that it only covers a short period of a few minutes/hours, so that there only exists clothes consistency for the same identities. However, even in such a short period, one can still take on or take off his/her coat, grasp a bag, or get dressed in another suit, let alone in long periods like across days. This assumption greatly limits the generalization ability of Reid models towards real-world applications, such as finding crime suspects. To this end, it is imperative to investigate the Reid models in the settings that allow clothes inconsistency, as shown in Fig. 1. Secondly, it is very expensive to create a large-scale person Reid dataset with significant amount of same-identity clothes inconsistency in real scenarios. Capturing data across days may not be so difficult, but associating the same identities with clothes changes in such large amounts of data for getting ground-truth labels is even more challenging for human annotators.
Despite of the difficulties, in this paper, we provide benchmark datasets and systematically study the influence of clothes inconsistency problem for person Reid, which is the first time as far as we know. In greater details, two benchmark datasets are built and will be released together with the publication of this paper. One is a 3-day 4-camera dataset capturing real persons and scenarios around the entrance of a big office building (as shown in the upper row of Fig. 2). Due to the privacy concerns, currently we have only been able to capture data of 28 volunteers. Such difficulties motivated us to build another dataset – a synthetic dataset called Virtually Changing-Clothes (VC-Clothes) dataset using 3D human models with the help of a video game engine named GTA5 (samples shown in Fig. 3). Thanks to the encouraging breakthroughs of computer graphical techniques and the prevalence of virtual reality game, we could get high-quality realistic data covering 512 identities of 19,060 images in 4 different scenes and with significant clothes changes.
About how the clothes inconsistency problem may influence Reid, intuitively, one can imagine that it makes Reid harder, however, it is never validated and discussed experimentally in previous studies. In Sec. 4, we show that clothes changes not only significantly influence representation learning, but also challenge the generalization ability of Reid models in re-identifying the persons wearing unseen clothes. Moreover, we present some preliminary efforts on how to promote performance by learning more identity-sensitive and clothes-insensitive features, which can serve as baselines for future deeper studies.
Contributions. (1) We systematically study the influence of same-identity clothes inconsistency issue for Reid, which is historically overlooked. (2) We build two new benchmark datasets, one real and one synthetic, with significant clothes changes, for supporting the research. (3) We provide preliminary solutions for handling the clothes inconsistency issue, which can be baselines for inspiring further researches.
There are few studies about clothes inconsistency problem in person Reid. The biggest challenge is lack of sufficient applicable and well-annotated data with clothes inconsistency. RGB-D 
is the first Reid dataset with clothes inconsistency proposed to solve this problem using additional depth information. However, such a solution is only applicable in indoor environments and the dataset is not sufficient to train a deep learning model (with only 79 identities and 1-2 clothes per identity). iQIYI-VID is a large scale multi-modal dataset collected from 600K video clips of 5,000 celebrities of various types. Although this dataset demonstrates great advantages in terms of size and variation in human poses, face quality, clothes, makeup, etc., the source of images is from online videos, instead of surveillance cameras. Another possible way is to automatically generate images with clothes differences using GAN, e.g. DG-GAN . The encoders decompose each pedestrian image into two latent spaces: an appearance space and a structured space, therefore, one can change the color of clothes by switching the two latent space of different persons. However, this work did not dig into the importance of clothes-independent features and appearance information contributes most of the performance improvement. It can only change the color of clothes without altering human poses or illuminations at the same time. Different from all the above methods, we propose a synthetic dataset which can best simulate the realistic scenarios.
Synthetic data is becoming increasingly popular in training deep learning models 
, because of the difficulty in building an ideal experimental environment and requirements of huge human labor resources. Comparatively, synthetic data can be generated in an automated process in demanded, therefore, promotes the development of many computer vision tasks like: semantic segmentation[18, 21], object detection [19, 9]
, pose estimation[1, 5, 6]. In Reid, only a few works [2, 3, 24] utilize synthetic data. [2, 24] are designed to investigate illumination varieties or viewpoint diversities in Reid instead of clothing changes. Although SOMAset  considers clothing changes, it does not discuss the influence of clothes on Reid in depth. Our dataset contains 512 identities with 1 ~ 3 different suits of clothes for each individual, so it can best serve the research of the clothes inconsistency task in Reid.
To the best of our knowledge, RGB-D  is the only publicly available Reid benchmark to studying the same-identity clothes inconsistency issue. However, it is not sufficient for an in-depth study as it contains only 79 identities with a few identities having a second suit of clothes and it was captured in an indoor laboratory environment with simple backgrounds. Therefore, we decided to build new benchmark datasets. However, it is difficult to build a large-scale dataset of real surveillance data with significant clothes changes. Therefore, we built a small one with 28 student volunteers for just preliminary research and put more efforts on building a much larger dataset with realistic synthetic data generated by a powerful 3D video game engine. Some important facts of these two datasets are given in Table 1, in comparison with the RGB-D dataset. In this paper, we show that these two datasets are sufficient for supporting many valuable researches on the historically overlooked clothes inconsistency issue with our initial studies. Firstly, we give a brief introduce to the two datasets, especially the synthetic one, which is unique and good for developing new robust Reid models. We also prove its potential to serve as a new person Reid benchmark for research purpose.
We name our real-scenario dataset “Real28”. As can be seen in Fig. 2, it is collected in 3 different days (with different clothing) by 4 cameras of 28 different identities and consists of altogether 4,324 images with 2 indoor scenes and 2 outdoors. As shown in Table 1, even though Real28 only contains less identities, its has more clothes and environment changes and significantly larger size than RGB-D. Since the size is still not big enough for training deep learning models, we suggest only using this dataset for performance testing and having the models trained on some other datasets, like the synthetic dataset VC-Clothes.
|Detector||hand||DPM||DPM, hand||Mask RCNN||hand||hand|
The images are rendered by the Grand Theft Auto V 111https://www.rockstargames.com/(GTA5), which is a famous action-adventure game, with HD and realistic game graphics. Critically, it allows very convenient configurations of the clothes of each avatar/character and supports the user-defined environmental parameters, e.g., illuminations, viewpoints and background. 
even made a pedestrian tracking and pose estimation dataset on top of it. GTA5 has a very large environment map, including thousands of realistic buildings, streets and spots. We select 4 different scenes – 1 indoor scene and 3 outdoor scences as shown in Fig.3, i.e., gate, street, natural scene and parking lot. We introduce additional illumination by changing not only the illumination of different scenes, but also the same scene of different periods, to make the images more realistic. In general, outdoor scenes have more lighting than indoors.
In the process of image generation, each person walks along a scheduled route, and cameras are put in a fixed location. We have different camera views of front-left, front-right, and right-after, thus the person faces may not be easily observed. However, each person still has very differentiated biological traits, as demonstrated in Fig. 4. Thus, it is imperative to identify these persons by the features of faces, genders, ages, body figures, hairstyles and so on, rather than the clothes. Notice we only change the clothes of persons, while maintaining their identity unchanged. In our VC-Clothes dataset, each identity has 1 ~ 3 suits shown in the four cameras; and keep their clothes unchanged in the same camera. Furthermore, to facilitate experiments, we make all the identities wear the same clothes on Cam2 and Cam3.
The main challenge of this dataset comes from the visual appearance of different clothes of the same identity. Other challenges also include different persons with similar clothes, various occlusion, illumination changes under different environments, and pose variance (see Fig. 4)
VC-Clothes has 512 identities, 4 scenes (cameras) and on average 9 images/scene for each identity and total number of 19,060 images. Mask RCNN  is employed to detect the person by automatically producing bounding boxes. The detailed statistics of VC-Clothes are summarized in Table. 2, in comparison with other widely used Reid datasets, e. g. DukeMTMC-reid , Market1501 , CUHK03 , CUHK01 and VIPeR. More statistical analysis of this dataset can be seen in Fig. 5.
We equally split the dataset by identities, 256 identities in training set and the other 256 in testing set. In test data, we randomly chose 4 images per person from each camera as query, while the other images serve as gallery images. Thus, we have altogether 9449 images in training, 1020 images in query and 8591 in the gallery.
To make VC-Clothes dataset a qualified benchmark, we compare the results of some representative Reid methods on VC-Clothes with a widely accepted benchmark, CUHK03.
Problem Definition. We still follow the classical definition of person Reid task. Given a training dataset of images, where and are the image data and its corresponding identity label, the goal of Reid is to learn an effective feature mapping for the input image , such that images of the same identity have smaller distances in the feature space than those of different identities. In VC-Clothes, we target to find both the clothes consistency results and clothes inconsistency results at the same time.
We use two main evaluation metrics: the Mean Average Precision (mAP) and the accuracy values at the first a few indicative ranks,e.g., Rank-1 (in abbr. R@1), Rank-5 (R@5), and Rank-10 (R@10). MAP refers to the average accuracy rate of each relevant document retrieved by a query. The Rank-i accuracy means whether the first i query result contains the correct image.
|Methods||Same Clothes||Change Clothes|
|(1 suit/person, Cam2&Cam3)||(2 suits/person, Cam3&Cam4)|
|Methods||Query in seen suits (Cam2)||Query inunseen suits (Cam1)|
|Gallery (Cam3&Cam4)||Gallery (Cam3&Cam4)|
Competitors. We compare several hand-crafted methods, including LOMO+XQDA , GOG+XQDA ; DNN methods: MDLA , PCB , Part-aligned . We use the original steup of the 5 state-of-the-art methods in our experiments. In addition, we also provide a baseline model using the widely adopted ResNet50 
network pre-trained on the ImageNet
dataset. During training, the batch-size is set to 128 and the model is trained for 200 epochs with an initial learning rate of 0.0001, which will decay by a factor of 0.01 every 50 epochs.
Benchmark results. As shown in Table 3, the performances trend of different methods on the VC-Clothes dataset is similar as on CUHK03, showing that VC-Clothes can reflect real-world performance. It also enables different design of sub settings, thus we can systematically study the influence of clothes-inconsistency problem. All of the above factors make VC-Clothes a rational benchmark for person Reid model evaluation.
When designing experimental settings, we mainly consider whether people change their clothes or not, and therefore have all identities maintaining their clothes over two cameras (Cam2 and Cam3) and changing their clothes over any other camera pairs. Such a setting enables study of how can clothes inconsistency influence the performance of Reid.
We first conduct a typical experiment on justifying the performance of Reid on two indicative settings: Same Clothes (Cam2 and Cam3) vs. Change Clothes (Cam3 and Cam4), using all the representative state-of-the-art methods. As can be seen in Table 4, clothes inconsistency brings about 30% to 50% performance drop to DNN-based methods, and even more to hand-crafted methods, in terms of mAP and Rank-1 accuracy. Although camera diversities may also influence the results to some extent, the primary cause of the dramatic performance decrease should be the changing of clothes.
This indicates that clothes inconsistency is essentially a very subtle problem, which unfortunately has been overlooked in previous works. This is easy to understand as clothes usually cover most of the body, which makes it hard to extract identity-sensitive but clothes-independent appearance features. Note that the performance drop of the Part-aligned method  is the least in all the compared methods, indicating finer and more detailed modeling of appearance is needed for handling clothing changes.
In Fig. 6, we give a visualization of the appearance maps and part maps of the Part-aligned model  under the “Same Clothes” setting (Cam2 and Cam3) and “Change Clothes” (Cam1 and Cam2) settings (seeing Tab. 4). For a given input image (left), the global appearance (center) and body parts (right) are represented using different colors. The same color implies similar features and brighter locations denote more significant body parts.
In the “Same Clothes” setting where there is no clothes inconsistency, the appearance maps of body parts are much brighter than that in the “Change Clothes” setting and the colors are more scattered. Obvious attention on faces and feet can be seen in the part map of the “Change Clothes” setting, which means learning identity-sensitive and clothes-independent representations require exploration of the details of smaller areas instead of the large body parts.
To evaluate the Reid models’ ability in handling clothes inconsistency of the same identities, training using the test data from the same camera pair (as shown in Tab. 4) is not enough, because the learned models may overfit specific camera viewpoints. Therefore, we propose to test by querying with data from another new camera.
In the VC-Clothes dataset, each camera corresponds to a type of suit for each identity. We can train models on Cam3 and Cam4, and test on two different cameras: Cam2 and Cam1. The two cameras represent two different settings. Choosing Cam2 means querying with human instances in seen suits, as the clothes of each person are the same for Cam2 and Cam3. Differently, Cam1 records a new unseen suit for each identity. The two models are trained on the same setting but test on two different settings, thus making it possible to evaluate not only the performance on querying with a new camera, but also the generalization ability on handling unseen clothes, which is believed to be more important when clothing changes do happen.
The experimental results are presented in Table 5. Since DNN-based methods perform much better than hand-crafted ones, we only list the results of DNN-based methods. As the result shows, querying with a new camera is not a big issue when people keep their clothes unchanged, but wearing unseen suits do bring significant extra difficulties and limit the generalization ability of the state-of-the-art methods.
|Pre-trained||All clothes||Same clothes|
To prove the advantage of models trained on dataset with clothes inconsistency, we test the generalization ability of model pre-trained on “All Clothes” setting (images from all cameras) and “Same Clothes
” setting (images for only Cam2 and Cam3) and make a confusion matrix of the two models tested on “All Clothes” setting and “Same Clothes” setting, respectively. Specifically, the “Same Clothes” setting imitates the general person Reid dataset, while the “All Clothes” setting is actually VC-Clothes dataset.
As shown in Table 6, the performance drop of model pre-trained on “All Clothes” setting and tested on “Same Clothes” setting is ignorable, about only 1% on VC-Clothes (in terms of R@1 and mAP, while R@10 has fewer differences), compared with the model pre-trained and tested both on “Same Clothes” setting. However, a dramatic decrease of performance appears when the pre-trained “Same Clothes” model is tested on the “All Clothes” setting (Rank-1 and mAP reduce by 27.7% and 39.8% respectively). Comparatively, models trained on the dataset with clothes inconsistency can perform well on both clothes consistency setting and clothes inconsistency setting, while models trained only on the dataset with only clothes consistency, does not show good generalization ability on clothes inconsistency setting.
|Methods||Train: 2 suits/person (Cam3&Cam4)||Train: add 1 unseen suit/person (add Cam1)||Train: add 1 seen suit/person (add Cam2)|
|Test: 2 suits/person (Cam3&Cam4)||Test: 2 suits/person (Cam3&Cam4)||Test: 2 suits/person (Cam3&Cam4)|
In this section, we further discuss some preliminary solutions towards the robustness against clothes inconsistency. In particular, we present several straightforward attempts towards improving models’ robustness against clothes inconsistency. Our experimental results will motivate and inspire further studies in such an important issue.
One common strategy to improve the performance of DNN-based models is providing more training data. We treat the models trained on Cam3 and Cam4 as the basic settings, and add data from one more camera to enlarge the training set. By choosing either Cam1 or Cam2 as the additional camera, we can get two different settings. Adding Cam1 means adding one more unseen suit/person, while adding Cam2 means adding a seen suit/person, as Cam2 and Cam3 share the same clothes for each person.
The experimental results of these three different settings are presented in Table 7. Adding 1 unseen suit/person significantly boosts the performance of all methods, while the opposite effect can be observed when adding 1 seen suit/person. It may indicate that adding unseen suits can make the DNN-based models work harder on generating more clothes-independent representations, which is more effective for Reid with clothing changes. Therefore, adding more training data is proved to be effective when introducing new clothes.
Overview of the fusion model. In order to improve the performance of current Reid models on top of VC-Clothes, we propose a 3-stream Appearance, Part and Face Extractor Network (3APF). As summarized in Fig. 7, it is composed of two main components: holistic feature extractor (Reid part) and a local face feature extractor (face part).
Holistic feature extractor. We directly utilize the Part-aligned network  as the holistic feature extractor following the original setting. In particular, this network learns the appearance and body part branches, which are further combined by a bi-linear pooling. Thus, this network computes the features as in Eq. 1.
Local face feature extractor. We utilize Pyramidbox 
for face detection; and only 80% person images have detectable faces, because of non-frontal faces. The confidence of the face bounding boxes is set to 0.8. Further, we expand the detected bounding boxes of faces by 15 pixels in four directions (up, down, left and right, respectively) to cover more facial area. All face images are then resized to the same size ofpixels. As for the images with undetectable faces, we use only the feature from a holistic feature extractor for identification.
We use the widely adopted Resnet50 network  pre-trained on MS-1M  dataset to extract face features , where is a face detector. The final feature vector is the weighted mean result of two features as in Eq. 2, where is the hyper-parameter to determine the proportion of each part.
Improvement brought by faces The visualization of Fig. 6 demonstrates the significance of face, when there exists clothes inconsistency. To describe the influence of face to the Reid part in a quantitative way, we employ the 3APF model to control the proportion of face and Reid part by adjusting the weight. The larger the weight, the more important of Reid part in our model. As can be seen in Table. 8, when we adjust the weight to a certain value (0.95), a significant improvement on mAP (82.1%) can be witnessed over single face extractor and holistic extractor.
|Pre-trained||Different clothes||All Clothes||All clothes|
To evaluate whether our pre-trained models on VC-Clothes (synthetic dataset with clothes inconsistency) has some real-world applications and prove its advantages over Market1501 (real dataset without clothes inconsistency). We adopt 2 different ways to transfer the model pre-trained on VC-Clothes to real dataset Real28. One is directly applying the 3APF models trained on VC-Clothes or Market1501 to Real28 or RGB-D. Another way is using finetuning to learn better representations of real data with clothes inconsistency. “VC-Clothes+FT” means pre-trained on VC-Clothes and fine-tuned on Market1501, while “Market1501+FT” denotes the opposite.
As can be seen in Table. 9, directly applying the model trained on VC-Clothes is hard to get good results, but we can get much better results (VC-Clothes+FT) by simply fine-tuning it on a third real dataset. Moreover, our synthetic data can also help model pre-trained on real datasets to generate better performance on the tasks with clothes inconsistency (Market1501+FT). This phenomenon is especially significant on “Real28”, where more clothes changes exist.
In this paper, we studied the clothes inconsistency problem in person Reid. Such a setting is much realistic and helpful in tackling longtime person ReID task. Because of the huge difficulty in collecting a large-scale dataset in real scenes, we just collect a small one for testing. Alternatively, we proposed to use the GTA5 game engine to render a large-scale synthetic dataset “VC-Clothes” with desired properties. With these two benchmark datasets, we conducted a series of pilot studies, which helps us better understand the influence and importance of this new problem. Finally, we also investigated some straightforward solutions for improving the performance on VC-Clothes and ways to transfer the model to real test dataset. It is hopefully to motivate and inspire further studies in this important issue.
Imagenet classification with deep convolutional neural networks. In NeurPIS, Cited by: §3.2.3.