Learning Shape Representations for Clothing Variations in Person Re-Identification

03/16/2020 ∙ by Yu-Jhe Li, et al. ∙ 10

Person re-identification (re-ID) aims to recognize instances of the same person contained in multiple images taken across different cameras. Existing methods for re-ID tend to rely heavily on the assumption that both query and gallery images of the same person have the same clothing. Unfortunately, this assumption may not hold for datasets captured over long periods of time (e.g., weeks, months or years). To tackle the re-ID problem in the context of clothing changes, we propose a novel representation learning model which is able to generate a body shape feature representation without being affected by clothing color or patterns. We call our model the Color Agnostic Shape Extraction Network (CASE-Net). CASE-Net learns a representation of identity that depends only on body shape via adversarial learning and feature disentanglement. Due to the lack of large-scale re-ID datasets which contain clothing changes for the same person, we propose two synthetic datasets for evaluation. We create a rendered dataset SMPL-reID with different clothes patterns and a synthesized dataset Div-Market with different clothing color to simulate two types of clothing changes. The quantitative and qualitative results across 5 datasets (SMPL-reID, Div-Market, two benchmark re-ID datasets, a cross-modality re-ID dataset) confirm the robustness and superiority of our approach against several state-of-the-art approaches



There are no comments yet.


page 1

page 2

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Related Works

Person Re-ID.

Person re-ID has been widely studied in the literature. Existing methods typically focus on tackling the challenges of matching images with viewpoint and pose variations, or those with background clutter or occlusion presented [18, 5, 22, 13, 30, 2, 17, 23, 40, 31, 3, 29, 39, 35, 4, 20]. For example, Liu et al. [23]

develop a pose-transferable deep learning framework based on GAN 

[9] to handle image pose variants. Chen et al. [3]

integrate conditional random fields (CRF) and deep neural networks with multi-scale similarity metrics. Several attention-based methods 

[30, 17, 31] are further proposed to focus on learning the discriminative image features to mitigate the effect of background clutter. While promising results have been observed, the above approaches cannot easily be applied for addressing clothing dependence problem due to the lack of ability in suppressing the visual differences across clothes/clothing-colors.

Cross-modality Re-ID.

Some related methods for cross-modality are proposed [42, 45, 46, 6, 38, 10] to address color-variations. Varior et al. [36] learned color patterns from pixels sampled from images across camera views, addressing the challenge of different illuminations of the perceived color of subjects. Wu et al. [42]

built the first cross-modality RGB-IR benchmark dataset named SYSU-MM01. They also analyze three different network structures and propose deep zero-padding for evolving domain-specific structure automatically in one stream network optimized for RGB-IR Re-ID tasks.

[45, 46] propose modality-specific and modality-shared metric losses and a new bi-directional dual-constrained top-ranking loss for RGB-Thermal person re-identification. [6] introduce a cross-modality generative adversarial network (cmGAN) to reduce the distribution divergence of RGB and IR features. Recently, [38] also propose pixel alignment and feature alignment jointly to reduce the cross-modality variations. Yet, even though they successfully learn sensor-invariant features across modality, their models still can not be used to address the issue of clothing dependence in single modality.

Disentanglement Re-ID.

Recently, a number of models are proposed to better represent specific disentangled features during re-ID [32, 49, 48, 47, 16, 44, 41]. Ma et al. [25] generate person images by disentangling the input into foreground, background and pose with a complex multi-branch model which is not end-to-end trainable. Ge et al. [8] and Li et al. [19] learn pose-invariant features with guided image information. Zheng et al. [52] propose a joint learning framework named DG-Net that couples re-id learning and data generation end-to-end. Their model involves a generative module that separately encodes each person into an appearance code and a structure code, which lead to improved re-ID performance. However, their appearance encoder used to perform re-ID are still dominated by the clothing-color features corresponding to the input images. Based on the above observations, we choose to learn clothing-color invariant features using a novel and unified model. By disentangling the body shape representation, re-ID can be successfully performed in the scenario of clothing change even if no ground true images containing clothing change are available for training data.

Figure 4: Overview of the proposed Color Agnostic Shape Extraction Network (CASE-Net). The shape encoder encodes input images across different color domains/datasets ( and ) and produces color-invariant features ( and ). The color encoder encodes the RGB images () and produce color-related feature . Then our feature discriminator is developed to determine whether the input color-invariant features ( and ) are from same distribution. Finally, the generator jointly takes the color-invariant () derived from gray-scaled image and color related feature () from RGB inputs, producing the synthesized RGB output images while jointly training with additional image discriminator ().

2 CASE-Net

For the sake of the completeness, we define the notations to be used in this paper. In the training stage, we have access to a set of  RGB images and its corresponding label set , where and are the RGB image and its label, respectively. To allow our model to handle images of different color variations, we generate a gray-scaled image set by multiplying each image from with RGB channel summation factors, followed by duplicating the single channel back to the original image size (i.e., ). Naturally, the label set for is identical to . In order to achieve body-shape distilling via image generation, we also sample another set of RGB images , where its corresponding label set is same as but with different pose and view point.

As depicted in Figure 4, CASE-Net consists of five components:(1) the shape encoder , (2) the color encoder , (3) the feature discriminator , (4) the image generator , and (5) the image discriminator . We now describe how these models work together to learn a body shape feature which can be used for re-ID in domains that do not use color. Training CASE-Net results in learning a shape encoding and a color encoding of an image of a person. However, we are primarily interested in the body shape feature since it can be re-used for cross-domain (non-color dependent) re-ID tasks.

2.1 Clothing color adaptation in re-ID

Shape encoder ().

To utilize labeled information of training data for person re-ID, we employ classification loss on the output feature vector

( and ). With person identity as ground truth information, we can compute the negative log-likelihood between the predicted label and the ground truth one-hot vector , and define the identity loss as


where is the number of identities (classes).

To further enhance the discriminative property, we impose a triplet loss on the feature vector , which would maximize the inter-class discrepancy while minimizing intra-class distinctness. To be more specific, for each input image , we sample a positive image with the same identity label and a negative image with different identity labels to form a triplet tuple. Then, the following equations compute the distances between and /:


where , , and represent the feature vectors of images , , and , respectively. With the above definitions, we have the triplet loss defined as


where is the margin used to define the distance difference between the distance of positive image pair and the distance of negative image pair .

Feature discriminator ().

Next, since our goal is to derive body-shape representations which do not depend on clothing-color, we first learn color-invariant representation by encouraging the content encoder to generate similar feature distributions when observing both and . To achieve this, we advance adversarial learning strategies and deploy a feature discriminator in the latent feature space. This discriminator takes the feature vectors and as inputs to determine whether the input feature vectors are from or . To be more precise, we define the feature-level adversarial loss as


where and denote the encoded RGB and gray-scaled image features, respectively.111For simplicity, we omit the subscript , denote RGB and gray-scaled images as and , and represent their corresponding labels as and . With loss , our feature discriminator distinguish the features from two distributions while our shape encoder aligns the feature distributions across color variations, carrying out the learning of color-invariant representations for clothing via adversarial manner.

2.2 Pose guidance for body shape disentanglement

Color encoder ().

To ensure our derived feature is body-shape related in clothing-color changing tasks, we need to perform additional body-shape disentanglement during the learning of our CASE-Net. That is, we have the color encoder in Fig. 4 encodes the inputs from RGB images set into color-related features . As a result, both gray-scaled body-shape and color features would be produced in the latent space. Inspired by DG-Net [52] using gray-scaled image to achieve body-shape disentanglement across pose variations, we similarly enforce the our generators to produce the person images conditioned on the encoded color feature coming from different pose. To be precise, we have the generator take the concatenated shape and color feature pair and output the corresponding image .

Image generator ().

Since we have ground truth labels (i.e., image pair correspondences) from the training data, we can perform a image recovery task given two images and of the same person but with different poses, we expect that they share the same body-shape feature . Given the desirable feature pair , we then enforce to output the image using the body-shape feature which is originally associated with . This is referred to as Pose guided image recovery.

With the above discussion, image reconstruction loss can be calculated as:


where denotes . Note that we adopt the L norm in the above reconstruction loss terms as it preserves image sharpness [12].

Image discriminator ().

To further enforce perform perceptual content recovery, we produce perceptually realistic outputs by having the image discriminator discriminate between the real images and the synthesized ones . To this end, we have both reconstruction loss and perceptual discriminator loss for image recovery. Thus, the image perceptual discriminator loss as


To perform person re-ID in the testing phase, our network encodes the query image by for deriving the body shape feature , which is applied for matching the gallery ones via nearest neighbor search (in Euclidean distances). We will detail the properties of each component in the following subsections.

It is import to note that the goal of CASE-Net is to perform re-ID in clothing changing scenario without observing ground true clothing changing training data. By introducing the aforementioned network module, our CASE-Net would be capable of performing re-ID in environments with clothing changes. More precisely, with the joint training of encoders/generator and the feature discriminator, our model allows learning of body-structural representation. The pseudo code for training our CASE-Net using above losses is summarized in Algorithm 1, where and are hyper-parameters.

Data: Image set: , , ; Label set: ,
Result: Configurations of CASIE-Net
1 , , , , initialize
2 for Num. of training Iters. do
3        , , , , sample from , , , ,
4        , , obtain by , ,
5        , calculate by (1), (4)
7        obtain by
8        , , calculate by (5), (6), (7)
9        for  Iters. of updating generator  do
13       for Iters. of updating discriminator  do
Algorithm 1 Learning of CASIE-Net
Method SMPL-reID Div-Market
R1 R5 R10 mAP R1 R5 R10 mAP
Verif-Identif [54] (TOMM’17) 19.0 35.6 43.9 4.2 9.2 23.9 34.6 1.0
SVDNet [34] (ICCV’17) 20.7 46.0 59.4 5.3 9.8 25.1 35.5 1.3
FD-GAN [8] (NIPS’18) 21.2 46.5 59.9 5.1 14.3 26.4 36.5 1.6
Part-aligned [33] (ECCV’18) 23.7 47.3 60.6 5.5 14.9 27.4 36.1 1.8
PCB [35] (ECCV’18) 25.5 48.9 61.9 5.9 15.7 27.0 39.5 1.7
DG-Net [52] (CVPR’19) 27.2 51.3 63.3 6.2 19.7 30.1 47.5 2.2
cmGAN* [6] (IJCAI’18) 29.6 55.5 65.3 8.2 23.7 37.0 50.5 2.9
AlignGAN* [38] (ICCV’19) 41.8 63.3 72.2 12.6 26.0 45.3 61.0 3.4
Ours 62.0 77.8 81.5 28.1 56.2 71.5 79.2 13.5
Table 1: Quantitative results of person re-ID on the SMPL-reID and Div-Market dataset. Note that all the reported results are reproduced using released codes available online. *indicates replacement of IR images with gray-scaled ones during training.
Method Standard re-ID Extended re-ID
Q: RGB, G: RGB Q: Gray, G: RGB Q: RGB, G: Gray Q: Gray, G: Gray
R1 R5 R10 mAP R1 R5 R10 mAP R1 R5 R10 mAP R1 R5 R10 mAP
Verif-Identif [54] (TOMM’17) 79.5 86.0 90.3 61.5 10.2 15.4 21.1 7.8 19.5 35.6 43.9 10.9 42.5 61.3 74.2 20.6
SVDNet [34] (ICCV’17) 82.2 92.3 93.9 62.4 10.1 13.2 22.5 8.9 18.9 36.5 45.4 11.0 42.0 62.7 72.1 21.1
FD-GAN [8] (NIPS’18) 90.5 96.0 97.7 77.9 12.4 19.6 23.8 10.1 30.5 50.1 59.6 18.4 49.7 69.8 76.2 23.2
Part-aligned [33] (ECCV’18) 93.8 97.7 98.3 79.9 14.1 22.5 27.9 11.6 36.6 58.7 67.4 20.0 51.3 73.4 80.4 26.5
PCB [35] (ECCV’18) 93.2 97.3 98.2 81.7 13.6 22.4 27.4 10.6 35.5 56.2 65.1 19.3 50.2 72.9 80.1 26.2
DG-Net [52] (CVPR’19) 94.4 98.4 98.9 85.2 15.1 23.6 29.4 12.1 37.7 59.8 68.5 22.9 52.9 73.8 81.5 27.5
cmGAN* [6] (IJCAI’18) 82.1 92.5 94.1 61.8 67.2 83.5 88.6 46.3 70.4 86.8 91.5 46.5 70.8 86.2 90.1 46.7
AlignGAN* [38] (ICCV’19) 89.3 95.4 97.2 74.3 77.2 89.6 94.7 57.0 79.4 90.5 92.1 55.1 79.8 91.8 94.0 57.1
Ours 94.6 98.9 99.1 85.7 80.4 93.0 95.9 60.3 81.4 93.9 97.4 60.8 81.6 93.7 95.7 60.5
Table 2: Quantitative results of person re-ID on the Market1501 dataset. Left block: standard re-ID evaluation. Right block: extended re-ID setting. Note that all the reported results are reproduced using released codes available online. *indicates replacement of IR images with gray-scaled ones during training.

3 Experiments

3.1 Datasets

To evaluate our proposed method, we conduct experiments on two of our synthesized datasets: SPML-reID and Div-Marke, and two benchmark re-ID datasets: Market-1501 [50] and DukeMTMC-reID [53, 28] , which is commonly considered in recent re-ID tasks. We also additionally conduct experiments on one cross-modality dataset named SYSU-MM01 [42] to assess the generalization of our model when it learns body shape representation.


SPML-reID is our synthetic dataset to simulate clothing change for person re-ID. We render identity across view-points (different shooting angle from top view) and walking poses using SMPL[24]. Details of the SMPL model can be found at [24]. For each identity, we render it using [14] and pair it with a selected background image. For shape parameters, we sampled from the ”walking” class from the AMASS dataset [26]. The identities are rendered from 6 different view points (see examples in Fig. 2), and identities are for training where no clothes change occurs, while the other identities in testing dataset contain clothes changes, totally images.


Div-Market is our small synthesized dataset from current Market-1501. We use our generative model similar as [52] to change the clothing-color in the images of each identity. It contains total 24732 images of 200 identities each with hundreds of figures and it is only used for testing scenario.


The Market-1501 [50] is composed of 32,668 labeled images of 1,501 identities collected from 6 camera views. The dataset is split into two non-over-lapping fixed parts: 12,936 images from 751 identities for training and 19,732 images from 750 identities for testing. In testing, 3368 query images from 750 identities are used to retrieve the matching persons in the gallery.


The DukeMTMC-reID [53, 28] is also a large-scale Re-ID dataset. It is collected from 8 cameras and contains 36,411 labeled images belonging to 1,404 identities. It also consists of 16,522 training images from 702 identities, 2,228 query images from the other 702 identities, and 17,661 gallery images.


The SYSU-MM01 [42] dataset is the first benchmark for cross-modality (RGB-IR) Re-ID, which is captured by 6 cameras, including two IR cameras and four RGB ones. This dataset contains 491 persons with total 287,628 RGB images and 15,792 IR images from four RGB cameras and two IR cameras. The training set consists of total 32,451 images including 19,659 RGB images and 12,792 IR images, where the training set contains 395 identities and the test set contains 96 identities.

3.2 Implementation Details

We implement our model using PyTorch. Following Section 

2, we use ResNet-

pre-trained on ImageNet as our backbone of shape encoder

and color encoder . Given an input image (all images are resized to size , denoting width, height, and channel respectively.), encodes the input into -dimension content feature . The structure of the generator is convolution-residual blocks similar to that proposed by Miyato et al. [27]. The structure of the image discriminator employs the ResNet- as backbone while the architecture of shared feature discriminator adopts is composed of convolution blocks in our CASE-Net. All five components are all randomly initialized. The margin for the is set as , and we fix and as and , respectively. The performance of our method can be possibly further improved by applying pre/post-processing methods, attention mechanisms, or re-ranking techniques. However, such techniques are not used in all of our experiments.

3.3 Evaluation Settings and Protocol.

For our rendered SPML-reID, we train the model on training set and then inference it with testing set. For our synthesized testing set Div-Market, we evaluate the models training only with Market-1501 on the clothing-color changing dataset during the testing scenario. For Market-1501 , we augment the testing dataset by converting the RGB images into Gray-scaled ones. That is, in addition to the standard evaluation setting where both Probe (Query) and Gallery are of RGB, we conducted extended experiments on Gray/RGB, Gray/Gray, and Gray/Gray as Probe/Gallary sets for evaluating the generalization of current re-ID models. For SYSU-MM01, there are two test modes, i.e., all-search mode and indoor-search mode. For the all-search mode, all testing images are used. For the indoor-search mode, only indoor images from 1st, 2nd, 3rd, 6th cameras are used. The single-shot and multi-shot settings are adopted in both modes. Both modes use IR images as the probe set and RGB images as the gallery set.

We employ the standard metrics as in most person Re-ID literature, namely the cumulative matching curve (CMC) used for generating ranking accuracy, and the mean Average Precision (mAP). We report rank-1 accuracy and mean average precision (mAP) for evaluation on both datasets.

Method Standard re-ID Extended re-ID
Q: RGB, G: RGB Q: Gray, G: RGB Q: RGB, G: Gray Q: Gray, G: Gray
R1 R5 R10 mAP R1 R5 R10 mAP R1 R5 R10 mAP R1 R5 R10 mAP
Verif-Identif [54] (TOMM’17) 68.7 81.5 84.2 49.8 8.9 15.4 20.3 6.2 16.1 33.9 43.5 8.1 35.4 52.3 60.0 17.4
SVDNet [34] (ICCV’17) 76.5 87.1 90.4 57.0 9.1 16.3 20.5 6.9 16.5 36.4 45.8 9.6 37.0 52.8 60.9 17.5
FD-GAN [8] (NIPS’18) 80.8 89.8 92.7 63.3 9.4 17.1 22.1 7.8 19.5 36.9 46.2 11.0 35.1 53.2 61.8 18.1
Part-aligned [33] (ECCV’18) 83.5 92.0 93.9 69.2 10.5 16.9 20.6 7.4 20.1 38.0 46.3 10.9 34.7 52.4 61.5 18.8
PCB [35] (ECCV’18) 82.9 91.1 93.6 67.1 9.1 17.4 21.4 6.6 19.4 37.7 46.4 10.8 34.2 52.6 60.9 18.4
DG-Net [52] (CVPR’19) 86.3 93.2 95.5 75.1 12.5 18.6 23.9 8.1 21.5 38.3 47.2 12.5 38.6 55.1 64.0 21.5
cmGAN* [6] (IJCAI’18) 74.1 86.2 88.5 54.1 58.8 76.1 79.3 38.2 60.5 77.0 83.6 39.0 60.6 76.9 81.7 39.0
AlignGAN* [38] (ICCV’19) 80.1 87.6 90.5 58.1 63.8 79.4 82.8 47.1 62.5 80.3 84.8 44.1 63.8 80.1 84.2 42.5
Ours 86.4 93.7 95.6 75.4 65.8 80.7 85.5 50.5 66.9 83.9 87.9 50.2 65.3 84.8 87.3 49.5
Table 3: Quantitative results of person re-ID on the DukeMTMC-reID dataset. Left block: standard re-ID evaluation. Right block: extended re-ID setting. Note that all the reported results are reproduced using released codes available online. *indicates replacement of IR images with gray-scaled ones during training.

3.4 Comparisons with State-of-the-art Approaches

Method All-search Indoor-search
Single-shot Multi-shot Single-shot Multi-shot
R1 R10 R20 mAP R1 R10 R20 mAP R1 R10 R20 mAP R1 R10 R20 mAP
HOG [7] (CVPR’05) 2.8 18.3 32.0 4.2 3.8 22.8 37.6 2.2 3.2 24.7 44.5 7.3 4.8 29.1 49.4 3.5
LOMO [21] (CVPR’15) 3.6 23.2 37.3 4.5 4.7 28.3 43.1 2.3 5.8 34.4 54.9 10.2 7.4 40.4 60.4 5.6
Two Stream Net [42] (ICCV’17) 11.7 48.0 65.5 12.9 16.4 58.4 74.5 8.0 15.6 61.2 81.1 21.5 22.5 72.3 88.7 14.0
One Stream Net [42] (ICCV’17) 12.1 49.7 66.8 13.7 16.3 58.2 75.1 8.6 17.0 63.6 82.1 23.0 22.7 71.8 87.9 15.1
Zero Padding [42] (ICCV’17) 14.8 52.2 71.4 16.0 19.2 61.4 78.5 10.9 20.6 68.4 85.8 27.0 24.5 75.9 91.4 18.7
cmGAN [6] (IJCAI’18) 27.0 67.5 80.6 27.8 31.5 72.7 85.0 22.3 31.7 77.2 89.2 42.2 37.0 80.9 92.3 32.8
AlignGAN [38] (ICCV’19) 42.4 85.0 93.7 40.7 51.5 89.4 95.7 33.9 45.9 87.6 94.4 54.3 57.1 92.1 97.4 45.3
Ours 42.9 85.7 94.0 41.5 52.2 90.3 96.1 34.5 44.1 87.3 93.7 53.2 55.0 90.6 96.8 43.4
Table 4: Quantitative results of person re-ID on the cross-modality SYSU-MM01 dataset. To evaluate the generalization of our model in addressing cross-modality re-ID, we compare with existing models working on IR-RGB scenario. *indicates the results are reproduced using the released codes. Bold and underlined numbers indicate top two results, respectively.
Method Rank 1 Rank 5 Rank 10 mAP
Ours (full model) 56.2 61.5 69.2 13.5
Ours w/o 55.7 60.4 66.8 10.1
Ours w/o 54.0 58.1 66.3 8.7
Ours w/o 50.5 57.6 65.5 8.2
Ours w/o 49.8 51.5 64.1 7.3
Ours w/o 46.5 50.1 61.5 6.9

Table 5:

Ablation study of the loss functions on the Div-Market dataset.

We note that, each row indicates the model with only one loss excluded.

(a) [52] w.r.t. identity.

(b) [52] w.r.t. clothing.

(c) Ours w.r.t. identity.

(d) Ours w.r.t. clothing.
Figure 5: Visualization of structure feature vectors on Div-Market via t-SNE. (a) different identities, each of which is shown in a unique color. (b) With five different appearance (clothing color) are considered and shown, images with the same dressing are shown in the same color.


To simulate the real-world clothing-color changing environment, we conducted the re-ID experiments on our SMPL-reID, and compared with the six state-of-the-art re-ID approaches and two cross-modality re-ID models. As the reported results presented on the left side of Table 1, our proposed CASE-Net outperforms all the compared methods by a large margin. In addition, some phenomenons can also be observed. First, we found severe performance drops in all the standard re-ID approaches, which indicates standard re-ID approaches all suffer from clothing-color/clothes mismatch problems. Second, though two cross-modality methods demonstrate improvement, their models can not handle clothing-color changing in single modality either.


For our synthesized Div-Market, we also compare our proposed method with six current standard re-ID approaches and two cross-modality re-ID models. We also reported the results on the right side of Table 1. Same phenomenons are also observed as SMPL-reID.


We compare our proposed method with six current standard re-ID approaches and two cross-modality re-iD models whose codes are available online, and reported the results in one standard and three extended settings on the Market-1501. These standard approaches include Verif-Identif [54], SVDNet [34], Part-aligned [33], FD-GAN [8], PCB [35], and DG-Net [52] while cross-modality models involve cmGAN [6] and AlighnGAN [38]. We report all the results in Table 2 and several phenomenons can be observed which we summarized as three folds. Firstly, state-of-the-arts methods outperform two cross-modality approaches by a margin but suffer severe performance drop in the extended evaluation, which shows their vulnerability to color variations and weak generalization when they train to overfit on the clothing color. Second, our proposed CASE-Net outperforms all the methods in each settings, which demonstrates that its ability to derive body shape representation.


For the DukeMTMC-reID dataset, we also compare our proposed method with six current standard re-ID approaches and two cross-modality re-ID models whose codes are available online, and we reported the results in one standard and three extended settings in as well Table 3. Same phenomenons are also observed as Market-1501.


To assess the generalization of our CASE-Net in cross-modality person re-ID, we also conducted additional experiments on the SYSU-MM01 dataset. We compare our proposed CASEI-Net with two hand-crafted features (HOG [7], LOMO [21]) and three cross-modality approaches (SYSU model [42], cmGAN [6], AlighnGAN [38]). We reported the results in Table 4 and observe that our method achieves comparable result in the cross-modality re-ID setting. It to worth repeating that, our proposed CASE-Net which is developed for clothing-color changes in re-ID generalizes well in cross-modality re-ID.

3.5 Ablation Studies

Loss functions.

To further analyze the importance of each introduced loss function, we conduct an ablation study shown in Table 5. Firstly, the feature adversarial loss is shown to be vital to our CASE-Net, since we observe drops on Div-Market when the loss was excluded. This is caused by no explicit supervision to guide our CASE-Net to generate human-perceivable images with body shape disentanglement, and thus the resulting model would suffer from image-level information loss. Secondly, without the feature adversarial loss , our model would not be able to perform feature-level color adaptation, causing failure on learning clothing color invariant representation and resulting in the re-ID performance drop (about ). Thirdly, when either or is turned off, our model is not able to be supervised using two re-ID losses, indicating that jointly use of two streams of supervision achieve best results. Lastly, the image adversarial loss is introduced to our CASE-Net to mitigate the perceptual image-level information loss.


We now visualize the feature vectors on our Div-Market in Figures 5 via t-SNE. It is worth to repeat that, in our synthesized Div-Market same identity can have different wearings while several identities can have the same wearing. In the figure, we select different person identities, each of which is indicated by a color. From Fig. 4(a) and Fig. 4(c), we observe that our projected feature vectors are well separated when it compared with DG-Net [52], which suggests that sufficient re-ID ability can be exhibited by our model. On the other hand, for Fig. 4(b) and Fig. 4(d), we colorize each same cloth dressing with a color. It can be observed that our projected feature vectors of the same identity but different dressing are all well clustered while the ones of DG-Net [52] are not.

4 Conclusions

In this paper, we have unfolded an challenge yet significant person re-identification task which has been long ignored in the past. We collect two re-ID datsets (SMPL-reID and Div-Market) for simulating real-world scenario, which contain changes in clothes or clothing-color. To address clothing changes in re-ID, we presented a novel Color Agnostic Shape Extraction Network (CASE-Net) which learns body shape representation training or fine-tuning on data containing clothing change. By advancing the adversarial learning and body shape disentanglement, our model resulted in satisfactory performance on the collected datasets (SPML-reID and Div-Market) and two re-ID benchmarks. Qualitative results also confirmed that our model is capable of learning body shape representation, which is clothing-color invariant. Furthermore, the extensive experimental result on one cross-modality dataset also demonstrated the generalization of our model to cross-modality re-ID.