DeepAI
Log In Sign Up

Clothes-Changing Person Re-identification with RGB Modality Only

The key to address clothes-changing person re-identification (re-id) is to extract clothes-irrelevant features, e.g., face, hairstyle, body shape, and gait. Most current works mainly focus on modeling body shape from multi-modality information (e.g., silhouettes and sketches), but do not make full use of the clothes-irrelevant information in the original RGB images. In this paper, we propose a Clothes-based Adversarial Loss (CAL) to mine clothes-irrelevant features from the original RGB images by penalizing the predictive power of re-id model w.r.t. clothes. Extensive experiments demonstrate that using RGB images only, CAL outperforms all state-of-the-art methods on widely-used clothes-changing person re-id benchmarks. Besides, compared with images, videos contain richer appearance and additional temporal information, which can be used to model proper spatiotemporal patterns to assist clothes-changing re-id. Since there is no publicly available clothes-changing video re-id dataset, we contribute a new dataset named CCVID and show that there exists much room for improvement in modeling spatiotemporal information. The code and new dataset are available at: https://github.com/guxinqian/Simple-CCReID.

READ FULL TEXT VIEW PDF

page 1

page 4

page 8

07/24/2021

Semantic-guided Pixel Sampling for Cloth-Changing Person Re-identification

Cloth-changing person re-identification (re-ID) is a new rising research...
05/26/2020

Long-Term Cloth-Changing Person Re-identification

Person re-identification (Re-ID) aims to match a target person across ca...
11/21/2022

A Benchmark of Video-Based Clothes-Changing Person Re-Identification

Person re-identification (Re-ID) is a classical computer vision task and...
11/24/2022

ReFace: Improving Clothes-Changing Re-Identification With Face Features

Person re-identification (ReID) has been an active research field for ma...
04/02/2020

Learning Longterm Representations for Person Re-Identification Using Radio Signals

Person Re-Identification (ReID) aims to recognize a person-of-interest a...
03/28/2017

Robust Depth-based Person Re-identification

Person re-identification (re-id) aims to match people across non-overlap...
02/22/2021

Person Re-identification based on Robust Features in Open-world

Deep learning technology promotes the rapid development of person re-ide...

1 Introduction

Person re-identification (re-id) [Zheng2015Scalable, gu2019TKP, Hou2021FC] aims to search the target person from surveillance videos across different locations and times. Most existing works [Sun2018Beyond, Gu2020AP3D, Hou2020TCL] assume that pedestrians do not change their clothes in a short period of time. However, if we want to re-identify a pedestrian over a long period of time, the clothes-changing problem cannot be avoided. Besides, clothes-changing problem also exists in some short-time real-world scenarios, e.g., criminal suspects usually change their clothes to avoid being identified and tracked. Due to the crucial role in intelligent surveillance system, clothes-changing person re-id [Yang2019PRCC, Fan2020Radio] has attracted increasing attention in recent years.


Figure 1:

The visualization of (a) two original images, (b) the learned feature maps only with identification loss, and (c) the learned feature maps with identification loss and the proposed CAL. Note that all training settings of (b) and (c) are consistent except loss functions. (b) only highlights face as the clothes-irrelevant features, while (c) highlights more clothes-irrelevant features,

e.g., face, hairstyle, and body shape. (Since different samples of the same person in the training set mostly wear the same shoes, shoes are also highlighted.)

Humans can distinguish their acquaintances, even if these acquaintances wear clothes that they have never seen before. The reason is that the human brain can decouple and utilize clothes-irrelevant features, e.g., face, hairstyle, body shape, and gait. To avoid the interference of clothes, some clothes-changing re-id methods [Hong2021Finegrained, Chen2021Learning3D] and gait recognition methods [Zhang2019Gait, Chao2019Gaitset] model body shape and gait from multi-modality inputs (e.g., skeletons [Qian2020LTCC], silhouettes [Chao2019Gaitset], radio signals [Fan2020Radio], contour sketches [Yang2019PRCC], and 3D shape [Chen2021Learning3D]

) or by disentangled representation learning 

[Zhang2019Gait]. However, multi-modality-based methods need additional models or equipment to capture multi-modality information, and learning disentangled representations is usually time-consuming.

Actually, the original RGB modality contains rich clothes-irrelevant information which is largely underutilized by the current methods. As for some clothes-changing re-id methods [Qian2020LTCC, Chen2021Learning3D], although they use a strong backbone (i.e. ResNet [He2016Deep]) to extract features from the original images, without a properly designed loss function, the learned feature map only focuses on some simple clothes-irrelevant information, e.g., face (see Fig. 1 (b)), while other crucial clothes-irrelevant information is omitted. As for most gait recognition methods [Chao2019Gaitset, Han2006Individual], they usually discard the original input videos and resort to other modality inputs, e.g., silhouettes.

To better mine the clothes-irrelevant information in RGB modality, in this paper, we propose Clothes-based Adversarial Loss

(CAL). Specifically, we add a clothes classifier after the backbone of the re-id model and define CAL as a

multi-positive-class classification loss

, where all clothes classes belonging to the same identity are mutually positive classes. To the best of our knowledge, this is the first work that uses multi-positive-class classification to formulate multi-class adversarial learning. During training, minimizing CAL can force the backbone of the re-id model to learn clothes-irrelevant features by penalizing the predictive power of the re-id model w.r.t. different clothes of the same identity. With backpropagation, the learned feature map can highlight more clothes-irrelevant features,

e.g., hairstyle and body shape, compared with the feature map trained only with identification loss (see Fig. 1 (c)). Extensive experiments on widely used clothes-changing re-id benchmarks demonstrate that using RGB images only, CAL outperforms all state-of-the-art methods.

Most current clothes-changing person re-id works [Yang2019PRCC, Yu2020COCAS, Qian2020LTCC] mainly focus on image-based setting, where both query and gallery samples are images. However, in many real-world re-id scenarios, both query and gallery sets usually consist of lots of videos. Compared with images, videos contain richer appearance information and additional temporal information. It is more promising to learn proper spatiotemporal patterns from videos, e.g., gait, which may be helpful for clothes-changing re-id. Since there is no publicly available dataset, we reconstruct a new Clothes-Changing Video person re-ID (CCVID) dataset from the raw data of a gait recognition dataset (i.e. FVG [Zhang2019Gait]) and provide fine-grained clothes labels. Extensive evaluations of state-of-the-art methods show that the utilization of richer appearance information and additional temporal information can boost the performance of clothes-changing person re-id significantly. We hope CCVID can inspire more clothes-changing video person re-id studies in the future.

2 Related Work

Clothes-changing person re-identification. The core problem to solve clothes-changing re-id is extracting clothes-irrelevant features. To this end, [Zhang2019Gait, DGNet] attempts to use disentangled representation learning to decouple appearance and structural information from RGB images, and considers structural information as clothes-irrelevant features. In contrast, other researchers attempt to use multi-modality information (e.g., skeletons [Qian2020LTCC], silhouettes [Jin2021Cloth, Hong2021Finegrained], radio signals [Fan2020Radio], contour sketches [Yang2019PRCC], or 3D shape [Chen2021Learning3D]) to model body shape and extract clothes-irrelevant features. However, the training of disentangled representation learning is time-consuming, and multi-modality-based methods need additional models or equipment to extract multi-modality information. Besides, these methods do not fully excavate and utilize the clothes-irrelevant features in the original images. In this paper, we propose a simple adversarial loss to decouple clothes-irrelevant features from the RGB modality.

Most current works [Hong2021Finegrained, Chen2021Learning3D, Yang2019PRCC, Qian2020LTCC, huang2019beyond, huang2019celebrities] mainly focus on image-based settings and only a few works [Fan2020Radio, Zhang2021Learning, Gou2016Person] focus on video-based settings. Besides, there are some publicly available clothes-changing person re-id datasets, e.g., PRCC [Yang2019PRCC], LTCC [Qian2020LTCC], and Celeb-reID [huang2019celebrities, huang2019beyond], but all of them are image-based datasets. In this paper, we contribute a large-scale clothes-changing video person re-id dataset and show that there exists much room to improve the spatiotemporal modeling for clothes-changing person re-id.

Video person re-identification. Most existing video person re-id methods [Wang2014Person, Gu2020AP3D, Hou2020TCL, Hou2019vrstc, Hou2020IAUNet] focus on clothes-consistent setting. Some research [Chen2020AFA] demonstrates that appearance feature plays a more important role than motion feature in this setting. Though it is straightforward to distinguish a person through the appearance of his/her clothes in clothes-consistent setting, these deep person re-id models usually overfit clothes-relevant features and have limited application scenarios. In contrast, this paper focuses on clothes-changing setting and proposes a solution through learning clothes-irrelevant features.

Gait recognition. Gait recognition methods [Zhang2019Gait, Chao2019Gaitset] attempt to learn gait features from the video samples of persons. To avoid the interference of clothes, they usually discard the original RGB frames and model gait on skeletons [Yang2016Learning, Ariyanto2012Marionette], silhouettes [Chao2019Gaitset, Han2006Individual], or disentangled representations [Zhang2019Gait]. As for clothes-changing person re-id, all kinds of clothes-irrelevant features besides gait are useful. Therefore, in this paper, we attempt to mine more clothes-irrelevant features from the original RGB modality.

Adversarial learning. Adversarial learning is first proposed in GAN [Goodfellow2014GAN] to force the generative model to generate realistic images. In recent years, it has been used in various tasks, e.g., domain adaptation [Tzeng2017AdversarialDA, Long2018ConditionalDA], knowledge distillation [shen2019MEAL], and representation learning [Wang2019LearningRobust]. Specifically, in [Wang2019LearningRobust], Wang et al. propose PAR to force the model to focus on discriminative global information by penalizing the predictive power of local features. Inspired by PAR, this paper proposes a Clothes-based Adversarial Loss to decouple clothes-irrelevant features. Although both PAR and the proposed method belong to multi-class adversarial learning, the motivation and formulation are different. We will discuss their differences in detail in Sec. 3.3.


Figure 2: The framework of the proposed method. In each iteration, we first optimize the clothes classifier by minimizing . Then, we fix the parameters of the clothes classifier and minimize and to force the backbone to learn clothes-irrelevant features.

3 Method

3.1 Framework and Notation

The framework of our method is shown in Fig. 2. In the framework, denotes the backbone with parameters and denotes the identity classifier with parameters . Given a sample , its identity label is denoted as , and its clothes label is denoted as . Note that we define the clothes class as fine-grained identity class. All samples of the same identity are divided into different clothes classes belonging to this identity according to their clothes. The number of clothes classes is the sum of the number of suits of different persons. The annotation of such clothes labels is easy, since they only need to be labeled among all samples of the same person, and different persons do not share the same clothes label even if they wear the same clothes. Given a sample with identity label , existing re-id methods [Sun2018Beyond, Hou2019Interaction] define identification loss using the cross entropy between predicted identity and identity label , and train the re-id model by minimizing .

As shown in Fig. 2, except for the identity classifier and the widely used identification loss, clothes classification loss is used to train an additional clothes classifier. The proposed Clothes-based Adversarial Loss (CAL) is used to force the backbone to decouple clothes-irrelevant features. We will introduce CAL in detail in the next subsection.

3.2 Clothes-based Adversarial Loss

Existing clothes-changing person re-id [Yang2019PRCC, Qian2020LTCC] and gait recognition methods [Yang2016Learning, Chao2019Gaitset] do not make full use of the clothes-irrelevant information in RGB modality. In this paper, we propose CAL to force the backbone of the re-id model to mine clothes-irrelevant information by penalizing the predictive power of the re-id model w.r.t. clothes. To this end, we add a new clothes classifier with parameters after the backbone. Each iteration in the training stage contains the following two-step optimization.

Training clothes classifier. In the first step, we optimize the clothes classifier by minimizing clothes classification loss (the cross entropy loss between predicted clothes and clothes label ). This process can be formulated as:

(1)

When we denote after -normalization as and denote the weights of -th clothes classifier after -normalization as , can be expressed as:

(2)

where is the batch size, is the number of clothes classes in the training set, and is a temperature parameter.

Learning clothes-irrelevant features. In the second step, we fix the parameters of the clothes classifier and force the backbone to learn clothes-irrelevant features. To this end, we should penalize the predictive power of re-id model w.r.t. clothes. A naive idea is defining opposite to following [Wang2019LearningRobust], such that the trained clothes classifier cannot distinguish all kinds of clothes in the training set. In this way, we can get a widely used min-max optimization problem. However, since clothes class is defined as fine-grained identity class, penalizing the predictive power of re-id model w.r.t. all kinds of clothes will also reduce its predictive power w.r.t. identity, which is harmful to re-id (we will demonstrate this in Sec. 5.4). What we want to do is making the trained clothes classifier cannot distinguish the samples with the same identity and different clothes. So should be a multi-positive-class classification loss, where all clothes classes belonging to the same identity are mutually positive classes. For example, given a sample , all clothes classes belonging to its identity class are defined as its positive clothes classes. Therefore, can be formulated as:

(3)
(4)

where () is the set of clothes classes with the same identity as (different identities from) . is the number of classes in and is the weight of cross entropy loss for -th clothes class. The positive class with the same clothes () and the positive classes with different clothes () have equal weight, i.e. .

In a long-term person re-id system, both clothes-consistent re-id and clothes-changing re-id are equally important. When we maximize the dot product between and the proxy of the positive class with different clothes, the accuracy of clothes-changing re-id can be improved but the accuracy of clothes-consistent re-id may reduce. To improve the clothes-changing re-id ability of the model without reducing the clothes-consistent re-id accuracy heavily, Eq. (4) can be replaced by:

(5)

where is a hyper-parameter. When , Eq. (5) is equivalent to Eq. (4). Otherwise, the positive class with the same clothes has a bigger weight than the positive classes with different clothes.

In the meantime of optimizing CAL, the identity classifier is also optimized. Therefore, the optimization process of the second step is:

(6)

Note that and have some affinity in learning clothes-irrelevant features. When we only use for training, the model tends to learn easy samples (with the same clothes) in the early stage of optimization and then learns to distinguish hard samples (with the same identity and different clothes) gradually. This is consistent with curriculum learning [Bengio2009Curriculum]. The objective of is to pull the features with the same identity closer, which is similar to . Even though, we do not discard in Eq. (6). The reason is that only minimizing and forcing the model to distinguish hard samples in the early stage of optimization may lead to local optimum. On the contrary, we add for training after the first reduction of the learning rate in our experiments.

3.3 Discussion

Relations between CAL and PAR. The idea of CAL is inspired by PAR [Wang2019LearningRobust]. Both CAL and PAR belong to multi-class adversarial learning methods, but both motivation and formulation of these two methods are different. PAR defines the multi-class adversarial loss as negative cross entropy loss and forces the model to focus on discriminative global information by penalizing the predictive power of local features w.r.t. all classes. However, in this paper, since we define clothes class as fine-grained identity class, if we use the same formulation as PAR, i.e. negative cross entropy loss, to penalize the predictive power of re-id model w.r.t. all kinds of clothes, the predictive power of re-id model w.r.t. identify will also be reduced which is contrary to our target. Hence, we define CAL as a multi-positive-class classification loss, where all clothes classes belonging to the same identity are mutually positive classes. In other words, we just want to make the trained clothes classifier cannot distinguish the samples with the same identity and different clothes. To the best of our knowledge, this is the first work that uses multi-positive-class classification to formulate multi-class adversarial learning. It is our main technical contribution.

Differences between CAL and label smooth regularization. To get a trade-off between clothes-changing re-id and clothes-consistent re-id accuracy, we set different weights for different positive classes in Eq. (5), but the weights of negative classes are still 0. In contrast, label smoothing regularization [Inceptionv2] sets the weights of negative classes to small non-zero values to avoid overfitting.

datasets #identities #sequences #bboxes changing
clothes?
PRID 200 400 40,033
iLIDS-VID 300 600 42,460
MARS 1,261 19,608 1,191,003
LS-VID 3,772 14,943 2,982,685
Real28 28 - 4,324
VC-Clothes 512 - 19,060
LTCC 152 - 17,119
PRCC 221 - 33,698
Celeb-reID 1,052 - 34,186
DeepChange 1,082 - 171,352
LaST 10,860 - 224,721
CCVID 226 2,856 347,833
Table 1: The statistics of our CCVID dataset and other video person re-id and clothes-changing person re-id datasets.

Figure 3: Two different video samples of the same identity on MARS, LS-VID, and CCVID datasets respectively. Only CCVID involves clothes changes.

4 CCVID Dataset

As shown in Tab. 1 and Fig. 3, all existing publicly available video person re-id datasets (i.e. PRID [Hirzer2011PRID], iLIDS-VID [Wang2014Person], MARS [Zheng2016MARS], and LS-VID [Li2019GLTR]) do not involve clothes changes. Besides, existing publicly available clothes-changing person re-id datasets (i.e. Real28&VC-Clothes [Real28], LTCC [Qian2020LTCC], PRCC [Yang2019PRCC], Celeb-reID [huang2019beyond], DeepChange [DeepChange], and LaST [LaST]) only contain still images and do not involve sequence data. However, as analyzed in the introduction, clothes-changing video re-id is closer to real-world re-id scenarios, and the abundant appearance information and additional temporal information in the video samples are helpful for clothes-changing re-id.

To provide a publicly available benchmark, we construct a Clothes-Changing Video person re-ID (CCVID) dataset from the raw data of a gait recognition dataset, i.e. FVG [Zhang2019Gait]111The raw data are downloaded from https://github.com/ziyuanzhangtony/GaitNet-CVPR2019. The data collection was approved by the persons who were collected. Using this dataset should accept and agree to be bound by the terms and conditions of the CC BY-NC-SA 4.0 license.. FVG dataset contains 2,856 sequences from 226 identities and each identity has 25 suits of clothes. In the original FVG dataset, 1,620 sequences from 135 identities are collected in 2017 and 948 sequences from the other 79 identities are collected in 2018. There also are 12 persons whose sequences are collected both in 2017 and 2018. Since gait recognition methods usually use masked images, while re-id methods use the images after detection. So we reconstruct this dataset by performing detection [He2017MaskRCNN] on the raw data. Since most frames of FVG only contain one person, we only detect the person with the highest score for each frame, and tracking algorithm is not required. The reconstructed CCVID dataset contains 347,833 bounding boxes. The length of each sequence changes from 27 to 410 frames, with an average length of 122. Besides, we also provide fine-grained clothes labels including tops, bottoms, shoes, carrying status, and accessories. For the convenience of evaluation, we re-divide the training and test sets to adapt to clothes-changing re-id. Specifically, 75 identities are reserved for training, and the remaining 151 identities are used for test. In the test set, 834 sequences are used as query set, and the other 1074 sequences form gallery set.

In Sec. 5.5, we will make a fair comparison between image-based setting and video-based setting on CCVID. Also, we will reproduce some state-of-the-art video person re-id and gait recognition methods on CCVID and compare their performance.

5 Experiments

5.1 Datasets and Evaluation Protocol

We mainly evaluate the proposed method on CCVID and two widely-used clothes-changing image person re-id datasets (i.e. PRCC [Yang2019PRCC] and LTCC [Qian2020LTCC]). The results on VC-Clothes [Real28], LaST [LaST], and DeepChange [DeepChange]

are shown in supplementary materials. Top-1 accuracy and mAP are used as evaluation metrics and three kinds of test settings are defined as follows: (i)

general setting (both clothes-changing and clothes-consistent ground truth samples are used to calculate accuracy), (ii) clothes-changing setting (abbreviated as CC. In this setting, only clothes-changing ground truth samples are used to calculate accuracy), and (iii) same-clothes setting (abbreviated as SC and also named clothes-consistent setting. In this setting, only clothes-consistent ground truth samples are used to calculate accuracy). For CCVID and LTCC, we report the accuracy for both general re-id and clothes-changing re-id. As for PRCC, following [Yang2019PRCC], the re-id accuracies in same-clothes setting and clothes-changing setting are reported.

5.2 Implementation Details

We use ResNet-50 [He2016Deep] as the backbone of re-id model. To enrich the granularity, the last downsampling of ResNet-50 is removed. As for image-based datasets (i.e. LTCC and PRCC), following [Huang2021Clothing]

, we use global average pooling and global max pooling to integrate the output feature map of the backbone, and then concatenate them and use BatchNorm 

[Ioffe2015BN] to normalize the image feature. Following [Qian2020LTCC], the input images are resized to . Random horizontal flipping, random cropping, and random erasing [zhong2020random] are used for data augmentation. The batch size is set to 64. Each batch contains 8 persons and 8 images for each person. The model is trained by Adam [Kingma2014Adam]

for 60 epochs and

is used for training after the 25th epoch. The learning rate is initialized to and divided by 10 after every 20 epochs. in Eq. (3) is set to and in Eq. (5) is set to 0.1 by grid search on LTCC. The optimal parameter values are directly used for the other datasets without tuning.

As for the video-based dataset, i.e. CCVID, following [Gu2020AP3D], we use spatial max pooling and temporal average pooling to integrate the output feature map of the backbone and then use BatchNorm [Ioffe2015BN]

to normalize the video feature. The frame lengths of different video samples are different. During training, the frame lengths of inputs should be equal and each frame would better be sampled with equal probability. Hence, for each original video, we randomly sample 8 frames with a stride of 4 to form a video clip. Each input frame is resized to

and only horizontal flip is used for data augmentation following [Gu2020AP3D]. Due to the limit of GPU memory, the batch size is set to 32 and each batch contains 8 persons and 4 video clips for each person. The model is trained by Adam [Kingma2014Adam] for 150 epochs and is used for training after the 50th epoch. The learning rate is initialized to and divided by 10 after every 40 epochs. In the test stage, each video sample is divided into a series of 8-frame clips with a stride of 4. The averaged feature of these clips is used as the representation of the original video for testing.

5.3 Comparison with State-of-the-art Methods

method modality clothes label extra training data LTCC PRCC
general CC SC CC
top-1 mAP top-1 mAP top-1 mAP top-1 mAP
HACNN [Li2018HACNN] RGB 60.2 26.7 21.6 9.3 82.5 - 21.8 -
PCB [Sun2018Beyond] RGB 65.1 30.6 23.5 10.0 99.8 97.0 41.8 38.7
IANet [Hou2019Interaction] RGB 63.7 31.0 25.0 12.6 99.4 98.3 46.3 45.9
SPT+ASE [Yang2019PRCC] sketch - - - - 64.2 - 34.4 -
GI-ReID [Jin2021Cloth] RGB+sil. 63.2 29.4 23.7 10.4 80.0 - 33.3 -
CESD [Qian2020LTCC] RGB+pose 71.4 34.3 26.2 12.4 - - - -
RCSANet [Huang2021Clothing] RGB - - - - 100 97.2 50.2 48.6
3DSL [Chen2021Learning3D] RGB+pose+sil.+3D - - 31.2 14.8 - - 51.3 -
FSAM [Hong2021Finegrained] RGB+pose+sil. 73.2 35.4 38.5 16.2 98.8 - 54.5 -
CAL RGB 74.2 40.8 40.1 18.0 100 99.8 55.2 55.8
Table 2: Comparison with state-of-the-art methods on LTCC and PRCC. ‘sketch’, ‘sil.’, ‘pose’, and ‘3D’ represent the contour sketches, silhouettes, human poses, and 3D shape information, respectively.
method CCVID LTCC PRCC
general CC general CC SC CC
top-1 mAP top-1 mAP top-1 mAP top-1 mAP top-1 mAP top-1 mAP
baseline 78.3 75.4 77.3 73.9 65.5 29.4 28.1 11.0 99.8 97.9 45.6 43.3
w/ clothes classifier 58.8 55.8 46.2 45.6 62.3 31.0 21.9 10.9 99.5 99.5 33.1 37.4
CAL 82.6 81.3 81.7 79.6 74.2 40.8 40.1 18.0 100 99.8 55.2 55.8
CAL () 52.8 53.0 50.0 49.2 21.5 3.1 9.2 2.3 89.6 67.7 19.3 13.1
Triplet Loss [Hermans2017In] 81.5 78.1 81.1 77.0 71.8 37.5 34.7 16.6 100 99.8 48.6 49.7
Table 3: The ablation studies of CAL on CCVID, LTCC, and PRCC.

We compare the proposed CAL with three traditional re-id methods (i.e. HACNN [Li2018HACNN], PCB [Sun2018Beyond], and IANet [Hou2019Interaction]) and six clothes-changing re-id methods (i.e. SPT+ASE [Yang2019PRCC], GI-ReID [Jin2021Cloth], CESD [Qian2020LTCC], RCSANet [Huang2021Clothing], 3DSL [Chen2021Learning3D], and FSAM [Hong2021Finegrained]) on LTCC and PRCC in Tab. 2. Note that these clothes-changing re-id methods use information from different modalities to avoid the interference of clothes. Especially, 3DSL, FSAM integrate at least three modalities and the computational cost of these two methods is at least four times w.r.t. CAL. Besides, RCSANet uses additional clothes-consistent re-id data to enhance the performance in the same-clothes setting. Nevertheless, using RGB images only and without additional data, the proposed CAL outperforms all these methods consistently on both two datasets. This comparison can demonstrate the effectiveness of CAL.

Limitation. Although CAL achieves state-of-the-art performance without additional modalities and data, it needs clothes labels for adversarial learning. Fortunately, the annotation of such clothes labels is only among the samples of the same person and thus is easier than the annotation of identities. When clothes labels are unavailable in practice, the collection date can be used as pseudo clothes labels to train CAL, since the samples of the same person captured on different days have a high probability of wearing different clothes. We attempt this strategy on DeepChange dataset and demonstrate its effectiveness. The results are shown in supplementary materials. Besides, we will also try to use clustering algorithms to obtain pseudo clothes labels in the future.

5.4 Ablation Studies

The effectiveness of CAL. To verify the effectiveness of the proposed CAL, we reproduce a baseline method that only uses identification loss for training and all the other settings are consistent with CAL. As shown in Tab. 3, when we add a clothes classifier after the backbone of the baseline and retrain the backbone by minimizing clothes classification loss and identification loss , the re-id accuracy in the same-clothes setting is superior to the baseline, but the performance in clothes-changing setting is lower than the baseline. These results are reasonable, since minimizing would force the backbone to learn clothes-relevant features. When is only used to train the clothes classifier and is then used to train the backbone, CAL surpasses the baseline for a large margin in both general and clothes-changing setting. One possible explanation is that, with the help of CAL, the backbone is forced to learn clothes-irrelevant features and thus more robust against clothes changes.

Comparison between different formulations. As explained in Sec. 3.2, if we follow PAR [Wang2019LearningRobust] and define , minimizing will penalize the predictive power of re-id model w.r.t. all kinds of clothes in the training set. Since the clothes label is defined as the fine-grained identity label, it will also reduce the predictive power of re-id model w.r.t. identity, which is harmful to re-id. To verify this, we compare CAL with CAL () in Tab. 3. It can be seen that the accuracy of CAL () is much lower than CAL and even lower than the baseline method in all general, clothes-changing, and the same-clothes settings.

Comparison with Triplet loss. We also compare CAL with the widely used metric learning loss, i.e. Triplet loss [Hermans2017In]. As shown in Tab. 3, Triplet loss is superior to the baseline, but CAL outperforms Triplet loss significantly, especially in clothes-changing setting. It is more likely that Triplet loss can only mine the hard cases in a mini-batch, while CAL uses a clothes classifier to save the proxies of all clothes classes in the training set and can mine clothes-irrelevant features in the global scope.

method Market-1501 MSMT17
top-1 mAP top-1 mAP
PCB [Sun2018Beyond] 93.8 81.6 68.2 40.4
IANet [Hou2019Interaction] 94.4 83.1 75.5 46.8
OSNet [OSNet] 94.8 84.9 78.7 52.9
JDGL [DGNet] 94.8 86.0 77.2 52.3
CircleLoss [Circleloss] 94.2 84.9 76.3 50.2
baseline 92.2 78.7 67.8 43.5
CAL 94.5 87.3 79.7 57.0
baseline (w/ triplet) 94.5 86.6 78.9 57.0
CAL (w/ triplet) 94.7 87.5 79.7 57.3
Table 4: Comparison on standard datasets without clothes-changing.
Figure 4: The top-1 accuracy of CAL with different and on LTCC. Note that the abscissa of the second subfigure is the .

Results on standard person re-id benchmarks.

When the test benchmarks do not involve clothes changes and one identity only has one clothes, the clothes classifier in our method will become an identity classifier and the proposed CAL will degenerate into cosine-similarity-based cross-entropy loss. We perform CAL on two standard re-id datasets,

i.e. Market-1501 [Zheng2015Scalable] and MSMT17 [MSMT17], and compare the results with baseline and some state-of-the-art methods in Tab. 4. It can be seen that CAL outperforms the baseline which only uses the original cross-entropy loss as supervision. When we combine these two methods with triplet loss [Hermans2017In], CAL still outperforms the baseline. Besides, CAL achieves comparable performance compared with state-of-the-art methods on these two benchmarks.

The influence of in CAL. By varying in Eq. (5), Fig. 4 shows the top-1 accuracy of CAL in the same-clothes setting and clothes-changing setting on LTCC. When is set to 0, CAL will degenerate into a clothes classification loss and constrain the backbone to learn clothes-relevant features. So, it achieves the lowest top-1 accuracy in clothes-changing setting. With the increase of , the weight of the positive class with the same clothes as in Eq. (3) decreases gradually, so the top-1 accuracy in the same-clothes setting rate is generally decreasing. As for the accuracy in clothes-changing setting, it increases rapidly and then starts to oscillate, eventually tending to overfit. To get a trade-off between clothes-changing re-id accuracy and the traditional re-id accuracy in the same-clothes setting, we set to 0.1 for all other experiments.

The influence of temperature parameter . In general, the optimal temperature parameter is related to the number of classes in the training set. We show the experimental results with varying on LTCC in Fig. 4. The best performance is achieved when .

5.5 Further Analyses on CCVID

method general CC
top-1 mAP top-1 mAP
baseline(F) 65.0 59.4 63.1 56.3
baseline 78.3 75.4 77.3 73.9
I3D [Carreira2017I3D] 79.7 76.9 78.5 75.3
+CAL +1.5 +3.0 +2.3 +3.3
Non-Local [Wang2018Non] 80.7 78.0 79.3 76.2
+CAL +2.9 +3.4 +3.7 +4.0
TCLNet [Hou2020TCL] 81.4 77.9 80.7 75.9
+CAL +1.3 +3.0 +1.4 +3.7
AP3D [Gu2020AP3D] 80.9 79.2 80.1 77.7
+CAL +3.2 +2.0 +3.5 +2.3
GaitNet [Zhang2019Gait] 62.6 56.5 57.7 49.0
GaitSet [Chao2019Gaitset] 81.9 73.2 71.0 62.1
CAL 82.6 81.3 81.7 79.6
Table 5: Comparison with state-of-the-arts on CCVID. ‘F’ means only using the first frame for testing.
Figure 5: The visualization of feature maps on LTCC, PRCC, and CCVID. In each triplet, the first column presents the original image/video frames. The second and third columns present the feature maps of the baseline method and CAL, respectively.

Image-based setting vs. video-based setting. To make a fair comparison between clothes-changing image re-id and clothes-changing video re-id, we reproduce the baseline method that uses all frames for training but only uses the first frame for testing (baseline(F) in Tab. 5) on CCVID. In the meantime, we also reproduce two classic temporal information modeling methods, i.e. I3D [Carreira2017I3D] and Non-Local [Wang2018Non], and two specially designed temporal information modeling methods for video re-id, i.e. TCLNet [Hou2020TCL] and AP3D [Gu2020AP3D] on CCVID. Note that TCLNet is reproduced by the source code provided in the original paper. The implementaiton of I3D, Non-Local, and AP3D is based on the the source code of [Gu2020AP3D]. As shown in Tab. 5, compared with baseline(F), the baseline which uses all frames for testing achieves significant performance improvement (more than 13%). Compared with the baseline, these temporal information modeling methods can achieve further improvement. These comparisons show the superiority of video-based setting and suggest that there exists much room to improve in spatiotemporal modeling for clothes-changing re-id.

Gait recognition vs. clothes-changing video person re-id. We also compare these temporal information modeling methods with two state-of-the-art gait recognition methods (i.e. GaitNet [Zhang2019Gait] and GaitSet [Chao2019Gaitset]) on CCVID. Note that GaitNet and GaitSet model gait from silhouettes and disentangled representations, respectively. All these methods are reproduced by the source codes provided in their papers. As shown in Tab. 5, except for the top-1 accuracy of GaitSet in the general setting, four temporal information modeling methods show grate advantage over two gait recognition methods. One possible explanation is that compared with silhouettes and disentangled representations, the original RGB modality provides more clothes-irrelevant information, which is helpful for clothes-changing re-id. Besides, the proposed CAL outperforms all these methods. When we combine these temporal information modeling methods with CAL, further improvement can be achieved. This comparison can demonstrate the effectiveness and generality of the proposed method.

5.6 Visualization

We visualize more feature maps of the baseline method and CAL in Fig. 5. On the two image-based datasets (i.e. PRCC and LTCC), the feature maps of the baseline method mainly focus on face, shoes, and shoulder. That is, these feature maps focus on clothes-relevant and clothes-irrelevant features which are all beneficial to re-identification: (1) As different samples of the same person in the training set mostly wear the same shoes, shoes are highlighted as key clothes-relevant features; (2) Face and the contour of shoulder are highlighted, which are a part of easy to learn clothes-irrelevant features. With the help of CAL, the feature maps highlight more clothes-irrelevant information, e.g., hairstyle and body shape. As for the video-based dataset, i.e. CCVID, the feature maps of the baseline method mainly focus on the body regions. With the constraint of CAL, the learned features can highlight the regions of heads and describe the pose and body shape more clearly. Therefore, CAL can achieve higher clothes-changing re-id accuracy.

Discussion. A controversial issue is whether the high responses of CAL on the body regions are mainly caused by the texture and color of clothes or by the body shape. To verify this, we remove the top 2/7 (head regions) and bottom 1/7 (foot regions) of the testing images and only reserve the body regions to construct a quantitative experiment. The test results of the baseline method and CAL on PRCC are shown in Tab. 6. It can be seen that CAL is slightly inferior to the baseline in the same-clothes setting but still outperforms the baseline significantly when only body regions are reserved. Hence, we argue the high responses of CAL on the body regions are mainly due to clothes-irrelevant features (body shape), while less due to clothes-relevant features (the color and texture of clothes).

method SC CC
top-1 mAP top-1 mAP
baseline 96.7 92.6 18.7 20.4
CAL 95.9 92.4 24.5 25.4
Table 6: The results with only body regions as inputs on PRCC.

6 Conclusion

In this paper, we propose Clothes-based Adversarial Loss (CAL) for clothes-changing person re-id. During training, CAL forces the backbone of the re-id model to learn clothes-irrelevant features by penalizing its predictive power w.r.t. clothes. As a result, the learned backbone can better mine the clothes-irrelevant information from the original RGB modality and is more robust against clothes changes. Extensive experiments on the newly constructed CCVID and other related datasets demonstrate that CAL consistently improves over the baseline method by a large margin. Using RGB images only, it outperforms all state-of-the-art methods on these datasets. We hope CAL will become a commonly used loss of future clothes-changing person re-id methods.

Broader impacts. The proposed method can be used in the existing person re-id methods and boost the performance of clothes-changing re-id without additional data and multi-modality inputs. It makes long-term person re-id technology more practicable in intelligent monitoring systems and may inspire more valuable and innovative studies in the future. The potential negative impact lies in that surveillance data and person re-id datasets may cause privacy breaches. Hence, the collection process of these data should inform the persons who are contained in the collection and the utilization of these data should be regulated.

Acknowledgment. This work is supported by National Key R&D Program of China (No. 2017YFA0700800), and National Natural Science Foundation of China (NSFC): 61876171 and 61976203.

Appendix A Convergence Analysis


Figure 6: The statistics of the average probability for three types of clothes classes on LTCC at the last training epoch. Best viewed in color.

To facilitate the convergence analysis, we define the classification probability of true positive clothes class (ground truth clothes class of ), pseudo positive clothes class (with the same identity but different clothes from ), and negative clothes class (with different identities from ) as , , and . In the first step of optimization, minimizing will maximize and minimize and . In the second step, minimizing will maximize and , and minimize

. So the probability distribution of the convergence state should be

. To verify this, we count the average probability for these three types of clothes classes of all samples on LTCC at the last training epoch. As shown in Fig. 6, of most samples converges to 0.61; converges to 1e-40.1; and converges to 1e-51e-2. The convergence results are consistent with our analysis.

Appendix B Experiments on VC-Clothes

VC-Clothes [Real28] is a virtual dataset synthesized by GTA5. It contains 19,060 images from 512 identities and 4 cameras (scenes). Each identity has 13 suits of clothes and all samples of each identity captured by camera 2&3 must wear the same clothes. Hence, most current works [Real28, Hong2021Finegrained] report the results on the subset from camera 2&3 as the accuracy in the same-clothes (SC) setting. Besides, they report the results on the subset from camera 3&4 as the accuracy in clothes-changing (CC) setting. To make a fair comparison, we follow the settings in these works.

We compare the proposed CAL with two single-modality-based (RGB) re-id methods (i.e. MDLA [Qian2017MS] and PCB [Sun2018Beyond]) and three multi-modality-based re-id methods (i.e. Part-aligned [Suh2018Part], FSAM [Hong2021Finegrained], and 3DSL [Chen2021Learning3D]) on VC-Clothes in Tab. 7. It can be seen that using RGB images only, the proposed CAL outperforms the baseline and all these state-of-the-art methods in general, the same-clothes, and clothes-changing settings. This comparison can demonstrate the effectiveness of CAL. Since these state-of-the-art methods do not report the accuracy in the same-clothes and clothes-changing settings on the whole dataset from all cameras, we only compare our method with the baseline in these settings in Tab. 8. The experimental results show CAL outperforms the baseline, especially in clothes-changing setting.

method general SC CC
(all cams) (cam2&cam3) (cam3&cam4)
top-1 mAP top-1 mAP top-1 mAP
MDLA [Qian2017MS] 88.9 76.8 94.3 93.9 59.2 60.8
PCB [Sun2018Beyond] 87.7 74.6 94.7 94.3 62.0 62.2
Part-aligned [Suh2018Part] 90.5 79.7 93.9 93.4 69.4 67.3
FSAM [Hong2021Finegrained] - - 94.7 94.8 78.6 78.9
3DSL [Chen2021Learning3D] - - - - 79.9 81.2
baseline 88.3 79.2 94.1 94.3 67.3 67.9
CAL 92.9 87.2 95.1 95.3 81.4 81.7
Table 7: Comparison with state-of-the-arts on VC-Clothes.
method SC CC
(all cams) (all cams)
top-1 mAP top-1 mAP
baseline 94.5 93.9 74.2 66.5
CAL 96.0 95.7 85.8 79.8
Table 8: The results in the same-clothes and clothes-changing settings for the data from all cameras on VC-Clothes.
method LaST DeepChange
top-1 mAP top-1 mAP
OSNet [OSNet] 63.8 20.9 39.7 10.3
ReIDCaps [huang2019beyond] - - 39.5 11.3
BoT [BoT] 68.3 25.3 47.5 13.0
mAPLoss [LaST] 69.9 27.6 - -
baseline 69.4 25.6 50.6 15.9
CAL 73.7 28.8 54.0 19.0
Table 9: Comparison with state-of-the-art methods on LaST and DeepChange.

Appendix C Experiments on LaST and DeepChange

LaST [LaST] and DeepChange [DeepChange] are two large-scale long-term person re-id datasets. LaST provides clothes labels of the training set, while DeepChange does not provide any clothes labels. We notice that the collection date of each image on DeepChange is given. Since the samples of the same person captured on different days have a high probability of wearing different clothes, we attempt to use the collection date as pseudo clothes labels to train CAL. To make a fair comparison, we set batch size as 64 following [LaST] and each batch contains 16 persons and 4 images for each person. On LaST, triplet loss [Hermans2017In] is used as the metric learning loss for both baseline and CAL (removing it will cause severe performance degradation). Finally, we report the performance in general setting to compare with state-of-the-art methods. Especially, on DeepChange, we allow the true matches coming from the same camera but different tracklets as query following [DeepChange].

The comparisons with the baseline and state-of-the-art methods on LaST and DeepChange are shown in Tab. 9. It can be seen that the proposed CAL outperforms the baseline and these state-of-the-art methods significantly. Especially, on DeepChange, when the collection date is used as pseudo clothes label, CAL still works well. It demonstrates that the proposed method does not rely on accurate clothes annotations.

References