DeepAI
Log In Sign Up

Cross-Modality Paired-Images Generation for RGB-Infrared Person Re-Identification

02/10/2020
by   Guan'an Wang, et al.
USTC
0

RGB-Infrared (IR) person re-identification is very challenging due to the large cross-modality variations between RGB and IR images. The key solution is to learn aligned features to the bridge RGB and IR modalities. However, due to the lack of correspondence labels between every pair of RGB and IR images, most methods try to alleviate the variations with set-level alignment by reducing the distance between the entire RGB and IR sets. However, this set-level alignment may lead to misalignment of some instances, which limits the performance for RGB-IR Re-ID. Different from existing methods, in this paper, we propose to generate cross-modality paired-images and perform both global set-level and fine-grained instance-level alignments. Our proposed method enjoys several merits. First, our method can perform set-level alignment by disentangling modality-specific and modality-invariant features. Compared with conventional methods, ours can explicitly remove the modality-specific features and the modality variation can be better reduced. Second, given cross-modality unpaired-images of a person, our method can generate cross-modality paired images from exchanged images. With them, we can directly perform instance-level alignment by minimizing distances of every pair of images. Extensive experimental results on two standard benchmarks demonstrate that the proposed model favourably against state-of-the-art methods. Especially, on SYSU-MM01 dataset, our model can achieve a gain of 9.2 mAP. Code is available at https://github.com/wangguanan/JSIA-ReID.

READ FULL TEXT VIEW PDF

page 2

page 4

page 7

10/13/2019

RGB-Infrared Cross-Modality Person Re-Identification via Joint Pixel and Feature Alignment

RGB-Infrared (IR) person re-identification is an important and challengi...
06/15/2021

G^2DA: Geometry-Guided Dual-Alignment Learning for RGB-Infrared Person Re-Identification

RGB-Infrared (IR) person re-identification aims to retrieve person-of-in...
09/18/2021

Memory Regulation and Alignment toward Generalizer RGB-Infrared Person

The domain shift, coming from unneglectable modality gap and non-overlap...
12/12/2021

Self-Supervised Modality-Aware Multiple Granularity Pre-Training for RGB-Infrared Person Re-Identification

While RGB-Infrared cross-modality person re-identification (RGB-IR ReID)...
11/01/2021

Benchmarks for Corruption Invariant Person Re-identification

When deploying person re-identification (ReID) model in safety-critical ...
03/03/2022

Modality-Adaptive Mixup and Invariant Decomposition for RGB-Infrared Person Re-Identification

RGB-infrared person re-identification is an emerging cross-modality re-i...
03/13/2020

Random smooth gray value transformations for cross modality learning with gray value invariant networks

Random transformations are commonly used for augmentation of the trainin...

Introduction

Figure 1: Illustration of set-level and instance-level alignment (please view in color). (a) There is a significant gap between the RGB and IR sets. (b) Existing methods perform set-level alignment by minimizing distances between the two sets, which may lead to misalignment of some instances. (c) Our method first generates cross-modality paired-images. (d) Then, instance-level alignment is performed by minimizing distances between each pair of images.
Figure 2: (a) In the edge-photo task, we can get cross-modality paired-images. By minimizing their distances in a feature space, we can easily reduce the cross-modality gap. (b) In RGB-IR Re-ID task, we have only unpaired-images. The appearance variation caused by the cross-modality gap makes the task more challenging. (c) Our method can well generate images paired with given ones, which help us to improve RGB-IR Re-ID. (d,e) Vanilla image translation models such as CycleGAN [31] and StarGAN [3] fail to deal with this issue.

Person Re-Identification (Re-ID) [6, 25] is widely used in various applications such as video surveillance, security and smart city. Given a query image of a person, Re-ID aims to find images of the person across disjoint cameras. It’s very challenging due to the large intra-class and small inter-class variations caused by different poses, illuminations, views, and occlusions. Most of existing Re-ID methods focus on visible cameras and RGB images, and formulate the person Re-ID as a single-modality (RGB-RGB) matching problem.

However, the visible cameras are difficult in capturing valid appearance information under poor illumination environments (e.g. at night), which limits the applicability of person Re-ID in practical. Fortunately, most surveillance cameras can automatically switch from visible (RGB) to near-infrared (IR) mode, which facilitates such cameras to work at night. Thus, it is necessary to study the RGB-IR Re-ID in real-world scenarios, which is a cross-modality matching problem. Compared with RGB-RGB single-modality matching, RGB-IR cross-modality matching is more difficult due to the large variation between the two modalities. As shown in Figure 2(b), RGB and IR images are intrinsically distinct and heterogeneous, and have different wavelength ranges. Here, RGB images have three channels containing color information of visible light, while IR images have one channel containing information of invisible light.

The key solution is to learn aligned features to bridge the two modalities. However, due to the lack of correspondence labels between every pair of images in different modalities like in Figure 2(a), existing RGB-IR Re-ID methods [22, 23, 24, 4, 8]

try to reduce the marginal distribution divergence between RGB and IR modalities, while cannot deal with their joint distributions. That is to say, as shown in Figure

1(b), they only focus on the global set-level alignment between the entire RGB and IR sets while neglecting the fine-grained instance-level alignment between every two images. This may lead to misalignment of some instances when performing the global alignment [2]. Although we can alleviate this issue by using label information, in Re-ID task, labels of training and test sets are unshared. Thus, simply fitting training labels may not perform very well for unseen test labels.

Different from the existing approaches, a heuristic method is to use cross-modality paired-images in Figure

2(a). With the paired images, we can directly reduce the instance-level gap by minimizing the distance between every pair of images in a feature space. However, as in Figure 2(b),all images are un-paired in RGB-IR Re-ID task. This is because the two kinds of images are captured at different times. RGB images are captured at daytime while IR ones at night. We can also translate images from one modality to another by using image translation models, such as CycleGAN [31] and StarGAN [3]. But these image translation models can only learn one-to-one mappings, while mapping from IR to RGB images are one-to-many. For example, gray in IR mode can be blue, yellow even red in RGB mode. Under this situation, CycleGAN and StarGAN often generate some noisy images and cannot be used in the following Re-ID task. As shown in Figure 2(d,e), the generated images by CycleGAN and StarGAN are unsatisfying.

To solve the above problems, in this paper, we propose a novel Joint Set-level and Instance-Level Alignment Re-ID (JSIA-ReID) which enjoys several merits. First, our method can perform set-level alignment by disentangling modality-specific and modality-invariant features. Compared with encoding images with only one encoder, ours can explicitly remove the modality-specific features and significantly reduce the modality-gap. Second, given cross-modality unpaired-images of a person, our method can generate cross-modality paired-images. With them, we can directly perform instance-level alignment by minimizing the distances between the two images in a feature space. The instance-level alignment can further reduce the modality-gap and avoid misalignment of instances.

Specifically, as shown in Figure 3, our proposed method consists of a generation module to generate cross-modality paired-images and a feature alignment module to learn both set-level and instance-level aligned features. The generation module includes three encoders and two generators. The three encoders disentangle a RGB(IR) image to modality-invariant and RGB(IR) modalities-specific features. Then, the RGB(IR) decoder takes a modality-invariant feature from an IR(RGB) image and a modality-specific feature from an IR(RGB) image as input. By decoding from the across-feature, we can generate cross-modality paired-images as in Figure 2(c). In the feature alignment module , we first utilize an encoder whose weights are shared with modality-invariant encoder. It can map images from different modalities into a shared feature space. Thus, set-level modality-gap can be significantly reduced. Then, we further import an encoder to refine the features to reduce the instance-level modality-gap by minimizing distance between feature maps of every pair of cross-modality images. Finally, by jointly training the generation module and feature alignment module with the re-id loss, we can learn both modality-aligned and identity-discriminative features.

The major contributions of this work can be summarized as follows. (1) We propose a novel method to generate cross-modality paired-images by disentangling features and decoding from exchanged features. To the best of our knowledge, it is the first work to generate cross-modality paired-images for the RGB-IR Re-ID task. (2) Our method can simultaneously and effectively reduce both set-level and instance-level modality-variation. (3) Extensive experimental results on two standard benchmarks demonstrate that the proposed model performs favourably against state-of-the-art methods.

Related Works

RGB-RGB Person Re-Identification. RGB-RGB person re-identification addresses the problem of matching pedestrian RGB images across disjoint visible cameras [6]. Recently, many deep ReID methods [25, 10, 18] have been proposed. Zheng et al. [25] learn identity-discriminative features by fine-tuning a pre-trained CNN to minimize a classification loss. In [10], Hermans et al. show that using a variant of the triplet loss outperforms most other published methods by a large margin. Most of exiting methods focus on the RGB-RGB Re-ID task, and cannot perform well for the RGB-IR Re-ID task, which limits the applicability in practical surveillance scenarios.

RGB-IR Person Re-Identification. RGB-IR Person re-identification attempts to match RGB and IR images of a person under disjoint cameras. Besides the difficulties of RGB-RGB Re-ID, RGB-IR Re-ID faces a new challenge due to cross-modality variation between RGB and IR images. In [22], Wu et al.

collect a cross-modality RGB-IR dataset named SYSU RGB-IR Re-ID and explores three different network structures with zero-padding for automatically evolve domain-specific nodes in the network. Ye

et al. utilize a dual-path network with a bi-directional dual-constrained top-ranking loss [23] and modality-specific and modality-shared metrics [24]. In [4], Dai et al. introduce a cross-modality generative adversarial network (cmGAN) to reduce the distribution divergence of RGB and IR features. Hao et al. [8] achieve visible thermal person re-identification via a hyper-sphere manifold embedding model. In [19] and [21], they reduce modality-gap in both image and feature domains. Most above methods mainly focus on global set-level alignment between the entire RGB and IR sets, which may lead to misalignment of some instances. Different from them, our proposed method performs both global set-level and fine-grained instance-level alignment, and achieves better performance.

Person Re-Identification with GAN. Recently, many methods attempt to utilize GAN to generate training samples for improving Re-ID. Zheng et al. [27] use a GAN model to generate unlabeled images as data augmentation. Zhong et al. [30, 28, 29] translate images to different camera styles with CycleGAN [31], and then use both real and generated images to reduce inter-camera variation. Ma et al. [13] use a cGAN to generate pedestrian images with different poses to learn features free of influences of pose variation. Zheng et al [26] propose joint learning framework that end-to-end couples re-id learning and image generation in a unified network. All those methods focus on single-modality RGB Re-ID and cannot deal with cross-modality RGB-IR Re-ID. Different from them, our method can generate cross-modality paired-images and learn both set-level and instance-level aligned features.

Image Translation. Generative Adversarial Network (GAN) [7]

learns data distribution in a self-supervised way via the adversarial training, which has been widely used in image translation. Pix2Pix

[11] solves the image translation by utilizing a conditional generative adversarial network and a reconstruction loss supervised by paired data. CycleGAN [31] and StarGAN [3] learn images translations with unpaired data using cycle-consistency loss. Those methods only learn one-to-one mapping among different modalities and cannot be used in RGB-IR Re-ID, where the mapping from IR to RGB is one-to-many. Different from them, our method first disentangles images to modality-invariant and modality-specific features, and then generates cross-modality paired-images by decoding from exchanged features.

The Proposed Method

Figure 3: Our proposed framework consists of a cross-modality paired-images generation module and a feature alignment module . first disentangle images to modality-specific and modality-invariant features, and then decode from the exchanged features. first use the modality-invariant encoder to perform set-level alignment, then further perform instance-level alignment by minimizing distance of each pair images. Finally, by training the two modules with re-id loss, we can learn both modality-aligned and identity-discriminative features.

Our method includes a generation module to generate cross-modality paired-images and a feature alignment module to learn both global set-level and fine-grained instance-level aligned features. Finally, by training the two modules with re-id loss, we can learn both modality-aligned and identity-discriminative features.

Cross-Modality Paired-Images Generation Module

As shown in Figure 2(b), in RGB-IR task, the training images from two modalities are unpaired, which makes it more difficult to reduce the gap between the RGB and IR modalities. To solve the problem, we propose to generate paired-images by disentangling features and decoding from exchanged features. We suppose that images can be decomposed to modality-invariant and modality-specific features. Here, the former includes content information such as pose, gender, clothing category and carrying, etc. Oppositely, the latter has style information such as clothing/shoes colors, texture, etc. Thus, given unpaired-images, by disentangling and exchanging their style information, we can generate paired-images, where the two images have the same content information such as pose and view but with different style information such as clothing colors.

Features Disentanglement. We disentangle features with three encoders. The three encoders are the modality-invariant encoder of learning content information from both modalities, the RGB modality-specific encoder of learning RGB style information, and the IR modality-specific encoder of learning IR style information. Given RGB images and IR images , their modality-specific features and can be learned in Eq.(2). Similarly, their modality-invariant features and can be learned in Eq.(1).

(1)
(2)

Paired-Images Generation. We generate paired-images using two decoders including a RGB decoder of generating RGB images and an IR decoder of generating IR images. After getting the disentangled features in Eq.(1) and Eq.(2), we can generate paired-images by exchanging their style information. Specifically, to generate RGB images paired with real IR images , we can use the content features from the real IR images and the style features from the real RGB images . By doing so, the generated images will contain content information from the IR images and style information from the RGB image. Similarly, we can also generate fake IR images paired with real RGB images . Note that to ensure that the generated images have the same identities with their original ones, we only exchange features intra-person. This processes can be formulated in Eq.(3).

(3)

Reconstruction Loss. A simple supervision is to force the disentangled features to reconstruct their original images. Thus, we can formulate the reconstruction loss as below, where is L1 distance.

(4)

Cycle-Consistency Loss. The reconstruction loss in Eq.(4) cannot supervise the cross-modality paired-images generation, and the generated images may not contain the expired content and style information. For example, when translating IR images to its RGB version via Eq(3), the translated images may not keep the poses (content information) from , or don’t have the right clothing color (style information) with . This is not the case we want and will harm the feature learning module. Inspired by CycleGAN [31], we introduce a cycle-consistency loss to guarantee that the generated images can be translated back to their original version. By doing so, the consistency loss further limits the space of the generated samples. The cycle-consistency loss can be formulated as below:

(5)

where and are the cycle-reconstructed images as in Eq.(6).

(6)

GAN loss. The reconstruction loss and cycle-consistency loss lead to blurry images. To make the generated images more realistic, we apply the adversarial loss [7] on both modalities, which have been proved to be effective in image generation tasks [11]. Specifically, we import two discriminators and to distinguish real images from the generated ones on RGB and IR modalities, respectively. In contrast, the encoders and decoders aim to make the generated images indistinguishable. The GAN loss can be formulated as below:

(7)

Feature Alignment Module

Set-Level Feature Alignment. To reduce the modality-gap, most methods attempt to learn a shared feature-space for different modalities by using dual path [23, 24], or GAN loss [4]. However, those methods do not explicitly remove the modality-specific information, which may be encoded into the shared feature-space and harms the performance [1]. In our method, we utilize a set-level encoder to learn set-level aligned features. The weights are shared with the modality-invariant encoder . As we can see, in the cross-modality paired-images generation module, our modality-invariant encoder is trained to explicitly remove modality-specific features. Thus, given images from any modality, we can learn their set-level aligned features .

Instance-Level Feature Alignment. Even so, as we discuss in the introduction, only performing global set-level alignment between the entire RGB and IR sets may lead to misalignment of some instances. To overcome this problem, we propose to perform instance-level alignment by using the cross-modality paired-images generated by the generation module. Specifically, we first utilize instance-level encoder to map the set-level aligned features to a new feature space , i.e. . Then, based on the feature space

, we align every two cross-modality paired-images by minimizing their Kullback-Leibler Divergence. Thus, the loss of the instance-level feature alignment can be formulated in Eq.(

8).

(8)

where and

are the predicted probabilities of

and on all identities, and are the features of and in the feature space ,

is a classifier implemented with a fully-connected layer.

Identity-Discriminative Feature Learning. To overcome the intra-modality variation, following [25, 10], we averagely pool the feature maps in instance-level aligned space

to corresponding feature vectors

. Given real images , we optimize their feature vectors with a classification loss of a classifier and a triplet loss .

(9)
(10)

where is the predicted probability predicted by the classifier that the input feature vector belongs to the ground-truth, and are a positive pair of feature vectors belonging to the same person, and are a negative pair of feature vectors belonging to different persons, is a margin parameter and .

Overall Objective Function and Test

Thus, the overall objective function of our method can formulated as below:

(11)

where are weights of corresponding terms. Following [31], we set and . is set 1 empirically and is decided by grid search.

During the test stage, only feature learning module is used. Given images , we use the set-level alignment encoder and the instance-level encoder to extract features, i.e.

. Finally, matching is conducted by computing cosine similarities of feature vectors

between the probe images and gallery ones.

Experiment

Methods All-Search Indoor-Search
Single-Shot Multi-Shot Single-Shot Multi-Shot
R1 R10 R20 mAP R1 R10 R20 mAP R1 R10 R20 mAP R1 R10 R20 mAP
HOG 2.76 18.3 32.0 4.24 3.82 22.8 37.7 2.16 3.22 24.7 44.6 7.25 4.75 29.1 49.4 3.51
LOMO 3.64 23.2 37.3 4.53 4.70 28.3 43.1 2.28 5.75 34.4 54.9 10.2 7.36 40.4 60.4 5.64
Two-Stream 11.7 48.0 65.5 12.9 16.4 58.4 74.5 8.03 15.6 61.2 81.1 21.5 22.5 72.3 88.7 14.0
One-Stream 12.1 49.7 66.8 13.7 16.3 58.2 75.1 8.59 17.0 63.6 82.1 23.0 22.7 71.8 87.9 15.1
Zero-Padding 14.8 52.2 71.4 16.0 19.2 61.4 78.5 10.9 20.6 68.4 85.8 27.0 24.5 75.9 91.4 18.7
BCTR 16.2 54.9 71.5 19.2 - - - - - - - - - - - -
BDTR 17.1 55.5 72.0 19.7 - - - - - - - - - - - -
D-HSME 20.7 62.8 78.0 23.2 - - - - - - - - - - - -
cmGAN 27.0 67.5 80.6 27.8 31.5 72.7 85.0 22.3 31.7 77.2 89.2 42.2 37.0 80.9 92.3 32.8
DRL 28.9 70.6 82.4 29.2 - - - - - - - - - - - -
Ours 38.1 80.7 89.9 36.9 45.1 85.7 93.8 29.5 43.8 86.2 94.2 52.9 52.7 91.1 96.4 42.7
Table 1: Comparison with the state-of-the-arts on SYSU-MM01 dataset. The R1, R10, R20 denote Rank-1, Rank-10 and Rank-20 accuracies (%), respectively. The mAP denotes mean average precision score (%).

Dataset and Evaluation Protocol

Dataset. We evaluate our model on two standard benchmarks including SYSU-MM01 and RegDB. (1) SYSU-MM01 [22] is a popular RGB-IR Re-ID dataset, which includes 491 identities from 4 RGB cameras and 2 IR ones. The training set contains 19,659 RGB images and 12,792 IR images of 395 persons and the test set contains 96 persons. Following [22], there are two test modes, i.e. all-search mode and indoor-search mode. For the all-search mode, all images are used. For the indoor-search mode, only indoor images from cameras are used. For both modes, the single-shot and multi-shot settings are adopted, where 1 or 10 images of a person are randomly selected to form the gallery set. Both modes use IR images as probe set and RGB images as gallery set. (2) RegDB [15] contains 412 persons, where each person has 10 images from a visible camera and 10 images from a thermal camera.

Evaluation Protocols

. The Cumulative Matching Characteristic (CMC) and mean average precision (mAP) are used as evaluation metrics. Following

[22], the results of SYSU-MM01 are evaluated with official code based on the average of 10 times repeated random split of gallery and probe set. Following [23, 24], the results of RegDB are based on the average of 10 times repeated random split of training and testing sets.

Implementation Details

In generation module , following [16]

, we construct our modality-specific encoders with 2 strided convolutional layers followed by a global average pooling layer and a fully connected layer. For decoders, following

[20], we use 4 residual blocks with Adaptive Instance Normalization (AdaIN) and 2 upsampling with convolutional layers. Here, the parameters of AdaIN are dynamically generated by the modality-specific features. In GAN loss, we use discriminator and LSGAN as in [14] to stable the training.

In feature learning module , for a fair comparison, we adopt the ResNet-50 [9]

pre-trained with ImageNet

[17] as our CNN backbone. Specifically, we use the first two layers of the ResNet-50 as our set-level encoder , and use the remaining layers as our instance-level encoder . For the classification loss, the classifier takes the feature vectors

as inputs, followed by a batch normalization, a fully-connected layer and a soft-max layer to predict the inputs’ labels.

We implement our model with open-source deep learning framework Pytorch The training images are resized to

and augmented with horizontal flip. The batch size is set to 128 (16 person, 4 RGB images and 4 IR images). We optimize our framework using Adam with learning rate 0.0002 and betas

. The generation module is first pre-trained for 100 epochs. Then the overall framework is jointly optimized for 50 epochs, where the learning rate is decayed to its

at 30 epochs.

Results on SYSU-MM01 Datasets

We compare our model with 10 methods including hand-crafted features (HOG [5], LOMO [12]), feature learning with the classification loss (One-Stream, Two-Stream, Zero-Padding) [22], feature learning with both classification and ranking losses (BCTR, BDTR) [23], metric learning (D-HSME [8]), and reducing distribution divergence of features (cmGAN [4], DRL [21]). The experimental results are shown in Table 1.

Firstly, LOMO only achieves 3.64% and 4.53% in terms of Rank-1 and mAP scores, respectively, which shows that hand-crafted features cannot be generalized to the RGB-IR Re-ID task. Secondly, One-Stream, Two-Stream and Zero-Padding significantly outperform hand-crafted features by at least 8% and 8.3% in terms of Rank-1 and mAP scores, respectively. This verifies that the classification loss contributes to learning identity-discriminative features. Thirdly, BCTR and BDTR further improve Zero-Padding by 1.4% in terms of Rank-1 and by 3.2% in terms of mAP scores. This shows that the ranking and classification losses are complementary. Additionally, D-HSME outperforms BDTR by 3.6% Rank-1 and 3.5% mAP scores, which demonstrates the effectiveness of metric learning. In addition, DRL outperform D-HSME by 8.1% Rank1 and 6.0% mAP scores, implying the effectiveness of adversarial training. Finally, Our method outperforms the state-of-the-art method by 9.2% and 7.7% in terms of Rank-1 and mAP scores, showing the effectiveness of our model for the RGB-IR Re-ID task.

Results on RegDB Dataset

Methods thermal2visible visible2thermal
Rank-1 mAP Rank-1 mAP
Zero-Padding 16.7 17.9 17.8 31.9
TONE 21.7 22.3 24.4 20.1
BCTR - - 32.7 31.0
BDTR 32.8 31.2 33.5 31.9
DRL 43.4 44.1 43.4 44.1
Ours 48.1 48.9 48.5 49.3
Table 2: Comparison with state-of-the-arts on the RegDB dataset under different query settings.

We evaluate our model on RegDB dataset and compare it with Zero-Padding [22], TONE [24], BCTR [23], BDTR [24] and DRL [21]. We adopt visible2thermal and thermal2visible modes. Here, the visible2thermal means that visible images are query set and thermal images are gallery set, and so on. As shown in Table 2, our model can significantly outperform the state-of-the-arts by 4.7% and 5.1% in terms of Rank-1 scores with thermal2visible and visible2thermal modes, respectively. Overall, the results verify the effectiveness of our model.

Model Analysis

index SL IL R1 R10 R20 mAP
1 32.1 75.7 87.0 31.9
2 35.1 78.6 88.2 33.8
3 36.0 79.8 89.0 35.5
4 38.1 80.7 89.9 36.9
5 - 36.8 80.2 89.4 36.0
Table 3: Analysis of set-level (SL) and instance-level (IL) alignment. Please see text for more details.

Ablation Study. To further analyze effectiveness of the set-level alignment and the instance-level alignment, we evaluate our method under four different settings, i.e. with or without set-level (SL) and instance-level (IL) alignment. Specifically, when removing set-level alignment, we use separate set-level encoder , i.e. we don’t share weights of set-level encoder with modality-invariant encoder . When removing instance-level alignment, we set . Moreover, to analyze whether the feature disentanglement contributes to set-level alignment, we remove the disentanglement strategy by using separate set-level encoder and training it with a GAN loss as in [4].

As shown in Table 3, when removing both SL and IL (index-1), our method only achieve Rank-1 score. By adding SL (index-2) or IL (index-3), the performance is improved to and Rank-1 score, which demonstrate the effectiveness of both SL and IL. When using both SL and IL (index-4), our method achieves the best performance at Rank-1 score, which demonstrates that SL and IL can be complementary with each other. Finally, when removing the disentanglement from set-level alignment (index-5), Rank-1 score drops by . This illustrates that disentanglement strategy is helpful for learning set-level alignment.

Figure 4: Distribution of cross-modality similarities of intra-person and inter-person. The instance-level alignment (IL) can enhance intra-person similarity while keep inter-person similarity unchanged, which improves performance. Please note that w/ means with and w/o means without. Please see text for more details.

To better understand set-level alignment (SL) and instance-level alignment (IL), we visualize the distribution of intra-person similarity and inter-person similarity under different variants. The similarity is calculated with cosine distance. Firstly, when comparing with Figure 4(a) and Figure 4(b), we can find that even using no SL and IL, model can easily fit training set, while fails to generalize to test set. As we can see in Figure 4(b), the two kind of similarities are seriously overlapped. This shows that the cross-modality variation cannot be well reduced by simply fitting identity information in training set. Secondly, in Figure 4(c), we find that although the similarity of intra-person becomes more concentrated, the similarity of inter-person also become larger. This shows that SL imports some misalignment of instances which may harm the performance. Finally, in Figure 4(c) we can see that, IL boosts intra-person similarity, meanwhile keeps the inter-person similarity unchanged. This illustrate that the IL explicitly reduce . In summary, experimental results and analysis above show the importance and effectiveness of instance-level alignment.

Figure 5: Rank-1 and mAP scores with different on SYSU-MM01 under single-shot&all-search mode.

Parameters Analysis. We evaluate the effect of the weights, i.e. . As shown in Figure 5, we analyze our method with respect to the on SYSU-MM01 dataset under single-shot&all-search mode. We can see that, with different , our method can stably have an significant improvement. The experimental results show that our method is robust to different weights.

Visualization of Images

Figure 6: Comparision among generated images from ours, CycleGAN [31] and StarGAN [3]. Ours can stably generate paired-images with given real ones, while CycleGAN and StarGAN fail.

In this part, we display the generated cross-modality paired-images from ours, CycleGAN [31] and StarGAN [3]. From Figure 6(a), we can see that, images of a person in the two modalities are significant different, even human beings cannot easily identify them. In Figure 6(b), our method can stably generate fake images when given cross-modality unpaired-images from a person. For example, in person A, ours can translate her IR images to RGB version with right colors (yellow upper and black bottom clothes). However, in Figure 6(c) and Figure 6(d), CycleGAN and StarGAN cannot learn the right colors even poses. For example, person B should have blue upper clothing. However, images generated by CycleGAN and StarGAN are red and black, respectively. Those unsatisfying images cannot be used to learn instance-level aligned features.

Conclusion

In this paper, we propose a novel Joint Set-Level and Instance-Level Alignment Re-ID (JSIA-ReID). On the one hand, our model performs set-level alignment by disentangling modality-specific and modality-invariant features. Compared with vanilla methods, ours can explicitly remove the modality-specific information and significantly reduce the modality-gap. On the other hand, given cross-modality unpaired images, we we can generate cross-modality paired-images by exchanging their features. With the paired-images, instance-level variations can be reduced by minimizing the distances between every pair of images. Finally, together with re-id loss, our model can learn both modality-aligned and identity-discriminative features. Experimental results on two datasets show the effectiveness of our proposed method.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grants 61720106012, 61533016 and 61806203, the Strategic Priority Research Program of Chinese Academy of Science under Grant XDBS01000000, and the Beijing Natural Science Foundation under Grant L172050.

References

  • [1] W. Chang, H. Wang, W. Peng, and W. Chiu (2019) All about structure: adapting structural information across domains for boosting semantic segmentation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 1900–1909. External Links: Link Cited by: Feature Alignment Module.
  • [2] Q. Chen, Y. Liu, Z. Wang, I. J. Wassell, and K. Chetty (2018) Re-weighted adversarial adaptation network for unsupervised domain adaptation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7976–7985. External Links: Link Cited by: Introduction.
  • [3] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018) StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8789–8797. Cited by: Figure 2, Introduction, Related Works, Figure 6, Visualization of Images.
  • [4] P. Dai, R. Ji, H. Wang, Q. Wu, and Y. Huang (2018) Cross-modality person re-identification with generative adversarial training. In

    IJCAI 2018: 27th International Joint Conference on Artificial Intelligence

    ,
    pp. 677–683. Cited by: Introduction, Related Works, Feature Alignment Module, Results on SYSU-MM01 Datasets, Model Analysis.
  • [5] N. Dalal and B. Triggs (2005) Histograms of oriented gradients for human detection. In international Conference on computer vision & Pattern Recognition (CVPR’05), Vol. 1, pp. 886–893. Cited by: Results on SYSU-MM01 Datasets.
  • [6] S. Gong, M. Cristani, S. Yan, and C. C. Loy (2014) Person re-identification. Cited by: Introduction, Related Works.
  • [7] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pp. 2672–2680. Cited by: Related Works, Cross-Modality Paired-Images Generation Module.
  • [8] Y. Hao, N. Wang, J. Li, and X. Gao (2019) HSME hypersphere manifold embedding for visible thermal person re-identification. In AAAI-19 AAAI Conference on Artificial Intelligence, Cited by: Introduction, Related Works, Results on SYSU-MM01 Datasets.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: Implementation Details.
  • [10] A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification.. arXiv preprint arXiv:1703.07737. Cited by: Related Works, Feature Alignment Module.
  • [11] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017)

    Image-to-image translation with conditional adversarial networks

    .
    In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5967–5976. Cited by: Related Works, Cross-Modality Paired-Images Generation Module.
  • [12] S. Liao, Y. Hu, X. Zhu, and S. Z. Li (2015) Person re-identification by local maximal occurrence representation and metric learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2197–2206. Cited by: Results on SYSU-MM01 Datasets.
  • [13] L. Ma, Q. Sun, S. Georgoulis, L. V. Gool, B. Schiele, and M. Fritz (2018) Disentangled person image generation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 99–108. External Links: Link Cited by: Related Works.
  • [14] X. Mao, Q. Li, H. Xie, R. Y.K. Lau, Z. Wang, and S. P. Smolley (2016) Least squares generative adversarial networks. arXiv preprint arXiv:1611.04076. External Links: Link Cited by: Implementation Details.
  • [15] D. T. Nguyen, H. G. Hong, K. Kim, and K. R. Park (2017) Person recognition system based on a combination of body images from visible light and thermal cameras. Sensors 17 (3), pp. 605. Cited by: Dataset and Evaluation Protocol.
  • [16] A. Radford, L. Metz, and S. Chintala (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. international conference on learning representations. Cited by: Implementation Details.
  • [17] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: Implementation Details.
  • [18] G. Wang, Y. Yang, J. Cheng, J. Wang, and Z. Hou (2019) Color-sensitive person re-identification.. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, pp. 933–939. External Links: Link Cited by: Related Works.
  • [19] G. Wang, T. Zhang, J. Cheng, S. Liu, Y. Yang, and Z. Hou (2019-10) RGB-infrared cross-modality person re-identification via joint pixel and feature alignment. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Related Works.
  • [20] H. Wang, X. Liang, H. Zhang, D. Yeung, and E. P. Xing (2017) ZM-net: real-time zero-shot image manipulation network.. arXiv preprint arXiv:1703.07255. External Links: Link Cited by: Implementation Details.
  • [21] Z. Wang, Z. Wang, Y. Zheng, Y. Chuang, and S. Satoh (2019) Learning to reduce dual-level discrepancy for infrared-visible person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 618–626. External Links: Link Cited by: Related Works, Results on SYSU-MM01 Datasets, Results on RegDB Dataset.
  • [22] A. Wu, W. Zheng, H. Yu, S. Gong, and J. Lai (2017) RGB-infrared cross-modality person re-identification. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5390–5399. Cited by: Introduction, Related Works, Dataset and Evaluation Protocol, Dataset and Evaluation Protocol, Results on SYSU-MM01 Datasets, Results on RegDB Dataset.
  • [23] M. Ye, X. Lan, J. Li, and P. c Yuen (2018) Hierarchical discriminative learning for visible thermal person re-identification. In AAAI-18 AAAI Conference on Artificial Intelligence, pp. 7501–7508. Cited by: Introduction, Related Works, Feature Alignment Module, Dataset and Evaluation Protocol, Results on SYSU-MM01 Datasets, Results on RegDB Dataset.
  • [24] M. Ye, Z. Wang, X. Lan, and P. C. Yuen (2018) Visible thermal person re-identification via dual-constrained top-ranking. In IJCAI 2018: 27th International Joint Conference on Artificial Intelligence, pp. 1092–1099. Cited by: Introduction, Related Works, Feature Alignment Module, Dataset and Evaluation Protocol, Results on RegDB Dataset.
  • [25] L. Zheng, Y. Yang, and A. G. Hauptmann (2016) Person re-identification: past, present and future. arXiv preprint arXiv:1610.02984. Cited by: Introduction, Related Works, Feature Alignment Module.
  • [26] Z. Zheng, X. Yang, Z. Yu, L. Zheng, Y. Yang, and J. Kautz (2019) Joint discriminative and generative learning for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2138–2147. External Links: Link Cited by: Related Works.
  • [27] Z. Zheng, L. Zheng, and Y. Yang (2017) Unlabeled samples generated by gan improve the person re-identification baseline in vitro. arXiv preprint arXiv:1701.07717. External Links: Link Cited by: Related Works.
  • [28] Z. Zhong, L. Zheng, S. Li, and Y. Yang (2018) Generalizing a person retrieval model hetero- and homogeneously. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 176–192. External Links: Link Cited by: Related Works.
  • [29] Z. Zhong, L. Zheng, Z. Luo, S. Li, and Y. Yang (2019) Invariance matters: exemplar memory for domain adaptive person re-identification. Cited by: Related Works.
  • [30] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang (2018) Camera style adaptation for person re-identification. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5157–5166. External Links: Link Cited by: Related Works.
  • [31] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2242–2251. Cited by: Figure 2, Introduction, Related Works, Related Works, Cross-Modality Paired-Images Generation Module, Overall Objective Function and Test, Figure 6, Visualization of Images.