RGB-infrared cross-modality person re-identification(ReID) aims to recognize the images of the same identity between the two modalities, the visible (RGB) modality and the infrared (IR) modality. With the widespread use of near-infrared cameras, it has attracted increasing attention and has great application value in the nighttime surveillance field. Due to the huge differences between two modalities (e.g., color, texture), how to reduce the modality discrepancy of the same identity is essential for the solution of cross-modality person ReID task.
Existing methods mainly include two solving strategies. The first strategy uses a two-stream architecture to eliminate the discrepancy between the two modalities. Modality-sharable features are extracted via the weight-specific network, and then discriminative features for matching are embeded into the same feature space via the weight-shared network (Ye et al., 2018a, b, 2020a; Hao et al., 2019; Liu et al., 2020). However, these methods mainly focus on feature learning in the final common feature space while ignoring the single space of different modalities. There is less work focus on the direct enhancement of modality-sharable features in each single-modality feature space. Moreover, although these methods focus on narrowing the distance of the same identity between two modalities in the feature space, they ignore the feature distance among different identities, making it difficult to distinguish different identities. There is less work focus on the efficient constraint of overall feature distribution. The second strategy extracts the modality-specific features and transforms them from one modality to another by a generator or a decoder (Dai et al., 2018; Wang et al., 2019a, b; Choi et al., 2020; Wang et al., 2020a). However, in the RGB-IR person ReID task, objects with the same color in the RGB image have different appearances in the IR image. Lack of stable mapping relationship and unavoidable noise cause unreliable image generation. Besides, this kind of methods usually has very slow convergence.
In light of the above observations, we design a multi-feature space joint optimization (MSO) network that mines more modality-sharable features in each single-modality feature space and learns discriminative features without modality transfer. For the RGB-IR cross-modality task, information such as color can cause a larger gap, while the outline information of body and hair is an important focus in this task because it will not change between different modalities (i.e., modality-invariant). Some works focus on how to eliminate the negative influence of color information. In our work, considering that edge features can effectively describe the outline information, we design an edge features enhancement module. We introduce this module into the weight-specific network layer, so the ability of the modality-sharable feature extraction is improved in each single-modality feature space. Moreover, we analyze five edge fusion strategies in section4.4 and select the perceptual edge features (PEF) loss as the final enhancement method. Our PEF loss allows features to focus on more edge information while retaining other useful information by perceptual losses (Johnson et al., 2016). After the weight-shared network layer, we introduce a novel cross-modality contrastive-center (CMCC) loss into the modality-joint constraints to increase the difference between cross-modality distance and class distance. Thus, in the common feature space, the features from different identities are more distinguishable while keeping features compact under the same identity. With the above two modules, the MSO model optimizes both the feature space of RGB and IR modality and the common space simultaneously in an end-to-end manner as shown in Figure 1.
Through comparison with state-of-the-art methods, the proposed model shows excellent performance. In conclusion, the major contributions of our paper can be summarized as follows:
We propose a novel multi-feature space joint optimization network to optimize both single-modality feature space and common feature space for effective RGB-IR cross-modality person ReID. Extensive experiments prove the proposed model significantly outperforms state-of-the-art methods.
The perceptual edge features loss is proposed as an edge features enhancement module to preserve edge information of each modality. To our knowledge, this is the first work that proposes an explicit optimization in the single-modality feature space on cross-modality person ReID task.
The cross-modality contrastive-center loss is introduced into the modality-joint constraints to learn a more suitable distribution in feature space, which has compact intra-class distribution and sparse inter-class distribution.
2. Related Work
2.1. RGB-IR Cross-Modality Person ReID
RGB-IR cross-modality person ReID aims to match person images captured by the visible cameras and the infrared cameras. Compared with RGB-RGB single-modality person ReID that only deals with visible images, the challenge of this task is how to bridge the huge gap between two different modalities. Existing work addresses the problem from different perspectives. Some early work focused on resolving channel mismatches between RGB images and IR images because RGB images have three channels (i.e., red, green, and blue). In contrast, IR images have only one channel. Wu et al. (Wu et al., 2017)
proposed a deep zero-padding method to a uniform number of channels. Kang et al.(Kang et al., 2019) transformed RGB images into gray images and then proposed three different combinations to combine gray images and IR images as a single input. However, these hand-designed methods did not work well. Then researchers repeat the single channel of IR images three times as input and pay more attention to discriminative feature learning.
Nowadays, most methods use a two-stream architecture to eliminate the discrepancy between the two modalities. These methods learned modality-specific features through the shallow network layers that do not share weights and extracted modality-sharable features through the deep layers with weights sharing. Ye et al. (Ye et al., 2018a) learned multi-modality shareable feature representations with the two-stream CNN network, while Liu et al. (Liu et al., 2020) fused the mid-level features from middle layers with the final feature of backbone. The MAC method (Ye et al., 2019) is proposed on top of a two-stream network to capture the modality-specific information. However, these methods only focus on feature learning in the final common feature space. Our model can optimize both single-modality feature space, and common feature space with an edge features enhancement module and the modality-joint constraints.
Other researchers use the idea of Generative Adversarial Network (GAN)(Goodfellow et al., 2020; Mirza and Osindero, 2014; Arjovsky et al., 2017; Deng et al., 2018; Isola et al., 2016; Zhu et al., 2017) to transform modality. Hi-CMD model could extract pose-invariant and illumination-invariant features (Choi et al., 2020). Dai et al. (Dai et al., 2018) generated modality-sharable representation through a minimax game with a generator and a discriminator. A dual-level discrepancy reduction learning scheme is proposed to project inputs from image space to the feature space (Wang et al., 2019b). Wang et al. (Wang et al., 2019a) generated fake IR images compared with real IR images through pixel alignment and feature alignment jointly. They also generated cross-modality paired-images with an instance-level alignment model and reduced the gap between different modalities (Zhang et al., 2020). However, the models with GAN introduce unavoidable noise and make the fake image generation unreliable.
Considering that there exists correlations between two different modalities, Zhang et al. (Zhang et al., 2019) obtained contrastive features by their proposed contrastive correlation. Feng et al. (Feng et al., 2020) imposed the cross-modality Euclidean constraint and identity loss to generate modality-invariant features. Lu et al. (Lu et al., 2020) decoupled features through a sharable-specific feature transfer network and then fused them with GCN (Kipf and Welling, 2017). Kansal et al. (Kansal et al., 2020) learned spectrum-related information and obtained spectrum-disentangled representation. Wei et al. (Wei et al., 2020) proposed an attention-based approach with multiple feature fusion. This kind of sharable-specific feature transfer and fusion methods needs additional modality information during testing to ensure performance. Li et al. (Li et al., 2020) introduced a third X modality to feature space. Ye et al. (Ye et al., 2021b) generated a third auxiliary grayscale modality from the homogeneous visible images. Ling et al. (Ling et al., 2020) used class-aware modality mix to generate mixed samples for reducing the modality gap in pixel-level. Considering the similarities among gallery samples of RGB modality, Jia et al. (Jia et al., 2020) proposed the similarity inference metric as a re-ranking method, thus obtained relative high performance in the multi-shot setting.
2.2. Loss in Cross-Modality Person ReID
The loss mostly used for cross-modality person ReID is identity loss (Zheng et al., 2017), which treats ReID as an identity classification issue in training (Zheng et al., 2021; Bao et al., 2021; Gan et al., 2016). Contrastive loss (Varior et al., 2016) and triplet loss (Hermans et al., 2017; Ye et al., 2021a; Yuan et al., 2020; He et al., 2018)
focus on the distance between image pair. Some loss functions are built in terms of tasks(Ye et al., 2018b, 2020a; Gan et al., 2016; Wang et al., 2020b; Gan et al., 2019). Hao et al. (Hao et al., 2019) used hypersphere manifold embedding with sphere loss (Liu et al., 2017; Fan et al., 2019; Wang et al., 2017) for metric learning. Similar to triplet loss, HP loss (Zhao et al., 2019) is proposed to minimize the distance between the cross-modality positive pair and maximize the distance between the cross-modality negative pair. Hetero-center (HC) loss (Zhu et al., 2020) is proposed to improve the intra-class cross-modality similarity without considering inter-class cross-modality similarity. Ling et al. (Ling et al., 2020) optimized the network with identity classification loss, KL-divergence loss, and center-guided metric learning loss. The research results of these losses show that the best performance is to constrain the samples and the whole sample sets simultaneously. Inspired by the above observations, we introduce a novel cross-modality contrastive-center loss to form a modality-joint constraints section with identity loss and weighted regularization triplet loss (Ye et al., 2020b) for RGB-IR cross-modality person ReID. With the cross-modality contrastive-center loss, the features from different identities are more distinguishable while keeping features compact under the same identity.
3. The Proposed Method
In this part, we introduce the proposed multi-feature space joint optimization (MSO) network in detail. Firstly, we describe the network structure in subsection 3.1, and explain the feature extraction process and the location of two proposed modules. Then we introduce these two novel proposed losses, the perceptual edge features (PEF) loss (subsection 3.2) and the cross-modality contrastive-center (CMCC) loss (subsection 3.3), which correspond to the optimization of single and common feature space respectively. In the last subsection 3.4, we show and explain the complete formula of loss function during our training process.
3.1. Overall Model Structure
The method of our paper mainly focuses on losses-based multi-feature space joint optimization. On the model side, we adopt the structural design of the previous method, the two-stream AGW ReID model (Ye et al., 2020b). As shown in Figure 2a, the two-stream structure is integrated into our MSO network with ResNet-50 as the backbone. And we divide the proposed network into three parts: single space feature extraction, common space feature embedding, and modality-joint constraints as illustrated in Figure 2a.
Let , denote the RGB modality and IR modality respectively. Let denotes the RGB image set and the IR image set, where , and are the height and the width of images. Each IR image also contains three channels by repeating its single channel three times as input. Suppose there are images in one batch during training. represents the th image in an input batch, where . During training, each or flows into its corresponding branch in the part of single space feature extraction (Figure 2a). Let and denote the unshared layer0 of ResNet-50, the shallow layer features of each modality is extracted and defined as follows:
denotes the modality-specific features, where . Our proposed perceptual edge features (PEF) loss is designed to enhance the extracted features in single space.
After that, and are sent to the part of common space feature embedding. During common space feature embedding, we utilize the layer1 to layer4 of ResNet-50 to obtain modality-sharable features. Unlike the previous layer0 used separately for each modality, the feature embedding uses a shared structure design to extract modality-sharable features. Following (Ye et al., 2020b), the non-local attention blocks (Wang et al., 2018) are inserted in the same positions as (Ye et al., 2020b). After that, we let the feature matrix through a generalized-mean (GeM) pooling layer (Ye et al., 2020b). For each
, we get a feature vector of 2048 dimensions. In modality-joint constraints, these 2048-dimensions vectors pass through BN layer(Ioffe and Szegedy, 2015), FC layer, and Softmax operation in turn. The Identity (ID) loss (Zheng et al., 2017), weighted regularization triplet (WRT) loss (Ye et al., 2020b), and our designed cross-modality contrastive-center (CMCC) loss are calculated on these vectors. During testing, we sent query and gallery images into its corresponding network branch, respectively. Following (Ye et al., 2020b), we obtain the features after the BN layer for calculating similarities of query and gallery.
3.2. Perceptual Edge Features Loss
Considering that modality-specific information brings negative effects for RGB-IR person ReID task and is not easy to eliminate, we hope to find prior information that can be used to compact the RGB and IR modality space. Due to the modality invariance of outline information, which can be described by edge features, we use edge information as a self-supervised learning guide for feature enhancement. We propose perceptual edge features (PEF) loss to enhance the modality-sharable features and introduce PEF loss into single space feature extraction as shown in Figure2.
PEF loss constrains features and obtained by the unshared layer0 of ResNet-50. Specifically, we obtain edge features by the sobel convolution module for the enhancement of the modality-sharable features as shown in Figure 2b. As shown in Figure 3, the sobel convolution module uses four classical sobel operators (Sobel and Feldman, 1973) as convolution kernels. For each image, input it and then output a result containing four channels of edge features in different directions. The final edge features are obtained by adding these four channels together. We measure perceptual differences between shallow features and edge features for each image. As shown in the Figure 2b, the perceptual losses are computed by the block1 to block4 of the VGG-16 network (Simonyan and Zisserman, 2015)
. Moreover, the VGG-16 network is pretrained on the ImageNet dataset and keeps weights non-learnable during the training of our proposed MSO model. Let
represent the loss network as shown in the Figure2b, represents the feature maps of input with shape , which is obtained by th block of the network. Let and represent the edge features extracted by sobel convolution module. can be formulated by:
When inputting and into the loss network, is calculated by mean square error (MSE) loss function:
During training, with the convergence of , the feature maps of the unshared layer0 are encouraged to be similar to the edge information at the perception level. After that, the outline information can be directly enhanced. Thus, the trained network can extract more modality-sharable features in the feature space of each modality by PEF loss.
3.3. Cross-Modality Contrastive-Center Loss
Although the difficulty of cross-modality tasks lies in narrowing the distance between features in different modalities, we should also pay the interval between features of different identities special attention. Otherwise, it will reduce the accuracy of recognition. Inspired by the contrastive loss (Varior et al., 2016), we introduce a novel cross-modality contrastive-center (CMCC) loss into the modality-joint constraints. CMCC loss can avoid the hard sample mining mechanism by using centers. Unlike normal contrastive center loss, our CMCC loss only constrains the distance among different centers and further optimizes feature distribution in the common feature space. It can be easily calculated and significantly improves the performance as a supplement to other losses.
Suppose there are person identities, and each person has RGB images and IR images in a batch. Let represent the center of different modalities of person , is the ID label of person, where , n is the number of person identities in a training batch. For each input , we obtain feature by BN layer and -norm after feature embedding. can be formulated as:
For each identity in a training batch, the feature center is the mean value of and cause the number of RGB and IR images are the same. is the Euclidean distance between the centers of different modalities of ( and ). represents all the Euclidean distances between the and each feature center of other identities in the batch. As shown in Figure 2c, for any identity, should be far less than each . The contrastive differences between and can be calculated as the distance of and ( represent the minimum of all ). In other words, should be far less than . In each batch, the can be formulated by:
By optimizing the , we can decrease while increase simultaneously. Thus, the CMCC loss makes images from different identities more distinguishable while keeping features compact under the same identity.
3.4. Cross-Modality Feature learning
In the modality-joint constraints, the Identity (ID) loss and weighted regularization triplet (WRT) loss are combined with CMCC loss to learn more discriminative modality-sharable feature. We calculate ID loss, WRT loss with the vectors of different positions as shown in Figure 2a. After Softmax operation, we get to represent the classification outputs of the image , and each corresponds to an one-hot label list . We define and as follows:
Where and N is the the number of person identities. ID loss () can be expressed as
Where represents a triple sample in training batch, containing one anchor sample , one positive sample with the same identity, and one negative sample from a different identity. and are formulated by Equation 9 and is the Euclidean distance between the feature vectors. is the set of all distances between every positive pair and is the negative set.
The total loss () consists of PEF loss (), ID loss (), WRT loss (), and CMCC loss ():
4.1. Datasets and Settings
The proposed method is evaluated on the public dataset SYSU-MM01 (Wu et al., 2017) and RegDB (Nguyen et al., 2017). SYSU-MM01 is the largest RGB-IR ReID dataset used by mainstream methods on this task. This dataset contains 491 identities, consisting of 29,033 RGB images and 15,712 IR images taken from 4 RGB cameras and 2 IR cameras in indoor and outdoor environments. The training set contains 395 identities, including 22,258 RGB images and 11,909 IR images, while the test set has 96 identities with 3,803 IR images for the query set. As for the gallery, 301 or 3,010 (single-shot or multi-shot) RGB images are randomly selected to generate the gallery set. Following (Wu et al., 2017), there are two evaluation modes for RGB-IR ReID: all-search with all images and indoor-search with only indoor images. RegDB dataset consists of 412 identities’ images with 10 RGB images and 10 infrared for each identity, which is collected by a pair of aligned far-infrared and visible cameras. According to the previous methods’ partition strategy (Wang et al., 2019a; Dai et al., 2018; Hao et al., 2019), this dataset is divided equally into two halves for training and testing. That is, the training set and test set each have 2,060 RGB and 2,060 infrared images. During testing, it has two kinds of evaluation mode. All RGB images/infrared images in the test set can be used as query, and all infrared images/RGB images can be used as gallery, corresponding to the Visible to Thermal/Thermal to Visible mode.
|Two-Stream(Wu et al., 2017)||ICCV 17||11.65||47.99||12.85||16.33||58.35||8.03||15.60||61.18||21.49||22.49||72.22||13.92|
|One-Stream(Wu et al., 2017)||ICCV 17||12.04||49.68||13.67||16.26||58.14||8.59||16.94||63.55||22.95||22.62||71.74||15.04|
|Zero-Padding(Wu et al., 2017)||ICCV 17||14.80||54.12||15.95||19.13||61.40||10.89||20.58||68.38||26.92||24.43||75.86||18.64|
|BDTR(Ye et al., 2018b)||IJCAI 18||17.01||55.43||19.66||-||-||-||-||-||-||-||-||-|
|HSME(Hao et al., 2019)||AAAI 19||18.03||58.31||19.98||-||-||-||-||-||-||-||-||-|
|D-HSME(Hao et al., 2019)||AAAI 19||20.68||62.74||23.12||-||-||-||-||-||-||-||-||-|
|SDL(Kansal et al., 2020)||TCSVT 20||28.12||70.23||29.01||-||-||-||32.56||80.45||39.56||-||-||-|
|DGD+MSR(Feng et al., 2020)||TIP 19||37.35||83.40||38.11||43.86||86.94||30.48||39.64||89.29||50.88||46.56||93.57||40.08|
|EDFL(Liu et al., 2020)||NeuroC 20||36.94||84.52||40.77||-||-||-||-||-||-||-||-||-|
|HPILN(Zhao et al., 2019)||IET-IPR 19||41.36||84.78||42.95||47.56||88.13||36.08||45.77||91.82||56.52||53.05||93.71||47.48|
|AGW(Ye et al., 2020b)||TPAMI 21||47.50||-||47.65||50.87||-||40.03||54.17||-||62.97||-||-||-|
|TSLFN+HC(Zhu et al., 2020)||NeuroC 20||56.96||91.50||54.95||62.09||93.74||48.02||59.74||92.07||64.91||69.76||95.85||57.81|
|cmGAN(Dai et al., 2018)||IJCAI18||26.97||67.51||27.80||31.49||72.74||22.27||31.63||77.23||42.19||37.00||80.94||32.76|
|RL(Wang et al., 2019b)||CVPR 19||28.90||70.60||29.20||-||-||-||-||-||-||-||-||-|
|Hi-CMD(Choi et al., 2020)||CVPR 20||34.94||77.58||35.94||-||-||-||-||-||-||-||-||-|
|JSIA(Wang et al., 2020a)||AAAI 20||38.10||80.70||36.90||45.10||85.70||29.50||43.80||86.20||52.90||52.70||91.10||42.70|
|AlignGAN(Wang et al., 2019a)||ICCV 19||42.40||85.00||40.70||51.50||89.40||33.90||45.90||87.60||54.30||57.10||92.70||45.30|
|tsGAN(Zhang et al., 2020)||arXiv 20||49.80||87.30||47.40||56.10||90.20||38.50||50.40||90.80||63.10||59.30||91.20||50.20|
|X-Modality(Li et al., 2020)||AAAI 20||49.92||89.79||50.73||-||-||-||-||-||-||-||-||-|
|CMM+CML(Ling et al., 2020)||ACMMM 20||51.80||92.72||51.21||56.27||94.08||43.39||54.98||94.38||63.70||60.42||96.88||53.52|
|HAT(Ye et al., 2021b)||TIFS 20||55.29||92.14||53.89||-||-||-||62.10||95.75||69.37||-||-||-|
Our experiments use the standard Cumulative Matching Characteristics (CMC) curve and the mean average precision (mAP) as the evaluation metrics. CMC is represented as rank-k, which tells the rate where the correct match is within the k-nearest neighbors, with k in our results are 1, 10, 20. In the ablation experiments, following(Ye et al., 2020b), we also compute the mean inverse negative penalty (mINP) as one of the evaluation metrics. When testing on the SYSU-MM01 dataset, we evaluate methods on the specified test set and the randomly generated gallery, repeat 10 times, then take the average value to measure models’ performance (Wu et al., 2017). As for the evaluation of the RegDB dataset, we adopt its evaluation protocol described above, doing experiments in both Visible to Thermal and Thermal to Visible mode to fully verify the effectiveness of our proposed method.
, we have explained the structure of the model in detail. The backbone network, ResNet-50, is initialized with the parameters pretrained on the ImageNet. Following(Ye et al., 2020b)
, the stride of the last convolutional layer in ResNet-50 is changed to 1. All the input images are uniformly resized to
. During the training phase, we use several simple data augmentation strategies that include random crop and horizontal flip. Each mini-batch contains 64 images of 8 identities, which means 4 pairs of RGB and infrared images are selected randomly from each identity. The whole model is optimized with Adam for 100 epochs with an initial learning rate of 0.0005, and we decay the learning rate by 0.1 at epoch 20, 25, and 35, respectively. When testing the method, we use cosine distance to measure the difference between query and gallery’s extracted features, generating retrieval results of this ReID task.
|Methods||Venue||Visible to Thermal||Thermal to Visible|
|Zero-Padding(Wu et al., 2017)||ICCV 17||17.8||34.2||18.9||16.6||34.7||17.8|
|BDTR(Ye et al., 2018b)||IJCAI 18||33.5||58.4||31.8||32.7||58.0||31.1|
|HSME(Hao et al., 2019)||AAAI 19||41.3||65.2||38.8||40.7||65.4||37.5|
|D-HSME(Hao et al., 2019)||AAAI 19||50.9||73.4||47.0||50.2||72.4||46.2|
|SDL(Kansal et al., 2020)||TCSVT 20||26.5||51.3||23.6||25.7||50.2||22.9|
|DGD+MSR(Feng et al., 2020)||TIP 19||48.4||70.3||48.7||-||-||-|
|EDFL(Liu et al., 2020)||NeuroC 20||52.6||72.1||53.0||-||-||-|
|AGW(Ye et al., 2020b)||TPAMI 21||70.0||-||66.4||-||-||-|
|cmGAN(Dai et al., 2018)||IJCAI18||47.9||47.9||12.8||16.3||47.9||65.5|
|RL(Wang et al., 2019b)||CVPR 19||43.4||66.1||44.1||-||-||-|
|Hi-CMD(Choi et al., 2020)||CVPR 20||70.9||86.4||66.0||-||-||-|
|JSIA(Wang et al., 2020a)||AAAI 20||48.5||-||49.3||48.1||-||48.9|
|AlignGAN(Wang et al., 2019a)||ICCV 19||57.9||-||53.6||56.3||-||53.4|
|cm-SSFT(Lu et al., 2020)||CVPR 20||65.4||-||65.6||63.8||-||64.2|
|X-Modality(Li et al., 2020)||AAAI 20||-||-||-||62.2||83.1||60.2|
|CMM+CML(Ling et al., 2020)||ACMMM 20||-||-||-||59.8||80.4||60.9|
|HAT(Ye et al., 2021b)||TIFS 20||71.8||87.2||67.6||70.0||86.5||66.3|
4.2. Comparison with State-of-the-art Methods
In this subsection, we compare our proposed method with several cross-modality person ReID methods that include the following categories: 1) With different structures and loss functions, Two-Stream, One-Stream, Zero-Padding (Wu et al., 2017), HSME, D-HSME (Hao et al., 2019), BDTR, SDL (Kansal et al., 2020), DGD+MSR (Feng et al., 2020), EDFL (Liu et al., 2020), HPILN (Zhao et al., 2019), AGW (Ye et al., 2020b), cm-SSFT (Lu et al., 2020), and TSLFN+HC (Zhu et al., 2020) learned modality-invariant feature representation; 2) With the ideas of GAN, cmGAN (Dai et al., 2018), RL (Wang et al., 2019b), Hi-CMD (Choi et al., 2020), JSIA (Wang et al., 2020a), AlignGAN (Wang et al., 2019a), and tsGAN (Zhang et al., 2020) generated cross-modality images or features; 3) X-Modality (Li et al., 2020), CMM+CML (Ling et al., 2020), and HAT (Ye et al., 2021b) introduced a third modality to feature space as a three-modality learning problem.
The experimental results of these state-of-the-art methods on two datasets SYSU-MM01 and RegDB are shown in Table 1 and 2. The R1, R10, R20 denote Rank-1, Rank-10, and Rank-20 accuracies (%), respectively. And the mAP denotes the mean average precision score (%). As shown in Table 1, our proposed model shows great performance. Compared with TSLFN+HC (Zhu et al., 2020), which achieved optimal performance with local features, our model over 1.74% on rank-1 and 1.47% on mAP using only global features in the single-shot setting of all-search mode. In the single-shot setting of indoor-search mode, our model achieves a rank-1 accuracy of 63.09% and an mAP of 70.31%, which are higher than TSLFN+HC by 3.35% and 5.40%, respectively. Compared with other methods that also only use global features, the proposed method outperforms other methods by a large margin. Compared with the results in the single-shot setting of all-search mode, our model is 11.20% and 8.77% higher than the strong AGW model on Rank-1 and mAP respectively, which has the same backbone.
As shown in Table 2, our method also achieves much higher accuracy on the evaluation of the RegDB dataset. In the Visible to Thermal mode, the proposed model achieves a rank-1 accuracy of 73.6% and a rank-10 accuracy of 88.6%, which are higher than other methods. Compared with the latest HAT model (Ye et al., 2021b), our method also achieves similar performance on mAP. In the Thermal to Visible mode, the proposed model surpasses the method HAT by 4.6% on Rank-1 and 1.2% on mAP, surpasses the method CMM+CML (Ling et al., 2020) by 14.8% on Rank-1 and 6.6% on mAP.
4.3. Ablation Study
In this subsection, we design the ablation experiments to test the effectiveness of PEF loss and CMCC loss. All ablation experiments are performed on the dataset SYSU-MM01, using the single-shot setting of all-search mode (the more difficult mode). Specifically, ”B” represents the baseline model using the proposed network without PEF loss and CMCC loss.
|B+expAT(Ye et al., 2021a)||52.01||89.16||95.40||49.78||35.36|
|B+TC(He et al., 2018)||53.37||89.73||95.68||51.47||37.49|
|B+HC(Zhu et al., 2020) ( = 0.1)||53.84||88.89||94.87||48.68||31.13|
Influence of PEF Loss:
The results of ablation experiments for PEF loss are shown in Table 3. Compared with the baseline model (B), the rank-1 accuracy and mAP gains 3.42% and 3.00% improvements respectively by introducing PEF loss (B+PEF), which proves the effectiveness of PEF loss. Additionally, we respectively visualize the shallow feature maps of the B model and the B+PEF model. Specifically, we take one RGB image and one IR image of the same person identity as input and extract their shallow feature maps of these models. As shown in Figure 4, we visualize RGB feature maps and IR feature maps extracted from the selected images. Figure 4a and Figure 4b show the feature maps of the B model, and Figure 4c and Figure 4d show the feature maps of the B+PEF model. The black images of feature maps represent inactive feature maps with zero values, which are useless for modality-sharable features learning. As shown in Figure 4, no matter which modality the input image belongs to, more useful non-zero feature maps are extracted after introducing the PEF loss, and the edge information in the feature maps are enhanced obviously. Besides, mAP decreases after removing PEF loss from the overall model MSO (B+PEF+CMCC), which also demonstrates the effectiveness of PEF loss.
Influence of CMCC Loss:
As shown in Table 3, the model with CMCC loss (B+CMCC) achieves a rank-1 accuracy of 56.63% and an mAP of 54.93%, which are higher than the baseline (B) by 7.47% and 7.95%, respectively. Besides, we implement expAT loss (Ye et al., 2021a) and triplet center loss (TC) (He et al., 2018) with the baseline respectively. Compared with B+expAT and B+TC, the addition of CMCC loss brings a marked performance boost to baseline. We also introduce hetero-center loss (HC) (Zhu et al., 2020) into the baseline (B+HC). The results of B+HC in Table 3 are the highest results of HC in several experiments with different values of (Zhu et al., 2020). The advantage of CMCC loss is evident because CMCC loss outperforms HC loss by 2.79% in rank-1 accuracy and 6.25% in mAP. In addition, we visualize the distribution of features learned by baseline model, baseline with HC loss model, and baseline with our CMCC loss model respectively. As shown in Figure 5, the feature distribution of baseline shows low discrimination, and the distribution of B+HC has small margin of different modalities of the same identity but large similarity of different identity. Compared with B and B+HC, our CMCC loss is effective on separating different identities as shown in Figure (c)c.
4.4. Edge Fusion Strategy Analysis
In this subsection, we try four other different edge fusion strategies as the edge features enhancement module. Among them, three edge fusion strategies is used to enhance the modality-sharable features in each single-modality space: 1) Directly Add Fusion: Get the feature maps after layer0 and the edge features extracted by sobel convolution module, directly add edge features to each feature map. 2) Weighted Add Fusion: Add edge features to each feature map with different learnable weights. 3) Concat Fusion: Get the feature maps of 64 channels after layer0, and concatenate them with the edge features of 1 channel. Then change back to 64 channels through a 1×1 convolutional layer. The experimental results with single-shot setting of all-search mode on SYSU-MM01 are shown in Table 4. Compared with above three edge fusion strategies, PEF loss shows best performance in enhancing modality-sharable feature learning on cross-modality person ReID task.
To prove the significant advantages of placing the edge features enhancement module in each single-modality space, we also design another method: 4) Classic Feature Fusion:
Take the edge features extracted from sobel convolution module as the third modality and feed them into convolutional neural network with RGB and IR images. Then concatenate image features with their edge features after CNN and use FC layers to fuse them. This means that the edge information are enhanced in the common feature space. As shown in Table4, the performance of this classic feature fusion method is relatively poor, which proves the rationality and effectiveness of optimization in single-modality space.
|Directly Add Fusion||55.56||90.00||95.71||53.67||40.03|
|Weighted Add Fusion||56.77||91.63||96.88||54.47||40.06|
|PEF Loss Fusion||58.70||92.06||97.20||56.42||42.04|
|Classic Feature Fusion||50.33||88.04||94.82||49.51||35.84|
In this paper, we present an innovative method for RGB-IR cross-modality person ReID, the multi-feature space joint optimization (MSO) network, which can learn modality-sharable features in both the single-modality space and the common space. Based on the observation that edge information is modality-invariant, we propose an edge features enhancement module to enhance the modality-sharable features in each single-modality space. In our method, the perceptual edge features (PEF) loss is introduced in the feature space of each modality, and cross-modality contrastive-center (CMCC) loss is introduced in the common feature space, which can enhance the modality-sharable features and learn more discriminative feature distribution. Through edge fusion strategy analysis, we also prove that PEF loss has outstanding advantages over the other fusion strategies. Experiments show that the performance of our proposed method is significantly improved against the state-of-the-art methods on both the SYSU-MM01 and RegDB datasets. Code will be made available. We believe that the new method will provide innovative solutions for future cross-modality ReID research.
Acknowledgements.This work was supported by the National Natural Science Foundation of China (Nos.61972030), and the Grapevine Scholar Plan of JD AI Research.
- Wasserstein gan. CoRR abs/1701.07875. Cited by: §2.1.
- Pose-guided tracking-by-detection: robust multi-person pose tracking. IEEE Trans. Multim. 23, pp. 161–175. Cited by: §2.2.
- Hi-cmd: hierarchical cross-modality disentanglement for visible-infrared person re-identification. In CVPR, pp. 10254–10263. Cited by: §1, §2.1, §4.2, Table 1, Table 2.
- Cross-modality person re-identification with generative adversarial training. In IJCAI 2018, pp. 677–683. Cited by: §1, §2.1, §4.1, §4.2, Table 1, Table 2.
- Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In CVPR 2018, pp. 994–1003. Cited by: §2.1.
- SphereReID: deep hypersphere manifold embedding for person re-identification. J. Vis. Commun. Image Represent. 60, pp. 51–58. Cited by: §2.2.
- Learning modality-specific representations for visible-infrared person re-identification. IEEE Trans. Image Process. 29, pp. 579–590. Cited by: §2.1, §4.2, Table 1, Table 2.
- Learning attributes equals multi-source domain generalization. In CVPR 2016, pp. 87–97. Cited by: §2.2.
- Self-supervised moving vehicle tracking with stereo sound. In ICCV 2019, pp. 7052–7061. Cited by: §2.2.
- Generative adversarial networks. Commun. ACM 63 (11), pp. 139–144. Cited by: §2.1.
- HSME: hypersphere manifold embedding for visible thermal person re-identification. In AAAI 2019, pp. 8385–8392. Cited by: §1, §2.2, §4.1, §4.2, Table 1, Table 2.
- Triplet-center loss for multi-view 3d object retrieval. In CVPR 2018, pp. 1945–1954. Cited by: §2.2, §4.3, Table 3.
- In defense of the triplet loss for person re-identification. CoRR abs/1703.07737. Cited by: §2.2.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML 2015, Vol. 37, pp. 448–456. Cited by: §3.1.
- Image-to-image translation with conditional adversarial networks. CoRR abs/1611.07004. Cited by: §2.1.
- A similarity inference metric for rgb-infrared cross-modality person re-identification. In IJCAI 2020, pp. 1026–1032. Cited by: §2.1.
Perceptual losses for real-time style transfer and super-resolution. In ECCV (2) 2016, Vol. 9906, pp. 694–711. Cited by: §1.
- Person re-identification between visible and thermal camera images based on deep residual cnn using single input. IEEE Access 7, pp. 57972–57984. Cited by: §2.1.
SDL: spectrum-disentangled representation learning for visible-infrared person re-identification. IEEE Trans. Circuits Syst. Video Technol. 30 (10), pp. 3422–3432. Cited by: §2.1, §4.2, Table 1, Table 2.
- Semi-supervised classification with graph convolutional networks. In ICLR (Poster) 2017, Cited by: §2.1.
- Infrared-visible cross-modal person re-identification with an x modality. In AAAI 2020, pp. 4610–4617. Cited by: §2.1, §4.2, Table 1, Table 2.
- Class-aware modality mix and center-guided metric learning for visible-thermal person re-identification. In ACM Multimedia 2020, pp. 889–897. Cited by: §2.1, §2.2, §4.2, §4.2, Table 1, Table 2.
- Enhancing the discriminative feature learning for visible-thermal cross-modality person re-identification. Neurocomputing 398, pp. 11–19. Cited by: §1, §2.1, §4.2, Table 1, Table 2.
SphereFace: deep hypersphere embedding for face recognition. In CVPR 2017, pp. 6738–6746. Cited by: §2.2.
- Cross-modality person re-identification with shared-specific feature transfer. In CVPR 2020, pp. 13376–13386. Cited by: §2.1, §4.2, Table 2.
- Bag of tricks and a strong baseline for deep person re-identification. In CVPR Workshops 2019, pp. 1487–1495. Cited by: §3.4.
- Conditional generative adversarial nets. CoRR abs/1411.1784. Cited by: §2.1.
- Person recognition system based on a combination of body images from visible light and thermal cameras. Sensors 17 (3), pp. 605. Cited by: §4.1.
- Very deep convolutional networks for large-scale image recognition. In ICLR 2015, Cited by: §3.2.
- A 3×3 isotropic gradient operator for image processing. Pattern Classification and Scene Analysis, pp. 271–272. Cited by: §3.2.
Gated siamese convolutional neural network architecture for human re-identification. In ECCV (8) 2016, Vol. 9912, pp. 791–808. Cited by: §2.2, §3.3.
- NormFace: l hypersphere embedding for face verification. In ACM Multimedia 2017, pp. 1041–1049. Cited by: §2.2.
- RGB-infrared cross-modality person re-identification via joint pixel and feature alignment. In ICCV 2019, pp. 3622–3631. Cited by: §1, §2.1, §4.1, §4.2, Table 1, Table 2.
- Cross-modality paired-images generation for rgb-infrared person re-identification. In AAAI 2020, pp. 12144–12151. Cited by: §1, §4.2, Table 1, Table 2.
- MetaSearch: incremental product search via deep meta-learning. IEEE Trans. Image Process. 29, pp. 7549–7564. Cited by: §2.2.
- Non-local neural networks. In CVPR 2018, pp. 7794–7803. Cited by: §3.1.
- Learning to reduce dual-level discrepancy for infrared-visible person re-identification. In CVPR 2019, pp. 618–626. Cited by: §1, §2.1, §4.2, Table 1, Table 2.
- Co-attentive lifting for infrared-visible person re-identification. In ACM Multimedia 2020, pp. 1028–1037. Cited by: §2.1.
- RGB-infrared cross-modality person re-identification. In ICCV 2017, pp. 5390–5399. Cited by: §2.1, §4.1, §4.1, §4.2, Table 1, Table 2.
- Bi-directional exponential angular triplet loss for rgb-infrared person re-identification. IEEE Trans. Image Process. 30, pp. 1583–1595. Cited by: §2.2, §4.3, Table 3.
- Modality-aware collaborative learning for visible thermal person re-identification. In ACM Multimedia 2019, pp. 347–355. Cited by: §2.1.
- Hierarchical discriminative learning for visible thermal person re-identification. In AAAI 2018, pp. 7501–7508. Cited by: §1, §2.1.
- Bi-directional center-constrained top-ranking for visible thermal person re-identification. IEEE Trans. Inf. Forensics Secur. 15, pp. 407–419. Cited by: §1, §2.2.
- Deep learning for person re-identification: a survey and outlook. CoRR abs/2001.04193. Cited by: §2.2, §3.1, §3.1, §3.4, §4.1, §4.1, §4.2, Table 1, Table 2.
- Visible-infrared person re-identification via homogeneous augmented tri-modal learning. IEEE Trans. Inf. Forensics Secur. 16, pp. 728–739. Cited by: §2.1, §4.2, §4.2, Table 1, Table 2.
- Visible thermal person re-identification via dual-constrained top-ranking. In IJCAI 2018, pp. 1092–1099. Cited by: §1, §2.2, Table 1, Table 2.
- In defense of the triplet loss again: learning robust person re-identification with fast approximated triplet loss and label distillation. In CVPR Workshops 2020, pp. 1454–1463. Cited by: §2.2.
- Attend to the difference: cross-modality person re-identification via contrastive correlation. CoRR abs/1910.11656. Cited by: §2.1.
- RGB-ir cross-modality person reid based on teacher-student gan model. CoRR abs/2007.07452. Cited by: §2.1, §4.2, Table 1.
- HPILN: a feature learning framework for cross-modality person re-identification. IET Image Process. 13 (14), pp. 2897–2904. Cited by: §2.2, §4.2, Table 1.
- Group-aware label transfer for domain adaptive person re-identification. CoRR abs/2103.12366. Cited by: §2.2.
- Person re-identification in the wild. In CVPR 2017, pp. 3346–3355. Cited by: §2.2, §3.1.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV 2017, pp. 2242–2251. Cited by: §2.1.
- Hetero-center loss for cross-modality person re-identification. Neurocomputing 386, pp. 97–109. Cited by: §2.2, §4.2, §4.2, §4.3, Table 1, Table 3.