Person re-identification (re-ID) is an important task in video surveillance. Given a query image of a person, re-ID aims to match persons in an image gallery collected from non-overlapping camera networks.
To leverage the unique features of sensors of different modalities, cross-modality re-ID has been attracting increasing interest in recent years for more robust identification, e.g. by using infrared images as query and RGB images as gallery. On the other hand, cross-modality re-ID remains a challenging task due to the large discrepancy across image modalities in terms of distinct illumination, heterogeneous features, etc.
Two typical approaches have been explored to address the cross-modality re-ID challenges. The first approach attempts to align feature distribution of images of different modalities to reduce the cross-modality discrepancy    . The other approach utilizes generative adversarial network (GAN) as a modality transformer to convert person images from one modality to another while preserving the identity information as much as possible . These two types of approaches thus focus more on the reduction of cross-modality discrepancy or learning ID-preservative mapping across modalities, whereas the discriminative similarity among gallery samples of the same modality is largely neglected. The lack of this very useful information has become a major reason for the low performance of cross-modality person re-ID.
In this paper, we propose an innovative similarity inference metric (SIM) for cross-modality person re-ID. SIM aims to infer cross-modality sample similarities by exploiting reliable intra-modality sample similarities as illustrated in Fig. 1. Instead of using the query-gallery similarity matrix for person matching like most existing methods do, we introduce similarity graph reasoning (SGR) and mutual nearest-neighbor reasoning (MNNR) that discover intra-modality sample similarities and circumvent the cross-modality discrepancy successively. Specifically, these two types of reasoning utilize the intra-modality similarities, in terms of graph shortest path and nearest neighbor overlap, to empower re-ID to match the hard positive samples which are dissimilar from the query but similar to the predicted matching samples of the query. What’s more, SIM improves the cross-modality re-ID performance significantly and consistently. More details will be provided in the experimental section.
The main contributions of this work can be summarized in three aspects. First, it proposes a similarity inference metric that successively improves cross-modality similarities by utilizing the discriminative intra-modality sample similarities. Second, it designs novel similarity graph reasoning and mutual nearest-neighbor reasoning that can be applied to different cross-modality person re-ID metric with little extra training. Third, it achieves significant performance improvement over the state-of-the-arts on two widely used cross-modality re-ID datasets: SYSU-MM01 and RegDB.
2 Related Works
2.1 Single-modality Person Re-ID.
Conventional re-ID research mainly focuses on the challenge of appearance variations in a single RGB modality, including illumination conditions, viewpoint variations, misalignment, etc. Existing methods can be broadly classified into two categories. Methods in the first category attempt to learn similarity metrics which are used to predict whether two images contain the same person    . Methods in the second category focus on learning a discriminative feature representation, upon which efficient L or cosine distances can be applied     . Besides, most existing methods were developed for single-modality re-ID which cannot tackle the cross-modality re-ID well due to the large discrepancy across modalities.
2.2 Cross-modality Person Re-ID.
For the RGB-Infrared cross-modality re-ID, the discrepancies come not just from appearance variations but also from heterogeneous images of different modalities. Two typical approaches have been explored to reduce the cross-modality discrepancies. The first approach attempts to align the feature distribution of images of different modalities. For example, 
explores three different network structures and uses deep zero-padding for evolving domain-specific nodes. jointly optimizes the modality-specific and modality-shared metrics to learn multi-modality representations.  proposes a dual-path network with a bi-directional dual-constrained top-ranking loss to learn common representations.  designs a cross-modality generative adversarial network (cmGAN) to learn discriminative representations from different modalities.  proposes a hyper-sphere manifold embedding model. The second approach instead uses generative adversarial network (GAN) as a modality transformer to convert person images from one modality to another while preserving the identity information as much as possible   .
Though these methods reduce the modality discrepancies, the very useful discriminative similarity among gallery samples of the same modality is largely neglected. Our similarity inference metric captures such intra-gallery sample similarity which improves the cross-modality re-ID significantly, more details to be discussed in the ensuing subsections.
2.3 Re-ranking for Person Re-ID.
Re-ranking methods have been wildly studied to improve conventional person re-ID. After an initial ranking list is obtained, re-ranking aims to raise the rank of relevant images in an automatic and unsupervised manner. Recently, various re-ranking methods have been reported. For example,  learns an unsupervised re-ranking model that removes the visual ambiguities by analyzing the content and context information in the initial ranking list.  attempts to tackle the re-ranking problem by exploiting the common nearest neighbors. To address the false match issue from -nearest neighbors,  proposes to utilize -reciprocal neighbors and designs an encoding method to revise the initial rank list by calculating feature distance and jaccard distance of samples.
Most existing re-ranking methods are designed for single-modality re-ID which do not work well in the cross-modality re-ID task. The major problem is that existing re-ranking methods cannot re-rank the samples of different modalities which have different similarity metrics as compared with samples of a single modality. We tackle this problem by combining cross-modality -nearest neighbors and intra-modality -reciprocal neighbors which improves the re-ID performance significantly, more details to be described in Sec.3.3.
3 The Proposed Approach
Given a query infrared person image and the gallery set with RGB images , cross-modality re-ID ranks the gallery images according to their similarities to the query . Existing methods usually derive the similarity metric by directly comparing features of cross-modality samples, which often face low precision due to the gap and bias across image modalities. The proposed Similarity Inference Metric (SIM) aims to improve the cross-modality similarity metrics by exploiting the discriminative intra-modality similarity among gallery samples. It consists of feature representation, similarity graph reasoning, and mutual nearest-neighbor reasoning as illustrated in Fig. 2, more details to be described in the ensuing subsections.
3.1 Feature Representation
A weight-sharing two-stream CNN structure is designed to learn features and image representation from infrared and RGB images as illustrated in Fig. 2a. The CNN models are trained by optimizing cross entropy loss and triplet loss with infrared and RGB training samples together.
In inference phase, each infrared query image in the query set and each RGB gallery image from are fed to the trained model to extract respective features and . A query-gallery similarity matrix can then be obtained by computing distance between all query images and gallery images, where each matrix element denotes the distance between and . Similarly, a gallery-gallery similarity matrix can be obtained for all image pairs in the gallery set. Due to abundant optical information with little modality gap, is much more discriminative than as intuitively shown in Table 1.
3.2 Similarity Graph Reasoning
We propose similarity graph reasoning to circumvent the cross-modality discrepancy by leveraging the intra-modality similarity. The idea is that for a query image and its similar gallery image , other gallery images that are similar to should be similar to (via ) even though they may have large distances from as illustrated in Fig. 2b. With the matrices and as described in the previous subsection, we define a similarity graph on the whole test set including all query and gallery, where each node in represents an image sample and each edge in represents the similarity between its connected two nodes. We initialize the cross-modality edges (query-gallery) with and intra-modality edges (gallery-gallery) with as follows:
where is a scale factor that adjusts the ratio of two distance space.
Given the query image and the gallery image , the distance in perspective of similarity graph reasoning is defined as the shortest path from to between the query node and the gallery nodes on the Graph . To be specific, suppose denotes the set that includes all the possible path from to . For any path , where , is formulated as
Due to the fact that metric used in gallery satisfies triangle inequality below
Thus, the query-gallery distance can be simplified by:
Further, we use the mean of the first shortest paths instead of the shortest one for more stable cross-modality distance evaluation as follows:
where denotes the -th shortest path between and .
In practice, to reduce the computational complexity, all useless edges between gallery pairs are deleted except those between each sample and its -nearest neighbors in the gallery set.
3.3 Mutual Nearest-Neighbor Reasoning
We proposed mutual nearest-neighbor reasoning (MNNR) under the hypothesis that a query image and a gallery image are more likely to be a true match if they have the same mutual -nearest neighbors in the gallery set. Neighbor information has been explored in re-ranking based re-ID, e.g. by -reciprocal encoding . But it was mainly used for single-modality re-ID which does not work well in cross-modality re-ID where similarity metrics of query-gallery and gallery-gallery are discrepant. For example, the cross-modality query-gallery distance and the intra-modality gallery-gallery distance are often at different scales and cannot be ranked while handling test samples of different modalities.
MNNR employs a series of asymmetric strategies to handle the cross-modality discrepancy as shown in Fig. 2c. First, it uses gallery set as the search space without including query. For an IR query , it ranks gallery images with similarities and obtains its cross-modality nearest neighbors:
For a RGB gallery image , it ranks the gallery images with and obtains its intra-modality reciprocal nearest neighbors as . The mutual nearest neighbors of and can thus be defined by the overlap between and . Intuitively, more mutual nearest neighbors means higher similarity and the MNNR distance can be defined by:
where denotes the number of candidates in the set.
3.4 Similarity Inference Metric
The proposed Similarity Inference Metric can thus be derived by combining the similarity graph reasoning and mutual nearest-neighbor reasoning. It jointly aggregates and as the final distance as follows:
where denotes the penalty factor. When , only the similarity graph reasoning is considered. Algorithm 1 provides the detailed description of our proposed similarity inference metric.
3.5 Complexity Analysis
Most of computations focus on pairwise distance calculation and distance ranking for all gallery pairs and the computation complexity is and , respectively. Given a new query , SIM just computes the distance between and gallery (), ranks all path distance for SGR (), computes the -nearest neighbors for MNNR (), and ranks the final distance ().
4.1 Datasets and Settings
The proposed SIM is evaluated over two public datasets RegDB and SYSU-MM01. The standard Cumulative Matching Characteristic (CMC) curve and mean average precision (mAP) are adopted as the evaluation metrics. Different from the traditional single-modality re-ID, the evaluations here use IR images as probe set and RGB images as gallery set for both datasets.
RegDB  is collected by using dual cameras (with optical and thermal sensors). It contains images of 412 persons, where for each person 10 RGB images and 10 IR images are collected. Following the evaluation protocol in (Ye et al.2018), this dataset is randomly split into two halves, one half for training and the other half for testing. In addition, the evaluation procedure is repeated for 10 trials to achieve statistically stable results.
SYSU-MM01  is a large-scale RGB-IR re-ID dataset which contains images of 419 identities captured using six disjoint surveillance cameras (four RGB cameras and two IR cameras). The training set contains 19,659 RGB images and 12,792 IR images of 395 persons and the test set contains images of 96 persons. Following , we adopt the multi-shot all-search mode evaluation protocol where 10 images of a person are randomly selected to form the gallery set with 10 times repeat.
Implementation details. We adopt the ResNet-50 
as the backbone network and initialize it by using parameters pre-trained on the ImageNet. During training, the input image is uniformly resized to and traditional image augmentation is performed via random flipping and random erasing . In addition, we use the Adam optimizer to train the model and the learning rate is set at
. The whole training process consists of 200 epochs.
4.2 Comparison with State-of-the-Arts
The proposed SIM is compared with a number of cross-modality person re-ID methods that can be broadly classified into three categories: 1) LOMO , HOG  and GSM  that use hand-crafted features; 2) One-stream, Two-stream, Zero-Padding , TONE , BDTR  and cmGAN  that focus on feature distribution alignment; and 3) DRL , PIG  and AlignGAN  that use GANs to transfer image styles. Table 2 and Table 3 show the experimental results over the datasets RegDB and SYSU-MM01, respectively, where Visible2Thermal means using RGB images as query and IR images as gallery, and Thermal2Visible means the opposite.
As the two tables show, methods in the first category do not perform well due to the constraints of hand-crafted features. Methods in the second category learn modality-invariance features by suppressing feature distribution gaps across modality, which achieve clearly better re-ID performance. GAN based methods reduce the modality discrepancy at image level which further improve the re-ID performance.
Our proposed Similarity Inference Metric outperforms all the competing methods significantly. As Table 2 shows, it outperforms the state-of-the-art (AlignGAN) by 24.9% in mAP (78.3% vs 53.4%) and 18.94% in rank-1 accuracy (75.24% vs 56.3%) for Thermal2Visible. Similar improvement is obtained for the Visible2Thermal. For the dataset SYSU-MM01, SIM obtains an mAP of 60.88% and a rank-1 of 56.93% as shown in Table 3, which outperforms the state-of-the-art (AlignGAN) by 26.98% and 5.43%, respectively.
4.3 Ablation Studies
Extensive ablation studies have been performed to evaluate each component of our proposed SIM. As Table 4 shows, four networks are trained including: 1) Baseline that uses the traditional L distance to measure the feature similarity; 2) SGR only that just incorporates the Similarity Graph Reasoning (as described in Section 3.2) beyond the Baseline; 3) MNNR only that just incorporates the proposed Mutual Nearest-Neighbor Reasoning (as described in Section 3.3) beyond the Baseline; and SIM that incorporates both SGR and MNNR. As Table 4 shows, the Baseline does not perform well due to the large discrepancy across image modalities.
In addition, either SGR only or MNNR only improves the re-ID performance greatly. Specifically, SGR only achieves a mAP of 59.17% and a rank-1 accuracy of 56.62%, which are higher than the Baseline by 19.27% and 2.69%, respectively. This results show that SGR improves the sample similarity greatly by exploiting the discriminative within-gallery similarities. Similarly, MNNR only improves the mAP by 14.49% and the rank-1 accuracy by 2.37%, respectively, as compared with the Baseline. The effectiveness of the MNNR can be largely attributed to the use of the overlap of k-nearest neighbor sets in gallery between image pairs.
Further, SIM with both SGR and MNNR outperforms either SGR only or MNNR only, demonstrating the complementariness of the two proposed reasoning mechanisms. It achieves a mAP of 60.88% and a rank-1 accuracy of 56.93% which are higher than the Baseline by 20.98% and 3.00%, respectively. This shows that the proposed SIM enhances the cross-modality sample similarities effectively. The contribution of our SIM can also be observed in the ranking list as illuminated in Fig. 3. SIM effectively improves the similarities of true persons which are ranked behind of the baseline.
4.4 Parameters Analysis
The proposed SIM involves three key parameters including scale factor , limit factor and penalty factor . The three parameters are studied by setting them to different values and checking the corresponding re-ID performance as shown in Fig. 4. As the top-left graph shows, should be small so as to exploit the within-gallery sample similarity sufficiently ( and fixed at and ). This can also be observed for and . For example, when is set at a small value 1, the false positive matching increases clearly as it lowers the fault tolerance for the first matching. On the contrary, the re-ID performance is degraded due to its weak discrimination when is set at 13. Experiments show that SIM performs optimally when , and .
4.5 Generalization Analysis
The proposed SIM is a generic metric that can work with different existing re-ID methods. We study this nice property by applying SIM to AlignGAN  and AGW , both using the traditional L distance metric in feature similarity evaluation. Table. 5 shows experimental results. As Table. 5 shows, either SGR or MNNR improves the re-ID performance by large margins when it is incorporated into the AlignGAN and AGW methods. In addition, further improvements are observed when both SGR and MNNR are incorporated. These results are well aligned with the experimental results observed in the Ablation Studies.
Specifically, AlignGAN* + SGR achieves a mAP of 51.30% and a rank-1 accuracy of 52.56% which are higher than the AlignGAN by 16.74% and 4.64%, respectively. AlignGAN* + MNNR achieves a mAP of 50.26% and a rank-1 accuracy of 52.33% which also outperforms AlignGAN significantly. Similar improvement can be observed for AGW as well. All these experimental results demonstrate the superiority of our proposed Similarity Inference Metric (with SGR and MNNR) that can generalize across different cross-modality re-ID metrics with significant and consistent performance improvements but little extra training.
|AlignGAN* + SGR||51.30||52.56|
|AlignGAN* + MNNR||50.26||52.33|
|AlignGAN* + SIM (SGR+MNNR)||54.45||52.70|
|AGW + SGR||55.89||52.70|
|AGW + MNNR||51.40||52.93|
|AGW + SIM (SGR+MNNR)||57.47||53.75|
This paper presents an innovative Similarity Inference Metric (SIM) for RGB-Infrared person re-identification. We introduce similarity graph reasoning and mutual nearest-neighbor reasoning to infer inter-modality sample similarities by exploiting reliable intra-modality sample similarity. The two types of reasoning can generalize over different cross-modality person re-ID metrics with significant performance improvements but little extra training. Experiments demonstrate the effectiveness as well as the generalization of our method for improving re-ID performance. We expect that the proposed SIM will inspire new insights for better cross-modality person re-ID in the near future.
This work was supported in part by National Natural Science Foundation of China (61902009) and Shenzhen Research Project (201806080921419290).
-  (2017) Beyond triplet loss: a deep quadruplet network for person re-identification. In CVPR, pp. 403–412. Cited by: §2.1.
-  (2018) Cross-modality person re-identification with generative adversarial training.. In IJCAI, pp. 677–683. Cited by: §1, §2.2, §4.2.
-  (2005) Histograms of oriented gradients for human detection. In CVPR, Cited by: §4.2.
-  (2015) Person re-identification ranking optimisation by discriminant context information analysis. In ICCV, pp. 1305–1313. Cited by: §2.3.
-  (2014) Person re-identification using kernel-based metric learning methods. In ECCV, Cited by: §2.1.
-  (2019) HSME: hypersphere manifold embedding for visible thermal person re-identification. In AAAI, Cited by: §2.2.
-  (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §4.1.
-  (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: §2.1.
Imagenet classification with deep convolutional neural networks. In NeurIPS, pp. 1097–1105. Cited by: §4.1.
-  (2018) Harmonious attention network for person re-identification. In CVPR, pp. 2285–2294. Cited by: §2.1.
-  (2015) Person re-identification by local maximal occurrence representation and metric learning. In CVPR, pp. 2197–2206. Cited by: §2.1, §4.2.
-  (2016) Cross-domain visual matching via generalized similarity measure and feature learning. TPAMI 39 (6), pp. 1089–1102. Cited by: §4.2.
-  (2017) Person recognition system based on a combination of body images from visible light and thermal cameras. Sensors 17 (3), pp. 605. Cited by: §4.1.
-  (2019-10) RGB-infrared cross-modality person re-identification via joint pixel and feature alignment. In ICCV, Cited by: §1, §2.2, §4.2, §4.5, Table 5.
-  (2020-02) Cross-modality paired-images generation for rgb-infrared person re-identification. In AAAI, Cited by: §1, §2.2, §4.2.
-  (2019) Learning to reduce dual-level discrepancy for infrared-visible person re-identification. In CVPR, pp. 618–626. Cited by: §1, §2.2, §4.2.
-  (2017) RGB-infrared cross-modality person re-identification. In ICCV, pp. 5380–5389. Cited by: §1, §2.2, §4.1, §4.2.
-  (2019) Attention driven person re-identification. Pattern Recognit. 86, pp. 143–155. Cited by: §2.1.
-  (2018) Hierarchical discriminative learning for visible thermal person re-identification. In AAAI, Cited by: §1, §2.2, §4.2.
-  (2016) Person reidentification via ranking aggregation of similarity pulling and dissimilarity pushing. IEEE Transactions on Multimedia 18 (12), pp. 2553–2566. Cited by: §2.3.
-  (2020) Deep learning for person re-identification: a survey and outlook. arXiv preprint arXiv:2001.04193. Cited by: §4.5, Table 5.
-  (2018) Visible thermal person re-identification via dual-constrained top-ranking. In IJCAI, Cited by: §1, §2.2, §4.2.
-  (2020) AD-cluster: augmented discriminative clustering for domain adaptive person re-identification. In IEEE CVPR, Cited by: §2.1.
-  (2017) Spindle net: person re-identification with human body region guided feature decomposition and fusion. In CVPR, pp. 1077–1085. Cited by: §2.1.
-  (2013) Learning locally-adaptive decision functions for person verification. In CVPR, Cited by: §2.1.
-  (2011) Person re-identification by probabilistic relative distance comparison. In CVPR, Cited by: §2.1.
-  (2017-07) Re-ranking person re-identification with k-reciprocal encoding. In CVPR, Cited by: §2.3, §3.3.
-  (2017) Random erasing data augmentation. arXiv preprint arXiv:1708.04896. Cited by: §4.1.