Multi-spectral Vehicle Re-identification with Cross-directional Consistency Network and a High-quality Benchmark

by   Aihua Zheng, et al.

To tackle the challenge of vehicle re-identification (Re-ID) in complex lighting environments and diverse scenes, multi-spectral sources like visible and infrared information are taken into consideration due to their excellent complementary advantages. However, multi-spectral vehicle Re-ID suffers cross-modality discrepancy caused by heterogeneous properties of different modalities as well as a big challenge of the diverse appearance with different views in each identity. Meanwhile, diverse environmental interference leads to heavy sample distributional discrepancy in each modality. In this work, we propose a novel cross-directional consistency network to simultaneously overcome the discrepancies from both modality and sample aspects. In particular, we design a new cross-directional center loss to pull the modality centers of each identity close to mitigate cross-modality discrepancy, while the sample centers of each identity close to alleviate the sample discrepancy. Such strategy can generate discriminative multi-spectral feature representations for vehicle Re-ID. In addition, we design an adaptive layer normalization unit to dynamically adjust individual feature distribution to handle distributional discrepancy of intra-modality features for robust learning. To provide a comprehensive evaluation platform, we create a high-quality RGB-NIR-TIR multi-spectral vehicle Re-ID benchmark (MSVR310), including 310 different vehicles from a broad range of viewpoints, time spans and environmental complexities. Comprehensive experiments on both created and public datasets demonstrate the effectiveness of the proposed approach comparing to the state-of-the-art methods.


page 1

page 2

page 3

page 8

page 10

page 12


Hi-CMD: Hierarchical Cross-Modality Disentanglement for Visible-Infrared Person Re-Identification

Visible-infrared person re-identification (VI-ReID) is an important task...

A Similarity Inference Metric for RGB-Infrared Cross-Modality Person Re-identification

RGB-Infrared (IR) cross-modality person re-identification (re-ID), which...

Towards Homogeneous Modality Learning and Multi-Granularity Information Exploration for Visible-Infrared Person Re-Identification

Visible-infrared person re-identification (VI-ReID) is a challenging and...

DCDLearn: Multi-order Deep Cross-distance Learning for Vehicle Re-Identification

Vehicle re-identification (Re-ID) has become a popular research topic ow...

Cross-Modality Earth Mover's Distance for Visible Thermal Person Re-Identification

Visible thermal person re-identification (VT-ReID) suffers from the inte...

A Unified RGB-T Saliency Detection Benchmark: Dataset, Baselines, Analysis and A Novel Approach

Despite significant progress, image saliency detection still remains a c...

Heterogeneous Relational Complement for Vehicle Re-identification

The crucial problem in vehicle re-identification is to find the same veh...

I Introduction

Fig. 1: Impact demonstration of the Cross-directional Center loss on the distribution of multi-spectral features. (a) The original distribution. (b) Impact of which aims to pull modality centers of each identity close. (c) Impact of which pulls sample centers of each identity close. (d) Feature distribution driven by including both and .
Fig. 2: Framework of the proposed Cross-directional Consistency Network (CCNet). A multi-stream network is designed to handle RGB, NIR and TIR data separately at the body part with an adaptive layer normalization unit (ALNU) embedded at the middle for each branch. Each branch is an independent ResNet50 which is split into two parts at the position between its layer2 and layer3. Then CdC loss is utilized to mine the potential intra-class relation in sample and modality level.

Vehicle re-identification (Re-ID) aims to search the given vehicle image from the cross camera gallery with the same identity. Due to the wide range of real-life applications such as video surveillance, smart city and intelligent transportation. Vehicle Re-ID has been attracted growing attention and experiencing rapid development with the emergence of comprehensive studies [Chu2019VehicleRW, Lou2019EmbeddingAL, Tang2019CityFlowAC] and public large-scale datasets [Liu2016LargescaleVR, Liu_2016_CVPR, Lou2019VERIWildAL, guo2018learning]. However, most existing studies only focus on visible images which suffer imaging weaknesses in complex lighting environments and extreme weather, thus can not satisfy the demand of all-day and all-weather real-life surveillance. Since visible (RGB), near infrared (NIR) and thermal infrared (TIR) sources have strongly complementary advantages in adverse lighting conditions and environments. RGB-NIR-TIR multi-spectral vision tasks, such as tracking [Lu2021RGBTTV, Li2016LearningCS, Tu2022M5LMM], person Re-ID [Zheng2021RobustMP] and saliency detection [Tu2021MultiInteractiveDF]

attract hot research interest in the both machine learning and computer vision communities. Recently, Li

et al. [li2020multi] first launch the multi-spectral vehicle Re-ID task. They first propose a baseline multi-spectral vehicle Re-ID method Heterogeneity-Collaboration Aware Multi-Stream Convolutional Network (HAMNet) which utilizes multi-spectral features with class-aware weight fusion. Meanwhile, they first provide two benchmark datasets RGBN300 and RGBNT100 to multi-spectral vehicle Re-ID community. To be annotated, different from traditional vehicle Re-ID datasets which treat a single image as a sample, this two multi-spectral datasets treat an image pair (RGB-NIR in RGBN300) or an image triplet (RGB-NIR-TIR) as a sample. To avoid the confusion, we use the concept sample in the rest of this paper to emphasize the difference from conventional single modality image for multi-modal Re-ID. Despite of the pioneer contribution, there are three major issues remain to be well addressed in multi-spectral vehicle Re-ID.

First, the sample discrepancy caused by the diverse imaging conditions and the modality discrepancy with the heterogeneous modality gap restrict the learning capacity of intra-class compactness. We propose a cross-directional center loss which is composed by a sample center loss loss and a modality center loss to solve the sample and modality discrepancies from multi-modality aspect. On the one hand, it is hard to distinguish identities by only a certain spectrum data under complex environmental interference while the modality gap significantly disturbs directly utilization of different modality. To reduce the heterogeneous gap while taking the advantages of consistent information among modalities, we propose to enforce the centers of images with the same ID from different modalities in a mini-batch close by introducing a modality center loss , as shown in Fig. 1 (b). In this way, we can intuitively enforce the modality consistency and reduce the disturbance caused by a certain modality image. On the other hand, although sample relation has been widely concerned in RGB and cross-modality retrieval task by triplet losses [Chu2019VehicleRW, Hermans2017InDO], center losses [2016A], cross-modality constrains [Ye2018VisibleTP, ye2019bi], they are not suitable for complementary and heterogeneous multi-modality images. The heavy environmental interference caused by illumination challenge ubiquitously exists in multi-modality data. In this case, a certain image from a certain modality is possibly unreliable when it suffers from extreme environmental interference, and will easily introduce abnormal relation in pair-wise metric process. Therefore, to learn more robust features from the complementary multi-modality images, we propose a sample center loss to pull the centers of each triplet (RGB-NIR-TIR) sample with the same identity in a mini-batch close in this paper, as shown in Fig. 1 (c). By jointly optimizing sample center loss and modality center loss in a cross-directional fashion (as shown in Fig. 2

) in a unified deep learning framework, it simultaneously reduces both intra-class sample discrepancy and cross-modality heterogeneity, as shown in Fig. 


Fig. 3: Illustration of the multi-modality middle feature distribution of the multi-modality images. Each point corresponds to an image in our dataset and the red points are with same identity, which scatter with large discordance. The images with the same color bounding boxes indicate the three modalities of the same sample.

Second, due to diverse environmental interference, features from single modality suffer from heavy distributional variation, as shown in Fig. 3. This increases the difficulty in robust feature learning for CNN and further impacts the intra-class identity consistency learning. To reduce the disturbance of intra-modality distributional variation, we design a simple but effective module called adaptive layer normalization unit (ALNU). Different from existing normalization operations like BN [2015Batch], IN [ulyanov2016instance] and GN [Wu2018GroupN], ALNU treats each input feature as an entirety and preserves original information without changing the relation across channels in feature when adjusting the distribution. Comparing with traditional layer normalization (LN) [ba2016layer] which also doesn’t change the relations across channels, our ALNU adaptively learns the gain and bias factors according to original inputs by introducing extra convolution and pooling layers and thus is more flexible. Specifically, we integrate ALNU into all branches in our network to greatly improve the discriminative ability of multi-spectral target representations and thus further boost the performance of multi-spectral vehicle Re-ID.

Fig. 4: Illustration of the comparison of test protocols between MSVR310 and RGBNT100.

Third, existing multi-spectral vehicle Re-ID datasets, RGBN300 and RGBNT100 [li2020multi], are limited in diversity. To provide a more comprehensive evaluation platform in multi-spectral vehicle Re-ID, we create a high-quality image benchmark dataset named MSVR310. Compared with the RGBNT100 dataset, our MSVR310 has following two benefits.

Longer time span

. MSVR310 is collected across a relative long time span (over 40 days). Benefiting by the long time span, data collected in MSVR310 have various environmental conditions such as various illuminations, occlusions and weather. It effectively increases the diversity of our dataset. Furthermore, we annotate the time labels of samples according to their collection sequences along time. These labels would be used in improving the experimental evaluation of multi-spectral vehicle Re-ID.

More reasonable protocol

. Although most advanced methods forbid to match the samples from the same camera such as Market1501 [Zheng2015ScalablePR], VeRi-776 [Liu2016LargescaleVR], or the same viewpoint such as in RGBNT100, RGBN300 [li2020multi] to avoid the easy matching, it is not realistic enough since the same vehicle may appear in the same camera or with the same viewpoint across different time spans. Therefore, we propose to prevent the easy matching caused by similar identity-unrelated information like environments and noises by a more reasonable label, time span, instead of the camera/viewpoint as the new protocol. Fig. 4 shows the easy matching problem in RGBNT100 with the same time span, even though with the different viewpoints, the vehicles with same identity and time label can be easily distinguished from others due to their high similarity on image content.

As summary, we propose a end-to-end Cross-directional Consistency Network (CCNet) to simultaneously overcome modality and sample discrepancies. And propose a new multi-spectrum vehicle Re-ID dataset MSVR310 with diverse illustration interference and rich view variation with more reasonable protocol. The contributions of this paper can be summarized as follows.

  • We propose a novel cross-directional consistency network based on the cross-directional center loss to simultaneously address the problems of cross-modality discrepancy caused by heterogeneous properties of different modalities and intra-class appearance discrepancy caused by different views and adverse lighting conditions in multi-spectral vehicle Re-ID.

  • We propose an adaptive layer normalization unit to dynamically adjust intra-modality feature distribution. We integrate the unit into all branches in our network to help reducing the disturbance of intra-modality distributional variation.

  • We create a high-quality benchmark dataset MSVR310, including 310 different vehicles from a broad range of viewpoints, time spans and environmental complexities. The benchmark will provide a comprehensive evaluation platform to promote the research and development of multi-spectral vehicle Re-ID.

  • Comprehensive experiments on our dataset MSVR310 and the public dataset RGBNT100 validate the superior performance of our approach against several state-of-the-art multi-spectral vehicle Re-ID methods. We also conduct a random modality-missing experiment to prove the robustness of CCNet in facing the issue of missing modalities.

Ii Related Work

We briefly review the related works in vehicle Re-ID, cross-modality person Re-ID and multi-modality person Re-ID.

Ii-a Vehicle Re-ID

In last few years, vehicle Re-ID has gained a growing attention with the rapid development of Re-ID task. Liu et al. [Liu2016LargescaleVR] propose a dataset called VeRi-776 with a coarse-to-fine progressive searching framework using multiple information like license plate and spatio-temporal label. Liu et al. [Liu_2016_CVPR] release another large-scale vehicle Re-ID dataset VehicleID and build a distance related method. Some works [Wang2017OrientationIF, Shen2017LearningDN] introduce spatio-temporal information to provide a stricter constraint besides utilization of normal visual features. VANet [Chu2019VehicleRW]

propose a metric loss function by treating vehicle image pairs with same or not same viewpoints differently to acquire a better distance measure. He

et al. [He2019PartRegularizedNV] design a method to enhance discriminative feature representation by introducing detection methods. Khorramshahi et al. [Khorramshahi2019ADM] introduce key-points information to utilize adaptive attention for vehicle Re-ID. Semantic segmentation [Meng2020ParsingBasedVE] is utilized to split feature into different parts with corresponding regions in vehicles, followed by a part-aligned metric way to measure distance of image pairs more precisely. Recently, more large-scale and challenging vehicle datasets are released, like VERI-Wild [Lou2019VERIWildAL] and CityFlow [Tang2019CityFlowAC]. Besides real data, synthetic dataset [Yao2020SimulatingCC] constructed via graphic engine emerges to provide arbitrary environments for learning. However, all these methods mentioned above only take a usage of single RGB modality, which is hard to satisfy the demand for all-day all weather monitoring over long period.

Ii-B Cross-Modality Person Re-ID

To handle illumination limitations in RGB-based person Re-ID, Wu et al. [Wu2017RGBInfraredCP]

propose the first RGB-Infrared cross-modality benchmark SYSU-MM01 and a deep zero-padding network. RegDB 

[Nguyen2017PersonRS] is also a widely used cross-modality dataset with paired visible and thermal images collected by dual camera system. Ye et al. [Ye2018VisibleTP] suggest a two-stream network with triplet loss to constrain the similarity in cross-modality images. An effective loss [2020Hetero] is designed to supervise network learning modality invariant feature by constraining the intra-class center distance in modalities. Ye et al. [ye2019bi] propose a bi-directional center-constrained loss to handle cross-modality and intra-modality variations simultaneously. Wang et al. [Wang2019RGBInfraredCP] introduce a generating model to translate images to opposite modality to acquire pixel level alignment and make a feature level constraint with joint discriminator to push network produce discriminative features. Li et al. [2020Infrared] introduce an auxiliary intermediate modality to reduce the gap between modalities. Lu et al. [Lu2020CrossModalityPR] propose a novel cross-modality shared-specific feature transfer algorithm to explore both modality-shared and modality-specific information. However, due to the lack of real aligned paired images in modalities, the heterogeneous issue in cross-modality person Re-ID still remains a key challenge.

Ii-C Multi-Modality Person Re-ID

Similar to infrared images, depth images do not suffer the influence on lighting variation and can reflect shape and distance information of targets. Barbosa et al. [Barbosa2012ReidentificationWR] first propose RGB-D person Re-ID with a corresponding dataset named PAVIS. Mgelmose et al. [Mgelmose2013TrimodalPR]

combined three different information including RGB, depth and thermal data in a joint classifier, which is the first time to utilize RGB, depth and thermal sources in person Re-ID. Munaro

et al. [Munaro20143DRO] collect a RGB-D dataset named BIWI with 50 identities and multiple data sources. Wu et al. [Wu2017RobustDP] utilize depth data to provide more invariant body shape and skeleton information to overcome change of illumination and color. A new cross-modality distillation network [Hafner2018ACD] has been proposed to transfer supervision between modalities like similar structural features and make a discriminative mapping to a common feature space. However, depth information is difficult to be utilized in outdoor open environments which seriously limits its application in this task.

To provide a robust solution for overcoming environmental interference, Li et al. [li2020multi] first launch multi-spectral vehicle Re-ID datasets RGBN300 (visible and near infrared) and RGBNT100 (visible, near infrared and thermal infrared), and propose a baseline method HAMNet to effectively fuse the multi-modality information. Zheng et al. [zheng2021robust] release a new multi-spectral person Re-ID dataset RGBNT201, and a progressive fusion network for multi-modality fusion. Although these two works first launch RGB-NI-TI multi-spectral Re-ID task and provide two benchmark datasets and baseline methods for vehicle and person Re-ID respectively, how to effectively fuse the complementary but heterogeneous information is still a big challenge.

Iii Cross-directional Consistency Network

To utilize the consistency and mitigate the discrepancy in multi-spectral data, we propose a robust method with cross-directional center loss and adaptive layer normalization unit for multi-spectral vehicle Re-ID, referred as Cross-directional Consistency Network (CCNet) in this paper.

As shown in Fig. 2, CCNet is a multi-branch structure with three equivalent branches aiming to extract specific features for each single spectral data. Given a sample with multiple modalities, we send the image from each spectrum into corresponding branch without sharing the parameters. In each branch, an individual ALNU (adaptive layer normalization unit) module is integrated at the middle to modify feature distribution. For input images in training mini-batches, their features are divided into different groups according to the identity. Then cross-directional center loss is introduced to mitigate the intra-class appearance discrepancy and cross-modality discrepancy simultaneously for multi-spectral vehicle Re-ID. Each branch makes a prediction supervised by the cross entropy loss to learn the identity related features.

In this work, we use donating the whole dataset where is the identity size. donates the sample set belonging to the vehicle where is the sample number of the vehicle . donates the image set of sample from and is the single image from the modality in the sample . In this work, is and we can simply donate samples in a triplet form as to represent images from RGB, NIR and TIR modality respectively. We use to donate the parts of the branch for modality in CCNet. Then, the forward process for the image can be formulated as:


where donates the correspond feature for the image . And the final representation for is the concatenation of its corresponding feature triplet .

Iii-a Adaptive Layer Normalization Unit

Despite of diverse environmental interference and large appearance gap, multi-modality Re-ID still suffers complex feature distribution. As shown in Fig. 3

, the mean value and standard deviation of intra-modality features are distributed in a wide range, even the images with the same identity from the same modality, which further influence the intra-class identity consistency learning. ALNU module tries to mitigate the disturbance caused by heavily distributional variation by normalizing each input feature and adjusting the distribution dynamically. On one hand, this operation reduces the discrepancy on distribution of intra-modality features and helps to extract more robust CNN features. On the other hand, it is hard to evaluate similarity accurately for intra-modality images with large distribution gap regardless of identity. And mitigating this discrepancy helps to improve the validity of final similarity comparing of intra-modality image pairs in multi-spectral vehicle Re-ID task.

Given an input image , we acquire its middle feature before sending into ALNU as:



is a 3-D tensor with the shape of

, , . We can easily obtain its mean value and standard deviation as:


Then, we calculate a normalized feature:


where is a small value to avoid the division over zero. Each ALNU module contains two adaptive learning blocks ( and ), each of which is stacked by two convolutional layers, two parallel pooling layers, another convolution layer and a activation function. ALNU dynamically acquires two extra scalars by two adaptive learning blocks according to original input to further adjust the distribution of normalized feature . This process can be formulated as:


where , , and is the final output of ALNU.

Compared to conventional normalization operations like BN [2015Batch], IN [ulyanov2016instance], which adjust the original feature distribution in channel level, ALNU module works for individual features without breaking the relation among inner channels to avoid conspicuous change of the original feature distribution. Compared with LN [ba2016layer]

which enforces features to follow the same mean value and variance in evaluation, our ALNU learns the gain and bias factors

and from original input features to adaptively adjust the distribution. Different from conventional normalization operation like BN, LN, GN [Wu2018GroupN] which help models to learn easier and faster, ALNU mainly focus on intra-modality distributional variation for features, which is unrelated to their identity and increases the difficulty in robust feature learning. On one hand, ALNU adaptively modifies the distribution of features within modality and reduce the discrepancy caused by environmental interference which further mitigates the disturbance of identity related information learning. On the other hand, ALNU adaptively learns the gain and bias factors for each feature to achieve more flexible adjustment instead of enforcing all features to follow identical mean value and variance.

Iii-B Cross-directional Center Loss

Compared with single spectral data, multi-spectral ones include more information but more challenges in vehicle Re-ID data. The challenges can be mainly summarized from two aspects, including sample discrepancy and modality discrepancy. For the sample discrepancy, a suitable representation for sample to satisfy the form of multi-modality data is necessary. Meanwhile, ubiquitous bad cases from a certain modality in multi-modality data have to be taken into consideration. For the modality discrepancy, the heterogeneous gap among modalities prevents the direct utilization for multi-modality data. We propose cross-directional center loss to handle above discrepancies and mine a better identity embedding in multi-spectral vehicle Re-ID.

In training process, we randomly select identities with samples in each mini-batch, which forms totally images. Then, let donate the final features belonging to the identity in a training mini-batch. The geometric sample center for the sample in can be formulated as:


To overcome the sample discrepancy in multi-modality case, we propose to pull intra-class sample centers as close as possible. This process can be formulated as:


Similar, the geometric modality center for the modality in can be formulated as:


In the same manner, to overcome the modality discrepancy, we propose to pull intra-class modality centers as close as possible. This process can be formulated as:


Then, the cross-directional center loss is defined as:


More intuitive demonstration is shown in Fig. 2. The gradients of with respect to can be solved as (since only concerns intra-class relation, we simply ignore below):


where , , can be formulated as:


Thus, the final optimizing strength of with respect to is linearly dependent on its corresponding sample center , modality center and global identity center . Intra-class features within sample (modality) are in same gradient along sample (modality) direction. Besides, the gradient of with respect to is not directly related with itself, which is not such sensitive when corresponds to the bad cases in a certain modality.

In this work, and is set to and respectively. As shown in Eq. (12), the final factors of gradient along sample and modality directions are different ( and respectively). Thus, we introduce a hyper-parameter to balance their strengths. The final formulation of is defined as:


Cross-directional center loss focuses on optimizing intra-class relation along sample and modality directions. To enhance the ability of discriminative inter-class learning, we further introduce the cross entropy loss . The total loss is defined as:


where the factor is a hyper-parameter used to balance the importance of components. In our experiments, and are set to and respectively according to the experiments on hyper-parameter analysis, as shown in  V-H.

Iv MSVR310 Benchmark

In this work, we release a new dataset called MSVR310 for multi-spectral vehicle Re-ID.

Iv-a Imaging Platform

In MSVR310, three different spectral modalities, RGB, NIR and TIR are captured for each sample. The RGB images are captured by two devices, a 360 D866 camera for day time and a Mi8 mobile phone camera for night time. All the NIR images are captured by the 360 D866 camera, which can be switched to the near infrared mode. The TIR image capture device is FLIR SC620 which contains a thermal infrared camera with the resolution of .

For each sample in our dataset, it is formed as a triplet constructed by three images from RGB, NIR and TIR respectively. We manually select bounding boxes for the targets in original captured images.

Iv-B Data Setting and Statistics

Our dataset contains 2087 samples from 310 vehicles and each sample is a triplet, which results in total 6261 images in our dataset. The number of image samples of each vehicle varies from 2 to 20. We randomly select 155 vehicles with 1032 samples as the training set, while the rest 155 vehicles with 1055 samples as the gallery set. We randomly select 52 vehicles with 591 samples from gallery set as query set. Each query identity has been captured at least twice with different time labels to support cross time matching. The data distribution is shown in Fig. 5.

We annotate data with time labels according to their collection order along time. Fig. 6 demonstrates the distribution of the captured time. Fig. 7 demonstrates some example images of four vehicles in MSVR310 along time labels. And each vehicle appears in various conditions with complex interference like strong illustration, reflection, shadow, color distortion and so on. Thus, bad cases in a certain modality exist ubiquitously and intra-class appearance discrepancy is very significant in MSVR310. The illumination disturbance in such degree is quite rare in existing works [Lou2019VERIWildAL, Liu2016LargescaleVR, Liu_2016_CVPR, li2020multi]. However, these disturbances represent differently in different modalities, and data across modalities are complementary in content against interference which requires for better utilization of multi-spectral data.

Fig. 5: Distribution for number of identities across sample sizes.
Fig. 6: Distribution of samples and identities across the number of time labels in MSVR310.
Fig. 7: Illustration of four sample data in MSVR310. Images in box with the same color indicate the multi-modality samples of the same identity with different time labels.

Iv-C Difference from Previous Work

Benchmark IDs Videos Modality Views Time Labels
RGBN300 300 4100 R+N 8 -
RGBNT100 100 2070 R+N+T 8 -
MSVR310 310 6261 R+N+T 8 28
TABLE I: Comparison of RGBN300, RGBNT100 and MSVR310, where ’-’ Denotes ’not Available’.

Li et al. [li2020multi] first propose two benchmark multi-spectral vehicle Re-ID datasets RGBN300 and RGBNT100, as shown in Table I. First, although RGBN300 and RGBNT100 contain much more images than MSVR310, it is actually collected from 2070 short videos (690 videos for each modality) which leads to a bunch of similar frames. We construct MSVR310 in various environments such as large changes of illuminations, occlusions and weather by capturing high-quality images instead of videos. Second, MSVR310 is collected across long time spans which leads to rich collections of various environments and vehicles. These significantly increase the diversity and difficulty of our dataset. Third, although matching between samples in same identity and same viewpoint is not allowed in RGBN300 and RGBNT100, environmental similarity among samples tends to raise easy matchings. Instead, MSVR310 introduces time labels to avoid easy matching. Matching between samples with the same identity and the same time label is forbidden in MSVR310, as shown in Fig. 4. This protocol effectively handles the easy matching problem and provides a more reliable evaluation.

V Experiments

V-a Datasets and Evaluation Metrics

To evaluate the effectiveness of the proposed CCNet on our proposed multi-spectral vehicle Re-ID dataset and public dataset, we provide comprehensive experimental results in this section. There are only one public RGB-NIR-TIR image dataset RGBNT100 [li2020multi] for the evaluation of multi-spectral vehicle Re-ID methods. Therefore, we conduct the experiments on MSVR310 and RGBNT100 following their own evaluation protocols.

To ensure the fairness of experimental evaluation, we follow the commonly used Cumulative Matching Characteristic () curve and the mean Average Precision () for evaluation. score reflects the retrieval precision, where , , scores are reported in our experiments. measure the mean of all queries of average precision (the area under the Precision Recall curve), which reflects the recall and precision comprehensively.

V-B Implementation Details

We use a strong baseline BoT [Luo2020ASB] which is modified from ResNet50 [he2016deep]

pretrained on ImageNet 


as our backbone and the implementation platform is Pytorch 1.0.1 with one NVIDIA GTX 1080Ti GPU. We use the Adam 

[kingma2014adam] optimizer to optimize our network with the initial learning rate as which will be decayed to and

at 300-th epoch and 550-th epoch respectively of total 1200 epochs. In training process, the input images are resized to

and some data augmentation methods like random cropping, horizontal flipping and random erasing are used. We randomly select 8 identities which will provide 4 samples (12 images) by each one respectively as our training samples in each training mini-batch. In evaluation, we concatenate the features extracted after BNNeck 

[Luo2020ASB] from three parallel branches as final representation for a sample in the absence of additional instructions.

V-C Evaluation on MSVR310 Dataset

Network Test Feature mAP Rank-1 Rank-5 Rank-10
ResNet50 R 20.0 29.9 49.9 61.6
N 17.8 28.9 51.3 62.8
T 11.9 23.2 37.4 46.4
R + N 23.6 36.7 57.0 66.2
R + T 22.6 35.4 54.7 63.5
N + T 21.4 37.2 56.3 64.3
R + N + T 25.6 39.4 58.5 67.9
CCNet R 30.7 49.4 65.5 73.3
N 26.3 45.5 67.3 73.1
T 19.6 35.7 53.5 61.9
R + N 34.0 53.6 70.2 76.3
R + T 34.6 52.8 68.7 75.5
N + T 31.4 51.6 68.9 76.6
R + N + T 36.4 55.2 72.4 79.7
TABLE II: Experimental Comparison of the Effectiveness of Modalities between ResNet50 and CCNet on MSVR310 (in ). In the Column Of Test Feature, R, N and T Represents Features from Corresponding Spectrum (Branch) while ’+’ Denotes Feature Concatenating Operation.
Models MSVR310 RGBNT100
mAP Rank-1 Rank-5 Rank-10 mAP Rank-1 Rank-5 Rank-10
MobileNetV2 [Sandler2018MobileNetV2IR] 22.5 37.6 53.6 64.1 57.1 82.3 85.7 87.8
MobileNetV2 + CCNet 24.0 43.5 59.2 70.1 66.5 90.6 92.2 92.9
SENet [Hu2020SqueezeandExcitationN] 22.7 40.9 60.6 69.9 64.8 90.3 91.0 91.8
SENet + CCNet 29.5 47.0 67.7 73.8 78.1 94.1 94.8 95.2
InceptionV3 [Szegedy2016RethinkingTI] 23.1 43.7 59.7 68.5 53.9 81.2 84.1 86.1
InceptionV3 + CCNet 28.0 49.4 64.0 72.3 63.5 90.7 92.4 93.5
TABLE III: Experimental Results of the Generality of Our Methods with Different Backbones on MSVR310 and RGBNT100 (in ).
Models Reference MSVR310 RGBNT100
mAP Rank-1 Rank-5 Rank-10 mAP Rank-1 Rank-5 Rank-10
DMML [chen2019deep] ICCV 2019 19.1 31.1 48.7 57.2 58.5 82.0 85.1 86.2
Circle Loss [Sun2020CircleLA] CVPR 2020 22.7 34.2 52.1 57.2 59.4 81.7 83.7 85.2
PCB [Sun2018BeyondPM] ECCV 2018 23.2 42.9 58.0 64.6 57.2 83.5 85.6 87.9
MGN [Wang2018LearningDF] ACM MM 2018 26.2 44.3 59.0 66.8 58.1 83.1 85.6 88.0
Strong Baseline [Huynh2021ASB] CVPRW 2021 23.5 38.4 56.8 64.8 78.0 95.1 95.8 96.4
HRCN [Zhao_2021_ICCV] ICCV 2021 23.4 44.2 66.0 73.9 67.1 91.8 93.1 93.8
OSNet [Zhou2019OmniScaleFL] ICCV2019 28.7 44.8 66.2 73.1 75.0 95.6 97.0 97.4
AGW [pami21reidsurvey] T-PAMI 2021 28.9 46.9 64.3 72.3 73.1 92.7 94.3 94.9
TransReID [He2021TransReIDTO] ICCV 2021 18.4 29.6 47.9 57.5 75.6 92.9 93.9 94.6
PFNet [zheng2021robust] AAAI 2021 23.5 37.4 57.0 67.3 68.1 94.1 95.3 96.0
HAMNet [li2020multi] AAAI 2020 27.1 42.3 61.6 69.5 74.5 93.3 94.3 95.2
CCNet OURS 36.4 55.2 72.4 79.7 77.2 96.3 97.2 97.7
TABLE IV: Comparison to State-of-the-art Re-ID Methods on MSVR310 and RGBNT100 (in ). The Best Scores Are Highlighted in Red, the Second Best Scores Are Highlighted in Green, and the Third Best Scores Are Highlighted in Blue.

We first evaluate our CCNet compared with the ResNet50 on MSVR310 dataset, as reported in Table II. For fairness, we use the same implementation of ResNet50 from BoT [Luo2020ASB] for comparison, which is the same as the backbone of CCNet. Specifically, the results of ResNet50 are achieved by a multi-branch network constructed by three separated ResNet50 in which each branch handles data from a certain modality. The multi-modality branches are independent with no interaction with other branches, while CCNet simultaneously utilizes multi-spectral data in the training phase and thus achieves much better performance. The R, N and T in Table II represent the features used in test phase for distance computing from corresponding spectrum. Note that, we use all three modality data during the training phase.

From Table II we can see, i) First of all, none of the single spectrum achieves satisfactory performance due to the complex lighting environments on MSVR310 dataset. In general, both RGB and NIR provide comparable reliable appearances thus lead to much better performance comparing to TIR. ii) Two spectrum scenarios significantly improve all the metrics than the single ones while the three spectrum scenarios further boost both performances of ResNet50 and CCNet. This strongly proofs the effectiveness of the introduced multi-spectral data. iii) Our CCNet is superior to ResNet50 by a large margin while there are limited differences on network structure between CCNet and ResNet50. This strongly indicates the rightness of our discrepancy mitigating design and effectiveness of the proposed CdC loss and ALNU module.

V-D Evaluation on Different Backbones

To validate the generality of our method, we integrate our CCNet into three backbones including MobileNetV2 [Sandler2018MobileNetV2IR], SENet [Hu2020SqueezeandExcitationN] and InceptionV3 [Szegedy2016RethinkingTI], as shown in Table III. All three backbones achieve significant improvements after integrating our framework, which indicates the generality of our method.

V-E Comparison with State-of-the-art Methods

To validate the effectiveness of our method, we extend nine state-of-the-art single modality Re-ID methods including DMML [chen2019deep], Circle loss [Sun2020CircleLA], PCB [Sun2018BeyondPM], MGN [Wang2018LearningDF], Strong Baseline [Huynh2021ASB], HRCN [Zhao_2021_ICCV], OSNet [Zhou2019OmniScaleFL], AGW [pami21reidsurvey] and TransReID [He2021TransReIDTO] to multi-modality version for comparison. At last, we compare our CCNet with the multi-spectrum vehicle Re-ID method HAMNet [li2020multi] and the multi-spectrum person Re-ID method PFNet [zheng2021robust]. Specifically, we train the single-modality methods on multiple spectral data respectively and then concatenate the final features from modalities of the same sample as the final representation. The experimental comparison of these methods is shown in Table IV.

First, all the methods perform much worse on MSVR310 than RGBNT100 which is caused by the huge challenge of the proposed MSVR310 dataset and our evaluation protocol which filters easy matchings caused by easy positive samples with same time label. The purposed CCNet beats almost all the comparison methods by a large margin on both two RGB-NIR-TIR vehicle Re-ID benchmarks, especially on MSVR310, which strongly proves the effectiveness of our method.

Second, as a first baseline multi-spectral vehicle Re-ID method, HAMNet [li2020multi] presents a simple network structure with considerable performance on three benchmark datasets, which proves its effectiveness on multi-spectral feature learning. PFNet [zheng2021robust] is the first work for multi-spectral person Re-ID, while the local feature separation seems to be more suitable for person data than vehicle data.

V-F Ablation Study and Visualization

Models MSVR310
mAP Rank-1 Rank-5 Rank-10
baseline 25.6 39.4 58.5 67.9
29.4 47.2 66.0 74.3
27.4 41.6 61.8 69.0
31.4 48.6 65.1 73.6
33.7 51.8 68.2 76.0
36.4 55.2 72.4 79.7
TABLE V: Ablation Study on MSVR310 (in ).

To verify the contributions of proposed components in our model, we implement the ablation study of several variants of CCNet on MSVR310 dataset, as reported in Table V. Note that , and the adaptive layer normalization unit (ALNU) all make positive improvements on our baseline, which demonstrates the contributions of the corresponding modules.

Methods mAP Rank-1 Rank-5 Rank-10
baseline 25.6 39.4 58.5 67.9
+ IN [ulyanov2016instance] 26.8 42.3 61.9 70.6
+ LN [ba2016layer] 28.8 45.9 66.3 72.3
+ ALNU 29.4 47.2 66.0 74.3
25.8 42.0 60.1 66.8
30.5 48.7 64.8 72.1
33.7 51.8 68.2 76.0
TABLE VI: Experimental Comparison with Different Normalizations and Losses on MSVR310 (in ).

We verify the contribution of our ALNU module by comparing two conventional normalization operations, instance normalization (IN) [ulyanov2016instance] and layer normalization (LN) [ba2016layer] as shown in Table VI. IN is widely used in image style transfer by normalizing instance features in channel level directly. LN and ALNU both treat each feature as an entirety for normalization, however LN strictly enforces all features to follow the same mean value and variance while our ALNU dynamically learns the gain and bias factors which are more reasonable for complex data. We also verify the contribution of our CdC loss by comparing two widely used center-type losses, Center loss [2016A] and HC loss [2020Hetero] as shown in Table VI. Both HC loss and Center loss are implemented based on ResNet50 with same setting as our baseline. We implement Center loss to pull features within identity close regardless of modality. And HC loss is implemented to reduce the modality gap within identity. However, Center loss is not good at handling the ubiquitous bad cases from a certain modality while HC loss ignores the discrepancy among intra-class samples in multi-modality situations. Both Center loss and HC loss work overshadowed by our CdC loss which simultaneously constrains intra-class relations from both modality and sample aspects. This proves the validity and robustness of our CdC loss in multi-spectral vehicle Re-ID task.

Fig. 8: T-SNE [maaten2008visualizing] illustration of the feature distributions extracted by CCNet trained (a) without and (b) with .

Fig. 8 demonstrates the feature distribution comparison of the network trained with and without . When training without , features from different modalities are mixed and hard to be separated by identity labels. After introducing , features from different modalities with same identity can be constrained to follow stronger consistency in both sample and spectral levels. And the images from same modality with different identities can be distinguished better.

Fig. 9: Illustration of the multi-modality feature distribution after introducing ALNU. Each point in figure corresponds to a image in our dataset and red points are in same identity. Comparing to Fig. 3 we can obverse the distributional discrepancy is significantly mitigated.

Fig. 9 demonstrates the distribution for multi-modality features after introducing ALNU. Compared with Fig. 3, the ALNU pushes features to distribute with similar mean values and standard deviations to reduce the distributional variation.

V-G Evaluation on Random Modality Missing

Fig. 10: Performance changing of different methods in different ratio of samples with partial modality missing on MSVR310.

To verify the generality of the proposed method and dataset in diverse real scenarios, we further evaluate CCNet in handling the missing modality issue.

Specifically, we adjust the samples with a certain ratio of missing modalities in the test set for evaluation. The ratio indicates the probability of the samples with partial (one/ two modalities in equal proportion) modality missing. To overcome the sample feature misalignment caused by modality missing, we use geometric center of the existing modality/modalities as the final representation of the sample.

In normal case without modality missing, CCNet extracts a final representation for sample (the sample for identity) where is a triplet of corresponding modality features. To handle the modality missing case, we generate a binary triplet mask for , to indicate whether the corresponding modality is missing or not. Then, the geometric center of sample can be formulated as:


where is the final representation of .

We evaluate the stability of our method in handling modality missing comparing with the representative multi-modality Re-ID method HAMNet [li2020multi] and the state-of-the-art single modality Re-ID method OSNet [Zhou2019OmniScaleFL]. All the experiments are evaluated based on the mean value of 10 random trials. Fig. 10 demonstrates the comparison performance against the ratio of samples with partial modality missing. Generally speaking, CCNet consistently outperforms both HAMNet and OSNet by a large margin. Even all the samples occur modality missing (when the ratio is 100%), CCNet still achieves competitive performance which is comparable with the results at low missing ratio of HAMNet and OSNet. This verifies the stability of our method in handling the modality missing. Meanwhile, all the metrics drop as the missing ratio increases, especially for and , which indicates the importance of complementary information of the multi-modality resources. As a state-of-the-art single modality Re-ID method, OSNet drops much faster than two multi-modality methods HAMNet and CCNet, which indicates the advantage of fusing multi-modality information in the two multi-modality methods in handling modality missing issue.

V-H Hyper-parameter Analysis

Hyper-parameters MSVR310
() mA Rank-1 Rank-5 Rank-10
0.1 31.3 47.9 66.5 72.6
0.2 33.3 50.6 67.2 73.8
0.3 33.7 51.8 68.2 76.0
0.4 33.4 50.9 67.0 74.6
0.5 33.0 50.6 67.7 74.3
0.6 32.8 50.4 66.8 73.3
0.7 32.7 50.1 66.2 73.8
0.8 32.3 49.7 66.5 72.9
0.9 31.9 49.2 65.7 72.6
1.0 31.3 48.4 65.8 72.8
() mAP Rank-1 Rank-5 Rank-10
0.1 32.6 48.6 66.2 74.0
0.2 33.5 51.6 67.2 76.0
0.3 33.1 50.4 66.9 75.5
0.4 33.4 50.1 67.1 76.4
0.5 33.1 50.1 68.9 77.0
0.6 33.7 51.8 68.2 76.0
0.7 33.8 51.6 67.9 75.8
0.8 33.3 51.0 67.2 74.7
0.9 33.1 51.0 68.2 75.0
1.0 33.1 50.4 67.7 76.5
TABLE VII: Hyper-parameter Analysis on MSVR310. (in )

There are two hyper-parameters in our method, e.g., in Eq. (16) which controls the importance of CdC loss in total loss and in Eq. (15) which balances the strength of gradient along sample and modality directions in CdC loss. Large may affect the inter-class discrimination ability provided by and large may break the balance between and . Therefore, we vary and between 0.1 and 1.0 for the analysis. The analysis on diverse values of these two hyper-parameters is reported in Table VII. It is clear that, our method achieves the top when is set to while it is not sensitive to . We fix and as 0.3 and 0.6 for the best performance in our method.

Vi Conclusion

In this work, we propose a novel end-to-end trained convolutional network named CCNet for robust multi-spectral vehicle Re-ID. CCNet contains a novel cross-directional center (CdC) loss to simultaneously overcome the problems of cross-modality discrepancy and intra-class individual discrepancy. Meanwhile, a simple yet effective module named adaptive layer normalization unit is designed to embed in CCNet to mitigate the distributional variation of intra-class features for robust feature learning. Furthermore, we create a high-quality benchmark dataset MSVR310 with diverse conditions and reasonable evaluation protocol. Comprehensive experiments on our benchmark dataset MSVR310 and the public RGB-NIR-TIR dataset RGBNT100 validate the superior performance of our CCNet and the research value of the proposed benchmark dataset.