Person Re-identification (Re-ID) aims at identifying the same person across multiple non-overlapping camera views [1, 2]. It has been widely investigated because of its great importance for video surveilance [3, 4]. Existing Re-ID methods mainly focus on RGB-RGB matching problem [7, 8], where all images are captured under good visible light conditions. However, in low-illumination environment, visible cameras cannot provide enough discriminative characteristics of a person, as shown in Fig. 1 (a). Against this issue, Visible-Infrared Person Re-Identification (VI-ReID) is introduced for person matching, which is imperative for practical surveillance applications.
As shown in Fig. 1 (b), given a target infrared (RGB) image, the main goal of VI-ReID is to search the corresponding RGB (infrared) images from a gallery set captured by other spectrum cameras [13, 5, 11, 12, 6]
. To our best knowledge, very few works have paid attention to the visible-infrared person re-identification mainly for two reasons. First, VI-ReID suffers from the large modality discrepancy between imaging processes of different spectrum cameras. Second, the person’s appearance discrepancy caused by distinct viewpoints, pose variations and scale changes give in large intra-class variations. As a result, the neural network is difficult to find such a shared feature space that the different modality information can be treated equally.
To deal with the above issues, some pioneer works have been proposed. Wu et al. 
introduced a deep zero-padding method to extract unified cross-modality feature vectors by adaptively handling the modality input. Contemporarily, Ye et al. proposed an effective two-stream framework in for feature learning and metric learning.  improved this idea and introduced a dual-constrained top-ranking loss to learn discriminative feature representations. However, these methods only consider the modality discrepancy of global coarse-grained features, but ignore the variation of pixel-level fine-grained features. More recently, some GAN-based methods [10, 14, 16, 15] have been proposed to generate cross-modality images to fill the gap between two different modalities. However, GAN based model training is unstable and the key local structure information of images is easily destroyed with unsupervised generation manners.
In view of the analysis above, we formulate a novel spectrum-aware feature augementation network, capable of alleviating the large cross-modality discrepancy in image-level and providing strong feature-level constraints to derives the discriminative property to distinguish the images between different classes. Our main idea is illustrated in Fig. 2. Different from the approaches [14, 16, 10]
that use GAN to generate modality-consistent images, we carry out cross-spectrum feature augmentation by generating spectrum-aware grayscale images to transform the original RGB features into a new homogeneously augmented space. The generated grayscale images are effective for model training because they approximate the infrared style and meanwhile preserve complete structure information of visible images. In this situation, to further alleviate remaining cross-modality discrepancy between grayscale spectrum and infrared modality, we adopt a sharable dual-path information-preserving structure with sharing some parameters of residual blocks which can be treated as the 3D-shaped tensor spaces to perserve the image information of spatial structures. This parameter sharing strategy was introduced in, which can be set as a strong and effective baseline backbone.
Additionally, we also investigate cross-modality sharable identity-specific information and explore a satisfactory metric measure to minimize the ambiguity between classes while maximizing cross-modality similarity among instances, with the aim to make person features towards modality-invariant for facilitating the cross-modality adaptation process using SFANet. More concretely, we propose a strong bi-directional tri-constrained top-push ranking (BTTR) loss with the separate cross-modality, intra-modality and inter-modality training strategy to guide the training process. Compared with the BDTR loss introduced in 
, the designed novel inter-modality regularizer significantly considers the distance relationship between cross-modality and intra-modaltiy, addressing the difficulty in learning discriminative feature embedding. Furthermore, a dual-linear with batch normalization ID embedding method is employed to improve the robustness against modality varitions in classifier-level. The in-depth analysis of SFNet is presented in section “Discussion”.
The main contributions can be summarized in four-fold:
We propose a spectrum-aware feature augementation network for visible-infrared person Re-ID. To our best knowledge, this is the first attempt to expoilt grayscale-spetrum images to fully replace conventional RGB images for cross-modality feature learning.
We develop a sharable dual-path information-preserving network to share the parameters of two shallow convolutional layers for feature embedding. Compared with the first parameter sharing work for VI-ReID in , it achieves a more outstanding performance.
We propose the Bi-directional Tri-constrained Top-Push Ranking (BTTR) Loss to constrain the relative distance of different classes from both the same modality and cross-modality, further promoting the performance of the network.
To improve classification accuracy and help stablize the magnitudes of embedding vectors, we design a novel classifier structure, named ”Dual-Linear with Batch Normalization ID Embedding”, which bring large performance improvement working with the proposed BTTR loss function.
Ii Related Work
In this section, we review the literature from the following three main research directions.
Visible Camera-Based Re-ID Methods. Person re-identification (Re-ID) mainly focus on the challenge of person’s appearance changes such as different camera views, poses varitons, partial occlusion, and illuminations changes [17, 18, 19, 20]
. Traditional visible camera-based Re-ID methods can be divided into two fundamental components: feature extraction[40, 41, 9, 21] and metric learning [42, 43, 22, 39]. The former feature extraction works mainly focus on extracting more robust and discriminative feature representations. For example, Zhang et al.  proposed a visual control flow language plus (VCFL+) model to improve the fusion process at the feature map level with the attention mechanism. The latter metric learning works aim at making a pair of true matches have a relatively smaller distance than a pair of wrong matches. Liao et al.  proposed a logic metric learning algorithm which adopts an asymmetric sample weighting strategy to enhance Re-ID training. All these Re-ID models are effective enough in addressing the problem of appearance changes and have achieved human-level retrieval performance. However, they are only designed for single visible modality, which cannot perform well for cross-modality Re-ID task.
Cross-Modality Re-ID Methods. Cross-modality person Re-ID aims at matching visible and infrared images of a person across different types, i.e. visible-to-infrared or infrared-to-visible [5, 12, 6, 11, 23, 33]. Besides the appearance discrepancy in single-modality Re-ID, VI-ReID confronts a new challenge of large modality discrepancy between visible and infrared images. More related to our work, some great pioneer works have been made to explore the issuse of VI-ReID. In , Wu et al. collected the first cross-modality dataset named SYSU-MM01 and proposed a zero-padding strategy with one-stream network to adaptively evolve domain-specfic nodes for visible-infrared matching. Later on, Ye et al.  proposed a hierarchical learning method to jointly optimize the modality-specific and modality-shared metrics.  improved this idea and introduced a two-stream network with dual-constrained top-ranking loss to simultaneously handle the cross- and intra- modality variations. Besides, Dai et al.  used generative adversarial training strategy to learn discriminative representations from different modalities. In addition, some methods [23, 33] haved been proposed to improve the performance of feature learning by utilizing the modality-specific classifiers. However, all these methods always focus on filling the gap between visible and infrared inputs by means of feature-level constraints, ignoring the large cross-modality variation brought by different spectrum with dramatic pixel-level unbalance.
Image generation for Person VI-ReID. Recently developed generative adversarial technique provides a powerful tool for image translation. The work most relevant to ours are following five image-based methods [25, 14, 15, 16, 10]. XIV cross-modality learning method was proposed in . It exploited a lightweight network to generate an auxiliary X modality with self-supervised manner for bridging modality gap between visible and infrared images. DRL  used a generative sub-network to translate cross-modality images from given visible or infrared images and handled feature-level discrepancy through feature embedding in the unified space. AlignGAN  is the first work to adopt the pixel-level and feature-level alignment strategies in a unified framework, which not only reduce the cross-modality and intra-modality discrepancy, but also learn identity-consitent features. Meanwhile, some other works [15, 16] attempt to use GANs to generate more realistic cross-modality images to eliminate large modality discrenpancy. All these methods achieve superior performance, but training a great generator and a discriminator would cost huge computing resources and how to balance the generator and discriminator is an intractable problem. Comparatively, our generation with utilizing the linear accumulation of three RGB channels is easier to implement and the generated images have an identical underlying appearance with the original images.
Iii Our Approach
In this section, we first describe the feature augmentation of grayscale-spetrum images process, which addresses the large modality discrepancy with style simulation approach. Then, we introduce our sharable dual-path information-preserving network for discriminative feature extraction, which addresses the feature-level discrepancy with a partially shared structure. Finally, a novel batch normalized modality-shared classification and the homogeneous modality top-push ranking losses are introduced. The overall framework is shown in Fig. 3.
Iii-a Grayscale-spetrum Images Generation
This subsection presents our grayscale-spetrum feature augmentation strategy which aims to approximate the style of single-channel infrared images to reduce the large modality discrepancy and retain the complete the feature information of person. Given an RGB image, we first extract each pixel value from the corresponding color channel (red, green, blue), and calculate the grayscale value with a transform function. Then, the grayscale pixel value is assigned to the corresponding position of the grayscale modality image, and all the pixels are traversed once to complete the transformation.
For an RGB image with red channel , green channel and blue channel , and a cross-spectrum image generation function , the method which aims to generate corresponding grayscale image can be denoted as follows:
Where the values of , , and are 0.299, 0.587 and 0.114, respectively. Some examples are shown in Fig. 4. We can observe that the generated cross-spectrum images have similiar appearances to those captured by an infrared camera. In addition, during the generation process, we do not introduce any agnostic nosie that destroys the initial semantic information from RGB images. It denotes that our generated method can well approximate the image style of infrared images and maintain the structure information to the greatest extent.
In this work, we utilize the grayscale images to replace visible images for model taining. Given visible inputs, our generation method can create point-to-point transformation grayscale images. However, these generated images have only single channel, while infrared images have three color channels. Therefore, we employ a channel expansion strategy to extend the single-channel images to three channels via simple copy operation, so all input images will have three channels that can be fed into a deep model.
Iii-B Sharable Dual-path Information-preserving Network.
The two-stream network is a conventional way to extract features in VI-ReID task. However, the learned feature maps from this fully two-stream network are arbitrarily inconsistent due to distinct data distribution, which focus more on alleviating data gap between cameras while sacrificing discriminativeness, ignoring inherently correlation of the same person images from different cameras. Recently, parameter-sharing problem  have been explored to analyze the impact of the number of parameter sharing for cross-modality feature learning. In this paper, we use the similar idea to refine both modality-sharable and modality-specific cues with parameter sharing strategy and introduce a sharable dual-path information-preserving network to extract discriminative common features of input images. The network contains two paths: grayscale-path and infrared path and both the paths are designed to share similar network structures in our cross-modality person re-identification task. Note that it mainly consists of two steps: modality specific feature extraction and modality shared feature embedding. The former focuses on capturing modality-specific low-level feature patterns, and the latter aims to learn common features of grayscale and infrared modalities.
As shown in Fig. 3, the generated three-channel grayscale images and infrared images are fed into the sharable dual-path information-preserving network. We use ResNet50 as our backbone network to extract feature maps of input images. The shallow two convolutional layers (Res-layer 0 and Res-layer 1) without sharing parameters are designed as feature extraction part to capture modality-specific low-level feature patterns. After that, the network parameters of convolution architectures on top of feature extraction part are shared to project the modality-specific inputs into a common space to learn modality-sharable high-level feature representations.
It is worth mentioning that compared with two-stream network proposed in , which utilizes the pre-trained five convolutional layers as independent feature extractor for fine-tuning, our improvement mainly lies in the partially shared network structures. This design enjoys processing 3D-shaped feature vectors instead of 1D-shaped feature maps computed by the shared fully-connected layer which losses rich person spatial structure information. To clarify, given an input tensor , the proposed network outputs a 3D common feature map , denoted by:
where represents the modality of the visible or infrared images and is the paprameter of the feature extractor . and means features with height and width . In this manner, our feature extract network can significantly preserve spatial structure information for the common space and keep the cross-view consistency.
During the training process, each mini-batch data contains different modalities images simultaneously. Then our designed sharable dual-path information-preserving network extract 2048-dim feature from mini-batch data for further feature embedding. During testing, we use 2048-dim feature extracted from the pooling layer as final Re-ID features for different modalities images.
Iii-C Dual-Linear with Batch Normalization ID Embedding.
Identity embedding network (IDE) is a bastic a baseline in person Re-ID task. The last layer of IDE network, which outputs the ID prediction logits of images, is a fully-connected layer whose size of last hidden units is equal to numbers of person identities. Learning such a sharable single-linear classifier is widely used in visible-infrared person Re-ID problem, but this structure bring about a severe matter that the shared network is hard to converge with cross-modality ID embedding due to vanishing gradient, resulting in less discriminative cross-modality feature presentations. The main reason for this is that the feature maps from the global pooling are directly fed into the classifer with single fully-connected layer, causing the loss of detail and internal covariate shift.
To address above issue, we augment the classifier network by proposing the Dual-Linear with Batch Normalization ID Embedding (DL-IDE) method to further reduce cross-modality discrepancy for VI-ReID task. The illustration is shown in Fig. 5. Different from other sharable classifier, we add a new linear (fully-connected) layer after the pooling layer and a batch normalization layer before the last fully-connected layer of cross-modality ID embedding. The reason for positioning two linear layers, rather than the single one is that the model fails to learn identity discriminative classifier for two different modalities, which might be due to that the correlation of weight vectors in the last FC layer is determined by the training sample distribution. When eigenvectors of different modalities with the same identity are used for shared ID embedding, single-linear classifier has to get the average value of multiple patterns to fit, resulting in less discriminative cross-modality feature representations in the backward propagation learning process. Therefore, in order to make the modality dustributions more distinguishable, we utilize a new FC layer to model the subspace projection for modality alignment.
To clarify, we explain the limitation of the single-linear classifier from a theoretical point of view. Generally, the last linear layer (without bias term) calculates dot product between feature vectors and weight vectors of different classes. For each feature vector , we can compute output value by formula: . Note that a good classification result mainly depends on the angles between two vectors and the modulus of weight vectors . However, for the neck of the standard classifier, the modulus of weight vectors should be similar to each other, leading to unsatisfied performance towards some particular classes from different modalities. By contrast, we add a new linear layer (with bias term) to project the feature vectors from different modalities into a more discriminative common feature space. Then, we utilize the last linear layer (without bias term) to produce classification result. A fully-connected layer with strong prior bias helps the identity loss in this way.
Common Space Batch Normalization. In most of conventional Re-ID models, the feature vectors from the pooling layer are directly used for identity classification, resulting in poor generalization and slow convergence. Inspired by , we propose to apply a weighting-sharing 1-D BN layer on the common feature embedding space to balance different losses in magnitude stabilizing. For the feature embedding vector , the normalized vector is calculated as:
where means the value of expectation and
denotes the value of statistical variance.
The trainable parameters and are the scaling and shift paprameters to ensure the batch normalization can degrade to the identity transformation. Note that is dislodged from the common space batch normalization operator, the final embedding vector can be obtained by:
With the extra scaling and shift paprameters in batch normalization layer, the input of the last fully-connected layer are transformed into a form that ensures zero mean and unit variance, which helps recalibrate the channels of embedding vectors.
Cross-Modality Sharable Identity Loss. Define set of training images by and with identity labels , where represent grayscale training samples () and denote themal (infrared) training samples (). Note that and mean the number of grayscale and infrared images in the training set respectively. In our learning process, we random select grayscale and infrared images to construct the batch. Given a grayscale sample with identity label
, we use a softmax function to calculate the its probability, formulated as:
where represent the feature fed into the dual-linear with batch normalization ID embedding module. is the weight matrix of the last fully connected layer in our classifier for th identity. is the total number of identities.
Similarly, for a infrared sample with identity label , we have:
With the calculated probabilities, the cross-entropy loss is used to optimize the dual-linear with batch normalization ID embedding, denoted by:
where denote the features are extracted from grayscale and infrared images and represents the number of grayscale (infrared) samples at each training batch.
Iii-D Learning with Bi-directional Tri-constrained Top-Push Ranking Loss.
We propose a novel bi-directional tri-constrained top-push ranking loss to guide the feature learning process. It mainly contains threefold optimized objectives, namely cross-modality, intra-modality and inter-modality regularization constraints. Our solution is the improved version of  which introduced the bi-directional dual-constrained top-ranking loss that includes cross-modality and intra-modality regularizers. In what follows we first revisit the bi-directional ranking loss, and then deduce our proposed bi-directional tri-constrained top-push ranking loss.
Iii-D1 Revisit Bi-directional Ranking Loss.
The bi-directional ranking loss is designed for cross-modality Re-ID task that considers two kinds of relationships: visible to infrared (sampling the anchor term from RGB domain, sampling the positive and negative terms from infrared domain) and infrared to visible (sampling the anchor term from infrared domain, sampling the positive and negative terms from RGB domain). Given a mini-batch that contains visible images and infrared images, the training samples from two different modalities can be defined as and with corresponding labels and . The bi-directional ranking loss includes visible anchor based term and infrared anchor based term can be denoted by:
where is a margin parameter. The subscripts and means the same label, while and are different labels.
Iii-D2 Bi-directional Tri-constrained Top-Push Ranking Loss.
To adapt the feature discrepancies of cross-modality and intra-modality, we improve the metric learning method with proposing a strong bi-directional tri-constrained top-push ranking loss for cross-modality Re-ID task. We employ three-fold triplet loss to optimize the network, i.e., cross-modality top-push ranking loss, intra-modality top-push ranking loss and inter-modality top-push ranking loss. Fig. 6 exhibits the relationship among heterogeneous images (grayscale modality and infrared modality in this work).
Cross-Modality Top-Push Ranking Loss. Considering the grayscale-infrared cross-modality matching protocol during testing, a cross-modality top-push ranking loss is first designed to enhance the feature discriminability with positive grayscale-infrared pairs and negative infrared-grayscale pairs. We use the formula to represent the Euclidean distance between two samples and , where the superscript means grayscale modlaity, means themal (infrared) modality and the subscripts represent the image index. The cross-modality top-push ranking loss can be written as:
where represents function to guarantee the non-negativity constraint, is a pre-defined margin , and means the number of infrared or grayscale inputs in each training batch.
Intra-Modality Top-Push Ranking Loss. VI-ReID also suffers from intra-class variations caused by distinct viewpoints, pose variations and scale changes. To address this problem, another top-push ranking loss is employed for each modality to reduce the feature distances between images of the same person and enlarge the distances between images of different people. Based on the cross-modality constrained loss, the intra-modality top-push ranking loss for grayscale and infrared modalities is represented by:
The intra-modality top-push ranking loss ensures that the closest intra-modality negative sample should be far from the farthest intra-modality positive sample.
Inter-Modality Top-Push Ranking Loss. To further minimize the cross-modality varitions, we add a novel optimizied regularizer to bi-directional tri-constrained top-push ranking loss. It significantly considers the distance relationship between cross-modality and intra-modaltiy and ensures that the learnt feature is modality-invariant for metric learning, which is formulated by:
For a grayscale anchor sample , note that we choose the farthest grayscale-infrared positive pair and the closest grayscale-grayscale negative pair to formulate a mined informative triplet . This manner fully utilizes the cross-modality triplet-wise relationship and improve the robustness against modality varitions.
Discussion of Dual-Constrained Top-Ranking and Tri-Constrained Top-push Ranking. Triplet constraint is usually utilised in the classical ranking scheme. Dual-constrained top-ranking constraints contain a cross-modality triplet regularizer and a intra-modality triplet regularizer, which connect embedding vectors from different domains and resolve the discrepancy between different modalities. However, in this situation, inter-class feature differences are still ambiguous in the cross-modality setting.
In comparison, the proposed tri-constrained top-push ranking loss adds a inter-modality regularizer to force intra-class difference to be smaller than the inter-class difference regardless of which modality is from. This modification has two major advantages: (a) further mitigate the data biases across modalities in the common space, (b) keep the powerful feature representation ability as well as the discriminative ability of top-push ranking constraint.
Overall Embedding Loss. Finally, the optimization of the proposed grayscale-infrared feature learning framework is defined by the combination of cross-modality sharable identity loss () and bi-directional tri-constrained top-push ranking loss (, , ):
where and are the predefined tradeoff parameters to balance the different losses.
The identity loss () optimizes network parameters with the identity supervision, which encourages indentity-invariant feature representations. The bi-directional tri-constrained top-push ranking loss (BTTR) provides a strong supervision to optimize the bi-directional relationship among different person images across the two modalities.
In this section, we evaluate the effectiveness of our proposed approach, including description of the experimental settings, ablation study of each component and comparsion with state-of-the-art methods. Experimental results are reported to answer two questions: 1) how do the components contribute to the performance? 2) What is the effect of our proposed framework when comparing with peer state-of-the-art methods?
Iv-a Experimental Settings
SYSU-MM01 . This dataset was captured by six disjoint cameras (four general RGB and two near-infrared) in both indoor and outdoor environments and contains a total of 491 identities with 287,628 visible images and 15,792 infrared images. The training set contains images of 395 persons with 22,258 visible images and 11,909 infrared images, and the testing set contains 96 persons. Following Wu et al., we also employ two test modes in our evaluation protocol, the all-search mode with all images and the indoor-search mode with only indoor images from cameras 1, 2, 3, and 6.
RegDB . This dataset was collected by dual camera systems (with optical and thermal sensors) and contains a total of 412 identities. Each identity has 10 different visible images and infrared images. We follow the evaluation protocol in  to randomly divide the dataset into two halves, one half for training and the other half for testing. The training subset consists of 2,060 visible images and 2,060 infrared images. This also applies to the testing set. We evaluate the performance via changing the query setting to visible (query) to thermal (gallery).
Evaluation Protocol. In the evaluation stage, we utilize the Cumulated Matching Characteristics (CMC) curve and mean Average Precision (mAP) evaluation criteria to evaluate our proposed method. CMC is adopted to report the results of rank-n accuracy, which denotes the probabilities that a query object appears in the target lists. The metric mAP computes the average value of the maximum recalls for each class in multiple types of tests, which can reflect the overall ranking accuracy . Note that there is a slight difference from the single-modality Re-ID problem. In VI-ReID issue, images of one modality act as the query set while the images from the other modality act as the gallery.
|All Search||Indoor Search|
|Modes||All Search||Indoor Search|
|SFANet (RGB) with Baseline||38.81||70.02||80.65||89.32||35.30||44.21||76.19||87.13||93.93||50.20|
|SFANet (RGB) + DL-IDE||48.96||75.89||83.99||92.96||48.23||53.59||81.08||90.18||96.60||59.78|
|SFANet (RGB) + BTTR||45.96||72.89||81.99||89.96||46.23||51.59||79.08||89.18||96.60||57.78|
|SFANet (RGB) + DL-IDE + BTTR||52.56||77.54||85.77||92.11||50.93||54.17||82.17||90.17||94.66||61.84|
|SFANet (Gray) with Baseline||43.70||69.50||79.15||87.06||40.36||50.05||81.07||90.81||96.24||56.66|
|SFANet (Gray) + DL-IDE||60.45||84.01||91.80||95.16||53.87||66.57||89.15||93.47||98.35||75.95|
|SFANet (Gray) + BTTR||58.35||82.83||90.24||94.85||53.00||64.80||88.69||94.67||98.07||75.16|
|SFANet (Gray) + DL-IDE + BTTR||65.74||87.93||92.98||97.05||60.83||71.60||91.08||96.60||99.45||80.05|
We implement our proposed method on Pytorch framework and use NVIDIA TITAN RTX graphics card for acceleration. Following most existing person Re-ID works, Resnet-50 is used as backbone for feature extraction. All input images are first resized to 288
144. We initialize the convolutional blocks with weights pre-trained on ImageNet. The dimensions of the last classification layer are 395 for SYSU-MM01 and 206 for RegDB, respectively. The training samples are augmented with two methods, random horizontal flipping and random cropping. The total number of training epochs is 80, and the batch size is setted to 8. We start training with learning rate 0.01 and linearly increase to 0.1 in the first 10 epochs, then, we keep the same value setting until researching to 20 epochs. In the following 60 epochs, learning rate is set to 0.01 for the first 30 epochs and 0.001 for another 30 epochs. We adopt the SGD optimizer with a weight decay of 5and a momentum of to update the parameters of the network. We set the margin parameter to 0.5 in Eq. 10, in Eq. 9 to 0.1 and in Eq. 11 to 0.3. The tradeoff parameter is set to 0.1 and is set to 0.5. In testing, feature distance is calculated by Euclidean metric.
Iv-B Ablation Study
In this subsection, we evaluate the effectiveness of each component of our proposed SFANet. We select the all-search and indoor-search modes of SYSU-MM01  for ablation study. Specifically, ‘SFANet (RGB) with Baseline’ means the baseline results by using the initial visible and infrared images as input tensors to be fed into the network, which is trained with standard identity loss and triplet loss . ‘SFANet (Gray) with Baseline’ denotes the results obtained with the converted grayscale and 3-channel infrared images as network inputs and other settings are same as the ‘RGB’. ‘DL-IDE’ represents the results obtained by using dual-linear with batch normalization ID embedding method. ‘BTTR’ demonstrates the bi-directional tri-constrained top-push ranking loss with hard mining.
|All Search||Indoor Search|
|Baseline + BN||54.96||51.95||59.74||64.91|
|Baseline + Dual-Linear||46.14||42.08||54.23||59.01|
|Baseline + Dual-Linear + BN||60.45||53.87||66.57||75.95|
Effectiveness of Grayscale-Spetrum Images. Our solution is based on grayscale-infrared feature learning method. To show the effect of this strategy, We conduct experiments with other three different GAN-based solutions in Table i@. Specifically, ‘RGB-GAN’ means exploiting GAN to generate RGB images from the infrared modality. ‘GAN-Infrared’ denotes generating infrared images from RGB modality. ‘GAN-GAN’ represents the cross-modality paired-images generation. For experimental fairness, our grayscale-infrared learning method is conducted on a one-stream network under the supervision of a standard identity loss and a triplet loss, which is similiar to the other three methods. From the Table ii@, we can observe that our graysacle-spetrum feature augementation strategy achieve a siginificant performance improvement over GAN-based methods on both all-search and indoor-search modes.
In addition, to fully present the effectiveness of VI-ReID solution method based on spetrum-aware feature augmentation, we also conduct an ablation experiment to see how the performance will change if we use conventional RGB-infrared learning strategy. The results are reported in Table ii@. As we can see in Table ii@, without grayscale-spetrum image augmentation, the result in the first row is inferior where rank-1 accuracy is 38.81 and mAP score is 35.30. In comparison, our grayscale augmentation method presented in the fifth row achieves 43.70 rank-1 accuracy and 40.36 mAP score, which improves the performance by a large margin.
|All Search||Indoor Search|
Effectiveness of Dual-Linear with Batch Normalization ID Embedding Method. To better understand the property of dual-linear with batch normalization ID embedding method, we study the effectiveness of each component of it. Here, ‘Baseline’ means that only one fully-connected layer are used for classification and do not use triplet loss for optimization, ‘BN’ represents batch normalization operation, and ‘Dual-Linear’ means the classifier contains two fully-connected layer. As shown in Table iii@, when we use batch normalization operation on baseline, the network achieves competitive performance with improving the baseline by 911. This is because ‘BN’ reshapes the discribution of embedding vectors and smoothen the optimization landscape. Then we add a new fully-connected layer which projects the Re-ID feature vectors into a more discriminative classification space, the performance of rank-1 and mAP increase by 34. It is worth to point that if we combine ‘BN’ and ‘Dual-Linear’ strategy, the performance can be further improved to 60.45(rank-1), 53.87(mAP).
Effectiveness of Bi-directional Tri-constrained Top-Push Ranking Loss. We also design a series of experiments based on our proposed dual-path network to test the effectiveness of BTTR. Since our bi-directional tri-constrained top-push ranking loss contains three parts, we test the influence of each component respectively. Note that we compare performance with optimizied regularization terms to the cases in which each regularizer is used individually. The result is reported in Table iv@. From the Table, we can observed that improves the performance of model by 23, which verfies that cross-modality constraint can effectively handles the modality varitions between two different modalities. With or , the performance have a slightly improved by 12. This observation demonstrates that intra-modality and inter-modality top-push ranking losses provide the relatively weaker constraints for cross-modality metric, but still can improve feature learning consistently.
Another observation is that the proposed BTTR achieves a siginificant improvement than BDTR since the new introduced inter-modality regularizer considers the distance relationship between cross-modality and intra-modaltiy. Furthermore, it is worth to point that when we combine three optimizied regularization terms (BTTR), rank-1 accuracy and mAP increase from 60.45 and 53.87 to 65.74 (+5.29) and 60.83 (+6.96), which improve the performance by a large margin.
Hyperparametric analysis. 1) Weighting Parameters: We analyze the contributions of the weighting parameters and on SYSU-MM01 dataset. Specifically, controls the contribution of the intra-modality regularization and controls the contribution of the inter-modality regularization. We vary their value from 0 to 1 and evaluate the resulting performance on SYSU-MM01. From the Fig. 8, we can observe that the Re-ID performance rise with increasing of and and the best performance is acheived with = 0.1, = 0.5. Empirically, to balance each part in the objective, we set parameters = 0.1, = 0.5 when using BTTR loss.
2) Margin Parameters: We also evaluate the margin parameters , and of the , and
respectively. Fig. 8 shows the peformance changes under two evaluation metrics (rank-1 and mAP) on SYSU-MM01 dataset. According to the literature, we empirically assign a small value of 0.1 for , the is set to 0.5 in our method. Actually, the results shown in Fig.7(a-b) also verifies the rationality of this setting. In this situation, from the Fig.7(c), we can observe that the performance fluctuate dynamicly with the different value of and model achieves peak performance when is equal to 0.9. This is because considers the distance relationship between cross-modality and intra-modaltiy. When we select a positive sample from the different modality and a negative sample from the same modality, the distance between positive and negative sample pairs usually becomes very close. Therefore, a relative large value for is beneficial for fast convergence of the network. In addition, this experiment also demonstrates that a suitable margin parameter is significant for cross-modality person re-identification task.
|MAC ||ACM MM2019||33.26||79.04||90.09||36.22||36.43||62.36||71.63||37.03|
Iv-C Comparison with State-of-the-Arts
We compare the performance of our proposed approach (GMA) with other state-of-the-art VI-ReID methods on SYSU-MM01 and RegDB in Table v@ and Table vii@, respectively. A variety of existing VI-ReID models are included for comparision, containing the up-to-date approachs of X-Modal , eDBTR , MSR , CosiGAN , MACE , DDAG , and cm-SSFT  that published in 2020. In addition, some learning-based classic VI-ReID methods are also used for comparisons, such as Zero-Pad , HSME , cmGAN , DRL , HPILN , LZM  and AlignG .
1) Comparisons on SYSU-MM01 Dataset: We report the results of comparision with the state-of-the-art approaches using SYSU-MM01 dataset in Table v@. Note that our proposed method achieves very satisfactory recognition accuracies in cross-modality matching task: the rank-1 acuuracy of 65.74, which is 10.99 absolute points higher than the recent DDAG  algorithm. We also achieve an mAP of 60.83, outperforming most of the state-of-the-art methods but performing worse than cross-modality shared-specific feature transfer (cm-SSFT) method . This is because cm-SSFT utilizies both shared and specific information for each sample while our feature learning method with parameter sharing strategy chooses to abandon some modality specific cues to achieve a more complete spatial structure information of person in the feature extraction stage, causing a little performance fluctuation. Compared to GAN-based methods like AlignG  and DRL , our method achieves much higher accuracy on both CMC and mAP evaluation protocols and does not train an extra GAN model with high computional resources. In addition, compared to HCML  which jointly optimizes the modality-specific and modality-shared metrics with human intervention, while our proposed method can be trained in an end-to-end way without interruption.
2) Comparisons on RegDB Dataset: We also report the comparison results using RegDB dataset in Table vi@. Our method achieves outstanding recognition performance in both query settings. We outperform the best rank-1 accuracy by 6.25 and the best mAP by 2.27 (CDP) in thermal-to-visible setting. Moreover, we also improve the performance of the DDAG algorithm under the visible-to-thermal setting, from 69.34 to 76.31 in rank-1 accuracy and from 63.46 to 68.00 in mAP. This experiment demonstrates that our proposed method can alleviate the impact of severe modality discrepancy and learn discriminative feature representations via uilizing grayscale-spectrum augementation strategy.
V-a Analysis of feature augmentation strategy via grayscale-spretrum images.
For VI-ReID problem, previous methods adopt RGB-Infrared feature learning strategy to eliminate the appearance and modality discrepancies. However, it is difficult to match the same person from different modalites in the unified space because visible and infrared images have an apparent cross-modality discrepancy and mapping such an RGB image to another heterogeneous modality may cause loss of channel information.
Some GAN-based methods try to generate a modality-invariant representation for eliminating the modality discrepancy in pixel level. However, the generated images cannot possess good qualities to fill the modality gap between synthesized scenarios and target real ones. As shown in Fig. 9, we present the visualized results that utilize CycleGAN to synthesize fake infrared images. It can be observed that the synthetic images suffer from three main issues with image indistinctness, distorted body structure and addtional nosie. If we directly use these low-quality generated images to train Re-ID model, a novel gap between the original data and the synthetic data will be introduced to our learning process. In other words, although GAN-based method reduce the modality discrepancy in pixel-level, meanwhile it also enlarge the intra-person discrepancy that increases the difficulty of feature similarity matching.
In comparison, the proposed method that expolits the grayscale-spectrum images to replace RGB images for feature learning is a great solution for this issue. It enjoys following serval metrits. First, the grayscale images can well preserve the key structural information of complete textures and semantic cues of visible images. They are critical for VI-ReID performance improvement since infrared images do not contain any visible information. Second, the grayscale modality can approximate the style of infrared images to the greatest extent. As a result, we can regard this subtle modality discrepancy between graycale and infrared images as a part of the appearance discrepancy that can be eliminated via using only feature-level constraints. From the Table ii@, it can be clearly seen that our grayscale-infrared learning method has a great performance improvement against RGB-Infrared training strategy. This experiment also provides a good suggestion for future VI-ReID research.
|All Search||Indoor Search|
|Two-stream (None shared)||45.23||42.43||52.53||58.68|
|One-stream (Fully shared)||62.66||56.22||66.50||75.30|
|MSTN  (Four-layers shared)||63.17||56.29||67.62||76.11|
|Ours (Three-layers shared)||65.74||60.83||71.60||80.05|
V-B Analysis of the sharable dual-path information-preserving architecture.
In VI-ReID community, almost all of algorithms are performed based on two types of network architectures, named one-stream network and two-stream network. In what follows we fully describe the characteristics of these network architectures and then analyse the improvement of our proposed sharable dual-path information-preserving architecture.
For widely-used two-stream structure, all the network parameters in the convolution blocks are optimized separately to capture modality-specific information and then use a fully connected layer to project these modality-specific feature vectors into a common space. However, in the authors’opinion, this design is overwhelmingly dependent on high-level sharable features in the final embedding layers and ignores the person spatial structure information which is crucial to descirbe a person. For another one-stream network where the shared structures are completely identical, it may implicitly learn the modality-specific and shared information but fail to achieve low-level visual patterns for specific modality which contributes to miss some critical cues for similarity measure. In comparison, our sharable dual-path information-preserving structure that adopts two convolutional layers (, ) to learn the modality-specific feature maps and the remaining three convolutional layers ( ) are shared for two modalities to capture modality-shared information for feature embedding. This design effectively learns the 3D-shaped middle-level sharable feature vectors with perserving person spatial structure information to help the model learn discriminative feature representations for cross-modality matching.
We conduct two groups of VI-ReID experiments to apply our method to the standard two-stream and one-stream networks. In addition, we also uitilize the classic parameter sharing work named MSTN  for comparsion. The results are shown in Table vii@. We observe that the sharable dual-path information-preserving structure has a superior performance compared to two-stream baseline network. This is because the partially shared convolutional blocks simultaneously capture the specific and shared feature representations rather than optimizing them independently. Moreover, benifiting from the parameter sharing parts, the person spatial structure information is preserved in the common space. Another observation is that the performance is also consistently improved by one-stream and MSTN network backbone. However, the improvement is not as siginificant as the sharable dual-path information-preserving architecture which improve the second best MSTN network by 2.57 of rank-1 accuacy and 4.54 of mAP. This experiment demonstrates that the modality-shared feature embedding netowrk with setting acceptable independent parameters of two blocks is more suitable than other choices for cross-modality modelling.
V-C Analysis of the Batch Normalization Operation.
Previous work demonstrated that combining the identity loss and triplet loss can make the network learn better feature representations for Re-ID task. However, we find that the target of these two losses are always inconsistent in the embedding space. That means the gradient direction of two tasks may be inconsistent during iterations. The main reason is that
constructs several hyperplanes to separate the embedding space into different subspaces. So, cosine distance instead of Euclidean distance is always used in an identity loss to optimize network parameters. On the contrary,based on a triplet distance optimization target is always computed by Euclidean distance to enhance intra-class tightness and inter-class separability in the Euclidean space. If we use both losses to optimize a feature space simultaneously, then one loss will be all the way down and the other will increase first and then decrease at one stage.
Previous literature point out the BN layer can overcome the overfitting and boosts the performance of IDE baseline. In authors’opinion, we consider that the BN layer can also smoothen the feature distribution in the embedding space. Assume that the feature before the BN layer that used to calculate and the feature passing through the BN layer that used to calculate , the batch normalization layer with the extra scaling and shift paprameters helps recalibrate the channels of these embedding vectors. As a result, in the process of back propagation, not only can keep a compact distribution form but also acquires ID knowledge from ID loss and with batch normalization has clear decision surfaces due to the the weaker influence of . We report the results with and without BN layer in Fig. 10. It can be seen that adding a BN layer can apparently suppress the inconsistency, smoothen the loss curve and achieve a better optimization performance.
|All Search||Indoor Search|
|Triplet (Weighted) ||59.87||53.16||65.74||74.41|
|Soft Margin ||61.23||55.17||68.12||76.82|
V-D Comparsion with Other Triplet Variants.
We conduct experiments to test the different variants of triplet training of bi-directional tri-constrained top-push ranking loss. In order to make this exploration tractable, we compare with other representative triplet variants, namely soft margin, weighted regularized triplet and lifted embedding losses. The baseline means the grayscale-infrared learning with only indentity loss. As shown in Table viii@, consistent improvements can be obtained by intehrating all proposed triplet re-formulations, which provides the relative distance optimization. It is nice to see that our BTTR loss with hard mining strategy achieves higher performance due to explicitly optimize hard samples for cross-modality and intra-modality relationships in different views, which ensures the discriminability of the learnt feature representation.
V-E Batch sampling strategy
In this paper, our batch sampling method is based on a sampling strategy that randomly selects person identities and corresponding images, but we make some difference for optimizing the bi-directional tri-constrained top-push ranking (BTTR) loss. Specifically, at each training iteration, person identities are randomly selected, where is the batch size. Note that our network contains two input streams which come from different modalities respectively, we randomly sample one visible image and one infrared image to form the mini-batch. In this manner, a total of images are obtained for network training. With the randomly sampling strategy, each image can be assigned to a non-overlap randomly identity so that all the possible assemblies will be traversed to get the global optimum.
V-F Visualization of results
We conduct visualization to show the top ten ranking results of the SYSU-MM01 test set, containing both infrared-to-visible and visible-to-infrared query settings. For each query setting, five query samples are randomly selected and their corresponding top ten retrieved cross-modality results are visualized in Fig. 11. The results show that our method has great robustness for modality and appearance discrepancies. The main reason is that the grayscale modality augmentation strategy perserves rich structural cues (e.g., bags or stripes) or conspicious part (e.g., logos). Even though there are some wrongly retrieved examples in the ranking list, our method can still retrieve the correct top-ranked images.
In this paper, we have presented a deep learning framework called SFANet for visible-infrared person re-identification. SFANet is formed based on the grayscale-spetrum feature augmentation strategy, which is captable of preserving the complete semantic information of RGB images and simultaneously alleviating the large modality discrepancy in the image space. Beyond the common two-stream feature extraction networks, we further extend SFANet with a parameter-sharing structure to capture the discriminative 3D-shaped spatial structure information of persons. In addition, we develop the feature-level constraints by proposing a dual-linear with batch normalization ID embedding method and a bi-directional tri-constrained top-push ranking (BTTR) loss, allowing to handle the modality difference in classifier-level and provide an effective metric measure to minimize the ambiguity between classes. Extensive experimental results on two standard benchmarks demonstrate that the proposed framework is robust enough to learn modality-invariant feature representations and outperforms state-of-the-art methods.
This work is supported by the Nature Science Foundation of China (No. 61762023).
G. Watson, A. Bhalerao, “Person re-identification combining deep features and attribute detection.”MTAP, vol. 79, no. 9-10, pp. 6463-6481, 2020.
W. Zhang, X. He, W. Lu, H. Qiao and Y. Li, ”Feature Aggregation With Reinforcement Learning for Video-Based Person Re-Identification,”IEEE TNNLS, vol. 30, no. 12, pp. 3847-3852, 2019.