Memory Regulation and Alignment toward Generalizer RGB-Infrared Person Re-identification
The domain shift, coming from unneglectable modality gap and non-overlapped identity classes between training and test sets, is a major issue of RGB-Infrared person re-identification. A key to tackle the inherent issue – domain shift – is to enforce the data distributions of the two domains to be similar. However, RGB-IR ReID always demands discriminative features, leading to over-rely feature sensitivity of seen classes, e.g., via attention-based feature alignment or metric learning. Therefore, predicting the unseen query category from predefined training classes may not be accurate and leads to a sub-optimal adversarial gradient. In this paper, we uncover it in a more explainable way and propose a novel multi-granularity memory regulation and alignment module (MG-MRA) to solve this issue. By explicitly incorporating a latent variable attribute, from fine-grained to coarse semantic granularity, into intermediate features, our method could alleviate the over-confidence of the model about discriminative features of seen classes. Moreover, instead of matching discriminative features by traversing nearest neighbor, sparse attributes, i.e., global structural pattern, are recollected with respect to features and assigned to measure pair-wise image similarity in hashing. Extensive experiments on RegDB <cit.> and SYSU-MM01 <cit.> show the superiority of the proposed method that outperforms existing state-of-the-art methods. Our code is available in https://github.com/Chenfeng1271/MGMRA.READ FULL TEXT VIEW PDF
Memory Regulation and Alignment toward Generalizer RGB-Infrared Person Re-identification
, regarded as the cross-modality image retrieval, aims to associate a person of interest across different scenes or disjoint camera views. As a vital support of surveillance system, deep learning based methods have become dominate components and achieved promising performance in recent years. However, the open-set (zero-shot setting) and cross-modality nature of RGB-IR ReID is still challenging to the algorithm generalization in searching a person in real world.
Current RGB-IR ReID methods [5, 6, 7]mainly focus on mitigating the modality discrepancy by learning attention-based visual similarity with objective functions, e.g., cross-entropy loss for classification and triplet loss for clustering. Such kind of classification-based retrieval model  exploits category recognition as a proxy task to learn the projection function from data to the feature space, which is allured by the discriminative region of seen categories in training set [9, 10]. Therefore, this paradigm tailors for searching objects with same categories, instead of open-set retrieval like ReID.
Some early attempts [11, 12] already redirected the attribution decision of classification-based retrieval model toward generalizer and adaptable ReID. For example, SBSGAN  suppresses the background which behaves as noise in cross-camera image matching, so as to look closer to the foreground person content. SSG  harnesses the unlabeled samples to exploit the potential cross-domain similarity. We argue, however, that using extra GAN or data is suboptimal. Therefore, we look back on the details of classification-based ranking retrieval.
We simply denote the attribution decision of RGB-IR ReID as and attribution similarity as where
is the immediate features extracted from image
. This pre-GAP (global average pooling) distills informative semantic features into fragile logit value. Moreover, the distilled semantic features sequentially is used for matching their discriminative regions after alignment. As shown in Figure1 (a), the feature matching is simplified to traverse nearest neighbor via a similarity metric. It has been a popular choice as an attribution method with many follow-up variants [14, 15], but still lacks interpretability for domain shift in ReID case. What activates this inductive bias of over-relying discriminative feature maps and how to alleviate the strong focus on seen classes during the training phase?
In this paper, we build a probabilistic graphical model to reconsider alleviating domain shift in RGB-IR ReID, involving location-wise recognition cue , category label and person image . As shown in Figure 1 (b), could be reformulated as
. Through expectation-maximization (EM) with negative log-likelihood (NLL), the semantic learned from training setcould not be transferred during testing. (Please see following preliminary for details) First, the model interest, , guides the joint likelihood but updates much faster than it (in exponential magnitude), resulting in heavy bias toward . Second, the over semantic abstraction of , especially the , makes distinct semantic embedding collapse into one semantic point. Therefore, only the discriminative cues would largely contribute to final prediction and this collapse manner would not be reusable in test set.
We thus introduce a novel regulation module, Multi-granularity Memory Regulation and Alignment (MG-MRA) module, to solve this issues. As shown in Figure 1 (c), we additionally model a learnable latent attribute variable to remember the representative structural prototype pattern in a global scope that covers diverse identity samples universally. To this end, since and is semi-independent on
, we attempt to reduce the over-confidence of model of seen-class features by multiplying such memory probability into prediction and increase the confidence of model on unseen classes via excluding it. Moreover, to avoid GAP-like semantic abstraction, MG-MRA conveys extra domain-level low-frequency information learned from previous seen samples for following joint decision. For details, we recollect features into predefined coarse-to-fine prototype indexes by reading memory for further similarity measurement. Different from searching relevant discriminative regions in two images, as depicted in Figure1 (a), this prototype alignment is light-weight and similar to multi-stage hash where we adopt general structural pattern as flexible prototype index code. Note that our work is a pre-hoc self-explainable method to solve domain shift which bases on the interpretable probabilistic treatment of existing mainstream paradigms. Our contributions can be listed as follows:
1) We analyze on the lack of interpretability for domain shift in RGB-IR ReID, and then propose a multi-granularity memory regulation and alignment (MG-MRA) module to solve this issue. The proposed MG-MRA is plug-and-play and more effective than previous GAN based or extra-data based methods.
2) The learned coarse-to-fine prototypes can consistently provide domain-level semantic templets with various granularity, meeting the requirement for multi-level semantic alignment.
Domain Shift in ReID: The domain discrepancy of ReID mainly comes from the intra-set gap (e.g.
, modality or cross-camera variance) and inter-set gap (e.g., category variance) . One intuitional solution is to enforce the data distributions of the two domains to be similar in latent space by zero-shot learning (ZSL) [17, 18, 19]. A line of works [7, 3]
focus on leveraging generative adversarial networks (GAN) to transfer the data from one domain to another. PTGAN proposes a person transfer to take advantages of existing labeled data from different datasets. cmGAN  reduces the distribution divergence of RGB and IR features by explicitly transferring RGB images to IR. However, this unpaired GAN-based image synthesis is always costly and hard to optimize. Another line based on metric learning, e.g., triplet loss , pushes the samples belonging to different identities to be dissimilar and pulls that of the same identity to be close.  proposes a label distillation strategy to guide the model to focus on confident samples. Cycle loss  intentionally re-weights each similarity to highlight the less-optimized similarity scores. However, this kind of works mainly concentrates on solving intra-set domain shift, ignoring the intrinsic inter-set discrepancy. In this work, we aim to propose a plug-and-play method to handle both intra-set and inter-set domain discrepancy simultaneously without involving additional GAN or data.
Feature Regulation on ReID
: A line of multi-task works delve to improve ReID performance by extra annotations or expertise from general computer vision tasks. EANet and PDC 
leverage part segmentation and pose estimation to accurately align discriminative cues. Lin et. al explicitly investigated additional attributes of pedestrians to assist ReID where attribute recognition logit is reweighed and combined with ReID features for final prediction. These multi-task based methods provide non-overlapped supervision beyond current ReID where we believe it as task regulation to avoid over-emphasise on ReID. Another interesting phenomenon lacking explanation is that PCB  module provides significant improvement over other soft-attention module, such as non-local attention, even though this hand-crafted striped division seems to less care about salient region than attention mask. For example, FBP-AL  introduces a PCB-like flexible body partition with multi-head soft-attention mask, however its performance is much lower than . In this paper, we propose a novel flexible prototype memory regulation which is free from extra annotation and then understand it in pre-hoc interpretable and examine it in post-hoc experiment way.
In this section, we interpret classification based retrieval model in both implementation and probabilistic graph perspectives.
Given the input image and identity label from training set , we simply denote the classification prediction 111Since this layers followed is equivalent to convolution kernel, we follow  to ignore it. and rank similarity of two images in training as:
where is the feature extractor and , are feature spatial resolution. Under the negative log-likelihood, i.e., cross entropy loss, high-variance semantic features would be filtered out to dominate prediction by pre-GAP with respect to the maximal and/or minimum values. However, this formula can not explicitly illustrate the recognition cues and heavy bias on seen classes. We therefore build a probabilistic inference with latent recognition cue , as shown in Figure 1 (b). We refer as the location index of the cue for recognizing the image as the class . Our aim is to let the model explicitly depend its prediction on the features corresponding to the location and later on using the distribution of possible cue locations to inference unseen images. For generality, we factorize the directional graph as:
Due to unobserved , explicitly finding maximum likelihood estimates of the model parameters in such a non-convex problem is hard. Thus, the estimation of is fitted to maximize the log-likelihood via expectation maximization (EM) algorithm.
According to Jensen’s inequality , for any , the Equation 3 gives a lower-bound on . Here we specifically separate parameters for from the whole model as which is always modeled by attention module. Recall that is the domain-variant posterior distribution with respect to , therefore it is not applicable for test images .
Besides, during training, EM aims to repeatedly construct a lower-bound on (E-step), and then optimizes that lower-bound (M-step), which shows EM always monotonically improves the log-likelihood. For details, the first signifies the model of interest, while the latter often refers to a slowly updated parameter used for generating the pseudo-targets for . The acceleration of heavy bias on seen classes during training is from 1) unbalanced gradient on and where while 2) category-guided expectation as maximizing .
Moreover, turning back to the feature alignment and matching, i.e., from Equation 2 and Figure 1 (a), we note that feature alignment would be limited to match discriminative features in two images: The most ideal case is to match local-consist features, e.g., head in RGB images with head in IR images. However, due to pose, occlusion etc, the discriminative features of two images may not be consistently identical in semantic meaning. Thus, it seems to be suboptimal to find the most similar pairs of local discriminative regions by traversing the nearest neighbor in embedding space.
We propose to mitigate domain shift in RGB-IR ReID by regulating the strong focus on discriminative features of seen classes with learning and memorizing prototypes. As depicted in Figure 2, our framework mainly consists of three components: (1) two-stream backbone for feature extractor and feature embedding, (2) Multi-granularity Memory Regulation and Alignment Module (MG-MRA) cooperated with PCB, (3) the loss, including memory supervision and ReID supervision. Finally we revisit the preliminary to explain how MG-MRA regulates the training process toward generalizer status.
pretrained on ImageNet as our two-stream backbone where the first two stages are parameter-independent for handling two heterogeneous modalities and the latter three stages act as parameter-shared feature embedding to embed images into a modality-shared common space. Given RGB and IR images , the output from feature embedding is applied as queries to retrieve corresponding prototype pattern as alternative prototype hashing index. For inference, we only use the output from main branch for prediction.
The single-granularity memory module (SG-MRA) contains prototypes which are recorded by a metric with fixed feature dimension . Then an attention-based addressing operator for accessing the memory, i.e., memory reader, is used to assign each image into spare prototypes:
where and are feature and prototype slice from input and prototype metric .
is the normalized weight measuring the cosine similaritybetween and . Thus, the assigned prototype from feature f could be calculated as:
Based on single-granularity memory module, we build a multi-granularity memory module (MG-MRA) as shown in Figure 2. MG-MRA consists of hierarchical semantic prototypes , i.e., part-instance-semantic, to avoid over-abstraction. Instance and semantic prototypes are summarized from previous low-level prototype. Therefore, although memory slots in prototypes across various semantic diversity, are shared to represent universal concepts of the all samples. Specifically, we define prototype metric with shape where and are predefined per-prototype number for part, instance and semantic level respectively and is category number. Before summarizing semantic prototype, each part and instance prototype is duplicated for two modalities. Therefore, for intra-modality gap, we keep the lower-level representative patterns of individual modality in part and instance prototype, and then align jointly them in semantic level. As shown in Figure 2, each higher-level prototype item could obtained by summing up over the range of its lower one. For example, the -th row of instance prototype sub-metric could be seen as the weighted subsegment of from to :
where is weight scalar calculated by combination to learn embedding center. Similarly, we could get and then following Equation 6, MG-MRA could be represented as:
In implementation, MG-MRA is applied as an auxiliary branch which is only used to regulate training process. We adopt PCB module to achieve basic state-of-the-art performance where each striped feature also retrieves corresponding prototype from our memory module.
From Equation 6, the final output is represented by a simple , which insights on how general prototype pattern shows in this image, rather than extracting salient features. Thus, we believe this memory module works like hashing for alignment, but is more flexible: the value is continuous, instead of binary, the index is learnable from the whole domain, instead of by predefined.
We adopt all default loss objectives, i.e., hetero-center triplet loss and identity loss , used in HCT 
without changing any hyper-parameters or structures. Therefore, we mainly introduce our memory loss function below.
Essentially, we expect the large prototype metric to be informative enough to record diverse representative patterns, i.e., . However, directly forcing it would result in chaos for semantic grouping. Considering the metric-decomposition-like hierarchical summarization, it could be easily achieved by observing and in two stages: (1) let of all striped features to be similar to make sure center at higher-level meaning; (2) let of a mini-batch to be distinct to ensure the prototype index with semantic marginalization.
Specifically, we use Maximum Mean Discrepancy (MMD) with MSE Loss to measure part-level consistency. For the items in , we first randomly split it to two halves and to avoid setup bias, and then follow Equation 10 to achieve consistent instance prototype representation of each part.
Besides, for semantic marginalization, we adopt more flexible triplet loss which pushes the embedding of samples sharing same ID to be close, and pull that of different ID to be far:
where are the anchor, positive and negative samples respectively where could be arbitrary samples in . Finally the whole objectives could be assembled as:
where , and are predefined tradeoff parameters.
Now we build the probabilistic graphical model of our method and then revisit preliminary to examine its inherent properties on alleviating over-confidence of discriminative features of seen classes.
Our MG-MRA aims to retrieve memory prototypes , which is independent from the attention-like module i.e., finding discriminative cues . Therefore, as shown in Figure 1 (c), in training could be reformulated as:
Similar to Equation 4, the estimation of could be represented as:
where is the model parameter of MG-MRA. Three reasons show MG-MRA is a work focusing on regulation. (1) Note that and , comparing to Equation 4, the model interest of expectation step is adjusted by and pseudo-targets generation of maximization step is dissuaded by multiplying . (2) Moreover, due to the operator, the gradient balance between E and M steps is also adjusted. (3) Different from discriminative cues which is dependent on , memory prototype is inert and semi-independent to any specific sample but memorizes general pattern from all domain. Finally, during inference, we remove and recover model confidence on discriminative features, which is identical as the inference of classical paradigm without extra computation:
Datasets and evaluation metrics: Two benchmark datasets, i.e., RegDB  and SYSU-MM01  are used to evaluate the effectiveness of our method. RegDB contains 412 identities in which each identity correspondingly includes ten visible and ten far-infrared images. Following 
, we randomly split it to training set and test set equally, all of which has 2,060 visible images and 2,060 infrared images. SYSU-MM01 is a larger-scale dataset dedicated to RGB-IR ReID, containing in total 30,071 visible images and 15,792 infrared images of 491 identities. The training set has 22,258 visible images and 11,909 infrared images with 395 identities. Similarly, the query set involves 3,803 infrared images with 96 identities. We use the Cumulative Matching Characteristics (CMC) curve and the mean Average Precision (mAP) as our standard evaluation metrics.
Implementation details: Our basic framework is largely modified from HCT  where MG-MRA/SG-MRA acts as a plug module applied in HCT. Similarity, this manner is applicable to other frameworks with GAP as we done in ablation study. We follow HCT to adopt ResNet50 
pretrained on ImageNet
as our backbone whose stride of the last convolutional block is changed from 2 to 1. Then each immediate feature is split into 6 stripes along height axis. We adopt the stochastic gradient descent (SGD) optimizer for optimization with momentum 0.9, learning rate 0.1. For the triplet losses,i.e., and , we set the relaxation margin to 0.3. For the sampling strategy, we set = 8, = 4 for the RegDB dataset, and = 6, = 8 for the SYSU-MM01 dataset respectively. Per-prototype number of part, instance and semantic level is set to , and . The trade-off constants and
are set to 0.05, 0.1, 0.1, 1 respectively. The model is trained on a single NVIDIA P100 GPU with pytorch.
|Method||All search||Indoor search|
We evaluate our method with other state-of-the-art (SOTA) methods on RegDB and SYSU-MM01, including AlignGAN , CMSP, Xmodel , GLMC , MPANet  etc. As shown in Table 1, our method achieves the best performance across most evaluation metrics in visible2infrared and infrared2visible setting. Besides, our method uses no bells and whistles to outperform existing methods with a large margin: 2.75%/6.76% Rank1/mAP improvement on visible2infrared and 2.10%/6.13% Rank1/mAP improvement on infrared2visible over previous best SOTA GLMC . Compared with HCT, the basic framework we modify from, our method could further boost 3.44%/4.90% and 3.92%/5.73% Rank1/mAP improvement on two settings respectively.
The comparison results on SYSU-MM01 are shown in Table 2. The compared SOTAs are selected identically as Table 1 on RegDB for fairness. The proposed MG-MRA outperforms MPANet , i.e., previous SOTA, with a promising improvement of 1.92%/0.7% and 5.28%/1.96% Rank1/mAP on All search and Indoor search. Moreover, our method provides promising performance margin over HCT, finally obtaining 72.50%/82.02% Rank1 for All search and Indoor search respectively.
SG-MRA & MG-MRA: As shown in Table 3, we plug our SG-MRA and MG-MRA into one GAN-based method (AlignGAN ), one baseline model (AGW ), one attention-based alignment model (DDAG ) and one SOTA model (HCT ) to verify the generalization ability of our memory regulation module. Compared with original methods, SG-MRA and MG-MRA could provide 2%5% Rank1 improvement. Moreover, the semantic hierarchy among MG-MRA explicitly eases the semantic abstraction by memorizing part-level and instance-level structural pattern. Therefore, MG-MRA could further boost the performance of different models by meeting different demands of semantic maintaining.
Per-prototype number: As shown in Table 4, we adjust the per-prototype number of part, instance and semantic metric to evaluate the sensitivity of MG-MRA for metric descriptive ability. Even we increase the metric size ten times larger (), the performance of our MG-MRA seems to change trivially (within 1% on Rank and mAP). Therefore, considering both SG-MRA and MG-MRA, we believe the hierarchical semantic summarization would make more sense than simply enhancing prototype descriptive ability.
Robustness without re-tuning hyper-parameters: The training of RGB-IR ReID model is sometimes tricky , but our MG-MRA is robust to provide consistent improvement for setting within reasonable range, which means MG-MRA is an user-friendly plug without re-tuning their well-posed hyper-parameters. For example, all experiments reported in the paper use the default setting of basic structures. Moreover, as shown in Figure 3, we evaluate the potential of MG-MRA to adjust different parameters, i.e., batch size and learning rate. The HCT baseline model is quite sensitive to hyper-parameters but our MG-MRA could still boost 2-3% accuracy in most cases.
Regulation visualization: We visualize the attention heat map of training, query and gallery images in Figure 4 and t-SNE distribution of query and gallery images in Figure 5. As shown in the upper and bottom parts of Figure 4, our MG-MRA (third row) could alleviate the overwhelmed attention on training images, and correspondingly, look on more discriminative regions for searching persons of interest. Besides, we find the original distribution of AGW in Figure 5 locates more randomly and compactly which is harmful for ranking, while using our MG-MRA could achieve spaced distribution as the intuition of triplet loss. We also discuss the role of MG-MRA and the suggestions for implementing it on other frameworks in Appendix.
1.Why MG-MRA is a regulation module, different from other memory variants?
As we analyzed in theoretical properties, our MG-MRA is designed for open-set RGB-IR ReID to divert the attention of the training process from recognition cues to general structural pattern. But other variants, e.g., MemAE  is applied in unsupervised generation task, mainly record close-set pattern to help sub-stream task. Empirically, we notice that adding MG-MRA and corresponding losses slower the decrease of ID loss and the increase of training accuracy, but improve the evaluation accuracy. We also illustrate the regulation effect in Figures 4 and 5. These findings are consist with general regulation term.
2. What the difference between MG-MRA and other regulation methods?
Our MG-MRA is a kind of feature regulation without any additional annotation or multi-task cooperation. For common kernel regulation term, such as term, it is unable to assess latent variable directly, e.g., recognition cue . Moreover, we reconsider methods using multi-task learning and annotation as external task regulation. This external cues would semi-overlap with recognition cues, so as to enhance the category-invariance.
For the PCB based methods , we believe this kind of hard attention acts similarly as regulation: hand-crafted stripe division forces the model to look specific area without concession. It is really powerful proved by its improvement, however, still can’t flexibly adjust features.
3. What the implementation limit for applying MG-MRA on other methods?
Our MG-MRA is not suitable for methods containing pre-Global Maxpooling (GMP) or its variants, as shown in Equations 1 and 2. We notice that this setting would result in optimization collapse, e.g., loss would not decrease and performance are within 1% accuracy. It is an interesting phenomenon that maxpooling commonly outperforms average pooling (In DGTL , solely replacing GAP with GMP would bring extra 10% improvement on RegDB), but it seems to be unsuitable in this case.
Another interesting phenomenon is that must be the classical triplet version. We exchange it with semi-hard or ha
In this paper, we delve to analyze the over-confidence of seen class features of mainstream paradigm, which leads to domain shift in RGB-IR ReID, in a probabilistic explainable way. Then we propose a multi-granularity memory regulation module (MG-MRA) to ease this tendency in training process. The proposed MG-MRA is effective and plug-and-play method for generalizer RGB-IR ReID with pre-hoc self-explanation. Experiment results on RegDB and SYSU-MM01 amply demonstrate that our MG-MRA outperforms previous state-of-the-arts with a large boost. Besides, the ablation study and discussion on MG-MRA illustrate the optimization essence of proposed method and provide suggestions for adopting our regulation module in other frameworks.
IEEE Transactions on Neural Networks, 22(2):199–210, 2010.