Pose Agnostic Cross-spectral Hallucination via Disentangling Independent Factors

09/10/2019 ∙ by Boyan Duan, et al. ∙ HUAWEI Technologies Co., Ltd. NetEase, Inc 15

The cross-sensor gap is one of the challenges that arise much research interests in Heterogeneous Face Recognition (HFR). Although recent methods have attempted to fill the gap with deep generative networks, most of them suffered from the inevitable misalignment between different face modalities. Instead of imaging sensors, the misalignment primarily results from geometric variations (e.g., pose and expression) on faces that stay independent from spectrum. Rather than building a monolithic but complex structure, this paper proposes a Pose Agnostic Cross-spectral Hallucination (PACH) approach to disentangle the independent factors and deal with them in individual stages. In the first stage, an Unsupervised Face Alignment (UFA) network is designed to align the near-infrared (NIR) and visible (VIS) images in a generative way, where 3D information is effectively utilized as the pose guidance. Thus the task of the second stage becomes spectrum transform with paired data. We develop a Texture Prior Synthesis (TPS) network to accomplish complexion control and consequently generate more realistic VIS images than existing methods. Experiments on three challenging NIR-VIS datasets verify the effectiveness of our approach in producing visually appealing images and achieving state-of-the-art performance in cross-spectral HFR.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Figure 1: Synthesized results ( resolution) of PACH. There are distinct pose deviations between the NIR images (1st row) and the VIS images (3rd row). PACH disentangles the independent factors in cross-spectral hallucination and produces realistic VIS images from NIR inputs.

In real world systems, there are multiple imaging sensors in cameras. For example, near infrared (NIR) sensors work well in low lighting conditions and are widely used in night vision devices and surveillance cameras. Nevertheless, visible (VIS) images are much easier to capture, leading it to the most common type. Different sensors result in face appearance variations, which imposes a great challenge to precisely match face images in different light spectra. Face recognition with NIR images is an important task in computer vision. However, in most face recognition scenarios, the only available face templates are VIS images. And there lacks large-scale datasets with NIR faces for complex model learning, compared with the VIS face datasets. Therefore, it is significant to utilize both NIR and VIS images effectively and boost cross-spectral Heterogeneous Face Recognition (HFR).

To solve this problem, much efforts have been paid in past decades. These methods can be classified into three categories. The first contrives to learn domain-invariant features of faces in different domains, e.g.,

[23]. The second category is projecting NIR images and VIS images into a common subspace, e.g., [32]. Face generation (or hallucination) has raised as another popular trend, especially in recent years It converts an NIR image to a VIS image while keeping the identity of the face, and then applies recognition models to the generated VIS images. As an advantage, these methods exploit existing face recognition approaches trained with VIS images well in the last step.

Figure 2: The schematic diagram of the Pose Agnostic Cross-spectral Hallucination (PACH). There are two stages in PACH, each having an individual duty. Unsupervised Face Alignment (UFA) in the 1st stage learns to rotate the input face according to the given 3D guidance . Texture Prior Synthesis (TPS) in the 2nd stage translates an NIR image to a VIS image based on the texture .

However, there are still challenges regarding the generation based methods. The main challenge is that each pair of an NIR image and a VIS image (coming from the same person) in the training set are not exactly aligned, i.e. the misalignment issue. We exhibit some sample NIR-VIS pairs in Figure 1 as well as our generated VIS results. The reason is that the NIR and VIS images are usually captured in different scenarios involving imaging distances or environments. Most existing methods required aligned (or paired) data to train a decent model. But when confronting unaligned data, they tend to produce unsatisfying images, and the image resolution of the synthesized image is usually no more than . Although [33] proposed to address the misalignment issue by learning attention from warped images, their generated results generally have similar skin color, which violates variations in reality and lacks personal textures. Moreover, their network is quite complex as well as with complicated data pre-processing.

In this paper, we propose a simple but effective solution against the misalignment problem in cross-spectral HFR, namely Pose Agnostic Cross-spectral Hallucination (PACH). The schematic diagram is presented in Figure 2. During the hallucination, procedures containing pose adjustment and spectrum transform are independent from each other. Instead of dealing with the blended factors together, PACH disentangles them and settle each in a individual stage. In the first stage, we design an Unsupervised Face Alignment (UFA) network to adjust the face pose according a 3D guidance. UFA is trained following an unsupervised principle of reconstructing the input image. Inspired by [13], UFA could naturally separate the style (or identity) and the content (or pose) of an NIR image. In the second stage, UFA has been well trained and stays unchanged. We use another 3D pose input (from a VIS image) to guide the pose adjustment of the NIR image. By this means, UFA generates the NIR image that is aligned with the VIS image. The aligned data produced by UFA simplifies the task of the second stage which becomes spectrum transform with paired data. To address the face texture problem, we develop a Texture Prior Synthesis (TPS) network that is able to control skin color and produces realistic results with personal textures. We train our network on the CASIA NIR-VIS 2.0 dataset, and evaluate it on three datasets including CASIA NIR-VIS 2.0, OULU-CASIA NIR-VIS and BUAA-VISNIR. Experimental results show that our method generates high-quality images as well as promotes HFR performance.

In summary, there are three main contributions in this work:

  1. This paper proposes a novel solution to deal with data misalignment in cross-spectral HFR, namely Pose Agnostic Cross-spectral Hallucination (PACH). Since pose and spectrum are two independent factors, we suggest to disentangle the factors and settle them separately in different stages with relatively simpler networks.

  2. There are two stages in PACH, each focusing on a certain factor. In the first stage, we design an Unsupervised Face Alignment (UFA) network to adjust face poses according to the 3D guidance and produce aligned NIR-VIS data. The second stage contains a Texture Prior Synthesis (TPS) network that accomplish complexion control and produces realistic VIS images for HFR.

  3. Experiments on three NIR-VIS datasets are conducted and PACH achieves state-of-the-art performance in both visualization and recognition.

Related Work

NIR-VIS face recognition is an important problem that has been studied in recent years. Existing methods could be classified into three categories : image synthesis, common subspace learning and domain-invariant feature representation.

Image synthesis methods generate VIS images from NIR images such that the generated VIS images could be used to compared with the VIS image templates. Image synthesis was first used in a similar task: [29] used this method to solve the sketch-photo recognition problem. [17] learned a mapping function between the NIR and VIS domains with a

-Dictionary Based Approach. Recent methods have applied deep learning in the image synthesis process.


used a Convolutional Neural Network (CNN) to synthesis VIS image from NIR image in patches and then applied a low-rank embedding to further improve the result. Generative Adversarial Network (GAN)

[7] is also used in the field, [28] proposed to use a framework based on Cycle-GAN for the image hallucination in NIR-VIS recognition. In addition, [4] proposed a dual generation method that generates paired NIR-VIS images from noise to reduce the domain gap of HFR.

Feature representation methods aim to learn features that are robust and invariant in NIR and VIS domains. Traditional methods are based on hand-crafted local features. [22] applied Difference-of-Gaussian (DoG) filtering and Multi-scale Block Local Binary Patterns (MB-LBP) to get the feature representation. [5] used Local Radon Binary Pattern (LRBP) as the feature that is robust in two different modalities to tackle a similar task with Sketch-VIS recognition. [6] encoded face images into a common encoding model (which is a feature descriptor), and used a discriminant matching method to match images in different domains.

Subspace learning methods learn to project the NIR and VIS domains into a common subspace such that the projections of the same person images in two domains are similar in the subspace. [32] applied Canonical Correlation Analysis (CCA) learning in Linear Discriminant Analysis (LDA) subspace for the task. [27] used Partial Least Squares (PLS) to map heterogeneous face images from different modalities into a common subspace. [12] proposed regularized discriminative spectral regression to match heterogeneous face images in a subspace. [18] used Multi-view Discriminant Analysis (MvDA) approach to learn a discriminant common subspace.


The goal of our method is to transfer a NIR image to the VIS one, which is expected to facilitate the performance of heterogeneous face recognition. However, on the one hand, the paired NIR and VIS images in heterogeneous face datasets, such as the CASIA NIR-VIS dataset, are not aligned. There are inevitable differences in the pose and expression of the paired NIR-VIS images. The discrepancies between the pose and expression make it hard to synthesize VIS images from the paired NIR images. On the other hand, the diverse complexions of the VIS images in the CASIA NIR-VIS dataset also brings challenges for photo-realistic face synthesis. In order to tackle the above problems, we explicitly divide the cross spectral face hallucination task as two independent stages: unsupervised face alignment and texture prior cross spectral synthesis. The first stage is proposed to align the pose and expression of the NIR images with the corresponding VIS images. After that, we can obtain the aligned paired NIR and VIS images for training. The second stage adopts the texture prior to help with the synthesis of VIS images in the complex scene. In the following subsections, the details of the above two stages are given respectively.

Figure 3: Examples of face alignment on the CASIA NIR-VIS dataset. There are large differences in the pose and expression of the paired NIR (the first row) and VIS (the third row) images. The aligned NIR (the second row) images obtained by our face alignment method have the same pose and expression with the VIS images.

Unsupervised Face Alignment (UFA)

Inspired by the recent proposed StyleGAN [19] that utilizes the Adaptive Instance Normalization (AdaIN) [13] to control the image styles, we propose an unsupervised face alignment method with AaIN to disentangle the facial shape and identity. The AdaIN is defined as:


where is feature of the ‘content’ image [14]. and

denote the channel-wise mean and standard deviation of

. and are the affine parameters that learned by a network. By controlling and , the image style can be switched [13]. In our face alignment task, the identity texture is equal to the ‘style’ in [13, 19, 14], and the facial shape is the ‘content’ in [14]. Our unsupervised face alignment method is mainly based on the above analysis.

As shown in Fig. 2, our face alignment generator consists of a style encoder , a content encoder , several AdaIN residual blocks and a decoder . is used to extract the style features of the input NIR image , which are irrelevant with the facial shape information. The style features is indeed the affine parameters and in Eq. 1, which can decide the identity of the synthesized images. is a facial shape extractor. The input of is the UV map (D model), which gives the detailed facial shape of . The shape features is also independent with the style features . From Fig. 2 we can see that the pose and expression information of are all contained in . are residual blocks with adaptive instance normalization. It is used to fuse the style features and the shape features . The affine parameters and of are learned by the style encoder . transfers the fused features to the image space, outputting the NIR image

. The loss functions of this stage are introduced as bellow.

Reconstruction Loss.

As mentioned above, our face alignment method is unsupervised, which is reflected in the fact that we only reconstruct the input image without any other supervision. The output image is required to keep consistent with the input image , which is implemented by a L loss:


Identity Preserving Loss.

The reconstructed NIR image should be consistent with the ground truth not only at the image space, but also at the feature semantic space. Inspired by [10, 3], we begin with introducing an identity preserving network to extract the identity features and respectively. Note that is a pre-trained LightCNN-29 [30] and means the output of the last fully connected layer of . Then, the feature distance between and is constrained by a L loss:


Adversarial Loss.

In order to improve the visual quality of the reconstructed images, we add a discriminator to perform adversarial learning [7] with the face alignment generator:


Overall Loss.

The overall loss in the unsupervised face alignment stage is the weighted sum of the above losses:


where , and are trade-off parameters. The face alignment generator (including , , and ) and the discriminator are trained alternatively to play the min-max game [7].

Texture Prior Synthesis (TPS)

After training the face alignment generator, we align the input NIR image with the target VIS image by changing the UV map. That is, replacing the UV map of the input NIR image with the UV map of the target VIS image. In this way, we obtain the aligned paired NIR-VIS training images = and , as shown in Fig. 3.

The complexions of all NIR images in the CASIA NIR-VIS dataset are unified, while the complexions of VIS images are diverse. The diverse complexions will affect the performance of traditional image to image translation methods [15, 34]. Because the NIR to VIS translation is essentially a ‘one to many’ problem, i.e., one NIR complexion to multiple VIS complexions. However, traditional image to image translation methods are only applicable to ‘one to one’ problems.

Based on the above analysis, we propose a texture prior cross spectral synthesis method. We adopt a prior texture to guide the NIR to VIS translation process. Specifically, this prior texture is cropped from the target VIS image, which indicates the complexion of the VIS image. Then, the prior texture is concatenated with the input NIR image as a new input to be fed to the generator. By the cross spectral synthesis generator , we obtain , where is the cropped texture of the target VIS image . The involved losses in this stage are listed as follows.

Pixel Loss.

Since the aligned NIR image has the same facial shape with , we can use the direct pixel-wise supervision to train the NIR to VIS translation network . The pixel loss with the texture prior defines the differences between the synthesized and the target is formulated as:

Figure 4: The results of changing prior textures. The top left is the input NIR image. The rest VIS images are all synthesized under different prior textures. For each synthesized VIS image, the lower right corner is the corresponding prior texture.

Total Variation Regularization.

The synthesized images are easily to get artifacts [26] in the training process, which will affect the visual quality. In order to reduce the artifacts, following [10], we also introduce a total variation regularization loss [16]:


where and denote the width and height of the generated image, respectively.

In addition, we also use an identity preserving loss and an adversarial loss in this stage. Both of these two losses are same with Eq. (2) and Eq. (3), except for replacing with . The discriminator of the adversarial loss is retrained from random initialization in this stage.

Overall Loss.

The overall loss of the texture prior synthesis stage is the sum of above losses:


where , , and are all trade-off parameters.

Figure 5: The comparison results of different methods on the CASIA NIR-VIS dataset.


In this section, we evaluate our proposed approach against state-of-the-art methods on three widely employed NIR-VIS face datasets, including the CASIA NIR-VIS dataset [21], the Oulu-CASIA NIR-VIS dataset [2] and the BUAA-VisNir face dataset [11]. We first introduce these three datasets and the testing protocols. Then, experimental details are given. Finally, qualitative and quantitative experimental results are reported to demonstrate the effectiveness of our approach.

Datasets and Protocols

The CASIA NIR-VIS dataset [21] is a public available challenging NIR-VIS heterogeneous face dataset with largest number of images. It consists of subjects, and the number of VIS images for each subject ranges from to , the number of NIR images for each subject ranges from to . The images in this dataset contain diverse variations, such as different expressions, poses, background and lighting conditions. The NIR and VIS images of each subject are not aligned. We follow the protocol of [31] to split the training and testing set, all of which contain -fold experimental settings. For each setting, VIS images and NIR images from about subjects are used as the training set. The probe set is composed of over NIR images from subjects. The gallery set contains VIS images from the same subjects. Note that, we also follow the generation protocol of [33]. That is, the qualitative and quantitative results are all obtained from the first fold. The Rank-1 accuracy, verification rate (VR)@ false accept rate (FAR) = % and VR@FAR = % are reported for comparisons.

The BUAA-VisNir face dataset [2] has images from subjects with images per subject (containing VIS-NIR pairs and VIS images). A total of images from subjects are chosen as the training set and the remaining subjects are the testing set. According to [33], we train our model on the first fold of the CASIA NIR-VIS dataset, and directly test it on the testing set of the BUAA-VisNir face dataset. The Rank-1 accuracy, VR@FAR = % and VR@FAR = % are reported.

The Oulu-CASIA NIR-VIS dataset [11] consists of identities with different expressions. Following the protocols of [31], identities are selected as the training set, and another identities are selected as the testing set. Each identity contains NIR images and VIS images. For the testing set, all the NIR images are used as the probe set and all the VIS images are the gallery set. Similarly, we train our model on the CASIA NIR-VIS dataset and test it on the Oulu-CASIA NIR-VIS dataset. The Rank-1 accuracy, VR@FAR = % and VR@FAR = % are reported.

Experimental Details

All images in the heterogeneous datasets are aligned to and center cropped to . In addition, we also align and crop resolution images on the CASIA NIR-VIS dataset for high-resolution face synthesis. In the training stage, a patch is cropped from the facial cheek of the VIS image and then resized to as the prior texture. The prior textures are randomly given in the testing stage. We use the method [1] to calculate the UV map. It can adequately represent facial shapes, as shown in Fig. 2. LightCNN- [30] is employed as the feature extractor that has been pre-trained on the MS-Celeb-M dataset [8]. Meanwhile, it is also used as the basic recognition network. Adam is used as the optimization and the learning rate is fixed to 2e-4. The batch size is set to . The trade-off parameters , and in Eq. 5 are set to , , respectively. , , and in Eq. 8 are set to , , and 1e-5 respectively. All parameters are set to balance the magnitude of each loss function.

Method Rank-1 VR@FAR=1% VR@FAR=0.1%
LightCNN-29 96.84 99.10 94.68
Pixel2Pixel 22.13 39.22 14.45
CycleGAN 87.23 93.92 79.41
Ours 99.00 99.61 98.51
Table 1: Comparisons on the -fold of the CASIA NIR-VIS dataset.
Method Rank-1 VR@FAR=1% VR@FAR=0.1%
VGG 62.1 1.88 71.0 1.25 39.7 2.85
TRIVET 95.7 0.52 98.1 0.31 91.0 1.26
LightCNN-29 96.7 0.23 98.5 0.64 94.8 0.43
IDR 97.3 0.43 98.9 0.29 95.7 0.73
ADFL 98.2 0.34 99.1 0.15 97.2 0.48
PCFH 98.8 0.26 99.6 0.08 97.7 0.26
Ours 98.9 0.19 99.6 0.10 98.3 0.21
Table 2: Comparisons on the -fold of the CASIA NIR-VIS dataset.
Method Rank-1 VR@FAR=1% VR@FAR=0.1%
KDSR 83.0 86.8 69.5
TRIVET 93.9 93.0 80.9
IDR 94.3 93.4 84.7
ADFL 95.2 95.3 88.0
LightCNN-29 93.5 95.4 86.7
PCFH 98.4 97.9 92.4
Ours 98.6 98.0 93.5
Table 3: Comparisons on the BUAA-VisNir face dataset.
Method Rank-1 VR@FAR=1% VR@FAR=0.1%
KDSR 66.9 56.1 31.9
TRIVET 92.2 67.9 33.6
IDR 94.3 73.4 46.2
ADFL 95.5 83.0 60.7
LightCNN-29 96.9 93.7 79.4
PCFH 100 97.7 86.6
Ours 100 97.9 88.2
Table 4: Comparisons on the Oulu-CASIA NIR-VIS dataset.


Results on the CASIA NIR-VIS dataset.

We first compare the qualitative results of our method with other generation methods, including Pixel2Pixel [15], CycleGAN [34], ADFL [28] and PCFH [33], on the -fold of the CASIA NIR-VIS dataset. The visual comparisons are presented in Fig. 5. All the compared generated images of other methods come from [33].

For Pixel2Pixel and CycleGAN, the pose and expression of the generated VIS images are not completely consistent with the input NIR images. For example, the mouth shape of the third synthesized VIS image of CycleGAN is different from the mouth of the input NIR image. The facial size of the first synthesized VIS image of Pixel2Pixel is smaller than the input NIR image. We argue that the reason behind these phenomena is the unaligned paired training data with large pose and expression variants. The unaligned paired training data make it hard to obtain direct pixel-to-pixel supervision, whether for Pixel2Pixel or CycleGAN. As a result, the generated VIS images are unsatisfactory. ADFL is mainly based on CycleGAN, leading to the similar visual problems. PCFH utilizes complex attention warping way to alleviate the unaligned problem and thus gets better results. However, the generated images still have huge gap with the real ones, which is mainly reflected in the complexion. It is obvious that our method outperforms all other generation methods. The generated VIS images not only maintain the facial pose and expression of the input NIR images, but also have more realistic textures. The pose and expression consistency of our method owes to the proposed face alignment stage, which creates aligned paired training data. As shown in Fig. 3, the aligned NIR images have the same pose and expression with the VIS ones. Benefited from the developed prior texture synthesis stage, the generated images of our method can be more realistic than other methods, which will be further studied in the following ablation study.

In Table. 1, following [33], we report the quantitative comparison results with the Pixel2Piexl, CycleGAN and the baseline LightCNN- on the first-fold of the CASIA NIR-VIS dataset. We can see that our method performs better than the baseline LightCNN-. Rank-1 , VR@FAR=1% and VR@FAR=0.1% are improved from 96.84 to 99.0, 99.1 to 99.61 and 94.68 to 98.51, respectively. It proves that our method can indeed improve the recognition performance by translating NIR images to VIS images. This translation manner is expected to reduce the domain gap between NIR and VIS images. On the contrary, compared with the baseline LightCNN-, other generation methods, such as Pixel2Pixel and CycleGAN, result in worse recognition performance. This degradation may be caused by the terrible image quality, as shown in Fig. 5.

In addition, we also conduct experiments on more folds of the CASIA NIR-VIS dataset, the results are tabulated in Table. 2. Except for the LightCNN-, the compared methods also contain VGG [25], TRIVET [24], IDR [9], ADFL [28] and PCFH [33]. Our method gets the best results on all the Rank-1, VR@FAR=1% and VR@FAR=0.1% recognition indicators. In particular, VR@FAR=0.1% is improved from the state-of-the-art (PCFH) to .

Results on the BUAA-VisNir face dataset.

We further compare our proposed method with LightCNN-, KDSR [12], VGG, TRIVET, IDR, ADFL and PCFH on the BUAA-VisNir face dataset. The results are shown in Table. 3. Among them, LightCNN- and PCFH are two crucial benchmarks. The results of LightCNN- are obtained by matching the plain NIR and VIS images in the dataset. Our approach outperforming it demonstrates the validity of synthesizing photo-realistic VIS images from NIR images. As for PCFH, it has improved the merits significantly comparing with previous methods. We can observe that our method obtains the best recognition performance on the BUAA-VisNir face dataset. Compared with the state-of-the-art method PCFH, our method gains 1.1% on VA@FAR=0.1%. The improvements against PCFH are quite gratifying and promising, since it suggests the importance of realistic textures.

Figure 6: The synthesis results of our method and its two variants on the CASIA NIR-VIS dataset.

Results on the Oulu-CASIA NIR-VIS dataset.

The comparison results with KDSR, TRIVET, IDR, ADFL, LighCNN-29 and PCFH on the Oulu-CASIA NIR-VIS dataset are tabulated in Table. 4. It is obvious that our method outperforms the state-of-the-art method such as PCFH by a large margin. For instance, the VR@FAR=0.1% is improved from 86.6 to 88.2 on the BUAA-VisNir face dataset. Since PCFH has accomplished rather striking results in Rank-1 and VR@FAR=1%, it is impressive to gain the improvements. The experimental results demonstrate that our method can reduce the domain discrepancy between the NIR and VIS images better.

Ablation Study

PACH disentangles the pose factor and the spectrum information in cross-spectral hallucination and accomplish the transformation from NIR images to VIS images in a two-stage way. The first stage contains a UFA network to adjust the face pose in NIR images and generates aligned data for the second stage. In the second stage, there is a TPS network that focus on the NIR-VIS transformation, integrating texture priors. In this subsection, we study the roles of the two main components (UFA and TPS) of our method and report both the qualitative results and quantitative results for better comparison.

Fig. 6 presents the visual comparisons between our method and its two variants. It is obvious that our method has the best visual results. Without pose alignment (UFA), the generated images are blurred, especially for the facial edges. For example, the cheek of the first generated VIS image is not consistent with the input NIR image. This may be caused by the unaligned paired data, leading to a lack of pixel to pixel supervision in the training process. Without the prior texture (TPS), the generated images look unrealistic. The complexions of subjects in the CASIA NIR-VIS dataset are various, which brings challenges for image translation. Our prior texture provides a complexion simulation mechanism in the training process, helping with synthesizing realistic facial texture. Moreover, Fig. 4 shows the generated results under different prior textures. The complexions of generated results change with the color of prior texture, which demonstrates the controllability of the complexion.

Method Rank-1 VR@FAR=1% VR@FAR=0.1%
w/o UFA 35.76 43.53 21.36
w/o TPS 86.56 90.64 81.67
Ours 99.00 99.61 98.51
Table 5: Ablation study on the CASIA NIR-VIS dataset.

Table. 5 tabulates the quantitative recognition results of different variants. We can see that the recognition performance will decrease if any component is not used, suggesting that each component of our method is useful. Especially, the Rank-1 accuracy drops significant when the the pose alignment module is removed. This demonstrates the essential role of pose alignment for effective image translation.


To deal with the misalignment problem in cross-spectral Heterogeneous Face Recognition (HFR), this paper proposed to disentangle the pose factor and the spectrum information and settle them in individual stages. The first stage focused on the pose factor. We designed an Unsupervised Face Alignment (UFA) network to make the pose in a near-infrared (NIR) image similar to the pose in the corresponding visible (VIS) image. Then we acquired aligned data to train a superior generator to transform NIR images to VIS images. The second stage was in charge of the transformation. To improve the reality of the generated results, we developed a Texture Prior Synthesis (TPS) network and produced VIS images with different complexion cases, which has been proved to facilitate cross-spectral HFR performance. We conducted experiments on three NIR-VIS databases and achieved state-of-the-art results in visual effects and quantitative comparisons.