A NIR-to-VIS face recognition via part adaptive and relation attention module

02/01/2021 ∙ by Rushuang Xu, et al. ∙ Yonsei University 0

In the face recognition application scenario, we need to process facial images captured in various conditions, such as at night by near-infrared (NIR) surveillance cameras. The illumination difference between NIR and visible-light (VIS) causes a domain gap between facial images, and the variations in pose and emotion also make facial matching more difficult. Heterogeneous face recognition (HFR) has difficulties in domain discrepancy, and many studies have focused on extracting domain-invariant features, such as facial part relational information. However, when pose variation occurs, the facial component position changes, and a different part relation is extracted. In this paper, we propose a part relation attention module that crops facial parts obtained through a semantic mask and performs relational modeling using each of these representative features. Furthermore, we suggest component adaptive triplet loss function using adaptive weights for each part to reduce the intra-class identity regardless of the domain as well as pose. Finally, our method exhibits a performance improvement in the CASIA NIR-VIS 2.0 and achieves superior result in the BUAA-VisNir with large pose and emotion variations.



There are no comments yet.


page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Face recognition is a very common technique in our lives. Not only widespread in daily life, such as in company access control systems, school attendance systems, mobile device unlocking, and so on, but it also plays an important role in assisting public security authorities in handling cases, such as in comparing suspect images captured using surveillance cameras. The near-infrared (NIR) camera used for surveillance systems can capture more useful information at night or in low light conditions than the visible-light (VIS) camera. The task of matching images between the NIR domain and VIS domain is called heterogeneous face recognition (HFR). A domain gap issue is faced by HFR because the NIR images lose considerable spectral information compared to VIS images. Fig. 1 reveals that the HFR database faces the challenge of the domain gap and variations in pose and emotion issues. A method to reduce all of these discrepancies is crucial.

There are two categories of approaches to HFR exist in recent years. One is based on image synthesis methods [18, 17, 10, 2] that are transferred from one domain (NIR) to another (VIS) using an image synthesis network and matched in a unified domain. However, the quality of the generated images is greatly influenced by the number of training images, which affects recognition performance. The existing HFR database suffers from insufficient data, which is detrimental to the image synthesis method. Another method is based on learning domain-invariant features. For example, He et al.  [8] minimized the Wasserstein distance [1] between two different domain features, and Liu et al.  [16] used triplet loss to reduce the domain gap. These methods did not specifically consider issues caused by variations in pose and emotion but simply reduced the overall distance between the two embedding features.

Figure 1: Pose and emotion variation examples of HFR task from (a) BUAA-VisNir [9], (b) CASIA NIR-VIS 2.0 [11] and (c) TUFTS [14] face databases.
Figure 2:

Framework of the proposed model. It extract part representative vectors from the backbone and embedding them into the 512-dim final embedding vector with PRAM. Component adaptive triplet loss and softmax loss are calculated for training

We propose a part relation attention module (PRAM) that extracts component relationships based from a facial semantic mask regardless of the domain to learn features robust to the domain and pose and emotion variations. The relationships of the facial components are important for representing domain-invariant identity information.

In addition, we suggest a component adaptive triplet loss function () that considers pose or emotion variations by assigning adaptive weights according to the visible part of the face. Compared with existing HFR methods, our approach uses relational information among the facial components separated by masking information. Furthermore, the network is trained by considering the part features robust to the pose, emotion, and domain.

In this paper, our main contributions are as follows:

  • Uncovering the relationships between facial components (eyes, mouth, and nose) and the entire face, where the proposed PRAM separates these components first then extracts the relations;

  • Proposing the component adaptive triplet loss function based on a semantic mask to enable the network to learn by selecting information more effectively to solve difficult issues caused by the variation of pose and emotions;

2 Proposed Method

The overall framework is illustrated in Fig. 2

. After cropping the facial image into four inputs, the full image and partial image features were extracted and passed through the PRAM layer. For training, the softmax loss is calculated with the embedding feature obtained from the PRAM. Also, the component adaptive triplet loss is calculated with the part-representative features extracted from the backbone. Two terms add up to total loss. While testing, the cosine similarity between the embedding features of each VIS gallery image and the NIR probe image is calculated for recognition.

2.1 Part Relation Attention Module (PRAM)

In HFR, it is important to learn features that irrelevant to the domain such as relational information. Several face recognition studies have been proposed to improve the recognition performance by learning such relations. Chowdhury et al.  [5] used a bilinear CNN [12] that multiplied the convolutional layer output feature maps from two-stream CNNs. Cho et al.  [3] concatenated the feature vectors of the last feature map pair-wisely to extract the relations between two different facial parts. They both treated each vector of the feature map as representing a certain part of a facial component, such as the lips or nose. However, because the receptive field of the CNN is very large, each vector can almost cover the whole input image. Therefore, these spatially correlated feature vectors have difficulty representing each component separately. In addition, for the images with large pose variations, each feature vector of the same spatial location may not contain the same component of the face. For example, Fig. 1 presents the VIS images of frontal view of the face and the NIR images of them are side view of the face. The parts captured in the same location of these images contain different components. Therefore, we design the module to capture the features of each facial part separately to model their relational information.
As presented in Fig. 2, the PRAM comprises four steps. First, a facial image is cropped into four parts: left eye, right eye, nose, and mouth, according to the mask extracted in advance (please refer to the Section 3.1). Unlike landmark detection, since the facial component of the NIR domain is well distinguished by the segmentation network, we use a segmentation mask for parts location. The four partial images and overall facial image (in total, five images) are separately input into the backbone lightCNN-9 [19]. The first few layers are frozen during training, and only the last layer 8 and 9 are fine-tuned, where the five MFM FC layers [19] ( in Fig. 2) are tuned independently. In this step, the representative features of each part are extracted precisely. The MFM FC layer is a special maxout operation that uses a competitive relationship to obtain generalizability, which benefits learning across different data distributions. In the second step, we extracted the orderless pairwise combinations and arranged them in a fixed order to present the relationship between two parts. In the third step, all combinations are input into a shared FC layer () to guarantee that the network learns the same functional relationship between two representative features. From this computation, the relationship between certain regions with a uniform standard are obtained. In the last step, we propose a learnable weight to capture the the strength of each relation. A 512-dim final embedding vector is computed by weighted sum with of these relational features.

Figure 3: Component adaptive triplet loss structure. The loss value is calculated from each component multiplied by the weight obtained based on the masking region.

2.2 Component Adaptive Triplet Loss Function

The triplet loss [16] was proposed to learn more optimized embedding features in latent space by closing the distance between the features of the anchor and positive examples and distancing the features of the anchor and negative examples. The positive example is an image with the same ID as the anchor example, whereas the negative example is different from the anchor. To narrow the intra-class distance between the different domains of the same person strictly, we sample anchor and positive into different domains and negative into the same domain. In addition, an adaptive weight is assigned to the loss of each part-representative vector obtained from the PRAM, considering the different deviations for each component feature caused by the variants of the pose and emotion.

As presented in Fig. 3, we use five loss terms for the original image and its four components, separately. In addition, examples exist in the database where some facial features are obscured due to a large pose or emotion variation. To avoid the negative effect of such challenging examples resulting bias in the network training, we propose a loss function that assigns weights adaptively to the five terms based on the matching region between the masks for two examples.


The extracted masks of the anchor and positive examples are denoted as and . They are both binary images where the background part is set as 0, and the object part is set as 1. In addition, is calculated using the intersection over union (IoU), where the above terms are the area of the overlap and the below terms are the area of the union.


As for the loss value in Eq. 2, we use the triplet loss with a conditional margin proposed by [4] where represents the cosine similarity, and the conditional margin is set to 0.55. The indicates feature vector of each component, and refer to anchor, positive and negative example, respectively. The final loss function (Eq. 3) consists of the softmax classification loss with the scaling factor (following [15]) and the component adaptive triplet loss.

(a) VIS
(b) NIR
Figure 4: Results of the semantic segmentation network BiSenet [21] implemented on the TUFTS face database, used for extracting the mask in our proposed model.
Models CASIA NIR-VIS 2.0 [11] BUAA-VisNir [9]
Rank-1 Acc.(%) VR@FAR=1%(%) VR@FAR=0.1%(%) Rank-1 Acc.(%) VR@FAR=1%(%)


Fine-tuned 96.06 95.19 94.06 95.77 95.88
+PRAM 97.55 95.87 95.06 96.55 95.44
+ 98.21 97.07 96.33 98.88 97.00
+ 98.53 98.0 97.49 99.44 98.44
Table 1: Proposed model results on the CASIA NIR-VIS 2.0 and BUAA-VisNir databases

3 Experiments

We use the CASIA NIR-VIS 2.0 [11], BUAA-VisNir [9], and TUFTS [14] HFR databases for the experiment. The CASIA NIR-VIS 2.0 database consists of 725 identities, and 8749 images in the training set of 357 identities. The testing set contains 358 identities, and the gallery set has one VIS image for each identity. The probe set has 6208 NIR images. The BUAA-VisNir consists of 150 identities, 900 images from 50 identities for the training set, and 900 images from 100 identities for the testing probe set. For evaluation, the gallery set consists of one VIS image per person. For TUFTS database, which has the largest pose variation, only visualization experiment is conducted since there is no protocol and only composed with 100 ids.

(a) BUAA-VisNir
(b) CASIA 2.0
Figure 5: Successfully recognized samples after employing our proposed model compared with the baseline. In each subfigure, left side is VIS image and the right side is NIR image of same identity.

3.1 Implementations

The masks of facial components are extracted using the real-time semantic segmentation network BiSenet [21], which is pretrained on the CelebAMask-HQ database (Fig. 4). We crop all the images to 144x144, and randomly crop to 128x128 during the training process. Then, the center point of the rectangle bounding box that contains the mask regions are cropped for the partial images. We use the LightCNN-9 as the baseline which pretrained on the MS-Celeb-1M database [6]

. We use the stochastic gradient descent optimizer with a learning rate of

and a weight decay of . The batch size was set to 16.

3.2 Ablation Studies and Analysis

In Table 1, the Rank-1 accuracy of the baseline on the CASIA NIR-VIS 2.0 and BUAA-VisNir databases are 96.06% and 95.77%, respectively. The performance is significantly improved 1.49% and 0.78% with the proposed PRAM on both databases. After training with our proposed component adaptive triplet loss for the experiment, the rank-1 accuracy boost to 98.53%, and 99.44%. Both results are better than using conditional triplet loss ().

As displayed in Fig. 5, we visualize some successfully recognized samples after employing our proposed model compared with the baseline. With the domain discrepancy, pose and emotion variation, the baseline fails to recognize identities in Fig. 5. These samples reveal that our model effectively recognizes images with a large variation in emotion and pose.

3.3 Comparison with Deep Learning Methods

We compared the our method with other deep learning methods, including TRIVET

[13], IDR [7], ADFL [17], CDL [20], WCNN [8], RM [3], and RGM [4]. In Table 2, our PRAM performed better than the RM, which pairwise concatenated the feature vector with the addition of conditional triplet loss (). In Table 3

, our approach exhibits the best performance on the BUAA-VisNir database with a large variance in emotion and pose. Compared to the WCNN and ADFL, our method is slightly lower in the CASIA NIR-VIS 2.0 database but still demonstrate competitive performance, and higher performance with 2.04% and 4.24% in Buaa database.

Models CASIA NIR-VIS 2.0 [11]
Rank-1 Acc.(%) VR@FAR=0.1%(%)
TRIVET [13] 95.7 78
IDR [7] 97.33 95.73
ADFL [17] 98.15 97.21
CDL [20] 98.62 98.32
WCNN [8] 98.7 98.4
RM [3] 94.73 94.31
RGM [4] 97.2 95.79
Ours 98.53 97.49
Table 2: Comparison with other methods on the CASIA NIR-VIS 2.0 database.
Models BUAA-VisNir [9]
Rank-1 Acc.(%) VR@FAR=1%(%)
TRIVET [13] 93.9 80.9
IDR [7] 94.3 84.7
ADFL [17] 95.2 95.3
CDL [20] 96.9 95.9
WCNN [8] 97.4 96
RGM [4] 97.56 98.1
Ours 99.44 98.44
Table 3: Comparison with other methods on the BUAA-VisNir Database.

4 Conclusion

In this paper, we propose a model employed a facial semantic segmentation mask for the location and cropping of facial components and learned domain-invariant features between facial parts through a relational attention structure PRAM. Furthermore, a component adaptive triplet loss function helped efficient learning with large discrepancies in facial parts. We obtained satisfactory performance on HFR databases.


  • [1] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §1.
  • [2] H. B. Bae, T. Jeon, Y. Lee, S. Jang, and S. Lee (2020) Non-visual to visual translation for cross-domain face recognition. IEEE Access 8, pp. 50452–50464. Cited by: §1.
  • [3] M. Cho, T. Chung, T. Kim, and S. Lee (2019) NIR-to-vis face recognition via embedding relations and coordinates of the pairwise features. In 2019 International Conference on Biometrics (ICB), pp. 1–8. Cited by: §2.1, §3.3, Table 2.
  • [4] M. Cho, T. Kim, I. Kim, K. Lee, and S. Lee (2020)

    Relational deep feature learning for heterogeneous face recognition

    IEEE Transactions on Information Forensics and Security 16, pp. 376–388. Cited by: §2.2, §3.3, Table 2, Table 3.
  • [5] A. R. Chowdhury, T. Lin, S. Maji, and E. Learned-Miller (2016) One-to-many face recognition with bilinear cnns. In

    2016 IEEE Winter Conference on Applications of Computer Vision (WACV)

    pp. 1–9. Cited by: §2.1.
  • [6] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao (2016) Ms-celeb-1m: a dataset and benchmark for large-scale face recognition. In European conference on computer vision, pp. 87–102. Cited by: §3.1.
  • [7] R. He, X. Wu, Z. Sun, and T. Tan (2017) Learning invariant deep representation for nir-vis face recognition. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 31. Cited by: §3.3, Table 2, Table 3.
  • [8] R. He, X. Wu, Z. Sun, and T. Tan (2018) Wasserstein cnn: learning invariant features for nir-vis face recognition. IEEE transactions on pattern analysis and machine intelligence 41 (7), pp. 1761–1773. Cited by: §1, §3.3, Table 2, Table 3.
  • [9] D. Huang, J. Sun, and Y. Wang (2012) The buaa-visnir face database instructions. School Comput. Sci. Eng., Beihang Univ., Beijing, China, Tech. Rep. IRIP-TR-12-FR-001. Cited by: A NIR-to-VIS face recognition via part adaptive and relation attention module, Figure 1, Table 1, Table 3, §3.
  • [10] J. Lezama, Q. Qiu, and G. Sapiro (2017) Not afraid of the dark: nir-vis face recognition via cross-spectral hallucination and low-rank embedding. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 6628–6637. Cited by: §1.
  • [11] S. Li, D. Yi, Z. Lei, and S. Liao (2013) The casia nir-vis 2.0 face database. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 348–353. Cited by: A NIR-to-VIS face recognition via part adaptive and relation attention module, Figure 1, Table 1, Table 2, §3.
  • [12] T. Lin, A. RoyChowdhury, and S. Maji (2015) Bilinear cnn models for fine-grained visual recognition. In Proceedings of the IEEE international conference on computer vision, pp. 1449–1457. Cited by: §2.1.
  • [13] X. Liu, L. Song, X. Wu, and T. Tan (2016) Transferring deep representation for nir-vis heterogeneous face recognition. In 2016 International Conference on Biometrics (ICB), pp. 1–8. Cited by: §3.3, Table 2, Table 3.
  • [14] K. Panetta, Q. Wan, S. Agaian, S. Rajeev, S. Kamath, R. Rajendran, S. Rao, A. Kaszowska, H. Taylor, A. Samani, et al. (2018) A comprehensive database for benchmarking imaging systems. IEEE transactions on pattern analysis and machine intelligence. Cited by: Figure 1, §3.
  • [15] R. Ranjan, C. D. Castillo, and R. Chellappa (2017) L2-constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507. Cited by: §2.2.
  • [16] F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §1, §2.2.
  • [17] L. Song, M. Zhang, X. Wu, and R. He (2018) Adversarial discriminative heterogeneous face recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §1, §3.3, Table 2, Table 3.
  • [18] F. Wu, W. You, J. S. Smith, W. Lu, and B. Zhang (2019) Image-image translation to enhance near infrared face recognition. In 2019 IEEE International Conference on Image Processing (ICIP), pp. 3442–3446. Cited by: §1.
  • [19] X. Wu, R. He, Z. Sun, and T. Tan (2018) A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security 13 (11), pp. 2884–2896. Cited by: §2.1.
  • [20] X. Wu, L. Song, R. He, and T. Tan (2018) Coupled deep learning for heterogeneous face recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §3.3, Table 2, Table 3.
  • [21] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang (2018) Bisenet: bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 325–341. Cited by: Figure 4, §3.1.