Face recognition, a task that aims to match facial images of the same person, has developed rapidly with the advent of deep learning. The features extracted through the multiple-hidden layers of deep convolutional neural network (DCNNs) contain representative information that is used to distinguish an individual. However, when recognizing a face via representative features, variations such as pose, illumination, or facial expression create difficulties[40, 51, 58]. Unlike general face recognition within the visible spectrum, the Heterogeneous Face Recognition (HFR) aims to cross-matching across different domains such as VIS (visible light), NIR (near-infrared), or the sketch domain[12, 3]. Face recognition over different domains is important, since NIR images acquired with infrared cameras contain more useful information when visible light is lacking, while sketch-to-photo matching is important in law enforcement for rapidly identifying suspects. As such, HFR can be a practical application for biometric security control or surveillance cameras under low-light scenarios.
HFR has several challenging issues, the biggest of which is the large gap between data domains. When HFR is performed with a general face recognition network trained on VIS face images, accuracy is significantly reduced. This is because the difference between the distributions of VIS and non-VIS data is large. Therefore, we need to reduce the domain gap through either learning domain-invariant features or common space projection methods. Another issue is the lack of HFR databases. Deep learning-based face recognition networks are usually trained with large-scale visual database such as MS-Celeb-1M which consists of 10 million images with 85 thousand identities, or MegaFace, with 4.7 million images. By comparison, the typical HFR database has a small number of images and subjects, which causes overfitting in a deep network and makes learning a general feature difficult. Therefore, most HFR tasks are fine-tuned using a feature extractor pre-trained on a large visual database.
To solve these problems, several works [3, 43, 59] use image synthesis method to transform input images from non-VIS domain to VIS domains and recognize faces in the same domain feature space. Although this approach may appear create a similar domain by the the data transform, it is difficult to generate good-quality transform images with a small amount of data; this greatly impacts performance and the approach does not reduce the gap between domain properties. Other studies[13, 55, 27] train the network to learn NIR-VIS-invariant features by using the Wasserstein distance, variational formulation, or a triplet loss function. These domain-invariant approaches force the network to reduce the domain gap implicitly, which makes learning and designing a network challening. Therefore, we propose a graph-structured module that explicitly reduces the fundamental differences in heterogeneous domain characteristics and focus on feature relations rather than domain-invariant features themselves.
For many computer vision tasks, the relational information within the image or video is important, in the same way that human visual processing can easily perform recognition by capturing relations. Since such relations are high-level and independent of information on texture, scale, and so on, relation information is suitable for reducing the gap between domains in HFR. With our proposed Relational Graph Module (RGM), each component of the face is embedded to a node vector and edges are computed by modeling the relationships among nodes. Through graph propagation with generated nodes and edges, we create a relational node vector containing the overall relationship nodes; and perform node-wise recalibration through the correlation information for these nodes with Node Attention Unit (NAU). Also, we suggest a conditional-margin loss function (-Softmax) to learn with an efficient space margin between inter-classes when data from two different domains is projected into one latent space. We demonstrate experimentally that this loss function is effective not only for HFR but also for large-scale visual face tasks such as LFW.
In this paper, our main contributions are as follows:
We propose a graph-structured module, RGM, to reduce the fundamental domain gap by modeling face components as node vectors and relational information edges. We also perform a recalibration by considering global node correlation via our NAU.
In order to project features from different domains into common latent space efficiently, we suggest a conditional-margin loss function -Softmax that uses the inter-class margin adaptively. In addition to accomplishing the HFR task, this demonstrates a performance improvement for large-scale visual face databases with large numbers of classes and large amounts of intra-class variation.
The proposed module can overcome the limited HFR databases available by plugging to a general feature extractor; we experimentally demonstrate superior performance for three different networks and three HFR databases.
The organization of the paper is as follows. We introduce three different approaches to HFR and relation capturing in Section II. Then, in Section III, we begin by presenting a preliminary version of this work, our Relation Module (RM), and explain our proposed RGM, NAU, and -Softmax approaches. Next, the experimental results for our proposed and other methods and related discussions are provided in Section IV. Finally, in Section V, we address conclusions relating our approach.
Ii Related Works
As stated in Section I, the challenge of the HFR task to match identities using conventional face recognition networks despite such domain differences as texture or style. The examples in Figure 1 of NIR-to-VIS and Sketch-to-Photo databases illustrate the gaps between the domains which can depend on variations in illumination or on the artist’s sketch style. Therefore, methods for reducing domain discrepancy are being studied, and these can be largely divided into projection to common space based methods, image synthesis based methods, and domain-invariant feature based methods. This section summarizes preceding HFR studies and then introduces methods of capturing the relational information within the image that can reduce the fundamental domain difference.
Ii-a Heterogeneous Face Recognition
Ii-A1 Projection to Common Space Based Method
Projection based methods involve learning to project features from two different domains into a discriminative common latent space where images with the same identity are close regardless of their domain. Lin and Tang  proposed a Common Discriminant Feature Extraction (CDFE) algorithm, in which two different domain features simultaneously learn common space to solve the inter-modality problem. With empirical separability and a local consistency regularization objective function, the model learns compact intra-class space and prevents the overfitting problem. Yi and Liao  suggested matching each partial patch of face images by extracting points, edges, or contours that are similar between domains. Lei et al.  designed a Coupled Spectral Regression (CPR) method for finding different projective vectors by representing relationships between each images and their embeddings. Different from this coupled method, which learns each domain’s representative vectors separately, Lei et al.  the technique of learning the projection from both domains. Since target data neighbors should correspond to source data neighbors, Shao et al.  matched projected target and source data by using a reconstruction coefficient matrix, Z, in the common space.
With deep neural networks (DNNs) showing great improvement in face recognition performance, Sarfraz and Stiefelhagen  used a deep perceptual mapping (DPM) method in which DNNs learn projection of visual and thermal features together. In , Reale et al. used coupled NIR and VIS CNNs, initializing them with a pre-trained face recognition network to extract global features. Wu et al.  proposed a Coupled Deep Learning (CDL) method with relevance constraints as a regularizer and a cross-modal ranking objective function. However, these methods are difficult to train because they require extraction of domain-specific features with a small database.
Ii-A2 Image Synthesis Based Method
Image synthesis based methods transform face images from one domain into the other so as to perform recognition in the same modality. Liu et al.  proposed a pseudo-sketch synthesis method which divides a photo image into a fixed number of patches and reconstructs each patch as a corresponding sketch patch. This patch-based strategy preserves local geometry while transforming the photo image into a sketch-style image. In , Wang et al. proposed a multi-scale Markov network, conducting brief propagation to transform multi-scale patches. Recently, with the widespread development of generative adversarial networks (GANs) , many studies have focused on generating visual face images from non-visual ones, . In  Song et al. transformed NIR face images to VIS face images by pixel space adversarial learning using CycleGAN
and feature space adversarial learning with a variance discrepancy loss function.
These methods of transforming an image from one domain to another can be effective for visually similar domaina, but do not fundamentally address the modality discrepancy that the data exhibits. In addition, due to issue of small amounts of unpaired HFR data, GAN-based methods struggle to create good-quality images, which affects performance.
Ii-A3 Domain-invariant Feature Based Method
Another approach is to use a feature extractor to reduce domain discrepancies and enable learning of domain-invariant features. Since the NIR-to-VIS face recognition task is heavily influenced by the light source in each image, Liu et al.  used differential-based band-pass image filters, relying only on variation patterns of skin properties. Lui et al.  also proposed a TRIVET loss function which applies Triplet loss  to cross-domain sampling to reduce the domain gap. The Wasserstein CNN in  is divided into an NIR-VIS shared layer and a specific layer. The shared layer is designed to learn domain-invariant features by minimizing the Wasserstein distance between different domain data. He et al.  also used a division approach, using two orthogonal spaces to represent invariant identity and light source information.
Several studies , ,  used relational representations, that allow projection heterogeneous data into a common space. In  Klare and Jain proposed a random prototype subspace framework to define prototype representations and learn a subspace projection matrix with kernel similarities among face patches. With their G-HFR method , Peng et al. employed Markov networks to extract graphical representations. This method finds the nearest patches of a patch in a probe or gallery image from the representation dataset and linearly combines them to obtain graphical representations. Since finding nearest patches from the representation process performance score relies heavily on the value of , Peng et al.  proposed an adaptive spare graphical representation method which considers all possible numbers of related image patches. These methods found relations between representation dataset image patches in randomly selected pairs. Unlike these methods, our proposed RGM extracts domain-invariant features by considering global spatial pair-wise relations. We first apply a deep learning based relation approach to the HFR task with graph structured module.
Ii-B Relation Capturing
In many computer vision tasks such as image classification, video recognition, and so on, it is important to understand the relationships within images or videos. However, simply operating multiple neural network layers often fails to identify long-range relationships as human visual systems do. With CNN bring significant improvements in computer vision, there are many studies underway to extract relational information using the local connectivity and multi-layer structure of CNN.
In , Lin et al. captured local pair-wise feature interaction to solve the fine-grained categorization task, in which visual differences are small between classes and can easily be overwhelmed by other factors such as view point or pose. The features from two streams of CNN termed Bilinear CNN or B-CNN are multiplied using an outer product to capture partial feature correlation. Since the face recognition task can be seen as a subarea of fine-grained recognition, Chowdhury et al.  applied this bilinear model to face recognition tasks with symmetric B-CNN (see Figure (a)a). Chen et al.  captured relational information with a double attention block consisting of bilinear pooling attention and feature distribution attention (Figure (b)b). The Non-local block  was proposed to operate a weighted sum of all features at each position, showing outperformance in video recognition tasks (Figure (c)c). To solve the Visual Q&A task, which requires high-level relation reasoning information, Santoro et al.  introduced a Relation Network that captures all potential relations for object pairs (Figure (d)d).
Recently, graph-based methods have proven effective in relation capturing. While traditional graph analysis has usually relied on hand crafted features, GNNs can learn nodes or edges update by propagating each layer’s weights. Kipf and Welling  proposed a spectral method using a Graph Convolutional Networks (GCN) which inputs graph-structured data and uses multiple hidden layers to learn graph structures and node features. Wang and Gupta  applied GCNs in action recognition to understand appearance and temporal functional relationships, while Chen et al. proposed a GloRe module that projects coordinate space features into interaction space to extract relation-aware features, boosting the performance of 2D and 3D CNNs on semantic segmentation, image recognition, and so on. As such, the graph-structured networks are effective for most computer vision tasks where relational information is important. In particular, since the HFR task involves small differences between classes and large within class discrepancies, relation information for faces plays an important role in representing each identity. Compared to Attentional Modules [23, 4, 49], graph-based modules better capture relations, thereby reducing the fundamental domain gap in HFR. We also compare existing attentional module-based approaches , ,  to our graph-based module  in Section IV-A4.
Iii Proposed method
In this section, we introduce our Relational Graph Module(RGM), designed to model relationships of face components as domain-invariant features. We first present our preliminary version of this work Relation Module (RM) and an overview of the RGM framework. We then describe our Node Attention Unit which helps to focus on global node correlation. Last, we introduce -Softmax, our loss function with conditional margin. RGM is an add-on module which can plug into any pre-trained face feature extractor. We experimentally quantify the performance improvement RGM yields in three different networks and over three different heterogeneous databases in Section IV.
Iii-a Relation Module
When an NIR or VIS face image is input to a single face recognition network pre-trained with large-scale visual face images, the network cannot perform well because of the domain discrepancy. In addition, HFR databases are mostly unpaired and consist of much smaller numbers of images than large-scale database such as MS-Celeb-1M and CASIA-WebFace, so it is difficult to fine-tune the pre-trained deep networks. To solve this problem, the RM concentrates on relationships of pair-wise face component that dp not depend on domain information (see Figure (d)d).
Here, is the relation extracting function with shared learnable weight and is the input feature vector. The RM is plugged in at the end of the convolutional layer and takes input as a feature map which is the output of the last convolutional layer. This feature map’s channel-wise vector represents the face component, depending on the CNN’s local connectivity characteristics. From this feature vector, which represents each component for the faces,the RM extracts the relationships of all these feature vector pairs. Since this pair-wise relationship is independent of ordering, the total number of combinations is and an -dimensional relation vector is extracted from each pair. This relation vector represents the relationship between two parts of the face such as the lips-to-nose or eye-to-eye relationship within a face. These computed relation vectors are reshaped and embedded into one embedding vector with a fully connected layer.
This process does not need to define actual relationships explicitly but simply looks at all combinations of patches and infer the relationship implicitly. Simply adding the RM can reduce intra-class variation and enlarge the inter-class space by using domain-invariant relational information.
Iii-B Relational Graph Module
As mentioned, the HFR task suffers from a problem of insufficient data and the difficulty of extracting features that reduce the domain gap. Since we confirmed with the RM that relational information in face images contains domain-invariant information, we propose our RGM for more efficient facial relationship modeling. Because RM considers every pair-wise combination and embeds all of them into -dimensional vector, it presents a computational complexity issue with an attendant overfitting risk when training on a small HFR database. Therefore, we propose a method of relation exploration through our graph-structured RGM, which consists of a node vector containing the face component information and edges that capture relationship information between node vectors.
Iii-B1 Node Embedding
Figure 3 shows the overall RGM framework. We first treat the spatial feature vectors extracted through the CNN as initial graph node vectors with dimension . Then we embed the node vectors into -dimensional vectors using a transform matrix . We experiment with the optimal value of this embedding dimension in Section IV-A2.
Iii-B2 Relation Propagation Based on Directed Relation Extraction
The feature vectors of the face image passing through the convolution layer represent each face component (e.g., eyes, lips, and chin). In the RM, the feature vectors are simply pair-wise concatenated to extract the relation through the shared fully connected layer. In RGM, after node embedding, we extract the directed edges of each node. Because the components that represent the face are the same for each class, we generate a fixed number of component nodes rather than selecting nodes (64 nodes are used in this paper). In Equation 2, the edge yielding the relationship between two node vectors and is obtained through the edge function .
Edge is a scalar value and is calculated as the weighted sum between node vector elements, where the weight
is a parameter obtained through learning. The edge value of each node vector is obtained via the edge function and has a value in the range [0, 1] through the Sigmoid function. Then, as shown in Equation 3, each node vector propagates in inter-dependency with all other node vectors through the edges to become a propagated node vector . Each face component has different relations for each identity and updating the nodes with the relations can reduce domain information and concentrate on component relational information.
As a point of comparison, Graph Attention Network (GAT) adopt self-attention mechanisms by a learnable parameter and by updating nodes within a fixed adjacency matrix, where
is a LeakReLU activation function andis a vector of learnable parameters. In the RGM, the adjacency matrix and are learned simultaneously; also, the RGM uses a Sigmoid activation function which looks at each value separately that allows for independent values of relations. Since the relation between the nodes is independent of the relations of other nodes, Sigmoid activation is more relevant than Softmax which looks all values interrelated in phase and computes the sum of values to 1. We experiment with this activation issue in Section IV-E1.
Iii-B3 Node Re-Embedding
After propagation, we apply the NAU as an activation function (see Figure 3). The NAU is serves as a node-adaptive activation function, as will be described in detail below. After we recalibrate nodes through the NAU, we use the weight matrix to re-embed the node vectors into the original input dimension and perform residual summation. After summation, we concatenate entire node vectors and embed them into the final representative embedding vector for comparison through the fully connected layer.
Iii-C Node Attention Unit
In SENet, when the convolutional layer compute a spatial and channel-wise combination in the receptive field, recalibration of the feature is performed to boost the network’s representative power. Inspired by this approach, each node vector that contains relational information is recalibrated through the NAU by updating through the graph and considering inter-node correlation.
In Equation 4, each node squeezes information through global average pooling to vector . The node-squeezed vector is then aggregated through weights to a vector that is scaled node-wise to each node vector, yielding a recalibration effect according to the global importance of nodes and focusing attention on the characteristic aspect of the identity. In contrast to SENet, squeeze spatial dimension and recalibrate channels, we squeeze channels and recalibrate nodes. In contrast to the CBAM (spatial attention block), since our nodes are not spatially correlated after propagation has forced each node to contains global relations, we do not perform convolution-based squeezing but instead squeeze point-wise.
Iii-D Conditional Margin Loss: -Softmax
The node passing through the RGM becomes the
-dimensional embedding vector through the fully connected layer. In the training phase, the embedding vector goes through the Softmax layer to optimize the loss value through the Cross Entropy loss function. When testing, the class is predicted by computing the cosine similarity between the embedding vectors of the gallery and the probe images, respectively. Equation5 defines a Cross Entropy loss function in which denotes batch size, is the number of classes and is the embedding vector of the (-th class) training sample.
In , a triplet loss function with conditional margin is proposed that applies an adaptive margin between inter-classes to reduce intra-class discrepancies. This loss function, defined in Equation 6, anchor while positive and negative are sampled from different domains and their cosine similarity (CS) calculated, respectively. In this case, the positive and negative similarities are and are intra-class and inter-class similarity respectively, and the loss value is calculated from these ratios’ margin . This margin considers the distributions of and and can be written as Equation 7, which takes into account not only the intercept value but also the slope , meaning that every margin is computed adaptively.
Since this loss function utilizes triplet loss, the anchor and the positive and negative sampling play an important role in learning; therefore online sampling should be done within a mini-batch and semi-hard example learning is required. This increases the training time, in addition to which the HFR database has only a small number of images and identities, which makes sampling difficult.
To avoid sampling, we suggest using -Softmax as a loss function, since it applies the margin into the Sofmax layer adaptively according to inter-class similarity. First, we normalize the fully connected layer and embedding vector ; normalized vectors are re-scaled to scale , following . Per Equation 8, the product of these two vectors gives the angle between the two vectors, which defines cosine similarity. Therefore, the conditional margin in Equation 7 can be written as Equation 8 and the Cross Entropy loss function then transforms as Equation 9.
Here and indicate the slope and intercept values respectively, as in Equation 7. Figure 5 shows the margin according to each cosine similarity in two classes. Compared to CosFace and ArcFace, our proposed conditional margin is determined adaptively by considering the similarity between classes. When this similarity is small, sufficient inter-class space is guaranteed, so that we do not need to set a high margin. Conversely, when the similarity between classes is high, we have a hard class example, so the margin should be increased to give a stricter criterion. In this way, the margin can be adaptively determined according to the similarity value, which gives a hard sampling effect by concentrating on the hard sample (see Figure (c)c). When we decrease , the margin at the large similarity region increases since the slope is gentle; when we increase , the margin at the small similarity region increases. To prevent a negative margin, we use a constraint such as . By contrast, CosFace gives a constant margin for cosine similarity using , and ArcFace uses a constant margin for class angular domain with . When we convert the ArcFace class angular domain to , as shown in Figure (b)b, the margin varies depending on the similarity, but a large margin occurs only when the similarity is near the midpoint. In contrast, our margins are given conditionally so that heterogeneous data with large intra-class discrepancy can be more efficiently trained during common space learning. In addition, our -Softmax shows performance improvement on large-scale visual databases with many identities, as discussed in Section IV-D below.
Iv Experimental Results
In this section, we report and analyze the results of experimentally applying our prposed method to HFR databases. NIR-to-VIS and Sketch-to-Photo ablation studies were performed on three HFR databases, namely CAISA NIR-VIS 2.0, IIIT-D Sketch and BUAA-VisNir, and comparing our approach with other methods. In addition, our proposed RGM, an add-on module, is applied to three networks with different numbers of layers, showing improved performance in all case. Also, we compare the RGM with other attentional modules and experimentally varied the node embedding vector dimension. Furthermore, we compare the performances of our -Softmax loss functiong with other angular margin losses, having also investigated its performance on the visual face databases CASIA-WebFace and LFW. Finally, we analyze and discuss the NAU’s activation function and visualization of the extracted relational information of our proposed module.
Our three baseline networks are LightCNN-9, LightCNN-29, and ResNet18, consisting of 9, 29 and 18 convolutional layers respectively. These three baseline networks are pre-trained on the MS-Celeb-1M database a large-scale visual face database. For the fine-tune, the pre-trained feature extractor is frozen and only the HFR database, comprising non-VIS and VIS faces, is used for training data. We use 128 or 64 batches and learning rate starts at 0.001 or 0.01(for the IIIT-D Sketch dataset); to avoid over-fitting, the dropout rate is set to 0.7 at the fully connected layer. The RGM is plugged after the last convolutional layer and use 64 () node vectors. In the NAU, the channel reduction ratio is 2; in -Softmax loss function , , and the normalized scale value = 24 are used.
Iv-a Casia Nir-Vis 2.0
The CASIA NIR-VIS 2.0 database is one of the largest HFR databases, is composed of NIR and VIS face images. It contains 725 subjects, imaged by VIS and NIR cameras in four recording sessions. There are between 1 and 22 VIS images and between 5 and 50 number of NIR images per identity. The images contain variations of resolution, lighting conditions, pose, and age as well as expression, eyeglasses/none, and distance, all of which make recognition more challenging. There are two protocols for performance evaluation in this database; we followed view 2, in which there are ten fold sub-experiments, each with a training list and gallery, and a probe list for testing. The training subjects and the corresponding testing sets are non-overlapped and the numbers of subjects are virtually identical. For evaluation, the gallery set comprises one VIS image per subject while the probe set contains several NIR images per subject. The prediction score is computed by similarity matrix over the whole gallery set and the identification accuracy and verification rate recorded. All parameters are fixed during the ten fold sub-experiments; we crop each image to size and randomly crop to for LightCNN and for ResNet.
Iv-A2 Ablation Studies
We first experiment with different numbers of the node vector dimension to find the appropriate dimension for HFR. The experiment is conducted on LightCNN-9, in which there are 128 channels in the last convolutional layer. Figure 6 shows the results of training with the RGM node vector dimension and . The identification accuracy and verification rate show better performance as the dimension increases and then drop off when it becames too large. We use a dimension of 128 in LightCNN and 256 in ResNet18, whose channel size at the last convolutional layer is .
Table I shows ablation studies in three baseline networks on the CASIA NIR-VIS 2.0 database. In the table, the fine-tuning shows the results of training only the fully connected layers while freezing the pre-trained feature extractor. For each network we attach our proposed RGM module, then experiment with NAU, and finally show the results of training with -Softmax. The ResNet18 fine-tuned performance is 88.37% in rank-1 accuracy. When the RGM extracts domain-invariant features focused on relational information, the performance improves to 96.33%. When the network trains with NAU and -Softmax, it shows additional performance improvement of 0.34% and 0.77% respectively. Similarly, performance on LightCNNs are improved by 4.82% and 1.65% over fine-tuned accuracy.
|TRIVLET||95.7 0.5||91.0 1.3|
|ADFL||98.2 0.3||97.2 0.3|
|CDL||98.6 0.2||98.3 0.1|
Iv-A3 Comparison with Other Methods
In Table II, we compare our method with other deep learning-based HFR methods, namely HFR-CNN, TRIVLET, ADFL, CDL, WCNN, RM and DSU. The RM method, which extracts features by pair-wise relation embedding, performs better than other deep learning methods. Our method shows 0.38% performance improvement over RM and also yields results comparable with other domain-invariant based methods.
Iv-A4 Comparison with Attentional Modules
In Table III, we compare RGM with the attentional modules depicted earlier in Figure 2. We use LightCNN-9 as the baseline model and train under the same conditions, with the last feature map passed through each module with cross entropy loss. We find that Double Attention and B-CNN perform worse than the fine-tuned model, followed by NonLocal performance. While these methods focus on attention, RM , which extracts the relation of every paired component, shows better performance than the fine-tuned model. The graph-structured modules GloRe  and RGM show higher performance than other methods, among which RGM accuracy is 97.2%.
Iv-B IIIT-D Sketch
The IIIT-D Sketch database is designed for the sketch-to-photo face recognition task. We use the Viewed Sketch Database which comprises 238 subjects. Each subject has one image pair, a sketch and a VIS photo face image. The sketches are drawn by a professional artiest based on the VIS face images, which were collected from 67 FG-NET aging databases, 99 Labeled Faces in Wild (LFW), and 72 IIIT-D student&staff databases. Since there are only a small number of images for training, we train on CUHK Face Sketch FERET Database (CUFSF) and evaluate on IIIT-D Sketch database, following the same protocol as in . The CUFSF database includes 1,194 subjects from the FERET database , with a single sketch and photo image pair per subject. For testing, we use VIS photo images as the gallery set and sketch images as the probe set.
|Rank-1 Acc(%)||VR@FAR=1%(%)||VR@FAR=0.1%(%)||Rank-1 Acc(%)||VR@FAR=1%(%)|
Iv-B2 Ablation Studies
Table IV presents ablation studies for the IIIT-D Sketch database, training with RGM, NAU, and -Softmax as before. As with the results on CASIA NIR-VIS 2.0 database, our approach improves further with the addition of RGM, NAU and -Softmax loss. In the table, when LightCNN-9 is the baseline, training with the original Softmax and Cross Entropy loss functions performs 0.43% better with than the -Softmax loss. This is because the number of CUFSF and IIIT-D images is smaller than for the CASIA NIR-VIS database, so it is difficult to learn sufficiently with -Softmax loss and the margin values and need to be adjusted. Nevertheless, the rest of the baseline models show improved performance, at 15.75% and 16.17% better than the fine-tuned models respectively.
Iv-B3 Comparison with Other Methods
As we described in Table V, SIFT, MCWLD, VGG, CenterLoss and CDL are compared with our approach. In particular, the sketch HFR database comprises artiest’s pictures, rather than the photos, making training based on deep learning difficult. Nevertheless, our method shows a rank-1 accuracy of 88.94%, the leading performance among deep learning and hand-crafted methods.
Iv-B4 Comparison with Attentional Modules
We also apply the attention method and the graph method on LightCNN-9 to the sketch-to-photo HFR task (Table VI). As with the NIR database, the B-CNN and DoubleAttention module show low performance making it difficult to reduce the sketch domain discrepancy via the self-attention method by simply multiplying feature vectors. The Non-Local and GloRe modules perform similarly, and not much differently from the fine-tuned model. In the Sketch database, which has a large domain difference and a small number of images, the RGM improves performance by a larger amount of 10.22%. The RGM, which extracts relational information with small parameters, prevents overfitting and outperforms even with a small database.
The BUAA-VisNir database is composed of NIR and VIS face images of 150 subjects. Each subject has nine NIR and VIS images including one frontal view, four different other views, and four different expressions (happiness, anger, disgust and amazement). The VIS images are composed with variant illumination direction. These NIR and VIS images are paired and captured simultaneously. The training set comprises 50 subjects with 900 images. For testing, 100 subjects with one VIS image each make up the gallery set, with 900 NIR images in the probe set.
Iv-C2 Ablation Studies
Performance with LightCNN-9 and 29 incrementally improves over baseline when the RGM module, NAU, and conditional margin loss -Softmax are added. In particular, size of the BUAA training set at 50 subjects, is smaller than that of other databases. Therefore, applying a -Softmax loss function that considers inter-class similarity and adjusts the margin adaptively helps with efficient training within a small number of classes. This loss boosts performance by 2.45% and 0.11% respectively. On the other hand, when NAU is added to the ResNet18 baseline, the performance decreases because it becomes more difficult to learn the global node correlation with fewer training set subjects. Our approach brings performance improvements 2.78%, 1.55%, and 2.23% over fine-tune in the three baselines.
Iv-C3 Comparison with Other Methods
Table VII compares our method with three other types of method (projection based, synthesis based and domain-invariant base method) on H2(LBP3), TRIVLET, ADFL, CDL and WCNN. Our method shows better performance than other domain-invariant feature methods such as WCNN, TRIVLET that focus on features themselves rather than relationships.
|Loss||CASIA NIR-VIS 2.0||CASIA
Iv-D Conditional Margin Loss: -Softmax
Iv-D1 Casia Nir-Vis 2.0
We compare our conditional margin loss (-Softmax) to other angular margin losses such as SphereFace, CosFace and ArcFace using LightCNN-9 as a baseline and training under the same conditions (we choose scale factor = 24); the margin for each loss follows each study. In Table VIII, on the CASIA NIR-VIS 2.0 database, performance of Softmax with Cross Entropy loss has 97% rank-1 accuracy; SphereFace and CosFace are lower than Softmax, with 87.06% and 97.17%, respectively. ArcFace and our C-Softmax, on the other hand, outperforms the other losses at 97.97% and 98.03% because of the different margins for the class cosine similarity, as shown in Figure 5. ArcFace reduces the margin when the class cosine similarity is large or small and increases it near the midpoint (Figure (b)b), while -Softmax increases the margin at larger class similarity values (Figure (c)c). This helps to control classes with domain discrepancy because it effectively adjusts the margin between inter-classes.
We also run experiment with LFW, a large-scale visual face database. For this purpose, we use CASIA WebFace  consisting of 10,575 subjects for the training dataset and perform evaluation in LFW. ResNet101 is used as the baseline network, and all conditions are the same, including a batch size of 128 and a gradually decaying learning rate 0.1. Table VIII shows experimental results conducted on a small CASIA WebFace database with 5,287 subjects, half the size of the CASIA WebFace. The results from the two databases show that SphereFace has the lowest verification rate, followed by the Softmax loss. The results for CosFace and ArcFace are 98.94%, 98.19% and 98.71%, 98.2%, respectively. Compared to other losses, -Softmax brings achieves the best performance, with 0.19% and 0.42% improvements on CASIA WebFace and small CASIA WebFace compared to Softmax. Table VIII shows that -Softmax improves learning effectiveness when the datasets are difficult to train because of the small number of classes.
Iv-E1 RGM with Sigmoid Activation Function
As mentioned in Section III-B2, when obtaining a directed relation between nodes all edge values are passed through an activation function. In this case, we use a Sigmoid activation function instead of Softmax because relation information for each node is independent of and should not be influenced by other nodes’ relations. Unlike Sigmoids where , Softmax looks at the interrelation of all values. Table IX shows the results of experiments in which the activation function of RGM is varied. We use LightCNN-9 as a baseline with the CASIA NIR-VIS and IIIT-D Sketch databases. For these databases, the rank-1 accuracy is increased by 0.08% and 0.85% compared to Softmax.
Iv-E2 Visualization of Relations
We visualize node relations using the learned parameter in Equation2 which extracts the directed relations of nodes in the RGM. Figure 7 shows faces from a testset in input NIR, Sketch, and VIS, with the face components representing the nodes. We visualize the values for relations with other face components based on the red part; the higher the relation value, the darker the green part. In other words, we can see relations between nodes corresponding to face parts. The first row of Figure 7 shows VIS and NIR pair, while the second row shows VIS and Sketch pairs. Figure (a)a shows the relationships between nose node and other nodes; this subject has strong relationships between both eyes and the left jaw. Figure (c)c has a strong relationship between the eyebrows and the mouth region with the reference node. Like NIR-to-VIS, both the VIS and sketch images in (d)d have strong relations between the left eye and the face shape. Similar results are found in Figure (b)b, Figure (e)e and Figure (f)f. These relationships are obtained by passing the gallery VIS image and the probe NIR or Sketch image separately to the RGM, revealing similar relationships in faces with the same identity and indicating that the relationships obtained are domain-invariant. Additional visualization results are presented in the Supplementary Material.
Iv-E3 Visualization of Node Attention Unit
Nodes whose relational information is propagated through the RGM are node-wise recalibrated through the NAU. Figure 8 shows the scale value computed for node-wise recalibration from the NAU ( in Equation 4). Each column corresponds to a subject; the first row is the gallery set and the second and third rows are probe sets. Looking at the gallery and probe set, we can see that the nodes are similarly focused for each subject. In other words, nodes of higher importance among relation propagated nodes are different for each subject and similar within each subject. In NAU, the representative power is increased by focusing on these more informative nodes. Each probe set in a subject’s scales are slightly different, but each subject scales in a similar fashion when viewed globally.
The Relational Graph Module (RGM) extracts representative relational information of each identity by embedding each face component into a node vector and modeling the relationships among these. This graph-structured module solved the discrepancy problem between HFR domains using a structured approach based on extracting relations. Moreover, the RGM overcame the problem of lack of adequate HFR database by plugging into a pre-trained face extractor and fine-tuning it. In addition, we performed node-wise recalibration to focus on global informative nodes among propagated node vectors through our Node Attention Unit (NAU). Furthermore, our novel -softmax loss helped to learn common projection space adaptively by applying a higher margin as the class similarity increases.
We applied the RGM module to several pre-trained networks and explored performance improvements on NIR-to-VIS and Sketch-to-VIS tasks. In addition, in ablation studies, each proposed method showed the impact of its role through boosted performance, while -Softmax demonstrated performance improvement not only in the HFR task but also on large-scale visual face databases. Furthermore, the visualization of relational information in VIS, NIR, and sketch images showed that relationships within the face are similar in each subject, revealing representative domain-invariant features. Finally, our proposed approach showed better performance on the CASIA NIR-VIS 2.0, IIIT-D Sketch, and BUAA-VisNir databases.
This research was supported by Multi-Ministry Collaborative R&D Program(R&D program for complex cognitive technology) through the National Research Foundation of Korea(NRF) funded by MSIT, MOTIE, KNPA(NRF-2018M3E3A1057289)
-  (2012) Memetic approach for matching sketches with digital face images. Technical report Cited by: Fig. 1, §I, §IV-B3, TABLE V, §IV, §V.
-  (2012) Memetically optimized mcwld for matching sketches with digital face images. IEEE Transactions on Information Forensics and Security 7 (5), pp. 1522–1535. Cited by: §IV-B3, TABLE V.
-  (2019) A multi-scale conditional generative adversarial network for face sketch synthesis. In 2019 IEEE International Conference on Image Processing (ICIP), pp. 3876–3880. Cited by: §I, §I, §II-A2.
-  (2018) A^ 2-nets: double attention networks. In Advances in Neural Information Processing Systems, pp. 352–361. Cited by: §II-B, §II-B, §IV-A4, TABLE III, TABLE VI.
Graph-based global reasoning networks.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 433–442. Cited by: §II-B, §IV-A4, TABLE III, TABLE VI.
-  (2016) One-to-many face recognition with bilinear cnns. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9. Cited by: §II-B, §IV-A4, TABLE III, TABLE VI.
-  (2018) Heterogeneous face recognition using domain specific units. IEEE Transactions on Information Forensics and Security 14 (7), pp. 1803–1816. Cited by: §IV-A3, TABLE II.
-  (2019) Arcface: additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699. Cited by: §III-D, §IV-D1, TABLE VIII.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §II-A2.
-  (2016) Ms-celeb-1m: a dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision, pp. 87–102. Cited by: §I, §III-A.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §IV-D2, §IV.
Learning invariant deep representation for nir-vis face recognition.
Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §I, §II-A3.
-  (2018) Wasserstein cnn: learning invariant features for nir-vis face recognition. IEEE transactions on pattern analysis and machine intelligence 41 (7), pp. 1761–1773. Cited by: §I, §II-A3, §IV-A3, §IV-C3, TABLE II, TABLE VII.
-  (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §III-C.
-  (2012) The buaa-visnir face database instructions. School Comput. Sci. Eng., Beihang Univ., Beijing, China, Tech. Rep. IRIP-TR-12-FR-001. Cited by: Fig. 1, §IV, §V.
-  (2016) The megaface benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4873–4882. Cited by: §I.
-  (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §II-B.
-  (2012) Heterogeneous face recognition using kernel prototype similarities. IEEE transactions on pattern analysis and machine intelligence 35 (6), pp. 1410–1422. Cited by: §II-A3.
-  (2009) Coupled spectral regression for matching heterogeneous faces. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1123–1128. Cited by: §II-A1.
-  (2012) Coupled discriminant analysis for heterogeneous face recognition. IEEE Transactions on Information Forensics and Security 7 (6), pp. 1707–1716. Cited by: §II-A1.
-  (2013) The casia nir-vis 2.0 face database. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 348–353. Cited by: Fig. 1, §IV, §V.
-  (2006) Inter-modality face recognition. In European conference on computer vision, pp. 13–26. Cited by: §II-A1.
-  (2015) Bilinear cnn models for fine-grained visual recognition. In Proceedings of the IEEE international conference on computer vision, pp. 1449–1457. Cited by: §II-B, §II-B.
-  (2005) A nonlinear approach for face sketch synthesis and recognition. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1, pp. 1005–1010. Cited by: §II-A2.
-  (2012) Heterogeneous face image matching using multi-scale features. In 2012 5th IAPR International Conference on Biometrics (ICB), pp. 79–84. Cited by: §II-A3.
-  (2017) Sphereface: deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 212–220. Cited by: §IV-D1, TABLE VIII.
-  (2016) Transferring deep representation for nir-vis heterogeneous face recognition. In 2016 International Conference on Biometrics (ICB), pp. 1–8. Cited by: §I, §II-A3, §IV-A3, §IV-C3, TABLE II, TABLE VII.
-  (2019) NIR-to-vis face recognition via embedding relations and coordinates of the pairwise features. In 2019 international conference on biometrics (ICB), Cited by: §I, §III-D, §III, §IV-A3, §IV-A4, TABLE II, TABLE III, TABLE VI.
-  (2016) A survey on heterogeneous face recognition: sketch, infra-red, 3d and low-resolution. Image and Vision Computing 56, pp. 28–48. Cited by: §I.
-  (2015) Deep face recognition.. In bmvc, Vol. 1, pp. 6. Cited by: §IV-B3, TABLE V.
-  (2016) Graphical representation for heterogeneous face recognition. IEEE transactions on pattern analysis and machine intelligence 39 (2), pp. 301–312. Cited by: §II-A3.
-  (2019) Sparse graphical representation based discriminant analysis for heterogeneous face recognition. Signal Processing 156, pp. 46–61. Cited by: §II-A3.
-  (2000) The feret evaluation methodology for face-recognition algorithms. IEEE Transactions on pattern analysis and machine intelligence 22 (10), pp. 1090–1104. Cited by: §IV-B1.
-  (2017) L2-constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507. Cited by: §III-D.
-  (2016) Seeing the forest from the trees: a holistic approach to near-infrared heterogeneous face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 54–62. Cited by: §II-A1.
-  (2017) A simple neural network module for relational reasoning. In Advances in neural information processing systems, pp. 4967–4976. Cited by: §II-B.
-  (2015) Deep perceptual mapping for thermal to visible face recognition. arXiv preprint arXiv:1507.02879. Cited by: §II-A1.
-  (2016) Heterogeneous face recognition with cnns. In European conference on computer vision, pp. 483–491. Cited by: §IV-A3, TABLE II.
-  (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §II-A3.
-  (2016) Frontal to profile face verification in the wild. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9. Cited by: §I.
-  (2016) Cross-modality feature learning through generic hierarchical hyperlingual-words. IEEE transactions on neural networks and learning systems 28 (2), pp. 451–463. Cited by: §IV-C3, TABLE VII.
-  (2014) Generalized transfer subspace learning through low-rank constraint. International Journal of Computer Vision 109 (1-2), pp. 74–93. Cited by: §II-A1.
-  (2018) Adversarial discriminative heterogeneous face recognition. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §I, §II-A2, §IV-A3, §IV-C3, TABLE II, TABLE VII.
Dropout: a simple way to prevent neural networks from overfitting.
The journal of machine learning research15 (1), pp. 1929–1958. Cited by: §IV.
-  (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §III-B2.
-  (2018) Cosface: large margin cosine loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274. Cited by: §III-D, §IV-D1, TABLE VIII.
-  (2018) Deep face recognition: a survey. arXiv preprint arXiv:1804.06655. Cited by: §I.
-  (2008) Face photo-sketch synthesis and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (11), pp. 1955–1967. Cited by: §II-A2, §IV-B1.
-  (2018) Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. Cited by: §II-B, §II-B, §IV-A4, TABLE III, TABLE VI.
-  (2018) Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 399–417. Cited by: §II-B.
-  (2018) Face aging with identity-preserved conditional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7939–7947. Cited by: §I.
-  (2016) A discriminative feature learning approach for deep face recognition. In European conference on computer vision, pp. 499–515. Cited by: §IV-B3, TABLE V.
-  (2018) Cbam: convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §III-C.
-  (2018) A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security 13 (11), pp. 2884–2896. Cited by: §IV.
-  (2018) Coupled deep learning for heterogeneous face recognition. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §I, §II-A1, §IV-A3, §IV-B1, §IV-B3, §IV-C3, TABLE II, TABLE V, TABLE VII.
-  (2014) Learning face representation from scratch. arXiv preprint arXiv:1411.7923. Cited by: §III-A, §IV-D2, §IV.
-  (2009) Partial face matching between near infrared and visual images in mbgc portal challenge. In International Conference on Biometrics, pp. 733–742. Cited by: §II-A1.
-  (2020) Low resolution face recognition using a two-branch deep convolutional neural network architecture. Expert Systems with Applications 139, pp. 112854. Cited by: §I.
-  (2018) Tv-gan: generative adversarial network based thermal to visible face recognition. In 2018 international conference on biometrics (ICB), pp. 174–181. Cited by: §I, §II-A2.
-  (2018) Graph neural networks: a review of methods and applications. arXiv preprint arXiv:1812.08434. Cited by: §II-B.
Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §II-A2.