Facial recognition (FR) technology has improved significantly over the past decade, both due to recent advancements in deep learning methods, as well as its utility and wide-spread application in authentication processes in mobile phones and other electronic devices . Following the introduction of AlexNet in 2012 , most FR tasks such as face verification (one-to-one) and face identification (one-to-many), have employed a deep neural network approach with a CNN as the backbone . Most of these advancements make use of 2D RGB (Red, Green and Blue channel) images, which are plentiful and readily available, thereby facilitating the training of even deeper and more effective neural networks.
While CNNs have proven successful with RGB images , recognition tasks using RGB-D images, which comprise co-registered color and range (depth) information, have been less explored. The advent of inexpensive depth sensors such as the Microsoft Kinect and the Intel RealSense has reduced the cost of acquiring RGB-D images . The additional information in RGB-D images can improve the effectiveness of FR algorithms, making them more accurate and robust to variations in pose and illumination [10, 14]. Traditional RGB-D approaches to FR used hand-crafted descriptors for the RGB and depth modalities to perform classification , but these engineered features may not generalize well to all datasets.
Generally, deep learning approaches for RGB-D facial recognition use different multimodal learning strategies such as feature-level or score-level fusion 
, following CNN feature extraction. Nonetheless, different parts of the embeddings and different input modalities may contain varying amounts of identity-related information, which most current fusion strategies fail to exploit. This is due to the fact that existing fusion strategies apply similar importance to different modalities and parts of the learned embedding.
Accordingly, we propose here a novel method to effectively fuse the two RGB and depth modalities using an attention mechanism. The first level called feature-map attention, selectively focuses on the fused feature maps produced by the convolution layers, while the second level called spatial attention, selectively focuses on spatial convolutional information over the feature maps. The attention refined features are then further learned by fully connected layers for classification.
Our main contributions are as follows:
We introduce a novel multimodal fusion mechanism using attention to selectively learn useful information from both RGB and depth modalities;
Our proposed method outperforms the state-of-the-art on two public datasets.
2 Related Work
RGB-D databases are generally collected in constrained lab environments, and as a result there have been only a few prominent RGB-D face datasets compiled by the community. Hg et al.  collected the VAP dataset and proposed a face detection algorithm based on curvature analysis. Min et al. 
developed the EURECOM RGB-D face database and used various algorithms to their baseline results on their dataset using a combination of Principal Component Analysis (PCA), Local Binary Patterns (LBP), and Scale Invariant Feature Transform (SIFT) features. Goswamiet al. [6, 7]27] developed the BUAA Lock3DFace RGB-D face database and provided baseline results for the depth modality using the Iterative Closet Point (ICP) algorithm. CurtinFaces is another notable RGB-D dataset that was collected by Li et al. , which contains 5000 images for both modalities recorded by Kinect sensors. Sepas-Moghaddam et al.  developed the IST-EURECOM light field face dataset containing RGB multi-view information, from which depth information was extracted. The original papers introducing new RGB-D datasets normally included some baseline results on the proposed dataset, thus providing an insight into the performance level. Apart from these results, the datasets have already been evaluated by a number of RGB-D face recognition methods that are reviewed in this section.
2.1 Traditional RGB-D approches
Goswami et al. [6, 7] used different types of features, such as Visual Saliency Map (VSM) from RGB data and entropy maps from depth data, which were then fused together with HOG features of image patches and fed to a classifier. In later work, they improved the feature set with RGB-D Image Saliency and Entropy maps (RISE) and Attributes from Depth Maps (ADM). Li et al.  used various pre-processing methods to exploit face symmetry at the 3D point cloud level and obtained a frontal view, shape and texture irrespective of the pose. They used pose correctness with a Discriminant Color Space (DCS) transformation to improve the accuracy of their approach. Hayat et al. 
used a co-variance matrix representation on the Riemannian manifold to represent the images and used a SVM classifier with a score level fusion method to fuse the depth and RGB scores to classify identities.
2.2 Deep learning approaches
Socher et al.  proposed a Convolutional Recursive Neural Network (CRNN) for RGB-D object recognition. In this network, two CNN networks were trained separately on the RGB and depth images, and the learned embeddings were fed into two RNN networks to obtain compositional features and part interactions. Borghi et al.  trained a Siamese CNN on RGB and depth images for a facial verification task . Chowdhury et al.  built upon the work of Goswami et al. [6, 7]
, with an approach called learning based reconstruction. They used Autoencoders to get the mapping function from RGB and depth images, and used the reconstructed images from the mapping function for identification. Zhanget al.  addressed feature fusion using deep learning techniques, focusing on jointly learning the CNN embedding to fuse the common and complementary information offered by the two modalities together effectively.
We aim to develop an accurate FR method which is more robust to variations in ambient illumination and face pose. To this end, we present a multimodal recognition method using both RGB and depth modalities contained in Kinect images. We fuse the two modalities together using two attention mechanisms as depicted in Fig. 1
3.1 Preprocessing and Image Augmentation
The first preprocessing step is to determine two depth values that respectively represent the near and far clipping planes of the scene. These clipping planes are used to filter out content that are too near or too far from the camera, passing through the face information. Following this preprocessing, both RGB and (preprocessed) depth images are passed through dlib CNN face extractor  network which returns cropped images containing only face information.
To combat the small size of RGB-D databases and to make the model more robust and less sensitive to small changes in the images, we apply image augmentation to the dataset during training. For augmentation, we use the geometric transformations, including image rotation ( -30° to 30°), sheer (-16° to 16°), perspective transformation (50% to 150%), and horizontal mirror reflection.
3.2 Network Architecture
Our network consists of a Siamese convolution network unit for RGB and depth modalities following the architecture of the VGG network . We utilize the convolutional feature extraction part of the VGG network which has already been trained on over 3.3+ million images from the VGGFace2 dataset 
. This helps to speed up the training process, by transferring the network’s already learnt ability to identify facial features, and also to further compensate for the relatively small size of the RGB-D datasets. Both depth and RGB images are passed through the convolutional part of the network to obtain feature embeddings, in the form of convolutional tensors, for both modalities from the 13th layer of the VGG-16 network.
Following convolutional feature extraction, the next part of the network is the attention mechanism, which focuses the network on the salient parts of the feature embedding. The attention mechanism can be broken down into two different units, named feature-map attention and spatial attention.
3.2.1 Feature-Map attention
Feature-map attention is inspired from channel attention used in . The aim of this mechanism is to help train the network to focus on those feature maps generated in the embedding that have a greater contribution [4, 24, 25, 18]. For this attention mechanism to work, we concatenate the features extracted by the CNN network from both RGB and depth modalities as shown in Fig. 1. Promising results from 
suggest the use of both average and max pooling of the spatial features simultaneously, as they gather important clues about distinctive object features to infer finer feature-map attention. Following this pooling, the resulting tensor is passed through a multi-level perceptron and sigmoid activation to return weights which lie between 0 and 1.
Let and be the feature embeddings extracted from the CNN network. is then given by:
where refers to the concatenation operation. Performing average pooling and max pooling on these concatenated feature embeddings results in and . The feature-map attention weights are then calculated as:
where is the output of the multi-level perceptron; and are the weights learnt from the multilevel perceptron, and; and are the respective biases. This output is normalized to
with the sigmoid function in Eq.3. The feature-map attention refined features are then calculated as:
3.2.2 Spatial attention
After refining the features through feature-map attention, the network next focuses attention on the spatial parts of the embedding. This module helps the network to identify the most salient features in the embedding and to focus its attention on those features . Similar to feature-map attention, to take the most salient information from the feature embedding, we use average and max pooling along the feature map axis. To get the attention weight we pass these to a convolution layer with kernel and feature map to compress the weights to a single weight layer, followed by sigmoid activation. Performing average pooling and max pooling on the feature embedding results in and . The spatial attention weights are then calculated as:
The final refined features following the attention module are then given by:
After fusing these features together, a classifier is applied to segregate the features obtained from the attention layer. We use 4 fully connected layers to serve as a classifier, with 3 layers followed by batch-norm and dropout to regularize the output. The final layer is the dense layer with the number of hidden units equal to the number of classes.
IIIT-D RGB-D : This dataset was compiled by Goswami et al. in 2012 in IIIT-Delhi. It contains images from 106 subjects captured using a Microsoft Kinect. Each subject has multiple images ranging from 11 to 254, with resolution. The dataset already has a pre-defined protocol with a five-fold cross-validation strategy, to which we strictly adhered in our experiments.
CurtinFaces RGB-D : The CurtinFaces dataset was made publicly available through Curtin University in 2011. It contains over 5000 images of 52 subjects with both RGB and depth modalities, captured with a Microsoft Kinect. For each subject, the first 3 images are the frontal, right and left poses. The remaining 49 images comprise 7 different poses recorded with 7 different expressions, and 35 images in which 5 different illumination variations are acquired with 7 different expressions. It also contains images with sunglasses and hand occlusions. We follow the test protocol described in .
The convolution layers were initialized with weights from the VGGFace2 dataset . We used the Adam optimizer with a learning rate of 0.01 and a decay rate of 0.9. The dropout rate of 0.5, size of three classifier nodes as 2048, 1024, and 512 respectively, and batch size of 20 were determined by grid search. The input to the network was the synchronized RGB and depth images, after image augmentation as described in Sec. 3.1. The two attention modules described in Sec. 3.2 were applied to the concatenated features from the convolution layers, and the attention refined features were fed to the classifier network containing 3 fully connected layers with 2048, 1024 and 512 nodes respectively. The fourth and final fully connected layer comprised the dataset classes.
|Goswami et al. ||91.6%|
|Goswami et al. ||95.3%|
|Zhang et al. ||98.6%|
|Chowdhury et al. ||98.7%|
|Ours (CNN Att.)||99.4%|
4.3 Performance and Comparison
To validate our results, we compared the performance of our proposed method with benchmark results from the CurtinFaces and IIIT-D RGB-D datasets. The results for the IIIT-D RGB-D dataset is shown in Table 1. Our proposed method increases the identification rank-1 accuracy to 99.38%. The proposed method outperformed the state-of-the-art results by Chowdhury et al.  which uses depth rich features acquired from an autoencoder, achieving a classification accuracy of 98.7%, and Zhang et al. , which uses complimentary feature learning to achieve 98.6% accuracy.
To further verify our results, we tested our multimodal attention network on the CurtinFaces RGB-D dataset, the results of which are tabulated in Table 2. Our proposed model outperforms the state-of-the-art results in both the test sets, with 97.53% accuracy in the pose-expression variation test set, and with 98.88% accuracy in the illumination-expression variation test set.
4.4 Ablation Experiments
We employ our two attention modules on top of the convolution layers for effective fusion of the two modalities. To demonstrate the effectiveness of the two modules we conducted the experiments described below, with results compiled in Table 4. It is evident from the results that the attention module aided with the effective fusion of the two modalities and improved overall performance. We also employed the attention modules separately to observe the improvement over the unaltered VGG network. The results to this experiment for IIIT-D dataset are shown in Table 3. It can be seen that employing feature-map attention alone improves accuracy by , whereas when employing spatial attention alone, the accuracy is improved by . The best performance was achieved by employing both attention mechanisms, yielding a performance improvement of close to over the base case.
|VGG-Face (RGB Depth)||95.4%|
|With Feature-Map Attention||96.4%|
|With Spatial Attention||96.2%|
|Both Attention Modules||99.4%|
|VGG-Face (RGB Depth)||92.6%||94.2%||93.4%|
|With Feature-Map Attention||96.4%||98.2%||97.3%|
|With Spatial Attention||96.7%||97.3%||97.0%|
|Both Attention Modules||97.5%||98.9%||98.2%|
In this paper, we present an attention-aware network to fuse RGB and depth modalities of RGB-D images for face recognition. Through our evaluations we validate that our attention aware fusion offers more accurate results than the state-of-the-art on IIIT-D and CurtinFaces datasets. To further increase accuracy, a more novel CNN should be considered, along with a more sophisticated data augmentation method to address the issue of dataset size.
-  (2018) Face verification from depth using privileged information. In British Machine Vision Conference, pp. 303. Cited by: §2.2.
Poseidon: face-from-depth for driver pose estimation. In , pp. 4661–4670. Cited by: §2.2.
-  (2018) VGGFace2: a dataset for recognising faces across pose and age. In International Conference on Automatic Face and Gesture Recognition, Cited by: §3.2, §4.2.
SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In IEEE conference on computer vision and pattern recognition, pp. 5659–5667. Cited by: §3.2.1.
-  (2016) RGB-D face recognition via learning-based reconstruction. In International Conference on Biometrics Theory, Applications and Systems, pp. 1–7. Cited by: §2.2, §4.3, Table 1.
-  (2013) On RGB-D face recognition using Kinect. In International Conference on Biometrics: Theory, Applications and Systems, pp. 1–6. Cited by: §2.1, §2.2, §2, §4.1, Table 1.
-  (2014) RGB-D face recognition with texture and attribute features. IEEE Transactions on Information Forensics and Security 9 (10), pp. 1629–1640. Cited by: §2.1, §2.2, §2, Table 1.
-  (2016) An RGB–D based image set classification for robust face recognition from kinect data. Neurocomputing 171, pp. 889–900. Cited by: §2.1, Table 2.
-  (2012) An RGB-D database using microsoft’s Kinect for windows for face detection. In International Conference on Signal Image Technology and Internet Based Systems, pp. 42–46. Cited by: §2.
-  (2020-01) Robust RGB-D face recognition using attribute-aware loss. IEEE Transactions on Pattern Analysis and Machine Intelligence in press (), pp. 1–15. Cited by: §1.
Dlib-ml: a machine learning toolkit. Journal of Machine Learning Research 10, pp. 1755–1758. Cited by: §3.1.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
-  (2016) Accurate and robust face recognition from RGB-D images with a deep learning approach.. In British Machine Vision Conference, Cited by: §1.
-  (2013-01) Using kinect for face recognition under varying poses, expressions, illumination and disguise. In IEEE Workshop on Applications of Computer Vision, Cited by: §1, §2.1, §2, §4.1, Table 2.
-  (2016) Face recognition based on Kinect. Pattern Analysis and Applications 19 (4), pp. 977–987. Cited by: Table 2.
-  (2014) Kinectfacedb: a Kinect database for face recognition. IEEE Transactions on Systems, Man, and Cybernetics: Systems 44 (11), pp. 1534–1548. Cited by: §2.
-  (2017-04) The IST-EURECOM light field face database. In International Workshop on Biometrics and Forensics, Cited by: §2.
-  (2020) Facial emotion recognition using light field images with deep attention-based bidirectional LSTM. In IEEE conference on Acoustics, Speech, and Signal Processing, Cited by: §3.2.1.
-  (2020-03) Face recognition: a novel multi-level taxonomy based survey. IET Biometrics 9 (2), pp. 1–12. Cited by: §1.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.2.
-  (2012) Convolutional-recursive deep learning for 3D object classification. In Advances in neural information processing systems, pp. 656–664. Cited by: §2.2.
-  (2019) Deep face recognition: a survey. arXiv preprint arXiv:1804.06655. Cited by: §1, §1.
-  (2018) Cbam: convolutional block attention module. In European Conference on Computer Vision, pp. 3–19. Cited by: §3.2.1, §3.2.2.
-  (2019) Classification of hand movements from EEG using a deep attention-based LSTM network. IEEE Sensors Journal. Cited by: §3.2.1.
-  (2019) Capsule attention for multimodal EEG and EOG spatiotemporal representation learning with application to driver vigilance estimation. arXiv preprint arXiv:1912.07812. Cited by: §3.2.1.
-  (2018) RGB-D face recognition via deep complementary and common feature learning. In IEEE International Conference on Automatic Face & Gesture Recognition, pp. 8–15. Cited by: §2.2.
-  (2016) Lock3DFace: a large-scale database of low-cost Kinect 3d faces. In International Conference on Biometrics, pp. 1–8. Cited by: §2, §4.3, Table 1.
-  (2012) Microsoft kinect sensor and its effect. IEEE multimedia 19 (2), pp. 4–10. Cited by: §1.