Attention-Based Face AntiSpoofing of RGB Images, using a Minimal End-2-End Neural Network

12/18/2019 ∙ by Ali Ghofrani, et al. ∙ 42

Face anti-spoofing aims at identifying the real face, as well as the fake one, and gains a high attention in security-sensitive applications, liveness detection, fingerprinting, and so on. In this paper, we address the anti-spoofing problem by proposing two end-to-end systems of convolutional neural networks. One model is developed based on the EfficientNet B0 network which has been modified in the final dense layers. The second one, is a very light model of the MobileNet V2, which has been contracted, modified and retrained efficiently on the data being created based on the Rose-Youtu dataset, for this purpose. The experiments show that, both of the proposed architectures achieve remarkable results on detecting the real and fake images of the face input data. The experiments clearly show that the heavy-weight model could be efficiently employed in server-side implementations, whereas the low-weight model could be easily implemented on the hand-held devices and both perform perfectly well using merely RGB input images.



There are no comments yet.


page 2

page 3

page 4

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Face anti-spoofing has always been a key challenging task of all face verification and recognition systems. Conventional face anti-spoofing systems used eigen faces [27], HoG (Histogram of Gradient) [1], or LBP (Local Binary Pattern) features to perform the task [15], whereas the recent systems mostly involve the deep neural features such as DeepFace [13], FaceNet [18], and OpenFace [2].

Emerging and pervasive usage of the IR and depth sensors, has highly simplified the detection of real and fake images recently through some image processing techniques, however using of all these sensors is not always feasible and moreover the previously distributed products of most companies are not included with these technologies.

In this paper, we address the problem of face anti-spoofing through merely using the RGB frames of traditional cameras without using the auxiliary data, which is the most challenging task in this area.

In order to cope with the facial liveness detection challenge, several datasets with their specific properties have been released, so far. NUAA imposter [24], which is publicly available, contains 7509 fake, as well as 5105 real images, however this volume of data, considering the general scale of the required data for deep learning purposes, is insufficient and limiting. Casia-Surf, is another dataset recently published in CVPR 2019 [14], which contains all three types of data including RGB, IR, and depth data samples and is useful for multi-modal systems. This dataset contains 9608 training, as well as validation data samples.

The historically influential works in anti-spoofing area, contains four major approaches. One, is the texture-based methods which incorporate some hand-crafted features such as HoG, and LBP followed by traditional classifiers such as SVM to perform the task

[3, 9]. The temporal-based methods, on the other hand, either use the facial motion patterns (e.g., eye blinking) or involve the movements between face and the background and employ methods such as the optical flow to track the movement of face in order to discriminate real faces from the fake ones [21]. Some 3D structure-based methods have also been developed which either extract depth information from 2D images, or they analyze the 3D shape information being recorded with 3D sensors and then compare the 3D model of the input sample with that of a genuine face [30]. This method, however requires specific 3D devices which are not easily available and should be costly. Finally, the rPPG (Remote Photoplethysmography) methods extract pulse signal from facial videos without contacting any skin [10, 22]. Nevertheless, all these systems are highly vulnerable against the fake face attacks and masks, and may not cope with these attacks without the auxiliary data assistance, such as depth information, IR [12]. In recent years, the deep learning-based methods have been pervasively used for many detection and recognition tasks, as well as anti-spoofing,  [4, 5].

In this paper, we address the anti-spoofing problem and liveness detection using an end-2-end system. The novelty of this work is multi-fold: 1) No hand-crafted features (e.g., HoG, and LBP) have been utilized, 2) The proposed system requires no auxiliary data (e.g., Depth, IR) and resorts merely on the input RGB images, 3) Although the deep neural network structure has been employed, the entire system is light, portable and able to be deployed on hand-held devices and mobile phones with normal commercial processors, 4) The proposed system can deal with all types of fake images, such as depth-wise masks and high resolution display replaying, 5) Since each of the aforementioned achievements raises some challenging issues, the proposed system is able to cope with these issues, effectively.

The outlines of this paper is as follows. The next section, explains the background theory underlying the proposed system. Then, the experimental results and the analytics about them are pointed out. The conclusion terminates the paper and is followed by the references being cited in this paper.

Ii Related Works

The problem of face anti-spoofing have been nearly solved for flagship devices. For example, the Face ID service on iPhone X, have been enabled to create a 3D mesh graph of the face, using Dot-projector, flood illumination, and IR sensors along with a dedicated neural network hardware (Neural Engine). Other brands, also use almost similar mechanisms to cope with the anti-spoofing problem.

On the other hand, there are a handful of mid-range mobile handsets and previously sold out devices which lack these sensors and processing units. Furthermore, there are many verification tasks which are performed using laptop webcams which are totally missing these mechanisms. These issues motivate the work on a model which can address the problem using the RGB images, only.

Fig. 1: Data preparation flow graph, in order to gather the real and fake images from the dataset.
Fig. 2: (Top) the real image samples, (Bottom) The fake image samples corresponding to the top images (from left to right): Mask with mouth and eyes cropped out, a paper mask without cropping, a paper mask with upper part cut in the middle, a paper print mask, and video replaying.

Iii The Proposed Framework

The problem of face anti-spoofing could be cast as a binary classification problem, which attempts to discriminate between real and fake images. However, the amount of fake samples is normally dominant vs. the real ones due to the enormous types of attacks and variations of the fake images within each type which could be given to the system. Hence, the system is likely to be exposed to the imbalanced training data.

Iii-a Dataset Preparation Task

In order to gather the required data for antispoofing purposes, there are issues which hinder the clean data preparation. For instance, the background person who passes by or the portraits in the background, could easily leak into the data if no preprocessing is performed. Correspondingly, these outliers have to be thrown away. The functional flow graph of the proposed system for appropriate and reliable data preparation is depicted in figure 


Various datasets are generated to handle the liveness detection problem. NUAA imposter [24] which has been publicly available, contains only 7509 fake, and 5105 real images, and this volume of data for the deep network-based applications is not sufficient, at all. Casia Surf [29] is a recently released dataset, which contains the RGB, IR, and depth data samples, and is appropriate for multi-modal systems. It contains 9608 training, as well as validation data samples. This much data also do not suffice the deep structured training models. The dataset being used in this paper is ROSE-Youtu [9], which contains the real videos, and their corresponding fake videos. Therefore, the data samples are not extracted as images. Some samples of images taken from this dataset are depicted in figure 2.

In order to extract the images from the ROSE-Youtu dataset, we employed the MTCNN network  [28]

for face detection. To do so, we loaded every single video from the

videos in the dataset and spread it out into frames. Within each frame the face detection has been performed using MTCNN. Then, the face region has been cropped out, and put in its associated class of data [7].

Fig. 3: Inappropriate samples (left to right): and images: Fake samples with real background images being moved; Third up to fifth images: Fake samples with portrayed images in the background.

Thus, we have prepared a set of data samples without augmentation being performed. During this preparation, we noticed some real images in which there are background people passing by, and this makes the real video to be classified as a fake one. In addition, there are videos which include portraits behind the scene, and this also causes the real video to be classified as fake. Samples of these cases are shown in figure 3.

In order to make sure that the model is robust against the person changing, the set of images related to a particular person has been extracted out of the data (which contains the samples from persons), and is used as the test data, for both the real and fake images. The rest of the dataset has been divided into and for training and validation purposes, respectively.

Iii-B The Proposed Architecture

After gathering and cleaning the data, in order to perform the binary classification (i.e., Real or fake images) a state-of-the-art EfficientNet architecture [23] has been employed, as depicted in figure 4.

Fig. 4:

Anti-spoofing architecture based on EfficientNet B0. Transfer learning has been involved for the weights of the network.

This network uses the B0 model, which has been pretrained on imagenet dataset, only as the initialization. The whole layers are trainable, and the stack of fully connected (FC) layers (

neurons in each layer) with swish [16]activation function in the first two FC layers, and the softmax and tanh activation functions for the final layers have been involved. Moreover, the dropconnect [25] and batch normalization [8] has been performed between each two layers, to avoid overfitting. The entire number of parameters in the model is , and the optimization method being employed is the Rectified-Adam [11] minimization algorithm.

Due to the fact that the ongoing binary classification is over the imbalanced data, as mentioned earlier, monitoring the accuracy of the network is not reasonable, the evaluations have been presented using the F1-score, which is a combination of precision and recall, as


Furthermore, due to the classification nature of the problem the binary cross entropy has been chosen as the loss function.

During the training the following results, as in table I, have been achieved:

Train Validation
Loss F1-score Loss F1-score
0.0204 99.37 0.0137 1.0
TABLE I: Training and validation loss and F1-score for the architecture of figure 4

Using dropconnect has caused the training loss to be less than the validation loss, as depicted in figure 5.

Fig. 5: Results of the F1-score, and loss of the training and validation for the architecture being proposed in figure 4.

The optimum parameters have been obtained after epochs, and the model needs

MBytes to be saved. In addition, for the unseen test data the confusion matrix is, as in figure 


Fig. 6: The confusion matrix for the EfficientNet test data.

As depicted in figure 6, the proposed model performs quite well, however due to the high number of parameters and using swish function it is not well suited for the client side implementations. This architecture, in stead is well qualified for the server side implementations.

Iii-C Low Weight Model-Client Side

Another prevalent architecture which could be incorporated to perform the task is the MobileNet V2 [17] , which uses the separable CNN logic with depth-wise and point-wise (i..e Xception [6]). In this work, we have used a minimal structure of such a model, which uses the separable CNN logic with less number of parameters. A visual perception of the final layer of the MobileNet V2 being trained on the imageNet dataset has been depicted in figure 7. In order to visualize the layer, the softmax has been omitted and the output has been activated using a linear activation.

Fig. 7: Visualized final layer of the MobileNet. (Top-Left to right): Gold fish, White shark, Stingray, (Bottom-left to right): Hen, Oustrich, Brambling
Fig. 8: The primary level kernels for MobileNet V2, pretrained on imageNet.

In addition to the previously shown output of the fully connected layer, we take a look at the primary convolution layer and some middle layers, as well. These figures are depicted in figure 10.

Fig. 9: (Top to Bottom Rows): The low level kernels for MobileNet V2, The mid-level kernels for MobileNet V2, The high-level kernels for MobileNet V2, all of them are pretrained on imageNet dataset.

Looking at these figures, indicates that the network has reached a high perception level for the imageNet dataset with numerous classes. However, in our work we are interested in only two classes, namely real and fake face images. Therefore, the basic network could be easily simplified with respect to the filters being used in convolution layers, and the parameters could be drastically decreased. In our proposed architecture, the number of the filters in each layer are of the original one, and the input size has been reduced to the minimum one, which is . By applying these changes the model volume and the number of parameters has been changed from MBytes with million parameters into MBytes with

parameters, respectively. Applying the deployment model conversion techniques, which has been proposed by Tensorflow (e.g., TF-lite conversion, and Quantization), even a more compact model volume could be achieved with


Fig. 10: Low-weight anti-spoofing architecture based on MobileNet V2. A contraction has been applied to the original network, and the remnant weights are used as the initializers.

In our implementation, we used these techniques, and using a GTX 1080 Graphic card, with GBytes of RAM, we could increase the batch size up to 718 samples, and for tuning the large batch the group normalizer [26] has been employed. The evaluation results of training and testing of the proposed light model has been depicted in figure 11.

Fig. 11: The results of the low-weight proposed architecture of figure 10 for training and validation data.

The achieved results are for epochs for each metric.

Train Valid
loss acc pre rec F1 loss acc pre rec F1
0.0395 1.0 1.0 1.0 100 0.1654 96.29 97.81 100 95.69
TABLE II: Training and validation loss, accuracy, precision, recall, and F1 score of the proposed low-weight architecture.
Fig. 12: The confusion matrix of the proposed low-weight proposed archiecture on the test data.

By observing the visualized layers of the network, which has been trained for imageNet dataset containing classes, and through induction we can logically dedicate that for our binary classification problem the low level kernels from the initial layers are not supposed to make a tremendous discriminative features, as opposed to the higher level layers, and therefore they could be reduced. Thus, we canceled out half of the filters from the initial convolution layer, and a percentage of the rest. Figure 8, depicts the first layer kernels of MobileNet V2 network. Obviously, these number of kernels would not be very informative for a binary classification purpose. Thus, the network width controller coefficient of has been used in our experiments to achieve an optimum filter width within the MobileNet V2 network. Hence, the transfer learning is not exactly what we have performed in our work. We have contracted the pretrained MobileNet V2, then we used the initial weights of the contracted network, as depicted in figure 9, followed by our customized MLP stack ( dense layers), with group normalization due to the huge batch size, as depicted in figure 10.

Fig. 13: (Left)The visualized dense layer of the proposed low-weight model, (The matrix of images) kernels of the highest layer of the proposed architecture of figure 10

Iv Experiments and Analytics

The confusion matrix of the low-weight anti-spoofing network over the ROSE-Youtu dataset is depicted in figure 12. The imbalanced nature of the data has impacted the real image detection outcome, compared to the EfficientNet B0 model, explained in the previous section.

As depicted in figure 11, the training process has been performed even faster than the original MobileNet model, since the number of parameters are dramatically decreased. However, the validation curve clearly shows a bias with respect to the training curve. The reason would be due to the tremendous reduction of the parameter numbers which pursue the network toward being underfitted. Identical to the figure 7, in figure 13 the final layer of our proposed network has been visualized, which works for the binary classification task, after the training phase is completed.

The results of the proposed low-weight architecture, as depicted in figure 11 and table II, obviously verifies its qualification for being used in the client side.

As depicted in figure 14, the gradCAM attention visualization for the Up-mask image focuses on the eyes which has got an unusual depth,  [19]. For the full-mask image, both the eyes and mouth has grabbed the attention, and for the replay image and the photo image the attention distribution over the face has become scattered almost randomly. For the real face, however the attention is mostly on the chin, and distributed regularly.

Fig. 14: (Top-2-down rows): Test data (left to right: Upper mask, full mask, replay, Photo, real images); Saliency features of the test data [20]; gradCAM attention visualization graph over the test data samples.

V Conclusion

Two end-2-end attention-based face anti-spoofing models, have been proposed in this paper, one could be used for the server side and the other for the client side implementations, which merely incorporate the RGB images of the camera. These models require no auxiliary data (e.g., depth, IR) and perform remarkably well on the real and fake discrimination task. The proposed model based on the EfficientNet B0, has performed perfectly well on the dataset, which enables it to be used in flagship mobile devices containing NPUs (dedicated Neural Processing Units), or in the server side. The proposed low-weight architecture requires very few number of parameters in a low volume which enables it to be efficiently used in mobile handsets. Various attacks have been experimented and both the heavy weight and the low-weight architectures have performed quite well on the fake data inputs which verify the robustness of the proposed models.


  • [1] A. Albiol, D. Monzo, A. Martin, J. Sastre, and A. Albiol (2008) Face recognition using hog–ebgm. Pattern Recognition Letters 29 (10), pp. 1537–1543. Cited by: §I.
  • [2] T. Baltrušaitis, P. Robinson, and L. Morency (2016)

    Openface: an open source facial behavior analysis toolkit


    2016 IEEE Winter Conference on Applications of Computer Vision (WACV)

    pp. 1–10. Cited by: §I.
  • [3] Z. Boulkenafet, J. Komulainen, and A. Hadid (2016) Face spoofing detection using colour texture analysis. IEEE Transactions on Information Forensics and Security 11 (8), pp. 1818–1830. Cited by: §I.
  • [4] F. Chen, C. Wen, K. Xie, F. Wen, G. Sheng, and X. Tang (2019)

    Face liveness detection: fusing colour texture feature and deep feature

    IET Biometrics 8 (6), pp. 369–377. Cited by: §I.
  • [5] H. Chen, G. Hu, Z. Lei, Y. Chen, N. M. Robertson, and S. Z. Li (2019) Attention-based two-stream convolutional networks for face spoofing detection. IEEE Transactions on Information Forensics and Security 15, pp. 578–593. Cited by: §I.
  • [6] F. Chollet (2017) Xception: deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258. Cited by: §III-C.
  • [7] A. Ghofrani, R. M. Toroghi, and S. Ghanbari Realtime face-detection and emotion recognition using mtcnn and minishufflenet v2. In 2019 5th Conference on Knowledge Based Engineering and Innovation (KBEI), pp. 817–821. Cited by: §III-A.
  • [8] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §III-B.
  • [9] L. Li, X. Feng, Z. Xia, X. Jiang, and A. Hadid (2018) Face spoofing detection with local binary pattern network. Journal of visual communication and image representation 54, pp. 182–192. Cited by: §I, §III-A.
  • [10] B. Lin, X. Li, Z. Yu, and G. Zhao (2019) Face liveness detection by rppg features and contextual patch-based cnn. In Proceedings of the 2019 3rd International Conference on Biometric Engineering and Applications, pp. 61–68. Cited by: §I.
  • [11] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han (2019)

    On the variance of the adaptive learning rate and beyond

    arXiv preprint arXiv:1908.03265. Cited by: §III-B.
  • [12] Y. Liu, A. Jourabloo, and X. Liu (2018) Learning deep models for face anti-spoofing: binary or auxiliary supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 389–398. Cited by: §I.
  • [13] O. M. Parkhi, A. Vedaldi, A. Zisserman, et al. (2015) Deep face recognition.. In bmvc, Vol. 1, pp. 6. Cited by: §I.
  • [14] A. Parkin and O. Grinchuk (2019-06) Recognizing multi-modal face spoofing with face recognition networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §I.
  • [15] M. A. Rahim, M. S. Azam, N. Hossain, and M. R. Islam (2013) Face recognition using local binary patterns (lbp). Global Journal of Computer Science and Technology. Cited by: §I.
  • [16] P. Ramachandran, B. Zoph, and Q. V. Le (2017) Swish: a self-gated activation function. arXiv preprint arXiv:1710.05941 7. Cited by: §III-B.
  • [17] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §III-C.
  • [18] F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §I.
  • [19] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626. Cited by: §IV.
  • [20] K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: Fig. 14.
  • [21] A. Singh, P. Joshi, and G. Nandi (2014-07) Face recognition with liveness detection using eye and mouth movement. pp. . External Links: Document Cited by: §I.
  • [22] L. Song and C. Liu (2018) Face liveness detection based on joint analysis of rgb and near-infrared image of faces. Electronic Imaging 2018 (10), pp. 373–1. Cited by: §I.
  • [23] M. Tan and Q. V. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §III-B.
  • [24] X. Tan, Y. Li, J. Liu, and L. Jiang (2010) Face liveness detection from a single image with sparse low rank bilinear discriminative model. In European Conference on Computer Vision, pp. 504–517. Cited by: §I, §III-A.
  • [25] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus (2013) Regularization of neural networks using dropconnect. In

    International conference on machine learning

    pp. 1058–1066. Cited by: §III-B.
  • [26] Y. Wu and K. He (2018) Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §III-C.
  • [27] J. Zhang, Y. Yan, and M. Lades (1997) Face recognition: eigenface, elastic matching, and neural nets. Proceedings of the IEEE 85 (9), pp. 1423–1435. Cited by: §I.
  • [28] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23 (10), pp. 1499–1503. Cited by: §III-A.
  • [29] S. Zhang, A. Liu, J. Wan, Y. Liang, G. Guo, S. Escalera, H. J. Escalante, and S. Z. Li (2019) CASIA-surf: a large-scale multi-modal benchmark for face anti-spoofing. arXiv preprint arXiv:1908.10654. Cited by: §III-A.
  • [30] J. Zhou, C. Ge, J. Yang, H. Yao, X. Qiao, and P. Deng (2019-Sep.) Research and application of face anti-spoofing based on depth camera. In 2019 2nd China Symposium on Cognitive Computing and Hybrid Intelligence (CCHI), Vol. , pp. 225–229. External Links: Document, ISSN null Cited by: §I.