Face anti-spoofing has always been a key challenging task of all face verification and recognition systems. Conventional face anti-spoofing systems used eigen faces , HoG (Histogram of Gradient) , or LBP (Local Binary Pattern) features to perform the task , whereas the recent systems mostly involve the deep neural features such as DeepFace , FaceNet , and OpenFace .
Emerging and pervasive usage of the IR and depth sensors, has highly simplified the detection of real and fake images recently through some image processing techniques, however using of all these sensors is not always feasible and moreover the previously distributed products of most companies are not included with these technologies.
In this paper, we address the problem of face anti-spoofing through merely using the RGB frames of traditional cameras without using the auxiliary data, which is the most challenging task in this area.
In order to cope with the facial liveness detection challenge, several datasets with their specific properties have been released, so far. NUAA imposter , which is publicly available, contains 7509 fake, as well as 5105 real images, however this volume of data, considering the general scale of the required data for deep learning purposes, is insufficient and limiting. Casia-Surf, is another dataset recently published in CVPR 2019 , which contains all three types of data including RGB, IR, and depth data samples and is useful for multi-modal systems. This dataset contains 9608 training, as well as validation data samples.
The historically influential works in anti-spoofing area, contains four major approaches. One, is the texture-based methods which incorporate some hand-crafted features such as HoG, and LBP followed by traditional classifiers such as SVM to perform the task[3, 9]. The temporal-based methods, on the other hand, either use the facial motion patterns (e.g., eye blinking) or involve the movements between face and the background and employ methods such as the optical flow to track the movement of face in order to discriminate real faces from the fake ones . Some 3D structure-based methods have also been developed which either extract depth information from 2D images, or they analyze the 3D shape information being recorded with 3D sensors and then compare the 3D model of the input sample with that of a genuine face . This method, however requires specific 3D devices which are not easily available and should be costly. Finally, the rPPG (Remote Photoplethysmography) methods extract pulse signal from facial videos without contacting any skin [10, 22]. Nevertheless, all these systems are highly vulnerable against the fake face attacks and masks, and may not cope with these attacks without the auxiliary data assistance, such as depth information, IR . In recent years, the deep learning-based methods have been pervasively used for many detection and recognition tasks, as well as anti-spoofing, [4, 5].
In this paper, we address the anti-spoofing problem and liveness detection using an end-2-end system. The novelty of this work is multi-fold: 1) No hand-crafted features (e.g., HoG, and LBP) have been utilized, 2) The proposed system requires no auxiliary data (e.g., Depth, IR) and resorts merely on the input RGB images, 3) Although the deep neural network structure has been employed, the entire system is light, portable and able to be deployed on hand-held devices and mobile phones with normal commercial processors, 4) The proposed system can deal with all types of fake images, such as depth-wise masks and high resolution display replaying, 5) Since each of the aforementioned achievements raises some challenging issues, the proposed system is able to cope with these issues, effectively.
The outlines of this paper is as follows. The next section, explains the background theory underlying the proposed system. Then, the experimental results and the analytics about them are pointed out. The conclusion terminates the paper and is followed by the references being cited in this paper.
Ii Related Works
The problem of face anti-spoofing have been nearly solved for flagship devices. For example, the Face ID service on iPhone X, have been enabled to create a 3D mesh graph of the face, using Dot-projector, flood illumination, and IR sensors along with a dedicated neural network hardware (Neural Engine). Other brands, also use almost similar mechanisms to cope with the anti-spoofing problem.
On the other hand, there are a handful of mid-range mobile handsets and previously sold out devices which lack these sensors and processing units. Furthermore, there are many verification tasks which are performed using laptop webcams which are totally missing these mechanisms. These issues motivate the work on a model which can address the problem using the RGB images, only.
Iii The Proposed Framework
The problem of face anti-spoofing could be cast as a binary classification problem, which attempts to discriminate between real and fake images. However, the amount of fake samples is normally dominant vs. the real ones due to the enormous types of attacks and variations of the fake images within each type which could be given to the system. Hence, the system is likely to be exposed to the imbalanced training data.
Iii-a Dataset Preparation Task
In order to gather the required data for antispoofing purposes, there are issues which hinder the clean data preparation. For instance, the background person who passes by or the portraits in the background, could easily leak into the data if no preprocessing is performed. Correspondingly, these outliers have to be thrown away. The functional flow graph of the proposed system for appropriate and reliable data preparation is depicted in figure1.
Various datasets are generated to handle the liveness detection problem. NUAA imposter  which has been publicly available, contains only 7509 fake, and 5105 real images, and this volume of data for the deep network-based applications is not sufficient, at all. Casia Surf  is a recently released dataset, which contains the RGB, IR, and depth data samples, and is appropriate for multi-modal systems. It contains 9608 training, as well as validation data samples. This much data also do not suffice the deep structured training models. The dataset being used in this paper is ROSE-Youtu , which contains the real videos, and their corresponding fake videos. Therefore, the data samples are not extracted as images. Some samples of images taken from this dataset are depicted in figure 2.
In order to extract the images from the ROSE-Youtu dataset, we employed the MTCNN network 
for face detection. To do so, we loaded every single video from thevideos in the dataset and spread it out into frames. Within each frame the face detection has been performed using MTCNN. Then, the face region has been cropped out, and put in its associated class of data .
Thus, we have prepared a set of data samples without augmentation being performed. During this preparation, we noticed some real images in which there are background people passing by, and this makes the real video to be classified as a fake one. In addition, there are videos which include portraits behind the scene, and this also causes the real video to be classified as fake. Samples of these cases are shown in figure 3.
In order to make sure that the model is robust against the person changing, the set of images related to a particular person has been extracted out of the data (which contains the samples from persons), and is used as the test data, for both the real and fake images. The rest of the dataset has been divided into and for training and validation purposes, respectively.
Iii-B The Proposed Architecture
After gathering and cleaning the data, in order to perform the binary classification (i.e., Real or fake images) a state-of-the-art EfficientNet architecture  has been employed, as depicted in figure 4.
This network uses the B0 model, which has been pretrained on imagenet dataset, only as the initialization. The whole layers are trainable, and the stack of fully connected (FC) layers (neurons in each layer) with swish activation function in the first two FC layers, and the softmax and tanh activation functions for the final layers have been involved. Moreover, the dropconnect  and batch normalization  has been performed between each two layers, to avoid overfitting. The entire number of parameters in the model is , and the optimization method being employed is the Rectified-Adam  minimization algorithm.
Due to the fact that the ongoing binary classification is over the imbalanced data, as mentioned earlier, monitoring the accuracy of the network is not reasonable, the evaluations have been presented using the F1-score, which is a combination of precision and recall, as
Furthermore, due to the classification nature of the problem the binary cross entropy has been chosen as the loss function.
During the training the following results, as in table I, have been achieved:
Using dropconnect has caused the training loss to be less than the validation loss, as depicted in figure 5.
The optimum parameters have been obtained after epochs, and the model needs
MBytes to be saved. In addition, for the unseen test data the confusion matrix is, as in figure6.
As depicted in figure 6, the proposed model performs quite well, however due to the high number of parameters and using swish function it is not well suited for the client side implementations. This architecture, in stead is well qualified for the server side implementations.
Iii-C Low Weight Model-Client Side
Another prevalent architecture which could be incorporated to perform the task is the MobileNet V2  , which uses the separable CNN logic with depth-wise and point-wise (i..e Xception ). In this work, we have used a minimal structure of such a model, which uses the separable CNN logic with less number of parameters. A visual perception of the final layer of the MobileNet V2 being trained on the imageNet dataset has been depicted in figure 7. In order to visualize the layer, the softmax has been omitted and the output has been activated using a linear activation.
In addition to the previously shown output of the fully connected layer, we take a look at the primary convolution layer and some middle layers, as well. These figures are depicted in figure 10.
Looking at these figures, indicates that the network has reached a high perception level for the imageNet dataset with numerous classes. However, in our work we are interested in only two classes, namely real and fake face images. Therefore, the basic network could be easily simplified with respect to the filters being used in convolution layers, and the parameters could be drastically decreased. In our proposed architecture, the number of the filters in each layer are of the original one, and the input size has been reduced to the minimum one, which is . By applying these changes the model volume and the number of parameters has been changed from MBytes with million parameters into MBytes with
parameters, respectively. Applying the deployment model conversion techniques, which has been proposed by Tensorflow (e.g., TF-lite conversion, and Quantization), even a more compact model volume could be achieved withKBytes.
In our implementation, we used these techniques, and using a GTX 1080 Graphic card, with GBytes of RAM, we could increase the batch size up to 718 samples, and for tuning the large batch the group normalizer  has been employed. The evaluation results of training and testing of the proposed light model has been depicted in figure 11.
The achieved results are for epochs for each metric.
By observing the visualized layers of the network, which has been trained for imageNet dataset containing classes, and through induction we can logically dedicate that for our binary classification problem the low level kernels from the initial layers are not supposed to make a tremendous discriminative features, as opposed to the higher level layers, and therefore they could be reduced. Thus, we canceled out half of the filters from the initial convolution layer, and a percentage of the rest. Figure 8, depicts the first layer kernels of MobileNet V2 network. Obviously, these number of kernels would not be very informative for a binary classification purpose. Thus, the network width controller coefficient of has been used in our experiments to achieve an optimum filter width within the MobileNet V2 network. Hence, the transfer learning is not exactly what we have performed in our work. We have contracted the pretrained MobileNet V2, then we used the initial weights of the contracted network, as depicted in figure 9, followed by our customized MLP stack ( dense layers), with group normalization due to the huge batch size, as depicted in figure 10.
Iv Experiments and Analytics
The confusion matrix of the low-weight anti-spoofing network over the ROSE-Youtu dataset is depicted in figure 12. The imbalanced nature of the data has impacted the real image detection outcome, compared to the EfficientNet B0 model, explained in the previous section.
As depicted in figure 11, the training process has been performed even faster than the original MobileNet model, since the number of parameters are dramatically decreased. However, the validation curve clearly shows a bias with respect to the training curve. The reason would be due to the tremendous reduction of the parameter numbers which pursue the network toward being underfitted. Identical to the figure 7, in figure 13 the final layer of our proposed network has been visualized, which works for the binary classification task, after the training phase is completed.
As depicted in figure 14, the gradCAM attention visualization for the Up-mask image focuses on the eyes which has got an unusual depth, . For the full-mask image, both the eyes and mouth has grabbed the attention, and for the replay image and the photo image the attention distribution over the face has become scattered almost randomly. For the real face, however the attention is mostly on the chin, and distributed regularly.
Two end-2-end attention-based face anti-spoofing models, have been proposed in this paper, one could be used for the server side and the other for the client side implementations, which merely incorporate the RGB images of the camera. These models require no auxiliary data (e.g., depth, IR) and perform remarkably well on the real and fake discrimination task. The proposed model based on the EfficientNet B0, has performed perfectly well on the dataset, which enables it to be used in flagship mobile devices containing NPUs (dedicated Neural Processing Units), or in the server side. The proposed low-weight architecture requires very few number of parameters in a low volume which enables it to be efficiently used in mobile handsets. Various attacks have been experimented and both the heavy weight and the low-weight architectures have performed quite well on the fake data inputs which verify the robustness of the proposed models.
-  (2008) Face recognition using hog–ebgm. Pattern Recognition Letters 29 (10), pp. 1537–1543. Cited by: §I.
Openface: an open source facial behavior analysis toolkit. In
2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–10. Cited by: §I.
-  (2016) Face spoofing detection using colour texture analysis. IEEE Transactions on Information Forensics and Security 11 (8), pp. 1818–1830. Cited by: §I.
Face liveness detection: fusing colour texture feature and deep feature. IET Biometrics 8 (6), pp. 369–377. Cited by: §I.
-  (2019) Attention-based two-stream convolutional networks for face spoofing detection. IEEE Transactions on Information Forensics and Security 15, pp. 578–593. Cited by: §I.
-  (2017) Xception: deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258. Cited by: §III-C.
-  Realtime face-detection and emotion recognition using mtcnn and minishufflenet v2. In 2019 5th Conference on Knowledge Based Engineering and Innovation (KBEI), pp. 817–821. Cited by: §III-A.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §III-B.
-  (2018) Face spoofing detection with local binary pattern network. Journal of visual communication and image representation 54, pp. 182–192. Cited by: §I, §III-A.
-  (2019) Face liveness detection by rppg features and contextual patch-based cnn. In Proceedings of the 2019 3rd International Conference on Biometric Engineering and Applications, pp. 61–68. Cited by: §I.
On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265. Cited by: §III-B.
-  (2018) Learning deep models for face anti-spoofing: binary or auxiliary supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 389–398. Cited by: §I.
-  (2015) Deep face recognition.. In bmvc, Vol. 1, pp. 6. Cited by: §I.
-  (2019-06) Recognizing multi-modal face spoofing with face recognition networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §I.
-  (2013) Face recognition using local binary patterns (lbp). Global Journal of Computer Science and Technology. Cited by: §I.
-  (2017) Swish: a self-gated activation function. arXiv preprint arXiv:1710.05941 7. Cited by: §III-B.
-  (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §III-C.
-  (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §I.
-  (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626. Cited by: §IV.
-  (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: Fig. 14.
-  (2014-07) Face recognition with liveness detection using eye and mouth movement. pp. . External Links: Cited by: §I.
-  (2018) Face liveness detection based on joint analysis of rgb and near-infrared image of faces. Electronic Imaging 2018 (10), pp. 373–1. Cited by: §I.
-  (2019) EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §III-B.
-  (2010) Face liveness detection from a single image with sparse low rank bilinear discriminative model. In European Conference on Computer Vision, pp. 504–517. Cited by: §I, §III-A.
Regularization of neural networks using dropconnect.
International conference on machine learning, pp. 1058–1066. Cited by: §III-B.
-  (2018) Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §III-C.
-  (1997) Face recognition: eigenface, elastic matching, and neural nets. Proceedings of the IEEE 85 (9), pp. 1423–1435. Cited by: §I.
-  (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23 (10), pp. 1499–1503. Cited by: §III-A.
-  (2019) CASIA-surf: a large-scale multi-modal benchmark for face anti-spoofing. arXiv preprint arXiv:1908.10654. Cited by: §III-A.
-  (2019-Sep.) Research and application of face anti-spoofing based on depth camera. In 2019 2nd China Symposium on Cognitive Computing and Hybrid Intelligence (CCHI), Vol. , pp. 225–229. External Links: Cited by: §I.