Face recognition techniques have been increasingly deployed in everyday scenarios for authentication purposes, such as mobile devices unlocking and door control access. Compared with other biometric information, using faces for authentication is more user-friendly as face verification is non-intrusive, and the face images can be feasibly captured with mobile phone cameras. However, it has been widely recognized that state-of-the-art face recognition systems are still vulnerable to spoofing attacks. Attackers can easily hack a face recognition system by presenting a spoofing face of a client to the system’s camera, where a spoofing face could be a face mask and a face image shown by a printed photo or by a digital display. Therefore, reliable Face Anti-Spoofing (FAS) techniques are highly desired and essential for developing secure face recognition systems.
The past few years have witnessed much progress in the FAS problem. Traditionally, in either the Spatial or Fourier space, various techniques have been proposed to extract handcrafted features with image descriptors as representations [FAS-CoALBP-AIVT-2012P, DB-IDIAP-RA, FAS-CoALBP-AIVT-2012P, FAS-ColorTexture-TIFS-2016, DoG-ECCV-2010, FAS-LPQ-TIFS-2015]
. These features are usually used to train a Support Vector Machine (SVM) to classify genuine or spoofing examples. However, these features are insufficiently discriminative because those descriptors (e.g., Local Binary Pattern) are not originally designed for the FAS problem.
Recently, deep-learning-based methods, which aim to learn discriminative representations in an end-to-end manner, have shown evidence to be more effective in countermeasures against spoofing attacks than the traditional methods. Yang et al. [FAS-CNN-ComputerScience-2014] firstly introduce the Convolutional Neural Network (CNN) for the FAS task. They train an AlexNet-based model [CV-AlexNet-CVPR-2012], extract features from the model’s last layer and learn an SVM with binary labels (“genuine” or “spoofing”) for classification. Besides using binary labels, Liu et al. [FAS-Auxiliary-CVPR-2018] seek for the auxiliary supervision signals. They use auxiliary techniques to extract pseudo depth maps and remote PhotoPlethysmoGraphy (rPPG) signals from the RGB images for supervision to boost the training. It is also reported that the Recurrent Neural Network (RNN) can be used to utilize temporal information from sequential frames for face anti-spoofing [FAS-LSTMCNN-ICASSP-2018, FAS-CVPR2019-STASN]. However, one limitation of the aforementioned techniques is that the learned feature representations may overfit to the properties of a particular database. For example, depth information can benefit face anti-spoofing when the suspicious input is in 2D format (e.g., printed photo, screen display), but it is likely to fail to counter mask attacks, which are with 3D information (e.g., Fig. 1(a)). To learn more spoofing-discriminative representations and alleviate the overfitting effect, we propose a novel two-branch framework based on CNN and RNN.
The motivations behind this work are inspired by 1) the observation that spoofing clues can appear in various ways and 2) how human beings can act to predict whether a presented face example is genuine or spoofing. Regarding the first motivation, we show motivating examples in Fig. 1, which indicates that spoofing clues can be diverse. In occasional cases, spoofing clues are visually salient, such as the boundaries of a printed photo, the bezel of a digital display [patel2016secure], and reflections, which are respectively shown in Fig. 1(a), Fig. 1(b) and Fig. 1(c). These clues can be easily spotted and used by human beings to assess the examples as “spoofing” even without further careful observation. However, in most cases, human beings may give prediction with less certainty as the aforementioned clues may be inconspicuous if an attacker carefully launches the spoofing attack. For instance, no paper boundary, bezel, or reflection appears in Fig. 1(d) and Fig. 1(e). Moreover, the visual quality of Fig. 1(d) and Fig. 1(e) is better than that of Fig. 1(a), Fig. 1(b) and Fig. 1(c). In other words, Fig. 1(d) and Fig. 1(e) look more similar to genuine faces, and thus human beings may not tell the difference with only a glance. Thus, to mine more information to confirm their assessment, human beings would carefully delve into local sub-patches to explore fine-scale and subtle spoofing clues. Fig. 2 portrays such behavior, which reveals our second motivation.
Under these two motivations, we propose a two-branch framework, DRL-FAS, that jointly exploits global and local features based on CNN, RNN, and deep reinforcement learning (DRL), for the face anti-spoofing (FAS) problem. Fig. 3 elaborates on this framework, which corresponds to Fig. 2. Firstly, we treat human beings’ glance at an example as the procedure of extracting global features. As such, we train a CNN to learn global information through the entire frames from video data. Then, we treat the following closer observations at suspicious sub-patches as the procedure of extracting local features. To model such observation behavior, we leverage reinforcement learning to learn a policy model that predicts locations of suspicious sub-patches and learn local information there with RNN. Finally, since human beings can benefit from both global and local information for better prediction, the extracted global and local features are fused for classification.
The contributions of this work are three-fold:
We propose a novel framework based on CNN and RNN for the FAS problem. Our framework aims to extract and fuse the global and local features. While many of the previous works used RNN to leverage temporal information from video frames, we take advantage of RNN to memory information from all “observations” from sub-patches to reinforce extracted local features gradually.
To explore spoofing-specific local information, we leverage the advantage of reinforcement learning to discover suspicious areas where discriminative local features can be extracted. To the best of our knowledge, in the field of FAS, this is the first attempt to introduce reinforcement learning for the optimization.
We conduct extensive experiments using six benchmark databases to evaluate our method. As shown in Section IV-C, our method can perform better than the schemes that either use global or local features. Moreover, our proposed method can generally achieve state-of-the-art performance compared with other methods.
Ii Related Works
Ii-a Traditional Face Anti-Spoofing
Most of the traditional FAS techniques focus on designing handcrafted features and learning classifiers with learning methods such as SVM. Texture-based methods are based on the assumption that there are differences in texture between genuine faces and spoofing faces due to the inherent nature of different materials. In the Fourier spectrum, Tan et al. [DoG-ECCV-2010] propose to use Difference-of-Gaussian (DoG) features to describe the frequency disturbance caused by the recapture. Besides, Gragnaniello et al. [FAS-LPQ-TIFS-2015] propose to use Local Phase Quantization (LPQ) to analyze texture distortion through the phase of images. Also, texture descriptors, such as Local Binary Pattern (LBP), Scale-Invariant Feature Transform (SIFT), are used in the spatial domain to extract features to represent such disparities [FAS-MicroTexture-IJCB-2011, FAS-CoALBP-AIVT-2012P, CDD-ICB-2013, FAS-FisherVector-TIFS-2017, FAS-ColorTexture-TIFS-2016]. In addition, due to the distortion caused during the recapture process, spoofing examples usually have lower visual quality compared with genuine ones. Motivated by this observation, the FAS community also has proposed to detect spoofing attack examples by assessing the input image quality [IQA-ICPR-2014, FAS-IDA-TIFS-2015, LI-QUALITY]. Apart from analyzing a single image, methods based on sequential video frames are also proposed to utilize the information from the temporal space, such as eye blinking, lip moves [Motion-IJCB-2011, MotionLBP-ICB-2013, LBP-TOP-EJIVP-2014], and motion blurring effect [TIFS-2019-MotionBlure]. Although such methods can be effective against photo attacks, they cannot counter video replay attacks where such movement information can exist in a given video.
Ii-B Deep-Learning-Based Face Anti-Spoofing
Recently, deep learning has dominated the computer vision community, so as the FAS field. Yanget al. [FAS-CNN-ComputerScience-2014] are the first to apply CNN to the FAS problem. The authors train a deep model based on AlexNet [CV-AlexNet-CVPR-2012] architecture and extract features from the model’s last layer to train an SVM classifier. However, this is just a straightforward application of AlexNet, and the improvement is limited compared with the handcrafted features. After that, more CNN-based methods are proposed [BottleCNN-JVCIR-2016, FAS-MSR-TIFS-2019, FAS-MultichannelCNN-TIFS-2020]
. Moreover, RNN (e.g., Long Short-Term Memory networks[ML-LSTM-NC-1997]ML-GRU-ArXiv-2014]) can also be used for the FAS problem by leveraging temporal information from sequential video frames [FAS-LSTMCNN-ACPR-2015, FAS-LSTMCNN-ICASSP-2018]. By far, the aforementioned methods merely use binary labels (“genuine” and “spoofing”) for training. Other than that, the methods in [FAS-Depth&Patch-IJCB-2018, FAS-Auxiliary-CVPR-2018, FAS-DeSpoofing-ECCV-2018] utilize extra information for auxiliary supervision, such as depth. For example, Atoum et al. [FAS-Depth&Patch-IJCB-2018]
introduce auxiliary depth information for spoofing detection. They hold the idea that 2D spoofing examples of printed papers and screens are flat and thus lack 3D information. As such, they train a CNN-based depth estimator, by which the output toward genuine and spoofing faces are 3D depth maps and flat maps, respectively. Although the performance is shown to be improved with depth as auxiliary supervision, such depth-based methods may not work efficiently when a paper mask attack with depth information is launched.
Ii-C Cross-Domain Face Anti-Spoofing
The variety of capturing settings, such as different cameras, environment illuminations, presentation mediums, etc., can lead to the domain shift problem [FAS-UnsupervisedDA-TIFS-2018]
. To be specific, a model trained with data collected under one condition setting may not be able to generalize to other settings. This problem deters models from being deployed in practical scenarios. Aimed at making models more generalized and overcoming the domain shift problem, transfer-learning-based algorithms regarding either domain adaptation/generalization or zero/few-shot learning are also proposed[FAS-UnsupervisedDA-TIFS-2018, FAS-3DCNN-TIFS-2018, FAS-CVPR-SHAO-M, FAS-ZeroShot-CVPR2019, LI-DISTILLATION, LIZHI]. However, it is still an open problem regarding how to design a sophisticated transfer learning algorithm for FAS by considering all possible capturing settings.
In this work, we propose a two-branch framework based on CNN, RNN inspired by how human beings can act to observe and explore spoofing clues. The overview of the framework is shown in Fig. 3. The backbone network firstly embeds the RGB image into a feature map . Then, is forwarded to the subsequent Branch 1 and Branch 2 for the extraction of global and local discriminative features, respectively. Finally, these features are fused for the classification purpose in the final fully-connected (FC) layer. The details of Branch 2 are shown in Fig. 4, where is a convolutional layer, and , , and are fully-connected layers.
In Section III-A, we describe how the backbone and Branch 1 work cooperatively to extract global features. Subsequently, in Section III-B, we illustrate the extraction of local features from suspicious sub-patches sequentially. Afterward, the reinforcement learning leveraged to predict the locations of those sub-patches will be elaborated in Section III-C. Finally, in Section III-D, we present the optimization process.
Iii-a Global Feature Extraction
The global feature extraction aims to exploit global discriminative information (e.g., paper boundaries, bezels, salient reflection patterns). In our framework, global features are extracted in Branch 1, a sequence of convolutional blocks. Branch 1 processesfrom the backbone to extract global features , into which all elements from are encoded.
For implementation, we construct the backbone and Branch 1 based on the ResNet18 [CV-He-2016-ResNet] architecture such that we can fairly compare our method with the recent ResNet-based methods (e.g., [FAS-MSR-TIFS-2019]). In particular, we adopt the first convolutional layer and four subsequent residual blocks of ResNet18 as our backbone network. The remaining convolutional residual blocks and the Global Average Pooling (GAP) layer [ML-GAP-ICLR-2014] constitute Branch 1. The details of these modules are provided in Table I.
Iii-B Local Feature Extraction
The local features are expected to exploit discriminative information from a local patch. The local feature extraction consists of steps in total. We first elaborate on the procedure of local feature extraction at a certain step . At the beginning of step , a location is first produced, where and represent the horizontal and vertical coordinates respectively. The learning of the location prediction is via reinforcement learning, which will be introduced in the next sub-section. Then, from , a square sub-patch with the patch size centering at is cropped. Next, together with the location information will be encoded to an intermediate feature . As such, contains information from the observation at step .
While previous works usually utilize RNN to leverage temporal information from sequential video frames, we particularly employ a Gated Recurrent Unit (GRU [ML-GRU-ArXiv-2014]) to learn local features from in a sequential and recursive manner. The reason is that the hidden state of the GRU can be learned from and :
where is the Hadamard product operation, , , and are parameters of the GRU. Furthermore, when analyzing a presented example, human beings’ knowledge with respect to the example grows as information is gradually gained at each step. In other words, based on previous observations (), their assessment toward a suspicious example can get reinforced after a new observation (). Therefore, we specifically employ the GRU to memorize the observed information and learn local features. After steps, is treated as the final extracted local feature because it has perceived the local information during the steps of observations. Finally, the proposed framework jointly exploits global and local features by fusing and for the classification purpose.
Iii-C Reinforcement Learning for Face Anti-Spoofing
To explore spoofing-discriminative local information, we leverage reinforcement learning to train an agent that can help predict locations of sub-patches where spoofing clues may appear. In this context, a reinforcement learning agent is an abstract subject that explores clues in a certain environment and predicts locations.
Environment: In our framework, we treat from the backbone as the environment where our agent predicts locations and gets feedbacks to update its policy. This is because the backbone can distill shallow spoofing-related features from raw RGB pixels into . Thus, can especially provide spoofing-related information for the agent to predict appropriate locations to extract effective local features.
Although an input RGB image () can also be set as the environment, we experimentally find that using the backbone to extract and setting it as the environment can provide better results. We conjecture that raw pixels of an RGB image may contain interference such that the agent could get overwhelmed in a complex environment. On the other hand, the backbone could filter out unrelated information and distill spoofing-related information from raw pixels, and thus provide a specific environment for the agent to explore spoofing clues with less disturbance. The experimental results in IV-C1 show the superiority of using the backbone and setting as the environment.
State: At step , according to a certain policy, the agent predicts the location of a sub-patch based on its state . As carries history action information, we define the agent’s state as : .
Action and Policy: In the framework, the agent learns to predict the location of a sub-patch to explore spoofing information. Hence, our agent’s action is to predict the location: . Then, the sub-patch at will be cropped for the extraction of local features.
For effective location prediction, an optimal policy should be provided to guide our agent to predict a location according to its current state: . Following the policy gradient theory [ML-2000-PolicyGradient], we parameterize as by using a differentiable linear layer, where denotes its parameters. In this way, we can optimize with the standard backward propagation based on reward signals, which will be illustrated later (Section III-D).
Reward: After predicting the locations, the agent should get reward signals to evaluate how discriminative the information that the sub-patches contain for the classification. The more effective the predictions, the higher the rewards. Since the classification is conducted at the final step, we define a delayed reward as
where is the reward signal at step , is the total number of observation steps, is the ground-truth label of , and
is the predicted probability distribution over the binary labels. The agent will be trained to gain the cumulative reward
as high as possible.
Iii-D Training and Optimization
Iii-D1 Two-stage training scheme
When optimizing our framework, although end-to-end training is achievable, it may not provide satisfactory performance. As mentioned, the output feature map
from the backbone can be seen as an environment where our agent acts to learn its policy. If the backbone is involved in training, the environment will be unstable. Assume that the training is in epoch, is optimized according to the “environment” . However, if the backbone is also involved in optimization, the “environment” will change in the next epoch, meaning that . Therefore, the agent may not act properly to the new “environment” with the policy learned from the .
To tackle this problem, we propose to use a two-stage training scheme. At the first stage, we pretrain a ResNet18 model with the training data. Then, the parameters of the corresponding modules will be loaded to the backbone from the pretrained model. Subsequently, the parameters of the backbone will be frozen and not involved in the second-stage optimization such that . As such, fixing the parameters of the backbone is to fix , which can help keep a stable “environment” and extract more effective local features. The experimental results in Section IV-C show the superiority of our two-stage training.
Iii-D2 Joint optimization
At the second stage, we optimize the parameters of Branch 1 and 2 jointly. The parameters of Branch 1 and Branch 2 except are optimized by the standard cross-entropy loss with binary labels for supervision.
As for , it is optimized with reinforcement learning. The optimization of can be formulated as the maximizing of the following objective function:
where is the distribution over action sequences, which depends on .
According to the policy gradient theory [ML-2000-PolicyGradient], we adopt a differentiable linear layer to approximate the policy function. Hence, the maximization of can be via the calculation of the gradient of and the application of the gradient ascend. To this end, we leverage the REINFORCE rule [ML-REINFORCE] to approximate the gradient of :
As the gradient of can be simply computed by ). Thus, can be optimized by the standard backward propagation with this approximated gradient.
This section describes how we conduct experiments to evaluate our method. To begin with, we introduce six benchmark databases employed in the experiments. After that, we illustrate the implementation details. Finally, we present and discuss the experimental results.
We utilize six publicly available face presentation attack databases in our experiments, including CASIA Face Anti-Spoofing Database [DB-CASIAFASD], IDIAP REPLAY-ATTACK [DB-IDIAP-RA], MSU Mobile Face Spoofing Database [FAS-IDA-TIFS-2015], OULU-NPU database [OULU_NPU_2017], the Spoofing in the Wild (SiW) database [FAS-Auxiliary-CVPR-2018] and the ROSE-YOUTU database [FAS-UnsupervisedDA-TIFS-2018].
Iv-A1 Casia Fasd
The CASIA Face Anti-Spoofing Database (CASIA for short) has 20 and 30 subjects in its training and testing set respectively. For each subject, there are 12 videos, among which 3 genuine face videos are recorded from the genuine faces and 9 spoofing face videos from photos and screens. As such, the CASIA database has 600 videos in total.
Iv-A2 Idiap Replay-Attack
The IDIAP REPLAY-ATTACK database [DB-IDIAP-RA] (IDIAP for short) is constituted of 1,200 videos in total, with 360, 360 and 480 videos in the training set, development set, and testing set, respectively. In this database, there are two illumination conditions: 1) a controlled condition where the background is uniform and the source of lighting is a fluorescent lamp; 2) an adverse condition where the background is non-uniform and the source of lighting is daylight. The attack videos of each subject involve the 1) Print Attack: High-resolution face pictures printed on a paper. 2) Replay Attack: High-resolution pictures or videos were displayed on the screen of an iPhone 3GS and an iPad. To collect such data, the webcam of a MacBook, an iPhone 3GS and a Canon PowerShot camera are used.
Iv-A3 Msu Mfsd
The MSU Mobile Face Spoofing Database (MSU for short) [FAS-IDA-TIFS-2015] consists of 280 video clips of photo and video attack from 35 subjects. Two types of cameras are used to collect videos: the built-in camera in MacBook Air 13” () and the front-facing camera in the Google Nexus 5 Android phone (). However, all the videos are only collected in normal indoor lighting environments.
The OULU-NPU database, similar to the IDIAP database, is divided into the training, development, and testing set with 20, 15, and 20 subjects, respectively. Overall, it contains 4950 face videos collected under three different environment conditions (e.g., different illumination and background conditions), with the frontal cameras of six mobile phones. As for attack mediums, two printers and two display devices were used to produce print attack and video attack examples. Furthermore, the OULU-NPU database provides four protocols for evaluation. Among them, Protocol 1, 2, and 3 aim to evaluate a model’s generalization capability to unseen environment conditions, unseen attack mediums, and unseen camera modules, respectively. Protocol 4 simultaneously considers the unseen environment conditions, attack mediums, and camera modules.
|Branch 1||1||Residual Block||
|Branch 2||2D Conv + GAP||
The parameters of each module in the framework. “2D conv” denotes the a sequence of a 2D convolutional layer, a 2D batch normalization layer and a ReLU layer. “Linear” denotes a sequence of a fully-connected layer and a ReLU layer. “GAP” denotes the Global Average Pooling layer. The input and output sizes of each layer are shown.is the patch size for the cropping.
The Spoofing in the Wild (SiW for short) database [FAS-Auxiliary-CVPR-2018] covers 165 subjects. For each subject, eight genuine face videos and 20 spoofing face videos are recorded. As for data collection environments, four sessions with variations of distances, poses, illuminations, and expressions have been considered [FAS-Auxiliary-CVPR-2018]. For print attack examples, an HP Color LaserJet M652 printer is for printing high resolution () and low resolution () photos. Besides, four e-devices (Samsung Galaxy S8, iPhone 7, iPad Pro, and PC ASUS MB168B) are used to collect spoofing faces on their screens. As for cameras, a Canon EOS T6 and a Logitech C920 webcam are utilized to capture data. Totally, the SiW database has up to 4478 genuine and spoofing face videos. Also, it offers three protocols to evaluate the generalization capability of a model to unseen face poses and expressions (Protocol 1), unseen attack producing mediums (Protocol 2), and unseen Presentation Attack types (Protocol 3).
The ROSE-YOUTU Face Liveness Detection Database (ROSE-YOUTU for short) is collected by the industry partner, YouTu. It involves 20 subjects, and for each subject, there are 25 genuine and 150 to 200 spoofing face videos. The data is diversely collected, covering up to 5 different lighting conditions. Also, the front-facing cameras of five mobile devices (Hasee phone, Huawei phone, ZTE phone, iPad, iPhone 5s) were used to record the videos, with resolution ranging from to . Moreover, besides (still and quivering) printed photos and replay video examples (displayed with Lenovo LCD screen and Mac screen), the ROSE-YOUTU database further includes various paper mask attack examples. Such attacks can contain 3D information but are lacked in the aforementioned five databases. Hence, we leverage the ROSE-YOUTU database to evaluate our method further.
Iv-B Experiments Settings
Iv-B1 Evaluation protocols and metrics
When evaluating the proposed framework, we report the experimental results in terms of Equal Error Rate (EER), Half-Total Error Rate (HTER), Attack Presentation Classification Error Rate (APCER), Bona Fide Presentation Classification Error Rate (BPCER), and Average Classification Error Rate (ACER) in different scenarios. For intra-database experiments on the CASIA, IDIAP, and ROSE-YOUTU databases, we use data of the training set of the given database to train models and report EER results on the corresponding testing sets. When conducting cross-database experiments, we report HTER. Besides, for the OULU-NPU and SiW databases, we respectively follow the four protocols [OULU_NPU_2017] and the three protocols [FAS-Auxiliary-CVPR-2018] to evaluate the generalization capability of our method by reporting ACER, APCER, and BPCER results.
Iv-B2 Implementation Details
As for the framework input, we consider including background information as spoofing-related clues are diverse and may not necessarily appear on face areas, which can be implemented by expanding face detection bounding boxes to crop out faces. However, there are various attack scenarios, and optimal bounding box sizes could depend on scenarios. Since the entire video frames can be regarded as the detected face cropped by the bounding box of a special configuration, we use such configuration by default for the framework input as a consistent way to evaluate the proposed framework. Nevertheless, we also evaluate the effectiveness of the framework under different configurations of bounding boxes, where the framework inputs are consistently resized to pixels. Then, the backbone network embeds the input into feature maps . Subsequently, is forwarded to Branch 1 and Branch 2 to extract global and local features respectively. For the final classification, we fuse the global and local features from Branch 1 and 2 by using the Concatenation as the input of the final Fully Connected (FC) layer. We show the details of the backbone network, Branch 1, Branch 2, and the final FC layer in Table I, where denotes the size for cropping patches.
|Methods||Only local features||Only global features||Fused (ours)|
When training the framework, we follow the two-stage training scheme stated in Section III.
At the first stage, we pretrain a ResNet18 model [CV-He-2016-ResNet] with cross-entropy loss with training data.
The trained parameters of the first convolutional layer and the four subsequent residual blocks are then loaded to the backbone, and the parameters of the backbone will be fixed and excluded from the second-stage optimization.
At the second stage, the GRU’s hidden state is initialized as , and the location of the initial patch is sampled from a normal distribution whose symmetry center corresponds to the center of the input images.
, and the location of the initial patch is sampled from a normal distribution whose symmetry center corresponds to the center of the input images.Then, Branch 1 and 2 will be optimized jointly from scratch with the standard backward propagation with gradients of the cross-entropy loss and . By default, except for the declaration, the input configuration is “FULL”; the number of observation steps is set as 8; the patch size is set as 8; and the fusion method is set as the Concatenation. In addition, we also explore the impacts of , , and different fusion methods in Section IV-C.
Iv-C Experimental results
Iv-C1 Analysis of Jointly Using Global and Local Features
In this subsection, we demonstrate the effectiveness of our proposed framework by jointly utilizing global and local features. To this end, we ablate Branch 1 and Branch 2 separately to compare results with only local features and with only global features. As shown in Table II, the results with only global features (Branch 2 ablated) are better than the results with only local features (Branch 1 ablated), which indicates that global features can be more effective than local features. Intuitively, people are likely to provide a reliable assessment with a glance at the original video frame, especially when discriminative artifacts appear, e.g., paper boundaries, bezels, or obvious reflections. However, when given merely a few local sub-patches, human beings could have trouble assessing the liveness of the original examples as discriminative artifacts may be absent in these patches. This is just like what the story Blind Men and An Elephant [book-The-Elephant-in-the-Dark] tells. Moreover, by fusing the global and local features, our proposed framework can further achieve better performance. Such improvement supports our motivation that “taking closer observations at local sub-patches” can provide more information to refine the classification.
For further analysis, we collect the statistical results of the falsely accepted examples for each medium on the ROSE-YOUTU database. Fig. 5 shows that, with only global features, the number of falsely-accepted “Vm” (video attack recorded from a Mac display) is nearly 9000, which represents the largest proportion. By carefully reviewing the data in the ROSE-YOUTU database, we find that the visual quality of “Vm” is generally better than that of “Vl” (video attack recorded from a Lenovo display) as the resolution of a Mac display ( resolution) is much higher than that of a Lenovo LCD ( resolution). For instance, Fig. 1(c) and is a “Vl” example, and Fig. 1(e) is a “Vm” example. As shown, the visual quality of the “Vm” looks better and than the “Vl”. In other words, the spoofing faces of “Vm” visually look more similar to genuine faces than “Vl” (as well as others). Therefore, “Vm” examples get most falsely accepted as genuine faces than “Vl” and the others. Moreover, as shown in Fig. 5, our method can generally lead to fewer falsely accepted spoofing examples of various attack mediums. Although the falsely accepted “Vm” from our method still accounts for the largest proportion of false acceptance, the value is nearly half of that with only global features. This means that our method can better discriminate spoofing examples of good visual quality by leveraging local information from sub-patches. It corresponds to our motivation that people can zoom in local sub-patches and explore subtle spoofing clues to refine and confirm their assessment of the liveness of examples. By far, we demonstrate that our framework can better counter spoofing attacks by jointly using global and local features. In the next subsection, we investigate how well our framework can be generalized to inputs of other configurations.
Iv-C2 Analysis of Configurations of the Framework Input
As our framework is proposed to exploit the discriminative information which may not necessarily appear on face areas, we also propose to investigate the performance by configuring the input with different scales of information from backgrounds based on detected faces. To this end, we propose to use a dlib’s CNN face detector [dlib09] to obtain detection bounding boxes. Subsequently, a bounding box will be expanded by 0%, 20%, 40%, and 60% (i.e., ) to produce four face images that have different scales of background information. Besides, using the entire video frames can be treated as cropping faces with a special configuration for the bounding box, and we denote such configuration as “FULL”. As such, there will be five groups of face images in different configurations for the framework input, and some of the examples are shown in Fig. 6. For experiments, we adopt the OULU-NPU database, as it provides four protocols for extensive evaluation. Also, we train ResNet18 models to provide the baseline results, where only global features are considered. The experimental results are shown in Table III. It is obvious that our method can achieve better ACER results than the counterparts of ResNet18 over different input configurations. To sum up, the experiments show that the proposed method can still be effective in jointly using global and local features extracted from face images that have different scales of background information.
Iv-C3 Effect of using reinforcement learning
In this subsection, we show the effectiveness of adopting deep reinforcement learning (DRL) for selecting patches. To be more specific, we compare our proposed DRL with the method of selecting patches that have the max SoftMax scores (denoted as the MAX-SCORES method), and the method of selecting patches randomly (denoted as the RANDOM method). In the implementation of the MAX-SCORES method, we pretrained a patch-based CNN based on [FAS-Depth&Patch-IJCB-2018] with the training data to infer the SoftMax scores of the patches, and those patches that have the max SoftMax scores are selected for the framework. Besides, in the implementation of the RANDOM method, we use a random number generator to generate locations of the patches to be selected. Table IV compares the performance of the MAX-SCORES, RANDOM, and the DRL with respect to patch selection. Regarding the IDIAP database, the MAX-SCORES, RANDOM, and DRL methods achieve 0.00% EER, meaning that local features can help with the final prediction, regardless of how we select patches. Nevertheless, in the other experiments based on the ROSE-YOUTU, CASIA, and OULU-NPU databases, our proposed DRL shows the effectiveness in selecting patches by achieving the best EER results.
Iv-C4 Analysis of Local Features Extraction
In this subsection, we analyze the impact of patch size and the number of steps for the local feature extraction.
Effect of patch size To analyze the effect of , we conduct experiments with , and we provide the EER results on the CASIA, ROSE-YOUTU, and the OULU-NPU-P1 in Table V. Regarding the CASIA database, we can see that the EER results become better from 0.35% to 0.17% when increases from 2 to 8 but deteriorate when we further increase to (0.28%). Regarding the ROSE-YOUTU database, a similar trend can be observed that the EER performance improves up to 1.79% EER when increases from 2 to 8. However, when , the EER performance drops as 3.65% EER. Regarding the OULU-NPU-P1, the best EER is achieved when , and the EER result of is better than that of and . Therefore, the optimal is different for different databases, and simply increasing may not necessarily lead to better performance. Nevertheless, we observe that can achieve the desired performance in general, and thus we fix for all other experiments.
Effect of total number of observation steps Besides the size of the local patches, we are also interested in the impact of different numbers of observation steps . To this end, we conduct experiments by increasing from 2 to 16, and the results are reported in Table VI. As we can observe, for the CASIA database, the EER performance improves when increases from 2 to 16. Regarding the ROSE-YOUTU database, the best EER result of 1.65% is achieved when , and the second-best EER of 1.79% is achieved when . Regarding the OULU-NPU-P1, the EER performance improves as changes from 2 to 8 but gets worse when . In summary, simply increasing does not necessarily lead to better performance. We conjecture the reason that when is increased to include more patches, those patches containing less discriminative information may deteriorate the performance. Nevertheless, we consistently choose for other experiments as it can lead to the best or second-best performance in Table VI.
|Methods||Without backbone||With backbone|
|Observation steps||One-stage||Our two-stage|
Iv-C5 Analysis of other settings of the framework
In this subsection, we analyze the impacts of the backbone, training strategies, and feature fusion methods in our proposed framework.
Effect of using the backbone for local features To study the effectiveness of using the backbone for feature embedding and local feature extraction, we also implement the framework without the backbone for feature embedding as a baseline, where patches are cropped at the predicted locations from original RGB inputs instead of the embedded feature maps. Table VII shows the results for comparison. We observe that better performances can be achieved in general by considering a backbone network. Such improvement indicates that extracting local features from the feature maps through the backbone can help to extract more discriminative spoofing-related information. Moreover, the results with the backbone on the OULU-NPU-P1 and -P4 (2.58% and 3.12%) are significantly better than those without the backbone (8.43% and 9.57%). As Protocol 1 and 4 involve unseen environments in the testing data, conducting feature embedding through the backbone network to extract local features may alleviate environmental interference to some extent such that better performance can be achieved.
Effect of the two-stage training Although training our framework in an end-to-end manner is achievable, we propose to use a two-stage training scheme to better optimize our framework. Table VIII compares the results of the one-stage end-to-end training and our proposed two-stage training. The EER results of our two-stage training are all remarkably lower than 0.2% for . However, for the one-stage training experiments, when , the EER result is up to 20.1%. Although the EER decreases as increases, the best EER result is still above 4% (), which is much higher than all the results achieved by our two-stage training. Therefore, Table VIII shows that our two-stage training can help achieve better results by providing a stable environment such that the agent can learn to extract effective features, even when .
|Observation steps||Patch size||Average||Weighted Average||Concatenation|
Effect of fusion methods In this framework, the global and local features are fused for classification. In Table IX, we compare results among three different fusion methods, the Average, Weighted Average [FAS-MSR-TIFS-2019], and Concatenation. We observe that the EER results are all lower than 0.2% when and . However, when or , the EER results of the Average and the Weighted Average are higher than 9%, while the result with only global features is less than 1% (shown in Table II). We conjecture the reason that the Average and Weighted Average fuse global and local features by averaging the elements at each dimension. When or , the extracted local features may not be effective enough. As a result, discriminative information contained by global features may be distorted after the average operation. However, when the extracted local feature is not representative enough, the Concatenation could maintain original global information to a larger extent. Therefore, the Concatenation can provide stable results (all lower than 0.2% EER) under different settings of and , and we fix the Concatenation as the fusion method in other experiments.
fusion + NN [BottleCNN-JVCIR-2016]
|CoALBP (YCBCR) [FAS-CoALBP-AIVT-2012P]||17.1|
|CoALBP (HSV) [FAS-CoALBP-AIVT-2012P]||16.4|
Iv-C6 Intra-database experiments
For further evaluation, we conduct experiments on six benchmark databases and compare our proposed method with state-of-the-art methods.
Results on the CASIA database and IDIAP database Table X provides the results of intra-database experiments on the CASIA and IDIAP databases. On the CASIA database, our method attains 0.17% EER, which is the best. The best performance on the IDIAP database can also be seen from the 0% EER on the development (DEV) set and the 0% HTER on the testing (TEST) set. On both the two benchmark databases, our method achieves the best performance and shows its effectiveness.
Protocol 1, 2, and 3 are for evaluating models’ generalization capability to unseen face poses and expressions, unseen attack mediums, and unseen Presentation Attack types, respectively.
For experiments of Protocol 1, the testing is only done once so there are no terms of standard deviation
Results on the ROSE-YOUTU database Table XI compares our method with the baseline methods on the ROSE-YOUTU database. Our method can achieve the lowest EER (1.8%), while the second-best method 3D-CNN [FAS-3DCNN-TIFS-2018] merely achieves 7.0% EER. This shows our method’s superiority. In addition, to evaluate how paper mask attacks in the ROSE-YOUTU database can fail depth-based methods, we implement the DeSpoofing method 111https://github.com/yaojieliu/ECCV2018-FaceDeSpoofing [FAS-DeSpoofing-ECCV-2018] because it leverages depth information for training. However, it merely achieves 12.3% EER, which indicates that when encountering paper mask attack examples that are with depth information, depth-based methods could lose efficacy. By contrast, our method can still perform favorably against the paper mask attack examples in the ROSE-YOUTU database.
|Color LBP [FAS-ColorTexture-TIFS-2016]||37.9||35.4||44.8||33.0|
|Color Texture [FAS-ColorTexture-TIFS-2016]||30.3||37.7||33.9||34.1|
|AlexNET without DA [FAS-UnsupervisedDA-TIFS-2018]||32.6||43.6|
|AlexNET with KMM [FAS-UnsupervisedDA-TIFS-2018]||31.6||43.6|
|AlexNET with SA [FAS-UnsupervisedDA-TIFS-2018]||35.0||38.5|
|AlexNET with KSA [FAS-UnsupervisedDA-TIFS-2018]||33.9||42.0|
|AlexNET with SA* [FAS-UnsupervisedDA-TIFS-2018]||30.7||36.2|
|AlexNET with KSA* [FAS-UnsupervisedDA-TIFS-2018]||30.1||38.8|
The inter-database experiments where the models are trained with data of ROSE-YOUTU database and tested on the CASIA and IDIAP databases. The performance is evaluated in terms of HTER (%). “*” means the results are with the outlier removal proposed in[FAS-UnsupervisedDA-TIFS-2018]. On the left of is the database used for training and the right for testing. The best results are highlighted in bold. In the experiments, we set and .
Results on the SiW database Table XII shows the ACER results of our proposed framework and state-of-the-art methods on the SiW database. The Auxiliary method [FAS-Auxiliary-CVPR-2018], with extra depth map and rPPG signals, attains 1.0%, 0.57%, and 8.31% ACER for Protocol 1, 2, and 3, respectively. Besides, the “STASN+” method [FAS-CVPR2019-STASN] collects extra data outside the database for data augmentation and achieves 0.3%, 0.15% and 5.58% correspondingly. However, without extra data and auxiliary signals but only binary labels for supervision, our method can achieve the best results for the three protocols (0.00%, 0.00%, and 4.51%, respectively). In addition, for experiments of Protocol 2 and 3, our method can manage to get the smallest standard deviation, showing better stability. Moreover, the ACER results of all the listed methods for Protocol 3 are much higher than the results for Protocol 1 and 2. This indicates the setting of unseen presentation attack types is more challenging than unseen faces poses and expressions as well as unseen attack mediums.
Results on the OULU-NPU database Table XIII compares our proposed method with state-of-the-art ones. In Protocol 1, our method achieves 4.7% ACER, better than the “MSR-ResNet” method (5.9%) [FAS-MSR-TIFS-2019], which is also based on ResNet18 [CV-He-2016-ResNet]. In Protocol 2, our method achieves the best ACER (1.9%). In Protocol 3, the Auxiliary method achieves the lowest 2.9% ACER, while our method achieves a very close ACER of 3.0%. In Protocol 4, our method achieves the second-best ACER of 7.2%. Overall, in terms of ACER, the DeSpoofing is better than the Auxiliary in Protocol 1 and 4, while the Auxiliary is better than the DeSpoofing in Protocol 2 and 3. This comparison shows that there may not be a method that is always optimal for all scenarios. Nevertheless, our method shows its effectiveness by achieving the best or the second-best ACER in Protocol 2, 3, and 4. Furthermore, according to Table III, using a proper setting of face cropping for the framework input can lead to better performance. Also, one can always further improve the framework with advanced neural networks and auxiliary information.
Iv-C7 Cross-database experiments
We also conduct cross-database experiments to evaluate the generalization capability of our method to different data domains. For conciseness, “A B” denotes an experiment where we run the training with the database “A” and run testing with the database “B”.
Table XIV provides the cross-database experimental results among the CASIA, IDIAP, and MSU databases. In the experiments of CASIAIDIAP and IDIAPCASIA, the Auxiliary method [FAS-Auxiliary-CVPR-2018] achieves the best results (27.6% and 28.4% HTER respectively). On the other hand, our framework achieves second-best HTER (28.4% and 33.2%). Among methods without auxiliary information, such as [FAS-MSR-TIFS-2019], our performance is the best. Also, in both the experiments of IDIAPMSU and MSUIDIAP, we implement the DeSpoofing method [FAS-DeSpoofing-ECCV-2018], and it outperforms the other baseline methods by achieving 33.2% and 27.8% HTER respectively. However, our method can significantly surpass it with much lower HTER results (29.7% and 15.6%), which is the best.
Table XV provides the experimental results of ROSE-YOUTUCASIA and ROSE-YOUTUIDIAP. In the experiment of ROSE-YOUTUCASIA, our method can achieve the best 8.1% HTER. Both the ROSE-YOUTU and CASIA database include spoofing attacks that have bezels and paper boundaries observed. Hence, the proposed framework can capture such discriminative artifacts to achieve good performance. Besides, as for attack samples in the IDIAP database, there are few paper boundaries and bezels observed in the samples. However, in the experiment of ROSE-YOUTUIDIAP, the proposed framework is still effective and achieves the best HTER (20.0%), at least 16% HTER significantly lower than the others. In summary, in the cross-database experiments, our proposed framework still shows effectiveness when the spoofing artifacts and backgrounds are from different data domains.
In this subsection, we conduct further analysis based on visualization. To show what types of information the global features are likely to capture for anti-spoofing, we propose to apply the Class Activation Mapping (CAM) [cam-zbl] to visualize activation heatmaps of global features. The CAM heatmaps are shown in Fig. 7, where red/blue areas mean high/low activation. Fig. 7 is a paper attack where the paper boundaries can be seen on the left and the right. Its CAM heatmap, Fig. 7, shows that the boundaries of both sides give high activation (red). Fig. 7(a) is a replay video attack with reflections appearing on the right, and we see from Fig. 7 that the reflection areas on the top right are red. Besides, we also explore the situation when the above artifacts are absent. Fig. 7 and Fig. 7 are replay video attack examples from the IDIAP database, and there are no discriminative artifacts observed. While the face area in Fig. 7 gives high activation, the background areas in Fig. 7 are also of high activation. In summary, discriminative information captured by global features may not necessarily appear on face areas, and information from backgrounds can also significantly contribute to anti-spoofing, even when bezels, reflections, etc., are not observed.
Besides, we propose to investigate how the local features can help with the performance by visualizing the predicted locations. To this end, we propose to fuse the global feature with the local feature extracted at the step-index () for classification to get the confidence score . In this way, the visualization results are shown in Fig. 8, where the number under each image is the confidence score . As shown in Fig. 8, the performance of confidence scores shows an increasing trend. The first row of Fig. 8 is an attack example from the IDIAP database, a printed paper replayed in a video. As increases, the predicted patches cover the printed stripes in the background, and is generally increasing. The two rows below are printed paper attack examples from the CASIA database, where the boundaries of the printed paper can be treated as discriminative spoofing artifacts, and the patches also cover the paper boundaries. In these two rows, the performance of shows an increasing trend. The fourth row and fifth row are replay video attack examples from the OULU-NPU database and the ROSE-YOUTU database, respectively. We observe that the local patches mainly explore moiré patterns. Meanwhile, there are for the fourth and for the fifth respectively, indicating that simply increasing may not necessarily further improve the performance. Nevertheless, for , that still holds, showing that the information from patches can generally improve the overall performance.
Furthermore, we observe that the patches generally move from the center areas toward the boundary (background) areas. As the initial locations are sampled from a normal distribution whose symmetry center corresponds to the center of the input image, the initial patch is generally near the center areas. Thereby, the RNN extracts features from the center areas first. As spoofing features can also be found in the background areas, driven by reinforcement learning, the patches then move toward the background areas such that the RNN can extract features from these areas to improve performance.
We also visualize misclassified spoofing examples from the CASIA, ROSE-YOUTU, and OULU-NPU databases, which are shown in Fig. 9. Fig. 9(a) is a printed paper attack from the CASIA database. Although the paper boundary at the bottom can be seen, it is not obvious, and most of the other areas have no blur or distortion observed. Fig. 9(b) is a replay video attack example from the ROSE-YOUTU database, and Fig. 9(c) is a replay video attack example from the OULU-NPU database. These figures show few discriminative artifacts, such as reflection. Based on the observations from these examples, as there are few discriminative artifacts observed, the extracted global and local features may not be effective in differentiating the spoofing faces from genuine ones.
Iv-E Computational Analysis
In this subsection, we analyze the computational efficiency of our proposed method and the ResNet18. As we can see from Table XVI, the total number of parameters of our model is about 16.50M, while the ResNet18 is 11.18M. This increased amount of parameters is reasonable as we introduce a local branch in our framework and our model can achieve better performance in the task of FAS. Despite that our model size increases by about 50% compared with the ResNet18, our proposed method does not introduce too much computational burden. In terms of Multiply-Accumulate-Operations (MACs) [IEEE-standard], which is for measuring the total multiplication and addition operations required for calculation, our method merely has more 0.04 Giga (1.7%) and 0.07 (3.0%) Giga when and than the ResNet18 baseline. Last but not least, we consider the inference efficiency by reporting Frames per Second (FPS) based on PyTorch and NVIDIA GTX 1080 Ti GPU. As we can see, while the ResNet18 achieves 150 FPS, our method can achieve 110 FPS and 70 FPS when
than the ResNet18 baseline. Last but not least, we consider the inference efficiency by reporting Frames per Second (FPS) based on PyTorch and NVIDIA GTX 1080 Ti GPU. As we can see, while the ResNet18 achieves 150 FPS, our method can achieve 110 FPS and 70 FPS whenand , which is reasonable as the local branch works recurrently. Nevertheless, in practice, one can always use various neural network acceleration techniques to speed up models.
We present a novel two-branch framework to explore spoofing clues for face anti-spoofing problem. The novelties of our work lie in two folds, 1) we propose to leverage CNN and RNN to extract both global and local information for the FAS task based on a single frame; 2) we propose a novel optimization strategy based on deep reinforcement learning, which is the first attempt in the FAS problem. We conduct extensive experiments on six different databases to evaluate our proposed framework. The experimental results on both intra- and cross-domain indicate that our proposed framework can generally achieve state-of-the-art performance compared with various state-of-the-art baselines, which demonstrate the effectiveness of our method.