Pytorch codes for Regularized Fine-grained Meta Face Anti-spoofing in AAAI 2020
Face presentation attacks have become an increasingly critical concern when face recognition is widely applied. Many face anti-spoofing methods have been proposed, but most of them ignore the generalization ability to unseen attacks. To overcome the limitation, this work casts face anti-spoofing as a domain generalization (DG) problem, and attempts to address this problem by developing a new meta-learning framework called Regularized Fine-grained Meta-learning. To let our face anti-spoofing model generalize well to unseen attacks, the proposed framework trains our model to perform well in the simulated domain shift scenarios, which is achieved by finding generalized learning directions in the meta-learning process. Specifically, the proposed framework incorporates the domain knowledge of face anti-spoofing as the regularization so that meta-learning is conducted in the feature space regularized by the supervision of domain knowledge. This enables our model more likely to find generalized learning directions with the regularized meta-learning for face anti-spoofing task. Besides, to further enhance the generalization ability of our model, the proposed framework adopts a fine-grained learning strategy that simultaneously conducts meta-learning in a variety of domain shift scenarios in each iteration. Extensive experiments on four public datasets validate the effectiveness of the proposed method.READ FULL TEXT VIEW PDF
Pytorch codes for Regularized Fine-grained Meta Face Anti-spoofing in AAAI 2020
Face recognition, as one of the computer vision techniques[14, 35], has been successfully applied in a variety of applications in the real life, such as automated teller machines (ATMs), mobile payments, and entrance guard systems. Although much convenience is brought by the face recognition technique, many kinds of face presentation attacks (PA) also appear. Easy-accessible human faces from the Internet or social media can be abused to produce print attacks (i.e. based on the printed photo papers) or video replay attacks (i.e. based on the digital image/videos). These attacks can successfully hack a face recognition system deployed in a mobile phone or a laptop because those spoofs are visually extremely close to the genuine faces. Therefore, how to protect our face recognition systems against these presentation attacks has become an increasingly critical issue in the face recognition community.
Many face anti-spoofing methods have been proposed. Appearance-based methods choose to extract various appearance cues to differentiate real and fake [5, 5, 33, 34]; Temporal-based methods aim to do differentiation based on various temporal cues [23, 29, 27, 17, 19]. Although these methods obtain promising performance in intra-dataset experiments where training and testing data are from the same dataset, the performance dramatically degrades in cross-dataset experiments where models are trained on one dataset and tested on a related but shifted dataset. This is because existing face anti-spoofing methods capture the differentiation cues that are dataset biased , and thus cannot generalize well to unseen testing data that have different feature distribution compared to training data (mainly caused by different materials of attacks or recording environments).
To overcome this limitation, this paper casts face anti-spoofing as a domain generalization (DG) problem. Compared to the traditional unsupervised domain adaptation (UDA) [28, 25, 21, 37, 8, 24, 31, 7, 32, 36, 4, 30] that assume access to the labeled source domain data and unlabeled target domain data, DG assumes no access to target domain information. For DG, multiple source domains are exploited to learn the model which can generalize well to unseen test data in the target domain. For the task of face anti-spoofing, because we do not know what kind of attacks will be presented to our face recognition system, we have no clue on the testing dataset (target domain data) when we train our model so that DG is more suitable for our task.
Inspired by [11, 15], this paper aims to address problem of DG for face anti-spoofing in a meta-learning framework. However, if we directly apply existing vanilla meta-learning for DG algorithms on the task of face anti-spoofing, the performance will be degraded due to the following two issues: 1) It is found that face anti-spoofing models only with binary class supervision discover arbitrary differentiation cues with poor generalization . As such, as illustrated in Fig. 2(a), if vanilla meta-learning algorithms are applied in face anti-spoofing only with the supervision of the binary class labels, the learning directions in the meta-train and meta-test steps will be arbitrary and biased, which makes it difficult for the meta-optimization step to summarize and find a generalized learning direction finally. 2) Vanilla meta-learning for DG methods  coarsely divide multiple source domains into two groups to form one aggregated meta-train and one aggregated meta-test domains in each iteration of meta-learning. Thus only a single domain shift scenario is simulated in each iteration, which is sub-optimal for the task of face anti-spoofing. In order to equip the model with the generalization ability to unseen attacks of various scenarios, a variety of domain shift scenarios instead of a single one that are simulated for meta-learning is more optimal for the task of face anti-spoofing.
To address the above two issues, as illustrated in Fig. 1, this paper proposes a novel regularized fine-grained meta-learning framework. For the first issue, compared to binary class labels, domain knowledge specific to the task of face anti-spoofing can provide more generalized differentiation information. Therefore, as illustrated in Fig .2(b), the proposed framework incorporates the domain knowledge of face anti-spoofing as regularization into feature learning process so that meta-learning is conducted in the feature space regularized by the auxiliary supervision of domain knowledge. In this way, this regularized meta-learning can focus on more coordinated and better-generalized learning directions in the meta-train and meta-test for task of face anti-spoofing. Therefore, the summarized learning direction in the meta-optimization can guide face anti-spoofing model to exploit more generalized differentiation cues. Besides, for the second issue, the proposed framework adopts a fine-grained learning strategy as shown in Fig .2(b). This strategy divides source domains into multiple meta-train and meta-test domains, and jointly conducts meta-learning between each pair of them in each iteration. As such, a variety of domain shift scenarios are simultaneously simulated and thus more abundant domain shift information can be exploited in the meta-learning to train a generalized face anti-spoofing model.
Face Anti-spoofing Methods. Current face anti-spoofing methods can be roughly categorized into appearance-based methods and temporal-based methods. Appearance-based methods are proposed to extract different appearance cues for attacks detection. Multi-scale LBP  and color textures  methods are proposed to extract various LBP descriptors in various color spaces for the differentiation between real/fake. Image distortion analysis  detects the surface distortions due to lower appearance quality of images or videos compared to the real face skin. Yang et al. 
trains CNN to extract discriminative deep features for real/fake faces classification. On the other hand, temporal-based methods aim to extract different temporal cues through multiple frames to differentiate real/fake faces. Dynamic texture methods proposed in[23, 29, 27] try to extract different facial motions. Liu et al. [18, 17] propose to capture discriminative rPPG signals from real/fake faces. 
learns a CNN-RNN model to estimate the different face depth and rPPG signals between real/fake faces. However, the performance of both appearance and temporal-based methods become degraded in cross-datasets test where unseen attacks are encountered. This is because all the above methods are likely to extract some differentiation cues that are biased to specific materials of attacks or recording environments in training datasets. Comparatively, the proposed method conducts meta-learning for DG in the simulated domain shift scenarios, which is designed to make our model generalize well and capture more generalized differentiation cues for the task of face anti-spoofing. Note that a recent work proposes multi-adversarial discriminative deep domain generalization for face anti-spoofing. It assumes that generalized differentiation cues can be discovered by searching a shared and discriminative feature space via adversarial learning. However, there is no guarantee that such a feature space exists among multiple source domains. Moreover, it needs to train multiple extra discriminators for all source domains. Comparatively, this paper does not need such a strong assumption and meta-learning can be conducted without training extra discriminators networks for adversarial learning, which is more efficient.
Meta-learning for Domain Generalization Methods.
Unlike meta-learning for few-shot learning , meta-learning for DG is relatively less explored. MLDG  designs a model-agnostic meta-learning for DG. Reptile  is a general first-order meta-learning method that can be easily adapted into DG task. MetaReg  learns regularizers for DG in a meta-learning framework. However, directly applying the aforementioned methods in the task of face anti-spoofing may encounter the two issues mentioned above. Comparatively, our method conducts meta-learning in the feature space regularized by auxiliary supervision of domain knowledge within a fine-grained learning strategy. This contributes a more feasible meta-learning for DG in the task of face anti-spoofing.
The overall proposed framework is illustrated in Fig. 3.
Suppose that we have access to source domains of face anti-spoofing task, denoted as . The objective of DG for face anti-spoofing is to make the model trained on the source domains can generalize well to unseen attacks from the target domain. To this end, at each training iteration, we divide the original source domains by randomly selecting domains as meta-train domains (denoted as ) and the remaining one as the meta-test domain (denoted as ). As such, the training and testing domain shift in the real world can be simulated. In this way, our model can learn how to perform well in the domain shift scenarios through many training iterations and thus learn to generalize well to unseen attacks.
Several existing vanilla meta-learning for DG methods can be applied to achieve the above objective. But their performance degrade for the task of face anti-spoofing due to the two issues mentioned in the introduction. To address these issues, this paper proposes a new meta-learning framework called regularized fine-grained meta-learning. In each meta-train and meta-test domain, we are provided with image and label pairs denoted as and , where are ground truth with binary class labels (
= 0/1 is the label of fake/real face). Compared to the binary class labels, domain knowledge specific to the face anti-spoofing task can provide more generalized differentiation information. This paper adopts the face depth map as the domain knowledge. By comparing the spatial information, it can be observed that live faces have face-like depth, while faces of attacks presented in the flat and planar papers or video screens have no face depth. In this way, for the first issue, we incorporate this domain knowledge as regularization into feature learning process so that meta-learning can be conducted in the feature space regularized by the auxiliary supervision of domain knowledge. Thus, this regularized meta-learning in the feature space can focus on better-generalized learning directions in meta-train and meta-test for task of face anti-spoofing. To this end, as illustrated in Fig. 3, a convolutional neural network is proposed in our framework that composes of a feature extractor (denoted as) and a meta learner (denoted as ). Then a depth estimator (denoted as ) is further integrated into our network, through which domain knowledge can be incorporated. Besides, to address the second issue, the proposed framework adopts a fine-grained learning strategy that meta-learning is jointly conducted among meta-train domains and one meta-test domain in each iteration, by which a variety of domain shift scenarios are simultaneously exploited in each iteration. The whole meta-learning process is summarized in Algorithm 1 and the details are as follows:
We sample batches in every meta-train domain , denoted as , and we conduct the cross-entropy classification based on the binary class labels in each meta-train domain as follows:
where and are the parameters of the feature extractor and the meta learner. In each meta-train domain, We can thus search the learning direction by calculating gradient of meta learner w.r.t the loss (). The updated meta learner can be calculated as . In the meantime, we incorporate face depth maps as the domain knowledge to regularize the above learning process of the feature extractor as follows:
where is the parameter of the depth estimator and are the pre-calculated face depth maps for input face images. We use the state-of-the-art dense face alignment network named PRNet  to estimate depth maps of real faces, which serve as the supervision for the real faces. Attacks are assumed to have no face depth so that depth maps of all zeros are set as the supervision for fake faces.
Moreover, we sample batch in the one remaining meta-test domain , denoted as . By adopting fine-grained learning strategy, we encourage our face anti-spoofing model trained on every meta-train domain can simultaneously perform well on the disjoint meta-test domain so that our model can be trained to generalize well to unseen attacks of various scenarios. Thus, multiple cross-entropy classifications are jointly conducted over all the updated meta learners:
The domain knowledge is also incorporated like meta-train:
To summarize all the learning information in the meta-train and meta-test for optimization, we jointly train the three modules in our network as follows:
Note that in (6), regression losses of depth estimation provides auxiliary supervision in the optimization of feature extractor. This can regularize the feature learning process of the feature extractor. In this way, the classifications in (1) and (3) within the meta learner are restrictively conducted in the feature space regularized by the auxiliary supervision of domain knowledge. This makes meta-train and meta-test focus on better-generalized learning directions.
This section provides more detailed analysis on the proposed method. The objective of (5) in the meta-optimization is as follows (omitting for simplicity):
We do the first-order Taylor expansion on the second term as follows:
and the objective becomes:
The above objective shows that meta-optimization finds the generalized learning direction in the meta learner through: 1) minimizing losses in all meta-train and meta-test domains 2) meanwhile coordinating the learning directions (gradients information) between meta-train and meta-test so that the optimization can be conducted without overfitting to a single domain. It should be noted that there are two major differences compared to vanilla meta-learning for DG: 1) the above objective is conducted in feature space regularized by the domain knowledge supervision instead of in instance space . This makes both meta-train and meta-test focus on better-generalized learning directions and thus their learning directions are more likely to be coordinated in the task of face anti-spoofing (in the above third term). 2) vanilla meta-learning for DG  is simply conducted between one aggregated meta-train domain and one aggregated meta-test domain in each iteration. Comparatively, the above objective is simultaneously conducted between multiple () pairs of meta-train and meta-test domains in each iteration. This adopts a fine-grained learning strategy that meta-learning is simultaneously conducted in a variety of domain shift scenarios in each iteration. Thus our face anti-spoofing model can be trained to generalize well to unseen attacks of various scenarios in each iteration.
The evaluation of our method is conducted on four public face anti-spoofing datasets that contain both print and video replay attacks: Oulu-NPU  (O for short), CASIA-MFSD  (C for short), Idiap Replay-Attack  (I for short), and MSU-MFSD  (M for short). Table 1 in the supplementary material111Codes are available at https://github.com/rshaojimmy/AAAI2020-RFMetaFAS shows the variations in these four datasets. Figure 1 in the supplementary material shows some samples of the genuine faces and attacks. Table 1 and Fig. 1 in supplementary material show that compared to the seen training data, attacks from unseen materials, illumination, background, resolution and so on cause significant domain shifts among these datasets.
Following the setting in , one dataset is treated as one domain in our experiment. We randomly select three datasets in four as source domains where domain generalization is conducted. The left one is the unseen domain for testing, which is unavailable in the training process. Half Total Error Rate (HTER) 
(half of the summation of false acceptance rate and false rejection rate) and Area Under Curve (AUC) are used as the evaluation metrics in our experiments.
Our deep network is implemented on the platform of PyTorch. The detailed structure of the proposed network is illustrated in Table 2 in the supplementary material.Training Details. The Adam optimizer  is used for the optimization. The learning rates are set as 1e-3. The batch size is 20 per domain, and thus 60 for 3 training domains totally. Testing. For a new testing sample , its classification score is calculated for testing as follows: , where and are the trained feature extractor and meta learner.
|Method||O&C&I to M||O&M&I to C||O&C&M to I||I&C&M to O|
|Method||O&C&I to M||O&M&I to C||O&C&M to I||I&C&M to O|
We compare several state-of-the-art face anti-spoofing methods as follows: Multi-Scale LBP (MS_LBP)  ; Binary CNN ; Image Distortion Analysis (IDA) ; Color Texture (CT) ; LBPTOP ; Auxiliary : To fairly compare our method only using one frame information, we implement its face depth estimation component(denoted as Auxiliary(Depth Only)). We also compare its reported results (denoted as Auxiliary(All)); MMD-AAE ; and MADDG . Moreover, we also compare the related state-of-the-art meta-learning for DG methods in the face anti-spoofing task: MLDG ; Reptile ; and MetaReg .
From comparison results in Table 1 and Fig. 4, it can be seen that the proposed method outperforms the state-of-the-art face anti-spoofing methods [20, 34, 33, 5, 19]. This is because all these methods focus on extracting differentiation cues the only fit to attacks in the source domains. Comparatively, the proposed meta-learning for DG trains our face anti-spoofing model to generalize well in the simulated domain shift scenario. This significantly improves the generalization ability of the face anti-spoofing method. Moreover, we also compare the DG with adversarial learning methods for face anti-spoofing [16, 26] and our method also performs better. This is because instead of focusing on learning a domain shared feature space and training extra domain discriminators, our method just needs to train a simple network with meta-learning strategy. This realizes the DG for face anti-spoofing in a more feasible and efficient way.
Table 2 and Fig. 4 show that compared to some state-of-the-art vanilla meta-learning for DG methods [15, 22], our method also outperforms them for the task of face anti-spoofing. This illustrates that by addressing the above two issues, the proposed meta-learning framework is more able to improve the generalization ability for the task of face anti-spoofing.
|Method||O&C&I to M||O&M&I to C||O&C&M to I||I&C&M to O|
Considering that O&M&I to C set has the most significant domain shift, we evaluate different components of our method in this set for an example and experimental results are shown in Fig. 5. Ours denotes the proposed method. Ours_wo/meta denotes the proposed network without the meta-learning component. In this setting, we do not conduct the meta-learning in the meta learner part. Ours_wo/reg denotes the proposed network without domain knowledge regularization. In this setting, we do not incorporate the face depth maps as the domain knowledge to regularize the meta-learning process.
Figure 5 shows that the proposed network has degraded performance if any component is excluded. Specifically, the results of Ours_wo/meta verify that the meta-learning conducted in the meta learner benefits for the generalization ability improvement. The results of Ours_wo/reg show that without the regularization of domain knowledge supervision, the performance of our meta-learning for DG degrades significantly. This validates that by addressing the first issue, the proposed meta-learning framework is more able to develop a generalized face anti-spoofing model.
As mentioned in the above analysis, compared to vanilla meta-learning for DG methods, our method adopts a fine-grained learning strategy which can help to develop face anti-spoofing model with the generalization ability to unseen attacks of various scenarios. To verify the effectiveness of this strategy, we conduct our method in the setting proposed in , where the proposed regularized meta-learning is only conducted between one aggregated meta-train and one aggregated meta-test domains in each training iteration. The comparison results are named as Ours (aggregation) in Table 3. Table 3 shows that our method obtains better performance than Ours (aggregation). This validates that the proposed meta-learning adopting fine-grained learning strategy is more able to improve the generalization ability for the task of face anti-spoofing. Moreover, the third term in (10) has the function of coordinating the learning of meta-train and meta-test so as to prevent the optimization process from overfitting to a single domain. This improves the generalization ability but at the same time involves the second-order derivative computation of parameters of meta learner. Some works such as Reptile  uses a first-order approximation to decrease the computation complexity. We thus compare a method named as Ours (First-order) in Table 3 that replaces the second-order derivative computation in meta learner with the first-order approximation proposed in Reptile . Results show that our method performs better, which verifies that the second-order derivative information in the third term of (10) is more effective and plays a key role in the generalization ability improvement for the task of face anti-spoofing.
To provide more insights on why our method improves the generalization ability for the task of face anti-spoofing, we visualize the attention map of networks by the Global Average Pooling (GAP) method . Figure 6 shows some examples of visualization results for the testing samples of attacks between Binary CNN  and our method. In , authors train a CNN only with supervision of binary class labels in the face anti-spoofing task. This makes the model focus on capturing biased differentiation cues with poor generalization ability. In the visualization of Binary CNN of Fig. 6, it can be seen that when encountering unseen testing attacks, this method pays the most attention to extracting the differentiation cues in the background (row 1-2) or on paper edges/holding fingers (row 3-5). These differentiation cues are not generalized because they will be changed if the attacks are from a new background or without clear paper edges. Comparatively, Fig. 6 shows that our method always focuses on the region of internal face for searching differentiation cues. These differentiation cues are more likely to be intrinsic and generalized for face anti-spoofing and thus the generalization ability of our method can be improved.
To improve the generalization ability of face anti-spoofing methods, this paper casts face anti-spoofing as a domain generalization problem, which is addressed in a new regularized fine-grained meta-learning framework. The proposed framework conducts meta-learning in the feature space regularized by the domain knowledge supervision. In this way, better-generalized learning information for face anti-spoofing can be meta-learned. Besides, a fine-grained learning strategy is adopted which enables a variety of domain shift scenarios to be simultaneously exploited for meta-learning so that our model can be trained to generalize well to unseen attacks of various scenarios. Comprehensive experimental results validate the effectiveness of the proposed method statistically and visually.
Acknowledgments This project is partially supported by Hong Kong RGC GRF HKBU12200518. The work of X. Lan is partially supported by HKBU Tier 1 Start-up Grant.
DeepJDOT: deep joint distribution optimal transport for unsupervised domain adaptation. In ECCV, Cited by: Introduction.
Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, Cited by: Introduction.
The evaluation of our method is conducted on four public face anti-spoofing datasets that contain both print and video replay attacks: Oulu-NPU  (O for short), CASIA-MFSD  (C for short), Idiap Replay-Attack  (I for short), and MSU-MFSD  (M for short). From Table 4 and Fig. 7, it can be seen that many kinds of variations, due to the differences on materials, illumination, background, resolution and so on, exist across these four datasets. Therefore, significant domain shift exists among these datasets.
The detailed structure of the proposed network is illustrated in Table 5. To be specific, each convolutional layer in the feature extractor, meta learner and depth estimator is followed by a batch normalization layer and a rectified linear unit (ReLU) activation function, and all convolutional kernel size is 33. The size of input image is , where we extract the RGB and HSV channels of each input image. Inspired by the residual network , we use a short-cut connection, which is concatenating the responses of pool1-1, pool1-2 and pool1-3, and sending them to conv3-1 for depth estimation. This operation helps to ease the training procedure.