FACE recognition techniques have been used in various identity authentication scenarios and have become increasingly prevalent in recent years. Despite the ease of use, face recognition systems are vulnerable to Presentation Attacks (PAs), a.k.a.
Spoofing Attacks, where an attacker may create face forgeries such as printed photos, digital displays, masks, and further launch spoofing attacks by presenting the forgeries to the camera sensor of face recognition systems. To secure face recognition systems, both the industry and academia have been paying increasing attention to the problem of Face Presentation Attack Detection (Face PAD),a.k.a. Face Anti-Spoofing (FAS), which aims to discriminate spoofing attacks from bona fide attempts of genuine users.
The research community of FAS believes that there are intrinsic disparities between face images captured from bona fide attempts (abbreviated as genuine faces) and spoofing attacks (abbreviated as spoofing faces) [FAS-ColorTexture-TIFS-2016]. For example, digital displays are made of glass and have high reflection coefficients, some texture patterns like reflection can be observed in spoofing faces of replay attacks. Also, for printed photos attacks, the spoofing faces tend to present lower image quality due to the low Dots Per Inch (DPI) and color degradation [FAS-IDA-TIFS-2015].
Based on observations and analysis of domain knowledge, earlier researchers [FAS-ColorTexture-TIFS-2016, moire-analysis-TIFS-2015, IQA-ICPR-2014, DoG-ECCV-2010, SURF-SPL-2017, LBP-TOP-EJIVP-2014, LBP-FAS-BIOSIG-2012] design various kinds of handcrafted features to describe the disparities between genuine and spoofing faces and leverage the handcrafted features to detect spoofing attacks. The different handcrafted features are usually devised based on different specific physical meanings. For example, handcrafted image descriptors such as Local Binary Pattern (LBP) [LBP-FAS-BIOSIG-2012] and Speeded Up Robust Features (SURF) [SURF-SPL-2017] are used to describe the texture discrepancy between genuine and spoofing faces. Besides, handcrafted features based on image quality have been proposed to detect spoofing attacks by analyzing the image quality (e.g., blurring) [FAS-IDA-TIFS-2015]. Other than analyzing a single frame, handcrafted features from sequential frames have also been proposed to analyze the motion difference between spoofing faces from the genuine ones in the temporal domain [Motion-IJCB-2011, MotionLBP-ICB-2013, LBP-TOP-EJIVP-2014]. Despite elegant interpretability, the handcrafted features heavily rely on experts’ domain knowledge. These features are devised on the basis of some specific considerations, and few of them can deal with diverse types of attacks.
Recently, deep neural networks have also been used to learn discriminative feature representations in a data-driven manner for the FAS problem [FAS-3DCNN-TIFS-2018, CAI-2020-DRL, CDCN-CVPR-2020, FAS-LSTMCNN-ICASSP-2018, LIZHI, Ternary-TIFS-2018, CBL]
. The deep learning based methods have outperformed the traditional methods, achieving desired performance in intra-domain experiments. However, training deep neural networks with only RGB images as input and simple binary labels as supervision easily make the models overfit to the properties of the source domain training data. This results in poor generalization performance in cross-domain experiments where the domain shift between training and testing data exists, including but not limited to the variations of environmental illumination, camera specifications, and materials of spoofing attack mediums[FAS-UnsupervisedDA-TIFS-2018, FAS-3DCNN-TIFS-2018].
Among various techniques that aim to mitigate the domain shift problem and improve generalization performance, one promising research direction is to combine task-aware handcrafted features and deep neural networks, and such methods are summarized as hybrid methods [FAS-Survey-arXiv-2021]. As illustrated in the top diagram of Fig. 1, some hybrid methods provide discriminative information by extracting handcrafted features from RGB images (e.g., using LBP [hashemifard2021compact]) or transforming the images in the RGB space to other spaces (e.g., the HSV space [FAS-Auxiliary-CVPR-2018, DTL, LIZHI]) as the input to the neural network models. Although the hybrid strategy can improve the generalization performance of the models to some extent, the extracted handcrafted features may not be representative and generalized enough under complex situations due to the diverse factors of domain shift problem (e.g., different cameras, lightings, attack mediums). As such, how to extract and utilize the features to improve the models’ generalization ability poses a unique challenge for the FAS problem.
In this work, we address the above challenge in a learning-to-learn framework. We devise a learnable neural network to extract the Meta Pattern (MP). The MP replaces handcrafted features to provide discriminative or auxiliary information to the target discriminator network to discriminate spoofing attacks (see the bottom of Fig. 1). We expect the learnable network can extract representative Meta Pattern, which benefits the generalization ability of the discriminator network . To be specific, a bi-level optimization problem is induced, where the optimizations of and are in the inner and outer optimization levels respectively. Since solving a bi-level optimization problem is non-trivial, we simplify the optimization by a neat and effective approximation method, which can be easily solved via asynchronous backward propagation (see Section III) and no surrogate model is needed. Moreover, we devise a two-stream Hierarchical Fusion Network (HFN) to fuse the original information from the RGB images and discriminative information from the MP by using our proposed Hierarchical Fusion Module (HFM). The illustrations of our HFN and HFM are shown in Fig. 2.
We summarize our contributions as follows:
We push the hybrid method one step further to be the end-to-end data-driven method by learning to extract the MP from data, instead of extracting handcrafted features manually.
We devise a Hierarchical Fusion Network to fuse information from multiple feature hierarchies with our Hierarchical Fusion Module.
We conduct extensive experiments to verify the effectiveness of the proposed method. The experimental results show that our learned MP can generally achieve better performance over the compared handcrafted features, and our proposed method can achieve state-of-the-art performance in the cross-domain generalization benchmarks.
The remainder of the paper is as follows. Section II discusses the literature related to our work to provide background information. Section III illustrates our method of learning the MP and presents the details of the HFN. Section IV presents the experiments about the preliminary study and ablation study. In Section V, we conclude this paper.
Ii Related works
Multifarious FAS methods based on RGB images have been proposed in the past decade and have become the mainstream of the FAS research. In this section, we firstly review the development of FAS methods, from traditional methods based on handcrafted features to recent methods based on deep learning. Besides, we also review recent progress of domain generalization methods for FAS, which are most relevant to our work.
Ii-a Traditional methods for face anti-spoofing
Since spoofing faces of printed photos attacks and replay attacks have undergone multiple capture processes, some texture differences such as blurring, moire patterns, and printed noise can be observed in the spoofing faces because of the image quality distortion during the recapturing [MicroTexture-IJCB-2011, FAS-IDA-TIFS-2015]. Traditional methods use image descriptors such as Local Binary Pattern (LBP), Scale-invariant feature transform (SIFT), Speeded-up Robust Features (SURF), Histogram of Oriented Gradients (HOG), and Difference of Gaussians (DoG) to extract handcrafted features about texture information [LBP-FAS-BIOSIG-2012, SURF-SPL-2017, HOG, DoG-ECCV-2010, deep-forest-lbp]. Considering the color distortion of spoofing faces, Boulkenafet et al. [FAS-ColorTexture-TIFS-2016] utilize the color information and propose to extract color texture features from the illuminance and chrominance components (respective channels of images in YCrCb or HSV color spaces). Specific texture analyses for moire patterns have also been studied [moire-icb, moire-analysis-TIFS-2015]. Wen et al. [FAS-IDA-TIFS-2015] argue that texture features contain information about personal identity, which is redundant for anti-spoofing and could lead to poor generalization performance. Therefore, some works propose handcrafted features based on image quality and distortion analysis for the FAS task [FAS-IDA-TIFS-2015, IQA-ICPR-2014, LI-QUALITY]. Besides, some methods [Motion-IJCB-2011, MotionLBP-ICB-2013, LBP-TOP-EJIVP-2014] extract dynamic texture features from multiple frames to analyze the motion information in the temporal domain other than in the spatial domain.
Ii-B Deep learning methods for face anti-spoofing
Deep neural networks show strong feature learning capacity and have been widely used in recent FAS methods. Yang et al. [FAS-CNN-ComputerScience-2014] are the first to propose a method that using a VGG-Net [vggnet]
as a feature extractor for the FAS task. They firstly extract deep features from the fully connected layer of the VGG-Net, then train a Support Vector Machine (SVM) classifier with the deep features to discriminate genuine and spoofing faces. Motivated by the great success of deep learning techniques, more FAS methods based on deep learning have been proposed. Caiet al. [CAI-2020-DRL]et al.
devise a loss function for optimization of the discrimination network[LIZHI]. Some methods are proposed to use pixel (patch)-wise labels for supervision of network training [FAS-Auxiliary-CVPR-2018, Ternary-TIFS-2018, deeppixel--ICB-2019, yu2021revisiting]. Yu et al. design spoofing-aware convolution structures for fine-grained feature learning [NASFAS-TPAMI-2020, CDCN-CVPR-2020].
However, learning deep models with only RGB images as input will easily make models overfit the training data thus can not generalize well to the testing data if there are domain shifts between training and testing data domains. Recently, some hybrid methods that combine handcrafted features and neural networks have been proposed to improve deep models’ generalization performance. For example, Chen et al. [FAS-MSR-TIFS-2019] transform images from the RGB space to the illumination-invariant multi-scale retinex (MSR) space and train an attention-based two-stream network with the RGB and MSR. Yu et al. construct the spatio-temporal remote photoplethysmography (rPPG) map to represent the signal of heartbeats, and train a vision transformer to detect 3D mask attacks. Li et al. [TIFS-2019-MotionBlur] propose to analyze motion blur from replay attacks by fusing features extracted by 1D CNN and Local Similarity Pattern (LSP). Pinto et al. [Pinto] consider that the attack mediums and human skins are different. As such, Pinto et al. [Pinto] utilize Shape-from-Shading (SfS) algorithm to extract albedo, depth, and reflectance maps as the input of the proposed SfSNet to analyze the material difference between genuine faces and spoofing faces. Some other hybrid methods that take advantage of handcrafted features (e.g., LBP, HOG) and neural networks have also been studied [CNN-LBPTOP-2017, rehman2019perturbing].
Although these hybrid methods improve the generalization performance of deep neural networks, the handcrafted feature used in each hybrid method is based on special considerations. Due to the various factors of domain shifts, it is hard to consider all possible spoofing information by a type of handcrafted feature. This limitation may constrain the hybrid methods as it would be hard to choose the desired handcrafted features when given different source data. Thus, we push the hybrid methods one step further to end-to-end data-driven methods by learning to extract the Meta Pattern in this work.
Ii-C Domain generalization methods for face anti-spoofing
The data collection conditions of training and testing data could be different, including but not limited to the variations of camera specifications, environment illuminations, and presentation mediums. Such variations of capture condition result in the shift between training and testing data domain and deter the models’ reliability from being deployed in practical scenarios [FAS-UnsupervisedDA-TIFS-2018]. To tackle this problem, domain adaptation methods [FAS-UnsupervisedDA-TIFS-2018, LI-DISTILLATION] utilize some unlabeled target domain data to adapt a model trained with the source domain to improve the performance in the target domain. However, domain adaptation methods require using some target domain data, which is expensive to collect in real-world applications. By contrast, since domain generalization aims to learn a model that can be more generalized to the unseen data domains without using the target domain data [FAS-3DCNN-TIFS-2018], the FAS methods using domain generalization techniques have been extensively studied in recent years [RFMetaFAS-AAAI-2020, NASFAS-TPAMI-2020, MetaTeacher-TPAMI-2021, FAS-3DCNN-TIFS-2018, MADDG-CVPR-2019]. Meta-learning as an effective way to tackle the general domain generalization problem has been introduced to the face anti-spoofing task. Shao et al. use meta-learning to regularize the gradient calculated from the pixel map supervision [RFMetaFAS-AAAI-2020]. Yu et al. [NASFAS-TPAMI-2020] use meta-learning and Neural Architecture Search to search the network architecture that can be generalized to unseen data domains. Yu et al. [MetaTeacher-TPAMI-2021] propose Meta-Teacher to learn a teacher network that can provide data-specific labels for supervision in a teacher-student framework. Different from existing methods, in this paper, we propose a novel method to learn a domain-generalized model by learning to extract the MP from a Meta Pattern Extractor to provide discriminative information in our meta-learning paradigm.
In this section, we formulate the learning of and as a bi-level optimization problem and describe how we solve the optimization problem by using a neat but effective approximation method. After that, we introduce the instantiation details about the Meta Pattern Extractor and the discriminator network (Hierarchical Fusion Network), respectively.
Iii-a Learning to extract Meta Pattern
Iii-A1 Problem formulation
The FAS problem can be formulated as a binary classification problem: genuine or spoofing. In general, a neural network can be can trained for the genuine/spoofing classification and the optimization of can be solved by Empirical Risk Minimization (ERM) over the training data, which can be expressed as:
where is the input RGB image, is the target label, denotes the source domain data, denotes the network output given the input , and is the loss function (e.g., Cross Entropy loss). For simplicity, hereafter, we rewrite the loss function calculation used in Eq. 1 as:
can be solved by standard Stochastic Gradient Descent (SGD, without the momentum). At-th iteration, given a batch of data , the update of the can be expressed as
where is the learning rate. To improve the generalization ability, existing hybrid methods extract task-aware handcrafted features as the input to neural networks. If denotes method (non-learnable) used to extract handcrafted features in hybrid methods, the optimization objective can be expressed as
and the gradient for the update in Eq. 3 can be rewritten as
However, manually devising could be tricky due to the complexity of the source training data. Therefore, we propose to utilize the learning-to-learn paradigm to replace with a learnable Meta Pattern Extractor
, which is implemented by a convolutional neural network, to extract the Meta Pattern (MP). In this paradigm, we expect that the extracted MP can helpbe more generalized. Therefore, the optimization objective of is to improve the generalization ability of to unseen data domains. Based on Eq. 5, a bi-level optimization problem is induced:
where represents unseen domain data. In Eq. 6, is in the inner level instead of the upper level because the is trained to provide MP as the discriminative and auxiliary information, while is the target model to be optimized for the face anti-spoofing. Therefore, is the optimization target in the upper level while is the optimization target in the inner level.
Iii-A2 Solving the bi-level optimization by approximation
Eq. 6 illustrates the induced bi-level optimization problem, but solving a bi-level optimization based on gradients is nontrivial as the gradient calculation could be inexplicit and complicated. To solve the original bi-level optimization problem, we propose a neat and effective solution by approximating with a local minima , which can be done with several steps of backward propagation.
In practice, can be obtained by several steps of backward propagation. If we consider one step only, the approximation of is written as
As such, we relax the constraint of the original problem defined in Eq 6, and the gradient for updating can be approximated as:
By using this approximation, we avoid the complex high-order calculation of the original bi-level problem.
In Algorithm 1, we describe the overall optimization procedure for and based on Eq.7 and Eq.8. As described in Algorithm 1, there are two loops nested. In the outer loop, source data from multiple domains is split into and and there is no domain overlapped between and . As such, is “unseen” to to simulate as the data from unseen domains in each outer loop. Moreover, the overall optimization procedure is neat and completely end-to-end (single-stage). Complicated high-order gradient calculation is avoided. Moreover, the Meta Pattern Extractor can be trained on the fly without using a surrogate model of and the retraining of the target model is not needed.
For the instantiation of the Meta Pattern Extractor , we propose to parameterize by a convolutional neural network because convolutional kernels can work as learnable filters used to extract features. The structure of can be seen from Fig. 3, which is a shallow network consisting of two convolutional layers. We do not consider a deep network because a deep network usually has more parameters and needs more data to fit. Meanwhile, in our approximation solution, the amount of data for the approximation is small. If is deep, the approximation may be insufficient because of the small amount of data. Thus, we parameterize by a convolutional network of two vanilla convolutional layers with convolutional kernels. The sigmoid activation function is used at the end to constrain the output range from 0 to 1. Besides, the output MP has the same size as the input RGB image, which follows [Pinto, FAS-MSR-TIFS-2019]. In the experiment section, we will discuss the effects of the other instantiations of by considering central difference convolution [CDCN-CVPR-2020].
Iii-B Hierarchical Fusion Network
The extracted MP provides task-aware discriminative information for FAS, while the original RGB image contains complete and detailed information. To effectively utilize both the detailed and discriminative information, we devise a two-stream Hierarchical Fusion Network (HFN) to fuse the input RGB images and the extracted MP for FAS. As shown in Fig. 2, the top stream of the HFN is the RGB stream that processes the information from RGB images, and the bottom stream is the MP stream that processes the information from the MP. The RGB and MP streams are identical. For implementation, we adopt the ResNet-50 [ResNet], which is commonly used in image classification, as the backbone for each stream. The features from the end of the two streams are fused to a fully connected layer to obtain a binary vector for the classification. Moreover, we improve the fusion performance by fusing the information more thoroughly from different feature hierarchies. As shown in Fig. 2, the features from different feature hierarchies are fused progressively via our proposed Hierarchical Fusion Module (HFM). The HFM is inspired by the Feature Pyramid Network (FPN). While the FPN constructs the feature pyramid from different hierarchies progressively to improve the detection for small objects [FPN], we fuse information from different hierarchies progressively to improve the fusion.
At hierarchy of the HFN, an HFM fuses feature maps from the top (RGB) stream , the bottom (MP) stream , and the fusion result of the prior hierarchy . The expression of can be written as
where represents a convolutional layer that is used to align the number of channels of feature maps from different hierarchies, and
is the nearest interpolation function that upsamplesto have the same size as and such that element-wise addition can be conducted. One similar fusion scheme that also fuses information from multiple feature hierarchies for the face anti-spoofing problem is the DC-CDN [DCCDN-IJCAI-2021]. The difference between the fusion in our HFN and the fusion in the DC-CDN is that DC-CDN just directly concatenates the feature maps from multiple levels along the channel axis [DCCDN-IJCAI-2021], which is straightforward but could be coarse, while our HFN fuses information progressively. In the experiment section, we conduct the ablation study to compare our fusion of HFM and the fusion of concatenation.
In the final fusion hierarchy, the fused feature map is used to predict a pixel map to take advantage of the pixel-wise supervision [FAS-Auxiliary-CVPR-2018, deeppixel--ICB-2019, yu2021revisiting, Ternary-TIFS-2018]. As such, both the binary classification supervision and pixel-wise supervision are used to optimize our proposed HFN. We apply Binary Cross Entropy loss to the binary vector output from the FC layer, where spoofing faces are labeled as “0” and genuine faces are labeled as “1”. We apply pixel-wise Mean Square Error (MSE) loss to the output pixel maps from the HFN. In the pixel-wise supervision, each spoofing face or genuine face is assigned a target pixel map with all elements as “0” or “1”, respectively. As such, the final loss for optimization is
In the testing stage, we combine the output binary vector and the pixel map to get the score and use the score for classification. As for an output binary vector , where , , and , and an output pixel map . The score used for the classification is calculated by
represents the probability of the input being “genuine”.
|Method||C&I&O to M||O&M&I to C||O&C&M to I||I&C&M to O|
For evaluation, we consider the situation of complicated source domain data and use the multi-source domain generalization setting introduced in [MADDG-CVPR-2019], which has been the FAS cross-domain benchmark used by the recent methods [RFMetaFAS-AAAI-2020, NASFAS-TPAMI-2020, SSDG-CVPR-2020, MetaTeacher-TPAMI-2021]. In this setting, four benchmark datasets, the MSU-MFSD [FAS-IDA-TIFS-2015] (M), IDIAP REPLAY-ATTACK (I) [LBP-FAS-BIOSIG-2012], CASIA-FASD [DB-CASIAFASD] (C), and OULU-NPU [OULU_NPU_2017] (O) datasets are used, where any three out of the four datasets are used for training and the left one is for testing. For simplicity, we propose to refer this setting as MICO (with initial letters from the four datasets). We also consider the ROSE-YOUTU dataset [FAS-UnsupervisedDA-TIFS-2018], a dataset for industrial needs. Similar to the MICO, the ROSE-YOUTU (Y) is used to conduct other cross-domain experiments in [DRUDA-TIFS-2020], where the MSU-MFSD [FAS-IDA-TIFS-2015], IDIAP REPLAY-ATTACK (I) [LBP-FAS-BIOSIG-2012], and CASIA-FASD [DB-CASIAFASD] (C) are also used in the experiment setting introduced in [DRUDA-TIFS-2020]. For simplicity, we propose to refer the setting as MICY. We also use the MICY to provide a more extensive evaluation. Next, we will briefly introduce these 5 datasets.
contains genuine and spoofing faces from 50 genuine subjects. Cameras of low, normal, and high imaging resolutions are used to capture face images. Therefore, each subject has 3 kinds of live faces captured under three different resolutions. Also, CASIA-FASD consists of three kinds of 2D attacks, warped photo-attack, cut photo attack, and video attacks. Thus, there are 3×3=9 kinds of spoofing faces. There are lighting variances but the variances are not annotated.
IDIAP REPLAY-ATTACK [LBP-FAS-BIOSIG-2012] captures all genuine and spoofing faces from 50 genuine subjects. Five attack manners including four kinds of replayed faces and one kind of printed face are used to produce spoofing faces. The data is captured under two different lighting conditions: normal lighting and adverse lighting.
MSU-MFSD [FAS-IDA-TIFS-2015] contains genuine and spoofing faces from 35 genuine subjects captured by two cameras. The production of spoofing faces is from two digital displays for replay attacks and from one printer for photo attacks. There is only indoor lighting condition to capturing data.
OULU-NPU [OULU_NPU_2017] is a large-scale dataset with 55 subjects. The captured face images are of high image quality since the OULU-NPU dataset captures data of genuine faces and spoofing faces by six cameras of high resolution. The spoofing faces consist of two kinds of printed spoofing faces and two kinds of replayed spoofing faces. The data is collected under three environmental sessions.
ROSE-YOUTU (ROSE-YOUTU Face Liveness Detection Database) [FAS-UnsupervisedDA-TIFS-2018] is a recent large-scale dataset from the industry. It collects data from 20 subjects. For each subject, there are 25 genuine and 150 to 200 spoofing face videos, captured by five kinds of camera modules, front-facing cameras of Hasee phone, Huawei phone, ZTE phone, iPad, and iPhone 5s, such that the resolutions range from to . Other than photo attacks and replay video attacks, the ROSE-YOUTU dataset involves various paper mask attacks. Moreover, ROSE-YOUTU diversely covers 5 different lighting conditions. Some face examples from these five datasets are shown in Fig. 4, and we can observe that different disparities can appear in different datasets. For example, Fig. 4(d) shows the cutting edge. The color distortion appears in Fig. 4(d), Fig. 4(h), and Fig. 4(o). Fig. 4(o) shows moire patterns. Fig. 4(s) shows reflection patterns.
As for data processing, we use MTCNN [MTCNN] to capture and crop face images from video frames. The captured face images are resized to
as the network input. To train networks, we use PyTorch[pytorch] to implement the networks to conduct experiments. We set the learning rate . We use the Stochastic Gradient Descent (SGD) optimizer with a momentum value of 0.9. Unless otherwise specified, we set the approximation steps (used in Algorithm 1) for the approximation. In each mini-batch, we set the batch size as 4 for genuine/spoofing faces and for each dataset domain to balance the sample between each dataset domain and balance the ratio between genuine real and spoofing faces. As such, there will be a batch of 24 () if there are 3 source domain datasets. For performance evaluation and comparison, we follow [RFMetaFAS-AAAI-2020, NASFAS-TPAMI-2020, SSDG-CVPR-2020, MetaTeacher-TPAMI-2021] to report Half Total Error Rate (HTER) and Area Under the receptive operating characteristic Curve (AUC) as performance metrics.
|Method||C&I&O to M||O&M&I to C||O&C&M to I||I&C&M to O|
|NAS-Baesline w/ D-Meta [NASFAS-TPAMI-2020]||11.62||95.85||16.96||89.73||16.82||91.68||18.64||88.45|
|NAS w/ D-Meta [NASFAS-TPAMI-2020]||16.85||90.42||15.21||92.64||11.63||96.98||13.16||94.18|
|Method||M&C&Y to I||I&C&Y to M||I&M&Y to C||I&C&M to Y|
|Method||M&I to C||M&I to O|
Iv-C Comparison with handcrafted features
In this section, we compare the MP with the other handcrafted features. We follow [FAS-ColorTexture-TIFS-2016] to extract the ColorLBP map from the R, G, and B channels respectively, using the parameters of the number of neighbor pixels and the Radius . We retain the 2D spatial structure in the LBP map instead of collecting the histogram to construct LBP feature vectors. Besides, we extract the Albedo, Depth, and Reflectance maps by using the shape-of-shading algorithm [Pinto] with the official code 111https://github.com/allansp84/shape-from-shading-for-face-pad.
The extracted ColorLBP, Albedo, Depth, Reflectance, and MP maps are visually compared in Fig. 5. The first row and the second row are the genuine and spoofing face examples from the CASIA FASD dataset [DB-CASIAFASD]. As for all these maps, we can observe the difference from these maps between the genuine and spoofing examples. As the other handcrafted features have been shown to provide discriminative information for the spoofing [FAS-Survey-arXiv-2021], it is difficult to compare and rank these maps to state which is more representative from the visual aspect. This difficulty corresponds to the motivating question of our work: “how to choose desired handcrafted features to devise a more generalized hybrid method for the face anti-spoofing problem”. Therefore, we address this question by using an end-to-end data-driven method: learning to extract the MP. To compare and analyze quantitatively, we train our proposed HFN with the MP based on Algorithm 1 (“HFN+MP”). To compare with the other handcrafted feature, we replace the MP with the ColorLBP, Albedo, Depth, and Reflectance maps respectively to train the other HFNs, denoted as “HFN+ColorLBP”, “HFN+Albedo”, “HFN+Depth”, and “HFN+Reflectance” respectively. The experimental results on the MICO setting can be seen from TABLE I. When we compare the handcrafted ColorLBP, Reflectance, Albedo, and Depth maps in TABLE I (exclude MP), we can observe that each map has its own advantages and disadvantages in different experiments. For example, “HFN+Reflectance” achieves the best HTER and AUC in “C&I&O to M”; “HFN+Depth” achieves the best HTER and AUC in “O&C&M to I”; “HFN+ColorLBP” achieves the best HTER and AUC in “I&C&O to M”. This observation shows that the representation ability of different handcrafted features are different in different source domain data, which reinforces our motivation that it is non-trivial to manually extract handcrafted features to improve models’ generalization ability under different complicated source domain data. As such, we propose to learn a generalized model by learning to extract MP. We can see from TABLE I that our “HFN+MP” can achieve the best AUC results in “O&M&I to C”, “O&C&M to I”, and “I&C&M to O” and the best HTER results in all the four experiments. In summary, the experimental results in TABLE I justify the motivation of our work and manifest the effectiveness of our proposed MP.
Iv-D Comparison with state-of-the-art methods
Iv-D1 Comparisons on the MICO domain generalization benchmark
Recent FAS methods about domain generalization are using the MICO benchmark for evaluation [RFMetaFAS-AAAI-2020, NASFAS-TPAMI-2020, SSDG-CVPR-2020, MetaTeacher-TPAMI-2021]. We also follow the MICO benchmark to compare our method with state-of-the-art methods. The results are shown in TABLE II. In the “C&I&O to M”, “O&M&I to C”, and “C&I&M to O” settings, our method HFN+MP achieves the best HTER and AUC results. In short, our method can generally achieve state-of-the-art performance.
Iv-D2 Comparisons in the MICY cross-domain experiments
To further evaluate the proposed method, we follow [DRUDA-TIFS-2020] to use the MICY setting to conduct domain generalization experiments. The experimental results can be found in TABLE III. In TABLE III, the listed methods are based on domain adaptation (DA), which uses the target domain data, but our method does not use target domain data. Even compared with these DA methods, our “HFN+MP” can significantly achieve lower HTER than the other methods in “I&C&Y to M”, “I&M&Y to C”, and “I&C&M to Y” settings. The results show the effectiveness of our MP. Although the AUC results of the other listed methods are not available, we still provide the AUC results of our method for readers’ reference if they are interested to make comparisons.
Iv-D3 Limited source domains
We also evaluate our proposed method when the number of source data domains is limited. We follow [SSDG-CVPR-2020] to evaluate our proposed method with a variant of the MICO benchmark, where two datasets are used as source training data. As shown in TABLE IV, our proposed method achieves the best HTER and AUC compared with state-of-the-art methods. Moreover, in the “M&I to O” experiment, our proposed method shows over 10% HTER lower than other methods. As such, our proposed method is still effective when fewer source data domains are available.
Overall, via extensive comparisons with state-of-the-art methods in multi-source domain generalization experiments, we manifest the effectiveness of our novel idea of training deep models by learning to extract Meta Pattern for the FAS problem. This provides a useful reference for future works about developing more generalized methods for the face anti-spoofing task, and maybe other tasks.
Iv-E Ablation study
In this section, we conduct the ablation study to present how each component of our proposed method can have influences on performance. We study the effectiveness of the fusion with HFM in the HFN. We also study the effectiveness of our approximation solution described in Algorithm 1 by comparing it to an end-to-end training strategy and trying different steps for approximation. Besides, we implement different instantiations of the Meta Pattern Extractor to observe the results. We conduct the ablation experiments by using the MICO benchmark and report the HTER results for comparison.
Iv-E1 The effect of the proposed Hierarchical Fusion Module
In our HFN, we adopt the Hierarchical Fusion Module to fuse the RGB images and the extracted MP from multiple feature hierarchies progressively. A two-stream network DC-CDN [DCCDN-IJCAI-2021] for the FAS also fuses information from multiple feature hierarchies from the two streams. In the DC-CDN, the features maps from different hierarchies are directly concatenated along the channel axis, which is straightforward but could be coarse. We improve the fusion by using the HFM to progressively fuse the features. In this ablation experiment, we still use the two-stream structure of HFN, but we remove the HFM and concatenate the feature maps along the channel axis to do the fusion. The HTER results of different fusion methods are compared in Fig. 6. We can see that our HFM achieves lower HTER performance than the concatenation in 18 out of 20 pairs of experiments. Not only can our proposed HFM benefit the MP but also can be useful for other handcrafted feature maps.
Iv-E2 The effect of the proposed optimization algorithm
According to Algorithm 1, we optimize and separately. If we treat and as a whole that , can be trained directly in an end-to-end manner according to Eq. 1 and Eq. 3. In this ablation experiment, we study the effectiveness of our proposed Optimization Algorithm (OA). The ablation experimental results are shown in Fig. 7. We can see that our OA can achieve lower HTER results than the direct end-to-end training of without our OA in the four experiments consistently. Therefore, the experimental results in Fig. 7 can show the effectiveness of our proposed OA.
Iv-E3 The effect of different steps for the approximation
In our proposed optimization algorithm (Algorithm 1), we approximate the local minima by doing steps of gradient descent. The quality of the approximation of is essential to the final optimization results. In this ablation experiment, we study how can affect the generalization performance. The HTER results of different (1, 2, 4, 8) are plotted in Fig. 9. We can see from Fig. 9 that when increases from 1 to 4, the performance generally increases. We conjecture that when is small, the approximation is not sufficient enough. We observe that when increases from 4 to 8, the performance could drop. We conjecture the reason that when increases to a large number, the approximated local minima could get into an “overfitting” pitfall and could lead to poorer performance.
Iv-E4 The effect of different Meta Pattern Extractors
In our proposed method, we parameterize by a network of two convolutional layers, denoted as . In this ablation experiment, we study the influence of different instantiations of by adding or removing one layer to get a network of one convolutional layer and a network of three convolutional layers as the Meta Pattern Extractors to make comparisons with . We also explore the central difference convolution (CDC) [CDCN-CVPR-2020], which is delicately designed for the face anti-spoofing problem. We replace the vanilla convolutions with the CDC in , , and respectively and obtain , , and . As shown in Fig. 8, different instantiations of can have different performance. For example, in the experiment of “C&I&O to M”, has the same number of layers as , but achieves lower HTER than . However, and achieve higher HTER than the and respectively. We can empirically observe that different instantiations of can influence the performance, but a theoretical analysis of the influence needs more effort to work out in the future. Also, the design of is an open problem beyond the scope of this paper, and we will leave it for our future research.
V Conclusion and Future Work
In this paper, we present a novel method for the domain generalized FAS by learning to extract the Meta Pattern. Our method pushes the hybrid method one step further to be a fully end-to-end method by learning to extract the Meta Pattern, without extracting handcrafted features manually. Besides, we devise a novel two-stream Hierarchical Fusion Network with our proposed Hierarchical Fusion Module to fuse the information from the RGB images and the MP. Moreover, our method can achieve state-of-the-art performance in the MICO and MICY domain generalization benchmarks.
In the future, we will explore to develop more effective methods based on the idea of Meta Pattern, such as improving the Meta Pattern Extractor and the algorithm of optimization. We can also explore migrating the idea of Meta Pattern to other areas, such as deep fake detection.