Face recognition system has made remarkable progress and been widely applied in various scenarios [56, 63, 10, 77, 70, 73]. However, it is vulnerable to physical presentation attacks , such as print attack, replay attack or 3D-mask attack etc. To make matters worse, these spoof mediums can be easily manufactured in practice. Thus, both academia and industry have paid extensive attention to developing face anti-spoofing (FAS) technology for securing the face recognition system .
Previous works on face anti-spoofing mainly focus on how to obtain discriminative information between the live and spoofing faces. The texture-based methods [16, 57, 38, 75, 58, 3] leverage textural features to distinguish spoofing faces. Besides, several works [45, 53, 46, 36] are designed based on liveness cues (e.g. rPPG, reflection) for dynamic discrimination. Recent works [80, 1, 53, 42, 83, 65, 40, 25]
employ convolution neural networks (CNNs) to extract discriminative features, which have made great progress in face anti-spoofing. However, most existing anti-spoofing datasets are insufficient in subject numbers and variances. For instance, the commonly used OULU-NPU and SiW  only have 20 and 90 subjects in the training set, respectively. As a consequence, models may easily suffer from over-fitting issue, thus lack the generalization ability to the unseen target or the unknown attack.
To overcome the above issue, Yang et al.  propose a data collection solution along with a data synthesis technique to obtain a large amount of training data, which well reflects the real-world scenarios. However, this solution requires to collect data from external data sources, and the data synthesis technique is time-consuming and inefficient. Liu et al. 
design an adversarial framework to disentangle the spoof trace from the input spoofing face. The spoof trace is deformed towards the live face in the original dataset to synthesize new spoofing faces as the external data for training the generator in a supervised fashion. However, their method performs image-to-image translation and can only synthesize new spoof images with the same identity attributes, thus the data still lacks inter-class diversity. Considering the above limitations, we try to solve this issue from the perspective of sample generation. The overall solution is presented in Fig.1. We first train a generator with original dataset to obtain new face images contain abundant new identities and original spoofing patterns. Then, both the original and generated face images are utilized to enhance the training of FAS network in a proper way. Specifically, our method can generate large-scale live and spoofing images with diverse variances from random noise, and the images are not limited in the same identity attributes. Moreover, the training process of the generator does not require external data sources.
In this paper, we propose a novel Dual Spoof Disentanglement Generation (DSDG) framework to enlarge the intra- and inter-class diversity without external data. Inspired by the promising capacity of Variational Autoencoder (VAE) 
in the interpretable latent disentanglement, we first adopt an VAE-like architecture to learn the joint distribution of the identity information and the spoofing patterns in the latent space. Then, the trained decoder can be utilized to generate large-scale paired live and spoofing images with the noise sampled from standard Gaussian distribution as the input to boost the diversity of the training set. It is worth mentioning that the generated face images contain diverse identities and variances as well as reverse original spoofing patterns. Different from, our method performs generation from noise, i.e. noise-to-image generation, and can generate both live and spoofing images with new identities. Superior to 
, our methods only rely on the original data source, but not external data source. However, we observe that a small portion of generated face images are partially distorted due to the inherent limitation of VAE that some generated samples are blurry and low quality. Such noisy samples are difficult to predict precise depth values, thus may obstruct the widely used depth supervised training optimization. To solve this issue, we propose a Depth Uncertainty Module (DUM) to estimate the confidence of the depth prediction. During training, the predicted depth map is not deterministic any more, but sampled from a dynamic depth representation, which is formulated as a Gaussian distribution with learned mean and variance, and is related to the depth uncertainty of the original input face. In the inference stage, only the deterministic mean values are employed as the final depth map to distinguish the live faces from the spoofing faces. It is worth mentioning that DUM can be directly integrated with any depth supervised FAS networks, and we find in experiments that DUM can also improve the training with real data. Extensive experiments on multiple datasets and protocols are conducted to demonstrate the effectiveness of our method. In summary, our contributions lie in three folds:
We propose a Dual Spoof Disentanglement Generation (DSDG) framework to learn a joint distribution of the identity representation and the spoofing patterns in the latent space. Then, a large-scale new paired live and spoofing image set is generated from random noises to boost the diversity of the training set.
To alleviate the adverse effects of some generated noisy samples in the training of FAS models, we introduce a Depth Uncertainty Module (DUM) to estimate the reliability of the predicted depth map by depth uncertainty learning. We further observe that DUM can also improve the training when only original data involved.
Extensive experiments on five popular FAS benchmarks demonstrate that our method achieves remarkable improvements over state-of-the-art methods on both intra- and inter- test settings.
Ii Related Worked
Face Anti-Spoofing. The early traditional face anti-spoofing methods mainly rely on hand-crafted features to distinguish the live faces from the spoofing faces, such as LBP [16, 17, 57], HoG [38, 75], SIFT  and SURF . However, these methods are sensitive to the variation of illumination, pose, etc. Taking the advantage of CNN’s strong representation ability, many CNN-based methods are proposed recently. Similar to the early works, [58, 42, 32, 24, 65, 26, 25] extract the spatial texture with CNN models from the single frame. Others attempt to utilize CNN to obtain discriminative features. For instance, Zhu et al.  propose a general Contour Enhanced Mask R-CNN model to detect the spoofing medium contours from the image. Chen et al.  consider the inherent variance from acquisition cameras at the feature level for generalized FAS, and design a two CNN branches network to learn the camera invariant spoofing features and recompose the low/high-frequency components, respectively. Besides, [47, 78, 71, 40] combine the spatial and temporal textures from the frame sequence to learn more distinguishing features between the live and the spoofing faces. Specifically, Zheng et al.  present a two-stream network spatial-temporal network with a scale-level attention module, which joints the depth and multi-scale information to extract the essential discriminative features. Meanwhile, some auxiliary supervised signals are adopted to enhance the model’s robustness and generalization, such as rPPG [45, 53, 46], depth map [80, 1, 53, 79, 74], [84, 49] and reflection . Recently, some novel methods are introduced to FAS, for instance, Quan et al. 
propose a semi-supervised learning based adaptive transfer mechanism to obtain more reliable pseudo-labeled data to learn the FAS model. Debet al.  introduce a Regional Fully Convolutional Network to learn local cues in a self-supervised manner. Cai et al. 
utilize deep reinforcement learning to extract discriminative local features by modeling the behavior of exploring face-spoofing-related information from image sub-patches. Yuet al.  present a pyramid supervision to provide richer multi-scale spatial context for fine-grained supervision, which is able to plug into existed pixel-wise supervision framework. Zhang et al.  decompose images into patches to construct a non-structural input and recombine patches from different subdomains or classes. Besides, [64, 31, 13, 69, 51, 76, 41, 68],  introduce domain generation into the FAS task to alleviate the poor generalizability to unseen domains. Qin et al. [60, 59] utilize meta-teacher to supervise the presentation attack detector to learning rich spoofing cues.
Recently, researchers pay more attention to solving face anti-spoofing with the generative model. Jourabloo et al.  decompose a spoofing face image into a live face and a spoof noise, then utilize the spoof noise for classification. Liu et al.  present a GAN-like image-to-image framework to translate the face image from RGB to NIR modality. Then, the discriminative feature of RGB modality can be obtained with the assistance of NIR modality to further promote the generalization ability of FAS model. Inspired by the disentangled representation, Zhang et al.  adopt GAN-like discriminator to separate liveness features from the latent representations of the input faces for further classification. Liu et al.  disentangle the spoof traces to reconstruct the live counterpart from the spoof faces and synthesize new spoof samples from the live ones. The synthesized spoof samples are further employed to train the generator. Finally, the intensity of spoof traces are used for prediction.
Generative Model. Variational autoencoders (VAEs) 
and generative adversarial networks (GANs) are the most basic generative models. VAEs have promising capacity in latent representation, which consist of a generative network (Decoder) and an inference network (Encoder) . The decoder generates the visible variable given the latent variable , and the encoder maps the visible variable to the latent variable which approximates . Differently, GANs employ a generator and a discriminator to implement a min-max mechanism. On one hand, the generator generates images to confuse the discriminator. On the other hand, the discriminator tends to distinguish between generated images and real images. Recently, several works have introduced the ”X via generation” manner to facial sub-tasks. For instance, “recognition via generation” [89, 88, 21], “parsing via generation”  and others [29, 9, 90, 8, 20, 44]. In this paper, we consider the interpretable factorized latent disentanglement of VAEs and explore “anti-spoofing via generation”.
Uncertainty in Deep Learning.
In recent years, lots of works begin to discuss what role uncertainty plays in deep learning from the theoretical perspective[2, 23, 34]
. Meanwhile, uncertainty learning has been widely used in computer vision tasks to improve the model robustness and interpretability, such as face analysis[35, 67, 11, 66], semantic segmentation [30, 33] and object detection [15, 39]. However, most methods focus on capturing the noise of the parameters by studying the model uncertainty. In contrast, Chang et al.  apply data uncertainty learning to estimate the noise inherent in the training data. Specifically, they map each input image to a Gaussian Distribution, and simultaneously learn the identity feature and the feature uncertainty. Inspired by , we treat the minor partial distortion during data generation as a kind of noise, which is hard to predict precise depth value, and introduce the depth uncertainty to capture such noise. To the best of our knowledge, this is the first to utilize uncertainty learning in face anti-spoofing tasks.
Previous approaches rarely consider face anti-spoofing from the perspective of data, while the existing FAS datasets usually lack the visual diversity due to the limited identities and insignificant variance. The OULU-NPU  and SiW  only contain 20 and 90 identities in the training set, respectively. To mitigate the above issue and increase the intra- and inter-class diversity, we propose a Dual Spoof Disentanglement Generation (DSDG) framework to generate large-scale paired live and spoofing images without external data acquisition. In addition, we also investigate the limitation of the generated images, and develop a Depth Uncertainty Learning (DUL) framework to make better use of the generated images. In summary, our method aims to solve the following two problems: (1) how to generate diverse face data for anti-spoofing task without external data acquisition, and (2) how to properly utilize the generated data to promote the training of face anti-spoofing models. Correspondingly, we first introduce DSDG in Sec. III-A, then describe DUL for face anti-spoofing in Sec. III-B.
Iii-a Dual Spoof Disentanglement Generation
Given the pairs of live and spoofing images from the limited identities, our goal is to train a generator which can generate diverse large-scale paired data from random noise. To achieve this goal, we propose a Dual Spoof Disentanglement VAE, which accomplishes two objectives: 1) preserving the spoofing-specific patterns in the generated spoofing images, and 2) guaranteeing the identity consistency of the generated paired images. The details of Dual Spoof Disentanglement VAE and the data generation process are depicted as follows.
Iii-A1 Dual Spoof Disentanglement VAE
The structure of Dual Spoof Disentanglement VAE is illustrated in Fig. 2(a). It consists of two encoder networks, a decoder network and a spoof disentanglement branch. The encoder and are adopted to maps the paired images to the corresponding distributions. Specifically, maps the live images to the identity distribution . maps the spoofing images to the spoofing pattern distribution and the identity distribution in the latent space. The processes can be formulated as:
where represents the posterior distribution. and denote the parameters of and , respectively.
To instantiate these processes, we follow the reparameterization routine in . Taking the encoder as an example, instead of directly obtaining the identity distribution , outputs the mean
and the standard deviationof . Subsequently, the identity distribution can be obtained by: , where is a random noise sampled from a standard Gaussian distribution. Similar processes are also conducted on and , respectively. After obtaining the , and
, we concatenate them to a feature vector and feed it to the decoderto generate the reconstructed paired images and .
Spoof Disentanglement. Generally, the input spoofing images contain multiple spoof types (e.g. print and replay etc.). To ensure the generated spoofing image preserves the spoof information, it is crucial to disentangle the spoofing image into the spoofing pattern representation and the identity representation. Therefore, the encoder is designed to map the spoofing image into two distributions: and , i.e. the spoofing pattern representation and the identity representation in the latent space, respectively. To better learn the spoofing pattern representation
, we adopt a classifier to predict the spoofing type by minimizing the following CrossEntropy loss:
where represents a fully connected layer, and y is the label of the spoofing type.
In addition, an angular orthogonal constraint is also adopted between and to guarantee the effectively disentangle the into the spoofing pattern representation and the identity representation. The angular orthogonal loss is formulated as:
where denotes inner product. By minimizing , and are constrained to be orthogonal, forcing the spoofing pattern representation and identity representation to be disentangled.
Distribution Learning. We employ a VAE-like network to learn the joint distribution of the spoofing pattern representation and the identity representation. The posterior distributions , and
are constrained by the Kullback-Leibler divergence:
Given the spoofing pattern representation , the identity representations and , the decoder network Dec aims to reconstruct the inputs and :
where denotes the parameters of the decoder network. In practice, we constrain the loss between the reconstructed images / and the original images / :
Distribution Alignment. Meanwhile, we align the identity distributions of and by a Maximum Mean Discrepancy loss to preserve the identity consistency in the latent space:
where denotes the dimension of and .
Identity Constraint. To further preserve the identity consistency, an identity constraint is also employed in the image space. Similar to the previous work [21, 22], we leverage a pre-trained LightCNN  as the identity feature extractor and deploy a feature distance loss to constrain the identity consistency of the reconstructed paired images:
Overall Loss. The total objective function is a weighted sum of the above losses, defined as:
where , , and are the trade-off parameters.
Iii-A2 Data Generation
The procedure of generating paired live and spoofing images is shown in Fig. 2(b). Similar to the reconstruction process in Sec. III-A1, we first obtain the and by sampling from the standard Gaussian distribution . In order to keep the identity consistency of the generated paired data, the identity representation is not sampled but copied from . Then, we concatenate , and as a joint representation, and feed it to the trained decoder Dec to obtain the new paired live and spoofing images.
It is worth mentioning that DSDG can generate large-scale paired live and spoofing images through independently repeated sampling. Some generation results of DSDG are shown in Fig. 4, where the generated paired images contain the identities that do not exist in the real data. In addition, the spoofing images also successfully preserve the spoofing pattern from the real data. Thus, our DSDG can effectively increase the diversity of the original training data. However,it is inevitable that a portion of generated samples have some appearance distortions in the image space due to the inherent limitation of VAEs, such as blur. When directly absorbing them into the training data, such distortion in the generated image may harm the training of anti-spoofing model. To handle this issue, we propose a Depth Uncertainty Learning to reduce the negative impact of such noisy samples during training.
Iii-B Depth Uncertainty Learning for FAS
Facial depth map as a representative supervision information, which reflects the facial depth information with respect to different face areas, has been widely applied in face anti-spoofing methods [53, 80, 71]. Typically, a depth feature extractor is used to learn the mapping from the input RGB image to the output depth map , where M and N are times and , respectively. The ground truth depth map of live face is obtained by an off-the-shelf 3D face reconstruction network, such as PRNet . For the spoofing face, we directly set the depth map to zeros following . Each depth value of the output depth map corresponds to the image patch of the input , where and .
Depth Uncertainty Representation. As mentioned in Sec. III-A2, a few generated images may suffer from partial facial distortion, and it is difficult to predict a precise depth value of the corresponding face area. When directly absorbing these generated images into the training set, such distortion will obstruct the training of anti-spoofing model. We introduce depth uncertainty learning to solve this issue. Specifically, for each image patch of the input X, we no longer predict a fixed depth value in the training stage, but learn a depth distribution in the latent space, which is defined as a Gaussian distribution:
where is the mean that denotes the expected depth value of the image patch , and is the standard deviation that reflects the uncertainty of the predicted . During training, the final depth value is no longer deterministic, but stochastic sampled from the depth distribution .
|Layer||Output||DepthNet ||CDCN |
|Stem||, 64||, 64|
|[concat(Low, Mid, High), 384]|
|Layer||Output||ResNet ||MobileNetV2 |
|Stem||, 64||, 32|
Depth Uncertainty Module. We employ a Depth Uncertainty Module (DUM) to estimate the mean and the standard deviation simultaneously, which is presented in Fig. 3. DUM is equipped behind the depth feature extractor, including two independent convolutional layers. One is for predicting the , and another is for the
. Both of them are the parameters of a Gaussian distribution. However, the training is not differentiable during the gradient backpropagation if we directly samplefrom the Gaussian distribution. We follow the reparameterization routine in  to make the process learnable, and the depth value can be obtained as:
In addition, same as , we adopt a Kullback-Leibler divergence as the regularization term to constrain to be close to a constructed distribution :
where is the ground truth depth value of the patch .
Training for FAS. When training the FAS model, each input image with a size of is first fed into a CNN-based depth feature extractor to obtain the depth feature map with the size of . Specifically, in experiments, we use modified ResNet , MobileNetV2 , DepthNet  and CDCN  as the depth feature extractor to evaluate the universality of our method. Then, DUM is employed to transform the depth feature map to the predicted depth value . The architecture details are listed in Tab. I and Tab. II. Finally, the mean square error loss is utilized as the pixel-wise depth supervision constraint.
The total objective function is:
where and represent the losses of the real data, and and are the losses of the generated data. and are the trade-off parameters. The former is adopted to control the regularization term, and the latter controls the proportion of effects caused by the generated data during backpropagation. Besides, in order to properly utilize the generated data to promote the training of FAS model, we construct each training batch by a combination of the original and the generated data with a ratio of .
Iv-a Datasets and Metrics
Datasets. Five FAS benchmarks are adopted for experiments, including OULU-NPU , SiW , SiW-M , CASIA-MFSD  and Replay-Attack . OULU-NPU consists of 4,950 genuine and attack videos from 55 subjects. The attack videos contain two print attacks and two video replay attacks. There are four protocols to validate the generalization ability of models across different attacks, scenarios and video recorders. SiW contains 165 subjects with two print attacks and four video replay attacks. Three protocols evaluate the generalization ability with different poses and expressions, cross mediums and unknown attack types. SiW-M includes 13 attacks types (e.g. print, replay, 3D mask, makeup and partial attacks) and more identities, which is usually employed for diverse spoof attacks evaluation. CASIA-MFSD and Replay-Attack are small-scale datasets contain low-resolution videos with photo and video attacks. Specifically, high-resolution dataset OULU-NPU and SiW are utilized to evaluate the intra-testing performance. For inter-testing, we conduct cross-type testing on the SiW-M dataset and cross-dataset testing between CASIA-MFSD and Replay-Attack.
Metrics. Our method is evaluated by the following metrics: Attack Presentation Classification Error Rate , Bona Fide Presentation Classification Error Rate and Average Classification Error Rate :
where TP, TN, FP and FN denote True Positive, True Negative, False Positive and False Negative, respectively. In addition, following , Equal Error Rate (EER) is additionally employed for cross-type evaluation, and Half Total Error Rate is adopted in cross-dataset testing.
Iv-B Implementation Details
We implement our method in PyTorch. In data generation phase, we use the same encoder and decoder networks as. The learning rate is set to 2e-4 with Adam optimizer, and , , and in Eq. 11 are empirically fixed with 50, 5, 1, 10, respectively. During training, we comply with the principle of not applying additional data. In face anti-spoofing phase, following [80, 71], we employ PRNet  to obtain the depth map of live face with a size of and normalize the values within [0,1]. For the spoofing face, we set the depth map to zeros with the same size of 32x32. During training, we adopt the Adam optimizer with the initial learning rate of 1e-3. The trade-off parameters and in Eq. 15 are empirically set to 1e-3 and 1e-1, respectively, and the ratio is set to 0.75. We generate 20,000 images pairs by DSDG as the external data. During inference, we follow the routine of CDCN  to obtain the final score by averaging the predicted values in the depth map.
|1||ResNet ||6.8||MobileNetV2 ||5.0||DepthNet ||1.6||CDCN ||1.0|
Iv-C Ablation Study
Influence of the identity number. Tab. III presents the influence of different identity numbers to the model performance. We choose 5, 10, 15 and 20 face identities from OULU-NPU P1 as the training data. It is obvious that as the identity number increasing, the performance of the model gets better. Besides, as shown in Tab. III, our method obtains better performance than the raw CDCN on each setting, especially when the number of identity is insufficient. We argue that more face identities cover more facial features and facial structures, which can significantly improve the model generalization ability.
Influence of the ratio between the original data and the generated data in a batch. The hyper-parameter controls the ratio of the original data over the generated data in each training batch. Specifically, =1 or =0 represents only using the original data or the generated data, respectively. We generate 20,000 images pairs and vary the ratio on OULU-NPU P1. As shown in Tab. IV, with a proper value of ratio (=0.5 or =0.75), the model achieves better performance than only using the original data (=1), indicating the generated data can promote the training process, when is equal to 0.75, the model obtains the best result. Then, we fix the to 0.75 and gradually increase the generated image pairs from 10,000 to 30,000 with 5,000 intervals. As shown in Tab. VI, the ACER is 0.7, 0.6. 0.3, 0.3 and 0.3, respectively. Thus, if not specially indicated, we fix the to 0.75 and generate 20,000 images pairs for all experiments.
The results of cross-dataset testing between CASIA-MFSD and Replay-Attack. The evaluation metric is HTER (%).
Quantitative analysis of the identity learning. The , and are utilized to effectively disentangle the spoof images into the spoofing pattern representation and the identity representation. In order to investigate the effectiveness of each loss, during training DSDG, we discard , and in Eq. 11, and evaluate the performance on OULU-NPU P1, respectively. As shown in Tab. VII, it can be observed that the performance drops significantly if one of the losses is discarded, which further demonstrates the importance of each identity loss.
Sensitivity analysis of and . Tab. VIII shows the analysis of sensitivity study for the hyper-parameters and in Eq. 15, where controls the proportion of effects caused by the generated data in backpropagation and is the trade-off parameter of the Kullback-Leibler divergence constraint. Specifically, when setting and to 0.1 and 1e-3, respectively, the model achieves the best ACER. In this situation, we find all loss values fall into a similar magnitude. Besides, most of the results outperform the CDCN whose ACER is 1.0, indicating that our method is not sensitive to these trade-off parameters in a large range.
Effectiveness of the DSDG and DUM. Tab. V presents the ablation study of different components in our method. “G” and “U” are short for DSDG and DUM, respectively. Modified ResNet, MobileNetV2, DepthNet and CDCN are employed as the baseline. We conduct the ablation study on all of four protocols on OULU-NPU . From the comparison results, it is obvious that combining DSDG and DUM can facilitate the models to obtain better generalization ability, especially on the challenging Protocol 4. On the other hand, both DSDG and DUM are universal methods that can be integrated into different network backbones. Meanwhile, we also find that the performance degrades on a few protocols when only adopt DSDG. We attribute this situation to the presence of noisy samples in the generated data, which is nevertheless solved by DUM. In other word, depth uncertainty learning can indeed handle the adverse effects of partial distortion in the generated images, even brings significant improvement in the case of only using the original data.
|Method||Metrics()||Replay||Mask Attacks||Makeup Attacks||Partial Attacks||Average|
|Half||Silic.||Trans.||Paper.||Manne.||Ob.||Imp.||Cos.||Funny Eye||Paper Gls.||Part. Paper|
Influence of different number of unknown spoof types. The spoof type label is used for better disentangling the specific spoof pattern during training of DSDG. In some cases, the spoof types of images may be unavailable. We design an experiment to explore the influence of different number of unknown spoof types. Specifically, there are four spoof types in OULU-NPU P1, we assume that some of the spoof type labels are unknown, and set them as a class of “unknown” to train DSDG. We gradually increase the number of unknown spoof types from 0 to 4, where 4 means that all the spoof type labels are unavailable. The results are shown in Tab. IX. We can observe that ACER gets worse with more unknown spoof types, but in the worst case (Number = 4), our method still obtains 0.7 in ACER, which outperforms the previous best 0.8 in ACER and beats CDCN by 0.3 in ACER. Thus, our method can still promote the performance without the fine grained spoof type labels.
Iv-D Intra Testing
We implement the intra-testing on the OULU-NPU and SiW datasets. In order to ensure the fairness of the comparison, we split the data used for DSDG training according to the protocols of each dataset (e.g. OULU-NPU and SiW own 14 and 7 sub-protocols, respectively), and employ CDCN  as the depth feature extractor, which is orthogonal to our contribution.
Results on OULU-NPU. Tab. X shows the comparisons on OULU-NPU. Our proposed method obtains the best results on all the protocols. It is worth mentioning that the performance is significantly improved by our method on the most challenging Protocol 4 which focuses on the evaluation across unseen environmental conditions, attacks and input sensors. That means our method is able to obtain a more generic model by adopting a more diverse training set with depth uncertainty learning.
Results on SiW. Tab. XI presents the results on SiW, where our method is compared with other state-of-the-art methods on three protocols. It can be seen that our method achieves the best performance on the first two protocols and a competitive performance on Protocol 3. Note that, we obtain the non-ideal result (4.34%, 4.35%, 4.34% for APCER, BPCER, ACER, respectively) when reproduce the CDCN on Protocol 3. Thus, our method still has the capacity of improving the generalization ability while encounter unknown presentation attacks.
Iv-E Inter Testing
Cross-type Testing. SiW-M contains more diverse presentation attacks, and is more suitable for evaluating the generalization ability to unknown attacks. As shown in Tab. XIII, comparisons with seven state-of-the-art methods on leave-one-out protocols are conducted. Note that, unlike other datasets, the paired live and spoofing images are not accessible on SiW-M. Thus, we discard and in Eq. 11 when utilizing DSDG to generate data. In such case, our method still achieves the best average ACER(10.6%) and EER(9.5%), outperforming all the previous methods, as well as the best EER and ACER against most of the 13 attacks. Particularly, our method yields a significant improvement (3.0% ACER and 3.6% EER) compared with CDCN, benefiting from the advantages of DSDG and DUM.
Cross-dataset Testing. In this experiment, CASIA-MFSD and Replay-Attack are used for cross-dataset testing. We first perform the training on CASIA-MFSD and test on Replay-Attack. As shown in Tab. XII, our method achieves competitive results and is better than CDCN. Then, we switch the datasets for the other trial, and obtain a significant improvement (of 5.9% HTER) compared with CDCN. Note that, FAS-SGTD performs anti-spoofing on video-level, and the result of our method (26.7% HTER) is the best among all frame-level methods.
Iv-F Analysis and Visualization
Visualization of Generated Images. We visualize some generated images on the OULU-NPU Protocol 1 by DSDG in Fig. 4(b). The generated paired images provide the identity diversity that the original data lacks. Moreover, DSDG successfully disentangles the spoofing patterns from the real spoofing images and preserves them in the generated spoofing images. We also provide the generated results on SiW-M, which are shown in Fig. 5. SiW-M contains more diverse presentation attacks (e.g. Replay, Print, 3D Mask, Makeup Attacks and Partial Attacks) and less identities in some presentation attacks. Even so, DSDG still generates the spoofing images with diverse identities, which retain the original spoofing patterns.
Visualization of Disentanglement on OULU-NPU. To better present the disentanglement ability of DSDG, we fix the identity representation, sample diverse spoofing pattern representations, and generate the corresponding images. Some generated results are presented in Fig.6, where the odd rows are live images with the same identity, and the even rows contain spoofing images with various spoof types. Obviously, DSDG disentangles the spoofing pattern and the identity representations. Furthermore, we used t-SNE to visualize the spoofing pattern representations by feeding the test spoofing images to the . Firstly, we discard the classifier and the orthogonal loss when training the generator, and present the distributions of spoofing pattern representations in Fig. 7(a). We can observe that the distributions are mixed together. Then, the same distributions of well-trained DSDG are shown in Fig. 7(b). Obviously, the distributions of DSDG are well-clustered to four clusters correspond to four spoof types.
Visualization of Standard Deviation. Fig. 8 shows the visualization of the standard deviations predicted by DUM, where the red area indicates high standard deviation. The 1st and 3rd rows are images with good quality whose standard deviations are relatively consistent. The 5th row contains some noisy samples with distorted regions indicated in red boxes. Obviously, the distorted regions have higher standard deviations, as shown in the 6th row. Hence, DUM provides the ability to estimate the reliability of the generated images.
Visualization Comparison with CDCN  In Fig. 9, we present some hard samples on OULU-NPU P4, which focuses on the evaluation across unseen environmental conditions, attacks and input sensors. It can be seen that some ambiguous samples are difficult for CDCN, but are predicted correctly by our method, further demonstrating the effectiveness and the generalization ability of the proposed method. We also visualize the corresponding standard deviations of the raw CDCN with DUM to explore the reason why DUM can boost the performance of the raw CDCN. It can be observed that some reflective areas have relatively larger standard deviations. Besides, the edge areas of the face and the background also have larger standard deviations. Obviously, these areas are relatively difficult to predict the precise depth. However, the model can estimate the depth uncertainty of these ambiguous areas with the DUM and predict more robust results. That is the reason why only adopt DUM can also improve the performance of the model.
Considering that existing FAS datasets are insufficient in subject numbers and variances, which limits the generalization abilities of FAS models, in this paper, we propose a novel Dual Spoof Disentanglement Generation (DSDG) framework that contains a VAE-based generator. DSDG can learn a joint distribution of the identity representation and the spoofing patterns in the latent space, thus is able to preserve the spoofing-specific patterns in the generated spoofing images and guarantee the identity consistency of the generated paired images. With the help of DSDG, large-scale diverse paired live and spoofing images can be generated from random noise without external data acquisition. The generated images retain the original spoofing patterns, but contain new identities that do not exist in the real data. We utilize the generated image set to enrich the diversity of the training set, and further promote the training of FAS models. However, due to the defect of VAE, a portion of generated images have partial distortions, which are difficult to predict precise depth values, degenerating the widely used depth supervised optimization. Thus, we introduce the Depth Uncertainty Learning (DUL) framework, and design the Depth Uncertainty Module (DUM) to alleviate the adverse effects of noisy samples by estimating the reliability of the predicted depth map. It is worth mentioning that DUM is a lightweight module that can be flexibly integrated with any depth supervised training. Finally, we carry out extensive experiments and adequate comparisons with state-of-the-art FAS methods on multiple datasets. The results demonstrate the effectiveness and the universality of our method. Besides, we also present abundant analyses and visualizations, showing the outstanding generation ability of DSDG and the effectiveness of DUM.
This work was supported by the National Key R&D Program of China under Grant No.2020AAA0103800.
-  (2017) Face anti-spoofing using patch and depth-based cnns. In IJCB, Cited by: §I, §II.
Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424. Cited by: §II.
-  (2016) Face spoofing detection using colour texture analysis. IEEE Transactions on Information Forensics and Security. Cited by: §I.
-  (2015) Face anti-spoofing based on color texture analysis. In ICIP, Cited by: TABLE XII.
-  (2017) Face antispoofing using speeded-up robust features and fisher vector encoding. IEEE Signal Processing Letters. Cited by: §II.
-  (2017) OULU-npu: a mobile face presentation attack database with real-world variations. In FG, Cited by: §I, §III, §IV-A, §IV-C.
-  (2021) DRL-fas: a novel framework based on deep reinforcement learning for face anti-spoofing. IEEE Transactions on Information Forensics and Security. Cited by: §II, TABLE X, TABLE XI, TABLE XII.
-  (2019) 3D aided duet gans for multi-view face image synthesis. IEEE Transactions on Information Forensics and Security. Cited by: §II.
-  (2018) Learning a high fidelity pose invariant model for high-resolution face frontalization. In NeurIPS, Cited by: §II.
-  (2020) Face recognition based on videos by using convex hulls. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §I.
-  (2020) Data uncertainty learning in face recognition. In CVPR, Cited by: §II, §III-B.
-  (2021) Camera invariant feature learning for generalized face anti-spoofing. IEEE Transactions on Information Forensics and Security. Cited by: §II, TABLE X.
-  (2021) Generalizable representation learning for mixture domain face anti-spoofing. In AAAI, Cited by: §II.
-  (2012) On the effectiveness of local binary patterns in face anti-spoofing. In BIOSIG, Cited by: §IV-A.
-  (2019) Gaussian yolov3: an accurate and fast object detector using localization uncertainty for autonomous driving. In ICCV, Cited by: §II.
-  (2012) LBP - TOP based countermeasure against face spoofing attacks. In ACCV, Cited by: §I, §II.
-  (2013) Can face anti-spoofing countermeasures work in a real world scenario?. In ICB, Cited by: §II.
-  (2021) Look locally infer globally: a generalizable face anti-spoofing approach. IEEE Transactions on Information Forensics and Security. Cited by: §II, TABLE XIII.
-  (2018) Joint 3d face reconstruction and dense alignment with position map regression network. In ECCV, Cited by: §III-B, §IV-B.
-  (2021) High-fidelity face manipulation with extreme poses and expressions. IEEE Transactions on Information Forensics and Security. Cited by: §II.
-  (2019) Dual variational generation for low-shot heterogeneous face recognition. In NeurIPS, Cited by: §II, §III-A1.
-  (2021) DVG-face: dual variational generation for heterogeneous face recognition. IEEE transactions on pattern analysis and machine intelligence. Cited by: §III-A1.
-  (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In CVPR, Cited by: §II.
-  (2019) Deep pixel-wise binary supervision for face presentation attack detection. In ICB, Cited by: §II.
-  (2021) Learning one class representations for face presentation attack detection using multi-channel convolutional neural networks. IEEE Transactions on Information Forensics and Security. Cited by: §I, §II.
-  (2020) Biometric face presentation attack detection with multi-channel convolutional neural network. IEEE Transactions on Information Forensics and Security. Cited by: §II.
-  (2014) Generative adversarial nets. In NeurIPS, Cited by: §II.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §III-B, TABLE II, TABLE V.
-  (2018) Pose-guided photorealistic face rotation. In CVPR, Cited by: §II.
-  (2017) Deep convolutional encoder-decoder network with model uncertainty for semantic segmentation. In INISTA, Cited by: §II.
-  (2020) Single-side domain generalization for face anti-spoofing. In CVPR, Cited by: §II.
-  (2018) Face de-spoofing: anti-spoofing via noise modeling. In ECCV, Cited by: §II, §II, TABLE XII.
Bayesian segnet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. In BMVC, Cited by: §II.
-  (2017) What uncertainties do we need in bayesian deep learning for computer vision?. In NeurIPS, Cited by: §II.
-  (2019) Striking the right balance with uncertainty. In CVPR, Cited by: §II.
-  (2019) BASN: enriching feature representation using bipartite auxiliary supervisions for face anti-spoofing. In ICCVW, Cited by: §I, §II.
-  (2014) Auto-encoding variational bayes. In ICLR, Cited by: §I, §II, §III-A1, §III-B, §IV-B.
-  (2013) Context based face anti-spoofing. In BTAS, Cited by: §I, §II.
-  (2019) Uncertainty estimation in one-stage object detection. In ITSC, Cited by: §II.
Learning generalized deep feature representation for face anti-spoofing. IEEE Transactions on Information Forensics and Security. Cited by: §I, §II.
-  (2018) Unsupervised domain adaptation for face anti-spoofing. IEEE Transactions on Information Forensics and Security. Cited by: §II.
-  (2016) An original face anti-spoofing approach using partial convolutional neural network. In IPTA, Cited by: §I, §II.
-  (2020) Dual-structure disentangling variational generation for data-limited face parsing. In ACM MM, Cited by: §II.
-  (2019) M2FPA: a multi-yaw multi-pitch high-quality dataset and benchmark for facial pose analysis. In ICCV, Cited by: §II.
-  (2016) Generalized face anti-spoofing by detecting pulse from face videos. In ICPR, Cited by: §I, §II.
-  (2019) Face liveness detection by rppg features and contextual patch-based cnn. In ICBEA, Cited by: §I, §II.
-  (2018) Live face verification with multiple instantialized local homographic parameterization. In IJCAI, Cited by: §II.
-  (2021) Cross-ethnicity face anti-spoofing recognition challenge: a review. IET Biometrics. Cited by: §I.
-  (2021) CASIA-surf cefa: a benchmark for multi-modal cross-ethnicity face anti-spoofing. In WACV, Cited by: §II.
-  (2021) Face anti-spoofing via adversarial cross-modality translation. IEEE Transactions on Information Forensics and Security. Cited by: §II, TABLE X, TABLE XI.
-  (2021) Dual reweighting domain generalization for face presentation attack detection. In IJCAI, Cited by: §II.
-  (2021) Adaptive normalized representation learning for generalizable face anti-spoofing. In ACM MM, Cited by: §II.
-  (2018) Learning deep models for face anti-spoofing: binary or auxiliary supervision. In CVPR, Cited by: §I, §II, §III-B, §III-B, TABLE I, §III, §IV-A, §IV-A, TABLE X, TABLE XI, TABLE XII, TABLE XIII, TABLE V.
-  (2019) Deep tree learning for zero-shot face anti-spoofing. In CVPR, Cited by: §IV-A.
-  (2020) On disentangling spoof trace for generic face anti-spoofing. In ECCV, Cited by: §I, §I, §II, TABLE X, TABLE XI, TABLE XIII.
-  (2019) Multi-fold gabor, pca, and ica filter convolution descriptor for face recognition. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §I.
-  (2011) Face spoofing detection from single images using micro-texture analysis. In IJCB, Cited by: §I, §II.
-  (2016) Secure face unlock: spoof detection on smartphones. IEEE Transactions on Information Forensics and Security. Cited by: §I, §II.
-  (2021) Meta-teacher for face anti-spoofing.. IEEE transactions on pattern analysis and machine intelligence. Cited by: §II.
-  (2020) Learning meta model for zero- and few-shot face anti-spoofing. In AAAI, Cited by: §II.
Progressive transfer learning for face anti-spoofing. IEEE Transactions on Image Processing. Cited by: §II, TABLE X.
-  (2018) MobileNetV2: inverted residuals and linear bottlenecks. In CVPR, Cited by: §III-B, TABLE II, TABLE V.
-  (2020) A double-deep spatio-angular learning framework for light field-based face recognition. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §I.
-  (2019) Multi-adversarial discriminative deep domain generalization for face presentation attack detection. In CVPR, Cited by: §II.
-  (2019) Joint discriminative learning of deep dynamic textures for 3d mask face anti-spoofing. IEEE Transactions on Information Forensics and Security. Cited by: §I, §II.
-  (2021) Dive into ambiguity: latent distribution mining and pairwise uncertainty estimation for facial expression recognition. In CVPR, Cited by: §II.
-  (2019) Probabilistic face embeddings. In ICCV, Cited by: §II.
-  (2021) Unsupervised adversarial domain adaptation for cross-domain face presentation attack detection. IEEE Transactions on Information Forensics and Security. Cited by: §II.
-  (2021) Self-domain adaptation for face anti-spoofing. In AAAI, Cited by: §II.
-  (2021) FaceX-zoo: a pytorch toolbox for face recognition. In ACM MM, Cited by: §I.
-  (2020) Deep spatial gradient and temporal depth learning for face anti-spoofing. In CVPR, Cited by: §II, §III-B, §IV-B, TABLE X, TABLE XI, TABLE XII.
-  (2018) A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security. Cited by: §III-A1.
-  (2020) Learning an evolutionary embedding via massive knowledge distillation. International Journal of Computer Vision. Cited by: §I.
-  (2021) Single-shot face anti-spoofing for dual pixel camera. IEEE Transactions on Information Forensics and Security. Cited by: §II.
-  (2013) Face liveness detection with component dependent descriptor. In ICB, Cited by: §I, §II.
-  (2015) Person-specific face antispoofing with subject domain adaptation. IEEE Transactions on Information Forensics and Security. Cited by: §II.
-  (2021) Orthogonality loss: learning discriminative representations for face recognition. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §I.
-  (2019) Face anti-spoofing: model matters, so does data. In CVPR, Cited by: §I, §I, §II, TABLE XII.
-  (2020) Face anti-spoofing with human material perception. In ECCV, Cited by: §II, TABLE X, TABLE XI, TABLE XII, TABLE XIII.
-  (2020) Searching central difference convolutional networks for face anti-spoofing. In CVPR, Cited by: §I, §II, §III-B, §III-B, TABLE I, Fig. 9, §IV-B, §IV-D, §IV-F, TABLE X, TABLE XI, TABLE XII, TABLE XIII, TABLE III, TABLE V.
-  (2021) Revisiting pixel-wise supervision for face anti-spoofing. IEEE Transactions on Biometrics, Behavior, and Identity Science. Cited by: §II, TABLE X, TABLE XII, TABLE XIII.
-  (2021) Deep learning for face anti-spoofing: a survey. ArXiv abs/2106.14948. Cited by: §I.
-  (2021) Dual-cross central difference network for face anti-spoofing. In IJCAI, Cited by: §I, TABLE X, TABLE XII, TABLE XIII.
-  (2020) NAS-fas: static-dynamic central difference network search for face anti-spoofing. IEEE transactions on pattern analysis and machine intelligence. Cited by: §II.
-  (2021) Structure destruction and content combination for face anti-spoofing. In IJCB, Cited by: §II.
Face anti-spoofing via disentangled representation learning. In ECCV, Cited by: §II, TABLE X, TABLE XI, TABLE XII.
-  (2012) A face antispoofing database with diverse attacks. In ICB, Cited by: §IV-A.
-  (2019) Look across elapse: disentangled representation learning and photorealistic cross-age face synthesis for age-invariant face recognition. In AAAI, Cited by: §II.
-  (2018) Towards pose invariant face recognition in the wild. In CVPR, Cited by: §II.
-  (2017) Dual-agent gans for photorealistic and identity preserving profile face synthesis. In NeurIPS, Cited by: §II.
-  (2021) Attention-based spatial-temporal multi-scale network for face anti-spoofing. IEEE Transactions on Biometrics, Behavior, and Identity Science. Cited by: §II, TABLE X, TABLE XI, TABLE XII.
-  (2021) Detection of spoofing medium contours for face anti-spoofing. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §II.