Multi-modal face anti-spoofing have also absorbed an increasing number of researchers in recent two years. Some fusion methods [41, 25] are published, which restrict the interactions among different modalities since they are independent before the fusion. But it is difficult for different modalities to effectively utilize the modality relatedness from the beginning of the network to its end to boost the overall performance. In this paper, we propose a partially shared branch multi-modal network (PSMM-Net) with allowing the exchanges and interactions among different modalities, aiming to capture correlated and complementary features.
|Dataset||Year||# sub||# num||Attacks||Mod.||Dev.||Eth.|
|Total: 1607 subjects, 23538 videos|
Furthermore, data plays a key role in face anti-spoofing tasks. About existing face anti-spoofing datasets, such as CASIA-FASD , Replay-Attack , OULU-NPU , and SiW , the amount of sample is relatively small and most of them just contain the RGB modality. The recently released CASIA-SURF  includes 1,000 subjects and RGB, Depth and IR modalities. Although this provides with a larger dataset in comparison to the existing alternatives, it suffers from limited attack types (2D print attack) and single ethnicity (Chinese people). Overall, the effect of cross-ethnicity for face anti-spoofing received little attention in previous works. Therefore, we introduce CASIA-CeFA dataset, the largest dataset in terms of subjects (see Table 1). In CASIA-CeFA, attack types are diverse, including printing from cloth, video replay attack, 3D print and silica gel attacks. More importantly, it is the first public dataset designed for exploring the impact of cross-ethnicity in the study of face anti-spoofing. Some samples of the CASIA-CeFA dataset are shown in Fig. 1.
To sum up, the contributions of this paper are summarized as follows: (1) We propose the SD-Net to learn both static and dynamic features for single modality . It is the first work incorporating dynamic images for face anti-spoofing. (2) We propose the PSMM-Net to learn complementary information from multi-modal data in videos. (3) We release the CASIA-CeFA dataset, which includes ethnicities, subjects and diverse 2D/3D attack types. (4) Extensive experiments of the proposed method on CASIA-CeFA and other 3 public datasets verify its high generalization capability.
2 Related Work
together with traditional classifiers, such as SVM or LDA, to perform binary anti-spoofing predictions. However, those methods lack of good generalization capability when testing conditions vary, such as lighting and background. Owing to the success of deep learning strategies over handcrafted alternatives in computer vision, some works[11, 19, 26]
extended feature vectors with features from CNN networks for face anti-spoofing. Authors of[2, 20] presented a two-stream network using RGB and Depth images as input. The work of  proposes a deep tree network to model spoofs by hierarchically learning sub-groups of features. However, previous methods do not consider any kind of temporal information for face anti-spoofing.
Temporal-based Methods. In order to improve robustness in real applications, some temporal-based methods [24, 26] have been proposed, which require from a constrained human guided interaction, such as movements of of eyes, lips, and head. However, those methods do not provide with a natural user friendly interaction. Even more importantly, these methods [24, 26] could become vulnerable if someone presents a replay attack or a print photo attack with cut eye/mouth regions. Given that the Photoplethysmography (rPPG) signals (i.e. heart plus signal) can be detected from real but not spoof, Liu et al. 
proposed a CNN-RNN model to estimate rPPG signals with sequence-wise supervision and face depth with pixel-wise supervision. The estimated depth and rPPG are fused to distinguish real and fake faces. Fenget al.  distinguished between real and fake samples based on the the difference between image quality and optical flow information. Yang et al.  proposed a spatio-temporal attention mechanism to fuse global temporal and local spatial information. All previous methods rely on a single visual modality, and no work considers the effect of cross-ethnicity for anti-spoofing.
Multi-modal Fusion Methods. Zhang et al.  proposed a fusion network with 3 streams using ResNet-18 as the backbone, where each stream is used to extract low level features from RGB, Depth and IR data, respectively. Then, these features are concatenated and passed to the last two residual blocks. Similar to , Aleksandr et al.  used a fusion network with 3 streams. They used ResNet-34 as the backbone and multi-scale feature fusion at all residual blocks. Tao et al.  proposed a multi-stream CNN architecture called FaceBagNet, which uses patch-level images as input and modality feature erasing (MFE) operation to prevent overfitting and obtain more discriminative fused features. All previous methods just consider as a key fusion component the concatenation of features from multiple modalities. Unlike [41, 25, 29], we propose the PSMM-Net, where three modality-specific networks and one shared network are connected by using a partially shared structure to learn discriminative fused features for face anti-spoofing.
Table 1 lists existing face anti-spoofing datasets. One can see that before 2019 the maximum number of available subjects was 165 on the SiW dataset . That was clearly limiting the generalization ability of new approaches for cross-dataset evaluation. Most of the datasets just contain RGB data, such as Replay-Attack , CASIA-FASD , SiW  and OULU-NPU . Recently, the CASIA-SURF  has been released, including 1000 subjects with three modalities, namely RGB, Depth and IR. Although this relieved the problem of the amount of data, it is limited in terms of attack types (only 2D print attack) and only includes 1 ethnicity (Chinese people). As shown in Table 1, most datasets do not provide ethnicity information, except SiW and CASIA-SURF. Although the SiW dataset provides four ethnicities, it still does not consider the effect of cross-ethnicity for face anti-spoofing. This limitation also holds for the CASIA-SURF dataset.
3 Proposed Method
3.1 SD-Net for Single-modal
Single-modal Dynamic Image Construction.
Rank pooling [13, 33] defines a rank function that encodes a video into a feature vector. The learning process can be seen as a convex optimization problem using the RankSVM  formulation in Eq.1. Let RGB (Depth or IR) video sequence with frames be represented as , and denotes the average of RGB (Depth or IR) features over time up to -frame. The process is formulated below.
where is the slack variable, and .
By optimizing Eq. 1, we map a sequence of frames to a single vector . In this paper, rank pooling is directly applied on the pixels of RGB (Depth or IR) frames and the dynamic image is of the same size as the input frames. In our case, given input frame, we compute its dynamic image online with rank pooling using consecutive frames. Our selection of dynamic images for rank pooling in SD-Net is further motivated by the fact that dynamic images have proved its superiority to regular optical flow [32, 13].
As shown in Fig. 2, taking the RGB modality as an example, we propose the SD-Net to learn hybrid features from static and dynamic images. It contains branches: static, dynamic and static-dynamic branches, which learn complementary features. The network takes ResNet-18  as the backbone. For static and dynamic branches, each of them consists of blocks (i.e., conv, res1, res2, res3, res4) and Global Average Pooling (GAP) layer, while in the static-dynamic branch, the conv and res1 blocks are removed because it takes fused features of res1 blocks from static and dynamic branches as input.
For convenience of terminology with the rest of the paper, we divide residual blocks of the network into a set of modules according to feature level, where is an indicator of the modality and represents the feature level. Except for the first module , each module extracts static, dynamic and static-dynamic features by using a residual block, denoted as , and , respectively. The output features from each module are used as the input for the next module. The static-dynamic features of the first module are obtained by directly summing and .
In order to ensure each branch learns independent features, each branch employs an independent loss function after the GAP layer. In addition, a loss function based on the summed features from all three branches is employed. The binary cross-entropy loss is used as the loss function. All branches are jointly and concurrently optimized to capture discriminative and complementary features for face anti-spoofing in image sequences. The overall objective function of SD-Net for the modality is defined as:
where , , and are the losses for static branch, dynamic branch, static-dynamic branch, and summed features from all three branches of the network, respectively.
3.2 PSMM-Net for Multi-modal Fusion
The architecture of the proposed PSMM-Net is shown in Fig. 3. It consists of two main parts: a) the modality-specific network, which contains three SD-Nets to learn features from RGB, Depth, IR modalities, respectively; b) and a shared branch for all modalities, which aims to learn the complementary features among different modalities.
For the shared branch, we adopt ResNet-18, removing the first conv layer and res1 block. In order to capture correlations and complementary semantics among different modalities, information exchange and interaction among SD-Nets and the shared branch are designed. This is done in two different ways: a) forward feeding of fused SD-Net features to the shared branch, and b) backward feeding from shared branch modules output to SD-Net block inputs.
Forward Feeding. We fuse static and dynamic SD-Nets features from all modality branches and fed them as input to its corresponding shared block. The fused process at feature level can be formulated as:
In the shared branch, denotes the input to the block, and denotes the output of the block. Note that the first residual block is removed from the shared branch, thus equals to zero.
Backward Feeding. Shared features are delivered back to the SD-Nets of the different modalities. The static features and dynamic features add with for feature fusion. This can be denoted as:
where ranges from 2 to 3.
After feature fusion, and become the new static and dynamic features, which are then feed to the next module . Note that the exchange and interaction among SD-Nets and the shared branch are only performed for static and dynamic features. This is done to avoid hybrid features among static and dynamic information to be disturbed by multi-modal semantics.
Loss Optimization. There are two main kind of losses employed to guide the training of PSMM-Net. The first corresponds to the losses of the three SD-Nets, i.e.. color, depth and ir modalities, denoted as , and , respectively. The second corresponds to the loss that guides the entire network training, denoted as , which bases on the summed features from all SD-Nets and the shared branch. The overall loss of PSMM-Net is denoted as:
4 CASIA-CeFA dataset
This section describes the CASIA-CeFA dataset. The motivation of this dataset is to provide with an increased diversity of attack types compared to existing datasets, as well as to explore the effect of cross-ethnicity in face anti-spoofing, which has received little attention in the literature. Furthermore, it contains three visual modalities, i.e.., RGB, Depth, and IR. Summarizing, the main purpose of CASIA-CeFA is to provide with the largest up to date face anti-spoofing dataset to allow for the evaluation of the generalization performance of new PAD methods in three main aspects: cross-ethnicity, cross-modality and cross-attacks. In this section, we describe the CASIA-CeFA dataset in detail, including acquisition details, attack types, and proposed evaluation protocols.
|Prot.||Subset||Ethnicity||Subjects||Modalities||PAIs||# real videos||# fake videos||# all videos|
Acquisition Details. We use the Intel Realsense to capture the RGB, Depth and IR videos simultaneously at 30 fps. The resolution is 1280 720 pixels for each video frame and all modalities. Performers are asked to move smoothly their head so as to have a maximum of around deviation of head pose in relation to frontal view. Data pre-processing is similar to the one performed in , expect that PRNet  is replaced by 3DFFA  for face region detection. Examples of original recorded images from video sequences and processed face regions for different visual modalities are shown in Fig. 1.
Statistics. As shown in Table 1, CASIA-CeFA consists of 2D and 3D attack subsets. For the 2D attack subset, it includes print and video-reply attacks, and three ethnicites (African, East Asian and Central Asian) with 2 attacks (print face from cloth and video-replay). Each ethnicity has 500 subjects. Each subject has 1 real sample, 2 fake samples of print attack captured in indoor and outdoor, and 1 fake sample of video-replay. In total, there are 18000 videos (6000 per modality). The age and gender statistics for the 2D attack subset of CASIA-CeFA is shown in Fig. 4.
For the 3D attack subset, it has 3D print mask and silica gel face attacks. For 3D print mask, it has 99 subjects, each subject with 18 fake samples captured in three attacks and six lighting environments. Attacks include only mask, wearing a wig and glasses, and wearing a wig and no glasses. Lighting conditions include outdoor sunshine, outdoor shade, indoor side light, indoor front light, indoor backlit and indoor regular light. In total, there are 5346 videos (1782 per modality). For silica gel face attacks, it has 8 subjects, each subject has 8 fake samples captured in two attacks styles and four lighting environments. Attacks include wearing a wig and glasses and wearing a wig and no glasses. Lighting environments include indoor side light, indoor front light, indoor backlit and indoor normal light. In total, there are 196 videos (64 per modality).
We design four protocols for the 2D attacks subset, as shown in Table 2, totalling 11 sub-protocols (1_1, 1_2, 1_3, 2_1, 2_2, 3_1, 3_2, 3_3, 4_1, 4_2, and 4_3). We divide subjects per ethnicity into three subject-disjoint subsets (second and fourth columns in Table 2). Each protocol has three data subsets: training, validation and testing sets, which contain 200, 100, and 200 subjects, respectively.
Protocol 1 (cross-ethnicity): Most of the public face PAD datasets just contain a single ethnicity. Even though there are few datasets [20, 41] containing multiple ethnicities, they lack of ethnicity labels or do not provide with a protocol to perform cross-ethnicity evaluation. Therefore, we design the first protocol to evaluate the generalization of PAD methods for cross-ethnicity testing. One ethnicity is used for training and validation, and the left two ethnicities are used for testing. Therefore, there are three different evaluations (third column of Protocol 1 in Table 2.
Protocol 2 (cross-PAI): Given the diversity and unpredictability of attack types from different presentation attack instruments (PAI), it is necessary to evaluate the robustness of face PAD algorithms to this kind of variations (sixth column of Protocol 2 in Table. 2).
Protocol 3 (cross-modality): Given the release of affordable devices capturing complementary visual modalities (i.e., Intel Resense, Mircrosoft Kinect), recently the multi-modal face anti-spoofing dataset was proposed . However, there is no standard protocol to explore the generalization of face PAD methods when different train-test modalities are considered for evaluation. We define three cross-modality evaluations, each of them having one modality for training and the two remaining ones for testing (fifth column of Protocol 3 in Table. 2).
Protocol 4 (cross-ethnicity & PAI): The most challenging protocol is designed via combining the condition of both Protocol 1 and 2. As shown in Protocol 4 of Table. 2, the testing subset introduces two unknown target variations simultaneously.
In this section, we conduct a series of experiments on public available face anti-spoofing datasets to verify the effectiveness of our methodology and the benefits of the presented CASIA-CeFA dataset. In the following, we will introduce the employed datasets & metrics, implementation details, experimental setting, and results & analysis sequentially.
5.1 Datasets & Metrics
We evaluate the performance of PSMM-Net on two multi-modal (i.e., RGB, Depth and IR) datasets: CASIA-CeFA and CASIA-SURF , while evaluate the SD-Net on two single-modal (i.e., RGB) face anti-spoofing benchmarks: OULU-NPU  and SiW . They are the mainstream datasets released in recent years with their own characteristics in terms of the number of subject, modality and ethnicity, attack types, acquisition device and PAIs et al.. Therefore, experiments on these datasets can verify the performance of our method more convincingly.
In order to perform a consistent evaluation with prior works, we report the experimental results using the following metrics based on respective official protocols: Attack Presentation Classification Error Rate (APCER) , Bona Fide Presentation Classification Error Rate (BPCER), Average Classification Error Rate (ACER), and Receiver Operating Characteristic (ROC) curve .
5.2 Implementation Details
The proposed PSMM-Net is implemented with Tensorflow and run on a single NVIDIA TITAN X GPU. We resize the cropped face region to , and use random rotation within the range of [, ], flipping, cropping and color distortion for data augmentation. All models are trained for epochs via Adaptive Moment Estimation (Adam) algorithm and initial learning rate of , which is decreased after and epochs with a factor of . The batch size of each CNN stream is , and the length of the consecutive frames used to construct dynamic map is set to by our experimental experience. In addition, all fusion points in this work use element summation operations to prevent dimension explosion.
5.3 Baseline Model Evaluation
Before exploring the traits of our dataset, we first provide a benchmark for CASIA-CeFA based on the proposed method. From the Table 3, in which the results of the four protocols are derived from all the respective sub-protocols by calculating the mean and variance, we can draw the following conclusions: (1) from the results of the three sub-protocols in Protocol 1, the ACER scores are , and , respectively, indicating that it is necessary to study the generalization of the face PAD method for different ethnicity; (2) In the case of Protocol 2, when print attack is used for training/validation and video-replay and 3D mask are used for testing, the ACER score is (sub-protocol 2_1), while video-replay attack is used for training/validation, and print attack and 3D attack are used for testing, with an ACER score of (sub-protocol 2_2). The large gap between the results of the two sub-protocols is mainly caused by different PAI (i.e.. different displays and printers) create different artifacts. (3) Protocol 3 evaluates cross-modality. The best result is achieved for sub-protocol 3_1, with ACER of . The other two sub-protocols achieve a similar low performance score. This means the best performance is achieved when RGB data of 2D attack subset is used for training/validation while the other two modalities of 2D and 3D attack subsets are used for testing. (4) Protocol 4 is the most difficult evaluation scenario, which simultaneously considers cross-ethnicity and cross-PAI. All sub-protocols achieve poor performance, being , , and ACER scores for 4_1, 4_2, and 4_3 achieve, respectively.
5.4 Ablation Analysis
In order to verify the effectiveness of the proposed method, we perform a series of ablation experiments on Protocol 1 (cross-ethnicity) of the CASIA-CeFA dataset.
Static and Dynamic Features. We evaluate S-Net (Static branch of SD-Net), D-Net (Dynamic branch of SD-Net) and SD-Net. Results for RGB, Depth and IR modalities are shown in Table 4. Compared with S-Net and D-Net, SD-Net achieves superior performance. For RGB, Depth and IR modalities, ACER of SD-Net is , , , versus , , of S-Net (improved by , , ) and , , of D-Net (improved by , , ), respectively. Furthermore, Table 4 shows that the performance of Depth and IR modalities are superior to the one of RGB. One reason is the variability in lighting conditions included in CASIA-CeFA.
In addition, we provide the results of single-modal experiments on the protocols to facilitate comparison of face PAD algorithms, shown in Table 5. It shows that when only single modality is used, the performance of the depth or IR modality is superior to that of the RGB modality.
Multiple Modalities. In order to show the effect of analysing a different number of modalities, we evaluate one modality (RGB), two modalities (RGB and Depth), and three modalities (RGB, Depth and IR) on PSMM-Net. As shown in Fig. 3, the PSMM-Net contains three SD-Nets and one shared branch. When only RGB modality is considered, we just use one SD-Net for evaluation. When two or three modalities are considered, we use two or three SD-Nets and one shared branch to train the PSMM-Net model, respectively. Results are shown in Table 6. The best results are obtained when using all three modalities, which of APCER, of BPCER and of ACER. Compared with the performance of using single RGB modality and two modalities, the improvement in performance corresponds to and for APCER, and for BPCER, and and for ACER, respectively.
|Method||TPR (%)||APCER (%)||BPCER (%)||ACER (%)|
|NHF fusion ||89.1||33.6||17.8||5.6||3.8||4.7|
|Single-scale SE fusion ||96.7||81.8||56.8||3.8||1.0||2.4|
|Multi-scale SE fusion ||99.8||98.4||95.2||1.6||0.08||0.8|
Fusion Strategy. In order to evaluate the performance of PSMM-Net, we compare it with other two variants: Naive halfway fusion (NHF) and PSMM-Net without backward feeding mechanism (PSMM-Net-WoBF). As shown in Fig. 5, NHF combines the modules of different modalities at a later stage (i.e., after module) and PSMM-Net-WoBF strategy removes the backward feeding from PSMM-Net. The fusion comparison results are shown in Table 7, showing higher performance of the proposed PSMM-Net, with ACER of .
5.5 Methods Comparison
CASIA-SURF Dataset. Comparison results are show in Table 8. The performance of the PSMM-Net is superior to the ones of the competing multi-modal fusion methods, including Halfway fusion , single-scale SE fusion , and multi-scale SE fusion . When compared with [41, 40], PSMM-Net improves the performance by at least for APCER, for NPECE, and for ACER. When the PSMM-Net is pretrained on CASIA-CeFA, it further improves performance. Concretely, the performance of is increased by when pretraining with the proposed CASIA-CeFA dataset.
In 2019 a challenge on the CASIA-SURF dataset was run at CVPR 111https://sites.google.com/qq.com/chalearnfacespoofingattackdete/welcome. The results of the challenge were very promising, where winning teams VisionLab , ReadSense  and Feather  got TPR=, and , respectively. The main reasons of these high performance are: 1) several external datasets were used. VisionLab  used four lare-scale datasets, namely CASIA-WebFace , MSCeleb-1M , AFAD-lite  and Asian dataset  for pretraining, while Feather  used a large private dataset with a collection protocol similar to CASIA-SURF. 2) Many network ensembles. VisonLab , ReadSense , and Feather  average the outputs of , and networks to compute final results. Thus in order to have a fair comparison we omit VisionLab , ReadSense  and Feather  from Table 8.
SiW Dataset. Results for this dataset are shown in Table 9. We compare the proposed SD-Net with other methods without pretraining. Taking the Protocol 1 of SiW as an example, SD-Net achieves the best ACER of , an improvement of with respect to the second best score, (ACER) from STASN . In terms of CASIA-CeFA pretraining, our method is competitive to STASN (Data)  ( versus in term of ACER), which used an large proviate dataset as pretrain. For Protocol 2 and 3 of SiW, our methods has achieved the best performance under three evaluation metrics.
|Prot.||Method||APCER (%)||BPCER (%)||ACER (%)||Pretrain|
|STASN (Data) ||-||-||0.30|
|STASN (Data) ||-||-||0.150.05|
|STASN (Data) ||-||-||5.850.85|
|Prot.||Method||APCER (%)||BPCER (%)||ACER (%)||Pretrain|
|STASN (Data) ||1.41.4||3.64.6||2.52.2||Yes|
|STASN (Data) ||0.91.8||4.25.3||2.62.8||Yes|
OULU-NPU Dataset. We perform evaluation on the 2 most challenging protocols of OULU-NPU. Protocol 3 studies the generalization across different acquisition devices and Protocol 4 considers all the conditions of previous three protocols simultaneously. The experimental Results are shown in Table 10. In the case of comparison without pretraining, SD-Net obtains the best results in both Protocol 3 and 4. The ACER of our SD-Net is versus of STASN . When comparing the results using pretraining, our method achieves the first and second position for Protocol 3 and 4, respectively. Based on the above experiments, when without using pretraining, the proposed method can get the state-of-the-art performance (ACER) in all protocols on SiW and OULU-NPU. The proposed method with pretraining on CASIA-CeFA can also get the best ACER scores on most of protocols. Those experimental results clearly demonstrate the effectiveness of the proposed method and the collected CASIA-CeFA dataset.
In this paper, we have presented the CASIA-CeFA dataset for face anti-spoofing. This corresponds to the largest public available dataset in terms of modalities, subjects, ethnicities and attacks. Moreover, we have proposed a static- and dynamic- network (SD-Net) to learn both static and dynamic features from single-modal. Then, we have proposed a partially shared multi-modal network (PSMM-Net) to learn complementary information from multi-modal data in videos. Extensive experiments on four popular datasets show the high generalization capability of the proposed SD-Net and PSMM-Net, and the utility and challenges of the released CASIA-CeFA dataset.
TensorFlow: a system for large-scale machine learning. Cited by: §5.2.
-  (2017) Face anti-spoofing using patch and depth-based cnns. In IJCB, pp. 319–328. Cited by: §2.1.
-  (2016) Face spoofing detection using colour texture analysis. TIFS. Cited by: §1, §2.1.
-  (2017) Face antispoofing using speeded-up robust features and fisher vector encoding. SPL. Cited by: §1.
-  (2017) OULU-npu: a mobile face presentation attack database with real-world variations. In FG, Cited by: Table 1, §1, §2.2, §4, §5.1.
-  (2012) On the effectiveness of local binary patterns in face anti-spoofing. In Biometrics Special Interest Group, Cited by: Table 1, §1, §2.2.
-  (2016) Face recognition systems under spoofing attacks. In Face Recognition Across the Imaging Spectrum, Cited by: Table 1.
-  (2016) The replay-mobile face presentation-attack database. In BIOSIG, Cited by: Table 1.
-  (2013) Can face anti-spoofing countermeasures work in a real world scenario?. In ICB, Cited by: §1, §2.1.
-  (2014) Spoofing in 2d face recognition with 3d masks and anti-spoofing with kinect. In BTAS, Cited by: Table 1.
-  (2016) Integration of image quality and motion cues for face anti-spoofing: a neural network approach. JVCIR. Cited by: §1, §2.1, §2.1.
-  (2018) Joint 3d face reconstruction and dense alignment with position map regression network. In ECCV, Cited by: §4.
-  (2017) Rank pooling for action recognition. TPAMI 39 (4), pp. 773–787. Cited by: §1, §3.1, §3.1.
-  (2016) Ms-celeb-1m: a dataset and benchmark for large-scale face recognition. In ECCV, pp. 87–102. Cited by: §5.5.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.1.
-  (2016) ISO/IEC JTC 1/SC 37 Biometrics. information technology biometric presentation attack detection part 1: framework. international organization for standardization. Note: https://www.iso.org/obp/ui/iso Cited by: §5.1.
-  (2018) Face de-spoofing: anti-spoofing via noise modeling. arXiv. Cited by: Table 10.
-  (2013) Context based face anti-spoofing. In BTAS, Cited by: §1.
An original face anti-spoofing approach using partial convolutional neural network. In IPTA, Cited by: §1, §2.1.
-  (2018) Learning deep models for face anti-spoofing: binary or auxiliary supervision. In CVPR, Cited by: Table 1, §1, §1, §2.1, §2.1, §2.2, §4, §5.1, Table 10, Table 9.
-  (2019) Deep tree learning for zero-shot face anti-spoofing. In CVPR, pp. 4680–4689. Cited by: §2.1.
-  (2011) Face spoofing detection from single images using micro-texture analysis. In IJCB, pp. 1–7. Cited by: §1, §2.1.
-  (2016) Ordinal regression with multiple output cnn for age estimation. In CVPR, pp. 4920–4928. Cited by: §5.5.
-  (2007) Eyeblink-based anti-spoofing in face recognition from a generic webcamera. In ICCV, Cited by: §1, §2.1.
-  (2019) Recognizing multi-modal face spoofing with face recognition networks. In PRCVW, pp. 0–0. Cited by: §1, §2.1, §5.5.
-  (2016) Cross-database face antispoofing with robust feature representation. In CCBR, Cited by: §1, §2.1, §2.1.
-  (2016) Secure face unlock: spoof detection on smartphones. TIFS. Cited by: §1, §2.1.
-  (2019) Multi-adversarial discriminative deep domain generalization for face presentation attack detection. In CVPR, pp. 10023–10031. Cited by: §1.
-  (2019) FaceBagNet: bag-of-local-features model for multi-modal face anti-spoofing. In PRCVW, pp. 0–0. Cited by: §2.1, §5.5.
-  (2004) A tutorial on support vector regression. Statistics and computing 14 (3), pp. 199–222. Cited by: §3.1.
-  (2019-08) Deeply-learned hybrid representations for facial age estimation. pp. 3548–3554. External Links: Cited by: §3.1.
-  (2017) Ordered pooling of optical flow sequences for action recognition. In WACV, pp. 168–176. Cited by: §3.1.
Cooperative training of deep aggregation networks for rgb-d action recognition.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §3.1.
-  (2018) Exploiting temporal and depth information for multi-frame face anti-spoofing. arXiv. Cited by: Table 9.
-  (2015) Face spoof detection with image distortion analysis. TIFS. Cited by: Table 1.
-  (2013) Face liveness detection with component dependent descriptor.. In ICB, Cited by: §1, §2.1.
-  (2019) Face anti-spoofing: model matters, so does data. In CVPR, pp. 3507–3516. Cited by: §2.1, §5.5, §5.5, Table 10, Table 9.
-  (2014) Learning face representation from scratch. arXiv. Cited by: §5.5.
-  (2019) FeatherNets: convolutional neural networks as light as feather for face anti-spoofing. External Links: Cited by: §5.5.
-  (2019) CASIA-surf: a large-scale multi-modal benchmark for face anti-spoofing. arXiv:1908.10654. Cited by: §5.5, Table 8.
-  (2019) A dataset and benchmark for large-scale multi-modal face anti-spoofing. In CVPR, Cited by: Table 1, §1, §1, §2.1, §2.2, §4, §4, §5.1, §5.1, §5.5, Table 8, Table 9.
-  (2012) A face antispoofing database with diverse attacks. In ICB, Cited by: Table 1, §1, §2.2.
-  (2018) Towards pose invariant face recognition in the wild. In CVPR, pp. 2207–2216. Cited by: §5.5.
-  (2017) Face alignment in full pose range: a 3d total solution. TPAMI 41 (1), pp. 78–92. Cited by: §4.