Static and Dynamic Fusion for Multi-modal Cross-ethnicity Face Anti-spoofing

by   Ajian Liu, et al.
University of Barcelona
Baidu, Inc.

Regardless of the usage of deep learning and handcrafted methods, the dynamic information from videos and the effect of cross-ethnicity are rarely considered in face anti-spoofing. In this work, we propose a static-dynamic fusion mechanism for multi-modal face anti-spoofing. Inspired by motion divergences between real and fake faces, we incorporate the dynamic image calculated by rank pooling with static information into a conventional neural network (CNN) for each modality (i.e., RGB, Depth and infrared (IR)). Then, we develop a partially shared fusion method to learn complementary information from multiple modalities. Furthermore, in order to study the generalization capability of the proposal in terms of cross-ethnicity attacks and unknown spoofs, we introduce the largest public cross-ethnicity Face Anti-spoofing (CASIA-CeFA) dataset, covering 3 ethnicities, 3 modalities, 1607 subjects, and 2D plus 3D attack types. Experiments demonstrate that the proposed method achieves state-of-the-art results on CASIA-CeFA, CASIA-SURF, OULU-NPU and SiW.


page 1

page 4


PipeNet: Selective Modal Pipeline of Fusion Network for Multi-Modal Face Anti-Spoofing

Face anti-spoofing has become an increasingly important and critical sec...

Creating Artificial Modalities to Solve RGB Liveness

Special cameras that provide useful features for face anti-spoofing are ...

FeatherNets: Convolutional Neural Networks as Light as Feather for Face Anti-spoofing

Face Anti-spoofing gains increased attentions recently in both academic ...

Two-stream Convolutional Networks for Multi-frame Face Anti-spoofing

Face anti-spoofing is an important task to protect the security of face ...

DRL-FAS: A Novel Framework Based on Deep Reinforcement Learning for Face Anti-Spoofing

Inspired by the philosophy employed by human beings to determine whether...

A Dataset and Benchmark Towards Multi-Modal Face Anti-Spoofing Under Surveillance Scenarios

Face Anti-spoofing (FAS) is a challenging problem due to complex serving...

Review of Face Presentation Attack Detection Competitions

Face presentation attack detection (PAD) has received increasing attenti...

1 Introduction

Figure 1: Samples of the CASIA-CeFA dataset. It contains 1607 subjects, 3 different ethnicities (i.e., Africa, East Asia, and Central Asia), with 4 attack types (i.e., print attack, replay attack, 3D print and silica gel attacks).

In order to enhance security of face recognition systems, the presentation attack detection (PAD) technique is a vital stage prior to visual face recognition 

[3, 4, 20, 28]. Most works in face anti-spoofing focus on still-images, including RGB, Depth or IR). These methods can be divided into two main categories: handcrafted methods [3, 9, 27] and deep learning based methods [11, 19, 26]. Handcrafted methods attempt to extract texture information or statistical features (i.e., HOG [18, 36] and LBP [22, 9]) to distinguish between real and spoof faces. Deep learning based methods automatically learn discriminative features from input images for face anti-spoofings [19, 26]. However, the analysis of motion divergences between real and fake faces received little attention. In order to improve robustness in real applications, some temporal-based methods [24, 26] have been proposed, which require from a constrained human guided interaction, such as movements of of eyes, lips, and head. However, this does not provide with a natural user friendly interaction. Different from those works, we capture the temporal/dynamic information by using dynamic image generated by rank pooling [13], which doesn’t need any human guide interaction. Moreover, inspired by the motion divergences between real and fake faces, a static- and dynamic-based network (SD-Net) is further formulated by taking the static and dynamic images as the input.

Multi-modal face anti-spoofing have also absorbed an increasing number of researchers in recent two years. Some fusion methods [41, 25] are published, which restrict the interactions among different modalities since they are independent before the fusion. But it is difficult for different modalities to effectively utilize the modality relatedness from the beginning of the network to its end to boost the overall performance. In this paper, we propose a partially shared branch multi-modal network (PSMM-Net) with allowing the exchanges and interactions among different modalities, aiming to capture correlated and complementary features.

Dataset Year # sub # num Attacks Mod. Dev. Eth.
Replay-Attack [6] 2012 50 1200 Pr,Re R CR *
CASIA-FASD [42] 2012 50 600 Pr,Cu,Re R CR *
3DMAD [10] 2014 17 255 M R/D CR/K *
MSU-MFSD [35] 2015 35 440 Pr,Re R P/L *
Replay-Mobile [8] 2016 40 1030 Pr,Re R P *
Msspoof [7] 2016 21 4704i
OULU-NPU [5] 2017 55 5940 Pr,Re R
SiW [20] 2018 165 4620
CASIA-SURF [41] 2019 1000 21000
CASIA-CeFA(Ours) 2019 1500 18000 Pr, Re R/D/I S A/E/C
99 5346 M
8 192 G
Total: 1607 subjects, 23538 videos
Table 1: Comparisons among existing face PAD databases. (i and * indicates the dataset only contains imges and does not provide specify ethnicities, respectively. Mod.: modalities, Dev.: devices, Eth.: ethnicities, Pr: print attack, Re: replay attack, Cu: Cut, M: 3D print face mask, G: 3D silica gel face mask, R: RGB, D: Depth, I: IR, CR: RGB Camera, CI: IR Camera, K: Kinect, P: Cellphone, L: Laptop, S: Intel Realsense, AS: Asian, A: Africa, U: Caucasian, I: Indian, E: East Asia, C: Central Asia.)

Furthermore, data plays a key role in face anti-spoofing tasks. About existing face anti-spoofing datasets, such as CASIA-FASD [42], Replay-Attack [6], OULU-NPU [5], and SiW [20], the amount of sample is relatively small and most of them just contain the RGB modality. The recently released CASIA-SURF [41] includes 1,000 subjects and RGB, Depth and IR modalities. Although this provides with a larger dataset in comparison to the existing alternatives, it suffers from limited attack types (2D print attack) and single ethnicity (Chinese people). Overall, the effect of cross-ethnicity for face anti-spoofing received little attention in previous works. Therefore, we introduce CASIA-CeFA dataset, the largest dataset in terms of subjects (see Table 1). In CASIA-CeFA, attack types are diverse, including printing from cloth, video replay attack, 3D print and silica gel attacks. More importantly, it is the first public dataset designed for exploring the impact of cross-ethnicity in the study of face anti-spoofing. Some samples of the CASIA-CeFA dataset are shown in Fig. 1.

To sum up, the contributions of this paper are summarized as follows: (1) We propose the SD-Net to learn both static and dynamic features for single modality . It is the first work incorporating dynamic images for face anti-spoofing. (2) We propose the PSMM-Net to learn complementary information from multi-modal data in videos. (3) We release the CASIA-CeFA dataset, which includes ethnicities, subjects and diverse 2D/3D attack types. (4) Extensive experiments of the proposed method on CASIA-CeFA and other 3 public datasets verify its high generalization capability.

2 Related Work

2.1 Methods

Image-based Methods. Image-based methods take still images as input, i.e., RGB, Depth or IR. Classical approaches based on handcrafted features, such as HOG [36], LBP [22, 9], SIFT [27] or SURF [3]

together with traditional classifiers, such as SVM or LDA, to perform binary anti-spoofing predictions. However, those methods lack of good generalization capability when testing conditions vary, such as lighting and background. Owing to the success of deep learning strategies over handcrafted alternatives in computer vision, some works 

[11, 19, 26]

extended feature vectors with features from CNN networks for face anti-spoofing. Authors of 

[2, 20] presented a two-stream network using RGB and Depth images as input. The work of [21] proposes a deep tree network to model spoofs by hierarchically learning sub-groups of features. However, previous methods do not consider any kind of temporal information for face anti-spoofing.

Temporal-based Methods. In order to improve robustness in real applications, some temporal-based methods [24, 26] have been proposed, which require from a constrained human guided interaction, such as movements of of eyes, lips, and head. However, those methods do not provide with a natural user friendly interaction. Even more importantly, these methods [24, 26] could become vulnerable if someone presents a replay attack or a print photo attack with cut eye/mouth regions. Given that the Photoplethysmography (rPPG) signals (i.e. heart plus signal) can be detected from real but not spoof, Liu et al. [20]

proposed a CNN-RNN model to estimate rPPG signals with sequence-wise supervision and face depth with pixel-wise supervision. The estimated depth and rPPG are fused to distinguish real and fake faces. Feng 

et al. [11] distinguished between real and fake samples based on the the difference between image quality and optical flow information. Yang et al. [37] proposed a spatio-temporal attention mechanism to fuse global temporal and local spatial information. All previous methods rely on a single visual modality, and no work considers the effect of cross-ethnicity for anti-spoofing.

Multi-modal Fusion Methods. Zhang et al. [41] proposed a fusion network with 3 streams using ResNet-18 as the backbone, where each stream is used to extract low level features from RGB, Depth and IR data, respectively. Then, these features are concatenated and passed to the last two residual blocks. Similar to [41], Aleksandr et al. [25] used a fusion network with 3 streams. They used ResNet-34 as the backbone and multi-scale feature fusion at all residual blocks. Tao et al. [29] proposed a multi-stream CNN architecture called FaceBagNet, which uses patch-level images as input and modality feature erasing (MFE) operation to prevent overfitting and obtain more discriminative fused features. All previous methods just consider as a key fusion component the concatenation of features from multiple modalities. Unlike [41, 25, 29], we propose the PSMM-Net, where three modality-specific networks and one shared network are connected by using a partially shared structure to learn discriminative fused features for face anti-spoofing.

2.2 Datasets

Table 1 lists existing face anti-spoofing datasets. One can see that before 2019 the maximum number of available subjects was 165 on the SiW dataset [20]. That was clearly limiting the generalization ability of new approaches for cross-dataset evaluation. Most of the datasets just contain RGB data, such as Replay-Attack [6], CASIA-FASD [42], SiW [20] and OULU-NPU [5]. Recently, the CASIA-SURF [41] has been released, including 1000 subjects with three modalities, namely RGB, Depth and IR. Although this relieved the problem of the amount of data, it is limited in terms of attack types (only 2D print attack) and only includes 1 ethnicity (Chinese people). As shown in Table 1, most datasets do not provide ethnicity information, except SiW and CASIA-SURF. Although the SiW dataset provides four ethnicities, it still does not consider the effect of cross-ethnicity for face anti-spoofing. This limitation also holds for the CASIA-SURF dataset.

3 Proposed Method

Figure 2: SD-Net diagram, showing a single-modal static-dynamic-based network. We take the RGB modality and its corresponding dynamic image as an example. This architecture includes three branches: static (red arrow), dynamic (blue arrow) and static-dynamic (green arrow). The static-dynamic branch fuses the static and dynamic features of first res block outputs from static and dynamic branches (best viewed in color).

3.1 SD-Net for Single-modal

Single-modal Dynamic Image Construction.

Rank pooling [13, 33] defines a rank function that encodes a video into a feature vector. The learning process can be seen as a convex optimization problem using the RankSVM [30] formulation in Eq.1. Let RGB (Depth or IR) video sequence with frames be represented as , and denotes the average of RGB (Depth or IR) features over time up to -frame. The process is formulated below.


where is the slack variable, and .

By optimizing Eq. 1, we map a sequence of frames to a single vector . In this paper, rank pooling is directly applied on the pixels of RGB (Depth or IR) frames and the dynamic image is of the same size as the input frames. In our case, given input frame, we compute its dynamic image online with rank pooling using consecutive frames. Our selection of dynamic images for rank pooling in SD-Net is further motivated by the fact that dynamic images have proved its superiority to regular optical flow [32, 13].

Single-modal SD-Net.

As shown in Fig. 2, taking the RGB modality as an example, we propose the SD-Net to learn hybrid features from static and dynamic images. It contains branches: static, dynamic and static-dynamic branches, which learn complementary features. The network takes ResNet-18 [15] as the backbone. For static and dynamic branches, each of them consists of blocks (i.e., conv, res1, res2, res3, res4) and Global Average Pooling (GAP) layer, while in the static-dynamic branch, the conv and res1 blocks are removed because it takes fused features of res1 blocks from static and dynamic branches as input.

For convenience of terminology with the rest of the paper, we divide residual blocks of the network into a set of modules according to feature level, where is an indicator of the modality and represents the feature level. Except for the first module , each module extracts static, dynamic and static-dynamic features by using a residual block, denoted as , and , respectively. The output features from each module are used as the input for the next module. The static-dynamic features of the first module are obtained by directly summing and .

In order to ensure each branch learns independent features, each branch employs an independent loss function after the GAP layer 

[31]. In addition, a loss function based on the summed features from all three branches is employed. The binary cross-entropy loss is used as the loss function. All branches are jointly and concurrently optimized to capture discriminative and complementary features for face anti-spoofing in image sequences. The overall objective function of SD-Net for the modality is defined as:


where , , and are the losses for static branch, dynamic branch, static-dynamic branch, and summed features from all three branches of the network, respectively.

Figure 3: The PSMM-Net diagram. It consists of two main parts. The first is the modality-specific network, which contains three SD-Nets to learn features from RGB, Depth, IR modalities, respectivel. The second is a shared branch for all modalities, which aims to learn the complementary features among different modalities.

3.2 PSMM-Net for Multi-modal Fusion

The architecture of the proposed PSMM-Net is shown in Fig. 3. It consists of two main parts: a) the modality-specific network, which contains three SD-Nets to learn features from RGB, Depth, IR modalities, respectively; b) and a shared branch for all modalities, which aims to learn the complementary features among different modalities.

For the shared branch, we adopt ResNet-18, removing the first conv layer and res1 block. In order to capture correlations and complementary semantics among different modalities, information exchange and interaction among SD-Nets and the shared branch are designed. This is done in two different ways: a) forward feeding of fused SD-Net features to the shared branch, and b) backward feeding from shared branch modules output to SD-Net block inputs.

Forward Feeding. We fuse static and dynamic SD-Nets features from all modality branches and fed them as input to its corresponding shared block. The fused process at feature level can be formulated as:


In the shared branch, denotes the input to the block, and denotes the output of the block. Note that the first residual block is removed from the shared branch, thus equals to zero.

Backward Feeding. Shared features are delivered back to the SD-Nets of the different modalities. The static features and dynamic features add with for feature fusion. This can be denoted as:


where ranges from 2 to 3.

After feature fusion, and become the new static and dynamic features, which are then feed to the next module . Note that the exchange and interaction among SD-Nets and the shared branch are only performed for static and dynamic features. This is done to avoid hybrid features among static and dynamic information to be disturbed by multi-modal semantics.

Loss Optimization. There are two main kind of losses employed to guide the training of PSMM-Net. The first corresponds to the losses of the three SD-Nets, i.e.. color, depth and ir modalities, denoted as , and , respectively. The second corresponds to the loss that guides the entire network training, denoted as , which bases on the summed features from all SD-Nets and the shared branch. The overall loss of PSMM-Net is denoted as:


4 CASIA-CeFA dataset

This section describes the CASIA-CeFA dataset. The motivation of this dataset is to provide with an increased diversity of attack types compared to existing datasets, as well as to explore the effect of cross-ethnicity in face anti-spoofing, which has received little attention in the literature. Furthermore, it contains three visual modalities, i.e.., RGB, Depth, and IR. Summarizing, the main purpose of CASIA-CeFA is to provide with the largest up to date face anti-spoofing dataset to allow for the evaluation of the generalization performance of new PAD methods in three main aspects: cross-ethnicity, cross-modality and cross-attacks. In this section, we describe the CASIA-CeFA dataset in detail, including acquisition details, attack types, and proposed evaluation protocols.

Figure 4: Age and gender distributions of the CASIA-CeFA.
Prot. Subset Ethnicity Subjects Modalities PAIs # real videos # fake videos # all videos
1_1 1_2 1_3
1 Train A C E 1-200 R&D&I Pr&Re 600/600/600 1800/1800/1800 2400/2400/2400
Valid A C E 201-300 R&D&I Pr&Re 300/300/300 900/900/900 1200/1200/1200
Test C&E A&E A&C 301-500 R&D&I Pr&Re 1200/1200/1200 6600/6600/6600 7800/7800/7800
2_1 2_2
2 Train A&C&E 1-200 R&D&I Pr Re 1800/1800 3600/1800 5400/3600
Valid A&C&E 201-300 R&D&I Pr Re 900/900 1800/900 2700/1800
Test A&C&E 301-500 R&D&I Pe Pr 1800/1800 4800/6600 6600/8400
3_1 3_2 3_3
3 Train A&C&E 1-200 R D I Pr&Re 600/600/600 1800/1800/1800 2400/2400/2400
Valid A&C&E 201-300 R D I Pr&Re 300/300/300 900/900/900 1200/1200/1200
Test A&C&E 301-500 D&I R&I R&D Pr&Re 1200/1200/1200 5600/5600/5600 6800/6800/6800
4_1 4_2 4_3
4 Train A C E 1-200 R D I Re 600/600/600 600/600/600 1200/1200/1200
Valid A C E 201-300 R D I Re 300/300/300 300/300/300 600/600/600
Test C&E A&E A&C 301-500 R D I Pr 1200/1200/1200 5400/5400/5400 6600/6600/6600
Table 2: Four evaluation protocols are defined for CASIA-CeFA; 1) cross-ethnicity, 2) cross-PAI, 3) cross-modality and 4) cross-ethnicity&PAI, respectively. Note that 3D attacks subset of CASIA-CeFA are included to the test set of every testing protocol (not shown in the table). R: RGB, D: Depth, I: IR, A: Africa, C: Central Asia, E: East Asia, Pr: print attack, Re: replay attack; & indicates merging; corresponds to the name of sub-protocols.

Acquisition Details. We use the Intel Realsense to capture the RGB, Depth and IR videos simultaneously at 30 fps. The resolution is 1280 720 pixels for each video frame and all modalities. Performers are asked to move smoothly their head so as to have a maximum of around deviation of head pose in relation to frontal view. Data pre-processing is similar to the one performed in [41], expect that PRNet [12] is replaced by 3DFFA [44] for face region detection. Examples of original recorded images from video sequences and processed face regions for different visual modalities are shown in Fig. 1.

Statistics. As shown in Table 1, CASIA-CeFA consists of 2D and 3D attack subsets. For the 2D attack subset, it includes print and video-reply attacks, and three ethnicites (African, East Asian and Central Asian) with 2 attacks (print face from cloth and video-replay). Each ethnicity has 500 subjects. Each subject has 1 real sample, 2 fake samples of print attack captured in indoor and outdoor, and 1 fake sample of video-replay. In total, there are 18000 videos (6000 per modality). The age and gender statistics for the 2D attack subset of CASIA-CeFA is shown in Fig. 4.

For the 3D attack subset, it has 3D print mask and silica gel face attacks. For 3D print mask, it has 99 subjects, each subject with 18 fake samples captured in three attacks and six lighting environments. Attacks include only mask, wearing a wig and glasses, and wearing a wig and no glasses. Lighting conditions include outdoor sunshine, outdoor shade, indoor side light, indoor front light, indoor backlit and indoor regular light. In total, there are 5346 videos (1782 per modality). For silica gel face attacks, it has 8 subjects, each subject has 8 fake samples captured in two attacks styles and four lighting environments. Attacks include wearing a wig and glasses and wearing a wig and no glasses. Lighting environments include indoor side light, indoor front light, indoor backlit and indoor normal light. In total, there are 196 videos (64 per modality).

Evaluation Protocols. We design four protocols for the 2D attacks subset, as shown in Table 2, totalling 11 sub-protocols (1_1, 1_2, 1_3, 2_1, 2_2, 3_1, 3_2, 3_3, 4_1, 4_2, and 4_3). We divide subjects per ethnicity into three subject-disjoint subsets (second and fourth columns in Table 2). Each protocol has three data subsets: training, validation and testing sets, which contain 200, 100, and 200 subjects, respectively.
Protocol 1 (cross-ethnicity): Most of the public face PAD datasets just contain a single ethnicity. Even though there are few datasets [20, 41] containing multiple ethnicities, they lack of ethnicity labels or do not provide with a protocol to perform cross-ethnicity evaluation. Therefore, we design the first protocol to evaluate the generalization of PAD methods for cross-ethnicity testing. One ethnicity is used for training and validation, and the left two ethnicities are used for testing. Therefore, there are three different evaluations (third column of Protocol 1 in Table 2.
Protocol 2 (cross-PAI): Given the diversity and unpredictability of attack types from different presentation attack instruments (PAI), it is necessary to evaluate the robustness of face PAD algorithms to this kind of variations (sixth column of Protocol 2 in Table. 2).
Protocol 3 (cross-modality): Given the release of affordable devices capturing complementary visual modalities (i.e., Intel Resense, Mircrosoft Kinect), recently the multi-modal face anti-spoofing dataset was proposed [41]. However, there is no standard protocol to explore the generalization of face PAD methods when different train-test modalities are considered for evaluation. We define three cross-modality evaluations, each of them having one modality for training and the two remaining ones for testing (fifth column of Protocol 3 in Table. 2).
Protocol 4 (cross-ethnicity & PAI): The most challenging protocol is designed via combining the condition of both Protocol 1 and 2. As shown in Protocol 4 of Table. 2, the testing subset introduces two unknown target variations simultaneously.

Like  [5]

, the mean and variance of evaluate metrics for these four protocols are calculated in our experiments. Detailed statistics for the different protocols are shown in Table 

2. More information about CASIA-CeFA can be found in our supplementary material.

Figure 5: Comparison of network units for multi-modal fusion strategies. From left to right: NHF, PSMM-NET-WoBF and PSMM-Net. The fusion process for the feature level of each strategy is shown at the bottom.

5 Experiments

In this section, we conduct a series of experiments on public available face anti-spoofing datasets to verify the effectiveness of our methodology and the benefits of the presented CASIA-CeFA dataset. In the following, we will introduce the employed datasets & metrics, implementation details, experimental setting, and results & analysis sequentially.

5.1 Datasets & Metrics

We evaluate the performance of PSMM-Net on two multi-modal (i.e., RGB, Depth and IR) datasets: CASIA-CeFA and CASIA-SURF [41], while evaluate the SD-Net on two single-modal (i.e., RGB) face anti-spoofing benchmarks: OULU-NPU [5] and SiW [20]. They are the mainstream datasets released in recent years with their own characteristics in terms of the number of subject, modality and ethnicity, attack types, acquisition device and PAIs et al.. Therefore, experiments on these datasets can verify the performance of our method more convincingly.

In order to perform a consistent evaluation with prior works, we report the experimental results using the following metrics based on respective official protocols: Attack Presentation Classification Error Rate (APCER) [16], Bona Fide Presentation Classification Error Rate (BPCER), Average Classification Error Rate (ACER), and Receiver Operating Characteristic (ROC) curve [41].

5.2 Implementation Details

The proposed PSMM-Net is implemented with Tensorflow 

[1] and run on a single NVIDIA TITAN X GPU. We resize the cropped face region to , and use random rotation within the range of [, ], flipping, cropping and color distortion for data augmentation. All models are trained for epochs via Adaptive Moment Estimation (Adam) algorithm and initial learning rate of , which is decreased after and epochs with a factor of . The batch size of each CNN stream is , and the length of the consecutive frames used to construct dynamic map is set to by our experimental experience. In addition, all fusion points in this work use element summation operations to prevent dimension explosion.

Prot. name APCER(%) BPCER(%) ACER(%)
Prot. 1 1_1 0.5 0.8 0.6
1_2 4.8 4.0 4.4
1_3 1.2 1.8 1.5
AvgStd 2.22.3 2.21.6 2.22.0
Prot. 2 2_1 0.1 0.7 0.4
2_2 13.8 1.2 7.5
AvgStd 7.09.7 1.00.4 4.05.0
Prot. 3 3_1 8.9 0.9 4.9
3_2 22.6 4.6 13.6
3_3 21.1 2.3 11.7
AvgStd 17.57.5 2.61.9 10.14.6
Prot. 4 4_1 33.3 15.8 24.5
4_2 78.2 8.3 43.2
4_3 50.0 5.5 27.7
AvgStd 53.822.7 9.95.3 31.810.0
Table 3: PSMM-Net evaluation on the four protocols of CASIA-CeFA dataset, where AB represents sub-protocol B from Protocol A, and AvgStd indicates the mean and variance operation.
Prot.1 RGB Depth IR
S-Net 28.13.6 6.44.6 17.23.6 5.63.0 9.84.2 7.73.5 11.42.1 8.21.2 9.81.7
D-Net 20.64.0 19.39.0 19.94.0 11.25.1 7.51.5 9.42.0 8.11.8 14.43.8 11.32.1
SD-Net 14.96.0 10.31.8 12.63.4 7.08.1 5.23.5 6.15.4 7.31.2 5.51.8 6.41.3
Table 4: Ablation experiments on three single-modal groups: RGB, Depth and IR. Each modality group contains three experiments: static branch, dynamic branch and static-dynamic branch. Numbers in bold correspond to the best results per column.
Prot. name RGB Depth IR
Prot. 1 14.96.0 10.31.8 12.63.4 7.08.1 5.23.5 6.15.4 7.31.2 5.51.8 6.41.3
Prot. 2 45.039.1 1.61.9 23.318.6 13.618.7 1.20.7 7.49.7 8.111.0 1.51.8 4.86.4
Prot. 3 5.9 2.2 4.0 0.3 0.3 0.3 0.2 0.5 0.4
Prot. 4 65.816.4 8.36.5 35.25.8 18.58.2 7.05.2 12.75.7 6.82.9 4.23.3 5.52.2
Table 5: Experimental results of the SD-Net based on single modality on four protocols ( indicates that the modal type of the testing subset is consistent with the training subset).

5.3 Baseline Model Evaluation

Before exploring the traits of our dataset, we first provide a benchmark for CASIA-CeFA based on the proposed method. From the Table 3, in which the results of the four protocols are derived from all the respective sub-protocols by calculating the mean and variance, we can draw the following conclusions: (1) from the results of the three sub-protocols in Protocol 1, the ACER scores are , and , respectively, indicating that it is necessary to study the generalization of the face PAD method for different ethnicity; (2) In the case of Protocol 2, when print attack is used for training/validation and video-replay and 3D mask are used for testing, the ACER score is (sub-protocol 2_1), while video-replay attack is used for training/validation, and print attack and 3D attack are used for testing, with an ACER score of (sub-protocol 2_2). The large gap between the results of the two sub-protocols is mainly caused by different PAI (i.e.. different displays and printers) create different artifacts. (3) Protocol 3 evaluates cross-modality. The best result is achieved for sub-protocol 3_1, with ACER of . The other two sub-protocols achieve a similar low performance score. This means the best performance is achieved when RGB data of 2D attack subset is used for training/validation while the other two modalities of 2D and 3D attack subsets are used for testing. (4) Protocol 4 is the most difficult evaluation scenario, which simultaneously considers cross-ethnicity and cross-PAI. All sub-protocols achieve poor performance, being , , and ACER scores for 4_1, 4_2, and 4_3 achieve, respectively.

5.4 Ablation Analysis

In order to verify the effectiveness of the proposed method, we perform a series of ablation experiments on Protocol 1 (cross-ethnicity) of the CASIA-CeFA dataset.

Static and Dynamic Features. We evaluate S-Net (Static branch of SD-Net), D-Net (Dynamic branch of SD-Net) and SD-Net. Results for RGB, Depth and IR modalities are shown in Table 4. Compared with S-Net and D-Net, SD-Net achieves superior performance. For RGB, Depth and IR modalities, ACER of SD-Net is , , , versus , , of S-Net (improved by , , ) and , , of D-Net (improved by , , ), respectively. Furthermore, Table 4 shows that the performance of Depth and IR modalities are superior to the one of RGB. One reason is the variability in lighting conditions included in CASIA-CeFA.

In addition, we provide the results of single-modal experiments on the protocols to facilitate comparison of face PAD algorithms, shown in Table 5. It shows that when only single modality is used, the performance of the depth or IR modality is superior to that of the RGB modality.

Multiple Modalities. In order to show the effect of analysing a different number of modalities, we evaluate one modality (RGB), two modalities (RGB and Depth), and three modalities (RGB, Depth and IR) on PSMM-Net. As shown in Fig. 3, the PSMM-Net contains three SD-Nets and one shared branch. When only RGB modality is considered, we just use one SD-Net for evaluation. When two or three modalities are considered, we use two or three SD-Nets and one shared branch to train the PSMM-Net model, respectively. Results are shown in Table 6. The best results are obtained when using all three modalities, which of APCER, of BPCER and of ACER. Compared with the performance of using single RGB modality and two modalities, the improvement in performance corresponds to and for APCER, and for BPCER, and and for ACER, respectively.

Prot.1 PSMM-Net
RGB 14.96.0 10.31.8 12.63.4
RGB&Depth 2.32.9 9.25.9 5.73.5
RGB&Depth&IR 2.22.3 2.21.6 2.22.0
Table 6: Ablation experiments on the effect of multiple modalities. Numbers in bold correspond to the best result per column.
Method APCER(%) BPCER(%) ACER(%)
NHF 25.312.2 4.43.1 14.86.8
PSMM-WoBF 12.70.4 3.22.3 7.91.3
PSMM-Net 2.22.3 2.21.6 2.22.0
Table 7: Comparison of fusion strategies in Protocol 1 of CASIA-CeFA. The number in black indicates best results.
Method TPR (%) APCER (%) BPCER (%) ACER (%)
NHF fusion [41] 89.1 33.6 17.8 5.6 3.8 4.7
Single-scale SE fusion [41] 96.7 81.8 56.8 3.8 1.0 2.4
Multi-scale SE fusion [40] 99.8 98.4 95.2 1.6 0.08 0.8
PSMM-Net 99.9 99.3 96.2 0.7 0.06 0.4
PSMM-Net(CASIA-CeFA) 99.9 99.7 97.6 0.5 0.02 0.2
Table 8: Comparison of the proposed method with three fusion strategies. All models are trained on the CASIA-SURF training subset and tested on the testing subset. Best results are bolded.

Fusion Strategy. In order to evaluate the performance of PSMM-Net, we compare it with other two variants: Naive halfway fusion (NHF) and PSMM-Net without backward feeding mechanism (PSMM-Net-WoBF). As shown in Fig. 5, NHF combines the modules of different modalities at a later stage (i.e., after module) and PSMM-Net-WoBF strategy removes the backward feeding from PSMM-Net. The fusion comparison results are shown in Table 7, showing higher performance of the proposed PSMM-Net, with ACER of .

5.5 Methods Comparison

CASIA-SURF Dataset. Comparison results are show in Table 8. The performance of the PSMM-Net is superior to the ones of the competing multi-modal fusion methods, including Halfway fusion [41], single-scale SE fusion [41], and multi-scale SE fusion [40]. When compared with [41, 40], PSMM-Net improves the performance by at least for APCER, for NPECE, and for ACER. When the PSMM-Net is pretrained on CASIA-CeFA, it further improves performance. Concretely, the performance of is increased by when pretraining with the proposed CASIA-CeFA dataset.

In 2019 a challenge on the CASIA-SURF dataset was run at CVPR 111 The results of the challenge were very promising, where winning teams VisionLab [25], ReadSense [29] and Feather [39] got TPR=, and , respectively. The main reasons of these high performance are: 1) several external datasets were used. VisionLab [25] used four lare-scale datasets, namely CASIA-WebFace [38], MSCeleb-1M [14], AFAD-lite [23] and Asian dataset [43] for pretraining, while Feather [39] used a large private dataset with a collection protocol similar to CASIA-SURF. 2) Many network ensembles. VisonLab [25], ReadSense [29], and Feather [39] average the outputs of , and networks to compute final results. Thus in order to have a fair comparison we omit VisionLab [25], ReadSense [29] and Feather [39] from Table 8.

SiW Dataset. Results for this dataset are shown in Table 9. We compare the proposed SD-Net with other methods without pretraining. Taking the Protocol 1 of SiW as an example, SD-Net achieves the best ACER of , an improvement of with respect to the second best score, (ACER) from STASN [37]. In terms of CASIA-CeFA pretraining, our method is competitive to STASN (Data) [37] ( versus in term of ACER), which used an large proviate dataset as pretrain. For Protocol 2 and 3 of SiW, our methods has achieved the best performance under three evaluation metrics.

Prot. Method APCER (%) BPCER (%) ACER (%) Pretrain
1 FAS-BAS [20] 3.58 3.58 3.58 No
FAS-TD-SF [34] 1.27 0.83 1.05
STASN [37] - - 1.00
SD-Net 0.14 1.34 0.74
1.27 0.33 0.80 Yes
STASN (Data) [37] - - 0.30
SD-Net (CASIA-CeFA) 0.21 0.50 0.35
2 FAS-BAS [20] 0.570.69 0.570.69 0.570.69 No
FAS-TD-SF [34] 0.330.27 0.290.39 0.310.28
STASN [37] - - 0.280.05
SD-Net 0.250.32 0.290.34 0.270.28
0.080.17 0.250.22 0.170.16 Yes
STASN (Data) [37] - - 0.150.05
SD-Net (CASIA-CeFA) 0.090.17 0.210.25 0.150.11
3 FAS-BAS [20] 8.313.81 8.313.81 8.313.81 No
FAS-TD-SF [34] 7.703.88 7.764.09 7.733.99
STASN [37] - - 12.101.50
SD-Net 3.742.15 7.851.42 5.800.36
6.274.36 6.434.42 6.354.39 Yes
STASN (Data) [37] - - 5.850.85
SD-Net (CASIA-CeFA) 2.701.56 7.101.56 4.900.00
Table 9: Comparisons on SiW. ’-’ indicates unprovided; ’()’ means the method is used a pretrain model trained from a specific dataset. Best results are bolded in the condition of with/without pretrain.
Prot. Method APCER (%) BPCER (%) ACER (%) Pretrain
3 FAS-BAS [20] 2.71.3 3.11.7 2.91.5 No
FAS-Ds [17] 4.01.8 3.81.2 3.61.6
STASN [37] 4.73.9 0.91.2 2.81.6
SD-Net 2.72.5 1.42.0 2.11.4
STASN (Data) [37] 1.41.4 3.64.6 2.52.2 Yes
SD-Net (CASIA-CeFA) 2.72.5 0.90.9 1.81.4
4 FAS-BAS [20] 9.35.6 10.46.0 9.56.0 No
FAS-Ds [17] 5.16.3 6.15.1 5.65.7
STASN [37] 6.710.6 8.38.4 7.54.7
SD-Net 4.65.1 6.36.3 5.42.8
STASN (Data) [37] 0.91.8 4.25.3 2.62.8 Yes
SD-Net (CASIA-CeFA) 5.04.7 4.64.6 4.82.7
Table 10: Results of Protocol 3 and 4 on OULU-NPU. ’()’ means the method is used a pretrain model trained from a specific dataset. Best results are bolded in the conditions of with/without pretrain.

OULU-NPU Dataset. We perform evaluation on the 2 most challenging protocols of OULU-NPU. Protocol 3 studies the generalization across different acquisition devices and Protocol 4 considers all the conditions of previous three protocols simultaneously. The experimental Results are shown in Table 10. In the case of comparison without pretraining, SD-Net obtains the best results in both Protocol 3 and 4. The ACER of our SD-Net is versus of STASN [37]. When comparing the results using pretraining, our method achieves the first and second position for Protocol 3 and 4, respectively. Based on the above experiments, when without using pretraining, the proposed method can get the state-of-the-art performance (ACER) in all protocols on SiW and OULU-NPU. The proposed method with pretraining on CASIA-CeFA can also get the best ACER scores on most of protocols. Those experimental results clearly demonstrate the effectiveness of the proposed method and the collected CASIA-CeFA dataset.

6 Conclusion

In this paper, we have presented the CASIA-CeFA dataset for face anti-spoofing. This corresponds to the largest public available dataset in terms of modalities, subjects, ethnicities and attacks. Moreover, we have proposed a static- and dynamic- network (SD-Net) to learn both static and dynamic features from single-modal. Then, we have proposed a partially shared multi-modal network (PSMM-Net) to learn complementary information from multi-modal data in videos. Extensive experiments on four popular datasets show the high generalization capability of the proposed SD-Net and PSMM-Net, and the utility and challenges of the released CASIA-CeFA dataset.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, and M. Isard

    TensorFlow: a system for large-scale machine learning

    Cited by: §5.2.
  • [2] Y. Atoum, Y. Liu, A. Jourabloo, and X. Liu (2017) Face anti-spoofing using patch and depth-based cnns. In IJCB, pp. 319–328. Cited by: §2.1.
  • [3] Z. Boulkenafet, J. Komulainen, and A. Hadid (2016) Face spoofing detection using colour texture analysis. TIFS. Cited by: §1, §2.1.
  • [4] Z. Boulkenafet, J. Komulainen, and A. Hadid (2017) Face antispoofing using speeded-up robust features and fisher vector encoding. SPL. Cited by: §1.
  • [5] Z. Boulkenafet, J. Komulainen, L. Li, X. Feng, and A. Hadid (2017) OULU-npu: a mobile face presentation attack database with real-world variations. In FG, Cited by: Table 1, §1, §2.2, §4, §5.1.
  • [6] I. Chingovska, A. Anjos, and S. Marcel (2012) On the effectiveness of local binary patterns in face anti-spoofing. In Biometrics Special Interest Group, Cited by: Table 1, §1, §2.2.
  • [7] I. Chingovska, N. Erdogmus, A. Anjos, and S. Marcel (2016) Face recognition systems under spoofing attacks. In Face Recognition Across the Imaging Spectrum, Cited by: Table 1.
  • [8] A. Costa-Pazo, S. Bhattacharjee, E. Vazquez-Fernandez, and S. Marcel (2016) The replay-mobile face presentation-attack database. In BIOSIG, Cited by: Table 1.
  • [9] T. de Freitas Pereira, A. Anjos, J. M. De Martino, and S. Marcel (2013) Can face anti-spoofing countermeasures work in a real world scenario?. In ICB, Cited by: §1, §2.1.
  • [10] N. Erdogmus and S. Marcel (2014) Spoofing in 2d face recognition with 3d masks and anti-spoofing with kinect. In BTAS, Cited by: Table 1.
  • [11] L. Feng, L. Po, Y. Li, X. Xu, F. Yuan, T. C. Cheung, and K. Cheung (2016) Integration of image quality and motion cues for face anti-spoofing: a neural network approach. JVCIR. Cited by: §1, §2.1, §2.1.
  • [12] Y. Feng, F. Wu, X. Shao, Y. Wang, and X. Zhou (2018) Joint 3d face reconstruction and dense alignment with position map regression network. In ECCV, Cited by: §4.
  • [13] B. Fernando, E. Gavves, J. Oramas, A. Ghodrati, and T. Tuytelaars (2017) Rank pooling for action recognition. TPAMI 39 (4), pp. 773–787. Cited by: §1, §3.1, §3.1.
  • [14] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao (2016) Ms-celeb-1m: a dataset and benchmark for large-scale face recognition. In ECCV, pp. 87–102. Cited by: §5.5.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.1.
  • [16] (2016) ISO/IEC JTC 1/SC 37 Biometrics. information technology biometric presentation attack detection part 1: framework. international organization for standardization. Note: Cited by: §5.1.
  • [17] A. Jourabloo, Y. Liu, and X. Liu (2018) Face de-spoofing: anti-spoofing via noise modeling. arXiv. Cited by: Table 10.
  • [18] J. Komulainen, A. Hadid, and M. Pietikainen (2013) Context based face anti-spoofing. In BTAS, Cited by: §1.
  • [19] L. Li, X. Feng, Z. Boulkenafet, Z. Xia, M. Li, and A. Hadid (2016)

    An original face anti-spoofing approach using partial convolutional neural network

    In IPTA, Cited by: §1, §2.1.
  • [20] Y. Liu, A. Jourabloo, and X. Liu (2018) Learning deep models for face anti-spoofing: binary or auxiliary supervision. In CVPR, Cited by: Table 1, §1, §1, §2.1, §2.1, §2.2, §4, §5.1, Table 10, Table 9.
  • [21] Y. Liu, J. Stehouwer, A. Jourabloo, and X. Liu (2019) Deep tree learning for zero-shot face anti-spoofing. In CVPR, pp. 4680–4689. Cited by: §2.1.
  • [22] J. Määttä, A. Hadid, and M. Pietikäinen (2011) Face spoofing detection from single images using micro-texture analysis. In IJCB, pp. 1–7. Cited by: §1, §2.1.
  • [23] Z. Niu, M. Zhou, L. Wang, X. Gao, and G. Hua (2016) Ordinal regression with multiple output cnn for age estimation. In CVPR, pp. 4920–4928. Cited by: §5.5.
  • [24] G. Pan, L. Sun, Z. Wu, and S. Lao (2007) Eyeblink-based anti-spoofing in face recognition from a generic webcamera. In ICCV, Cited by: §1, §2.1.
  • [25] A. Parkin and O. Grinchuk (2019) Recognizing multi-modal face spoofing with face recognition networks. In PRCVW, pp. 0–0. Cited by: §1, §2.1, §5.5.
  • [26] K. Patel, H. Han, and A. K. Jain (2016) Cross-database face antispoofing with robust feature representation. In CCBR, Cited by: §1, §2.1, §2.1.
  • [27] K. Patel, H. Han, and A. K. Jain (2016) Secure face unlock: spoof detection on smartphones. TIFS. Cited by: §1, §2.1.
  • [28] R. Shao, X. Lan, J. Li, and P. C. Yuen (2019) Multi-adversarial discriminative deep domain generalization for face presentation attack detection. In CVPR, pp. 10023–10031. Cited by: §1.
  • [29] T. Shen, Y. Huang, and Z. Tong (2019) FaceBagNet: bag-of-local-features model for multi-modal face anti-spoofing. In PRCVW, pp. 0–0. Cited by: §2.1, §5.5.
  • [30] A. J. Smola and B. Schölkopf (2004) A tutorial on support vector regression. Statistics and computing 14 (3), pp. 199–222. Cited by: §3.1.
  • [31] Z. Tan, Y. Yang, J. Wan, G. Guo, and S. Li (2019-08) Deeply-learned hybrid representations for facial age estimation. pp. 3548–3554. External Links: Document Cited by: §3.1.
  • [32] J. Wang, A. Cherian, and F. Porikli (2017) Ordered pooling of optical flow sequences for action recognition. In WACV, pp. 168–176. Cited by: §3.1.
  • [33] P. Wang, W. Li, J. Wan, P. Ogunbona, and X. Liu (2018) Cooperative training of deep aggregation networks for rgb-d action recognition. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §3.1.
  • [34] Z. Wang, C. Zhao, Y. Qin, Q. Zhou, and Z. Lei (2018) Exploiting temporal and depth information for multi-frame face anti-spoofing. arXiv. Cited by: Table 9.
  • [35] D. Wen, H. Han, and A. K. Jain (2015) Face spoof detection with image distortion analysis. TIFS. Cited by: Table 1.
  • [36] J. Yang, Z. Lei, S. Liao, and S. Z. Li (2013) Face liveness detection with component dependent descriptor.. In ICB, Cited by: §1, §2.1.
  • [37] X. Yang, W. Luo, L. Bao, Y. Gao, D. Gong, S. Zheng, Z. Li, and W. Liu (2019) Face anti-spoofing: model matters, so does data. In CVPR, pp. 3507–3516. Cited by: §2.1, §5.5, §5.5, Table 10, Table 9.
  • [38] D. Yi, Z. Lei, S. Liao, and S. Z. Li (2014) Learning face representation from scratch. arXiv. Cited by: §5.5.
  • [39] P. Zhang, F. Zou, Z. Wu, N. Dai, S. Mark, M. Fu, J. Zhao, and K. Li (2019) FeatherNets: convolutional neural networks as light as feather for face anti-spoofing. External Links: 1904.09290 Cited by: §5.5.
  • [40] S. Zhang, A. Liu, J. Wan, Y. Liang, G. Guo, S. Escalera, H. J. Escalante, and S. Z. Li (2019) CASIA-surf: a large-scale multi-modal benchmark for face anti-spoofing. arXiv:1908.10654. Cited by: §5.5, Table 8.
  • [41] S. Zhang, X. Wang, A. Liu, C. Zhao, J. Wan, S. Escalera, H. Shi, Z. Wang, and S. Z. Li (2019) A dataset and benchmark for large-scale multi-modal face anti-spoofing. In CVPR, Cited by: Table 1, §1, §1, §2.1, §2.2, §4, §4, §5.1, §5.1, §5.5, Table 8, Table 9.
  • [42] Z. Zhang, J. Yan, S. Liu, Z. Lei, D. Yi, and S. Z. Li (2012) A face antispoofing database with diverse attacks. In ICB, Cited by: Table 1, §1, §2.2.
  • [43] J. Zhao, Y. Cheng, Y. Xu, L. Xiong, J. Li, F. Zhao, K. Jayashree, S. Pranata, S. Shen, J. Xing, et al. (2018) Towards pose invariant face recognition in the wild. In CVPR, pp. 2207–2216. Cited by: §5.5.
  • [44] X. Zhu, X. Liu, Z. Lei, and S. Z. Li (2017) Face alignment in full pose range: a 3d total solution. TPAMI 41 (1), pp. 78–92. Cited by: §4.