Log In Sign Up

Deep Tree Learning for Zero-shot Face Anti-Spoofing

Face anti-spoofing is designed to keep face recognition systems from recognizing fake faces as the genuine users. While advanced face anti-spoofing methods are developed, new types of spoof attacks are also being created and becoming a threat to all existing systems. We define the detection of unknown spoof attacks as Zero-Shot Face Anti-spoofing (ZSFA). Previous works of ZSFA only study 1-2 types of spoof attacks, such as print/replay attacks, which limits the insight of this problem. In this work, we expand the ZSFA problem to a wide range of 13 types of spoof attacks, including print attack, replay attack, 3D mask attacks, and so on. A novel Deep Tree Network (DTN) is proposed to tackle the ZSFA. The tree is learned to partition the spoof samples into semantic sub-groups in an unsupervised fashion. When a data sample arrives, being know or unknown attacks, DTN routes it to the most similar spoof cluster, and make the binary decision. In addition, to enable the study of ZSFA, we introduce the first face anti-spoofing database that contains diverse types of spoof attacks. Experiments show that our proposed method achieves the state of the art on multiple testing protocols of ZSFA.


page 6

page 7


Meta Anti-spoofing: Learning to Learn in Face Anti-spoofing

Face anti-spoofing is crucial to the security of face recognition system...

Exploring Hypergraph Representation on Face Anti-spoofing Beyond 2D Attacks

Face anti-spoofing plays a crucial role in protecting face recognition s...

Face Anti-Spoofing by Learning Polarization Cues in a Real-World Scenario

Face anti-spoofing is the key to preventing security breaches in biometr...

Look Locally Infer Globally: A Generalizable Face Anti-Spoofing Approach

State-of-the-art spoof detection methods tend to overfit to the spoof ty...

3D Face Anti-spoofing with Factorized Bilinear Coding

We have witnessed rapid advances in both face presentation attack models...

Disentangled Representation with Dual-stage Feature Learning for Face Anti-spoofing

As face recognition is widely used in diverse security-critical applicat...

Race Bias Analysis of Bona Fide Errors in face anti-spoofing

The study of bias in Machine Learning is receiving a lot of attention in...

Code Repositories


DOneLogin Android: Facial verification for Two-Factors Authentication (2FA) on Android platform

view repo

1 Introduction

Face is one of the most popular biometric modalities due to its convenience of usage, e.g., access control, phone unlock. Despite the high recognition accuracy, face recognition systems are not able to distinguish between real human faces and fake ones, e.g., photograph, screen. Thus, they are vulnerable to face spoof attacks, which deceives the systems to recognize as another person. To safely use face recognition, face anti-spoofing techniques are required to detect spoof attacks before performing recognition.

Attackers can utilize a wide variety of mediums to launch spoof attacks. The most common ones are replaying videos/images on digital screens, i.e., replay attack, and printed photograph, i.e., print attack. Different methods are proposed to handle replay and print attacks, based on either handcrafted features [35, 38, 7] or CNN-based features [4, 20, 32, 18]. Recently, high-quality D custom mask is also used for attacking, i.e., D mask attack. In [30, 31, 29], methods for detecting print/replay attacks are found to be less effective for this new spoof, and hence the authors leverage the remote photoplethysmography (r-PPG) to detect the heart rate pulse as the spoofing cue. Further, facial makeup may also influence the outcome of recognition, i.e., makeup attack [12]. Many works [11, 12, 13] study facial makeup, despite not as an anti-spoofing problem.

Figure 1:

To detect unknown spoof attacks, we propose a Deep Tree Network (DTN) to unsupervisely learn a hierarchic embedding for known spoof attacks. Samples of unknown attacks will be routed through DTN and classified at the destined leaf node.

All aforementioned methods present algorithmic solutions to the known spoof attack(s), where models are trained and tested on the same type(s) of spoof attacks. However, in real-world applications, attackers can also initiate spoof attacks that we, the algorithm designers, are not aware of, termed unknown spoof attacks111There is subtle distinction between 1) unseen attacks, attack types that are known to algorithm designers so that algorithms could be tailored to them, but their data are unseen during training; 2) unknown attacks, attack types that are neither known to designers nor seen during training. We do not differentiate these two cases and term both unknown attacks.. Researchers increasingly pay attention to the generalization of anti-spoofing models, i.e., how well they are able to detect spoof attacks that have never been seen during the training? We define the problem of detecting unknown face spoof attacks as Zero-Shot Face Anti-spoofing (ZSFA). Despite the success of face anti-spoofing on known attacks, ZSFA, on the other hand, is a new and unsolved challenge to the community.

The first attempts on ZSFA are [45, 3]

. They address ZSFA between print and replay attacks, and regard it as an outlier detection problem for live faces (a.k.a. real human faces). With handcrafted features, the live faces are modeled via standard generative models, e.g., GMM, auto-encoder. During testing, an unknown attack is detected if it lies outside the estimated live distribution. These ZSFA works have three drawbacks:

Lacking spoof type variety: Prior models are developed w.r.t. print and replay attacks only. The respective feature design may not be applicable to different unknown attacks.

No spoof knowledge: Prior models only use live faces, without leveraging the available known spoof data. While the unknown attacks are different, the known spoof attacks may still provide valuable information to learn the model.

Limitation of feature selection:

They use handcrafted features such as LBP to represent live faces, which were shown to be less effective for known spoof detection [32, 27, 37, 48]

. Recent deep learning models 

[32, 20] show the advantage of CNN models for face anti-spoofing.

This work aims to address all three drawbacks. Since one ZSFA model may perform differently when the unknown spoof attack is different, it should be evaluated on a wide range of unknown attacks types. In this work, we substantially expand the study of ZSFA from types of spoof attacks to types. Besides print and replay attacks, we include types of D mask attacks, types of makeup attacks, and partial attacks. These attacks cover both impersonation spoofing, i.e., attempt to be authenticated as someone else, and obfuscation spoofing, i.e., attempt to cover attacker’s own identity. We collect the first face anti-spoofing database that includes these diverse spoof attacks, termed Spoof in the Wild database with Multiple Attack Types (SiW-M).

To tackle the broader ZSFA, we propose a Deep Tree Network (DTN). Assuming there are both homogeneous features among different spoof types and distinct features within each spoof type, a tree-like model is well-suited to handle this case: learning the homogeneous features in the early tree nodes and distinct features in later tree nodes. Without any auxiliary labels of spoof types, DTN learns to partition data in an unsupervised manner. At each tree node, the partition is performed along the direction of the largest data variation. In the end, it clusters the data into several sub-groups at the leaf level, and learns to detect spoof attacks for each sub-group independently, shown in Fig. 1. During the testing, a data sample is routed to the most similar leaf node to produce a binary decision of live vs. spoof.

In summary, our contributions in this work include :

Conduct an extensive study of zero-shot face anti-spoofing on different types of spoof attacks;

Propose a Deep Tree Network (DTN) to learn features hierarchically and detect unknown spoof attacks;

Collect a new database for ZSFA and achieve the state-of-the-art performance on multiple testing protocols.

2 Prior Work

Dataset Year Num. of Face variations Spoof attack types Total num. of
subj./vid. pose expression lighting replay print D mask makeup partial spoof types
CASIA-FASD [50] / Frontal No No
Replay-Attack [15] / Frontal No Yes
HKBU-MARs [30] / Frontal No Yes
Oulu-NPU [9] / Frontal No No
SiW [32] / Yes Yes
SiW-M / Yes Yes
Table 1: Comparing our SiW-M with existing face anti-spoofing datasets.

Face Anti-spoofing Image-based face anti-spoofing refers to face anti-spoofing techniques that only take RGB images as input without extra information such as depth or heat. In early years, researchers utilize liveness cues, such as eye blinking and head motion, to detect print attacks [36, 37, 39, 24]. However, when encountering unknown attacks, such as photograh with eye portion cut, and video replay, those methods suffer from a total failure. Later, research move to a more general texture analysis and address print and replay attacks. Researchers mainly utilize handcrafted features, e.g., LBP [16, 17, 35, 7], HoG [25, 47], SIFT [38] and SURF [8], with traditional classifiers, e.g., SVM and LDA, to make a binary decision. Those methods perform well on the testing data from the same database. However, while changing the testing conditions such as lighting and background, they often have a large performance drop, which can be viewed as an overfitting issue. Moreover, they also show limitations in handling D mask attacks, mentioned in [30].

To overcome the overfitting issue, researchers make various attempts. Boulkenafet et al. extract the spoofing features in HSVYCbCR space [7]. Works in [2, 5, 6, 18, 46] consider features in the temporal domain. Recent works [4, 2] augment the data by using image patches, and fuse the scores from patches to a single decision. For D mask attacks, the heart pulse rate is estimated to differentiate D mask from real faces [28, 30]. In the deep learning era, researchers propose several CNN works [4, 20, 32, 18, 27, 37, 48] that outperform the traditional methods.

Zero-shot learning and unknown spoof attacks Zero-shot object recognition, or more generally, zero-shot learning, aims to recognize objects from unknown classes [40], i.e., object classes unseen in training. The overall idea is to associate the known and unknown classes via a semantic embedding, whose embedding spaces can be attributes [26]

, word vector 

[19], text description [49] and human gaze [22].

Zero-shot learning for unknown spoof attack, i.e., ZSFA, is a relatively new topic with unique properties. Firstly, unlike zero-shot object recognition, ZSFA emphasizes the detection of spoof attacks, instead of recognizing specific spoof types. Secondly, unlike generic objects with rich semantic embedding, there is no explicit well-defined semantic embedding for spoof patterns [20]. As elaborated in Sec. 1, prior ZSFA works [45, 3] only model the live data via handcrafted features and standard generative models, with several drawbacks. In this work, we propose a deep tree network to unsupervisely learn the semantic embedding for known spoof attacks. The partition of the data naturally associates certain semantic attributes with the sub-groups. During the testing, the unknown attacks are projected to the embedding to find the closest attributes for spoof detection.

Deep tree networks Tree structure is often found helpful in tackling language-related tasks such as parsing and translation [14], due to the intrinsic relation of words and sentences. E.g., tree models are applied to joint vision and language problems such as visual question reasoning [10]. Tree structure also has the property for learning features hierarchically. Face alignment works [23, 41] utilize the regression trees to estimate facial landmarks from coarse to fine. Xiong et al. propose a tree CNN to handle the large-pose face recognition [44]. In [21]

, Kaneko et al. propose a GAN with decision trees to learn hierarchically interpretable representations. In our work, we utilize tree networks to learn the latent semantic embedding for ZSFA.

Face anti-spoofing databases Given the significance of a good-quality database, researchers have released several face anti-spoofing databases, such as CASIA-FASD [50], Replay-Attack [15], OULU-NPU [9], and SiW [32] for print/replay attacks, and HKBU-MARs [30] for D mask attacks. Early databases such as CASIA-FASD and Replay-Attack [50] have limited subject variety, pose/expression/lighting variations, and video resolutions. Recent databases [9, 30, 32] improve those aspects, and also set up diverse evaluation protocols. However, up to now, all databases focus on either print/replay attacks, or D mask attacks. To provide a comprehensive study of face anti-spoofing, especially the challenging ZSFA, we for the first time collect the database with diverse types of spoof attacks, as in Tab. 1. The details of our database are in Sec. 4.

Figure 2: The proposed Deep Tree Network (DTN) architecture. (a) the overall structure of DTN. A tree node consists of a Convolutional Residual Unit (CRU) and a Tree Routing Unit (TRU), and a leaf node consists of a CRU and a Supervised Feature Learning (SFL) module. (b) the concept of Tree Routing Unit (TRU): finding the base with largest variations; (c) the structure of each Convolutional Residual Unit (CRU); (d) the structure of the Supervised Feature Learning (SFL) in the leaf nodes.

3 Deep Tree Network for ZSFA

The main purposes of DTN are twofold: discover the semantic sub-groups for known spoofs; learn the features in a hierarchical way. The architecture of DTN is shown in Fig. 2. Each tree node consists of a Convolutional Residual Unit (CRU) and a Tree Routing Unit (TRU), while the leaf node consists of a CRU and a Supervised Feature Learning (SFL) module. CRU is a block with convolutional layers and the short-cut connection. TRU defines a node routing function to route a data sample to one of the child nodes. The routing function partitions all visiting data along the direction with the largest data variation. SFL module concatenates the classification supervision and the pixel-wise supervision to learn the spoofing features.

3.1 Unsupervised Tree Learning

3.1.1 Node Routing Function

For a TRU node, let’s assume the input is the vectorized feature response, is data input, is the parameters of the previous CRUs, and is the set of data samples that visit this TRU node. In [44], Xiong et al. define a routing function as:


where v denotes the projection vector and is the bias. Data can then be split into and , and directed to the left and right child node, respectively. To learn this function, they propose to maximize the distance between the mean of and , while keeping the mean of centered at . This unsupervised loss is formulated as:


where , , denote the number of samples in each set.

However, in practice, minizing Equ. 2 might not lead to a satisfactory solution. Firstly, the loss can be minimized by increasing the norm of either v or x, which is a trivial solution. Secondly, even when the norms of v, x are constrained, Equ. 2 is affected by the density of data and can be sensitive to the outliers. In other words, the zero expectation of does not necessarily result in a balanced partition of data . Local minima could be achieved when all data are split to one side. In some cases, the tree may suffer from collapsing to a few (even one) leaf nodes.

To better partition the data, we propose a novel routing function and an unsupervised loss. Regardless of , the dot product between and v can be regarded as projecting x to the direction of v. We design v such that we can observe the largest variation after projection. Inspired by the concept of PCA, the optimal solution naturally becomes the largest PCA basis of data . To achieve this, we first constrain v to be norm and reformulate Equ. 1 as:


where is the mean of data . Then, finding v

is identical to finding the largest eigenvector of the covariance matrix

, where , and is the data matrix. Based on the definition of eigen-analysis , our optimization aims to maximize:


The loss for learning the routing function is formulated as:


where are scalars, and set as e-, e- in our experiments. We apply the exponential function on the first term to make the maximization problem bounded. The second term is introduced as a regularizer to prevent trivial solutions by constraining the trace of covariance matrix of .

3.1.2 Tree of Known Spoofs

With the routing function, we can build the entire binary tree. Fig. 2 shows a binary tree of depth of , with leaf nodes. As mentioned early in Sec. 3

, the tree is designed to find the semantic sub-groups from all known spoofs, and is termed as spoof tree. Similarly, we may also train live tree with live faces only, as well as general data tree with both live and spoof data. Compared to spoof tree, live and general data tree have some drawbacks. Live tree does not convey semantic meaning for the spoof, and the attributes learned at each node cannot help to route and better detect spoof; General data tree may result in imbalanced sub-groups, where samples of one class outnumber another. Such imbalance would cause bias for supervised learning in the next stage.

Hence, when we compute Equ. 5 to learn the routing functions, we only consider the spoof samples to construct . To have a balanced sub-group for each leaf, we suppress the responses of live data to zero, so that all live data can be evenly partitioned to the child nodes. Meanwhile, we also suppress the responses of the spoof data that do not visit this node, so that every node models the distribution of a unique spoof subset.

Formally, for each node, we maximize the routing function responses of spoof data that visit this node (denoted as ), while minimizing the responses of other data (denoted as ), including all live data and spoof data that don’t visit this node, i.e., that visit neighboring nodes. To achieve this objective, we define the following loss:


3.2 Supervised Feature Learning

Given the routing functions, a data sample will be assigned to one of the leaf nodes. Let’s first define the feature output of leaf node as , shortened as for simplicity. At each leaf node, we define two node-wise supervised tasks to learn discriminative features: binary classification drives the learning of a high-level understanding of live vs. spoof faces, pixel-wise mask regression draws CNN’s attention to low-level local feature learning.

Classification supervision To learn a binary classifier, as shown in Fig. 2(d), we apply two additional convolution layers and two fully connected layers on to generate a feature vector . We supervise the learning via the softmax cross entropy loss:


where represents all the data samples that arrive this leaf node, denotes the number of samples in , are the parameters in the last fully connected layer, and is the label of data sample ( denotes spoof, and live).

Pixel-wise supervision We also concatenate another convolution layer to to generate a map response . Inspired by the prior work [32], we leverage the semantic prior knowledge of face shapes and spoof attack position to provide a pixel-wise supervision. Using the dense face alignment model [33], we provide a binary mask , shown in Fig. 4

, to indicate the pixels of spoof mediums. Thus, for a leaf node, the loss function for the pixel-wise supervision is:


Overall loss Finally, we apply the supervised losses on leaf nodes, the unsupervised losses on TRU nodes, and formulate our training loss as:


where ,,, are the regularization coefficients for each term, and are set as , , , respectively. For a -layer DTN, and .

Figure 3: The structure of the Tree Routing Unit (TRU).
Figure 4: The examples of the live faces and types of spoof attacks. The second row shows the ground truth masks for the pixel-wise supervision . For in the third row, denotes the number of subjects/videos for each type of data.

3.3 Network Architecture

Deep Tree Network (DTN) DTN is the main framework of the proposed model. It takes as input, where the channels are RGB+HSV color spaces. We concatenate three convolution layers with channels and max-pooling layer, and group them as one Convolutional Residual Unit (CRU). Each convolution layer is equipped with ReLU and group normalization layer [43], due to the dynamic batch size in the network. We also apply a shortcut connection for each convolution layer. For each tree node, we deploy one CRU before the TRU. At the leaf node, DTN produces the feature representation of input as , then uses one convolution layer to generate the binary mask map .

Tree Routing Unit (TRU) TRU is the module routing the data sample to one of the child CRUs. As shown in Fig. 3, it first compresses the feature by using an convolution layer, and resizing the response spatially. For the root node, we compress the CRU feature to , and for later tree node, we compress the CRU feature to . Compressing the input feature to a smaller size helps to reduce the burden of computating and saving the covariance matrix in Equ. 5. E.g., the vectorized feature for the first CRU is , and the covariance matrix of x can take GB in memory. However, after compression the vectorized feature is , and the covariance matrix of x only needs GB of memory.

After that, we vectorize the output and apply the routing function . To compute in Equ. 3

, instead of optimizing it as a variable of the network, we simply apply a batch normalization layer without scaling to save the moving average of each mini-batch. In the end, we project the compressed CRU response to the largest basis

v and obtain the projection coefficient. Then we assign the samples with negative coefficient to the left child CRU and the samples with positive coefficient to the right child CRU.

Implementation details With the overall loss in Equ. 10, our proposed network is trained in an end-to-end fashion. All losses are computed based on each mini-batch. DTN modules and TRU modules are optimized alternately. While optimizing DTN, we keep the parameters of TRUs fixed and vice versa.

4 Spoof in the Wild Database with Multiple Attack Types

To benchmark face anti-spoofing methods specifically for unknown attacks, we collect the Spoof in the Wild database with Multiple Attack Types (SiW-M). Compared with the previous databases in Tab. 1, SiW-M shows a great diversity in spoof attacks, subject identities, environments and other factors.

For spoof data collection, we consider two spoofing scenarios: impersonation, which entails the use of spoof to be recognized as someone else, and obfuscation, which entails the use to remove the attacker’s own identity. In total, we collect videos of types of spoof attacks listed hieratically in Fig 4. For all mask attacks, partial attacks, obfuscation makeup and cosmetic makeup, we record P HD videos. For impersonation makeup, we collect P videos from Youtube due to the lack of special makeup artists. For print and replay attacks, we intend to collect videos from harder cases where the existing system fails. Hence, we deploy an off-the-shelf face anti-spoofing algorithm [32] and record spoof videos when the algorithm predicts live.

For live data, we include videos from subjects. In comparison, the number of subjects in SiW-M is times larger than Oulu-NPU [9] and CASIA-FASD [50], and times larger than SiW [32]. In addition, subjects are diverse in ethnicity and age. The live videos are collected in sessions: a room environment where the subjects are recorded with few variations such as pose, lighting and expression (PIE). a different and much larger room where the subjects are also recorded with PIE variations. a mobile phone mode, where the subjects are moving while the phone camera is recording. Extreme pose angles and lighting conditions are introduced. Similar to print and replay videos, we deploy the face anti-spoofing algorithm [32] to find out the videos where the algorithm predicts spoof. Hence, this third session is a harder scenario.

In total, we collect videos and each lasts - seconds. The P videos are recorded by Logitech C webcam and Canon EOS T. To use SiW-M for the study of ZSFA, we define the leave-one-out testing protocols. Each time we train a model with types of spoof attacks plus the of the live videos, and test on the left attack type plus the of live videos. There is no overlapping subjects between the training and testing sets of live videos.

Methods CASIA [50] Replay-Attack [15] MSU [42] Overall
Video Cut Photo Warped Photo Video Digital Photo Printed Photo Printed Photo HR Video Mobile Video
NN+LBP [45]
Table 2: AUC () of the model testing on CASIA, Replay, and MSU-MFSD.

5 Experimental Results

5.1 Experimental Setup


We evaluate our proposed method on multiple databases. We deploy the leave-one-out testing protocols on SiW-M and report the results of experiments. Also, we test on previous face anti-spoofing databases, including CASIA [50], Replay-Attack [15], and MSU-MFSD [42]), compare with the state of the art.

Evaluation metrics

We evaluate with the following metrics: Attack Presentation Classification Error Rate (APCER) [1], Bona Fide Presentation Classification Error Rate (BPCER) [1], the average of APCER and BPCER, Average Classification Error Rate (ACER) [1], Equal Error Rate (EER), and Area Under Curve (AUC). Note that, in the evaluation of unknown attacks, we assume there is no validation set to tune the model and thresholds while calculating the metrics. Hence, we determine the threshold based on the training set and fix it for all testing protocols. A single test sample is one video frame, instead of one video.

Random routing
Proposed routing function
Table 3: Compare models with different routing strategies.
MPT [44]
Live data , Spoof data , Unique Loss
Live data , Spoof data , Unique Loss
Live data , Spoof data , Unique Loss
Live data , Spoof data , Unique Loss
Table 4: Compare models with different tree losses and strategies. The first two terms of row - refer to using live or spoof data in tree learning. The last row is our method.
Parameter setting

The proposed method is implemented in Tensorflow, and trained with a constant learning rate of

with a batch size of . It takes epochs to converge. We randomly initialize all the weights using a normal distribution of mean and standard deviation.

5.2 Experimental Comparison

5.2.1 Ablation Study

All ablation studies use the Funny Eye protocol.

Different fusion methods In the proposed model, both the norm of the mask maps and binary spoof scores could be utilized for the final classification. To find the best fusion method, we compute ACER from using map norm, softmax score, the maximum of map norm and softmax score, and the average of two values, and obtain , , , and respectively. Since the average score of the mask norm and binary spoof score performs the best, we use it for the remaining experiments. Moreover, we set as the final threshold to compute APCER, BPCER and ACER for all the experiments.

Different routing methods Routing is a crucial step to find the best subgroup to detect spoofness of a testing sample. To show the effect of proper routing, we evaluate alternative routing strategies: random routing and pick-one-leaf. Random routing denotes randomly selecting one leaf node for a testing sample to produce prediction; Pick-one-leaf denotes constantly selecting one particular leaf node to produce results, for which we report the mean score and standard deviation of selections. Shown in Tab. 3, both strategies perform worse than the proposed routing function. In addition, the large standard deviation of pick-one-leaf strategy shows the large performance difference of subgroups on the same type of unknown attacks, and demonstrates the necessity of a proper routing.

Methods Metrics (%) Replay Print Mask Attacks Makeup Attacks Partial Attacks Average
Half Silicone Trans. Paper Manne. Obfusc. Imperson. Cosmetic Funny Eye Paper Glasses Partial Paper
Auxiliary[32] APCER
Ours APCER 1.0
Table 5: The evaluation and comparison of the testing on SiW-M.

Advantage of each loss function We have three important designs in our unsupervised tree learning: route loss , data used to compute the route loss, and the unique loss . To show the effect of each loss and the training strategy, we train and compare networks with each loss excluded and alternative strategies. First, we train a network with the routing function proposed in [44], and then models with different modules on and off, shown in Tab. 4. The model with MPT [44] routes data only to leaf nodes out of (i.e. tree collapse issue), which limits the performance. Models without the unique loss exhibit the imbalance routing issue where sub-groups cannot be trained properly . Models using all data to learn the tree show worse performances than using spoof data only. Finally, the proposed method performs the best among all options.

5.2.2 Testing on existing databases

Figure 5: Visulization of the Tree Routing.

Following the protocol proposed in [3], we use CASIA [50], Replay-Attack [15] and MSU-MFSD [42] to perform ZSFA testing between replay and print attacks. Tab. 2 compares the proposed method with top three methods selected from over methods in [45, 3, 9]. Our proposed method outperforms the prior state of the art by a convincing margin of , and our smaller standard deviation further indicates a consistently good performance among unknown attacks.

5.2.3 Testing on SiW-M

We execute leave-one-out testing protocols on SiW-M. We compare with two of the most recent face anti-spoofing methods [9, 32], and set [32] as the baseline, which has demonstrated its SOTA performance on various benchmarks. For a fair comparison with the baseline, we provide the same pixel-wise labeling (as in Fig. 4), and set the same threshold of to compute APCER, BPCER, and ACER.

As shown in Tab. 5, our method achieves an overall better APCER, ACER and EER, with the improvement of baseline by , , and . Specifically, we reduce the ACERs of transparent mask, funny eye, and paper glasses by , , and , where the baseline models can be considered as total failures since they recognize most of the attacks as live. Note that, ACER is more valuable in the context of ZSFA: no evaluation data for setting threshold and considerably varied thresholds for obtaining the EER performance. For instance, EERs of paper glasses model are similar between the baseline and our method, but with a preset threshold, our method offers a much better ACER.

Moreover, the proposed method is a more compact model than[32]. Given the input size of , the baseline requires GFlops to compute the result while our method only needs GFlops ( smaller). More analysis are shown with visualization in Sec. 5.2.4.

Figure 6: Tree routing distribution of live/spoof data. X-axis denotes leaf nodes, and y-axis denotes types of data. The number in each cell represents the percentage () of data that fall in that leaf node. Each row is sum to . (a) Print Protocol. (b) Transparent Mask Protocol. Yellow box denotes the unknown attacks.

5.2.4 Visualization and Analysis

To provide a better understanding of the tree learning and ZSFA, we visualize the results in several ways. First, we illustrate the tree routing results. In Fig. 5, we rank the spoof data based on the routing function values , and provide examples with responses from the smallest to the largest. This offers us an intuitive understanding of what are learned at each tree node. We observe an obvious spoof style transfer: for the first two-layer nodes , and , the transfer captures the change of general spoof attributes such as image quality and color temperature; for the third-layer tree nodes , , , and , the transfer involves more spoof type specific changes. E.g., transfers from eye portion spoofs to full face D mask spoofs.

Further, Fig. 6 quantitatively analyzes the tree routing distributions of all types of data. We utilize two models, Print and Trans. Mask, to generate the distributions. It can be observed that live samples are relatively more spread out to leaf nodes while the spoof attacks are routed to fewer specific leaf nodes. Two distributions in Fig. 6 (a)&(b) share similar semantic sub-groups, which demonstrates the success of the proposed method on learning a tree. E.g., in both models, about half of trans. mask samples share the same leaf node as ob. makeup. By comparing two distributions, most testing unknown spoofs in both models are successfully routed to the most similar sub-groups.

Figure 7: t-SNE Visualization of the DTN leaf features.

In addition, we use t-SNE [34] to visualize the feature space of Print model. The t-SNE is able to project the output of the leaf node to D by preserving the KL divergence distance. Fig. 7 shows the features of different types of spoof attacks are well-clustered into semantic sub-groups even though we don’t provide any auxiliary labels. Based on these sub-groups, the features of unknown print attacks are well lied in the sub-group of replay and silicone mask, and thus are recognized as spoof. Moreover, with the visualization, we can explain the performance variation among different spoof attacks, shown in Tab. 5. Among all, the performance of trans. mask, funny eye, paper glasses and ob. makeup are worse than other protocols. The feature space shows that the live samples lies much closer to those attacks than others (“” places), and hence it’s harder to distinguish them with the live samples. This demonstrates the diverse property of different unknown attacks and the necessity of such a wide range evaluation.

6 Conclusions

This paper tackles the zero-shot face antispoofing problem among types of spoof attacks. The proposed method leverages a deep tree network to route the unknown attacks to the most proper leaf node for spoof detection. The tree is trained in an unsupervised fashion to find the feature base with the largest variation to split the spoof data. We collect SiW-M that contains more subjects and spoof types than any previous databases. Finally, we experimentally show superior performance of the proposed method.


This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. -. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.


  • [1] ISO/IEC JTC 1/SC 37 Biometrics. information technology biometric presentation attack detection part 1: Framework. international organization for standardization, 2016.
  • [2] A. Agarwal, R. Singh, and M. Vatsa. Face anti-spoofing using Haralick features. In BTAS, 2016.
  • [3] S. R. Arashloo, J. Kittler, and W. Christmas.

    An anomaly detection approach to face spoofing detection: a new formulation and evaluation protocol.

    IEEE Access, 5:13868–13882, 2017.
  • [4] Y. Atoum, Y. Liu, A. Jourabloo, and X. Liu. Face anti-spoofing using patch and depth-based CNNs. In IJCB, 2017.
  • [5] W. Bao, H. Li, N. Li, and W. Jiang. A liveness detection method for face recognition based on optical flow field. In IEEE International Conference on Image Analysis and Signal Processing (IASP), 2009.
  • [6] S. Bharadwaj, T. I Dhamecha, M. Vatsa, and R. Singh. Face anti-spoofing via motion magnification and multifeature videolet aggregation. Technical report, 2014.
  • [7] Z. Boulkenafet, J. Komulainen, and A. Hadid. Face anti-spoofing based on color texture analysis. In ICIP, 2015.
  • [8] Z. Boulkenafet, J. Komulainen, and A. Hadid. Face antispoofing using speeded-up robust features and fisher vector encoding. IEEE Signal Processing Letters, 2017.
  • [9] Z. Boulkenafet, J. Komulainen, L. Li, X. Feng, and A. Hadid. OULU-NPU: A mobile face presentation attack database with real-world variations. In FG, 2017.
  • [10] Q. Cao, X. Liang, B. Li, G. Li, and L. Lin. Visual question reasoning on general dependency tree. In CVPR, 2018.
  • [11] H. Chang, J. Lu, F. Yu, and A. Finkelstein. PairedCycleGAN: Asymmetric style transfer for applying and removing makeup. In CVPR, 2018.
  • [12] C. Chen, A. Dantcheva, and A. Ross. Automatic facial makeup detection with application in face recognition. In ICB, 2013.
  • [13] C. Chen, A. Dantcheva, and A. Ross. Impact of facial cosmetics on automatic gender and age estimation algorithms. In

    IEEE International Conference on Computer Vision Theory and Applications (VISAPP)

    , 2014.
  • [14] X. Chen, C. Liu, and D. Song. Tree-to-tree neural networks for program translation. arXiv preprint arXiv:1802.03691, 2018.
  • [15] I. Chingovska, A. Anjos, and S. Marcel. On the effectiveness of local binary patterns in face anti-spoofing. In BIOSIG, 2012.
  • [16] T. de Freitas Pereira, A. Anjos, J. M. De Martino, and S. Marcel. LBP-TOP based countermeasure against face spoofing attacks. In ACCV, 2012.
  • [17] T. de Freitas Pereira, A. Anjos, J. M. De Martino, and S. Marcel. Can face anti-spoofing countermeasures work in a real world scenario? In ICB, 2013.
  • [18] L. Feng, L. Po, Y. Li, X. Xu, F. Yuan, T. C. Cheung, and K. Cheung.

    Integration of image quality and motion cues for face anti-spoofing: A neural network approach.

    Journal of Visual Communication and Image Representation, 2016.
  • [19] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visual-semantic embedding model. In NIPS, 2013.
  • [20] A. Jourabloo, Y. Liu, and X. Liu. Face de-spoofing: Anti-spoofing via noise modeling. In ECCV, 2018.
  • [21] T. Kaneko, K. Hiramatsu, and K. Kashino. Generative adversarial image synthesis with decision tree latent controller. In CVPR, 2018.
  • [22] N. Karessli, Z. Akata, B. Schiele, A. Bulling, et al. Gaze embeddings for zero-shot image classification. In CVPR, 2017.
  • [23] V. Kazemi and J. Sullivan. One millisecond face alignment with an ensemble of regression trees. In CVPR, 2014.
  • [24] K. Kollreider, H. Fronthaler, M. I. Faraj, and J. Bigun. Real-time face detection and motion analysis with application in “liveness” assessment. In TIFS, 2007.
  • [25] J. Komulainen, A. Hadid, and M. Pietikainen. Context based face anti-spoofing. In BTAS, 2013.
  • [26] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, 2009.
  • [27] L. Li, X. Feng, Z. Boulkenafet, Z. Xia, M. Li, and A. Hadid.

    An original face anti-spoofing approach using partial convolutional neural network.

    In IEEE International Conference on Image Processing Theory Tools and Applications (IPTA), 2016.
  • [28] X. Li, J. Komulainen, G. Zhao, P. C. Yuen, and M. Pietikäinen. Generalized face anti-spoofing by detecting pulse from face videos. In ICPR, 2016.
  • [29] S. Liu, X. Lan, and P. C. Yuen. Remote photoplethysmography correspondence feature for 3D mask face presentation attack detection. In ECCV, 2018.
  • [30] S. Liu, B. Yang, P. C. Yuen, and Guoying Zhao. A 3D mask face anti-spoofing database with real world variations. In CVPRW, 2016.
  • [31] S. Liu, P. C. Yuen, S. Zhang, and G. Zhao. 3D mask face anti-spoofing with remote photoplethysmography. In ECCV, 2016.
  • [32] Y. Liu, A. Jourabloo, and X. Liu. Learning deep models for face anti-spoofing: Binary or auxiliary supervision. In CVPR, 2018.
  • [33] Y. Liu, A. Jourabloo, W. Ren, and X. Liu. Dense face alignment. In ICCVW, 2017.
  • [34] L. Maaten and G. Hinton. Visualizing data using t-SNE.

    Journal of machine learning research

    , 9(Nov):2579–2605, 2008.
  • [35] J. Määttä, A. Hadid, and M. Pietikäinen. Face spoofing detection from single images using micro-texture analysis. In IJCB, 2011.
  • [36] G. Pan, L. Sun, Z. Wu, and S. Lao. Eyeblink-based anti-spoofing in face recognition from a generic webcamera. In ICCV, 2007.
  • [37] K. Patel, H. Han, and A. K. Jain. Cross-database face antispoofing with robust feature representation. In CCBR, 2016.
  • [38] K. Patel, H. Han, and A. K. Jain. Secure face unlock: Spoof detection on smartphones. In TIFS, 2016.
  • [39] R. Shao, X. Lan, and P. C. Yuen. Deep convolutional dynamic texture learning with adaptive channel-discriminability for 3D mask face anti-spoofing. In IJCB, 2017.
  • [40] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning through cross-modal transfer. In NIPS, 2013.
  • [41] R. Valle and M. José. A deeply-initialized coarse-to-fine ensemble of regression trees for face alignment. In ECCV, 2018.
  • [42] D. Wen, H. Han, and A. K. Jain. Face spoof detection with image distortion analysis. In TIFS, 2015.
  • [43] Y. Wu and K. He. Group normalization. In ECCV, 2018.
  • [44] C. Xiong, X. Zhao, D. Tang, K. Jayashree, S. Yan, and T. Kim. Conditional convolutional neural network for modality-aware face recognition. In ICCV, 2015.
  • [45] F. Xiong and W. Abdalmageed. Unknown presentation attack detection with face RGB images. In BTAS, 2018.
  • [46] Z. Xu, S. Li, and W. Deng. Learning temporal features using LSTM-CNN architecture for face anti-spoofing. In ACPR, 2015.
  • [47] J. Yang, Z. Lei, S. Liao, and S. Z. Li. Face liveness detection with component dependent descriptor. In ICB, 2013.
  • [48] Z. Yang, J.and Lei and S. Z. Li. Learn convolutional neural network for face anti-spoofing. arXiv preprint arXiv:1408.5601, 2014.
  • [49] L. Zhang, T. Xiang, and S. Gong. Learning a deep embedding model for zero-shot learning. In CVPR, 2017.
  • [50] Z. Zhang, J. Yan, S. Liu, Z. Lei, D. Yi, and S. Z. Li. A face antispoofing database with diverse attacks. In ICB, 2012.