Exploiting temporal and depth information for multi-frame face anti-spoofing

11/13/2018 ∙ by Zezheng Wang, et al. ∙ 0

Face anti-spoofing is significant to the security of face recognition systems. By utilizing pixel-wise supervision, depth supervised face anti-spoofing reasonably contains more generalization than binary classification does. Without considering the importance of sequential information in depth recovery, previous depth supervised methods only regard depth as an auxiliary supervision in the single frame. In this paper, we propose a depth supervised face anti-spoofing model in both spatial and temporal domains. The temporal information from multi-frames is exploited and incorporated to improve facial depth recovery, so that more robust and discriminative features can be extracted to classify the living and spoofing faces. Extensive experiments indicate that our approach achieves state-of-the-art results on standard benchmarks.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Face recognition systems have become indispensable in many interactive AI systems for its convenience. However, most of existing face recognition systems are so vulnerable to the face spoofing that attackers can easily deceive face recognition systems by using presentation attacks (PA), e.g., printing a face on paper (print attack), replaying a face on a digital device (replay attack), and bringing a 3D-mask (3D-mask attack). Such PAs can tamper with face recognition systems, placing direct risks to money payment and privacy verification that have high stakes with the public interests. Therefore, face anti-spoofing plays a critically important role in the security of face recognition.

To defend against face spoofing attacks, a great number of face anti-spoofing methods [5, 14, 21, 23] have been proposed to discriminate the living and fake face. Previous approaches fall into two categories. The first is the traditional methods, which commonly train shallow classifiers with hand-crafted features, i.e., LBP [10], SIFT [26], or SURF [6]. With only texture characteristic considered, these methods are often prone to the attacks, such as replaying mediums and 2D/3D mask.

Figure 1: To observe the difference between live and spoof(print attack here) scenes, we represent the micro motion exaggeratedly. The angle of camera viewpoint can reflect the facial motion relative depth among different keypoints. In the living scene (a), the angle between nose and right ear is getting smaller with the face moving right (), while the angle between left ear and nose is getting larger (). However, in the spoofing scene (b), , and .

In contrast to texture-based methods, CNN-based methods provide an alternative way to learn discriminative anti-spoofing representations in an end-to-end fashion. Some CNN-based methods [21, 23, 14, 34, 35, 30]

only treat face anti-spoofing as a pure binary classification - spoofing as 0 and living as 1, and train the neural network supervised by simple softmax loss. However, these methods fail to explore the nature of spoof patterns 

[20], which consist of skin detail loss, color distortion, moir pattern, motion pattern, shape deformation and spoofing artifacts.

Depth supervised face anti-spoofing somehow relieves these issues. Intuitively, images of living faces contain face-like depth information, whereas spoofing face images in print and replay carriers only have planar depth information. Thus, Atoum et al[2] and Liu et al.  [20]

proposes single-frame depth supervised CNN architecture in their architecture, and improves the performance of presentation attack detection (PAD). However, they only leverage facial depth as an auxiliary strong supervision to detect the static spoofing cues under pixel-wise supervision relatively carefully, neglecting to exploit the

virtual depth difference between living and spoofing scenes. To compensate with the lack of temporal information, Liu et al[20] takes rPPG signals as extra supervisory cues, and does not use any depth information in temporal domain. However, depth information is more obvious in sequential frames. Indeed, temporal motion is very useful to extract depth information and thus play a great potential role for face anti-spoofing detection.

In this paper, we will explore the temporal depth by combining temporal motion and facial depth, and demonstrate that face anti-spoofing can be significantly benefit from temporal facial depth. No matter whether cameras or subjects move, sequential frames can be represented in a 3D space using the relative motion of objects. Inspired by the work [29] revealing the relationship between relative facial depth and the motion, we believe patterns of motion and depth are distinct between living and spoofing scenes. And the abnormal facial depth will reflect on the unique facial motion pattern in the temporal domain. Fig. 1 illustrates apparently different temporal representations between living and spoofing scenes. To this end, we present a novel depth supervised neural network architecture with optical flow guided feature block (OFFB) and convolution gated recurrent units (ConvGRU) module to incorporate the temporal information from both adjacent and long-sequence frames, respectively. By combining both short-term and long-term motion extractors, the temporal motion can be used effectively to discriminate living and spoofing faces under depth supervision. And to estimate facial depth in face anti-spoofing tasks, we propose a novel contrastive depth loss to learn topography of facial points.

We summarize the main contributions below:

  • We analyze the temporal depth in face anti-spoofing and seek the usefulness of motion and depth in face anti-spoofing.

  • We propose a novel depth supervised architecture with OFF block (OFFB) and ConvGRU module to uncover facial depths and their unique motion patterns from temporal information of monocular frame sequences.

  • We design a contrastive depth loss to learn the topography of facial points for depth supervised face anti-spoofing.

  • We demonstrate the superior performances to the state-of-the-art methods on widely used face anti-spoofing benchmarks.

2 Related Work

We review related face anti-spoofing works in three categories: binary supervised methods, depth supervised methods, and temporal-based methods.

Binary supervised Methods Since face anti-spoofing is essentially a binary classification problem, most of previous anti-spoofing methods purely train a classifier under binary supervision, e.g., spoofing face as 0 and living face as 1. Binary classifiers contain traditional classifiers and neural networks. Prior works usually rely on hand-crafted features, such as LBP [10, 11, 22], SIFT [26], SURF [6], HoG [18, 36], DoG [27, 33]

with traditional classifiers, such as SVM and Random Forest. Since these manually-engineered features are often sensitive to varying condition, such as camera devices, lighting conditions and presentation attack instruments (PAIs), traditional methods often perform poorly in generalization.

Since then, CNN has achieved great breakthrough with the help of hardware development and data abundance. Recently, CNN is also widely used in face anti-spoofing tasks [21, 23, 14, 12, 19, 25, 35]

. However, most of the deep learning methods simply consider face anti-spoofing as a binary classification problem with softmax loss. Both  

[19] and  [25] fine-tune a pre-trained VGG-face model and take it as a feature extractor for the subsequent classification. Nagpal et al[23]

comprehensively study the influence of different network architectures and hyperparameters on face anti-spoofing. Feng 

[12] and Li [19] feed different kinds of face images into the CNN network to learn discriminative features on living faces and spoofing faces.

Depth supervised Methods Compared with binary supervised face anti-spoofing methods, depth supervised methods have a lot of advantages. Atoum [2] utilizes depth map of face as supervisory signals for the first time. They propose a two-stream CNN-based approach for face anti-spoofing, by extracting the local features and holistic depth maps from the face images. In other words, they combine depth-based CNN and patch-based CNN from single frame to obtain discriminative representation to distinguish live vs. spoof. It shows that depth estimation is beneficial for modeling face anti-spoofing to obtain promising results especially on higher-resolution images.

Liu [20] proposes a face anti-spoofing method from a combination of spatial perspective (depth) and temporal perspective (rPPG). They regard facial depth as an auxiliary supervision, along with rPPG signals. For temporal information, they use a simple RNN to learn the corresponding rPPG signals. However, due to the simplicity of sequence-processing, they have a non-rigid registration layer to remove the influence of facial poses and expressions, ignoring that unnatural changes of facial poses or expressions are significant spoofing cues.

Temporal-based Methods Temporal information plays a vital role in face anti-spoofing tasks.  [24, 25, 30] focus on the movement of key parts of the face. For example,  [24, 25] make spoofing predictions based on eye-blinking. These methods are vulnerable to replay attack since excessively relying on one aspect. Gan [14] proposes a 3D convolution network to distinguish the live vs. spoof. 3D convolution network is a stacked structure to learn the temporal features in a supervised pattern, but depends on significant amount of data and performs poorly on small database. Xu [34] proposes an architecture combining LSTM units with CNN for binary classification. Feng [12] presents a work that takes optical flow magnitude map and Shearlet feature as inputs to CNN. Their work shows that optical flow map presents obvious difference between living faces and different kinds of spoofing faces. All prior temporal-based methods are incapable of catching valid temporal information with a well-designed structure.

3 The Proposed Approach

In this section, firstly, we will introduce the temporal depth in face anti-spoofing and show the useful patterns of motion and depth in face anti-spoofing. Then we will present the proposed temporal PAD method under depth supervision. As shown in Fig. 5, the proposed model mainly consists of two modules. One is the single-frame part, focusing on seeking spoofing cues with the static depth supervision. The other is the multi-frame part, consisting of optical flow guided feature block (OFFB) as the short-term motion module and convolutional gated recurrent units (ConvGRU) modeling long-term motion patterns. The model can explore spoofing cues in both spatial and temporal domains under the depth supervision, successfully.

3.1 Temporal Depth in Face Anti-spoofing

Figure 2: The schematic diagram of motion and depth variation in the presentation attack scene.

As shown in Fig. 8, there are two image spaces. The right image space is recording image space, where attackers use to record others’ faces. The left image space is realistic image space, where real face recognition systems are. In the figure, are the focal points of realistic image space and recording image space, respectively, are corresponding focal distances, and is the distance between and the recording image plane (i.e., the screen of replay attack or the paper of print attack), while is the distance between the focal point and the nearest facial point. are three points having different distances to the camera at time , and are the corresponding points when they move down vertically by at time . is the coordinate in the vertical direction, and is the mapping coordinate in the recording image plane. represents the optical flow on the image plane when the point moves down. are values mapped on the realistic image plane by , respectively. We have the following two equations revealing the relation between these variables:


where just like , and are optical flow of and mapping on the realistic image plane, respectively, which are not shown in the figure for simplicity. And and are estimated depth difference and , that can be calculated by the optical flow. The other is


where is the distance of the recording image plane moving down vertically, caused by the shake of attack carrier. Here, we only consider vertical direction of movement. From these relations, we have some important conclusions:

  • In real scene, we can obtain the correct estimate of relative depth: .

  • In print attack scene, . Then we can use Eq. 1 to show , which suggests the face is a plane.

  • In replay attack scene, if , then . In this case, there is no abnormity of relative depth. In the meantime, if there are no static spoofing cues , we call this case Perfect Spoofing Scene(PSS). However, making up PSS attacks is costly and approximately impossible in practice. Thus we usually assume , where . The estimated relative depth in attacking scenes is different from that in real scenes. And as well as usually varies in the long-term sequence. Such variations can also be taken as a valid spoofing cue.

  • If denotes the largest depth difference among facial points, then , showing that constraining depth label of living face to is valid. For spoofing scenes, the abnormal relative depth is too complex to be computed directly. Therefore, we merely set depth label of spoofing face to all 0 to distinguish it from living label, making the model learn the abnormity under depth supervision itself.

More details about the derivation are included in the supplementary material.

3.2 Single-frame Part

Single-frame part is important to learn static spoofing information. In this work, we train a simple CNN to regress the depth map instead of traditional classification with softmax loss. In the following, we introduce the proposed depth supervised single-frame architecture in two aspects: depth generation and network structure.

3.2.1 Depth Generation

Figure 3: The examples of our input and label for single-frame and multi-frame.

We train the single-frame network with facial depth map, which reveals face location and 3D shape of the face in the 2D plane image, and extracts useful information in the recognition of real face or not. To distinguish living faces from spoofing faces, we normalize living depth map in a range of , while setting spoofing depth map to 0 [20].

In this module, we adopt dense face alignment method PRNet [13] to estimate the 3D shape of the living face. PRNet can be used to project the 3D shape of a complete face into UV space. Through this method, we obtain a group of vertices representing

3D coordinates of facial keypoints. Since these coordinates are sparse when mapped to the 2D plane image, we interpolate them to obtain dense face coordinates. By mapping and normalizing the interpolated coordinates to a 2D plane image, the generated facial depth map can be represented by

. Fig. 3 shows the labels and corresponding inputs of the proposed model.

3.2.2 Network Structure

Figure 4: The kernel in contrastive depth loss.
Figure 5: The pipeline of proposed architecture. The inputs are consecutive frames in a fixed interval. Our single-frame part aims to extract features at various levels and to output the single-frame estimated facial depth. OFF blocks take single-frame features from two consecutive frames as inputs and calculate short-term motion features. Then the final OFF features are fed into the ConvGRUs to obtain long-term motion information, and output the residual of single-frame facial depth. Finally, the combined estimated multi-frame depth maps are supervised by the depth loss and binary loss in respective manners.

As shown in Fig. 5, there are three main cascaded blocks connected after one convolution layer in our single-frame part. Each block is composed of three convolution layers and one pooling layer. We resize pooling features to the pre-defined size of

and concatenate them into one tensor, which is used to regress the depth map followed by three subsequent convolutional groups.

Given an original RGB image , we can obtain the estimated depth map from the single-frame part. Supervised by the generated “ground truth” depth , we design a novel depth loss with the following two parts. One is squared Euclidean norm loss between and D for an absolute depth regression:


and the other is a contrastive depth loss:


where represents depthwise separable convolution operation [16], represents the contrastive convolution kernel shown in Fig. 4, and indexes the location of “1” around “-1”. As shown in Eq. 4 and Fig. 4, the contrastive depth loss aims to learn the topography of each pixel, which gives constraints to the contrast from the pixel to its neighbors.

Putting together, the single-frame loss can be written as:


where is the final loss used in our single-frame part.

3.3 Multi-frame Part

In this part, we expore the short-term and the long-term motion in temporal domain. Specially, the short-term motion is extracted by optical flow guided feature block (OFFB) and the long-term motion is obtained by ConvGRU module.

3.3.1 Short-term Motion

Figure 6: The architecture of our OFF block.

We use OFF block (OFFB) derived from  [32] to extract short-term motion. The famous brightness constant constraint in traditional optical flow can be formulated as:


where represents the brightness at the location of the frame at time . From Eq. 6, we can derive:


where , and represent the two dimensional velocity of the pixel at . , and are gradients of with respects to , and the time, respectively. is exactly the optical flow. In Eq. 7, we can see that is orthogonal to . Obviously, is guided by the optical flow. By replacing image by features level in equations above, can be reformulated as:


where is the extracted feature from image , and

is the parameters of the feature extraction function.

is called optical flow guided features (OFF) [32], which encodes spatial and temporal gradients guided by the feature-level optical flow.

Different from OFF [32], we add extra spatial and temporal gradients and morphological information into our own OFF block (OFFB). As shown in Fig. 6, there are five sub-modules: , , , , where is reduced from the original single-frame feature at feature level and time to get foundamental morphological information, and are the spatial gradients at time and , is the temporal gradient between time and , and is the OFFB feature from the previous feature level . To reduce the training burden, we introduce a reduction operator – convolution layer – at the original feature. We concatenate all of the five sub-modules, and feed them into a convolution layer to reduce the feature dimension and establish . More details about spatial gradient and temporal gradient can refer to  [32]. In the following of this paper, we denote the output of the final OFF block by .

3.3.2 Long-term Motion

As a short-term motion feature, OFFB feature primarily captures the motion information between two consecutive frames, whereas has difficulty in fusing long-sequence motion. In this regard, we resort to ConvGRU module to gain the long-term motion.

GRU [9]

derives from Long Short-term Memory (LSTM) 

[15], which has simpler structure and fewer trainable parameters. Similar to LSTM, GRU aims to process long sequence information as well.

However, normal GRU neglects the spatial information in its hidden units. Thus, we should take into account the convolution operation in the hidden layers to model the spatiotemporal sequences, and thus propose the Convolution Gated Recurrent Units (ConvGRU) related to ConvLSTM [31]. The ConvGRU can be described below:


where and are the matrix of input, output, update gate and reset gate, are the kernels in the convolution layer, is convolution operation, denotes element wise product, and

is the sigmoid activation function. By feeding

into , we set up depth maps , where and denotes the number of input frames. Then, based on the residual idea, we integrate the single depth map and multi depth map:


where is the weight of in . Finally, we build up the set of multi-frame depth maps .

3.3.3 Multi-frame Loss

We make the final decision to discriminate live vs. spoof in the multi-frame part. Nevertheless, in view of the potentially unclear depth map, we hereby consider a binary loss when looking for the difference between living and spoofing depth map. Note that the depth supervision is decisive, whereas the binary supervision takes an assistant role to discriminate the different kinds of depth maps. In this regard, we establish the following multi-frame loss:


where and are depth label and binary label at time , is the concatenated depth maps of frames,

denotes two fully connected layers and one softmax layer after the concatenated depth maps, which outputs the logits of two classes,

is the weight of binary loss in the final multi-frame loss . In Eq. 14, we use cross entropy loss to calculate the binary loss. In Eq. 15, we simply take a sum of the binary loss and depth loss.

4 Experiment

4.1 Databases and Metrics

4.1.1 Databases

Four databases - OULU-NPU [7],  SiW [20], CASIA-MFSD [37], Replay-Attack [8] are used in our experiment. OULU-NPU [7] is a high-resolution database, consisting of 4950 real access and spoofing videos. This database contains four protocols to validate the generalization of models.  SiW [20] contains more live subjects and three protocols are used for testing. CASIA-MFSD [37] and Replay-Attack [8] are databases which contain low-resolution videos. We use these two databases for cross testing.

4.1.2 Metrics

In OULU-NPU and SiW dataset, we follow the original protocols and metrics for a fair comparison. OULU-NPU and SiW utilize 1) Attack Presentation Classification Error Rate , which evaluates the highest error among all PAIs(e.g. print or display), 2) Bona Fide Presentation Classification Error Rate , which evaluates error of real access data, and 3)  [17], which evaluates the mean of and :


HTER is adopted in the cross testing between CASIA-MFSD and Repaly-Attack, which evaluates the mean of False Rejection Rate (FRR) and the False Acceptance Rate (FAR):


4.2 Implementation Details

4.2.1 Training Strategy

The proposed method combines the single-frame part and multi-frame part. Two-stage strategy is applied in the training process. Stage 1: We train the single-frame part by the single-frame depth loss, in order to learn a fundamental representation. Stage 2: We fix the parameters of the single-frame part, and finetune the parameters of multi-frame part by the depth loss and binary loss. Note that the diverse data should be adequately shuffled for the stability of training and generalization of the learnt model. The network is fed by frames, which are sampled by a interval of three frames. This sampling interval makes sampled frames maintain enough temporal information in the limitation of GPU memory.

4.2.2 Testing Strategy

For the final classification score, we feed the sequential frames into the network and obtain depth maps and the living logits in . The final living score can be obtained by:


where is the same as that in equation 15, is the mask of face at frame , which can be generated by the dense face landmarks in PRNet [13], and the second module denotes that we compute the mean of depth values in the facial areas as one part of the score.

4.2.3 Hyperparameter Setting

We implement our proposed method in Tensorflow 

[1], with a learning rate of 3e-3 for single-frame part and 1e-2 for multi-frame part. The batch size of single-frame part is 10 and that of multi-frame part is 2 with being 5 in most of our experiment, except that batch size being 4 and being 3 in protocol 3 of OULU-NPU. Adadelta optimizer is used in our training procedure, with as 0.95 and as 1e-8. We set and to optimal values by our experimental experience, and according to the anylysis of the following section that protocol 4 in OULU-NPU is most challenging to the generalization, we recommend that the parameters and are suitable for the realistic scenes.

4.3 Experimental Comparison

4.3.1 Intra Testing

Prot. Method APCER(%) BPCER(%) ACER(%)
1 CPqD 2.9 10.8 6.9
GRADIANT 1.3 12.5 6.9
FAS-BAS [20] 1.6 1.6 1.6
OURs 2.5 0.0 1.3
2 MixedFASNet 9.7 2.5 6.1
FAS-BAS [20] 2.7 2.7 2.7
GRADIANT 3.1 1.9 2.5
OURs 1.7 2.0 1.9
3 MixedFASNet 5.36.7 7.85.5 6.54.6
OURs 5.91.0 5.91.0 5.91.0
GRADIANT 2.63.9 5.05.3 3.82.4
FAS-BAS [20] 2.71.3 3.11.7 2.91.5
4 Massy_HNU 35.835.3 8.34.1 22.117.6
GRADIANT 5.04.5 15.07.1 10.05.0
FAS-BAS [20] 9.35.6 10.46.0 9.56.0
OURs 14.03.4 4.13.4 9.23.4
Table 1: The results of intra testing on four protocols of OULU-NPU [7].
Prot. Method APCER(%) BPCER(%) ACER(%)
1 FAS-BAS [20] 3.58 3.58 3.58
OURs 0.96 0.50 0.73
2 FAS-BAS [20] 0.570.69 0.570.69 0.570.69
OURs 0.080.14 0.210.14 0.150.14
3 FAS-BAS [20] 8.313.81 8.313.80 8.313.81
OURs 3.100.81 3.090.81 3.100.81
Table 2: The results of intra testing on three protocols of SiW [20].

We compare the performance of intra testing on OULU-NPU and SiW dataset. OULU-NPU proposes four protocols to evaluate the generalization of the developed face presentation attack detection (PAD) methods. Protocol 1 is designed to evaluate the generalization of PAD methods under previously unseen illumination and background scene. Protocol 2 is designed to evaluate the generalization of PAD methods under unseen attack medium, such as unseen printers or displays. Protocol 3 utilizes a Leave One Camera Out (LOCO) protocol, in order to study the effect of the input camera variation. Protocol 4 considers all above factors and integrates all the constraints from protocols 1 to 3, so protocol 4 is the most challenging. Table 1 shows that our proposed method ranks first on three protocols - protocol 1,2,4, and ranks third on protocol 3. We can see that our model performs well at the generalization of external environment and attack mediums, and is slightly worse when it comes to the input camera variation. It’s worth noting that our proposed method has the lowest mean and std of ACER in protocol 4, which is most suitable for the real-life scenarios.

Tab. 2 compares the performance of our method with FAS-BAS [20] on SiW [20]. According to the purposes of three protocols on SiW [20] and the results in Tab. 2, we can see that our method performs great advantages on the generalization of (1) variations of face pose and expression, (2) variations of different spoof mediums, (3) cross PAI.

4.3.2 Cross Testing

Method Train Test Train Test
Motion [11] 50.2 47.9
LBP-1 [11] 55.9 57.6
LBP-TOP [11] 49.7 60.6
Motion-Mag [3] 50.1 47.0
Spectral cubes [28] 34.4 50.0
CNN [35] 48.5 45.5
LBP-2 [4] 47.0 39.6
Colour Texture [5] 30.3 37.7
FAS-BAS [20] 27.6 28.4
OURs 17.5 24.0
Table 3:

The results of cross testing between CASIA-MFSD and Replay-Attack. The evaluation metric is HTER(%).

We utilize CASIA-MFSD and Replay-Attack dataset to perform cross testing. This can be regarded as two testing protocols. One is training on the CASIA-MFSD and testing on Replay-Attack, which we name protocol CR; the other is training on the Replay-Attack and testing on CASIA-MFSD, which we name protocol RC. In table 3, we see HTER of our proposed method is 17.5 on protocol CR and 24.0 on protocol RC, reducing 36.6% and 15.5% respectively compared with the previous state of the art. The improvement of performance on cross testing demonstrates the generalization and superiority of our proposed method.

4.3.3 Ablation Study

Module Single-frame OFF-block ConvGRU D-S B-S ACER(%)
Model 1 4.4
Model 2 3.5
Model 2 2.3
Model 3 3.3
Model 3 3.1
Model 4 1.7
Model 4 1.3
Table 4: The results of ablation study on protocol 1 of OULU-NPU. B-S denotes binary-supervision, and D-S denotes depth-supervision.
Module Depth Loss ACER(%)
Euclidean Depth Loss Contrastive Depth Loss
Model 1 5.8
Model 1 4.4
Table 5: The results of ablation study on the influence of contrastive depth loss on protocol 1 of OULU-NPU.

We implement experiment on seven architectures to demonstrate the advantages of our proposed sequential structure under the supervision of depth. As shown in table 4, Model 1 is the single-frame part of our method. Model 2 combines single-frame CNN with OFFB under binary supervision and depth supervision. Model 3 combines single-frame CNN with ConvGRU under binary supervision and depth supervision. Model 2 and model 3 discard the binary supervision. Model 4 is our complete architecture, integrating all modules. Model 4 discards binary supervision compared with model 4. Comparing ACER of model 2 and model 3 with that of model 1, we see that our OFFB module and ConvGRU module both improve the performance of face anti-spoofing. And the ACER of model 4 shows that the combination of OFFB module and ConvGRU module has more positive effects. Via discarding the binary supervision, we test the effect of binary supervision on our model 2, 3 and 4. In this model, we find that multi-frame model with simple depth supervision can also outperform the single-frame model and binary supervision indeed assists the model to distinguish live vs. spoof.

In table 5, we study the influence of contrastive depth loss. Model 1 is our single-frame model supervised by both the euclidiean depth loss and the contrastive depth loss, while model 1 is supervised only by euclidiean depth loss. Comparing model 1 with model 1, we can see that contrastive depth loss can improve the generalization of our model.

Moreover, the inference of the model 1 costs around 18 ms and that of model 5 costs around 96 ms, which indicates that our method is efficient enough to be applied in reality.

4.3.4 Qualitative Analysis

Figure 7: The generated results of a group of hard samples in OULU-NPU. D-score denotes the living score calculated by the mean of depth values in facial area, which is the depth subpart in equation 18.

Figure 7 presents the generated depth maps of a group of hard samples for the same person ID in OULU-NPU. From figure 7

, we can see that our multi-frame maps are more complete than single-frame maps in real scenes. Although the multi-frame maps in spoofing scenes are visually noisier than those in the single-frame maps, the discrimination is obvious when only considering the multi-frame maps themselves. Specially, in single-frame maps, the D-score of real scene is 0.368, which is lower than that of replay2 scene. This is probably caused by the illumination condition in replay2 scene. By contrast, in multi-frame maps, the D-score of real scene is higher than the D-scores in all of the attack scenes. We can observe the discrimination among the multi-frame maps obviously.

It’s worth noting that this group of images are results of hard samples, which is failure case in single frame but success case in multi frame. We use residual architecture, adding the multi-frame depth to the single-frame depth, which leads to the depth values in multi frame are higher than those of single frame.

5 Discussion

We only discuss a simple case of equal motion in vertical dimension in Sec. 3.1. In reality, the facial variation and motion are more complex, including forward/backward motion, rotation, deformation and so on. In these cases, we still assume that there are discriminative patterns of temporal depth and motion between living and spoofing faces in the temporal domain. Extension of theory and application on these aspects is our future research. Face anti-spoofing based on facial motion and depth is indeed promising and valuable.

6 Conclusions

In this paper, we analyze the usefulness of motion and depth in presentation attack detection. According to the analysis, we propose a novel face anti-spoofing method, which is depth supervised and consists of adequate temporal information. To seek the spatiotemporal information, we take OFF block as short-term motion module and ConvGRU as long-term motion module, and then combine them into our architecture. Our proposed method can discover spoof patterns efficiently and accurately under depth-supervision. Extensive experimental results demonstrate the superiority of our method.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al.

    Tensorflow: a system for large-scale machine learning.

    In OSDI, volume 16, pages 265–283, 2016.
  • [2] Y. Atoum, Y. Liu, A. Jourabloo, and X. Liu. Face anti-spoofing using patch and depth-based cnns. In IJCB, pages 319–328, 2017.
  • [3] S. Bharadwaj, T. I. Dhamecha, M. Vatsa, and R. Singh. Computationally efficient face spoofing detection with motion magnification. In CVPRW, pages 105–110, 2013.
  • [4] Z. Boulkenafet, J. Komulainen, and A. Hadid. Face anti-spoofing based on color texture analysis. In ICIP, pages 2636–2640, 2015.
  • [5] Z. Boulkenafet, J. Komulainen, and A. Hadid. Face spoofing detection using colour texture analysis. IEEE Transactions on Information Forensics and Security, 11(8):1818–1830, 2016.
  • [6] Z. Boulkenafet, J. Komulainen, and A. Hadid.

    Face antispoofing using speeded-up robust features and fisher vector encoding.

    IEEE Signal Processing Letters, 24(2):141–145, 2017.
  • [7] Z. Boulkenafet, J. Komulainen, L. Li, X. Feng, and A. Hadid. Oulu-npu: A mobile face presentation attack database with real-world variations. In FGR, pages 612–618, 2017.
  • [8] I. Chingovska, A. Anjos, and S. Marcel. On the effectiveness of local binary patterns in face anti-spoofing. In Biometrics Special Interest Group, pages 1–7, 2012.
  • [9] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. Computer Science, 2014.
  • [10] T. de Freitas Pereira, A. Anjos, J. M. De Martino, and S. Marcel. Lbp- top based countermeasure against face spoofing attacks. In ACCV, pages 121–132, 2012.
  • [11] T. de Freitas Pereira, A. Anjos, J. M. De Martino, and S. Marcel. Can face anti-spoofing countermeasures work in a real world scenario? In ICB, pages 1–8, 2013.
  • [12] L. Feng, L.-M. Po, Y. Li, X. Xu, F. Yuan, T. C.-H. Cheung, and K.-W. Cheung. Integration of image quality and motion cues for face anti-spoofing: A neural network approach. Journal of Visual Communication and Image Representation, 38:451–460, 2016.
  • [13] Y. Feng, F. Wu, X. Shao, Y. Wang, and X. Zhou. Joint 3d face reconstruction and dense alignment with position map regression network. In CVPR, 2017.
  • [14] J. Gan, S. Li, Y. Zhai, and C. Liu.

    3d convolutional neural network based on face anti-spoofing.

    In ICMIP, pages 1–5, 2017.
  • [15] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
  • [16] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. In CVPR, 2017.
  • [17] international organization for standardization. Iso/iec jtc 1/sc 37 biometrics: Information technology biometric presentation attack detection part 1: Framework. In https://www.iso.org/obp/ui/iso, 2016.
  • [18] J. Komulainen, A. Hadid, and M. Pietikainen. Context based face anti-spoofing. In BTAS, pages 1–8, 2013.
  • [19] L. Li, X. Feng, Z. Boulkenafet, Z. Xia, M. Li, and A. Hadid. An original face anti-spoofing approach using partial convolutional neural network. In IPTA, pages 1–6, 2016.
  • [20] Y. Liu, A. Jourabloo, and X. Liu. Learning deep models for face anti-spoofing: Binary or auxiliary supervision. In CVPR, pages 389–398, 2018.
  • [21] O. Lucena, A. Junior, V. Moia, R. Souza, E. Valle, and R. Lotufo. Transfer learning using convolutional neural networks for face anti-spoofing. In International Conference Image Analysis and Recognition, pages 27–34, 2017.
  • [22] J. Mtt, A. Hadid, and M. Pietikinen. Face spoofing detection from single images using micro-texture analysis. In IJCB, pages 1–7, 2011.
  • [23] C. Nagpal and S. R. Dubey. A performance evaluation of convolutional neural networks for face anti spoofing. arXiv preprint arXiv:1805.04176, 2018.
  • [24] G. Pan, L. Sun, Z. Wu, and S. Lao. Eyeblink-based anti-spoofing in face recognition from a generic webcamera. In ICCV, pages 1–8, 2007.
  • [25] K. Patel, H. Han, and A. K. Jain. Cross-database face antispoofing with robust feature representation. In Chinese Conference on Biometric Recognition, pages 611–619, 2016.
  • [26] K. Patel, H. Han, and A. K. Jain. Secure face unlock: Spoof detection on smartphones. IEEE transactions on information forensics and security, 11(10):2268–2283, 2016.
  • [27] B. Peixoto, C. Michelassi, and A. Rocha. Face liveness detection under bad illumination conditions. In ICIP, pages 3557–3560. IEEE, 2011.
  • [28] A. Pinto, H. Pedrini, W. R. Schwartz, and A. Rocha. Face spoofing detection through visual codebooks of spectral temporal cubes. IEEE Transactions on Image Processing, 24(12):4726–4740, 2015.
  • [29] B. Shahraray and M. K. Brown. Robust depth estimation from optical flow. In ICCV, 1988.
  • [30] R. Shao, X. Lan, and P. C. Yuen. Deep convolutional dynamic texture learning with adaptive channel-discriminability for 3d mask face anti-spoofing. In IJCB, pages 748–755, 2017.
  • [31] X. Shi, Z. Chen, H. Wang, W. C. Woo, W. C. Woo, and W. C. Woo. Convolutional lstm network: a machine learning approach for precipitation nowcasting. In International Conference on Neural Information Processing Systems, pages 802–810, 2015.
  • [32] S. Sun, Z. Kuang, L. Sheng, W. Ouyang, and W. Zhang. Optical flow guided feature: A fast and robust motion representation for video action recognition. In CVPR, pages 1390–1399, 2018.
  • [33] X. Tan, Y. Li, J. Liu, and L. Jiang. Face liveness detection from a single image with sparse low rank bilinear discriminative model. In ECCV, pages 504–517, 2010.
  • [34] Z. Xu, S. Li, and W. Deng. Learning temporal features using lstm-cnn architecture for face anti-spoofing. In ACPR, pages 141–145, 2015.
  • [35] J. Yang, Z. Lei, and S. Z. Li. Learn convolutional neural network for face anti-spoofing. Computer Science, 9218:373–384, 2014.
  • [36] J. Yang, Z. Lei, S. Liao, and S. Z. Li. Face liveness detection with component dependent descriptor. In ICB, page 2, 2013.
  • [37] Z. Zhang, J. Yan, S. Liu, Z. Lei, D. Yi, and S. Z. Li. A face antispoofing database with diverse attacks. In ICB, pages 26–31, 2012.

Appendix A Temporal Depth in Face Anti-spoofing

In this section, we use some simple examples to explain that exploiting temporal depth and motion is reasonable in the face anti-spoofing task.

Figure 8: The schematic diagram of motion and depth variation in different scenes.

a.1 Basic Scene

As shown in Fig. 8(a), node denotes the camera focus. represents the image plane of camera. is one facial point at time , and is the corresponding point when moves down vertically for at time . For example, can be the point of nose or ear. denotes the focal distance, and is the horizontal distance from the focal point to the point . and are the corresponding coordinates in vertical dimension. When moves down vertically to for , the motion can be reflected on the image plane as . According to the camera model, we can obtain:


When moves down vertically for to , the can be achieved:


As shown in Fig. 8(b), to distinguish points , and , we transform Eq. 20 and get , and ( and are not shown in the figure):


where and are the corresponding depth difference. From Eq. 21, there are:


Removing from Eq. 22, can be obtained:


In this equation, we can see that the relative depth can be estimated by the motion of three points, when . The equations above are about the real scenes. In the following, we will introduce the derivation of attack scenes.

a.2 Attack Scene

a.2.1 What if the attack carriers move?

As shown in Fig. 8(c), there are two image spaces in attack scenes: one is recording image space, where we replace by , and the other is realistic image space, where we replace by . In the recording image space, it’s similar to Eq. 21:


where are the magnitude of optical flow when three points move down vertically for .

In the realistic image space, there are:


where , and are the motion of three points on the recording image plane, and are the corresponding values mapping on the realistic image plane.

Actually, there are , if the recording screen is static. Now, a vertical motion is given to the recording screen, just as . By inserting , we transform Eq. 25 into:


Due to that only can be observed directly in the sequential images, we can estimate the relative depth via . So we leverage Eq. 23 to estimate the relative depth :


and then we can insert Eq. 26 into Eq. 27 to get:


According to equations above, some important conclusions can be summarized:

  • If , the scene can be recognized as print attack and Eq. 28 will be invalid, for , and the denominator in Eq. 27 will be zero. So here we use Eq. 26 and


    to obtain:


    In this case, it’s obvious that the facial relative depth is abnormal and the face is fake.

  • If , the scene can be recognized as replay attack.

    • If , there is:


      In this case, if these two image planes are parallel and the single-frame model can not detect the static spoof cues, the model will fail in the task of face anti-spoofing, owing to that the model is hard to find the abnormality of relative depth estimated from the facial motion. We call this scene Perfect Spoofing Scene(PSS). Of course, making up PSS will cost a lot and is approximately impossible in practice.

    • If and we want to meet Eq. 31, the following equation should be satisfied:




      However, in our assumption, , so:


      This equation indicates that relative depth can’t be estimated preciously, if the attack carrier moves in the replay attack. And usually varies when attack carrier moves in the long-term sequence, leading to the variation of . This kind of abnormality is more obvious along with the long-term motion.

  • If denotes the largest depth difference among facial points, then , showing that constraining depth label of living face to is valid. As analyzed above, for spoofing scenes, the abnormal relative depth usually varies over time, so it is too complex to be computed directly. Therefore, we merely set depth label of spoofing face to all 0 to distinguish it from living label, making the model learn the abnormity under depth supervision itself.

a.2.2 What if the attack carriers rotate?

As shown in Fig. 8(d), we rotate the recording image plane for degree . are the coordinates of mapping on the recording image plane. The two black points at the right end of green double arrows on recording image plane (vertical) will reach the two black points at the left end of green double arrow on recording image plane (rotated), when the recording image plane rotates. And the corresponding values will not change after rotation. For convenient computation, we still map the rotated points to the vertical recording image plane. And the coordinates after mapping are . are the corresponding distances shown in the figure. According to the relationship of the foundamental variables, we can obtain:


Deriving from equations above, we can get :


and can also be calculated by imitating Eq. 36:


Subtract from , the following is achieved:


Obviously, . We define . And then we get the following equation:


where the relationship between and are just like that between and , as well as . Note that for simplification, we only discuss the situation that are all positive.

Reviewing Eq. 25, We can confirm that , , . According to Eq. 27, the final can be estimated:


where and can be represented as:


where . Observing Eq. 40, we can see if or , there will be , .

Now, we discuss the sufficient condition of . When , the can be established. Similar to Eq. 21, the relationship of variables can be achieved:


From Eq. 42 and , we can obtain: