Deep Spatial Gradient and Temporal Depth Learning for Face Anti-spoofing
Face anti-spoofing is critical to the security of face recognition systems. Depth supervised learning has been proven as one of the most effective methods for face anti-spoofing. Despite the great success, most previous works still formulate the problem as a single-frame multi-task one by simply augmenting the loss with depth, while neglecting the detailed fine-grained information and the interplay between facial depths and moving patterns. In contrast, we design a new approach to detect presentation attacks from multiple frames based on two insights: 1) detailed discriminative clues (e.g., spatial gradient magnitude) between living and spoofing face may be discarded through stacked vanilla convolutions, and 2) the dynamics of 3D moving faces provide important clues in detecting the spoofing faces. The proposed method is able to capture discriminative details via Residual Spatial Gradient Block (RSGB) and encode spatio-temporal information from Spatio-Temporal Propagation Module (STPM) efficiently. Moreover, a novel Contrastive Depth Loss is presented for more accurate depth supervision. To assess the efficacy of our method, we also collect a Double-modal Anti-spoofing Dataset (DMAD) which provides actual depth for each sample. The experiments demonstrate that the proposed approach achieves state-of-the-art results on five benchmark datasets including OULU-NPU, SiW, CASIA-MFSD, Replay-Attack, and the new DMAD. Codes will be available at https://github.com/clks-wzz/FAS-SGTD.READ FULL TEXT VIEW PDF
Deep Spatial Gradient and Temporal Depth Learning for Face Anti-spoofing
Central Difference Convolutional Networks
Face recognition technology has become the most indispensable component in many interactive AI systems for their convenience and human-level accuracy. However, most of existing face recognition systems are easily to be spoofed through presentation attacks (PAs) ranging from printing a face on paper (print attack) to replaying a face on a digital device (replay attack) or bringing a 3D-mask (3D-mask attack). Therefore, not only the research community but also the industry has recognized face anti-spoofing [18, 19, 4, 33, 39, 11, 23, 55, 1, 29, 12, 49, 45, 54, 21] as a critical role in securing the face recognition system.
In the past few years, both traditional methods [14, 42, 9] and CNN-based methods [35, 38, 20, 24, 46] have shown effectiveness in discriminating between the living and spoofing face. They often formalize face anti-spoofing as a binary classification between spoofing and living images. However, these approaches are challenging to explore the nature of spoofing patterns, such as the loss of skin details, color distortion, moir pattern, and spoofing artifacts.
In order to overcome this issue, many auxiliary depth supervised face anti-spoofing methods have been developed. Intuitively, the images of living faces contain face-like depth, whereas the images of spoofing faces in print and by replaying carriers only have planar depth. Thus, Atoum et al.  and Liu et al. 
propose single-frame depth supervised CNN architectures, and improve the presentation attack detection (PAD) accuracy.
By surveying the past face anti-spoofing methods, we notice there are two problems that have not yet been fully solved: 1)
Traditional methods usually design local descriptors for solving PAD while modern deep learning methods can learn to extract relatively high-level semantic features instead. Despite their effectiveness, we argue that low-level fine-grained patterns can also play a vital role in distinguishing living and spoofing faces, e.g. the spatial gradient magnitude shown in Fig.1. So how to aggregate local fine-grained information into convolutional networks is still unexplored for face anti-spoofing task. 2) Recent depth supervised face anti-spoofing methods [2, 34]estimate facial depth based on a single frame and leverage depth as dense pixel-wise supervision in a direct manner. We argue that the virtual discrimination of depth between living and spoofing faces can be explored more adequately by multiple frames. A vivid and exaggerated example with assumed micro motion is illustrated in Fig. 2.
To address the problems, we present a novel depth supervised spatio-temporal network with Residual Spatial Gradient Block (RSGB) and Spatio-Temporal Propagation Module (STPM). Inspired by ResNet  , our RSGB aggregates learnable convolutional features with spatial gradient magnitude via shortcut connection. As a result, both local fine-grained patterns and traditional convolution features can be captured via stacked RSGB. To better utilize the information from multiple frames, STPM is designed for propagating short-term and long-term spatio-temporal features into depth reconstruction. To supervise the models with facial depth more effectively, we propose a Contrastive Depth Loss (CDL) to learn the topography of facial points.
We believe that the accuracy of facial depth directly affects the establishment of the relationship between temporal motion and facial depth. So we collect a double-modal anti-spoofing dataset named Double-modal Anti-spoofing Dataset (DMAD) which provides actual depth map for each sample. Extensive experiments are conducted to show that actual depth is more appropriate for monocular PAD than the generated depth. Note that this paper mainly focuses on the planar attack, which is the most common in practice.
We summarize the main contributions below.
We propose a novel depth supervised architecture to capture discriminative details via Residual Spatial Gradient Block (RSGB) and encode spatio-temporal information from Spatio-Temporal Propagation Module (STPM) efficiently from monocular frame sequences.
We develop a Contrastive Depth Loss to learn the topography of facial points for depth supervised PAD.
We collect a double-modal dataset to verify that the actual depth is more appropriate for monocular PAD than the generated depth. This indicates an insight that collecting corresponding depth image to the RGB image brings benefit to the progress of the monocular PAD.
We demonstrate the state-of-the-art performance by our method on widely used face anti-spoofing benchmarks.
Roughly speaking, previous face anti-spoofing works generally fall into three categories: binary supervised, depth supervised, and temporal-based methods.
Binary supervised Methods
Since face anti-spoofing is essentially a binary classification problem, most of previous anti-spoofing methods train a classifier under binary supervision, e.g., spoofing face as 0 and living face as 1. The early works usually rely on hand-crafted features, such as LBP[14, 15, 37], SIFT , SURF , HoG [28, 52], DoG [43, 48]
, and traditional classifiers, such as SVM and Random Forests. Because of the sensitiveness of manually-engineered features, traditional methods often generalize poorly across varied conditions such as camera devices, lighting conditions and presentation attack instruments (PAIs). Recently, CNN has emerged as a powerful tool in face anti-spoofing tasks with the help of hardware advancement and data abundance. For instance, in early works like[30, 41], pre-trained VGG-face model is fine-tuned to extract features in a binary-classification setting. However, most of them consider face anti-spoofing as a binary classification problem with cross-entropy loss, which easily learns the arbitrary patterns such as screen bezel.
Depth supervised Methods Compared with the binary setting, depth supervised methods aim to learn more faithful patterns. In , the depth map of a face is utilized as a supervisory signal for the first time. They propose a two-stream CNN-based approach for face anti-spoofing, by extracting both the patch features and holistic depth maps from the face images. It shows that depth estimation is beneficial for modeling face anti-spoofing to obtain promising results, especially on higher-resolution images. In another work , the authors propose a face anti-spoofing method by augmenting spatial facial depth as an auxiliary supervision along with temporal rPPG signals. More recently,  attempts to learn spoof noise and depth for generalized face anti-spoofing. However, these methods take stacked vanilla convolutional networks as the backbone and fail to capture the rich detailed patterns for depth estimation.
, the eye-blinking fact is used to predict spoofing. However, these methods are vulnerable to replay attacks since they heavily rely on some heuristic assumptions about the nature of these attacks. More general approaches like 3D convolution or LSTM [50, 53] have recently been used to distinguish the live from spoof images. In addition, optical flow magnitude map and Shearlet feature have been taken as inputs in  to the CNN due to the obvious difference in flow patterns between living and spoofing faces. Based on the different color changes between the living and spoofing face videos, rPPG [31, 34, 32] features are also explored for PAD. To the best of our knowledge, no depth supervised temporal-based methods has ever been proposed for face anti-spoofing task.
In this section, we first present our advanced depth-supervised spatio-temporal network structure, including Residual Spatial Gradient Block (RSGB) and Spatio-Temporal Propagation Module (STPM). Then our proposed novel Contrastive Depth Loss (CDL) and the overall loss would be demonstrated.
Designed in an end-to-end depth supervised fashion, our proposed framework takes -frame face images as input and predicts the corresponding depth map directly. As shown in Fig. 3, the backbone is composed of cascaded RSGB followed by pooling layers, intending to extract fine-grained spatial features in low-level, mid-level and high-level, respectively. Then these multi-level features are concatenated to predict coarse depth map for each frame.
In order to capture rich dynamic information, STPM is plugged between frames. Short-term Spatio-Temporal Block (STSTB) picks up spatio-temporal features from adjacent frames while ConvGRU propagates these short-term features in a multi-frame long-term view. Finally, the temporal depth maps estimated from STPM are used to refine the coarse depth from the backbone.
Fine-grained spatial details are vital for distinguishing the bona fide and attack presentations. As illustrated in Fig. 1, the gradient magnitude response between the living (Fig. 1(a)) and spoofing (Fig. 1(b)) face is quite different, which gives the insight to design a residual spatial gradient block (RSGB) for capturing such discriminative clues. In this paper, we take the well-known Sobel  operation to compute gradient magnitude. In a nutshell, the horizontal and vertical gradients can be derived from the following convolutions respectively:
where denotes the depthwise convolution operation, and represents the input feature maps. As shown in Fig. 4, our RSGB adopts the advanced shortcut connection structure to aggregate the learnable convolutional features with gradient magnitude information, which intends to enhance representation ability of fine-grained spatial details. It can be formulated as
where represents the input features maps while denotes the feature maps altered through 1x1 convolution, which intends to keep the consistent channel numbers for subsequent residual addition. denotes the output feature maps. and
denote the normalization and Relu layer, respectively. The functionrepresents the residual gradient magnitude mapping to be learned. Note that the proposed RSGB is able to plug in both image and feature levels, extracting rich spatial context for depth regression task.
Virtual discrimination of depth between living and spoofing faces can be explored adequately by multiple frames. Therefore, we design STPM to extract multi-frame spatio-temporal features for depth estimation, via Short-term Spatio-Temporal Block (STSTB) and ConvGRU.
STSTB. As illustrated in Fig. 3, STSTB extracts the generalized short-term spatio-temporal information by fusing five kinds of features: the current compressed features , the current spatial gradient features , the future spatial gradient features , the temporal gradient features , and the STSTB features from the previous level . The fused features can provide weighted spatial and temporal information in a learnable/adaptive way. In this paper, the spatial and temporal gradients are implemented with Sobel-based depthwise convolution (similar to Eq. 1) and element-wise subtraction of temporal features, respectively. Note that the 1x1 convolutions intend to compress the channel number with more efficiency.
Different from the related OFF  work, we consider both spatial gradient of the current compressed features and future spatial gradient features while OFF only considers . Moreover, current compressed feature itself also plays an important role in recovering the fine depth map, which is concatenated in STSTB as well. The detailed comparison between STSTB and OFF will be studied in Sec. 5.3, which shows the advancement of STSTB especially for depth-supervised face anti-spoofing task.
As short-term information between two consecutive frames from STSTB has limited representation ability, it is natural to use the recurrent neural network to capture long-range spatio-temporal context. However, the classical LSTM and GRU neglect the spatial information in hidden units. In consideration of the spatial neighbor relationship in the hidden layers, ConvGRU is conducted for propagating the long-range spatio-temporal information. ConvGRU can be described as below:
where and are the matrix of input, output, update gate and reset gate, are the kernels in the convolution layer, is convolution operation, denotes element wise product, and
denotes the sigmoid activation function.
Forwarding the RSGB based backbone and STPM for a given -frame input, we could obtain the corresponding coarse depth maps and temporal depth maps , respectively, where denotes the -th frame. Then is utilized to refine in a weighted summation manner:
where is the trade-off weight between and . The higher value of indicates the more importance about the multi-frame spatio-temporal features. Finally, refined depth maps are obtained.
Besides designing the network architecture, we also need an appropriate loss function to guide the network training. One major step-forward of the current study is that we design a novel Contrastive Detph Loss, which is able to combine with classical loss, further boosting performance.
In the classical depth-based face anti-spoofing, Euclidean Distance Loss (EDL) is usually used for pixel-wise supervision, which is formulated:
where and are the predicted depth and groundtruth depth, respectively. EDL applies supervision on the predicted depth based on pixel one by one, ignoring the depth difference among adjacent pixels. Intuitively, EDL merely assists the network to learn the absolute distance between the objects to the camera. However, the distance relationship of different objects is also important to be supervised for the depth learning. Therefore, as shown in Fig. 5, we propose the Contrastive Depth Loss (CDL) to offer extra strong supervision, which improves the generality of the depth-based face anti-spoofing model:
where is the contrastive convolution kernel, . The details of the kernels can be found in Fig. 5.
In view of the potentially unclear depth map, we hereby consider a binary loss when looking for the difference between living and spoofing depth map. Note that the depth supervision is decisive, whereas the binary supervision takes an assistant role to discriminate the different kinds of depth maps.
where is the binary groundtruth label, is the pool averaged map of , andis the hyper-parameter to trade-off binary loss and depth loss in the final overall loss .
In this work, we collect a real double-modal dataset (RGB and Depth). There are three kinds of display materials in replay attack: AMOLED screen, OLED screen, IPS/TFT screen. Meanwhile, three kinds of paper materials in print attacks are adopted: high-quality A4 paper, coated paper, and poster paper. The capture camera is RealSense SR300, which can offer corresponding RGB and Depth images. There are 300 subjects, each of which is recorded in three sessions and contains one real category and two attack categories (print and replay). Totally, we obtain 2700 samples (4~12 seconds videos) in less than two months with two human workers. Tab. LABEL:tab:DFAD_detail demonstrates the details of DMAD, and Fig. 6 shows some corresponding examples.
|Subset||Subject||Session||Modal Types||Presentation Material||# of live/attack vid.|
|Train||1~100||1~3||RGB, Depth||A4 Paper, AMOLED||900|
|101~200||1~3||RGB, Depth||Coated Paper, OLED||900|
|Test||201~300||1~3||RGB, Depth||Poster Paper, IPS/TFT||900|
Five databases - OULU-NPU [10, 5], SiW , CASIA-MFSD , Replay-Attack , DMAD are used in our experiment. OULU-NPU  is a high-resolution database, consisting of 4950 real access and spoofing videos and containing four protocols to validate the generalization of models. SiW  contains more live subjects and three protocols are used for testing. CASIA-MFSD  and Replay-Attack  both contain low-resolution videos.
In OULU-NPU and SiW dataset, we follow the original protocols and metrics for a fair comparison. OULU-NPU, SiW and DMAD utilize 1) Attack Presentation Classification Error Rate , which evaluates the highest error among all PAIs (e.g. print or display), 2) Bona Fide Presentation Classification Error Rate , which evaluates the error of real access data, and 3) , which evaluates the mean of and :
HTER is adopted in the cross-database testing between CASIA-MFSD and Replay-Attack, evaluating the mean of False Rejection Rate (FRR) and False Acceptance Rate (FAR):
Dense face alignment method PRNet  is adopted to estimate the 3D shape of the living face and generate the facial depth map . A typical sample can be found in the third row of Fig. 6. To distinguish living faces from spoofing faces, at the training stage, we normalize living depth map in a range of , while setting spoofing depth map to 0, which is similar to .
The proposed method is trained with a two-stage strategy: Stage 1: We train the backbone with cascaded RSGB by the depth loss and , in order to learn a fundamental representation to predict coarse depth maps. Stage 2: We fix the parameters of the backbone, and train the STPM part by the overall loss for refining depth maps. Our networks are fed by frames, which are sampled by an interval of three frames. This sampling interval makes sampled frames maintain enough temporal information in the limited GPU memory.
For the final classification score, we feed the sequential frames into the network and obtain depth maps and the living logits in . The final living score can be obtained by:
where is the same as that in equation 8, is the mask of face at frame , which can be generated by the dense face landmarks in PRNet , and the second module denotes that we compute the mean of depth values in the facial areas as one part of the score.
Our proposed method is implemented in Tensorflow, with a learning rate of 1e-4 for single-frame part and 1e-2 for multi-frame part. The batch size of single-frame part is 48, and that of multi-frame part is 2 withbeing 5 in our experiment. Adadelta optimizer is used in our training procedure, with as 0.95 and as 1e-8. We set and by our experimental experience.
Seven architectures are implemented to demonstrate the efficacy of vital parts (i.e., RSGB, STPM and loss functions) in the proposed method. As shown in Tab. 2, Model 1 can be treated as a raw baseline, consisting of a backbone network with stacked vanilla convolutions. Model 2 is supervised with extra contrastive depth loss. Based on Model 2, vanilla convolutions are replaced by RSGB in Model 3. Moreover, Model 4 and Model 5 are designed for validating the effectiveness of STSTB and ConvGRU. In Model 6, STSTB is replaced by normal OFF . Model 7 is our complete architecture with all modules and losses.
Efficacy of the Modules and Loss Functions. It can be seen from Tab. 2 that Model 2 outperforms Model 1, which means our proposed CDL helps to estimate more accurate depth maps. With the progressive lower ACER of Model 3, Model 4 and Model 5, it is clear that RSGB, STSTB and ConvGRU contribute to extract effective discriminative features respectively. Finally, in comparison between Model 5 and Model 7, binary supervision indeed assists to distinguish live vs. spoof.
STSTB vs. OFF. As illustrated in Tab. 2, Model 7 with STSTB surpasses Model 6 with OFF for a large margin, which implies that the current and future gradient information is valuable for spatio-temporal face anti-spoofing task. Model 6 even achieves inferior result compared with Model 3, indicating that it is challenging to design an effective temporal module for depth regression task.
Importance of Spatio-temporal Information for Depth Refinement. It can be seen from Eq. 4
that the depth map refinement is conducted in a weighted summation manner and hyperparametercontrols the contribution of the temporal depth maps predicted by STPM. As shown in Fig. 7, with appropriate valuel of , the model can be benefited from spatio-temporal information and achieves better performance than that using only spatial information (). And the best performance can be obtained when .
Influence of Sampling Interval in Spatio-temporal Architecture. We conduct experiments on one sub-protocol of Protocol 3 with various sampling intervals (). When euqals to and frame(s), the ACER is 3.347%, 2.927%, 4.223%, and 2.934%, respectively. The ACER is the lowest when , which is used as the default setting for the following intra- and cross-database testing.
We compare the performance of intra-database testing on OULU-NPU, SiW and DMAD datasets. There are four protocols in OULU-NPU for evaluating the generalization of the developed face presentation attack detection (PAD) methods. Protocol 1 and Protocol 2 are designed to evaluate the generalization of PAD methods under previously unseen illumination scene and under unseen attack medium (e.g., unseen printers or displays), respectively. Protocol 3 utilizes a Leave One Camera Out (LOCO) protocol, in order to study the effect of the input camera variation. Protocol 4 considers all the above factors and integrates all the constraints from protocols 1 to 3, so protocol 4 is the most challenging.
Results on OULU-NPU. As shown in Tab. 3, our proposed method ranks first on all 4 protocols, which indicates the proposed method performs well at the generalization of the external environment, attack mediums and input camera variation. It’s worth noting that our model has the lowest mean and std of ACER in protocol 3 and 4, indicating its good accuracy and stablity.
Results on SiW. Tab. 4 compares the performance of our method with two state-of-the-art methods Auxiliary  and STASN  on SiW dataset. According to the purposes of three protocols on SiW and the results in Tab. 4, we can see that our method performs great advantages on the generalization of (a) variations of face pose and expression, (b) variations of different spoof mediums, (c) cross presentation attack instruments.
Results on DMAD. The results of intra-database testing on DMAD are shown in Tab. 5. In this experiment, we still set spoofing depth map to zero when training the actual depth model. Tab. 5 shows that the ACER(%) of multi-frame model (Model 7) supervised by actual depth obtains 1.78 lower than that supervised by generated depth. This demonstrates the actual depth map brings benefit to the improvement of monocular face anti-spoofing.
|Spectral cubes ||34.4||50.0|
|Colour Texture ||30.3||37.7|
The results of cross-database testing between CASIA-MFSD and Replay-Attack. The evaluation metric is HTER(%).
We utilize four datasets (CASIA-MFSD, Replay-Attack, SiW and OULU-NPU) to perform cross-database testing for measuring the generalization potential of the models.
Results on CASIA-MFSD and Replay-Attack. In this experiment, there are two cross-database testing protocols. One is training on the CASIA-MFSD and testing on Replay-Attack, which we name protocol CR; the other is training on the Replay-Attack and testing on CASIA-MFSD, which we name protocol RC. In Tab. 6, it is shown that HTER(%) of our proposed method is 17.0 on protocol CR and 22.8 on protocol RC, reducing 38.4% and 19.7% respectively compared with the previous state of the art. The improvement of performance on cross-database testing demonstrates the good generalization of proposed method.
Results from SiW to OULU-NPU. Here, It is shown that the cross-database testing results trained on SiW and tested on OULU-NPU in Tab. 7. It can be seen that our method outperforms Auxiliary  on three protocols (decrease 2.5%, 2.2% and 4.6% of ACER on protocol 1, protocol 2 and protocol 4, respectively). In protocol 3, ACER of our method is 14.64.8% and slightly higher than that of Auxiliary. Considering the rPPG used in Auxiliary method, it may also be a good choice combined with proposed method.
The predicted depth maps of hard samples in OULU-NPU Protocol 3 are partly visualized in Fig. 8. It can be seen that some samples are difficult for the single-frame PAD to be detected. In contrary, our multi-frame methods with STPM can estimate more precise depth maps than those of single-frame method. The difference of depth images from real and attack samples in third row is also more significant, indicating the good discriminative information with the results of STPM.
Feature distribution of the testing videos on OULU-NPU Protocol 1 is shown in Fig. 9. The right image (w/ RSGB) presents more well-clustered behavior than the left image (w/o RSGB), which demonstrates the excellent discrimination ability of our proposed RSGB for distinguishing the living and spoofing faces.
In this paper, we propose a novel face anti-spoofing method, which exploits fine-grained spatio-temporal information for facial depth estimation. In our method, Residual Spatial Gradient Block (RSGB) is utilized to detect more discriminative details while Spatio-Temporal Propagation Module (STPM) to encode spatio-temporal information. An extra Contrastive Depth Loss (CDL) is designed to improve the generality of depth-supervised PAD. We also investigate the effectiveness of actual depth map in face anti-spoofing. Extensive experiments demonstrate the superiority of our method.
This work has been partially supported by the Chinese National Natural Science Foundation Projects #61876178, #61806196, #61976229.
Face antispoofing using speeded-up robust features and fisher vector encoding.IEEE Signal Processing Letters, 24(2):141–145, 2017.
Avoiding replay-attacks in a face recognition system using head-pose estimation.AMFGW, pages 234–235, 2003.
3d convolutional neural network based on face anti-spoofing.In ICMIP, pages 1–5, 2017.
Journal of machine learning research, 9(Nov):2579–2605, 2008.
In this section, we use some simple examples to explain that exploiting temporal depth and motion is reasonable in the face anti-spoofing task.
As shown in Fig. 10(a), node denotes the camera focus. Image Plane represents the image plane of camera. is one facial point at time , and is the corresponding point when moves down vertically for at time . For example, can be the point of nose or ear. denotes the focal distance, and is the horizontal distance from the focal point to the point . and are the corresponding coordinates in vertical dimension. When moves down vertically to for , the motion can be reflected on the image plane as . According to the camera model, we can obtain:
When moves down vertically for to , the can be achieved:
where and are the corresponding depth difference. From Eq. 14, there are:
Removing from Eq. 15, can be obtained:
In this equation, we can see that the relative depth can be estimated by the motion of three points, when . The equations above are about the real scenes. In the following, we will introduce the derivation of attack scenes.
As shown in Fig. 10(c), there are two image spaces in attack scenes: one is recording image space, where we replace by , and the other is realistic image space, where we replace by . In the recording image space, it’s similar to Eq. 14:
where are the magnitude of optical flow when three points move down vertically for .
In the realistic image space, there are:
where , and are the motion of three points on the recording image plane, and are the corresponding values mapping on the realistic image plane.
Actually, there are , if the recording screen is static. Now, a vertical motion is given to the recording screen, just as . By inserting , we transform Eq. 18 into:
Due to that only can be observed directly in the sequential images, we can estimate the relative depth via . So we leverage Eq. 16 to estimate the relative depth :
According to equations above, some important conclusions can be summarized:
If , the scene can be recognized as replay attack.
If , there is:
In this case, if these two image planes are parallel and the single-frame model can not detect the static spoof cues, the model will fail in the task of face anti-spoofing, owing to that the model is hard to find the abnormality of relative depth estimated from the facial motion. We call this scene Perfect Spoofing Scene(PSS). Of course, making up PSS will cost a lot and is approximately impossible in practice.
If and we want to meet Eq. 24, the following equation should be satisfied:
However, in our assumption, , so:
This equation indicates that relative depth can’t be estimated preciously, if the attack carrier moves in the replay attack. And usually varies when attack carrier moves in the long-term sequence, leading to the variation of . This kind of abnormality is more obvious along with the long-term motion.
If denotes the largest depth difference among facial points, then , showing that constraining depth label of living face to is valid. As analyzed above, for spoofing scenes, the abnormal relative depth usually varies over time, so it is too complex to be computed directly. Therefore, we merely set depth label of spoofing face to all 0 to distinguish it from living label, making the model learn the abnormity under depth supervision itself.
As shown in Fig. 10(d), we rotate the recording image plane for degree . are the coordinates of mapping on the recording image plane. The two black points at the right end of green double arrows on recording image plane (vertical) will reach the two black points at the left end of green double arrow on recording image plane (rotated), when the recording image plane rotates. And the corresponding values will not change after rotation. For convenient computation, we still map the rotated points to the vertical recording image plane. And the coordinates after mapping are . are the corresponding distances shown in the figure. According to the relationship of the foundamental variables, we can obtain:
Deriving from equations above, we can get :
and can also be calculated by imitating Eq. 29:
Subtract from , the following is achieved:
Obviously, . We define . And then we get the following equation:
where the relationship between and are just like that between and , as well as . Note that for simplification, we only discuss the situation that are all positive.
where and can be represented as:
where . Observing Eq. 33, we can see if or , there will be , .
Now, we discuss the sufficient condition of . When , the can be established. Similar to Eq. 14, the relationship of variables can be achieved:
From Eq. 35 and , we can obtain:
where are corresponding coordinates of in the dimension of x in recording image space. In facial regions, we can easily find corresponding points , which satisfy that (i.e., ) and . In this pattern, Eq. 36 can be established. To establish Eq. 37, we only need to find point , which satisfies that . According to the derivation above, we can see that there exists cases that . And there are also many cases that satisfy , which we do not elaborate here. When faces move, the absolute coordinates vary, as well as
, leading to the variation of estimated relative depth of three facial points at different moments, which will not occur in thereal scene. That’s to say, if the realistic image plane and recording image plane are not parallel, we can seek cases to detect abnormal relative depth with the help of abnormal facial motion.
One of basis of the elaboration above is that the structure of face is similar to that of the hill, which is complex, dense and undulate. This is interesting and worth being exploited in face anti-spoofing.
Even though we only use some special examples to demonstrate our viewpoints and assumption, they can still prove the reasonability of utilizing facial motion to estimate the relative facial depth in face anti-spoofing task. In this way, the learned model can seek the abnormal relative depth and motion in the facial regions. And our extensive experiment demonstrates our assumption and indicates that temporal depth method indeed improves the performance of face anti-spoofing.