Face recognition technology has been widely used in a variety of applications in our daily life, such as mobile face payment, entrance authentication and office check-in machines, etc. Unfortunately, human face’s easy accessibility brings not only convenience but also presentation attacks. These facial recognition systems are vulnerable to various types of presentation attacks, including printed photograph, digital video replay, 3D print mask, and silica gel faceet al. Therefore, how to detect these various means of presentation attacks has become an increasingly critical and challenging task in all face recognition and authentication systems.
such as Sparse Low Rank Bilinear (SLRB), Local Binary Patterns (LBP), Histogram of Oriented Gradients (HOG), Difference of Gaussian (DoG), Scale Invariant Feature Transform (SIFT), Speeded-Up Robust Feature (SURF) and traditional classifiers including Support Vector Machines (SVM) and Linear Discriminant Analysis (LDA) to solve the classification problem of real and fake face. However, due to poor generalization capability, it is difficult to apply these methods in changing illumination environments.
Recently, deep convolutional neural networks (CNNs) have been widely used for various face related tasks such as face detection, recognition, identification and landmark detectionet al. Researchers have also introduced CNNs into face anti-spoofing and liveness detection area. It is proved that CNNs [7, 16, 26, 35]
can effectively extract richer deep facial image semantic features than traditional methods for binary classification of face anti-spoofing tasks. However, the single modal data input and supervised-learning mechanism makes it difficult to perform well in cross-dataset tests.
RGB data can provide texture information, but it is insensitive to surface curvature and reflectivity of materials. To improve the accuracy of face anti-spoofing systems (FAS), more modalities data such as depth  and InfraRed (IR) are used as algorithm data input. Depth camera like Intel RealSense D400 series111https://www.intelrealsense.com/ can provide distance/depth information to make it easier to distinguish real face surfaces from face shapes in photograph and video replays on electronic displays. IR camera is sensitive to material reflectivity of different presentation attack instruments (PAI). Especially, fusion methods [25, 28, 40] become main stream with the large-scale multi-modal dataset CASIA-SURF  publicly released.
Unfortunately, with the rapid development of presentation attack methods, face anti-spoofing systems have to deal with more challenging situations. As shown in Fig 1, it becomes easier and cheaper to produce 3D printed masks, while it’s extremely difficult to distinguish them from real faces in both RGB and depth data modalities. In addition, silica gel face is more lifelike and it looks like real face in all modalities (RGB, depth and IR). For most previous approaches, it’s impossible to distinguish between these two presentation attacks and live face.
To address this issue, we propose a novel pipeline-based fusion framework which can effectively and respectively extract richer information from face data in different modalities.
In summary, the main contributions of this paper are summarized as follows:
(1) We present an effective pipeline-based framework PipeNet for multi-modal face anti-spoofing task;
(2) In contrast to previous unified designs, we propose a Selective Modal Pipeline (SMP) design for differentiating feature extraction among different modalities data;
(3) We propose a novel method - Limited Frame Vote (LFV) to get stable and reliable prediction for video classification.
The rest of the paper is organized as follows: Section 2 briefly reviews related works in multi-modal face anti-spoofing; Section 3 introduces CASIA-SURF CeFA dataset and baseline methods; Section 4 proposes our method PipeNet with SMP and LFV modules; Section 5 shows competition basic rules; Section 6 shows the experiments and results analysis; Finally, we summarize our work and future research direction in Section 7.
2 Related Work
Traditional Methods: The facial motions in a sequence of frames are firstly utilized as cues in face anti-spoofing task. For example, eyes blinking is a reliable evidence that the face is real [29, 24].
In contrast to requesting user’s cooperation, some existing approaches treat face anti-spoofing task as a binary classification problem which can proceed with single frame. Various hand-crafted features have been explored and adopted in previous works, such as SLRB, LBP, HOG, DOG, SIFT, and SURF. Followed by traditional classifiers such as SVM, LDA and Random Forest.
|Subset||Ethnicities||Subjects||Modalities||PAIs||# real videos||# fake videos||# all videos|
attempt to learn deep feature representations from face image data for binary classification. Two-stream CNNs are proposed to overcome the weakness of cross-dataset test performance based on both patch and depth streams. Li et al.  proposed a method to link partial features and Principle Component Analysis (PCA), instead of fully-connected layer, to reduce the dimensionality of features in order to avoid the over-fitting problem.
: Temporal information is also used to enhance CNN’s representation capability. Long Short Term Memory (LSTM)[38, 34] can recurrently learn features to obtain context information, but the heavy computation overhead makes it difficult to deploy such method, especially on mobile devices. Remote photoplethysmography (rPPG) [2, 22, 18] measures a pulse signal of heart rate which only exists in live face, and can be used to 3D masks attack. However, weak robustness makes rPPG vulnerable to illumination changes of environment. More importantly, both methods take a long time for one testing shot, which is unpractical for deployment.
Fusion-based Methods: Multi-Modal methods can take advantage of complementary information among different modalities data and provide more robust performance on face anti-spoofing task. However, there used to be only small face anti-spoofing dataset with limited subjects and samples available, which make it easy to fall into overfitting problem during CNN training.
In 2019, Zhang  released CASIA-SURF, a large-scale multi-modal dataset for face anti-spoofing, which consists 1000 subjects with 21,000 videos and each sample with modalites data(i.e., RGB, depth and IR).
With CASIA-SURF dataset, they hold the first round of Face Anti-spoofing Attack Detection Challenge@CVPR2019 . The top-3 solutions were as follows:
VisionLab  was the champion with ACER score of . They proposed a fusion network with multi-level feature aggregation module which can fully utilize the feature fusion from different modalities at both coarse and fine levels. FaceBagNet  won the second place with ACER score of . They proposed a patch-based multi-stream fusion CNN architecture based on Bag-of-local-features. The patch-level images contribute to extract spoof-specific discriminative information and Model Feature Erasing module randomly erases one modal to prevent overfitting. FeatherNets  was the third winner with ACER score of . They proposed a light-weighted network architecture with modified Global Average Pooling(GAP) named streaming module. Their fusion procedure is based on ensemble + cascade structure to make best use of each modal data.
3 Dataset And Baseline
3.1 CASIA-SURF CeFA
CASIA-SURF dataset only contains faces in one ethnicity. In order to provide a benchmark platform for improving the generalization capability in cross-ethnicity anti-spoofing and the utilization of multi-modal continuous data, CASIA further released the largest up-to-date cross-ethnicity face anti-spoofing dataset, CASIA-SURF CeFA  in 2020. CASIA-SURF CeFA covers ethnicities, modalities, subjects, and 2D plus 3D attack types. It is the first public dataset designed for exploring the impact of cross-ethnicity in the study of face anti-spoofing. As shown in Figure 2, samples from different ethnicities and several presentation attack types are included in the CASIA-SURF CeFA dataset. Specifically, protocols are introduced to measure the impact under various evaluation conditions: cross-ethnicity (Protocol 1), (2) cross-PAI (Protocol 2), (3) cross-modality (Protocol 3) and (4) cross-ethnicity&PAI (Protocol 4 as shown in Table 1). This paper is focus on most difficult one, Protocol 4.
3.2 Baseline Methods
SD-Net. CASIA provides a baseline for single-modal face anti-spoofing task via a ResNet-18-based  network named SD-Net . It includes branches: static, dynamic and static-dynamic branches to learn hybrid features from static and dynamic images, respectively. For static and dynamic branches, each of them consists of res-blocks and GAP layer. A detailed description is provided in  for how to generate dynamic images. In short, a dynamic image is computed online with rank pooling based on consecutive frames. The reason for selection of dynamic images for rank pooling in SD-Net is that dynamic images have proved to have advantage over conventional optical flow [31, 8].
PSMM-Net. To make full use of multi-modal image data to alleviate the ethnic and attack bias, CASIA proposes a novel multi-modal fusion network, namely PSMM-Net , which includes two main parts: a) the modality-specific network, contains three SD-Nets to respectively learn deep features from three modalities such as RGB, Depth, IR; b) a shared branch for all modalities, for the purpose of learning the complementary features among different modalities. They also design information exchange and interaction among SD-Nets and the shared branch, with the purpose of capturing correlations and complementary semantic information among different modalities.Two main kinds of losses are adopted to guide the training of PSMM-Net. One kind of loss corresponds to the losses of the three SD-Nets, i.e. RGB, Depth and IR modalities data, denoted as , and , respectively. Another kind of loss bases on the summed features from all SD-Nets and the shared branch, which guides the entire network training, denoted as , which . The overall loss of PSMM-Net is denoted as follow:
We focus on fusion network for CASIA-SURF CeFA dataset, since fusion network achieved better stability and robustness when facing various kinds of presentation attacks with previous CASIA-SURF dataset.
4.1 The overall Model Architecture
In this work, we propose a novel Pipeline-based CNN fusion architecture for multi-modal face anti-spoofing, named PipeNet. It is based on modified SENet154, with a Selective Modal Pipeline (SMP) module for multiple modalities of image input and a Limited Frame Vote (LFV) module for sequence input of video frames.
Figure 3 shows the overall network architecture. The inputs are facial videos in
modalities, which are RGB, depth and IR. For each modal, we take one frame as input and randomly crop it into patches, then send them to the corresponding modal pipeline. In each modal pipeline, data augmentation and CNN feature extraction are performed. The outputs of three pipelines are concatenated into one and sent to the fusion module for further feature abstraction. After linear connection, we obtain the predictions for each frame, and send all of them to the Limited Frame Vote module for iterative calculation. The output is the real face prediction probability of the original facial video.
4.2 Closing the Gap in Cross-ethnicity Learning
As shown in Table 1, in the dataset CASIA SURF CeFA Protocal-4 (including 4@1,4@2,4@3), the faces are from different ethnicities between train and test set, which is called Cross-ethnicity learning. Figure 2 shows the cropped faces of different ethnicities(African, East Asian and Central Asian) in modalities(RGB, depth and IR) in the case of real face. In intuitive observation, the biggest gap between ethnicities is skin tone in RGB modal. There is a potential risk of increasing test loss and ACER error rate, because test data is inconsistent with training data. To eliminate or reduce the gap in the input feature distributions between train and test set, we convert the RGB images of both set into other representations such as HSV, YCbCr and Grayscale, and pick the representation with best performance. The objective is to reduce the influence of skin tone information.
4.3 Selective Modal Pipeline for Different Modalities Data
In essence, face anti-spoofing attack detection is an image classification task. It mainly contains two phases: CNN feature extraction with the first half CNN layers and feature map (binary) classification with other CNN layers.
|Modalities||Crop face||Crop patches||Color trans|
Because the face image in different modalities contains specific feature distribution and attributes, we treat them differently with selective pipeline in feature extraction phase. The pipeline concept means it is an end-to-end data flow and it can be created and configured in a very flexible way; selective means the structure of each pipeline is adapted to each modal. Pipelines for different modalities are independent which could be unified or specific.
We pick up basic block candidates from ResNet, ResNeXt, XceptionNet, SENet et al. and we align the input and output dimensions of these blocks to link them as a pipeline and deploy these pipelines to feature extraction module with input data from different modalities. We search for the best combination of three pipelines with experiments. Then, we concatenate the output from SMP as the input of the fusion module. The selective pipeline structure greatly improves the accuracy and efficiency in finding the most suitable structure.
4.4 Limited Frame Vote for Video Classification
Since the input is in video-clip format, it is a sequence of continuous frames with different lengths. Accordingly, the proposed method adds a Limited Frame Vote (LFV) module to obtain the final statistical prediction of a video-clip input.
Several algorithm can provide a measurement of the central tendency of a probability distribution or the random variable characterized by that distribution. The performance mainly depends on the data distribution. Mathematical expectation and geometric median are the most common methods for samples statistics.
To avoid outliers, voting should be applied by the partial probabilities of frames. We create the following algorithm based on PauTa Criterion algorithm. Algorithm pseudocode is as Algorithm1:
In this algorithm, is input, and are hard value from experience.
4.5 Some Other Tricks in Training
There are also some training tricks doing benefit to our final accuracy. 4.5.1 is borrowed from winners [28, 40] in Face Anti-spoofing Attack Detection Challenge@CVPR2019 ; 4.5.2 and 4.5.3 are breakthrough of previous classic works.
4.5.1 Random Input Crop and Modal Dropout
During training, we apply patch-level input into Face Anti-spoofing area and prove that patch-level inputs can better extract spoof-specific discriminative information over the whole face area. And random dropout of one selective modal is adopted since it can avoid over-fitting and better make use of the characteristics in all three modalities. In the inference phase, we use fixed patches and full modalities to facilitate reproducing result.
4.5.2 Dynamic Start Point for Cosine Decay Restarts
Cyclical cosine annealing learning rate  is able to avoid falling into local minimum extreme point during gradient descent process. We improve it by using dynamic learning rate (LR) starting point for each cycle, instead of the fixed value in original version.
|Zitong Yu[39, 32]||BOBO||Oulu||1|
|Qing Yang||Hulking(our team)||Intel||3|
4.5.3 Train/Validation Loss Balance Strategy
Snapshot ensemble  inspired us on how to search for best snapshot point of the weight checkpoints during training process. We make a checkpoint-auto-save system whose input are ACER and train/validation loss. It automatically saves checkpoints in multiple positions in each cycle for potential global best.
5 Competition Details
5.1 Dataset and Protocol
We use CASIA-SURF CeFA dataset, which is introduced in Section 3. In order to promote the competition and increase the level of challenge, the most challenging protocol Protocol 4 designed via combining the condition of both Protocol and , was adopted as the evaluation criterion of the competition. As shown in Table 1, it contains data subsets: training, validation and testing sets containing , , and subjects, respectively. In addition, it has sub-protocols (i.e., , and ) and in which one ethnicity is used for training and validation, and the remaining two ethnicities are used for testing. Moreover, the factor of PAIs are also considered in this protocol by setting different attack types during the training and testing phases.
5.2 Evaluation Metrics
From the evaluation of prediction results, we can get True Positive(TP), False Positive(FP), True Negative(TN), False Negative(FN). Based on these four values, we can calculate the following evaluation metrics. Donate Attack Presentation Classification Error Rate as:
Donate Bona Fide Presentation Classification Error Rate:
Average Classification Error Rate is donated as:
Eventually, the competition uses the ACER to determine the final ranking.
5.3 Training Configuration
5.4 Competition Results
We won the 3rd place in the final ranking based on competition organizer’s reproduced results. The final result of team scores and ranking is shown as Table 3. In the Chalearn Multi-modal Cross-ethnicity Face Anti-spoofing Recognition Challenge@CVPR2020 , our final submission gets the score of ACER on the test set, while reproduced result is ACER, very close to our submission.
Table 4 shows the best scores of our submission for trained models corresponding to 4@1,4@2 and 4@3 protocols, which are , and , respectively.
6 Experiments and Results Analysis
6.1 Effect of Different Train/Validation Segmentation Ratio
How to segment the training and validation dataset also affects the final result. Several ratios have been tried, including 15:1,12:1,10:1 and 3:1. We ultimately choose 15:1 which generates best performance among them. The test set is much larger than training and validation set and includes many non-existing presentation attacks types in training set. It is observed that a larger training set may still improve the model to learn more information about the unseen samples in test set.
6.2 Effect of Selective Modal Pipeline
As shown in Table 5, we have proposals for pipeline portfolio, including unified designs and selective design. Unified means pipelines for RGB, depth and IR are exactly the same while selected means different. We examine their ACER on dev set and test set simultaneously. In particular, we adjust the augmentation parameters and CNN blocks combination for each of the three pipelines. CNN blocks consist of SE-ResNetBottleneck and types of SE-ResNeXtBottleneck. The result shows that ACER is reduced after treating each model independently and customizing each modality with different pipeline.
For this section, we repeat each quantitative experiment three times for each of the three dataset protocols. The experiments include training and inference phases. We use the average value as the final result to ensure stability and consistency of model performance.
6.3 Does Frame Time Sequence Order Matter?
As shown in Table 6, before the decision of adopting LFV module, we perform experiments to measure the effect of sequence order on video classification results. We take part of continuous frames from sample video as our input, which is corresponding to micro face motions or static face frames, from both living and spoofing video samples, respectively. We then modified PipeNet data augmentation module to process continuous frames. We rearrange the order of frame sequence and test it with three different orders - original order, reverse order and random order. N/A are the cases where dataset does not contain corresponding examples.
|3D Print Mask||Original||0.0968||N/A||N/A|
The probabilities of predictions among different order options are very close. Two possible reasons are: a) the order of frame sequence is not a key contribution factor in fusion network compared to RGB single-modal network; b) Even after basic cleaning and facial data alignment, the dataset frames are still not consistent and have various artifacts such as random transportation and partially cropped from background disturbs the context information.
As a result, we choose Limited Frames Vote strategy to obtain the probability of real face for each video sample.
In this paper, we propose a flexible and practical multi-stream network architecture to build a robust face anti-spoofing system. The model, named PipeNet, is a pipeline-based design with Selective Modal Pipeline module and Limited Frame Vote module. The quantitative experiments show that the customized pipeline for each modality in PipeNet can make better use of different modalities data. In addition, LFV module provides stable and accurate prediction with continuous frames input. We apply the proposed PipeNet to Chalearn Multi-modal Cross-ethnicity Face Anti-spoofing Recognition Challenge@CVPR2020  and win the third place. Our final submission achieves a score of ACER on the test set. Our future research will focus on enhancing the self-adjustment capability of modal pipelines.
-  (2017) Face anti-spoofing using patch and depth-based cnns. In 2017 IEEE International Joint Conference on Biometrics (IJCB), pp. 319–328. Cited by: §1, §2.
-  (2016) Remote photoplethysmography based on implicit living skin tissue segmentation. In 2016 23rd ICPR, pp. 361–365. Cited by: §2.
-  (2016) Face anti-spoofing using speeded-up robust features and fisher vector encoding. IEEE Signal Processing Letters, pp. 1–1. External Links: Cited by: §1.
-  (2012-01) On the effectiveness of local binary patterns in face anti-spoofing. BIOSIG, pp. 1–7. Cited by: §1.
Xception: deep learning with depthwise separable convolutions. . External Links: Cited by: §4.3.
-  (2013-06) Can face anti-spoofing countermeasures work in a real world scenario?. 2013 International Conference on Biometrics (ICB). External Links: Cited by: §1.
-  (2016-07) Integration of image quality and motion cues for face anti-spoofing: a neural network approach. Journal of Visual Communication and Image Representation 38, pp. 451–460. External Links: Cited by: §1, §2.
-  (2017) Rank pooling for action recognition. TPAMI 39 (4), pp. 773–787. Cited by: §3.2.
-  (2016-06) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §3.2, §4.3.
-  (2018-06) Squeeze-and-excitation networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. External Links: Cited by: §4.3.
-  (2017) Snapshot ensembles: train 1, get m for free. External Links: Cited by: §4.5.3.
-  (2019) IEEE conference on computer vision and pattern recognition workshops, CVPR workshops 2019, long beach, ca, usa, june 16-20, 2019. Computer Vision Foundation / IEEE. External Links: Cited by: §2, §4.5.
-  (2015-04) Face liveness detection from a single image via diffusion speed model. IEEE transactions on image processing : a publication of the IEEE Signal Processing Society 24, pp. . External Links: Cited by: §1.
-  (2012-03) Face spoofing detection from single images using texture and local shape analysis. Biometrics, IET 1, pp. 3–10. External Links: Cited by: §1.
-  (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. External Links: Cited by: §2.
-  (2016-12) An original face anti-spoofing approach using partial convolutional neural network. 2016 6th IPTA. External Links: Cited by: §1, §2.
-  (2020) CASIA-surf cefa: a benchmark for multi-modal cross-ethnicity face anti-spoofing. External Links: Cited by: Figure 2, §3.1, §3.2, §3.2.
-  (2016) 3D mask face anti-spoofing with remote photoplethysmography. In European Conference on Computer Vision, pp. 85–100. Cited by: §2.
-  (2020) Cross-ethnicity face anti-spoofing recognition challenge: a review. arXiv. Cited by: §5.4, §7.
SGDR: stochastic gradient descent with warm restarts. External Links: Cited by: §4.5.2.
-  (2011-10) Face spoofing detection from single images using micro-texture analysis. 2011 International Joint Conference on Biometrics (IJCB). External Links: Cited by: §1.
-  (2017) Ppgsecure: biometric presentation attack detection using photopletysmograms. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 56–62. Cited by: §2.
-  (2002-08) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. Pattern Analysis and Machine Intelligence, IEEE Transactions on 24, pp. 971–987. External Links: Cited by: §1.
-  (2007-01) Eyeblink-based anti-spoofing in face recognition from a generic webcamera. pp. 1–8. External Links: Cited by: §2.
-  (2019) Recognizing multi-modal face spoofing with face recognition networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §1, §2.
-  (2016) Cross-database face antispoofing with robust feature representation. Lecture Notes in Computer Science, pp. 611–619. External Links: Cited by: §1, §2.
-  (2016-06) Secure face unlock: spoof detection on smartphones. IEEE Transactions on Information Forensics and Security 11, pp. . External Links: Cited by: §1.
-  (2019-08) FaceBagNet: bag-of-local-features model for multi-modal face anti-spoofing. pp. . Cited by: §1, §2, §4.5.
-  (2007) Blinking-based live face detection using conditional random fields. In Advances in Biometrics, S. Lee and S. Z. Li (Eds.), Berlin, Heidelberg, pp. 252–260. External Links: Cited by: §2.
-  (2010) Face liveness detection from a single image with sparse low rank bilinear discriminative model. In European Conference on Computer Vision, pp. 504–517. Cited by: §1.
-  (2017) Ordered pooling of optical flow sequences for action recognition. In WACV, pp. 168–176. Cited by: §3.2.
-  (2020) Deep spatial gradient and temporal depth learning for face anti-spoofing. In CVPR, Cited by: Table 3.
-  (2017-07) Aggregated residual transformations for deep neural networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §4.3.
-  (2015) Learning temporal features using lstm-cnn architecture for face anti-spoofing. 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pp. 141–145. Cited by: §2.
-  (2014) Learn convolutional neural network for face anti-spoofing. arXiv preprint arXiv:1408.5601. Cited by: §1, §2.
-  (2014) Learn convolutional neural network for face anti-spoofing. arXiv preprint arXiv:1408.5601. Cited by: §2.
-  (2013-06) Face liveness detection with component dependent descriptor. pp. 1–6. External Links: Cited by: §1.
-  (2019) Face anti-spoofing: model matters, so does data. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3502–3511. Cited by: §2.
-  (2020) Searching central difference convolutional networks for face anti-spoofing. In CVPR, Cited by: Table 3.
-  (2019) FeatherNets: convolutional neural networks as light as feather for face anti-spoofing. External Links: Cited by: §1, §2, §4.5.
-  (2020) CASIA-surf: a large-scale multi-modal benchmark for face anti-spoofing. IEEE Transactions on Biometrics, Behavior, and Identity Science, pp. 1–1. External Links: Cited by: §1, §2.