Learning deep forest with multi-scale Local Binary Pattern features for face anti-spoofing

by   Rizhao Cai, et al.
Shenzhen University

Face Anti-Spoofing (FAS) is significant for the security of face recognition systems. Convolutional Neural Networks (CNNs) have been introduced to the field of the FAS and have achieved competitive performance. However, CNN-based methods are vulnerable to the adversarial attack. Attackers could generate adversarial-spoofing examples to circumvent a CNN-based face liveness detector. Studies about the transferability of the adversarial attack reveal that utilizing handcrafted feature-based methods could improve security in a system-level. Therefore, handcrafted feature-based methods are worth our exploration. In this paper, we introduce the deep forest, which is proposed as an alternative towards CNNs by Zhou et al., in the problem of the FAS. To the best of our knowledge, this is the first attempt at exploiting the deep forest in the problem of FAS. Moreover, we propose to re-devise the representation constructing by using LBP descriptors rather than the Grained-Scanning Mechanism in the original scheme. Our method achieves competitive results. On the benchmark database IDIAP REPLAY-ATTACK, 0% Equal Error Rate (EER) is achieved. This work provides a competitive option in a fusing scheme for improving system-level security and offers important ideas to those who want to explore methods besides CNNs.



There are no comments yet.


page 1

page 2


Attacking CNN-based anti-spoofing face authentication in the physical domain

In this paper, we study the vulnerability of anti-spoofing methods based...

Exploring Hypergraph Representation on Face Anti-spoofing Beyond 2D Attacks

Face anti-spoofing plays a crucial role in protecting face recognition s...

Learning Generalizable and Identity-Discriminative Representations for Face Anti-Spoofing

Face anti-spoofing (a.k.a presentation attack detection) has drawn growi...

On the Learning of Deep Local Features for Robust Face Spoofing Detection

Biometrics emerged as a robust solution for security systems. However, g...

A Performance Evaluation of Convolutional Neural Networks for Face Anti Spoofing

In the current era, biometric based access control is becoming more popu...

Improving Face Anti-Spoofing by 3D Virtual Synthesis

Face anti-spoofing is crucial for the security of face recognition syste...

KeyForge: Mitigating Email Breaches with Forward-Forgeable Signatures

Email breaches are commonplace, and they expose a wealth of personal, bu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Face recognition systems, which identify an individual with her/his face, have been widely used in practical applications such as mobile phone unlocking. However, the existing face recognition techniques cannot differentiate between genuine faces (captured from human) and spoofing faces (captured from the faces in images, digital display, etc.). Most of the face recognition systems are therefore vulnerable to Presentation Attack (PA), including print attack, replay attack. Attackers could bypass the face recognition systems by presenting different types of spoofing faces since face images can be readily available to attackers from social platforms, e.g., Facebook, Instagram [26]. To guarantee the security of face recognition systems, there are increasing demands for developing the FAS techniques.

Fig. 1: Examples of Presentation Attack (PA). (a): display attack. The face is in a digital display screen [7]. (b): replay attack [39]. The face is in a video. (c): print attack. The face is in a print photo [26]. (d): print attack. The face is in a print photo that is tailored [15].

Traditionally, image descriptors, such as Local Binary Pattern (LBP) and Scale Invariant Feature Transform (SIFT), are utilized to extract features for describing the data from the FAS databases. Recently, with the powerful ability for learning deep representations from data, Convolutional Neural Networks (CNNs) have been successfully exploited in various visual tasks, e.g.objects classification [12], face recognition [41], etc.and have achieved the state-of-the-art performances. The attempts of CNNs in the FAS have been also reported and have achieved much improvement [37, 12, 22, 35, 30].

Although CNN-based methods have shown their excellent capacities, it is pointed out that they are vulnerable to adversarial attack [31]

. Under such adversarial attack, a CNN-model would fail to correctly classify the adversarial examples, which are generated by imposing some human-invisible perturbations on the original samples. What is more, though adversarial examples are usually manipulated in the digital world, they could still take effects even after a print-and-capture cycle

[13, 9, 29]. In other words, the adversarial attack can be conducted in the physical world. Worse still, the adversarial examples are shown to be transferable. Empirical experiments in [17, 28] and theoretical analysis in [33]

show that adversarial examples can be transferred to attack other models as long as they adopt the same or similar features even if the classification models are different (Support Vector Machine, Random Forest, etc.). Therefore, it is likely for attackers to generate adversarial-spoofing examples to attack a CNN model for face liveness detection in a face recognition system.

Fortunately, using handcrafted feature-based methods could be a solution. In [28, 33], it is revealed that adversarial examples are non-transferable when they are in the different feature spaces as the input of their victim models. This indicates that the handcrafted features from RGB images as input for a face anti-spoofing model could be an approach against the adversarial-spoofing attack targeted at the CNN-based models. In the cybersecurity applications, it is also suggested in [36] that ensembling a diverse pool of models of different features could improve the security of a cyber system against the adversarial attack. Hence, to alleviate the threats of the adversarial attack, handcrafted feature-based methods also deserve efforts of exploration.

In this paper, we introduce a new feature-based method, the deep forest [40], to the FAS problem. The deep forest is an advanced synthesis of tree-ensemble methods. It consists of the Grained-Scanning Mechanism (GSM) for learning representations from data and the layer-cascade strategy for further processing the representations. The deep forest has been evaluated on several visual tasks, e.g., face recognition, handwriting recognition, etc., and it achieves competitive performance [40]. Since the deep forest is newly-published, there are not yet many works about using the deep forest in applications related to biometrics. To the best of our knowledge, we are the first to introduce the deep forest in the problem of the FAS. However, the performance is not satisfactory in our initial attempt when the GSM, proposed by [40], is directly used to learn representations for the spoofing detection. The unsatisfactory result suggests that the GSM is not competent enough in capturing the cues for the face spoofing detection. Inspired by texture analysis [21, 38, 23], the baseline approaches in the research area of the FAS, we propose to employ Local Binary Pattern (LBP) descriptors to construct the representations of spoofing information. Experimental results show that the proposed approach has achieved competitive performance.

To the best of our knowledge, this is the first work that introduces the deep forest to the problem of FAS. Our method offers an important reference and a competitive option to those who want to fuse diverse methods in their schemes for system-level security in their cases.

We re-devise the representation constructing by utilizing the LBP descriptors instead of the GSM. The proposed scheme that integrates LBP descriptors and the deep forest learning method achieves better results than that of the GSM [40].

The proposed scheme shows competitive performance compared to state-of-the-art approaches. On the IDIAP REPLAY-ATTACK database [7], 0% Equal Error Rate (EER) is achieved. Also, extensive experiments on the two newly-published databases, MSU USSA database and ROSE-YOUTU database have been conducted. On the MSU database, EER of 1.56% is obtained, which is a competitive result compared to the Patch-based CNN (0.55% EER) and the Depth-based CNN (2.62% EER) proposed by [2].

The rest of the paper is organized as follow: Section II presents brief literature reviews about approaches to FAS and about learning methods that are forest-related. The proposed scheme is elaborated in Section III. The performance of the proposed scheme is evaluated in Section IV. Finally, Section V concludes this paper.

Ii Related Works

Fig. 2: Examples of genuine faces and spoofing faces from MSU USSA database [26]. Columns (a), (b), (c) are the genuine face, display face and printed photo face and their correspondent magnifying regions, respectively.

In this section, the literature on both traditional handcrafted feature-based methods and CNN-based methods in the problem of FAS is first reviewed, followed by the tree-ensemble learning methods.

Ii-a The Existing Works on Face Anti-Spoofing

Ii-A1 The Traditional Methods

Most of the traditional FAS approaches focus on designing handcrafted features and learning classifiers with traditional learning methods, e.g., Support Vector Machine (SVM) [2]. Texture analysis is one of the main approaches to spoofing faces detection since there are inherent texture disparities between genuine faces and spoofing faces of the print attack or of the replay attack. As can be seen in Fig. 2, images of the spoofing faces, compared to the genuine faces, usually have lower quality and contain visual artifacts because of the recapturing process. These disparities can be described effectively by texture descriptors. Relevant methods aimed at capturing these disparities in the Fourier spectrum or spatial domain are reported. Ref.[32] uses Difference-of-Gaussian (DoG) features to describe the disturbance of frequency resulting from the recapturing. Besides, the Local Phase Quantization (LPQ) that analyzes distortion through the phase is also discussed by [10]. In addition, in the spatial domain, a significant number of research works employ the LBP-based features to describe the disparities from local texture information [21, 38, 23]. Analogously, methods that utilize Scale-Invariant Feature Transform (SIFT) and Speed-Up Robust Feature [3] are also reported. Besides, to utilize motion information from the temporal domain, the texture-based methods mentioned above are extended into three orthogonal planes, e.g., LBP-TOP [27], and LPQ-TOP [1]. Moreover, the color information of spoofing faces, which is less abundant after distortions in the recapturing process, is essential in discriminating spoofing faces. Therefore, color texture methods are proposed in [4] by extracting features from separate channels in a certain color space (e.g., to extract features of images in HSV space from the three components H, S, and V individually) using the aforementioned methods.

Fig. 3: The illustration of how the GSM learns representations for local information [40]

. First, a sliding window with a certain stride is used to scan raw pixels. Then, all the scanned patches are fed to forests, a random forest (black) and a completely-random forest (rose). Finally, all the output results from the forests will be concatenated as the representations of the raw pixel data. For full details about the GSM, please refer to



Ii-A2 The Deep-learning Based Methods

Recently, CNN-based methods with the powerful ability for learning deep representations from data has attracted many research attention. Ref.[37] trains a CNN to learn deep representations for face anti-spoofing based on the AlexNet architecture [12]. After that, the feasibility of CNN in learning deep representation for biometric, including face anti-spoofing, is further demonstrated by [22], and more CNN-based methods are increasingly reported [2, 14]

. In addition, efforts in exploiting Long Short-Term Memory networks (LSTM) to utilize temporal information from frames of videos are also reported in

[35, 30].

Ii-B The Tree-Ensemble Methods

Fig. 4: Illustrations of constructing multi-scale representations. The (a) and (b) illustrate how the MGSM and the proposed scheme construct representations on multi-scales respectively.

The tree-ensemble methods are based on decision trees. The random decision forest is first proposed as a solution towards the dilemma between performance and generalization of the decision tree

[11]. It is later ameliorated to the Random Forest (RF) by introducing the feature sampling and the data bootstrapping by [5]. Completely-Random tree Forest (CRF) has a mechanism that is much more “random” than RF since it splits the nodes randomly, regardless any criterions [16]. Both the RF and the CRF would project original features into subspaces by sampling the original features. This reduces dimensions of features to process, which facilitates the handling of high dimensional features [40]

. The Gradient Boosting Decision Tree (GDBT) methods introduce loss functions for training which have not been included in the RF and the CRF. The GDBT models are trained by boosting the gradients of the loss. An effective way to implement GDBT is proposed by


, namely the XGBoost. The XGBoost provides a more flexible and powerful scheme that approximates non-differentiable loss functions by the first two terms of their Taylor Expansion, so users are enabled to define arbitrary loss functions in their problems. The XGBoost has achieved superior performance among many GDBT implementations.

The deep forest, proposed by [40], can achieve state-of-the-art performance compared to CNN-based methods on several visual tasks reported by [40]. It is proposed in [40] that a Grain Scanning Mechanism (GSM) is used to learn representation from data and a cascade strategy for further processing the representations. The RF is a basis of the GSM of the deep forest and CRF offers another option for the deep forest. By combining different types of forest, the diversity of representations learned by the deep forest can be improved [40]. The XGBoost and other implementations of GBDT can also be a basis in the deep forest. More details about the deep forest can be found in [40]. Unlike CNNs, whose structures are fixed during the training process, the number of cascade levels of the deep forest model depends on the scale of the data and grows as the training proceed. Once the output scores (accuracy, loss, etc.) begins to converge, the growth stops. Hence, the complexity of the model can be adaptively adjusted according to the scale of the database. This ensures that the deep forest can maintain a satisfactory result even on a small-scale database [40].

Iii The Proposed Deep Forest with LBP Features

The LBP is selected due to two reasons. Firstly, LBP features cannot be reconstructed back to RGB pixels images, thus helpful against the adversarial-spoofing attack. Secondly, LBP is designed for texture description, which may be appropriate for the face spoofing problem. This section will first elaborate on how to use the LBP descriptors [22] to leverage texture information. Then, the proposed scheme integrating the deep forest and the LBP features will be presented.

Fig. 5: The procedure of the deep forest learning with multi-scale representations. The left part contains the LBP representations on three different scales, denoted by , and , respectively. The right part illustrates the cascading strategy. The “black” and “red” boxes are output results of each forest from the previous layer. They will be concatenated with features on different scales in different layers as the input of their next layers

Iii-a The LBP-based Features for Texture Analysis

The Local Binary Pattern (LBP) descriptor proposed by [24] is a grey-scale descriptor that is effective for texture description. By calculating the LBP values of the binary patterns for each pixel and accumulating the occurrences of them into histograms, the LBP features can be extracted to represent local texture information. The calculation of LBP can be described as

where takes the sign of the operand, denotes the intensity value of the central pixel and denotes the intensity values of adjacent pixels distributed symmetrically at a circle of radius . An image can be divided into several patches, and LBP histograms are calculated for each patch. Then, all the histograms can be concatenated into a feature vector to represent the image in the texture field. To fully exploit the color information, color LBP features will be employed by referring to [4] in this paper. The color LBP features are to extract LBP features from each component individually of the color space (e.g.Red, Blue, Green in the RGB space or Hue, Saturation, Value in the HSV space) and the obtained results will be concatenated into a feature vector [4]. These features based on LBP descriptors are to be called LBP features in this paper.

The GSM learns the representations of local information from adjacent pixels within a certain window, and similarly, the extraction of LBP-based features also considers the local information. On the other hand, the significant contrast between employing the GSM [40] (illustrated in Fig. 3) and the LBP features lies in the representations constructing. The GSM constructs representations by learning from data while the LBP features construct representations with the domain knowledge of a researcher.

Iii-B The Proposed Multi-scale Representations

Firstly, we propose to use the multi-scale LBP descriptor to construct the multi-scale representations. Taking multi-scales into accounts is important because the image samples are from practical capturing conditions and there are variations of the textural disparities. For example, although both Fig. 2 (b) and (c) are spoofing faces, they are captured under different conditions, i.e., different devices, different circumstances, etc., so they show different texture appearances in both patterns and scales. Therefore, different scales of local information should be taken into considerations. As is illustrated in Fig. 4 (a), the Multi-Grained Scanning Mechanism (MGSM) [40] is used to learn representations from data on multiple scales. By changing the size of the sliding windows and conducting the GSM, relationships of the pixels on different scales will be learned, and local information on different scales can be leveraged [40]. On the other hand, Fig. 4 (b) illustrates our proposed scheme. In the proposed scheme, there is a sliding window for scanning patches of pixels, and descriptors [24] are used to obtain LBP features. By changing the parameters and , representations on different scales can be obtained. To utilize color information, the color LBP features will be adopted in the proposed scheme to construct representations on different color channels and scales according to [4]. One of the differences between the MGSM and our proposed scheme in constructing multi-scale representations lies in the selection of sliding windows. With the MGSM, windows of different sizes are needed to learn multi-scale representations, while multi-scale representations based on LBP features can be obtained with a fixed size window. This is because the representations on a certain scale learned by the GSM only depends on the size of the window; while, in the exploitations of LBP descriptors, the representation in a certain scale can also be determined by certain parameters of LBP descriptors, i.e., and . Multiple sizes of windows are not adopted in this paper for the consideration that when small-size windows are used to extract LBP histograms, many of the bins are empty, and the obtained features will be of high-dimension and sparse, i.e., less informative.

Secondly, instead of concatenating all the representations on these three scales to construct a feature vector, as performed in some traditional methods [4, 26], a circular cascading strategy is adopted in our proposed scheme by referring to [40]. This strategy is shown in Fig. 5. The -th layer will be identified as . Representations on the three scales are denoted by , and , respectively. They will be individually fused with the output of each layer of the deep forest, and each layer will focus on the representations on a certain scale. The will be fed to the first layer of deep forest and fused with the output of . Then, the representation will be fused with the output from and become the input of . The and will do so. It should be noted that this cascading process is circular. For instance, in the next circle, the is concatenated in the . In the -th circle, the will be concatenated in the . Actually, the options of the scales and cascade strategies are flexible according to tasks.

Iv Experiments

In this section, a brief introduction for four databases, on which the experiments are conducted, will be first given. Then, the details about the settings of the experiments are shown. Finally, the experimental results are presented and discussed.

Iv-a Databases

In our experiment, several representative databases have been employed. Two are benchmark databases, CASIA FASD [39] and IDIAP REPLAY-ATTACK [7], and two newly-published databases, ROSE-YOUTU LIVENESS database [15] and MSU USSA database [26]. The IDIAP, CASIA and ROSE-YOUTU databases consist of videos, covering replay attack, display attack, and print attack. The MSU database only contains images, i.e., only including display attack and print attack. More specifically, the scales of each database is summarized below.

The IDIAP REPLAY-ATTACK database [7] constitutes about 50 subjects. There are 60 videos of genuine faces and 300 videos of fakes faces in the training set. In the testing set, there are 80 videos of genuine faces and 400 videos of spoofing faces.

The CASIA database [39] consists of 600 videos from 50 subjects, 20 subjects for the training set and 30 subjects for the testing set. For each subject, there are 3 videos of genuine faces and 9 videos of spoofing faces.

The ROSE-YOUTU LIVENESS database [15] contains 10 and 12 subjects in the training set and the testing set, respectively. For each subject, there are 180 videos, consisting of various types of attack and light conditions. This database is the latest database concerning PA, and there is a tailored print attack. As can be seen in Fig. 1, the background is not included in the recapturing process, making this database more challenging.

The MSU USSA database [26] includes 1,000 genuine faces (about 1,000 subjects) and about 6,000 spoof face images. There is no division of the training set and the testing set, so a 5-fold validation protocol is used to evaluate the performance of the FAS methods.

Iv-B Experimental Setups

In the first place, it should be highlighted that some data preprocessing have been performed in our experiments. When conducting experiments on the MSU and IDIAP databases, the whole image frames are taken as the inputs for making full use of information. This is because the PA places the spoof media near the cameras to achieve high recapturing quality, and the “background” is also recaptured (as shown in (a) and (c) of Fig. 1). The recaptured background provides useful information which is beneficial to the spoof face detection. However, in the CASIA database and the ROSE-YOUTU database, the PA is far away from cameras and hence the background is not included in the recapturing process (as shown in (b) and (d) of Fig. 1). Under this circumstance, if the whole frame is used as the input, unnecessary interference (from the genuine background) will be introduced. Therefore, the Viola-Jones method [34] is utilized to detect the faces in frames from the ROSE-YOUTU and CASIA databases. The detected faces are cropped and are employed as the inputs in the experiments. Then the resolution of all the inputs (i.e., whole frames from the IDIAP databases and MSU database or cropped face regions from the CASIA database and ROSE-YOUTU database) is normalized to pixels for a trade-off between the computational complexity and performance by referring to the prior works [26].

Secondly, the settings of the experiments are elaborated. In [40], three square sliding windows of different sizes are employed to evaluate the performance of the deep forest. By referring to this, three scales of windows, 16, 32 and 64 pixels, with the strides of 8, 16, and 32 pixels, respectively, are used for the MGSM in this paper. The obtained representations on these three scales will be denoted by , , , respectively. In our proposed scheme, the size of the sliding window is fixed as 32 pixels and the stride is 16 pixels. For each image of , there will be overlapped sub-patches in total. Three descriptors [24], , and , are utilized to construct representations on three scales, and the obtained representations are referred to as , , , respectively. Also, color LBP features in HSV and YCbCr spaces are considered in this paper. That is to extract features in each seperate channels of a image. For one patch, the feature lengths of color (RGB, HSV, YCbCr) , and are , , , respectively. Since there are 49 sub-patches for each image, the lengths of the final , , will be , , respectively. During the cascading operation, / will be fused with , / with and / with . This process continues circularly until the training process terminates. This process will stop automatically when accuracies converge for several rounds. As for the setting of forests utilized in the deep forest, four RFs and four CRF are employed and there are 500 trees in each forest by referring to [40]. These are implemented with the package of the gcForest111https://github.com/kingfengji/gcForest with default settings of the forests. For more details of the mechanism about the deep forest, please refer to [40].

Iv-C Experimental Results

Iv-C1 Comparisons between Multi-scale Representations

GSM (RGB) [40] 4.84 1.02 14.50
proposed (RGB) 4.17 0 11.82
proposed (HSV) 2.14 0.052 8.73
proposed (YCBCR) 1.56 0 9.66
TABLE I: Comparisons between two implementations of multi-scale representations on MSU USSA database, IDIAP database, CASIA database. Performance is evaluated by EER (%).

Table I provides the experimental results of the GSM and of the proposed scheme in terms of Equal Error Rate (EER). From Table I

, by integrating the LBP features (RGB) with the deep forest, the EER on the MSU, IDIAP, CASIA databases are reduced from 4.84% to 4.17%, from 1.02% to 0% and from 14.50% to 11.82%, respectively. These results suggest that LBP-based features are more competent in exploiting texture information to represent the degradation of the spoofing faces than the GSM. Furthermore, across different color spaces, the performances of LBP features in HSV and YCbCr color spaces are generally better than that in the RGB color space. This is because the change of illuminance should not interfere chrominance information, which is crucial in color texture methods, and the HSV space and YCbCr space separate primely components of illumination and chrominance. However, the RGB space remains high correlations in the three components, and slight variance of illumination by altering the R, G, B may result in unexpected change chrominance, making feature less effective


Fig. 6: The convergence curves of experiments on MSU database. An average of the results of the five validations is taken. The -axis refers to the number of the cascade layer that increases along with the training process. The -axis refers to the testing accuracy of the output of each layer.

To further probe into the effectiveness of the proposed scheme, curves of the training accuracy outputted by each layer are drawn and shown in Fig. 6. An upward trend of the accuracy results can be seen. It goes up along with the growth of the structure. There are limited improvements in the curve of over different layers with the GSM, which indicates that the GSM is not able to capture the texture information of the spoofing cues over different scales efficiently. Meanwhile, despite inferior accuracies in the first two layers, the accuracies of the proposed scheme (RGB, HSV, YCbCr) finally outperform that of the GSM. Moreover, the trend of the curves indicates that the cascading strategy enables LBP features to be re-represented. For instance, is fed to the layers , and , and the outputted accuracies get improved. In layer , the deep forest model learns from , representations on a small scale. Then, after and , where the model has perceived more information from representations on larger scales, the model leads to a better understanding towards the distortion on different scales.

Iv-C2 Comparisons with State-of-the-Art Approaches

LBP-TOP [27] 7.9 - -
CoALBP (HSV) [4] 3.7 5.5 16.4
CoALBP (YCbCr) [4] 1.4 10.0 17.7
Fine-tuned AlexNet [37] 6.1 7.4 8.0
CNN+Conv-LSTM [30] 5.12 22.40 -
CNN+LSTM [30] 1.28 14.60 -
Patch-based CNN [2] 2.5 4.44 -
Depth-based CNN [2] 0.86 2.85 -
proposed (HSV) 0.052 8.73 10.9
proposed (YCbCr) 0 9.66 11.9
TABLE II: Comparisons between the proposed scheme and state-of-the-art approaches on the IDIAP database, CASIA database and ROSE-YOUTU database, which are in terms of EER (%).

Tables II and III provide results of comparisons between the proposed scheme and the state-of-the-art approaches. From Table II, The proposed scheme with simple LBP features is demonstrated to be highly competitive. Firstly, on the CASIA database, the proposed scheme (HSV) achieves 8.73% EER. Although this result is inferior to results of some CNN-based methods, the patch-based CNN (4.44%) [2], the depth-based CNN (2.85% ) [2] and fine-tuned AlexNet (6.1%) [37], it is better than some LSTM-based method with 22.40% and 14.60% EERs presented in [30]. It is worth mentioning that, among the traditional methods (using the SVM classifiers with handcrafted features), particularly among LBP-based methods, Co-occurrence of Adjacent Local Binary Patterns (CoALBP) method [23] has achieved the state-of-the-art performance [15]. Experiments on the CASIA database show that the proposed scheme with LBP features has achieved a better result (9.66%) than CoALBP (10.0%) in YCbCr space. Moreover, experimental results on the ROSE-YOUTU database [15], a more diverse and challenging database, are presented in the last column in Table II. The results show that the CoALBP, which performs well on IDIAP database (3.7% in HSV and 1.4% in YCbCr) and CASIA database (5.0% in HSV and 10.0% in YCbCr), drops dramatically (16.4% in HSV and 17.7% in YCbCr) [15]. However, the proposed scheme, which is also related to the LBP, achieves 10.9% (HSV) and 11.9% (YCbCr). Furthermore, from Table II, the proposed scheme achieves 0% (YCbCr) on the IDIAP REPLAY-ATTACK database, which is better than the results of all the presented CNN-based methods and CoALBP. Therefore, it is concluded that the proposed scheme has achieved comparable performance to the state-of-the-art CNN methods and traditional methods.

Patel et al. [26] 3.84 -
Patch-based CNN [2] 0.550.26 0.410.32
Depth-based CNN [2] 2.620.73 2.220.66
proposed (HSV) 2.140.58 1.980.58
proposed (YCbCr) 1.560.61 1.330.51
TABLE III: Performance in terms of EER (%) and HTER (%) on MSU USSA database. The reuslts are obtained according to the 5-fold validation protocol in [26].
Dataset Number of trees 64 128 256 500
CASIA HSV 8.62 8.59 8.67 8.73
YCbCr 9.53 9.54 9.61 9.66

HSV 0.054 0.047 0.048 0.052
YCbCr 0.026 0.017 0.023 0

HSV 1.99 1.96 2.22 2.14
YCbCr 1.26 1.28 1.42 1.51

HSV 10.4 10.4 10.7 10.9
YCbCr 11.4 11.5 11.3 11.9

TABLE IV: Performance (EER %) of different numbers of trees in each forest.

The experimental results on the MSU USSA database, in terms of the EER and the Half Total Error Rate (HTER), are provided in Table III. According to Table III, the patch-based CNN [2] achieves the best results both in EER (0.55%) and HTER (0.41%) on the MSU database, but our proposed scheme achieves 2.14% EER and 1.98 % HTER in HSV space as well as 1.56% EER and 1.33% HTER in YCbCr, which are better than that of the Depth-based CNN [2] with 2.62% EER and 2.22% HTER.

In summary, taking Tables II and III together, our proposed method is highly competitive when compared with the state-of-the-art CNN-based methods and the traditional methods, e.g., CoALBP [23].

Iv-C3 Comparisons of different numbers of trees in each forest

In the above experiments, we follow [40] and adopt 500 trees in each forest. In a certain range, the more trees in a forest, the better the performance. However, too many trees in a forest would introduce heavy computational costs. In [25], it is suggested that a trade-off between performance and computational costs can be achieved when the number of trees in a forest is in the range from 64 to 128. There are no significant performance gains when the number of trees increases to 512, 1024, 2048 or other larger numbers. Experimental results in Table 4 show that when the number of trees is smaller than 500, the performance does not necessarily drop. This observation coincides with the conclusion in [25].

V Conclusion and Future Work

Given the concern on the adversarial attack, in this paper, we propose to utilize the deep forest [40] in the problem of the FAS. To the best of our knowledge, this is the first attempt to introduce the deep forest into the FAS problem. Inspired by works related to texture analysis, we re-devise the constructing of multi-scale representations by integrating LBP descriptors with the deep forest learning scheme. Our proposed scheme has achieved better results than the original GSM proposed by [40]. Furthermore, compared with the state-of-the-art approaches, competitive results have been achieved on several benchmark databases by the proposed scheme. For example, 0% EER is achieved on the IDIAP dataset. This indicates the effectiveness and competitiveness of our proposed scheme. Hence, our method could offer a competitive option to those who would like to improve the security of their systems by fusing diverse approaches in their schemes in system-level. Moreover, there have been a limited number of research works which exploit the deep forest on practical problems. This paper could serve as an important reference to the researchers who want to explore methods beyond the CNN-based schemes.

Admittedly, the results of our approach do not look as attractive as some CNN-based methods. In the future, various efforts can be made to improve the overall performance, such as investigating more cascading strategies and feature extraction methods. In this work, the LBP is utilized because it is common in the field of the FAS problem and it is relatively simple for us to implement with the deep forest. However, the LBP is designed by the researchers in computer vision society based on their domain knowledge. Such knowledge may not be fully applicable to the FAS problem. Some novel methods of binary descriptors have raised our strong interest and given us significant references,

[18, 19, 20, 8]. Designed in a more intellectual idea, they can learn features from data and are less dependent on people’s knowledge. Hopefully, we could achieve better results by referring to these methods.


  • [1] S. R. Arashloo, J. Kittler, and W. Christmas (2017)

    Face Spoofing Detection Based on Multiple Descriptor Fusion Using Multiscale Dynamic Binarized Statistical Image Features

    IEEE Transactions on Information Forensics and Security 10 (11), pp. 2396–2407. Cited by: §II-A1.
  • [2] Y. Atoum, Y. Liu, A. Jourabloo, and X. Liu (2018) Face anti-spoofing using patch and depth-based CNNs. In IEEE International Joint Conference on Biometrics, Cited by: §I, §II-A1, §II-A2, §IV-C2, §IV-C2, TABLE II, TABLE III.
  • [3] Z. Boulkenafet, J. Komulainen, and A. Hadid (2017) Face Antispoofing Using Speeded-Up Robust Features and Fisher Vector Encoding. IEEE Signal Processing Letters 24 (2), pp. 141–145. Cited by: §II-A1.
  • [4] Z. Boulkenafet, J. Komulainen, and A. Hadid (2017) Face Spoofing Detection Using Colour Texture Analysis. IEEE Transactions on Information Forensics and Security 11 (8), pp. 1818–1830. Cited by: §II-A1, §III-A, §III-B, §III-B, §IV-C1, TABLE II.
  • [5] L. Breiman (2001) Random forest. Machine Learning 45, pp. 5–32. Cited by: §II-B.
  • [6] T. Chen and C. Guestrin (2016) XGBoost: a scalable tree boosting system. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. Cited by: §II-B.
  • [7] I. Chingovska, A. Anjos, and S. Marcel (2012) On the Effectiveness of Local Binary Patterns in Face anti-Spoofing. In Biometrics Special Interest Group, pp. 1–7. Cited by: Fig. 1, §I, §IV-A, §IV-A.
  • [8] Y. Duan, J. Lu, J. Feng, and J. Zhou (2018-05) Context-aware local binary feature learning for face recognition. 40 (5), pp. 1139–1153. External Links: Document, ISSN 0162-8828 Cited by: §V.
  • [9] K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song (2018)

    Robust physical-world attacks on deep learning visual classification


    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1625–1634. Cited by: §I.
  • [10] D. Gragnaniello, G. Poggi, C. Sansone, and L. Verdoliva (2015) An investigation of local descriptors for biometric spoofing detection. IEEE Transactions on Information Forensics and Security 10 (4), pp. 849–863. Cited by: §II-A1.
  • [11] T. K. Ho (1998) The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (8), pp. 832–844. Cited by: §II-B.
  • [12] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §I, §II-A2.
  • [13] A. Kurakin, I. J. Goodfellow, and S. Bengio (2016) Adversarial examples in the physical world. CoRR abs/1607.02533. Cited by: §I.
  • [14] H. Li, P. He, S. Wang, A. Rocha, X. Jiang, and A. C. Kot (2018-10)

    Learning generalized deep feature representation for face anti-spoofing

    13 (10), pp. 2639–2652. External Links: Document, ISSN 1556-6013 Cited by: §II-A2.
  • [15] H. Li, W. Li, H. Cao, S. Wang, F. Huang, and A. C. Kot (2018) Unsupervised Domain Adaptation for Face Anti-Spoofing. IEEE Transactions on Information Forensics and Security 13 (7), pp. 1794–1809. Cited by: Fig. 1, §IV-A, §IV-A, §IV-C2.
  • [16] F. T. Liu, M. T. Kai, Y. Yu, and Z. H. Zhou (2008) Spectrum of variable-random trees. Journal of Artificial Intelligence Research 32 (1), pp. 355–384. Cited by: §II-B.
  • [17] Y. Liu, X. C. andChang Liu, and D. Song (2017) Delving into transferable adversarial examples and black-box attacks. In Proceedings of 5th International Conference on Learning Representations, Cited by: §I.
  • [18] J. Lu, V. E. Liong, and J. Zhou (2015-12)

    Cost-sensitive local binary feature learning for facial age estimation

    24 (12), pp. 5356–5368. External Links: Document, ISSN 1057-7149 Cited by: §V.
  • [19] J. Lu, V. E. Liong, and J. Zhou (2017-05) Deep hashing for scalable image search. 26 (5), pp. 2352–2367. External Links: Document, ISSN 1057-7149 Cited by: §V.
  • [20] J. Lu, V. E. Liong, and J. Zhou (2018-08) Simultaneous local binary feature learning and encoding for homogeneous and heterogeneous face recognition. 40 (8), pp. 1979–1993. External Links: Document, ISSN 0162-8828 Cited by: §V.
  • [21] J. Maatta, A. Hadid, and M. Pietikainen (2011) Face spoofing detection from single images using micro-texture analysis. In 2011 International Joint Conference on Biometrics (IJCB),, pp. 1–7. Cited by: §I, §II-A1.
  • [22] D. Menotti, G. Chiachia, A. Pinto, W. R. Schwartz, H. Pedrini, A. X. Falcão, and A. Rocha (2015) Deep representations for iris, face, and fingerprint spoofing detection. IEEE Transactions on Information Forensics and Security 10 (4), pp. 864–879. Cited by: §I, §II-A2, §III.
  • [23] R. Nosaka, Y. Ohkawa, and K. Fukui (2011) Feature Extraction Based on Co-occurrence of Adjacent Local Binary Patterns. In Pacific Rim Conference on Advances in Image and Video Technology, pp. 82–91. Cited by: §I, §II-A1, §IV-C2, §IV-C2.
  • [24] T. Ojala, M. Pietikainen, and T. Maenpaa (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (7), pp. 971–987. Cited by: §III-A, §III-B, §IV-B.
  • [25] T. M. Oshiro, P. S. Perez, and J. A. Baranauskas (2012) How many trees in a random forest?. In Machine Learning and Data Mining in Pattern Recognition, P. Perner (Ed.), Berlin, Heidelberg, pp. 154–168. External Links: ISBN 978-3-642-31537-4 Cited by: §IV-C3.
  • [26] K. Patel, H. Han, and A. K. Jain (2016) Secure face unlock: Spoof detection on smartphones. IEEE Transactions on Information Forensics and Security 11 (10), pp. 2268–2283. Cited by: Fig. 1, §I, Fig. 2, §III-B, §IV-A, §IV-A, §IV-B, TABLE III.
  • [27] T. D. F. Pereira, A. Anjos, J. M. D. Martino, and S. Marcel (2012) LBP-TOP Based Countermeasure against Face Spoofing Attacks. In International Conference on Computer Vision, pp. 121–132. Cited by: §II-A1, TABLE II.
  • [28] P. Russu, A. Demontis, B. Biggio, G. Fumera, and F. Roli (2016) Secure kernel machines against evasion attacks. In Proceedings of the 2016 ACM workshop on artificial intelligence and security, pp. 59–69. Cited by: §I, §I.
  • [29] M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter (2016) Accessorize to a crime: real and stealthy attacks on state-of-the-art face recognition. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, New York, NY, USA, pp. 1528–1540. External Links: ISBN 978-1-4503-4139-4, Link, Document Cited by: §I.
  • [30] Z. Sun, L. Sun, and Q. Li (2018) Investigation in Spatial-Temporal Domain for Face Spoof Detection. Note: Accessed on 24 Dec 2018.
    External Links: Link Cited by: §I, §II-A2, §IV-C2, TABLE II.
  • [31] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks.. CoRR abs/1312.6199. External Links: Link Cited by: §I.
  • [32] X. Tan, Y. Li, J. Liu, and L. Jiang (2010) Face liveness detection from a single image with sparse low rank bilinear discriminative model. In European Conference on Computer Vision, pp. 504–517. Cited by: §II-A1.
  • [33] F. Tramèr, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel (2017) The space of transferable adversarial examples. arXiv preprint arXiv:1704.03453. Cited by: §I, §I.
  • [34] P. Viola and M. Jones (2004-05) Rapid object detection using a casacde of simple features. International Journal of Computer Vision (57), pp. 137–154. Cited by: §IV-B.
  • [35] Z. Xu, S. Li, and W. Deng (2016) Learning temporal features using LSTM-CNN architecture for face anti-spoofing. In 2015 3rd IAPR Asian Conference on Pattern Cecognition (ACPR), pp. 141–145. Cited by: §I, §II-A2.
  • [36] F. Yang and Z. Chen (2018) Using randomness to improve robustness of machine-learning models against evasion attacks. abs/1808.03601. External Links: Link, 1808.03601 Cited by: §I.
  • [37] J. Yang, Z. Lei, and S. Z. Li (2014) Learn convolutional neural network for face anti-spoofing. Computer Science 9218, pp. 373–384. Cited by: §I, §II-A2, §IV-C2, TABLE II.
  • [38] J. Yang, Z. Lei, S. Liao, and S. Z. Li (2013) Face liveness detection with component dependent descriptor. In International Conference on Biometrics, pp. 1–6. Cited by: §I, §II-A1.
  • [39] Z. Zhang, J. Yan, S. Liu, Z. Lei, D. Yi, and S. Z. Li (2012) A face antispoofing database with diverse attacks. In IAPR International Conference on Biometrics, pp. 26–31. Cited by: Fig. 1, §IV-A, §IV-A.
  • [40] Z. Zhou and J. Feng (2017) Deep forest: towards an alternative to deep neural networks. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pp. 3553–3559. External Links: Document, Link Cited by: §I, §I, Fig. 3, §II-B, §II-B, §III-A, §III-B, §III-B, §IV-B, §IV-C3, TABLE I, §V.
  • [41] Z. Zhu, P. Luo, X. Wang, and X. Tang (2013) Deep learning identity-preserving face space. In 2013 IEEE International Conference on Computer Vision (ICCV), pp. 113–120. Cited by: §I.