, especially since face recognition technology has reached a satisfactory level of performance. Nowadays, large scale face recognition systems can be easily to be found, such as the Unique Identification Authority of India (UIDAI), which offers identity to all India residents. Although these biometric systems have achieved high accuracy on recognizing customer faces, fake faces still can easily fool or bypass them [4, 5]. With the popularity of Internet communication and social media, people’s biological information can be rapidly proliferated leading to the criminals can easily access to face pictures or videos. Actually, once a face image is shared in the Internet, there is no further control over that image. For instance, Olaye  has reported that a mobile app called FindFace allows uploading a picture to access that person’s social network data including good quality images that can be used for face PAD. To make matters even worse, it is not difficult for the criminals to use the biological information to launch a presentation attack. Therefore, considering this urgent security situation, an effective and reliable face PAD method must be developed for circumventing and detecting such threats.
According to different kinds of fake faces, there are four types of presentation attacks that can be considered: (i) printed face photos, (ii) displayed face images, (iii) replayed videos and (iv) 3D masks. In printed face photo attacks, the face photos are printed on paper and placed in front of the camera. Sometimes in order to produce aliveness signals such as eye blinking to confuse face recognition system, eye area of the photo will be cut off and replaced with real eye blinking. In both displayed image and replayed video attacks, the attacker uses a digital screen to show face images or videos. Compared with printed photo and displayed image attacks, replayed video attacks can exhibit motion and liveness information and are more challenging on common cameras. 3D mask scenario refers to utilizing 3D printing technology  and virtual reality  to fabricate a face mask and presenting the mask for presentation attack. However, 3D attacks are much more expensive to launch compared with traditional printed photo, displayed image and replayed video attacks . Fig. 1 shows an example of different face presentation attacks.
In the last decade, many face PAD methods have been proposed to detect fake faces [3, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]. However, most methods directly extracted features and analyzed the differences between real and fake faces in existing color spaces (e.g. RGB, HSV and YCbCr). For instance, [10, 12] distinguished real faces with the fake ones in gray-scaled space and [9, 13] detected presented fake faces in RGB color space. Although these methods can achieve good detection performance, the real faces and fake faces are overlapped in original color spaces as shown in Fig. 2. Especially with the popularity of high-resolution screen and high-definition printer, the fake faces are getting closer to the real faces leading to detection task becomes more difficult. Therefore, we propose an end-to-end deep learning network to generate a new color-liked space, where the real and fake faces can be separated as much as possible.
Inspired by generative adversarial network (GAN) , a color-liked space generator is constructed to map existing color spaces. Then, a feature extractor is designed to obtain features from the learned color-liked space. As aforementioned, the goal of our proposed network is to separate the real and fakes as much as possible. Therefore, we introduce a novel points-to-center triplet mechanism to train the generator. The flowchart is illustrated in Fig. 3
. Finally, the extracted features are fed into a Support Vector Machine (SVM)classifier to detect face presentation attack. We train and test our proposed method on two public available databases: Replay-Attack  and OULU-NPU . The experimental results demonstrate the effectiveness and excellent generalization capabilities of the proposed method in various fake face detection compared to the state-of-the-art approaches.
Among the significant contributions of this present paper, we can cite:
Different from most previous works that detect presentation attack in existing color spaces, we propose a novel and appealing approach based on color-liked space analysis. By maximizing interclass distance and minimizing intraclass distance, the learned color-liked space can separate fake faces and genuine ones as much as possible instead of overlapping each other.
We develop a new points-to-center triplet training mechanism. Unlike traditional mechanism, the proposed points-to-set mechanism can guarantee a stable decline in triplet loss.
Extensive experimental analysis is conducted on the two latest and challenging face PAD databases using their pre-defined publicly well-defined experimental evaluation protocols ensuring the reproducibility of the results and a fair comparison with the state-of-the-art methods. Furthermore, in our cross-database evaluation, our generated color-liked space shows promising generalization capabilities.
The remainder of the paper is organized as follows: Section II reviews the existing state-of-the-art methods of face PAD and briefly provides the development of triplet network. Section III describes our proposed color-liked space analysis based method. Section IV provides the details of experimental setup and Section V discusses the obtained results. Finally, in Section VI, we conclude the paper and discuss some directions for future research.
Ii Related Work
Ii-a State-of-the-art of Face PAD
In the past few years, face PAD has received great attention and many detection approaches have been developed [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]. Based on different clues, these countermeasures can be further categorized into texture analysis [10, 12, 13], motion analysis [14, 15, 25], image quality analysis [16, 17, 18, 19, 20], and hardware based methods [11, 19, 20, 26, 27, 28].
Ii-A1 Texture analysis based methods
Due to the limitations of printers and display devices, there are substantial differences in color distribution and edge texture between real and fake faces. Therefore, multi-scale local binary pattern (LBP) features  were extracted to describe edge texture differences. In another work, Chingovska et al.  used different kinds of LBP features to improve detection accuracy. To capture the color differences in luminance and chrominance, Boulkenafet et al. [13, 29] proposed a method by computing LBP features from different color spaces and concatenating all LBP into a single feature vector. Apart from , Boulkenafet et al.  constructed a Gaussian pyramid and extracted multi-scale textural features from it. In this work, the method shows superior detection performance in inter-database test. Instead of using LBP features, Akshay et al.  extracted Haralick 
texture features for face PAD. With the recent advances of deep learning in computer vision[33, 34], deep textures have also been applied in face PAD. Yang et al. 
proposed an end-to-end convolutional neural network (CNN) model for face PAD. Instead of using fully-connected layers, Liat al. [36, 37] extracted hand-crafted features from convolutional feature maps and showed some promising performances. For capturing texture variations, Xu et al.  proposed a long short memory network (LSTM) and Li et al.  proposed a 3D CNN to detect face presentation attacks, respectively. In another work, Li at al.  combined hand-crafted features with deep learning features and learned an embedding function that can measure the distribution similarity of different face images. More recently, a LBP network  has been designed for face PAD which simulates the idea of basic LBP. These texture analysis based approaches have good detection performances when the artificial traces are obvious, such as rough face texture. However, with the popularity of high-definition screens, their detection performances tend to decrease drastically.
Ii-A2 Motion analysis based methods
Apart from texture analysis, motion also plays an important role in face PAD. For instance, based on the fact that involuntary eyes blinking often occurs in the interval of 2 to 4 seconds , an undirected conditional random field framework  was proposed to detect printed photo attacks. Compared with the fixed background of real face, the background of fake face is dithering. Therefore, Anjos et al.  detected fake face by analyzing motion correlation coefficient between face region and background. Moreover, facial motion variations were also explored for face PAD. In  and , LBP-TOP  and LDP-TOP  features were extracted to describe these variations, respectively. In another work, Santosh et al.  used dynamic mode decomposition (DMD) to capture the dynamics of movements. Instead of analyzing facial motion, some works have tried to depict the media and its movements (for example the paper, mobile phone and laptop). Bao et al.  addressed 2D face PAD by detecting planar media. Tan et al.  used Difference-of-Gaussians (DoG) to extract the differences in motion deformation patterns between real and fake faces. Since there are no aliveness signals in 3D mask attacks, Li et al. 
proposed an 3D mask PAD method by estimating pulse from face videos. Although motion analysis based methods are effective to static image attacks and 3D mask attacks, these methods can still be easily fooled by replayed video attacks. Therefore, it is necessary to request the subject to perform specific movements[49, 50].
Ii-A3 Image quality analysis based methods
Methods based on image quality analysis exploit the fact that the characteristics of reproduced face images or videos typically present lower quality and loss of detail and sharpness. Therefore, some methods took advantage of high frequency components of face image to recognize fake faces. In  and , DoG filters were used to extract high frequency information. In another work, Li et al.  dealt with the problem of face PAD based on a learned high frequency feature mapping function. Instead of high frequency information, Pinto et al.  minimized possible negative impacts on low-level features and extracted them to describe the missing detail and sharpness in fake faces. Wen et al. 
extracted four different kinds of features (i.e. specular reflection, blurriness, chromatic moment and color diversity) to catch the light reflection differences between real and fake faces, which are caused by the medium of liquid-crystal display (LCD) screen. Galballyet al.  extracted different kinds of image quality assessment features to describe the quality of real and fake faces. In , several different non-reference image-quality features and dense optical flow features were fused to capture the fake faces that have low quality. Such image quality analysis based approaches are expected to work well for low-resolution printed photo attacks or when using crude face masks, but are likely to fail for high quality displayed images, replayed videos or 3D masks.
Ii-A4 Hardware based methods
In addition to analyzing face images and videos, additional or non-conventional hardwares, e.g., multi-spectral, depth and light-field cameras, have been applied to acquire useful information about the reflectance properties and the shape of the observed faces. For instance, based on the fact that there is no depth information in 2D presentation attacks, Erdogmus et al.  proposed a method by detecting a planar surface such as a video display or a printed paper (not bent). Pavlidis et al.  and Zhang et al.  acquired the reflectance maps by computing the upper-band of near-infrared (NIR) spectrum and combining two photodiodes respectively, instead of using intrinsic image decomposition algorithm like . More recently, light-field cameras allow exploiting disparity and depth information from a single capture. Therefore, these kinds of cameras have also been introduced into face PAD task [26, 27, 28]. Even though hardware-based methods can achieve good performances for replayed video attacks, some of them might present operation restrictions in certain conditions. For instance, the sunlight can cause severe perturbations for NIR and depth sensors; wearable 3D masks are obviously challenging for those methods relying on depth data.
Ii-B Triplet Network
Triplet network [54, 55] is extended from siamese network , which aims to find a mapping function from original feature space to new distance space where the same class samples are similar than those from different ones. While ensuring maximizing interclass distances and minimizing intraclass distances, the triplet network can also maintain a secure margin between different classes compared to siamese network. In the past few years, many computer version and multimedia analysis tasks such as face verification and person re-identification (Re-ID) have explored the effectiveness of triplet network. For instance, Ding et al.  used the triplet loss to learn a deep neural network for person Re-ID. Wang et al.  addressed person Re-ID problem by presenting a unified siamese and triplet deep architecture which can jointly extract single-image and cross-image feature representations. In , Swami et al. proposed a triplet network for face verification and got promising results. Wang et al. 
designed a siamese-triplet convolutional neural network with a ranking loss function to learn visual representations from unlabeled videos. As far as we know, although triplet network can achieve better performance, there is no relevant work to introduce it into face PAD.
Iii Proposed Face PAD Method in Learned Color-Liked Space
In this section, we present the pipeline of our proposed face PAD method that extracts features from a learned color-liked space. The proposed method consists of three parts, i.e., color-liked space generator, feature extractor and triplet loss. The overall process pipeline is shown in Fig. 3
. Firstly, the captured RGB face image is fed into color-liked space generator, which can map existing color space into a new space. Then, the learned color-liked face image, real face images and fake face images are fed into the weights shared feature extractors to get deep features. Finally, the points-to-center triplet training mechanism is used to maximize interclass distance and minimize intraclass distance, and also keep a safe margin between different face classes. The following paragraphs explain the parts in more detail, respectively.
Iii-a Color-liked Space Generator
The architecture of color-liked space generator is shown in Fig. 4. In order to map existing color spaces and use pre-trained deep learning model to extract features, the convolutional feature maps with three channels are outputted by color-liked space generator. More specifically, given a color video frame , a convolutional layer with filter size of
and a leaky rectified linear unit (lReLU) layer are firstly used to process. Then, these feature maps are passed through a series of concatenated residual blocks. In each residual block, the input is skip connected to its outputs as He et al. . Finally, the resulting feature maps are passed through a convolutional layer with filter size of and the generated color-liked space of last layer is denoted as , where denotes the mapping function of the color-liked space generator and represents the generator parameters. It is noted that the size of all convolutional filters in the generator is allowing for deeper models at a low number of parameters 
. Moreover, the batch normalization (BN) layer is introduced after convolutional layer. The hierarchic link of color-liked space generator is summarized in Table I.
Iii-B Feature Extractor
A perceptual similarity measure is proposed by Dosovitskiy and Brox  as well as Johnson et al. . Rather than computing triplet loss in generated color-liked space, our proposed method first maps the generated color-liked space into a feature space by a differentiable function
. After feature extraction, the generatedcan be written as , where represents the parameters of feature extractor. Compared to generated color-liked space, allows color-liked space generator outputs that may not match the same class image with pixel-wise accuracy but instead encourages the generator to produce images that have similar feature representations.
For the feature mapping , we exploit the pre-trained implementation of the popular VGG-19  network, which consists of stacked convolutional layers coupled with pooling operations to gradually decrease the spatial dimension of the image and to extract higher-level features in higher layers. To acquire high-level features, we use the last fully connected layer to describe the image obtained in the generated color-liked space.
Iii-C Points-to-Center Triplet Loss
To begin with, let be the real face image set and be the fake face image set, where and denote the number of different face images in and , and and represent the and face image in and , respectively. The goal of our color-liked space generator is to find a space mapping function that can minimize intraclass distance, maximize interclass distance and keep a safe margin between and . In triplet network, conventional works randomly select triplet training samples to train the network [57, 58, 59]. However, this kind of points to points selection mechanism restricts the stability of network training. As illustrated in Fig. 5(a), when randomly combining triplet training samples, the distribution of different classes may change in the same direction leading to the interclass distance cannot be maximized according to our goal.
In order to tackle the problem of triplet training sample combination, we chose a center image as the anchor image and combine triplet samples with it, as shown in Fig. 5(b). More specifically, all face images in and are sequentially fed into the initialized color-liked space generator and pre-trained feature extractor, where the results are represented as and . The center of real face features can be calculated as .
Then the anchor image of is the one that is a real face image with the nearest Euclidean distance from the feature center and denoted as , where is the index of the anchor image. Given a threshold , if the Euclidean distance between the real face image and the anchor image is greater than , the real face image and the anchor image form a image pair with same class represented as ; if the Euclidean distance between the fake face image and the anchor image is less than , then the fake face image and the anchor image compose an image pair with heterogeneous classes written as . Finally, the points to set triplet training samples are denoted as , where and are the index and the number of triplet training samples, respectively. As aforementioned, our goal is to minimize the distance between and , maximize the distance and keep a safe margin between and . Therefore, the loss function of the goal can be formulated as Eq. 1.
where is the safe margin and the role of the max operation with is to prevent the overall value of the loss function from being dominated by easily identifiable triplets .
Iii-D Gradient of Parameters
In training stage, the stochastic gradient descent (SGD) algorithm is used to learn the network parameters. So, calculating the gradient of parameters is the first important thing. Let be the difference in distance between and in the triplet sample:
and the loss function can be rewritten as
Then the partial derivative of the loss function about can be calculated as Eq. 6.
Based on the definition of , its gradient can be obtained as follows:
where is the abbreviation of , is the abbreviation of , and is the abbreviation of , respectively. For the parameter in feature extractor, its gradient can also be obtained as .
Iii-E Implementation Details
For the parameters in feature extractor, we transfer the pre-trained deep neural network (i.e. VGG-19) to initialize them. But for the parameters in color-liked space generator, they are initialized based on  as illustrated in Eq. 7, which can ensure that all convolutional layers in the network initially have the approximately same output distribution and empirically improve the rate of convergence.
where samples from a zero mean, unit standard derivation gaussian function, and is the channel number of inputs in convolutional layer. In training stage, the momentum and weight decay of SGD are set to 0.9 and 0.0005, respectively. The learning rate is set to in the beginning and will be multiplied with damping factor when all mini-batches are traversed and re-allocated randomly.
The inputs of face images are normalized into with the scale ranging from 0 to 1. In color-liked space generator, five residual blocks are concatenated to constitute the generator. For the threshold , it is five times the average Euclidean distance between the real face images and the anchor image. In testing stage, we extract the deep features of the learned face images from feature extractor and use an SVM classifier to predict whether the input face is a presentation attack. In our paper, we realize the proposed color-liked space generator and SVM based on the toolbox of MatConvNet with the version 1.0-beta20 111http://www.vlfeat.org/matconvnet/ and liblinear with the version 1.96 222https://www.csie.ntu.edu.tw/cjlin/liblinear/ , respectively.
Iv Experimental Setup
Iv-a Experimental Data
The IDIAP Replay-Attack database 333https://www.idiap.ch/dataset/replayattack/download-proc  consists of 1300 video clips of real and attack and attempts to 50 clients. These clients are divided into 3 subject-disjoint subsets for training, development and testing (15, 15 and 20, respectively). The real face videos are recorded under two different lighting conditions: controlled and adverse. Three types of attacks are created: printed photos, displayed images and replayed videos. In displayed image and replayed video attacks, high quality images and videos of real clients are replayed on iPhone 3GS and iPad display devices. For printed photo attacks, high quality images were printed on A4 papers and presented in front of the camera. Fig. 6 shows some examples of real and fake faces.
|Overall attacks||Printed photo||Displayed image||Replayed video|
|Overall attacks||Printed photo||Displayed image||Replayed video|
The OULU-NPU Database 444https://sites.google.com/site/oulunpudatabase/welcome  consists of 4950 real access and attack videos and attempts 55 clients. Similar to Replay-Attack database, all clients are divided into 3 subject-disjoint subsets for training, development and testing (20, 15 and 20, respectively). These videos were recorded using the front cameras of six mobile devices in three sessions with different illumination conditions and background scenes. Two types of fake faces are created: printed photo and replayed video attacks. The attacks were created using two printers and two display devices. For the replayed video attacks, the original face videos were recorded by 6 different cell phones. Fig. 7 shows some examples of real and fake faces captured from the first scenario.
Iv-B Evaluation Protocol
For performance evaluation, the results are reported in term of recently standardized ISO/IEC 30107-3 metrics : Attack Presentation Classification Error Rate (APCER) and Bona Fide Presentation Classification Error Rate (BPCER). In principle, these two metrics correspond to the FAR and FRR commonly used in the PAD related literature. However, different with the (false acceptance rate) FAR and false rejection rate (FRR), the attacker’s potential (such as expertise, resources and motivation) in the worst case scenario are taken into considered by APCER and BPCER. It is noted that the APCER and BPCER depend on the decision threshold. Therefore, the development set is used to fine tuning the system parameters and estimate the threshold value. To compare the overall system performance in a single value, the Average Classification Error Rate (ACER) is computed which is the average of the APCER and the BPCER at the decision threshold estimated by Equal Error Rate (EER) on the development set.
V Experimental Results and Discussion
In this section, we present and discuss the detection results obtained in the generated color-liked space. Firstly, the impact of hyper-parameters is explored. Then, we compare our proposed points to center triplet training sample combination mechanism with conventional randomly combination mechanism. After that, we visualize the distribution of real and fake faces in the proposed color-liked space and compare the detection results with that computed in existing color spaces. Finally, the performance of our method is compared against the state-of-the-art algorithms and the generalization capabilities of learned color-liked space are evaluated by conducting cross-database experiments.
V-a Safe Margin for Feature Extractor
In this part, the influence of the hyper-parameter on detection performance is presented and its optimal value is also been determined. Table II illustrates the detection results of different . It can be clearly seen that has a great influence on the performance. For example, when it is set to 0.1, the ACER of replayed video attack of Replay-Attack database is 0.0%. However, when it is set to 5, the ACER changes to 1.2%. For how to set the value of , we find that the optimal values for different databases are different. More specifically, for Replay-Attack database, should be set to 0.1 with the averaged ACER=0.5%; but for OULU-NPU database, should be set to 1 with the averaged ACER=5.4%. While considering the averaged ACER of both Replay-Attack and OULU-NPU databases, we set the to 0.5 in our proposed method. In addition, the detection performance does not improve as the increase of safe margin between the real faces and fake faces. We conjecture the reason may lie in that larger will cause the network to over-fit during training. Especially for face PAD, the training data is very limited.
|Mechanism||Overall attacks||Printed photo||Displayed image||Replayed video|
|Mechanism||Overall attacks||Printed photo||Displayed image||Replayed video|
V-B Points-to-Center Combination Mechanism
As aforementioned, the triplet training samples are combined based on the proposed points-to-center (P2C) mechanism instead of randomly combination (RC). Table III shows the detection results obtained based on these two different combination mechanisms. From the table we can find that our proposed P2C mechanism significantly improves overall system performance, except for the ACER of replayed video attack of Replayed-Attack database. Comparing the detection results of the two databases, one conclusion can be obtained that the performance improvement of P2C mechanism is more obvious for the OULU-NPU database. More specifically, the ACER of overall OULU-NPU database has dropped more than 54%. But for the overall Replayed-Attack database, the ACER has dropped by 43%. The reason may lie in that the distribution of real and fake faces in Replay-Attack database is simpler than that of OULU-NPU database, which limits the superiority of P2C.
|Space||Overall attacks||Printed photo||Displayed image||Replayed video|
|Space||Overall attacks||Printed photo||Displayed image||Replayed video|
|Scale LBP ||0.7||3.1||-||-||-||-||-||-||-||-|
|Deep CNN ||6.1||2.1||-||-||-||-||-||-||-||-|
|3D CNN ||0.3||1.2||-||-||-||-||-||-||-||-|
|Image Quality ||-||15.2||17.9+||12.5||15.2+||31.9||36.3||56.4||24.5||40.5|
|RGB LBP ||7.3||11.2||10.1||14.3||12.2||11.2||18.5||18.8||20.0||19.4|
|HSV LBP ||7.7||9.8||18.5||6.1||13.7||12.0||15.5||21.1||15.9||18.5|
|YCbCr LBP ||4.1||7.4||14.6||8.9||10.4||14.2||17.8||21.7||13.0||17.3|
|Color-liked Space LBP||2.8||2.9||5.0||4.6||4.8||8.6||12.2||14.9||10.5||12.7|
|Color-liked Space VGG||0.8||0.7||2.4||0.0||1.2||7.6||6.0||3.3||9.4||6.3|
means the actual value is greater than the value.
was retested on OULU-NPU database.
was retested on Replay-Attack and OULU-NPU databases.
V-C Visualize the Data Distribution and Color-liked Space
For a clearer analysis of the generated color-liked space, the sample distributions of Replay-Attack and OULU-NPU databases are described in Fig. 10 and Fig. 11, respectively. From the figures 10(a) and 11(a), we can clearly see that the real and fake face images are overlapped in RGB color space. However, in our generated space, the real and fake faces can be well separated, and there is a safe margin between them. By comparing Fig. 10 and Fig. 11, we can find that the Replay-Attack database is easier to be addressed than the OULU-NPU database, which is consistent with the results shown in Table II.
Apart from plotting the distribution, the generated color-liked space is also been visualized in Fig. 8. Since the output of the generator has three channels, we visualize the randomly selected real and the fake faces in the learned color-liked space according to different channel combination sequences. From the visualization results, the difference between the real and fake faces is mainly revealed in color saturation. More specifically, the real faces are more saturated than the fake faces, which can be explained by the display colour gamut differences shown in Fig. 9. Compared Fig. 8 with Fig. 9, we can summarize that the generated color-liked space learning process essentially magnifies the differences between the color saturation of real and fake faces.
V-D Comparison with Existing Color Spaces
Furthermore, we extract the last full connected layer of VGG-19 from the learned color-liked space and existing color spaces (i.e. RGB, HSV and YCbCr), and compare the detection results in Table IV. Based on the comparison, we can observe that our proposed color-liked space has greatly improved the detection performance. For instance, in RGB color space, the averaged APCER, BPCER and ACER of Replay-Attack and OULU-NPU are 6.1%, 3.1%, 4.6% and 17.7%, 11.6% 14.7%, respectively. However, the averaged protocols obtained in our color-liked space are 1.1%, 0.1%, 0.6% and 5.1%, 5.8%, 5.4%. This indicates that the color-liked space makes it easier to distinguish the real and fake faces compared to RGB, HSV or YCbCr spaces.
V-E Comparison with the State of the Art
In order to be consistent with the original evaluation protocol and to facilitate the later works compared with ours, we show the original EER, HTER  and the latest APCER, BPCER, ACER. Table V gives a comparison with the state-of-the-art face PAD approaches proposed in the literature. Apart from extracting VGG features, we also extract LBP features from the learned color-liked space. It can be seen that our proposed color-liked space based method outperforms the-state-of-the-art algorithms on Replay-Attack database. Especially compared with , the HTER of VGG features is significantly reduced to 0.7%, even though the EER is not as good as . For hand-crafted feature, the LBP extracted from our learned color-liked space takes on better superiority compared with  that extracts LBP features from existing color spaces. Regarding the OULU-NPU database, we compare the proposed method with its baseline and . From the results, our learned color-liked space based method also significantly surpasses them. As for the learned color-liked space, we find that the extracted VGG features can get better detection results than LBP features. This can be interpreted as the generated color-liked space is learned based on VGG features rather than LBP.
|Train set||Dev set||Test set||Train set||Dev set||Test set|
|Image Quality ||50.4||50.1||48.8||52.4||36.8||39.1|
|Color-liked Space LBP||49.9||47.1||44.2||46.7||48.8||48.3|
|Color-liked Space VGG||44.7||44.4||46.2||41.6||42.6||36.2|
V-F Cross-Database Analysis
In real-world applications, face PAD techniques are operated in open environments, where the conditions and attack scenario are unknown. The conduct cross-database evaluation is conducted to gain insight into the generalization capabilities of our learned color-liked space. More specifically, the color-liked space generator is trained and tuned on one of the databases and then tested on another database. The computed ACERs are summarized in Table VI.
When the generator is trained on OULU-NPU and tested on Replay-Attack, we notice that the averaged ACER of VGG features is 40.1%. When the generator is trained on Replay-Attack and tested on OULU-NPU, the averaged metric is 45.1%. From these results, we conclude that the generator trained on Replay-Attack is not able to be generalized as good as trained on OULU-NPU. It is caused that the OULU-NPU database contains more variations in the collecting environment (e.g., light and camera quality) compared to Replay-Attack. Finally, compared with the baseline of color LBP based method , our proposed method is more stable.
Although our proposed method can work well in intra database and cross database tests, it is still difficult to deal with face PAD when the light of collecting environment is variant. This is due to the fact that the brightness of captured face images can affect the distribution of colour gamut. In addition to light variation, the quality of the camera is also an important factor in the effectiveness of our detection method. Especially for the camera is not good enough to capture color information, our proposed countermeasure will may not be applicable.
In this paper, we addressed the problem of face PAD from the viewpoint of the color space analysis. Based on triplet training and perceptual similarity measure mechanisms, a new color-liked space was generated for fake face detection. Instead of randomly combining triplet training samples, we proposed a points-to-center combination method to solve the problem of instability in training costs. Extensive experiments on two latest and challenging presentation attack databases (the Replay-Attack and OULU-NPU) showed excellent results. On OULU-NPU database, the proposed color-liked space based method outperformed the baseline, while very competitive results were achieved on Replay-Attack database. Furthermore, in our cross-database evaluation, our proposed method showed promising generalization capabilities. Overall, from the results of Replay-Attack and OULU-NPU databases, we find that external-environment factors (e.g. light and camera quality) limit the effectiveness of our proposed detection method. Thus, we will focus on how to eliminate the influence of external environmental factors and improve the robustness of our method. Finally, the proposed method just chosen one anchor image from the real face images, which means the proposed method only minimized the intraclass distance in real face set. Therefore, we will also try to choose the second anchor image from the fake face images and minimize the intraclass distance in real and fake face sets.
This paper is partly supported by the Natural Science Basic Research Plan in Shaanxi Province of China (No.2018JQ6090), and the Aerospace Science and Technology Foundation.
-  Z. Akhtar, G. Fumera, G. L. Marcialis, and F. Roli, “Evaluation of serial and parallel multibiometric systems under spoofing attacks,” in IEEE Fifth International Conference on Biometrics: Theory, Applications and Systems, Sept 2012, Conference Proceedings, pp. 283–288.
-  D. Wen, H. Han, and A. K. Jain, “Face spoof detection with image distortion analysis,” IEEE Transactions on Information Forensics and Security, vol. 10, no. 4, pp. 746–761, April 2015.
-  L. Li, P. L. Correia, and A. Hadid, “Face recognition under spoofing attacks: countermeasures and research directions,” IET Biometrics, vol. 7, no. 1, pp. 3–14, 2018.
-  Y. Li, K. Xu, Q. Yan, Y. Li, and R. H. Deng, “Understanding osn-based facial disclosure against face authentication systems,” ACM Symposium on Information, Computer and Communications Security, pp. 413–424, June 2014.
-  L. Omar and I. Ivrissimtzis, “Evaluating the resilience of face recognition systems against malicious attacks,” in Seventh UK British Machine Vision Workshop, Sept 2015, Conference Proceedings, pp. 5.1–5.9.
-  “Biometrics and facial recognition a dangerous new frontier for data,” https://www.itproportal.com/features/biometrics-and-facial-recognition-a-dangerous-new-frontier-for-data/, accessed Jun 4, 2018.
-  I. Manjani, S. Tariyal, M. Vatsa, R. Singh, and A. Majumdar, “Detecting silicone mask-based presentation attack via deep dictionary learning,” IEEE Transactions on Information Forensics and Security, vol. 12, no. 7, pp. 1713–1723, July 2017.
-  Y. Xu, T. Price, J.-M. Frahm, and F. Monrose, “Virtual u: Defeating face liveness detection by building virtual models from your public photos,” in 25th USENIX Security Symposium. Austin, TX: USENIX Association, 2016, pp. 497–512.
-  H. Li, W. Li, H. Cao, S. Wang, F. Huang, and A. C. Kot, “Unsupervised domain adaptation for face anti-spoofing,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 7, pp. 1794–1809, July 2018.
-  I. Chingovska, A. Anjos, and S. Marcel, “On the effectiveness of local binary patterns in face anti-spoofing,” in Biometrics Special Interest Group, Sept 2012, Conference Proceedings, pp. 1–7.
-  N. Erdogmus and S. Marcel, “Spoofing in 2d face recognition with 3d masks and anti-spoofing with kinect,” in IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems, Sept 2013, Conference Proceedings, pp. 1–6.
-  J. Määttä, A. Hadid, and M. Pietikäinen, “Face spoofing detection from single images using micro-texture analysis,” in International Joint Conference on Biometrics, Oct 2011, Conference Proceedings, pp. 1–7.
-  Z. Boulkenafet, J. Komulainen, and A. Hadid, “Face anti-spoofing based on color texture analysis,” in IEEE International Conference on Image Processing, Sept 2015, Conference Proceedings, pp. 2636–2640.
-  G. Pan, L. Sun, Z. Wu, and S. Lao, “Eyeblink-based anti-spoofing in face recognition from a generic webcamera,” in IEEE International Conference on Computer Vision, Oct 2007, Conference Proceedings, pp. 1–8.
Y. Li and X. Tan, “An anti-photo spoof method in face recognition based on the analysis of fourier spectra with sparse logistic regression,” in
Chinese Conference on Pattern Recognition, Nov 2009, Conference Proceedings, pp. 1–5.
-  Z. Zhang, J. Yan, S. Liu, Z. Lei, D. Yi, and S. Z. Li, “A face antispoofing database with diverse attacks,” in International Conference on Biometrics, March 2012, Conference Proceedings, pp. 26–31.
-  X. Tan, Y. Li, J. Liu, and L. Jiang, “Face liveness detection from a single image with sparse low rank bilinear discriminative model,” in European Conference on Computer Vision, Sept 2010, Conference Proceedings, pp. 504–517.
-  H. Li, S. Wang, and A. C. Kot, “Face spoofing detection with image quality regression,” in International Conference on Image Processing Theory Tools and Applications, Dec 2016, Conference Proceedings, pp. 1–6.
-  I. Pavlidis and P. Symosek, “The imaging issue in an automatic face/disguise detection system,” in IEEE Workshop on Computer Vision Beyond the Visible Spectrum: Methods and Applications, June 2000, Conference Proceedings, pp. 15–24.
-  Z. Zhang, D. Yi, Z. Lei, and S. Z. Li, “Face liveness detection by learning multispectral reflectance distributions,” in IEEE International Conference on Automatic Face and Gesture Recognition and Workshops, May 2011, Conference Proceedings, pp. 436–441.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 2672–2680.
-  C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, Feb 1995.
-  Z. Boulkenafet, J. Komulainen, L. Li, X. Feng, and A. Hadid, “Oulu-npu: A mobile face presentation attack database with real-world variations,” in IEEE International Conference on Automatic Face and Gesture Recognition, May 2017, Conference Proceedings, pp. 612–618.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
-  A. Anjos and S. Marcel, “Counter-measures to photo attacks in face recognition: A public database and a baseline,” in International Joint Conference on Biometrics, Oct 2011, Conference Proceedings, pp. 1–7.
-  S. Kim, Y. Ban, and S. Lee, “Face liveness detection using a light field camera,” Sensors, vol. 14, no. 12, pp. 22 471–22 499, Nov 2014.
-  Z. Ji, H. Zhu, and Q. Wang, “Lfhog: A discriminative descriptor for live face detection from light field image,” in IEEE International Conference on Image Processing, Sept 2016, Conference Proceedings, pp. 1474–1478.
-  F. P. A. Sepas-Moghaddam, P. Correia, “Light field local binary patterns description for face recognition,” in IEEE International Conference on Image Processing, Sept 2017, Conference Proceedings, pp. 3815–3819.
-  Z. Boulkenafet, J. Komulainen, and A. Hadid, “Face spoofing detection using colour texture analysis,” IEEE Transactions on Information Forensics and Security, vol. 11, no. 8, pp. 1818–1830, Aug 2016.
-  Z. Boulkenafet, J. Komulainen, X. Feng, and A. Hadid, “Scale space texture analysis for face anti-spoofing,” in International Conference on Biometrics, June 2016, Conference Proceedings, pp. 1–6.
-  A. Agarwal, R. Singh, and M. Vatsa, “Face anti-spoofing using haralick features,” in IEEE International Conference on Biometrics Theory, Applications and Systems, Sept 2016, Conference Proceedings, pp. 1–6.
-  R. M. Haralick, K. Shanmugam, and I. Dinstein, “Textural features for image classification,” IEEE Transactions on Systems Man and Cybernetics, vol. smc-3, no. 6, pp. 610–621, Nov 1973.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inInternational Conference on Neural Information Processing Systems, Dec 2012, Conference Proceedings, pp. 1097–1105.
-  Z. Xia, X. Peng, X. Feng, and A. Hadid, “Deep convolutional hashing using pairwise multi-label supervision for large-scale visual search,” Signal Processing Image Communication, vol. 59, pp. 109–116, 2017.
-  J. Yang, Z. Lei, and S. Z. Li, “Learn convolutional neural network for face anti-spoofing,” Computing Research Repository, vol. abs/1408.5601, pp. 373–384, Aug 2014.
-  L. Li, X. Feng, Z. Boulkenafet, Z. Xia, M. Li, and A. Hadid, “An original face anti-spoofing approach using partial convolutional neural network,” in International Conference on Image Processing Theory Tools and Applications, Dec 2016, Conference Proceedings, pp. 1–6.
-  L. Li, X. Feng, X. Jiang, Z. Xia, and A. Hadid, “Face anti-spoofing via deep local binary patterns,” in IEEE International Conference on Image Processing, Sept 2018, Conference Proceedings, pp. 101–105.
-  Z. Xu, S. Li, and W. Deng, “Learning temporal features using lstm-cnn architecture for face anti-spoofing,” in Asian Conference on Pattern Recognition, Nov 2015, Conference Proceedings, pp. 141–145.
-  H. Li, P. He, S. Wang, A. Rocha, X. Jiang, and A. C. Kot, “Learning generalized deep feature representation for face anti-spoofing,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 10, pp. 2639–2652, Oct 2018.
-  L. Li, X. Feng, Z. Xia, X. Jiang, and A. Hadid, “Face spoofing detection with local binary pattern network,” Journal of Visual Communication and Image Representation, vol. 54, pp. 182 – 192, 2018.
-  C. N. Karson, “Spontaneous eye-blink rates and dopaminergic systems,” Brain, vol. 106, no. 3, pp. 643–653, 1983.
-  T. D. F. Pereira, A. Anjos, J. M. D. Martino, and S. Marcel, “Lbp-top based countermeasure against face spoofing attacks,” in Asian Conference on Computer Vision Workshops, Nov 2012, Conference Proceedings, pp. 121–132.
-  Q. T. Phan, D. T. Dang-Nguyen, G. Boato, and F. G. B. D. Natale, “Face spoofing detection using ldp-top,” in IEEE International Conference on Image Processing, Sept 2016, Conference Proceedings, pp. 404–408.
-  G. Zhao and M. Pietikainen, “Dynamic texture recognition using local binary patterns with an application to facial expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 915–928, June 2007.
-  B. Zhang, Y. Gao, S. Zhao, and J. Liu, “Local derivative pattern versus local binary pattern: Face recognition with high-order local pattern descriptor,” IEEE Transactions on Image Processing, vol. 19, no. 2, pp. 533–544, Feb 2010.
-  S. Tirunagari, N. Poh, D. Windridge, A. Iorliam, N. Suki, and A. T. S. Ho, “Detection of face spoofing using visual dynamics,” IEEE Transactions on Information Forensics and Security, vol. 10, no. 4, pp. 762–777, April 2015.
-  W. Bao, H. Li, N. Li, and W. Jiang, “A liveness detection method for face recognition based on optical flow field,” in International Conference on Image Analysis and Signal Processing, April 2009, Conference Proceedings, pp. 233–236.
-  X. Li, J. Komulainen, G. Zhao, P. C. Yuen, and M. Pietikäinen, “Generalized face anti-spoofing by detecting pulse from face videos,” in International Conference on Pattern Recognition, Dec 2017, Conference Proceedings, pp. 4244–4249.
-  G. Pan, L. Sun, Z. Wu, and Y. Wang, “Monocular camera-based face liveness detection by combining eyeblink and scene context,” Telecommunications Systems, vol. 47, no. 3-4, pp. 215–225, Aug 2011.
-  D. F. Smith, A. Wiliem, and B. C. Lovell, “Face recognition on consumer devices: Reflections on replay attacks,” IEEE Transactions on Information Forensics and Security, vol. 10, no. 4, pp. 736–745, April 2015.
-  A. Pinto, H. Pedrini, W. R. Schwartz, and A. Rocha, “Face spoofing detection through visual codebooks of spectral temporal cubes,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 4726–4740, Dec 2015.
-  J. Galbally, S. Marcel, and J. Fierrez, “Image quality assessment for fake biometric detection: Application to iris, fingerprint, and face recognition,” IEEE Transactions on Image Processing, vol. 23, no. 2, pp. 710–724, Feb 2014.
-  L. Feng, L. M. Po, Y. Li, X. Xu, F. Yuan, C. H. Cheung, and K. W. Cheung, “Integration of image quality and motion cues for face anti-spoofing,” Journal of Visual Communication and Image Representation, vol. 38, pp. 451–460, July 2016.
-  E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” CoRR, vol. abs/1412.6622, 2014.
-  S. Florian, K. Dmitry, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” CoRR, vol. abs/1503.03832, 2015.
-  J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature verification using a” siamese” time delay neural network,” in Advances in Neural Information Processing Systems, 1994, pp. 737–744.
-  S. Ding, L. Lin, G. Wang, and H. Chao, “Deep feature learning with relative distance comparison for person re-identification,” Pattern Recognition, vol. 48, no. 10, pp. 2993 – 3003, 2015.
-  F. Wang, W. Zuo, L. Lin, D. Zhang, and L. Zhang, “Joint learning of single-image and cross-image representations for person re-identification,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016, pp. 1288–1296.
-  S. Sankaranarayanan, A. Alavi, and R. Chellappa, “Triplet similarity embedding for face verification,” CoRR, vol. abs/1602.03418, 2016.
-  X. Wang and A. Gupta, “Unsupervised learning of visual representations using videos,” CoRR, vol. abs/1505.00687, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE conference on computer vision and pattern recognition, 2016, Conference Proceedings, pp. 770–778.
M. S. Sajjadi, B. Schölkopf, and M. Hirsch, “Enhancenet: Single image super-resolution through automated texture synthesis,” inIEEE International Conference on Computer Vision (ICCV), 2017, Conference Proceedings, pp. 4501–4510.
-  S. Ioffe and C. Szegedy, “Batch normalization: accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning, 2015, Conference Proceedings, pp. 448–456.
-  A. Dosovitskiy and T. Brox, “Generating images with perceptual similarity metrics based on deep networks,” in Advances in Neural Information Processing Systems, 2016, Conference Proceedings, pp. 658–666.
-  J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European Conference on Computer Vision. Springer, 2016, Conference Proceedings, pp. 694–711.
-  L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010. Physica-Verlag HD, Sept 2010, pp. 177–186.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in IEEE International Conference on Computer Vision (ICCV), Dec 2015, Conference Proceedings, pp. 1026–1034.
-  R. E. Fan, K. W. Chang, C. J. Hsieh, X. R. Wang, and C. J. Lin, “Liblinear: A library for large linear classification,” Journal of Machine Learning Research, vol. 9, no. 9, pp. 1871–1874, Aug 2008.
-  “Iso/iec jtc 1/sc 37 biometrics. information technology - biometric presentation attack detection - part 1: Framework,” International Organization for Standardization, Tech. Rep., 2016.