Camera Invariant Feature Learning for Generalized Face Anti-spoofing

There has been an increasing consensus in learning based face anti-spoofing that the divergence in terms of camera models is causing a large domain gap in real application scenarios. We describe a framework that eliminates the influence of inherent variance from acquisition cameras at the feature level, leading to the generalized face spoofing detection model that could be highly adaptive to different acquisition devices. In particular, the framework is composed of two branches. The first branch aims to learn the camera invariant spoofing features via feature level decomposition in the high frequency domain. Motivated by the fact that the spoofing features exist not only in the high frequency domain, in the second branch the discrimination capability of extracted spoofing features is further boosted from the enhanced image based on the recomposition of the high-frequency and low-frequency information. Finally, the classification results of the two branches are fused together by a weighting strategy. Experiments show that the proposed method can achieve better performance in both intra-dataset and cross-dataset settings, demonstrating the high generalization capability in various application scenarios.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 6

page 12

page 19

01/17/2019

Deep Transfer Across Domains for Face Anti-spoofing

A practical face recognition system demands not only high recognition pe...
07/14/2021

Domain Generalization with Pseudo-Domain Label for Face Anti-Spoofing

Face anti-spoofing (FAS) plays an important role in protecting face reco...
12/02/2020

Suppressing Spoof-irrelevant Factors for Domain-agnostic Face Anti-spoofing

Face anti-spoofing aims to prevent false authentications of face recogni...
12/15/2019

Domain Agnostic Feature Learning for Image and Video Based Face Anti-spoofing

Nowadays, the increasingly growing number of mobile and computing device...
10/11/2021

A Multi-Resolution Front-End for End-to-End Speech Anti-Spoofing

The choice of an optimal time-frequency resolution is usually a difficul...
03/23/2021

Generalizing Face Forgery Detection with High-frequency Features

Current face forgery detection methods achieve high accuracy under the w...
03/13/2018

Face Spoofing Detection by Fusing Binocular Depth and Spatial Pyramid Coding Micro-Texture Features

Robust features are of vital importance to face spoofing detection, beca...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Face authentication services have been growing exponentially in the past decade, coinciding with the accelerated proliferation of acquisition devices and advances of artificial intelligence. Though unexceptionable performance has been achieved, the security issue is a very challenging problem as the system can be easily attacked even by a printed photo or replayed video. To prevent the face recognition system from being vulnerable, face Presentation Attacks Detection (PAD) algorithms have been widely studied to distinguish the spoofing faces from the live one.

Recently, various face presentation attacking methods have been developed to deceive the authentication systems, such as print attack (printing a face on a paper), replay attack (replaying a face video by other devices), and mask attack (wearing a mask). In the literature, numerous methods have been investigated for PAD, and the majority of them rely on computational vision algorithms. In particular, both handcrafted

[1, 2, 3, 4] and deep learning [5, 6, 7, 8] based features have been developed. For deep learning based methods that rely on training of models based upon labelled data, a large gap between the limited performance and essential requirements have been observed when the training and testing data are from different domains. One typical example is that the training data are acquired with one type of camera and testing data are from another type of camera. To increase the domain adaptation and generalization capability, efforts have been devoted to extracting auxiliary information, including depth [9] and Remote Photoplethysmography (rPPG) signals [10]. Attempts have also been made to use domain adaptation technology [11, 12, 13] for mitigating the gap between different domains.

Fig. 1: Illustrations of live (a) and spoofing (b) faces acquired with two different cameras. The gray-scale maps show corresponding features of the second residual layer of the face images. Two types of cameras in CASIA-FASD [14] database are used for demonstration, including the low quality camera (first and third columns) and high quality camera (second and fourth columns). The features are from the ResNet-18 [15] network trained on CASIA-FASD database.

It has been widely recognized that the camera information is a dominant factor causing the domain gap. One typical example is shown in Fig. 1. In particular, although the spoofing face images are generated by an identical attack type (print attack) and identity, the spoofing patterns still dramatically vary according to the cameras. This phenomenon reveals that the camera divergence between training and testing could cause the PAD performance degradation, which is further validated in Fig. 2(a). More specifically, it is observed that features from spoofing faces and live faces are lack of discrimination capabilities with large overlapping in-between when training and testing are performed based on cross-camera settings. However, given numerous camera types, it is difficult to collect sufficient training data and specifically train a model for each of them, especially with the surge of emerging cameras. Motivated by this, we aim to propose a novel deep learning based PAD model with high generalization capability. To this end, the camera information should be effectively removed. The composition of facial features, which has been widely studied in the literature [16, 17], motivates us to automatically learn the camera invariant features. Herein, we propose a feature level decomposition scheme, such that the trained model does not depend on the acquisition device. This allows the model to be widely applied in myriad applications, as the learned model can be well generalized to unseen cameras at large. Extensive experiments have demonstrated that the proposed scheme achieves the state-of-the-art performance and reveals high generalization capability. The main contributions of this paper are as follows,

  • We propose a novel framework with two branches to improve the generalization capability of face spoofing. The first is camera invariant branch, aiming to provide high-frequency domain features with the elimination of camera variance. Considering spoofing features in other frequency domains (e.g., lighting) may be neglected in the first branch, our second branch is feature discrimination augmentation branch which generates features by an enhanced image recomposed from the low and high frequency layers of the original input image.

  • We develop a sophisticated camera variance removal scheme based on the feature level decomposition. The features with the mixture of spoofing and camera information are efficiently decomposed with a pseudo siamese network, in an effort to blindly infer the feature that well reflects the spoofing information while being invariant to different camera types.

  • The classification results of the two branches are fused for the final decision. Experiments show that our proposed method can achieve high accuracy not only on intra-dataset settings but also on cross-dataset scenarios, demonstrating superior generalization capacity with camera-invariant feature extraction.

The rest of this paper is organized as follows. We first review the related works in Section II. Subsequently, the proposed scheme is detailed in Section III. The experimental results are presented in Section IV, and finally we conclude this paper in Section V.

Fig. 2: T-SNE [18] visualization of the second-to-last fully connected layer. (a) ResNet-18; (b) the first branch of our proposed model. The two networks are trained on CASIA-FASD database and tested on two types of cameras in MSU-MFSD [19] and Replay-Attack [1] databases.

Ii Related Works

For face anti-spoofing, the intrinsic spoofing features relying on which the binary classification can be performed are expected to be both discriminative and of high generalization capability [20]. In the literature, a series of features have been studied and developed, including both hand-crafted and deep learning features.

Ii-a Hand-crafted Features

Based on the observation that certain characteristics of texture in spoofing face and live face are different, hand-crafted features are first exploited, including Local Binary Patterns (LBP) [1], Local Phase Quantization (LPQ) [2], Histogram of Gradients (HoG) [3], Scale-Invariant Feature Transform (SIFT) [4] and Speeded Up Robust Features (SURF) [2]. In contrast with the feature extraction performed in spatial domain, in [21], Li et al. utilized the dissimilarity in Fourier spectra by considering that less High Frequency (HF) components exist in spoofing images compared with the live ones. To obtain texture features based on the 3-D plane in videos, the high frequency information in both spatial and temporal domains is exploited in [22]. In [23], Chan et al. incorporated flash light for more stable spoofing feature extraction by reducing the influence of environmental factors. Compared with HF texture features, Low Frequency (LF) features have also been utilized together in image quality based methods [24, 25, 8]. In [25], with the live face image as the reference, color distortion relevant features are extracted and compared by Mean Squared Error (MSE), Maximum Difference (MD), R-Averaged Maximum Difference (RAMD), etc. Regarding no-reference image quality assessment, in [19], the concatenated features of specular, blurriness and color distortion are utilized.

Although those handcrafted features are computationally efficient and perform well in intra settings, they may easily fail when there are large variations in terms of attack scenarios [26]. To tackle this issue, additional clues such as motion, depth and blood circulation have also been incorporated. Motion clues from eye blinking [27, 28] and word speaking [29, 30] can be acquired from multi-frames. Moreover, in [31], the pulse generated by facial blood circulation is used since only the live face videos have such traits. In [29], the facial expression clues were firstly enhanced by an Eulerian motion magnification algorithm, then the LBP texture features and Histograms of Oriented Optical Flow (HOOF) motion features were fused together for final classification. The 3D depth information of the captured face [32], [33] and infrared images [34] are effective clues though these solutions rely on additional sensors and could be more expensive to launch.

Ii-B Deep Learning Features

For face PAD, deep learning based methods have also been widely studied for obtaining more discriminative features that account for the spoofing patterns. Yang et al. [5]

first proposed to use Convolutional Neural Network (CNN) for face spoofing detection. In

[32], the pulse information and other spatial and temporal features learned by CNNs are fused together to boost the performance. In [35]

, a multi-level deep dictionary learning based method was proposed especially for the silicone mask attacks. To improve the performance of CNNs, transfer learning based schemes have been adopted based on CNNs pretrained on ImageNet

[36], VGG-Face [37] and GoogLeNet [6]. Motivated by the denoising algorithms, in [17] the spoofing patterns are treated as spoofing noise in the live face and extracted by a CNN architecture for classification.

However, due to the limited size of existing labeled data, the CNN models may be prone to over-fitting. To address this issue, methods can be classified into three categories. The first one is using the auxiliary information. In

[38], Atoum et al.

proposed a two-steam CNN based model, where one stream is responsible for patch-based anti-spoofing while another one is developed for depth-estimation. In addition to depth, remote Photoplethysmography (rPPG) signals have also been exploited by a CNN-RNN scheme from the raw videos in

[9]. Considering the domain shift between different databases, domain adaptation based methods have been proposed to shrink the domain gap between samples in training and testing sets. In [11], Li et al. proposed an embedding function to map the image data into another space, such that the Maximum Mean Discrepancy (MMD) based loss can be optimized to evaluate the similarity of source and target domains. In recent work [39], Wang et al.

utilized a generative adversarial network for domain adaptation based face spoofing detection. The shared embedding space by both the source and target domains can be learned when the discriminator cannot reliably predict whether a sample is from the source or target domain. However, due to the requirement of attack samples in the test database, domain adaptation based method may not be practical as it is difficult to acquire the spoofing images for any unseen device. The last category is domain generalization based methods. In

[12], Li et al. utilized a 3D CNN model for the spatial-temporal information extraction. To reduce the domain shift among different domains, the regularization term is incorporated by minimizing the MMD. To learn a more generalized representation for face anti-spoofing, Tu et al. adopted the Total Pairwise Confusion (TPC) [40] loss for CNN training and moreover an identity based method was studied.

Fig. 3: Illustration of the proposed framework which consists of two branches. In the feature invariant branch, the camera information is removed from the features that are specifically responsible for spoofing detection in HF domain. In the feature discrimination augmentation branch, we aim to extract more discriminative features based on a recomposition of the learned low/high-frequency components. Finally, the classification results of two branches are fused for the final prediction.

Iii The Proposed Scheme

Iii-a Framework

Generally speaking, the hardware and software processing in visual information acquisition leave unique traces in final images or videos [41, 42, 43]. These distinct fingerprints, which usually lie in the HF domain (e.g., sensor pattern noise), imply unique camera information and exhibit strong invariance to the captured scene. Unfortunately, important clues in performing face spoofing such as the moiré pattern in replay attack or texture of artificial materials (paper, mask) also belong to the HF domain. As such, to obtain camera invariant spoofing features, the camera contamination causing the domain divergence between different cameras should be eliminated in a scientifically sound way.

As illustrated in Fig. 3, we propose a two branch based model, in an effort to extract camera-free and spoofing specific features to achieve generalized face anti-spoofing. In the first branch, we focus on the HF information which contains abundant clues regarding camera and spoofing relevant features. Instead of performing metric learning to learn the camera irrelevant features, we treat the camera information as a factor varying the distribution of the spoofing features such that a feature decomposition scheme is proposed to align features captured from different cameras to pursue camera-invariant features learning. In particular, unlike the conventional siamese CNN architecture [44] which is designed with shared weights for these sub-networks, we propose an architecture with the pseudo-siamese network, in which the two sub-networks share the same structure while each sub-network will learn its own weights. More specifically, the first sub-network aims for camera information extraction, and the second sub-network targets for extracting spoofing features accompanied with camera information. Due to the fact that two sub-networks share the same camera classifier, the first sub-network is expected to extract the same camera information existing in the second one. Then we utilize a feature decomposition scheme, with which the spoofing features can be independently extracted based on the obtained camera feature. However, straightforwardly using the first branch may limit the performance as only HF features are considered, while other spoofing clues including lighting, reflection etc. tend to be neglected. In view of this, in the second branch, we extract discrimination augmented features based on the enhanced image recomposed from low/high-frequency signal to enhance the detection accuracy. Finally, the detection results of the two branches are fused for final classification.

Iii-B Pre-processing of Input Images

Apparently, face anti-spoofing relying on textures aims to extract the unique features that could distinguish acquired genuine facial skin with the screen, paper, photo and mask. In analogous to the camera information, these clues are irrelevant to the face identity but closely relevant to the texture quality, motivating us to first focus on the HF domain. As shown in Fig. 4, the Eight-Direction Differential Filter-set (EDDF) is adopted for high-frequency information extraction. In particular, to eliminate the influence of the background, we first crop the face region from the original image using a face detection algorithm proposed in [45]. Subsequently, we perform the filtering based on EDDF such that the fixed 2D convolution layers with eight kernels on three channels of the input image can be obtained and the output 24 feature maps are denoted as . Then the is further enhanced by enlarging the respective field adaptively with a multi-channel CNN layer of kernel size , and the corresponding CNN layer is denoted as in Fig. 3. The EDDF is adopted based on the inspiration of steganalysis rich model (SRM) [46], where the high-frequency components can be extracted by a union of many diverse submodels, leading to comprehensive representations of the high frequency information with eight corresponding high pass filters. The SRM has also been widely used for image manipulation detection [47] as well as camera filter array (CFA) patterns extraction [48]. Instead of using bunches of high pass filters, we employ the EDDF to provide the basic residual operations, then followed by the useful high-frequency components can be extracted upon the output from EDDF adaptively. We denote the output feature maps as and it will be treated as the input of a pseudo-siamese network for extracting spoofing specific features with camera information eliminated. Instead of using only learned features extracted from the HF information, the second branch learns to recompose the low/high frequency signal of the original image to augment spoofing clues for more discriminative feature learning. To this end, in this branch, the augmentation components are learned by a three channel convolution layer performed on . As such, the learned maps will be combined with the original image to generate an augmented image, serving as the input of the second branch.

Iii-C Camera Invariant Feature Learning

Given the HF domain input, the camera invariant feature learning aims to obtain the discriminative and generalized features based on feature level decomposition. More specifically, a pseudo-siamese network is adopted in an effort to remove camera variance from spoofing specific features. As illustrated in Fig. 3, in the first branch, the two sub-networks of the pseudo-siamese network share the same structure with three residual layers in ResNet-18 [15]. However, they are individually trained towards different targets. In particular, the upper sub-network in the first branch is specifically designed to obtain camera relevant features. Given the fact that the camera information is determined by the camera type, the classification loss on camera types is adopted to guide the generation of the camera information. Herein, the learned feature maps at the third residual layer are denoted as . With , a camera classifier is implemented based on a channel CNN layer, where is the number of categories of cameras in the training database. Considering the hard example mining, based on the original focal loss [49] we design a per-pixel multi-class focal loss on output maps of last CNN layer for camera type identification,

(1)
(2)

Herein, and indicate the spatial sizes of , and is the binary label of camera type. In particular, when an element at location of belongs to the camera type , , and otherwise . is a constant parameter for penalty adjustment on the easy examples. denotes the prediction result from a softmax function in Eqn. (2), where denotes the output of channel at spatial location in . The reason to adopt the pixel-wise loss to form the constraint is that the camera information should be shared and maintained identical among local patches sampled from an image. Moreover, each element of the feature corresponds to a patch of the input images. When we choose to enforce the constraint on these elements, it amounts to sample training patches from the image repeatedly, which largely improves the diversity of data and alleviates the over-fitting problem.

The bottom sub-network of the first branch aims to learn the spoofing specific features. In analogous with the upper sub-network, the learned feature maps at the third residual layer are denoted as , as shown in Fig. 3. Following , the is acquired by an average pooling layer and two fully connected layers produce the prediction on spoofing with a softmax function. The binary (live vs. spoofing) focal loss is adopted to persuade the model to well discriminate hard examples and increase the distance between genuine and spoofing samples, such that more discriminative spoofing features can be learned. As such, it is formulated as follows,

(3)

where is the ground truth, ( = 0 for spoofing and = 1 for live) and

is predicted probability of the live sample.

, and are constant parameters to control the balance between hard and easy samples. The spoofing features extracted from are mixed with camera information. As such, to further eliminate the camera variance, the camera classification loss is also imposed on the generated output Maps from , to drive the training of the bottom sub-network based on the classification loss.

(4)
Fig. 4: Illustration of the eight kernels in EDDF.

Herein, our aim is to construct the approximation of the latent camera invariant features conditioned on the explicitly learned and based on decomposition, with a strong prior that the resultant features have a strong capability in discriminating spoofing patterns while no information regarding camera types is implied. To this end, we formulate a rational feature level decomposition model by establishing the relationship among , and the desired camera invariant HF spoofing features . As the feature expected to acquire is independent with the camera type, we assume that the is a combination of and the camera information denoted as , leading to a feature decomposition approach,

. (5)

As shares the same camera classifiers with , we have,

. (6)

As such, the camera invariant HF spoofing map can be acquired by,

. (7)

In general, the desired feature should not be able to identify the camera type while equipped with strong capability in detecting spoofing. In particular, the loss that confuses the camera type decam is defined as follows,

(8)
(9)

As such, the predicted probability of each camera type is desired to be equal to the average, indicating that the loses the capability in discriminating the camera for camera-invariant feature learning.

Given , the pooled feature map can be acquired with an average pooling layer and the face liveness prediction is achieved with two fully connected layers. For training the network, the binary (live vs. spoofing) focal loss is adopted again,

(10)

where is the ground truth and is predicted probability of the face liveness. The prediction result based on from the first branch is used to detect the spoofing.

It is worth mentioning that in this branch we adopt the camera type as the side information for camera invariant feature learning. In the training phase, we divide the face images in terms of camera types in the training set, then the pixel-wise multi-class cross-entropy loss is imposed for learning and extracting the camera feature. The per-pixel based loss is adopted with the assumption that the camera information existing in an image should be invariant to the acquired scene. Moreover, the number of camera categories for classification varies with the used training set. However, we do not attempt to employ the classification results in the testing phase, as totally different camera types from the training set may form the testing set. Instead, we only use the features learned in the second to last layer as the extracted camera information. Although the cameras in the testing set are unseen during training, the camera-specific information can also be successfully extracted. Subsequently, the camera feature will be decomposed from the learned mixed spoofing feature in the second sub-network, by which the camera invariant feature can be finally learned in the first branch.

Iii-D Feature Discrimination Capability Augmentation

The first branch aims to obtain camera invariant spoofing features in the HF domain, which will inevitably ignore certain useful information in other frequency ranges, i.e., color, camera reflectance. To comprehensively obtain features extracted from both high and low frequency domains for spoofing detection, the second branch is specifically learned. In the second branch, to emphasize discriminative spoofing clues, we adopt the augmented image generated in the image pre-processing phase as the input. Again, after the last residual layer we acquire the augmented feature map and an average pooling layer is adopted to reduce the spatial dimension. We denote the output features as and subsequently two fully connected layers with a softmax function predict the final result. The binary focal loss is also adopted here,

(11)

where is the ground truth and is the predicted probability of the face liveness.

Iii-E Spoofing Detection

In summary, six loss functions have been defined in the two branches and the total loss for training our network is a weighted sum of these loss functions,

(12)

where , , are the weighting factors. In the testing phase, we use a weighted score to fuse the results from the two branches,

(13)

where is the empirically set fusion score weight. and are the predicted probabilities of face liveness from the two branches.

Iv Experimental Results

Fig. 5: Illustrations of face samples in the four datasets. (a), (b) and (c) are images sampled from three camera types (low quality, normal quality and high quality) in the CASIA-FASD database [14]. From top to down are the live samples and spoofing samples generated by printed photo, photo with eye cut and video replay. (d) are images sampled from Replay-Attack database [1]. From top to down are the live sample and spoofing samples generated by print photo, screen photo and video replay. (e) and (f) are images sampled from two types of cameras (MacBook Air and Google Nexus 5) in MSU-MFSD database [19]. From top to down are the live samples and spoofing samples generated by printed photo and video replay with iPad and iPhones respectively. (g)-(l) are images sampled from six mobile cameras (Samsung Galaxy S6 edge, HTC Desire EYE, MEIZU X5, ASUS Zenfone Selfie, Sony XPERIA C5 Ultra Dual and OPPO N3) in Oulu-NPU [50] database. From top to down are the live samples and spoofing samples generated by print photo and video replay.

Iv-a Datasets

We evaluate our model on four face anti-spoofing datasets: CASIA-FASD [14], Replay-Attack [1], Oulu-NPU [50] and MSU-MFSD [19]. The descriptions of the four databases are shown in Table 1.

CASIA-FASD: The CASIA-FASD database was published in 2012. In this database, 50 subjects are included. For each subject, 3 live and 9 fake video clips are provided. There are three attack types in this database including warped photo attack, cut photo attack and video replay attack. The cameras can be categorized into three different quality levels including low quality (web camera), normal quality (web camera) and high quality (Sony NEX-5). Among the 600 video clips, 240 are used for the training set and the rest are used for testing.

Replay-Attack: This database also contains 50 subjects. The total number of videos is 12000, consisting of 360 videos (60 real-accesses and 300 attacks) in the training set, 360 videos (60 real-accesses and 300 attacks) in the development set and 480 videos (80 real-accesses and 400 attacks) in the testing set. Two types of attack including print attack and replay attack are performed in this database. For the print attack, the face images are printed on a high resolution A4 paper. For the replay attack, two devices are adopted, including iPhone 3GS (with resolution ) and first generation iPad (with resolution ). Different illumination conditions are studied in this database. The first is a controlled condition where a fluorescent lamp is used for lighting with a uniform background. The other condition is adverse, and only day-light is used in the background. All videos in this database are recorded by a webcam of a MacBook.

Oulu-NPU: This database was published in 2017 and it consists of 4950 video clips. Those videos are divided into three sets: 1) training set, which contains 360 real and 1440 attack videos of 20 subjects. 2) development set, which contains 270 real and 1080 attack videos of 15 subjects. 3) testing set, which contains 360 real and 1400 attack videos of 20 subjects. Six cameras are used for recording in this database. The attack types implemented in this database are print and video-replay, generated by two types of printers and two types of display devices, respectively. Four testing protocols are considered to evaluate the environmental condition variations, spoofing mediums variation, camera style variations and fusion of all above challenges.

MSU-MFSD: Two different resolutions of cameras are considered in this database: 1) the built-in camera in the MacBook Air 13 with the resolution of , 2) the front-facing camera in the Google Nexus 5 with resolution of . The database contains 35 subjects and 280 videos in total. Two attack types are performed including printed photo and video replay. For the printed photo attack, the Canon 550D camera is used for high resolution () face image acquisition and the photo will be printed on an A3 paper with an HP color printer. Two cameras are used for replay attacks including SLR camera and iPhone 5S back-facing camera.

The sample face images of the four datasets, including CASIA-FASD, Replay-Attack, MSU-MFSD and Oulu-NPU, are shown in Fig. 5. Both live samples and spoofing samples generated by different attack types are presented. It should be noted that there is no overlapping on the subjects between training set and testing set, implying that we test our model on unseen subjects in both intra-dataset and cross-dataset settings.

Database Year Number of Identities Genuine/Attack Samples Attack Types Capture Cameras Display Cameras
CASIA-FASD [14] 2012 50 150/450 1 Print, 2 Replay Low-quality webcam, normal-quality webcam, SonyNEX-5 iPad
REPLAY-ATTACK [1] 2012 50 200/1000 1 Print, 2 Replay Macbook webcam iPhone 3GS, iPad
MSU-MFSD [19] 2015 35 110/330 1 Print, 2 Replay Macbook air webcam, nexus 5 iPad Air, iPhone 5S
OULU-NPU [50] 2017 55 1980/3960 2 Print, 2 Replay Samsung Galaxy S6, edge oppo_n3, htc_desire_eye, meizu_x5, asus xenfone, sony_xperia_c5 Dell1905FP,
MacBook Retina
TABLE I: Descriptions of the four databases.

Iv-B Implementation Details

We implement our model by PyTorch. For each database, we averagely select 30 frames from each videos. To eliminate the background variations, we crop the face region from each frame by a face detection algorithm

[45]. As our proposed method is camera based method, we divide the database based on camera types. We divide the CASIA-FASD into 3 sub-databases and MSU-MFSD into 2 sub-databases based on the camera types. As there is only one single camera in Replay-Attack, this database is not divided. The cropped face images are resized to . Considering the imbalance of positive and negative samples, we sample the live faces and spoofing faces at a ratio 1:1 in a batch during training. We also use data augmentation for better generalization capacity. The augmentation includes horizontal and vertical flipping with a probability of 0.5, random rotation within 15 degrees as well as color jitter. The color jitter is implemented based on the Transforms

module in Pytorch. In Table II, we show the layer-wise network design of our proposed method. It is worth mentioning that we replace batch normalize layers with group normalization layers 

[51], and the group number is set to 32 in each residual layer to make our network more stable.

The batch size in the training phase is 32 and we adopt Adam optimizer for optimization. The learning rate begins at 0.004 for the first 20k steps and reduces by 0.2 after every 10k steps. As the Oulu-NPU database contains much more samples compared with other databases, we set the learning rate to be 0.004 for the first 30k steps, which further reduces by 0.2 after every 20k steps. We set the parameters be 0.5, 1.0, 4.0, 0.005, 5.0, 0.1, 0.7 for all experiments. Five metrics are adopted for performance evaluations, including:

1) Equal Error Rate (EER). We use EER for intra-database evaluation and the development set is used for EER threshold selection.

2) Half Total Error Rate (HTER). HTER is the average of False Acceptance Rate (FAR) and False Rejection Rate (FRR). It is defined as follows:

(14)

3) Attack Presentation Classification Error Rate (APCER). It is defined as follows:

(15)

where is the number of attack attempts of certain Presentation Attack Instruments (PAI). The value will be ‘1’ if an attempt is predicted as ‘attack’ and 0 as ‘live’. From the definition, we can find that the ACER only considers the worst performance of different scenarios, which means it penalizes approaches that only perform well on certain types of attacks.

4) BonaFide Presentation Classification Error Rate (BPCER):

(16)

where is the total number of the Bonafide presentation times.

5) Average Classification Error Rate (ACER). It is the average of APCER and BPCER:

(17)

The APCER, BPCER and ACER are the standardized ISO/IEC 30107-3 metrics in [52].

Iv-C Comparison with the State-of-the-Art Methods

To demonstrate the high generalization capability of our method, we firstly perform cross-database experiments on above mentioned four databases and compare our results with the state-of-the-art methods. We adopt seven cross-test settings in this experiment, including: CASIA-FASD Replay-Attack, CASIA-FASD MSU-MFSD, MSU-MFSD Replay-Attack, MSU-MFSD CASIA-FASD, (CASIA-FASD & MSU-MFSD) Replay-Attack, MSU-MFSD & Replay-Attack CASIA-FASD, CASIA-FASD & Replay-Attack MSU-MFSD. To simplify, we denote these seven settings as C R, C M, M R, M C, (C & M ) R, (M & R) C, (C& R) M. The R C and R M settings are not involved, as the camera classification cannot be performed considering that only one camera type exists in the Replay-Attack database. The comparison results are reported in Table  III.

As can be seen, we compare our method with 14 classical methods, including both hand-crafted feature based methods and CNN based methods. It should be mentioned that for CNN based methods, classical CNN models are involved, including original ResNet-18 [15] for binary classification, CNN [5], ResNet18+TripletLoss provided in [39], DeepPixel [53] method and Auxiliary [9] with depth map only which is denoted as Auxiliary (depth only). From Table  III, we can find our method achieve the lowest HTER in most cross-database settings except for M C and C M. Regarding the M C and C M settings, our method achieves the second best HTER results which are competitive with state-of-the-art models (27.3% vs. 24.3% and 17.5% vs. 14.0%). Moreover, for C R, we achieve more than 10% lower in terms of HTER than the state-of-the-art methods “Auxiliary”, which demonstrate the effectiveness of our method. It is worth noting that although the TripletLoss is utilized in [39] aiming to reduce the intra-class distance and enlarge the inter-class distance, it still tends to be over-fitting and cannot generalize well on different databases. It demonstrates that our feature decomposition scheme is more suitable in the face anti-spoofing task as some meaningful spoofing clues may be discarded due to the distance restricted by TripletLoss. In Table  III, we also show the HTER results of the two branches with different settings. It is obvious that although only using the second branch cannot achieve the desired performance, it provides the compulsory information for the first branch and the results can be further improved by fusion of the two branches, which reveals that the results of the two branches can be effectively leveraged by our fusion strategy.

Layer name Input size [Kernel size, Channels] Output size
Feature Invariant Branch
EDDF 224×224 [3×3, 24] 224×224 [t]
224×224 [5×5, 64] 224×224 [b]
Residual Blocks Conv 224×224

[7×7, 64], stride=2

112×112 [t]
Pooling 112×112 3×3 Maxpooling, stride=2 56×56
Residual Bloack1 56×56 56×56
Residual Bloack2 56×56 28×28
Residual Bloack3 28×28 14×14
[b]
14×14 [3×3, Number of Cameras] 14×14
Feature Discrimination Augmentation Branch
224×224 [3×3,3] 224×224
Residual Blocks Conv 224×224 [7×7,64], stride=2 112×112 [t]
Pooling 112×112 3×3 Maxpooling, stride=2 56×56
Residual Bloack1 56×56 56×56
Residual Bloack2 56×56 28×28
Residual Bloack3 28×28 14×14
[b]
Average Pooling 14×14 1×1
Binary Classification
FC1 (512, 128) FC1 (512, 128) FC1 (512, 128) [t]
Relu Relu Relu
FC2 (128, 2) FC2 (128, 2) FC2 (128, 2) [b]
TABLE II: Architecture of the network in the proposed method.

Furthermore, we also evaluate the performance influenced by different fusion weighting parameter on cross-database testing (C R). As shown in Fig. 6, we can find that the performance will drop if the value of is set to an inappropriate value (extremely lager or small). The best value of is around 0.5 which is half of the weight (1.0) of the first branch. It demonstrates that the HF feature invariant spoofing clues learned from the first branch are more important than the spoofing features learned by the second branch when the testing data exhibit large distribution divergence against the data in the training database.

Method C R M R M C C M C&M R C&R M M&R C [b]
LBP [1] 47.0 45.5 - - - - - [t]
LBP-TOP [22] 49.7 46.5 - - - - -
Motion [54] 50.2 - - - - -
CNN [5] 48.5 37.1 37.8 26.3 29.3 21.2 37.2
Color LBP [26] 37.9 44.8 45.7 21.0 - - -
Color Tex. [26] 30.3 33.9 46.0 20.4 - - -
Color SURF [26] 26.9 29.7 24.3 19.1 - - -
ResNet18 [15] 47.0 47.9 45.8 36.4 35.3 27.6 40.7
ResNet18+TripletLoss [39] 43.3 11.5 37.8 14.0 27.3 23.8 43.3
Auxiliary [9] 27.6 - - - - - -
Auxiliary (depth only)[9] 29.1 26.7 44.5 36.1 29.7 16.6 35.5
DeepPixel [53] 41.5 21.9 36.0 40.3 38.9 21.6 33.1
De-Spoof [17] 28.5 - - - - - - [b]
Invariant Branch (Ours) 19.4 25.8 32.6 21.6 24.6 15.0 31.2
Augmentation Branch (Ours) 31.2 28.1 36.3 31.9 29.3 18.6 37.8
Fusion (Ours) 17.6 21.7 27.3 17.5 21.3 14.8 32.3
TABLE III: Cross-test results (HTER %) on CASIA-FASD, Replay-Attack, and MSU-MFSD. “-” indicates that the corresponding result is unavailable. The numbers in bold are the best results.

To further verify the effectiveness of our method, we also perform intra-database experiments on CASIA-FASD database. The results are shown in Table IV. From Table IV, we can find our method could achieve the best EER result (0.89%), demonstrating our method does not sacrifice the accuracy of intra-database even though our main target is to improve the generalization capability. For further analysis of the performance of our method under different attack types, we show the error rates of three attack types: Video Replay (VR), Print Photo (PP) and cut Photo Mask (PM) in Fig. 7. We can find that our method achieves lower EER when the attack type is ‘VR’ and ‘PP’. However, although the ‘PM’ also uses printed paper as an attack medium, our method cannot detect this attack type perfectly. The potential reason is the live face parts (eyes and mouth) existing in the input image may cause the contamination in the spoofing features.

Fig. 6: The EER(%) results influenced by the fusion weighting parameter in Eqn. (13) on cross-database (C R) setting.
Fig. 7: The error rate of the false accepted attack types on the CASIA-FASD database. ‘VR’ means Video Replay, ‘PP’ indicates the printed photo and ‘PM’ means a mask made by the printed photo with the regions of eyes and mouth been cut.
Fig. 8: Variations of the EER(%) results influenced by the fusion weight parameter in Eqn. (13) on intra-database (CASIA-FASD) setting.

Analogously, we investigate the influence of the parameter of Eqn. (13) on the performance, we set different weightings regarding the fusion of the two branches, and the results are shown in Fig. 8. From Fig. 8, we can find a similar phenomenon that the EER results will be higher if is set too small or too lager, and the best performance can be achieved when is set to 0.8. This indicates that the two branches in our model are all necessary, and provides useful evidence on the effectiveness of our fusion strategy.

Then we conduct an intra-database experiment on Oulu-NPU. Four protocols are provided in the OULU-NPU database. Protocol 1 is used to evaluate the generalization capability in different environments. Protocol 2 is designed for performance evaluation on unseen attack types including video replay and printed photo. Protocol 3 is the most relevant to our method, which is used to evaluate how a model can be generalized to unseen camera types. Six camera types are contained in this protocol, in which, five selected camera styles are used for training and the rest one for testing. Protocol 4 is the most challenging protocol, as different environments, attack mediums and cameras are all considered in this protocol. we report our experimental results in Table  V.

Form the table, we observe that our method can achieve the best ACER in Protocol 3 with the lowest variance in terms of different cameras which indicates that our method can be generalized well to the camera modules. Moreover, our method achieves superior results compared with the “GRADIANT” [55] which only uses binary labels for supervision. In Protocols 2 and 4, we achieve competitive performance compared with “Auxiliary” or the “DeSpoofing” method which also uses the auxiliary depth information for training. It is worth noting that our method achieves the lowest variance in Protocol 3 as well as Protocol 4, demonstrating the promising robustness of our method.

Method EER(%)
IQA [24] 32.4
Motion [54] 26.6
LBP [1] 18.2
LBP-TOP [22] 10.0
LBP+SVM [56] 4.9
CNN [5] 7.4
MSR-RESNET [57] 3.1
Color SURF [26] 2.2
ResNet18 [15] 5.2
DeepPixel [53] 2.6
ResNet18+TripletLoss [39] 1.4
Auxiliary (depth only)[9] 1.3
CamInva Branch (Ours) 2.3
Augmentation Branch (Ours) 2.9
Fusion (Ours) 0.89
TABLE IV: Intra-test results on CASIA-FASD. The number in bold are the best results.
Prot. Methods APCER(%) BPCER(%) ACER(%) [t]
[b]
1 CPqD [55] 2.9 10.8 6.9
GRADIANT [55] 1.3 12.5 6.9
Auxiliary [9] 1.6 1.6 1.6
DeSpoofing [17] 1.2 1.7 1.5
Ours 3.8 2.9 3.4
2 MixedFASNet [55] 9.7 2.5 6.1
GRADIANT [55] 3.1 1.9 2.5
Auxiliary [9] 2.7 2.7 2.7
DeSpoofing [17] 4.2 4.4 4.3
Ours 3.6 1.2 2.4
3 MixedFASNet [55] 5.3±6.7 7.8±5.5 6.5±4.6
GRADIANT [55] 2.6±3.9 5.0±5.3 3.8±2.4
Auxiliary [9] 2.7±1.3 3.1±1.7 2.9±1.5
DeSpoofing [17] 4.0±1.8 3.8±1.2 3.6±1.6
Ours 3.8±1.3 1.1±1.1 2.5±0.8
4 Massy HNU [55] 35.8±35.3 8.3±4.1 22.1±17.6
GRADIANT [55] 5.0±4.5 15.0±7.1 10.0±5.0
Auxiliary [9] 9.3±5.6 10.4±6.0 9.5±6.0
DeSpoofing [17] 5.1±6.3 6.1±5.1 5.6±5.7
Ours 5.9±3.3 6.3±4.7 6.1±4.1
TABLE V: Experimental results of the four protocols on the OULU-NPU database.

Iv-D Ablation Study

In this subsection, to reveal the functionalities of different modules in the proposed method, we perform the ablation analysis. The experiments are conducted both on intra-database (CASIA-FASD database) and cross-database (training on CASIA-FASD and testing on Replay-Attack) settings. The results are shown in Table VI in which the feature invariant branch is denoted as the first branch and the feature discrimination augmentation branch is denoted as the second branch. Moreover, , and correspond to low quality, normal quality and high quality cameras in the CASIA-FASD dataset. To identify the effectiveness of the two branches, we show the experiment results of the two branches respectively without fusion and a T-SNE visualization of the learned feature in the first branch is shown in Fig. 2(b). Subsequently, we remove EDDF at each branch to evaluate the functionality of EDDF. More specifically, when evaluating each branch without EDDF, we adopt the same residual architecture in each branch except for the input image which has not been processed by the EDDF, and the number of channels of the following CNN layer is three. Another important factor is the feature decomposition operation. To verify its importance, we ablate the camera classification sub-network in the first branch such that the results of the second sub-network and the second branch are fused for comparison. We denote the results as Fusion w/o CamID in Table VI.

Methos Intra Testing ( EER% ) Cross Testing ( HTER% )
CASIA-FASD CASIA-FASD Replay
1st branch Cam1: 2.5
Cam2: 1.8
Cam3: 2.6
Avg.: 2.3 19.4
1st branch w/o EDDF Cam1: 2.9
Cam2: 2.4
Cam3: 3.1
Avg.: 2.8 24.8
2nd branch Cam1: 2.1
Cam2: 4.9
Cam3: 1.6
Avg.: 2.9 31.2
2nd branch w/o EDDF Cam1: 3.3
Cam2: 7.0
Cam3: 2.5
Avg.: 4.3 32.1
Fusion w/o CamID Cam1: 2.4
Cam2: 1.2
Cam3: 3.1
Avg.: 2.2 29.3
Fusion Cam1: 0.7
Cam2: 1.8
Cam3: 0.2
Avg.: 0.9 17.6
TABLE VI: Ablation studies of our method on the CASIA database and Repaly-Attack.

From the EER results on CASIA-FASD intra-database settings in Table VI, we can find that the variance of detection results of each camera is significantly reduced in the first branch compared with the second branch. It is demonstrated that more generalized spoofing features can be learned in the first branch. In addition, the performance of the first branch on and is degraded compared with the second branch, due to the fact that only the HF feature is utilized in the first branch. However, promising performance improvement (31.2 vs. 19.4) by the first branch when testing on cross-database (training on CASIA-FASD and testing on Replay) has been observed. This phenomenon is reasonable, as more database-dependent information (e.g., environment, lighting) is used in the second branch, and the information will cause the learned features to tend to be over-fitting on the training database such that a negative effect when testing on a different database has been observed. Moreover, we find that the ablation of EDDF in the first branch will cause the performance drop on both intra-database and cross-database settings. It is demonstrated that the EDDF can make this branch pay attention to the HF component when the input image is pre-processed by EDDF. Similar results can be observed when the EDDF is ablated in the second branch. With EDDF, lower EER and HTER values can be acquired, as more discriminative parts can be augmented before being processed by CNN layers. As such, it can be concluded that the EDDF is necessary for both branches. From Table VI, we can find that the PAD performance will be largely improved when we fuse the results of two branches, and outperform each individual branch. This implies that the learned features of each branch are compulsory and the results of the two branches are balanced by the weighted fusion strategy.

Iv-E Feature Visualizations

To better understand the learned spoofing features of the two branches in our proposed methods, we train the model using CASIA-FASD database and visualize the feature maps of each sub-network. Moreover, to identify which components will be augmented in the second branch, we also visualize the input map in the second branch. The results are shown in Fig. 9.

As we can see, in the second row, the contrast and details of input images are augmented. Taking the augmented image as the input in the second branch, the last row shows the learned feature maps, from which we can find a large difference exists between the feature maps of live and spoofing face images, especially in the regions of the nose and mouth. The third row shows the maps learned for camera type classification. In this row, although the face images in the same camera type are different (live vs. spoofing), similar camera patterns can still be extracted and the patterns are varied by camera types. In the fourth row, the spoofing feature maps learned in the second sub-network are shown. From the images, we can find more noise exists in the spoofing face image compared with live samples. However, the noise patterns are different among different cameras as the camera information mixed with the spoofing patterns causes the domain gap among different cameras.

Fig. 9: Visualization results of the learned feature maps. The model is trained on CASIA-FASD database. In the first row, from left to right are the input live and spoofing face images sampled from CASIA-FASD database (including three camera types: low quality camera, normal quality camera and high quality camera) and Replay Attack database. The second row shows the images augmented in the second branch. The sampled feature maps of the two subnetworks in the “Feature Invariant Branch” are shown in the following two rows respectively and the last row presents one of the features learned in the “Feature Discrimination Augmentation Branch”.

Iv-F Investigations on the Robustness of Unknown Camera Feature Extraction

In the feature invariant branch of our model, we use the camera feature extracted in the first sub-network as the side information to guide the model to remove the camera information incorporated in the spoofing detection feature, leading to the camera invariant feature extraction for spoofing detection. In this sense, the camera features extracted in the testing set should be more specific and different from the camera features extracted in the training set (as the camera type in the testing set is not identical with that of the training set). To demonstrate such assumption, we provide the visualization of cameras features extracted from training set and testing set in Fig. 10 (equal number of samples of each dataset are randomly selected for better visualization). More specifically, the number of camera types in training set ranges from 2 to 5.

Fig. 10: T-SNE [18] visualization of the camera features. C1, C2, C3, M1, M2 and R represent the six different camera types in CASIA-FASD, MSU-MFSD and Replay-Attack datasets.

As shown in Fig. 10, we use different numbers of camera types for camera classifier training and the results show that our camera classifier can extract the camera-specific features of unseen cameras in the testing set even the number of camera type in the training set is limited (e.g. 2 or 3). Besides, we can also find that the camera classification capability can be improved (corresponding to more compact of each camera feature) when the number of camera type for training increases. However, even the discriminative camera features can be efficiently extracted by the camera classifier, it cannot be fully guaranteed that the features do not carry other forms of noise which are not considered in the training data. To improve the robustness of camera feature extraction, we further explore the robustness of unknown camera feature extraction and conduct more studies on the probability based unknown camera classification with a self-attention based strategy for the feature refining.

Unknown camera classification. In this step, we study the “” camera classifier, where is the number of camera categories in the training set. In the testing phase, if a sample is classified into category, the sample is further treated as an image captured by an unknown camera. Towards the implementation of the

camera classifier, we adopt the softmax probability distribution as category descriptor which is widely used in the filed of Out-of-Distribution Detection (OOD) detection 

[58, 59, 60]. To be specific, we assume the camera classifier is trained by cameras in categories and tested on a testing set with categories, where is the number of unknown camera types. We first show the average softmax probability of each camera type for testing in Fig. 11. Furthermore, we also visualize the probability distribution on intra-dataset for comparisons in Fig. 12. From the figures, we can observe that the probability distribution of cameras types in the training set has significant differences with the distribution of unknown camera types. Inspired by this observation, to transfer the trained classifier into an “ ” classifier, the distribution descriptor is created. More specifically, for each sample in the testing set, we use the difference between the maximum classification probability and second maximum probability that is predicted among categories to infer the probability that the belongs to the category,

(18)

where is the predicted camera category of and the parameter can be calculated from the training set. To be specific, for each sample in the training set, if its maximum classification probability , this sample is selected and the corresponding distribution descriptor is obtained as follows:

(19)

The final is selected as the minimum value among the descriptors of all selected samples. Then the normalized probability that belongs to category m (m ) can be acquired as follows:

(20)

To demonstrate the performance of our “” classifier in the testing set, we average the probability that each sample is classified into category by camera type. The results are shown in Fig. 13. Besides, the intra-testing results on CASIA-FASD dataset are also visualized in Fig. 14 for comparisons. It is apparent that the unknown camera is classified into the category with a high probability ( 0.85) and the intra-class cameras are rejected by the category. This reveals the effectiveness of the “” classifier.

Fig. 11: Visualization of the average probability of camera-specific samples in the category classification, where is the number of camera categories in the training set. The datasets for training is annotated under each sub-figure and all samples in six datasets are used for testing. The probability corresponds to the probability that the testing sample is classified into the corresponding category.
Fig. 12: Visualization of the average probability of camera-specific samples in the category classification, where is the number of camera category in the training set. The samples for training are a portion of CASIA-FASD dataset and the rest of samples are used for testing. Again, the probability corresponds to the probability that the testing sample is classified into the corresponding category.
Fig. 13: Visualization of the average probability of classifying the testing samples to the category . The datasets for training are provided under each sub-figure and all samples in six datasets are used for testing.
Fig. 14: Visualization of the average probability of classifying the testing samples to the category . The samples for training are a portion of CASIA-FASD dataset and the rest of samples are used for testing.

Unknown Camera feature refining. After the camera classification is finished for “” classification, an attention based unknown camera feature refining is conducted when the testing camera is classified into the category (a.k.a unknown camera).

Firstly, we use the normalized probability acquired in above step to examine the affiliation of the extracted camera features with respect to each one of classes as follows:

(21)

where is the category that is finally classified by comparing all probabilities and selecting the maximum one. When equals to , the image is regarded to be captured by an unknown camera. In this scenario, we aim to further improve the detection accuracy based on the philosophy of re-weighting the branches and refining the features.

1) In the weighted fusion, we increase the weight of results acquired by the invariant branch form 0.7 to 1.0, due to the higher generalization ability of the Invariant branch compared with Augmentation branch.

2) The extracted camera feature is refined due to unseen noise possibly introduced. For simplification, we use and to represent and in Fig. 3 respectively. The is a combination of camera invariant spoofing feature and camera feature. When the is aligned with the camera information in , the camera invariant spoofing feature can be acquired. To reduce the noise injected in , an attention based strategy is proposed. More specifically, as shown in Fig. 15, we adopt a self-attention module to measure the spatial attention map of as follows:

(22)

where indicate the spatial size of and are the indexes of row and column respectively. The matrix is given by,

(23)

where the “” operation represents the matrix multiply. In particular, for each index in the spatial dimension, we calculate the normalized weights of itself and the other indices to synthesize the refined camera feature as follows:

(24)

Based on the refined camera feature , the feature subtraction is performed, and the binary classification result is finally obtained. To verify the refining strategy, we compare the performance of our model with and without (W/O) unknown camera feature refining on different cross-dataset settings. The results are shown in Table VII. From the table, we can observe that the unknown camera feature refining strategy can be benefit for the performance improvement on the most of cross-dataset settings, which demonstrates that the features can be further enhanced to improve the effectiveness of our method.

Fig. 15: Illustration of the attention based unknown camera feature refining. Herein, h, w and c represent the height, width, and number of channels respectively.
Method C R M R M C C M C&M R C&R M M&R C [b]
Ours (W/O Unknown Camera Refining) 17.6 21.7 27.3 17.5 21.3 14.8 32.3
Ours (Unknown Camera Refining) 17.0 21.0 27.1 13.8 19.9 15.3 30.7
TABLE VII: Performance (HTER %) comparison between our models with and without (W/O) unknown camera feature refining on different cross-dataset settings.

Iv-G Investigations in Re-compression Scenarios

The compression artifacts will be injected when the video clips are represented in compact format, leading to the negative effect on spoofing detection. Moreover, with the rapid development of cloud computing, the face anti-spoofing algorithms are also expected to be deployed on the cloud platforms and the compression/re-compression could significantly reduce the bandwidth when transmitting the video stream to the cloud. As such, the evaluation of the spoofing detection performance with videos been compressed/re-compressed is important and essential.

Herein, we conduct performance comparisons in the scenarios that the testing video data are further recompressed with different codecs based on HEVC standard (x265) [61, 62] and H.264/AVC standard (x264) [63, 64]. Three databases are adopted in the performance comparisons, including CASIA-FASD, MSU-MFSD and REPLAY-ATTACK. Besides different compression standards explored, multiple quantization parameters (QP) are also selected, ranging from high to low bit rate (QP=17 to QP=42). The performance comparisons are shown in Tables VIII and  IX. In particular, it is apparent that our proposed method achieves better detection performance for different QP values. However, it is generally acknowledged that the compression artifacts may influence the performance of the detection accuracy. For example, the detection accuracy decreases with the increase of QP, as very important clues that are useful for spoofing detection could be distorted during compression. Herein, we also show the unknown camera feature refining results in Table VIII and Table IX where the rows are denoted “Ours (Unknown Camera Refining)”. We can observe that the performance in terms of HTER% results is further improved. For the high QP scenario (), the refining strategy is still effective which reveals the robustness of our scheme.

Configuration Methods Original QP17 QP22 QP27 QP32 QP37 QP42
CASIA Intra LBP+SVM [56] 7.5 10.8 16.2 20.6 16.9 22.2 28.8
Resnet18 [15] 5.2 7.6 8.3 8.1 8.7 9.5 10.9
DeepPixel [53] 2.6 3.2 3.0 3.5 3.7 5.2 4.4
Auxiliary (depth only) [9] 1.3 2.0 1.8 2.9 3.9 7.3 9.4
Ours 0.9 1.3 1.9 1.6 2.8 3.4 4.2
C R LBP+SVM [56] 28.9 28.6 32.6 40.3 44.5 42.8 46.8
Resnet18 [15] 47.0 47.6 47.8 48.5 49.6 49.1 50.0
DeepPixel [53] 41.5 38.2 37.6 36.2 36.2 36.9 36.7
Auxiliary (depth only) [9] 28.1 26.2 25.6 28.4 32.5 37.8 43.6
Ours 17.6 18.7 18.7 18.9 20.6 22.8 20.2
Ours (Unknown Camera Refining) 17.0 17.2 16.8 16.4 18.7 21.9 19.9
C M LBP+SVM [56] 48.1 48.5 48.8 49.3 48.4 45.7 47.4
Resnet18 [15] 36.4 37.2 36.0 36.0 34.3 35.8 36.5
DeepPixel [53] 40.3 40.1 39.0 37.7 39.4 39.6 40.1
Auxiliary (depth only) [9] 36.1 30.3 29.1 28.5 27.6 28.1 24.8
Ours 17.5 18.0 17.5 17.8 17.6 17.3 17.2
Ours (Unknown Camera Refining) 13.8 13.67 12.6 12.6 12.5 12.2 11.8
M R LBP+SVM [56] 46.4 47.8 46.5 42.1 41.0 41.8 45.1
Resnet18 [15] 47.9 45.1 44.8 45.9 46.0 46.4 44.1
DeepPixel [53] 21.9 22.7 22.7 22.6 23.8 26.4 28.0
Auxiliary(depth only) [9] 26.7 28.0 27.6 26.5 28.1 30.9 35.2
Ours 21.7 22.1 23.2 19.8 25.7 30.3 36.0
Ours (Unknown Camera Refining) 21.0 19.9 19.5 19.2 24.3 28.9 30.1
M C LBP+SVM [56] 49.8 50.0 49.2 49.7 48.9 50.0 48.3
Resnet18 [15] 45.8 48.4 48.3 48.6 48.7 47.9 49.0
DeepPixel [53] 36.1 34.6 35.2 34.5 35.4 36.3 36.9
Auxiliary (depth only) [9] 44.5 38.5 37.9 38.2 39.3 41.3 42.8
Ours 27.3 29.8 30.4 30.7 29.1 28.5 31.3
Ours (Unknown Camera Refining) 27.1 28.7 30.1 30.2 28.7 28.3 29.6
TABLE VIII: Performance comparisons of different face spoofing methods when the videos are compressed with HEVC standard (x265).
Configuration Methods Original QP17 QP22 QP27 QP32 QP37 QP42
CASIA Intra LBP+SVM [56] 7.5 8.7 32.8 25.3 32.0 28.0 32.9
Resnet18 [15] 5.2 7.4 7.1 7.9 9.0 8.7 11.8
DeepPixel [53] 2.6 3.4 3.5 3.3 3.7 5.2 7.3
Auxiliary (depth only) [9] 1.3 1.8 2.5 2.3 4.2 6.1 11.3
Ours 0.9 1.2 1.7 1.8 1.9 2.3 5.9
C R LBP+SVM [56] 28.9 35.4 36.3 41.7 47.4 49.5 47.3
Resnet18 [15] 47.0 47.5 47.4 48.8 49.1 50.6 51.3
DeepPixel [53] 41.5 39.3 38.8 38.6 38.1 39.8 39.7
Auxiliary (depth only) [9] 28.1 28.2 27.0 28.9 34.6 39.3 42.8
Ours 17.6 21.9 18.5 18.2 21.6 20.8 22.1
Ours (Unknown Camera Refining) 17.0 17.5 17.5 17.1 20.2 20.0 21.5
C M LBP+SVM [56] 48.1 46.6 47.8 49.5 49.8 48.9 48.1
Resnet18 [15] 36.4 36.8 36.1 35.0 35.0 35.5 34.2
DeepPixel[53] 40.3 41.2 40.2 39.7 41.0 42.2 42.8
Auxiliary (depth only) [9] 36.1 30.2 28.9 28.2 28.4 27.3 27.1
Ours 17.5 16.8 17.0 16.8 17.2 17.7 16.8
Ours (Unknown Camera Refining) 13.8 12.9 12.7 12.2 11.8 12.9 11.9
M R LBP+SVM [56] 46.4 45.7 45.0 43.1 41.9 42.2 45.9
Resnet18 [15] 47.9 45.1 45.1 45.9 47.2 46.1 44.9
DeepPixel [53] 21.9 23.1 23.2 23.8 24.6 28.1 29.9
Auxiliary (depth only) [9] 26.7 27.5 27.2 28.1 31.4 36.2 40.8
Ours 21.7 21.9 24.5 21.9 27.0 34.0 36.7
Ours (Unknown Camera Refining) 21.0 21.0 19.9 19.5 26.3 28.7 31.3
M C LBP+SVM [56] 49.8 49.8 48.8 49.2 48.8 49.2 49.9
Resnet18 [15] 45.8 47.8 48.3 49.8 48.6 48.8 50.0
DeepPixel [53] 36.1 34.5 35.1 34.9 34.7 36.9 39.6
Auxiliary (depth only) [9] 44.5 38.7 38.3 39.0 39.9 41.8 44.4
Ours 27.3 30.3 30.2 29.9 30.4 30.7 30.9
Ours (Unknown Camera Refining) 27.1 29.8 29.1 29.2 29.8 29.6 30.1
TABLE IX: Performance comparisons of different face spoofing methods when the videos are compressed with H.264/AVC standard (x264).

V Conclusions and Future Work

We aim to address one critical and practical challenge faced by face anti-spoofing that the diverse camera types are prone to cause a large domain gap when the training and testing data are from different cameras. In this paper, we present a novel camera invariant face anti-spoofing model, targeting to improve the generalization capability of face anti-spoofing in real applications. Our model is able to eliminate the influence of cameras in feature extraction based on feature domain decomposition, and practically obtain promising spoofing detection performance based on the combination of camera invariant and discrimination augmented feature extraction. Extensive experiments based on intra-database and cross-database settings verify that the proposed scheme achieves superior performance and exhibits strong generalization capability in face spoofing detection.

Recent years have witnessed a surge of growth in terms of camera types along with different mobile devices. It is envisioned that the proposed method could be naturally incorporated to support the applications in these devices. Based on the elaborate design, more accurate face anti-spoofing is achieved without specifically collecting and labeling training data from the given camera, as evidenced by our experimental results. The proposed method is also extensible when more environments, attack types and compression levels are involved in the face spoofing detection. As such, how the proposed model could be further extended to a more unified spoofing detection model, is an interesting direction yet to be explored. It is also anticipated that the design philosophy could facilitate many other applications in addition to face spoofing detection. One concrete example is the low-level computer vision task such as image quality assessment, as a model that exhibits strong robustness to different cameras and acquisition environments is highly desired. Another example is the high-level visual understanding at the cloud side, as a universal model that works across different cameras can be deployed in the cloud for efficient recognition and understanding.

Vi Acknowledgement

The authors would like to thank the anonymous reviewers for their valuable comments which greatly helped improve this paper.

References

  • [1] I. Chingovska, A. Anjos, and S. Marcel, “On the effectiveness of local binary patterns in face anti-spoofing,” in 2012 BIOSIG-proceedings of the international conference of biometrics special interest group (BIOSIG).   IEEE, 2012, pp. 1–7.
  • [2]

    Z. Boulkenafet, J. Komulainen, and A. Hadid, “Face antispoofing using speeded-up robust features and fisher vector encoding,”

    IEEE Signal Processing Letters, vol. 24, no. 2, pp. 141–145, 2016.
  • [3] J. Komulainen, A. Hadid, and M. Pietikäinen, “Context based face anti-spoofing,” in 2013 IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS).   IEEE, 2013, pp. 1–8.
  • [4] K. Patel, H. Han, and A. K. Jain, “Secure face unlock: Spoof detection on smartphones,” IEEE transactions on information forensics and security, vol. 11, no. 10, pp. 2268–2283, 2016.
  • [5] J. Yang, Z. Lei, and S. Z. Li, “Learn convolutional neural network for face anti-spoofing,” arXiv preprint arXiv:1408.5601, 2014.
  • [6] K. Patel, H. Han, and A. K. Jain, “Cross-database face antispoofing with robust feature representation,” in Chinese Conference on Biometric Recognition.   Springer, 2016, pp. 611–619.
  • [7]

    L. Li, X. Feng, Z. Boulkenafet, Z. Xia, M. Li, and A. Hadid, “An original face anti-spoofing approach using partial convolutional neural network,” in

    2016 Sixth International Conference on Image Processing Theory, Tools and Applications (IPTA).   IEEE, 2016, pp. 1–6.
  • [8] L. Feng, L.-M. Po, Y. Li, X. Xu, F. Yuan, T. C.-H. Cheung, and K.-W. Cheung, “Integration of image quality and motion cues for face anti-spoofing: A neural network approach,” Journal of Visual Communication and Image Representation, vol. 38, pp. 451–460, 2016.
  • [9] Y. Liu, A. Jourabloo, and X. Liu, “Learning deep models for face anti-spoofing: Binary or auxiliary supervision,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2018, pp. 389–398.
  • [10] S. Liu, P. C. Yuen, S. Zhang, and G. Zhao, “3d mask face anti-spoofing with remote photoplethysmography,” in European Conference on Computer Vision.   Springer, 2016, pp. 85–100.
  • [11] H. Li, W. Li, H. Cao, S. Wang, F. Huang, and A. C. Kot, “Unsupervised domain adaptation for face anti-spoofing,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 7, pp. 1794–1809, 2018.
  • [12]

    H. Li, P. He, S. Wang, A. Rocha, X. Jiang, and A. C. Kot, “Learning generalized deep feature representation for face anti-spoofing,”

    IEEE Transactions on Information Forensics and Security, vol. 13, no. 10, pp. 2639–2652, 2018.
  • [13] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7167–7176.
  • [14] Z. Zhang, J. Yan, S. Liu, Z. Lei, D. Yi, and S. Z. Li, “A face antispoofing database with diverse attacks,” in 2012 5th IAPR international conference on Biometrics (ICB).   IEEE, 2012, pp. 26–31.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [16] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun, “Bayesian face revisited: A joint formulation,” in European conference on computer vision.   Springer, 2012, pp. 566–579.
  • [17] A. Jourabloo, Y. Liu, and X. Liu, “Face de-spoofing: Anti-spoofing via noise modeling,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 290–306.
  • [18] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,”

    Journal of machine learning research

    , vol. 9, no. Nov, pp. 2579–2605, 2008.
  • [19] D. Wen, H. Han, and A. K. Jain, “Face spoof detection with image distortion analysis,” IEEE Transactions on Information Forensics and Security, vol. 10, no. 4, pp. 746–761, 2015.
  • [20] N. Evans, Handbook of Biometric Anti-Spoofing: Presentation Attack Detection.   Springer, 2019.
  • [21] J. Li, Y. Wang, T. Tan, and A. K. Jain, “Live face detection based on the analysis of fourier spectra,” in Biometric Technology for Human Identification, vol. 5404.   International Society for Optics and Photonics, 2004, pp. 296–303.
  • [22] T. de Freitas Pereira, J. Komulainen, A. Anjos, J. M. De Martino, A. Hadid, M. Pietikäinen, and S. Marcel, “Face liveness detection using dynamic texture,” EURASIP Journal on Image and Video Processing, vol. 2014, no. 1, p. 2, 2014.
  • [23] P. P. Chan, W. Liu, D. Chen, D. S. Yeung, F. Zhang, X. Wang, and C.-C. Hsu, “Face liveness detection using a flash against 2d spoofing attack,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 2, pp. 521–534, 2017.
  • [24] J. Galbally and S. Marcel, “Face anti-spoofing based on general image quality assessment,” in 2014 22nd International Conference on Pattern Recognition.   IEEE, 2014, pp. 1173–1178.
  • [25] J. Galbally, S. Marcel, and J. Fierrez, “Image quality assessment for fake biometric detection: Application to iris, fingerprint, and face recognition,” IEEE transactions on image processing, vol. 23, no. 2, pp. 710–724, 2013.
  • [26] Z. Boulkenafet, J. Komulainen, and A. Hadid, “On the generalization of color texture-based face anti-spoofing,” Image and Vision Computing, vol. 77, pp. 1–9, 2018.
  • [27] G. Pan, Z. Wu, and L. Sun, “Liveness detection for face recognition,” in Recent advances in face recognition.   IntechOpen, 2008.
  • [28] S. Tirunagari, N. Poh, D. Windridge, A. Iorliam, N. Suki, and A. T. Ho, “Detection of face spoofing using visual dynamics,” IEEE transactions on information forensics and security, vol. 10, no. 4, pp. 762–777, 2015.
  • [29] S. Bharadwaj, T. I. Dhamecha, M. Vatsa, and R. Singh, “Computationally efficient face spoofing detection with motion magnification,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2013, pp. 105–110.
  • [30] T. A. Siddiqui, S. Bharadwaj, T. I. Dhamecha, A. Agarwal, M. Vatsa, R. Singh, and N. Ratha, “Face anti-spoofing with multifeature videolet aggregation,” in 2016 23rd International Conference on Pattern Recognition (ICPR).   IEEE, 2016, pp. 1035–1040.
  • [31] I. Chingovska, J. Yang, Z. Lei, D. Yi, S. Z. Li, O. Kahm, C. Glaser, N. Damer, A. Kuijper, A. Nouak et al., “The 2nd competition on counter measures to 2d face spoofing attacks,” in 2013 International Conference on Biometrics (ICB).   IEEE, 2013, pp. 1–6.
  • [32] T. Wang, J. Yang, Z. Lei, S. Liao, and S. Z. Li, “Face liveness detection using 3d structure recovered from a single camera,” in 2013 international conference on biometrics (ICB).   IEEE, 2013, pp. 1–6.
  • [33] Y. Wang, F. Nian, T. Li, Z. Meng, and K. Wang, “Robust face anti-spoofing with depth information,” Journal of Visual Communication and Image Representation, vol. 49, pp. 332–337, 2017.
  • [34] Z. Zhang, D. Yi, Z. Lei, and S. Z. Li, “Face liveness detection by learning multispectral reflectance distributions,” in Face and Gesture 2011.   IEEE, 2011, pp. 436–441.
  • [35] I. Manjani, S. Tariyal, M. Vatsa, R. Singh, and A. Majumdar, “Detecting silicone mask-based presentation attack via deep dictionary learning,” IEEE Transactions on Information Forensics and Security, vol. 12, no. 7, pp. 1713–1723, 2017.
  • [36] A. Pinto, H. Pedrini, M. Krumdick, B. Becker, A. Czajka, K. W. Bowyer, and A. Rocha, “Counteracting presentation attacks in face, fingerprint, and iris recognition,” Deep Learning in Biometrics, vol. 245, 2018.
  • [37] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” British Machine Vision Association, 2015.
  • [38] Y. Atoum, Y. Liu, A. Jourabloo, and X. Liu, “Face anti-spoofing using patch and depth-based cnns,” in 2017 IEEE International Joint Conference on Biometrics (IJCB).   IEEE, 2017, pp. 319–328.
  • [39] G. Wang, H. Han, S. Shan, and X. Chen, “Improving cross-database face presentation attack detection via adversarial domain adaptation,” in International Conference on Biometrics (ICB), 2019.
  • [40] X. Tu, J. Zhao, M. Xie, G. Du, H. Zhang, J. Li, Z. Ma, and J. Feng, “Learning generalizable and identity-discriminative representations for face anti-spoofing,” arXiv preprint arXiv:1901.05602, 2019.
  • [41] C. T. Li, “Source camera identification using enhanced sensor pattern noise,” IEEE Transactions on Information Forensics and Security, vol. 5, no. 2, pp. 280–287.
  • [42] Y. Sutcu, S. Bayram, H. T. Sencar, and N. Memon, “Improvements on sensor noise based source camera identification,” in 2007 IEEE International Conference on Multimedia and Expo.   IEEE, 2007, pp. 24–27.
  • [43] M. Chen, J. Fridrich, and M. Goljan, “Digital imaging sensor identification (further study),” in Security, steganography, and watermarking of multimedia contents IX, vol. 6505.   International Society for Optics and Photonics, 2007, p. 65050P.
  • [44] S. Zagoruyko and N. Komodakis, “Learning to compare image patches via convolutional neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4353–4361.
  • [45] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.
  • [46] J. Fridrich and J. Kodovsky, “Rich models for steganalysis of digital images,” IEEE Transactions on Information Forensics and Security, vol. 7, no. 3, pp. 868–882, 2012.
  • [47] P. Zhou, X. Han, V. I. Morariu, and L. S. Davis, “Learning rich features for image manipulation detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1053–1061.
  • [48] M. Goljan and J. Fridrich, “Cfa-aware features for steganalysis of color images,” in Media Watermarking, Security, and Forensics 2015, vol. 9409.   International Society for Optics and Photonics, 2015, p. 94090V.
  • [49] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
  • [50] Z. Boulkenafet, J. Komulainen, L. Li, X. Feng, and A. Hadid, “Oulu-npu: A mobile face presentation attack database with real-world variations,” in 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).   IEEE, 2017, pp. 612–618.
  • [51] Y. Wu and K. He, “Group normalization,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 3–19.
  • [52] I. J. S. . Biometrics, “Information technology - biometric presentation attack detection-part 1: Framework. international organization for standardization,” 2016.
  • [53] A. George and S. Marcel, “Deep pixel-wise binary supervision for face presentation attack detection,” in Proceedings of 2019 International Conference on Biometrics (ICB).   IEEE, 2019, pp. 1–8.
  • [54] A. Anjos and S. Marcel, “Counter-measures to photo attacks in face recognition: a public database and a baseline,” in 2011 international joint conference on Biometrics (IJCB).   IEEE, 2011, pp. 1–7.
  • [55] Z. Boulkenafet, J. Komulainen, Z. Akhtar, A. Benlamoudi, D. Samai, S. E. Bekhouche, A. Ouafi, F. Dornaika, A. Taleb-Ahmed, L. Qin et al., “A competition on generalized software-based face presentation attack detection in mobile scenarios,” in 2017 IEEE International Joint Conference on Biometrics (IJCB).   IEEE, 2017, pp. 688–696.
  • [56] Z. Boulkenafet, J. Komulainen, and A. Hadid, “Face spoofing detection using colour texture analysis,” IEEE Transactions on Information Forensics and Security, vol. 11, no. 8, pp. 1818–1830, 2016.
  • [57] H. Chen, G. Hu, Z. Lei, Y. Chen, N. M. Robertson, and S. Z. Li, “Attention-based two-stream convolutional networks for face spoofing detection,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 578–593, 2019.
  • [58] D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,” arXiv preprint arXiv:1610.02136, 2016.
  • [59] Q. Yu and K. Aizawa, “Unsupervised out-of-distribution detection by maximum classifier discrepancy,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 9518–9526.
  • [60] D. Mandal, S. Narayan, S. K. Dwivedi, V. Gupta, S. Ahmed, F. S. Khan, and L. Shao, “Out-of-distribution detection for generalized zero-shot action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 9985–9993.
  • [61] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (hevc) standard,” IEEE Transactions on circuits and systems for video technology, vol. 22, no. 12, pp. 1649–1668, 2012.
  • [62] http://x265.org/.
  • [63] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the h. 264/avc video coding standard,” IEEE Transactions on circuits and systems for video technology, vol. 13, no. 7, pp. 560–576, 2003.
  • [64] https://x264.org/en/.

References