Deep Transfer Across Domains for Face Anti-spoofing

01/17/2019 ∙ by Xiaoguang Tu, et al. ∙ Microsoft 0

A practical face recognition system demands not only high recognition performance, but also the capability of detecting spoofing attacks. While emerging approaches of face anti-spoofing have been proposed in recent years, most of them do not generalize well to new database. The generalization ability of face anti-spoofing needs to be significantly improved before they can be adopted by practical application systems. The main reason for the poor generalization of current approaches is the variety of materials among the spoofing devices. As the attacks are produced by putting a spoofing display (e.t., paper, electronic screen, forged mask) in front of a camera, the variety of spoofing materials can make the spoofing attacks quite different. Furthermore, the background/lighting condition of a new environment can make both the real accesses and spoofing attacks different. Another reason for the poor generalization is that limited labeled data is available for training in face anti-spoofing. In this paper, we focus on improving the generalization ability across different kinds of datasets. We propose a CNN framework using sparsely labeled data from the target domain to learn features that are invariant across domains for face anti-spoofing. Experiments on public-domain face spoofing databases show that the proposed method significantly improve the cross-dataset testing performance only with a small number of labeled samples from the target domain.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In biometric based face recognition systems, spoofing attacks are usually perpetrated using photographs, replayed videos or forged masks. Despite continuous works on face anti-spoofing over the years [1, 2, 3, 4, 5, 6, 7], most of them limit to the non-realistic intra-database testing scenarios instead of the cross database testing scenarios. In practical scenario, the environments are not fixed, differences in light conditions, backgrounds and camera resolutions of a new environment may make the images captures differently. Besides, as the spoofing displays can be produced using different kinds of materials, such as paper, electronic screen and forged mask, the distributions of spoofing samples are widely varied. Therefore, it is extremely hard that an approach trained on one dataset performs well on other datasets.

Among the numerous published literatures, most of them focused on hand-crafted features and tried to capture texture differences between live and spoofing face images from the perspective of surface reflection and material differences, such as LBP [3], LBP-TOP [4] and HOG [8]. However, methods in this category may suffer from poor generalizability since the texture varies with the spoofing devices. In [5, 7], the researchers used CNNs to automatically learn features for face anti-spoofing and have achieved promising performance. Nevertheless, the CNNs need a large number of various types of spoofing data to guarantee its generalization ability. Even the traditional approach for adapting deep models, fine-tuning, may require hundreds or thousands of labeled examples for each category that need to be adapted [9]. Unfortunately, the current publicly available face spoofing datasets, such as Replay Attack [3], CASIA-FASD [10], MSU-MFSD [11] and NUAA [2] are too limited to train a generalized network compared with the datasets of image classification and face recognition.

However, in the practical application of a face anti-spoofing product, it is reasonable to assume that the product’s new owner will be able to label a handful of examples for a few types of training samples. Given this circumstance, we proposed a domain transfer network, aiming at learning domain-invariant features across two different datasets for robust face anti-spoofing. We propose to learn a shared feature subspace where the distributions of the real access samples (genuine) from different domains, and the distributions of different types of spoofing attacks (fake) from different domains are drawn close, respectively. In the proposed framework, the sufficient labeled source data are used to learn discriminative representations that distinguish the genuine samples and the fake samples, meanwhile the sparsely labeled target samples are fed to the network to calculate the feature distribution distance between the genuine samples from the source and the target domain, and between the fake samples from the source and the target domains, corresponding to their materials. The kernel approach is adopted to map the features output from the CNN into a common kernel space, and the Maximum Mean Discrepancy (MMD) is adopted to measure the distribution distance between the samples from the source and target domains. This feature distribution distance is treated as a domain loss term added to the objective function and minimized along with training of the network. We provide a comprehensive evaluation on some popular datasets with the proposed method and show significant performance improvements.

2 Related Work

Existing face anti-spoofing approaches are mainly based on two cues for the purpose of liveness face detection, the texture differences between live and spoofing face images and the fine-grained motions such as eye blinking, mouth movement and head movement across video frames.

In [12], the researchers utilized the difference of structural texture between the 2D images and 3D images to detect spoofing attacks based on the analysis of Fourier spectra, where the reflections of light on 2D and 3D surfaces result in different frequency distributions. Tan et al. [2] used a variational retinex-based method and the Difference-of-Gaussian (DoG) filters to extract latent reflectance features on face images to distinguish fake images from real images. In [13]

, Maatta et al. extracted the texture of 2D images using the multi-scale local binary pattern (LBP) to generate a concatenated histogram which was fed into a SVM classifier for genuine/fake face classification. In the later work, Chingovska et al.

[3] applied the LBP operator and its variation to capture the textural properties of the input image. However, these method analyze each frame in isolation, not considering the temporal motion cues across video frames. To make use of the temporal information, Pereira et al [4]. proposed the LBP-TOP, considering three orthogonal planes intersecting the center of a pixel in the direction, direction and direction, where is the time axis. According to their experimental results, this multi-resolution LBP-TOP with SVM classifier achieved the best HTER of 7.6% on the Replay-Attack dataset.

Figure 1: The distributions of genuine samples and two types of fake samples (print and video) from the database of MSU-MFSD (a) and Replay-Attack (b).
Figure 2: The flowchart of the proposed framework, where every input batch contains half the source images and half the target images. Features of the two domains output from the last pooling layer are used to calculate the distribution distance with kernel based MMD. The network is trained using the classification loss along with the distribution distance which is taken as domain loss.

Though promising performance can be achieved with the above mentioned hand-crafted features in intra-dataset protocol, a new dataset of different domain may result in severe performance degradation, since many hand-crafted features are designed specifically and can not be easily transferred to new conditions [14]

. In order to obtain features with better generalization ability, quite a few researchers used the deep convolutional neural network (CNN) for face anti-spoofing

[5, 15, 16, 17]. The CNNs can automatically learn features and obtain discriminative cues between genuine and fake faces. If sufficient data from various domains is available, the CNN-based methods can achieve performance with fairly good generalization ability. Unfortunately, the current publicly available face anti-spoofing datasets are too limited to train a generalized network due to the difficulty in obtaining labeled training samples, the variety of materials of spoofing devices also enhance the domain shift in the distribution of data representations.

To bridge the gap between two different distributions, many approaches have been proposed based on the ideas of domain adaptation [18, 19, 20, 21]

. Multimodal deep learning architectures have been proposed to learn domain invariant representations in

[22]. However, this method performed primarily in a generative context and did not leverage the full representation power of supervised CNN representations. In [23], Ghifary et al. proposed pre-training with a denoising auto encoder, then train a two-layer network simultaneously with the MMD domain confusion loss. Nevertheless, the learned network is relatively shallow and therefore lacks the strong semantic representation which is learned by directly optimizing with a supervised deep CNN. In [20], Tzeng et al. proposed a new CNN architecture which introduces an adaptation layer and an additional domain confusion loss for classification. They use MMD both to select the depth and width of the architecture while using it as a regularizer during fine-tuning, and achieved state-of-the-art performance on the standard visual domain adaptation benchmark.

3 Deep Domain Transfer Network

The idea of domain adaptation have been successfully applied in many other fields [18, 24], similar ideas, however, have never been used in the study of face anti-spoofing, even current face anti-spoofing approaches severely suffered from the problem of dataset bias. In order to remove the dataset bias and bridge the gap between distributions from different domains with the limited training samples, we proposed a deep domain transfer framework for face anti-spoofing with kernel based metric. Kernel based metric has already been used in many works for the measurement of distribution distance [25, 26, 27]. If a suitable kernel is found, the input data can be mapped into a convenient feature space, and the distance between different distributions will be better quantified.

We propose to use deep Convolutional Neural Network to extract the discriminative features of input images. We argue that the feature distributions of different datasets are different in feature subspace that learned by the shared feature extraction layer. Figure 

1 illustrates the feature distributions of three principal components with respect to two different datasets, the features are learned by CNN hidden layer. It is shown that the distributions of the genuine samples and the two types of fake smaples in the learned feature subface are vary from dataset to dataset. The proposed framework is outlined in Figure 2.

Let be a

-dimension column vector.

and are the data points in the source and the target datasets, respectively. Suppose that the source-domain data and the target-domain data , where and are the numbers of the data points. We denote the representation of the last pooling layer as . Therefore, the representations of source-domain points, , and target-domain points , can be represented as and , respectively. Suppose is the kernel of a reproduction Hilbert space of functions. Then the MMD in between the distributions of and is [28]:

(1)

where and . Many kernels, such as the Gaussian RBF, are characteristic [29, 30], which indicates the MMD is a metric, and in particular that if and only if . Given and

, one estimator of

is:

(2)

This estimator is unbiased, and has nearly minimum variance among unbiased estimators

[28]. Just take the feature representations and as and , respectively, the kernel based MMD between the features of source-domain and target-domain samples can be calculated accordingly. In our experiments, the mixture of RBF kernels were chosen for the computation of distribution distance based MMD.

Table 1: The training, testing and development sets contained in databases of Replay-Attack and MSU-MFSD.

 

Replay-Attack-fixed MSU-MFSD
Adverse Controlled
Genuine Attack Genuine Attack Genuine Attack
Video Paper Video Paper Video Paper

 

Train 30 30*4 30 30 30*4 30 30 30*2 30
Test 40 40*4 40 40 40*4 40 20 20*2 20
Devel 30 30*4 30 30 30*4 30 20 20*2 20

 

As Figure 2 shows, the proposed framework minimizes the distance between different domains, as well as the classification errors among genuine and fake samples. The features learned should be domain-invariant across domains to achieve good classification results on dataset of either the source domain or the target domain. If no labeled target domain data are available, the objection loss is defined as:

(3)

where denotes the source-domain data and target-domain data, respectively, denotes the labels of source-domain data. is the classification loss on the labeled source-domain data, and is the domain loss between the source data and target data . The regulation parameter determines how strongly we would like to confuse the domains.

In the situation of face anti-spoofing, spoofing images were usually obtained using different types of devices that have different materials of surfaces, i.e., print photo, electronic screen, forged mask, which we named spoofing modality in this paper. If sparsely labeled data from the target domain is available, we could define the objection function by a intra-modality way. Specifically, the source samples and target samples are divided into several subsets corresponding to the material of their spoofing surfaces (e.t, print paper, video screen, mask). Then, the MMD is calculated between different modalities. Therefore, the semi-supervised objection loss is:

(4)

where and represent the real samples from the source and target domain, respectively, and is the number of spoofing modalities in the database. This objective function can be optimized by Adam algorithm [31]

. The Adam offers a computationally way for gradient-based optimization of stochastic objective functions, aiming at towards machine learning problems with large datasets and/or high-dimensional parameter spaces. It is robust and well-suited to a wide range of non-convex optimization problems

[31].

We build the proposed Convnet based on AlexNet [32]

, which contains five convolutional and max pooling layers, and three fully connected layers. A ”Two-Half” strategy is proposed for the training of the proposed framework. Specifically, in every training batch, one half is the source data and the other half is the target data, as illustrated in Figure 

2. Since only a few target samples are available, we randomly picked and copied these target samples to ensure the two domains have the same number of samples. During the training process, the labeled samples from the source domain are used to learn discriminative features for the distinguish of genuine and fake images, while the target samples are used to shrink the domain variance against the source domain, corresponding to the modalities, respectively. Owning to the two terms of joint loss, the proposed deep domain adaptation network could learn representations that are effectively discriminative between genuine and fake samples due to the classification loss, while still remaining invariant to domain shift due to the domain loss.

Table 2: Comparison results of inter-test between different pairs of datasets. The training set of the first column are used to train the model, whereas the testing set of the second column are used for model testing.

 

Datasets Metrics Methods

 

Train Test -
Replay-attack Replay-attack HTER(%) 22.50 24.00 24.42 20.00
(adverse) (controlled) AUC(%) 77.50 76.00 75.58 80.00
Replay-attack MSU- HTER(%) 45.83 47.50 36.36 25.83
(adverse) MFSD AUC(%) 54.17 52.50 63.64 74.17
MSU- Replay-attack HTER(%) 45.50 46.50 48.58 27.50
MFSD (controlled) AUC(%) 54.50 53.50 51.42 72.50

 

4 Experiments

In this section, we first give a brief description of two benchmark datasets, Replay-Attack [3] and MSU-MFSD [11]. After that, we report the performance evaluation of the proposed method on the two datasets. We follow the protocols provided by these two datasets.

4.1 Experimental Settings and Datasets

To avoid any influences from the background, only the face region of each image was cropped based on eye coordinates and used as input. The eye coordinates were obtained by a facial landmarks detection algorithm available on the internet. All the input face images were resized to , and the regulation parameter

was set as 0.5. Batch normalization layer was used to overcome internal covariate shift. The mixture of Gaussian RBF kernels were chosen for the calculation of the domain loss. The mixed kernel function is a sum of Gaussian RBF kernels with fixed bandwidths 2, 5, 10, 20, 40 and 80.

MSU-MFSD: This dataset consists of 280 video recordings of genuine and attack faces. The recordings were taken from 35 individuals using two types of cameras, with different resolutions ( and ). For the real accesses, each subject has two video recordings captured with the Android and Laptop cameras. For the video attacks, a high definition video was taken for each subject using a Canon camera and a iPhone camera. The videos taken with the Canon camera were then replayed on iPad Air screen to generate the HD replay attacks while the videos recorded by the iPhone mobile were replayed itself to generate the mobile replay attacks. For the printed attacks, the pictures were printed on A3 Paper using an HP colour printer. The 35 subjects (280 videos) of the MSU-MFSD dataset were divided into training (120 videos) and testing (160 videos) subdatasets, respectively. The training dataset contains 30 real access videos and 90 attack videos while the testing dataset contains 40 real accesses and 120 attacks.

Replay-Attack: This dataset consists of 1200 videos that include 200 real access videos and 1000 attack videos. The attacks were taken under 2 different illumination conditions (controlled and adverse) and 2 support conditions (hand and fixed). Under the same condition, a high resolution picture and video were taken for each person. Three types of attacks were designed: (1) print attack, (2) mobile attack, and (3) highdef (high definition) attack. The evaluation protocol splits the dataset into training (360 videos), testing (480 videos) and development (360 videos) subdatasets. The training and development datasets contain 60 real access videos and 300 attack videos each, whereas the testing subdataset contains 80 real accesses and 400 attacks. In our experimetns, only the attacks that using fixed-support were used.

4.2 Performance measure

We evaluate the adaptation ability of the proposed transfer network in this section. Half Total Error Rate (HTER) was used as the metric in our experiments to keep consistent to previous work. The HTER is half of the sum of the False Rejection Rate (FRR) and the False Acceptance Rate (FAR). Since both False Acceptance Rate (FAR) and False Rejection Rate (FRR) depend on a threshold , increasing the FAR will usually reduce the FRR and vice-versa, we followed previous works [5, 33], and used the development set to determine the threshold corresponding to Equal Error Rate (ERR) for the computing of HTER. Unfortunately, there is no development set in MSU-MFSD dataset. In this case, we equally split the testing set into a couple, namely the development set and the new testing set. The split datasets are shown in table 1. Observing from table 1, there are 30, 40 and 30 subjects used for training, testing an development in Replay-Attack database, respectively, while the training, testing and development datasets in MSU-MFSD each contains 30, 20 and 20 subjects, respectively. The testing accuracy (ACC) achieved by each methods is also calculated for comparison.

4.3 Cross-Databases Spoofing Detection

For face anti-spoofing, the adaptation ability from one dataset to another is crucial for practical application. In this part, we evaluate this ability by cross-dataset testing (inter-test), namely the model is firstly trained using the samples from dataset A (source), and then tested on dataset B (target). The Replay-Attack and the MSU-MFSD databases were used to evaluate the proposed method. Images in Replay-Attack were divided into two subdatasets according to their illumination conditions (adverse or controlled), and then, we could obtain three datasets: MSU-MFSD, Replay-Attack-controlled and Replay-Attack-adverse. Replay-Attack-controlled is the subdataset of Replay-Attack where images are taken with the light controlled, whereas Replay-Attack-adverse is the other subdataset of Replay-Attack with the light uncontrolled. For each of the three datasets, the images were split into three subsets, the training, testing and development set. To execute the experiments in inter-test fashion, the training set of dataset A was used to train the CNN models or train the SVM classifiers, whereas the testing set from dataset B was used for testing. Three groups of inter-test performance were evaluated: Repaly-Attack-adverse vs Replay-Attack-controlled, Repaly-Attack adverse vs MSU-MFSD, and MSU-MFSD vs Replay-Attack-controlled.

We consider the situation that the labeled data from target domain is limited for training. In our experiments, we only randomly selected one labeled subject from the target dataset for training. The video frames, including genuine and fake, were stacked and copied to the same number with the training samples from source domain. Because the input data were video clips, each contained a number of consecutive frames, we took all the consecutive frames for the training of the network for the purpose of data argumentation. However, in the testing phase, the output probabilities of the consecutive frames with respect to the same subject were averaged to determine the categorization of each video clip. Three popular methods,

[3], - [4] and [10], were implemented to compare with the proposed method. The features captured by the three methods are all fed into SVM to obtain the final classification results. For the LBP method, all the consecutive frames of each video clip of the training set were used to extract the 59-dimensional holistic features, and to train the SVM classifier. The final results were achieved on the test set by averaging the probabilities of all the consecutive frames per video clip. The features of LBP-TOP were extracted per video clip, as this method involved the spatio-temporal information from video sequences. For the Dog operator, 30 frames of each video clip were randomly selected to train the SVM classifier to avoid high dimensionality of the DoG features, followed by the procedures of the original literature [10]. The standard CNN (stdCNN) was also implemented to compared with the three methods, the stdCNN has the same architecture with the proposed framework only without the domain loss layer.

The inter-test results regarding these methods are first summarized in table 2 to evaluate their generalization ability on the three datasets. For Replay-Attack-adverse and Replay-Attack-controlled datasets, only the illumination condition is different, other factors like individuals and spoofing materials are keep unchanged, therefore the domain shift is not serious between these two datasets. This is the reason why all the four approaches achieved the best performance by testing across these two datasets. However, the MSU-MSFD database is totally different from Replay-Attack database, where the individuals, spoofing materials and illumination conditions are all different. As a result, the domain shift between these two databases are quite serious. The performances achieved by the four approaches are relatively lower across there two databases.

Figure 3: The accuracy of intra- and inter test achieved by LBP (a), LBP-TOP (b), DoG (c) and stdCNN vs. DTCNN (d), where stdCNN stands for the standard CNN, and DTCNN represents for the domain transferred CNN.

4.4 Intra-test vs. Inter test

To illustrate the effectiveness of the proposed method, we compared the intra-test performance with inter-test performance regarding the same method, LBP, LBP-TOP, DoG and stdCNN. Suppose we evaluate the performance across dataset A and dataset B, then the intra-test was executed by using the training and testing set from A for the training and testing of the model. However, for the inter-test procedure, we used the training set from A and the testing set from B to train and test the model, respectively.

Figure 3 reports the accuracy of intra-test and inter-test by LBP, LBP-TOP, DoG, stdCNN and the proposed method (DTCNN), respectively. As can be seen, all of the methods obtained satisfying performance in the fashion of intra-test, the CNN even achieved nearly perfect performance () on Replay-Attack database. However, when intra-test on MSU-MFSD, we observed the training accuracy is nearly , while the testing accuracy degraded significantly to . The main reason for such a performance decline is over-fitting due to short of training samples.

The best inter-test performances were still achieved when testing across Replay-Attack-adverse and Replay-Attack-controlled (Adver vs. Contrl). Compared with the results of intra-test, there was a significant degeneration of the performance of inter-test by the standard CNN (blue vs. green bar graph in Figure 3(d)). The declined accuracy even reached up to when testing across Replay-Attack-adverse and MSU-MFSD (Adver vs. MFSD). However, when using the proposed method (DTCNN) for inter-test (blue vs. yellow bar graph in Figure 3(d)), there was a considerably boost of the performance compared with the standard CNN. The improved accuracy are and for Adver vs. Contrl, Adver vs. MFSD and MFSD vs. Contrl, respectively.

Overall, the results of LBP, LBP-TOP, DoG and the standard CNN show that they are not able to well handle the cross-database challenge. This is because LBP, LBP-TOP and DoG are hand-crafted features which are designed specifically and can not be easily transferred to new condition. For the standard CNNs, the features learned are specific to the training dataset provided, it may achieve superb performance on the data from the training domain, however, the performance may degrade considerably if testing on a total different dataset. Different from the standard CNNs, the proposed method can make use of the domain information from the target dataset, bridging the gap between the feature distributions and hence obtaining satisfying performance on the target set, only with a very few labeled samples provided from the target domain.

5 Summary and Conclusion

Cross-database face anti-spoofing replicates real application scenarios, and is a challenge for biometrics anti-spoofing. Although many of the existing methods proposed can achieve excellent performance in the way of non-realistic intra-database testing, few of them can achieve comparable performance on the dataset of other domain. To bridge the gap between the datasets from different domains, we proposed a CNN framework that effectively adapts to a new domain with sparsely labeled target domain data for face anti-spoofing. The proposed network can learn a invariant feature space for the source and target samples by optimizing an objective that simultaneously minimizes classification loss and the domain loss. As a result, the model trained with the labeled source data can also achieve satisfactory performance on the target dataset. Experiments on the datasets of Replay-Attack and MSU-MFSD showed the proposed framework greatly enhance the performance by cross-dataset testing, with only a few labeled samples from the target domain. The proposed method could open new perspectives for the future research of face anti-spoofing.

References

  • [1]

    G. Pan, L. Sun, Z. Wu, S. Lao, Eyeblink-based anti-spoofing in face recognition from a generic webcamera, in: Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, IEEE, 2007, pp. 1–8.

  • [2] X. Tan, Y. Li, J. Liu, L. Jiang, Face liveness detection from a single image with sparse low rank bilinear discriminative model, Computer Vision–ECCV 2010 (2010) 504–517.
  • [3] I. Chingovska, A. Anjos, S. Marcel, On the effectiveness of local binary patterns in face anti-spoofing, in: Biometrics Special Interest Group, 2012, pp. 1–7.
  • [4] T. D. F. Pereira, A. Anjos, J. M. D. Martino, S. Marcel, Lbp-top based countermeasure against face spoofing attacks, in: International Conference on Computer Vision, 2012, pp. 121–132.
  • [5] J. Yang, Z. Lei, S. Z. Li, Learn convolutional neural network for face anti-spoofing, Computer Science 9218 (2014) 373–384.
  • [6] A. Pinto, W. R. Schwartz, H. Pedrini, A. D. R. Rocha, Using visual rhythms for detecting video-based facial spoof attacks, IEEE Transactions on Information Forensics and Security 10 (5) (2015) 1025–1038.
  • [7] L. Li, X. Feng, Z. Boulkenafet, Z. Xia, M. Li, A. Hadid, An original face anti-spoofing approach using partial convolutional neural network, in: Image Processing Theory Tools and Applications (IPTA), 2016 6th International Conference on, IEEE, 2016, pp. 1–6.
  • [8] J. Komulainen, A. Hadid, M. Pietikainen, Context based face anti-spoofing, in: Biometrics: Theory, Applications and Systems (BTAS), 2013 IEEE Sixth International Conference on, IEEE, 2013, pp. 1–8.
  • [9] E. Tzeng, J. Hoffman, T. Darrell, K. Saenko, Simultaneous deep transfer across domains and tasks, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4068–4076.
  • [10] Z. Zhang, J. Yan, S. Liu, Z. Lei, A face antispoofing database with diverse attacks, in: Iapr International Conference on Biometrics, 2012, pp. 26–31.
  • [11] D. Wen, H. Han, A. Jain, Face Spoof Detection with Image Distortion Analysis, IEEE Trans. Information Forensic and Security 10 (4) (2015) 746–761.
  • [12] J. Li, Y. Wang, T. Tan, A. K. Jain, Live face detection based on the analysis of fourier spectra, in: Defense and Security, International Society for Optics and Photonics, 2004, pp. 296–303.
  • [13] J. Määttä, A. Hadid, M. Pietikäinen, Face spoofing detection from single images using micro-texture analysis, in: Biometrics (IJCB), 2011 international joint conference on, IEEE, 2011, pp. 1–7.
  • [14] Y. Bengio, A. Courville, P. Vincent, Representation learning: A review and new perspectives, IEEE transactions on pattern analysis and machine intelligence 35 (8) (2013) 1798–1828.
  • [15] A. Alotaibi, A. Mahmood, Deep face liveness detection based on nonlinear diffusion using convolution neural network, Signal, Image and Video Processing 4 (11) (2016) 713–720.
  • [16] K. Patel, H. Han, A. K. Jain, Cross-database face antispoofing with robust feature representation., in: CCBR, 2016, pp. 611–619.
  • [17]

    E. Valle, R. Lotufo, Transfer learning using convolutional neural networks for face anti-spoofing, in: Image Analysis and Recognition: 14th International Conference, ICIAR 2017, Montreal, QC, Canada, July 5–7, 2017, Proceedings, Vol. 10317, Springer, 2017, p. 27.

  • [18] R. Gopalan, R. Li, R. Chellappa, Domain adaptation for object recognition: An unsupervised approach, in: Computer Vision (ICCV), 2011 IEEE International Conference on, IEEE, 2011, pp. 999–1006.
  • [19] B. Fernando, A. Habrard, M. Sebban, T. Tuytelaars, Unsupervised visual domain adaptation using subspace alignment, in: Proceedings of the IEEE international conference on computer vision, 2013, pp. 2960–2967.
  • [20] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, T. Darrell, Deep domain confusion: Maximizing for domain invariance, arXiv preprint arXiv:1412.3474.
  • [21] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, V. Lempitsky, Domain-adversarial training of neural networks, Journal of Machine Learning Research 17 (59) (2016) 1–35.
  • [22] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A. Y. Ng, Multimodal deep learning, in: Proceedings of the 28th international conference on machine learning (ICML-11), 2011, pp. 689–696.
  • [23] M. Ghifary, W. B. Kleijn, M. Zhang, Domain adaptive neural networks for object recognition, arXiv preprint arXiv:1409.6041.
  • [24]

    Q. Chen, J. Huang, R. Feris, L. M. Brown, J. Dong, S. Yan, Deep domain adaptation for describing people based on fine-grained clothing attributes, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5315–5324.

  • [25] D. Comaniciu, V. Ramesh, P. Meer, Kernel-based object tracking, IEEE Transactions on pattern analysis and machine intelligence 25 (5) (2003) 564–577.
  • [26] F. Xiong, M. Gou, O. Camps, M. Sznaier, Person re-identification using kernel-based metric learning methods, in: European conference on computer vision, Springer, 2014, pp. 1–16.
  • [27] D. J. Sutherland, H.-Y. Tung, H. Strathmann, S. De, A. Ramdas, A. Smola, A. Gretton, Generative models and model criticism via optimized maximum mean discrepancy, arXiv preprint arXiv:1611.04488.
  • [28] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, A. Smola, A kernel two-sample test, Journal of Machine Learning Research 13 (Mar) (2012) 723–773.
  • [29] K. Fukumizu, A. Gretton, X. Sun, B. Schölkopf, Kernel measures of conditional dependence, in: Advances in neural information processing systems, 2008, pp. 489–496.
  • [30] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, G. R. Lanckriet, Hilbert space embeddings and metrics on probability measures, Journal of Machine Learning Research 11 (Apr) (2010) 1517–1561.
  • [31] D. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.
  • [32]

    A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105.

  • [33] Z. Xu, S. Li, W. Deng, Learning temporal features using lstm-cnn architecture for face anti-spoofing, in: Pattern Recognition (ACPR), 2015 3rd IAPR Asian Conference on, IEEE, 2015, pp. 141–145.