Smartphones with facial authentication features have made biometric systems remarkably common in our everyday lives. This is in addition to less prevalent and more traditional, yet critical, applications of biometric systems, such as automatic passport control and access control to high-security facilities. As services based on biometric recognition technologies gain popularity, presentation attack detection (PAD) is becoming a more crucial requirement for these systems. In parallel, attackers continually attempt to gain unauthorized access by designing new attack types, which makes developing defense mechanisms against presentation attacks more challenging. The goal of a PAD algorithm is to classify whether the presented input to the system is abona-fide presentation (BF) or a presentation attack (PA), so that access is denied for PAs. While recognition systems for all biometric modalities, such as fingerprint and iris, are vulnerable to presentation attacks, the face modality poses a higher risk and extra challenges due to the easy access to high resolution face images of most people, e.g., through social media, and due to the relatively easier fabrication of face PAs. In this paper we exclusively focus on face PAD.
Similar to most sub-fields in computer vision, advances in deep learning, which is inspired by the nervous system
, have led to significant face PAD improvements on benchmark datasets using convolutional neural network-based (CNN) end-to-end representation learning and classification[2, 13, 18, 27, 50, 53, 47, 9, 52]
. Following the standard pipeline of supervised learning for deep learning, a large, labeled training dataset, which consists of known attacks and bona-fide data points, is collected, and then used to train a deep network with a suitable architecture[16, 23, 11].
The vulnerability of using the aforementioned pipeline for PAD is that attackers can continually generate new types of PAs that are unknown to the system, i.e., absent in the training dataset. Since deep networks suffer from overconfidence in their predictions , the system may not be able to identify novel attack types, generated at inference time after the initial training. Even if the unknown attack types can be identified, the standard deep learning pipeline necessitates collecting a sufficient number of samples of new attack types and augmenting the training dataset. The model then needs to be retrained from scratch (or fine-tuned) on the augmented dataset . However, collecting labeled data is time consuming, model retraining is computationally inefficient, and both usually involve human intervention . As a result, it is highly desirable to enable biometric recognition systems to identify novel attack types on-the-fly, during deployment, and then to autonomously adapt their classifications models for recognizing these new attack types in the future.
We develop an algorithm for continual detection of emerging novel types of face PAs. Our objective is to enable the system to identify novel PAs. The model then is updated to learn new attack types such that it does not forget the past learned attack types. Our idea is based on enabling the network to identify new attack types as testing samples that are outside the training distribution samples (OTDS) in an embedding space [17, 38, 51, 25, 9]. The base model is then updated to classify these samples as new attacks types in a continual learning (CL) setting, where catastrophic forgetting  is addressed using experience replay . Despite being effective in continual learning (CL) settings, the idea of detecting OTDS has not been explored for PAD.
The main contributions of our work are as follows:
A new formulation of face PAD as a continual learning problem to equip a PAD system with defense mechanisms that allow learning novel attack types continually.
An algorithm to identify novel attack types as OTDS anomalies by continually screening the input data representations and enable the model to correctly classify them as attacks, in the future, via experience replay.
A new face anti-spoofing dataset with diverse attack types to evaluate our algorithm in CL settings.
2 Related Work
Our work straddles the intersection of two topics: detection of novel face PAs and continual learning.
Novel Presentation-Attack Detection: Novel class detection has been studied within several learning settings. The learning setting that we explore is more related to the zero-shot learning (ZSL) formulation [15, 48, 49, 34]. ZSL has been studied extensively but works on ZSL for PAD have been quite limited. Most ZSL works have proposed to identify novel classes using the standard semantic-based idea for describing a sample. In these works, it is assumed that the semantic description of a novel class is accessible a priori. Novel classes can then be identified by establishing relationships between known and unknown classes through their semantic descriptions. Note, however, coming up with accurate semantic descriptions for PAs is challenging.
To relax the need for knowing the semantic descriptions a priori, Shao et al.  proposed to learn an embedding space that is discriminative across several source domains to improve generalizability of the PAD model on novel PAs in new domains. Liu et al.  used similarity between an unknown type of attack with known attack types for zero-shot attack detection. Both approaches analyze data representations in an embedding space and identify new attack types as unfamiliar data points. Our work follows a similar strategy, where novel attack data points are identified as OTDS. We then use the collected OTDS to expand the generalizability of the base model on the identified novel attack types.
Continual Learning: Upon detecting novel attack data points, the base model needs to be continually updated to gain the ability to recognize the newly identified attack types in the future. Recent works of CL  for deep neural networks have mostly focused on tackling catastrophic forgetting . It occurs when a deep neural network is updated to learn drifts in data distribution in a CL setting which would lead to underperformance on the past learned tasks.
Several strategies have been proposed to mitigate catastrophic forgetting. A group of methods are based on regularizing the deep network weight parameters . The idea is to identify the network weights that are important for decent performance on past learned tasks, consolidate these weights according to their importance, and learn new tasks using the remaining weights that are unimportant to remember past learned tasks. The main challenge is identifying the important weights and minimize the negative effects of weight consolidation on the network learning capacity. A second group of works are based on the notion of experience replay , where the end-to-end training mechanism of deep networks is changed, rather than the network itself. The idea is to replay data points of past learned tasks along with the current task data to update the model through the pseudo-rehearsal process , i.e., retraining the model jointly on the past and the current data. Since training the network on full datasets is computationally expensive and the storage capacity is limited, only a subset of training data for the past learned tasks should be used. These samples are stored in a memory buffer of fixed maximum capacity . The main challenge is how to select these samples. For example, Schaul et al. select samples that were uncommon and led to maximum learning effects in past tasks . To alleviate the need for a memory buffer, an alternative approach is to use generative models. The idea is to enable the model to generate pseudo-data points that are similar to the data points of the past tasks and use the pseudo-data points for pseudo-rehearsal [42, 37, 36]. We rely on memory-based experience replay to update the PAD model in our work.
3 Problem Statement
Consider a PAD task with initial labeled training data , where is the collection of BF data points and a number of fixed known PA instances, and
is the corresponding one-hot binary labels of the PAs and BFs. The training data points are assumed to be independent and identically distributed (iid) and are drawn from an unknown probability distribution,.
To solve the initial supervised PAD detection task, we select a parameterized family of functions with learnable parameters . We then search for the optimal model using empirical risk minimization (ERM):
is a proper discrimination loss function such as cross-entropy and
denotes the probability expectation operator. Upon training the base deep network model on the dataset, the PAD system is fielded for testing. We have depicted this base model as the PAD Module in Figure 1. If is large enough, the selected deep network structure is suitable, and observed data during testing are drawn from the training distribution then the model will generalize well during execution according to theoretical guarantees of PAC-learning framework . However, if new types of attacks are introduced after the initial training or any drift in the input distribution occurs, poor model generalization is expected. In other words, the model may fail to identify new attack types and misclassify them as bona-fide samples.
To allow for a robust and adaptive PAD system, we extend the standard one-shot training/testing formulation of PAD to a continual learning setting . To this end, we consider that after the initial training phase, PAD tasks arrive sequentially and we need to address these tasks at inference time. We consider that the system encounters sequential PAD tasks in a time sequence during execution time. Each task is specified by an unlabeled training dataset , , built from observed input data points over a fixed time period, e.g., a day. The unlabeled dataset for subsequent tasks may contain new attack types that were not present when the previous tasks were learned, i.e., the tasks may have different distributions . This means that we need to equip the PAD module with a mechanism such that the system can identify instances of unknown types of attacks in the dataset at each time-step and then update the model to learn them (see Figure 1).
If labeled data for BFs and all past learned PAs is accessible, expanding the model to learn each attack type would be a standard supervised learning problem similar to Eq. (1). We just need to augment the dataset with the detected instances of novel PA types and then retrain the base model. However, this would require a memory buffer with unlimited size to store the growing number of observed attack types. Retraining the model continually from scratch can also become computationally expensive and time-consuming.
As a solution, our goal is to update the model by incorporating the new PAs into the system’s knowledge by replaying only a subset of training data which are stored in a replay buffer as representative samples (see Figure 1). After updating the model, the system proceeds by learning the subsequent tasks through an iterative procedure. The major challenge of model updating is that, since the past learned attack types may always be encountered, the system must expand its ability to recognize the identified novel attack types such that it maintains the ability to recognize the past learned tasks. This means that the stored samples in the replay buffer need to be such that they can encode the information required to retain the past tasks knowledge. A high-level block-diagram visualization of our continual learning framework for PAD is provided in Figure 1.
4 Proposed Method
To solve the challenges of novel attack detection and model updating in a CL setting, we continually screen data representations in a discriminative embedding space which is modeled as the output of a deep encoder. We assume that the deep network can be decomposed into an encoder subnetwork with learnable parameters , e.g., convolutional layers of a CNN, and a classifier subnetwork with learnable parameters , e.g., a sequence of fully connected layers. Here, is the discriminative embedding space in which the input data points become separable after performing supervised learning. In the case of a deep neural network with good generalization performance, the embedding space should be discriminative and the data representation would form a bimodal distribution , similar to the visualization presented in Figure 2.
Figure 2 illustrates that the input data distribution is transformed into a bimodal distribution in the embedding space by the encoder subnetwork after learning a PAD task. PAs and BFs each form one mode of this distribution. A decision boundary between these two modes is learned by the classifier subnetwork to classify the input images in the future. The more a data point lies away from the learned decision boundary in the embedding space, the more confident the classifier subnetwork becomes about its prediction. Overconfident area in the embedding space on the BF side of the decision boundary is a major vulnerability of the PAD model (i.e., high-confidence false negatives). If a novel attack is designed such that it lies in this overconfident region, the model would fail to identify it. Our goal is to make the model robust and stable towards this type of attacks by screening the embedding space, using the intuition above.
4.1 Novel Attack Detection
To tackle the vulnerability of PAD systems in the overconfident regions, we need to suppress the confidence of the model in those regions. To this end, we fit a parametric distribution to model the learned bimodal distribution in the embedding space. Our idea is based on expanding the base classifier subnetwork and classify the data points into three classes, namely BFs, PAs, and OTDS (see Figure 3). The intuition behind this idea is that novel PA instances are expected to be different from the training data in the embedding space. This means that we can identify them if the input lies outside the components of the bimodal distribution fitted on the embedding. Hence, if we can generate samples that lie outside this distribution, i.e., intuitively the gray region in Figure 3, we can augment the samples from this region with the training data and retrain the classifier subnetwork. As a result, the system will be capable of identifying OTDS data points during execution.
To implement the above rationale, we need to estimate the distributionbefore moving forward to start learning at . The empirical version of the learned training distribution at time-step is encoded by the training data representations in the embedding space, 222We have used a slight abuse of notation. We have assumed that denotes all the samples that are accessible for training at time . As we will see, these labeled samples consist of novel attack types that have been detected at time , combined with the samples that are selected and stored in the replay buffer from the previous model update at time . Inspired by prototypical networks , we model
as a Gaussian mixture model (GMM):
denotes weights for each data modal, i.e., prior probability for BFs and PAs,
is the empirical class conditional probability distribution, and, 22], which can be a computationally expensive procedure. However, since we have access to labels of data points, we can decouple the GMM components and compute the GMM parameters for each component independently, using MAP estimates. Consider to be the support set for BFs () and PAs () in the training dataset, i.e., . Then, we can simply estimate the GMM parameters as:
We rely on the prototypical distributional estimate to generate samples that are outside the training distribution. We draw random samples from the GMM distribution such that the samples lie in the overconfident region (See Figure 3
). For this purpose, we draw random samples from the standard multidimensional Gaussian distributionand then generate samples according to the transformation . It is easy to check that these samples are distributed according to the Gaussian distribution of the GMM component. Since we have drawn them to be twice the root of the covariance matrix away from the mean, it is more likely that they lie outside the data cluster, as presented in Figure 3. We use this sampling strategy to generate data points from the overconfident region to expand our model.
Consider that we generate samples for the component as . We fix a probability threshold for the GMM component and then build a pseudo-dataset:
In Eq. (4), by using the membership probability, predicted by the GMM, we ensure to exclude all the generated samples that are close to the means of the GMM components as samples that are inside the distribution. We then build the augmented dataset for training the ternary classification and then retrain the expanded classifier subnetwork. Note that denote the original binary classification dataset of BFs/PAs on which the network was initially trained.
As a result of the above process, when the system proceeds to time-step and samples of the dataset are encountered during the model execution, the system is able to identify OTDS samples at the third output of the classifier. Let denote the OTDS samples in the dataset . We can consider them to be in the attack class. If we retrain the model on the concatenated dataset , the model would generalize well on the novel attack types. However, this requires storing all observed samples. In the following section, we describe a more efficient approach.
4.2 Experience Replay for Continual Learning
To update the model after forming at time , we perform experience replay  by relying on a replay memory buffer that stores a subset of the observed data after learning each task and before starting learning a subsequent task. Let denote the data points stored in the memory buffer (see Figure 1). At each batch of optimization, we include samples from both and in the data batch to update the model. As a result, the model learns to identify novel attacks while retaining the learned knowledge about the past tasks. The only remaining challenge in our framework is a strategy for selecting the samples to be stored in the buffer.
A simple selection strategy is to randomly select samples for each of the BF and the PA classes to store in the memory buffer. Multiple strategies have been used in the CL literature to improve upon this baseline sampling strategy, including, mean of features (MoF) , ring buffer , and reservoir sampling . Since we learn the prototypical distribution as a GMM, we can also rely on a strategy similar to MoF. After training the model in the binary classification setting and fitting the GMM, we can compute the distance of all BFs and PAs from their corresponding Gaussian component’s mean as . We sort these distances, for each class separately, and given the per-class memory budget , we store the samples that are closest to the cluster means. Note that as opposed to a normal CL setting, the labels are predicted for novel PAs in our setting, for
. Hence, it is more likely that labels for samples close to the means are predicted correctly. However, information about higher moments of the distribution is lost when these samples are used for pseudo-rehearsal. As a result, the model prediction accuracy may reduce in the area close to the boundary of the classes in the future.
Given the samples stored in the buffer at , we solve the following pseudo-rehearsal problem for model updating:
where is a trade-off parameter. The first and the second terms in Eq. (5) are simply the supervised loss terms for the identified novel samples and the samples stored in the memory buffer, respectively. The third term is added for updating the encoder subnetwork, consistently, according to the past experiences. This term enforces the samples in the memory buffer to be mapped to the proximity of the same location in the embedding space after updating the model to enhance past leaned features. This term can be thought of as a regularization term to mitigate catastrophic forgetting further in addition to pseudo-rehearsal. Our algorithm, called Novel presentation Attack detection in Continual Learning (NACL), is described in Algorithm 1.
5 PADISI-Face Dataset
To validate our algorithm in a meaningful setup, we need PAD datasets with a diverse set of PAs but such datasets are scarce in the literature. A secondary, yet important, contribution of our work is the introduction of the Face Presentation Attack Detection from Information Sciences Institute (PADISI-Face) dataset which includes various major face spoofing attack types. To the best of our knowledge, the only other comparable dataset that is accessible at the moment is the recently released HQ-WMCA face anti-spoofing dataset 333The Wild with Multiple Attacks Database (SiW-M) face anti-spoofing dataset  is another existing dataset with various PA types. SiW-M dataset includes various attack types, similar to PADISI-Face. However, that SiW-M is temporarily inaccessible. Hence, PADISI-Face can serve as a possible substitute for SiW-M. The PADISI-Face dataset is publicly available at https://github.com/ISICV/PADISI_USC_Dataset. . In PADISI-Face Dataset, each capture consists of a -frame sequence of pixel images. The PADISI-Face dataset contains comparable variety of spoofing attacks to HQ-WMCA. Table 1 presents statistics of the collected dataset as well the HQ-WMCA, for comparison. Figure 4 visualizes instances of all the attack types in the dataset. For comprehensive details on the PADISI-Face dataset and its characteristics, please refer to the Appendix.
6 Experimental Validation
For our experiments, we adapt suitable benchmark datasets and build incremental PA detection tasks. Given a dataset with several classes, we assume that the base network is initially trained on a subset of attack types and bona-fide samples. The remaining attack types are observed in a set of sequentially arriving tasks. During each task, new attack types are detected and the model is updated to learn them.
6.1 Experimental Setup
Datasets: We preform experiments using the HQ-WMCA  and the new PADISI-Face datasets that are suitable for our learning setting. The provided unknown attack protocols of these datasets contain only unknown attack types in the testing set and are not suitable for CL setting. As such, we used the Grandtest protocol of HQ-WMCA  to first divide samples into a training and testing set. This protocol contains about of the samples in the test set, with proportional division of each attack type between the training and testing sets, while ensuring that BF samples in the two sets are participant disjoint. For the PADISI-Face dataset, we followed the same division scheme. For both datasets, the CL tasks are constructed using the training set and evaluation is performed on the test set.
Baselines for comparison: Since no prior method in the literature addresses the continual PAD setting explored in this work, we use three baselines to compare the proposed method with. The presented performance is compared against static training (ST), joint training (JT), and full replay (FR). In the ST setting, we report the performance of the base model after initial training without further updating when new attack types in the dataset are encountered. This setting represents performance of existing PAD algorithms when novel PAs are observed and serves as a lower bound. Improvement over this baseline demonstrates relative effectiveness of our approach. In the JT setting, we train the model on the whole labeled training dataset including all attacks types in the initial training. This setting serves an upper-bound which assumes all attack types are known a priori. FR is a variation of Algorithm 1 in which we assume that the memory budget is unlimited. As a result, we can save and replay all the stored data points in the buffer. We also report performance of NACL when random sampling (RS) is used to select the buffer samples. In the RS setting, we randomly store selected samples in the memory buffer. Comparison with RS is performed to investigate the effect of using the proposed sampling selection technique. For a fair comparison, we use the same buffer size for both RS and NACL methods. We set the buffer size equal to a fixed size of samples, filled evenly with BF and PA samples.
Evaluation protocol: We evaluate the performance of all algorithms using the three standard PAD performance metrics: Attack Presentation Classification Error Rate (APCER), Bona-Fide Presentation Classification Error Rate (BPCER), and Average Classification Error Rate (ACER). As opposed to the common PAD evaluation setting in which evaluation is performed after training on the full dataset, in one-shot, and only a single number is reported, we generate learning curves to report the PAD performance versus time during execution to encode learning dynamics in our evaluation. In our experiments, we use the original index-order of the classes for each dataset, as the order that the attacks are encountered. At each time-step , we compute performance of the model on the testing set when the corresponding task is learned and before proceeding to learning the next task. We report average performance of randomly initialized runs.
For details of the experimental setup, including the network structure, hyper-parameter values, optimization parameters, and our implementation, please see the Appendix.
Similar to most works in the CL literature, there is a boundary between two subsequent tasks in our formulation. This boundary can be attributed to the instances at which the model is updated after a period of data collection. During each task or period at which the model is not updated, the system may encounter more than one attack types. We consider two sets of experiments for a thorough validation.
First, we consider that the initial training task in our experiments consists of training on bona-fide samples and only the first type of PA, according to the index used in the dataset (see the order of attack types in the Appendix). Each subsequent task is constructed by introducing one novel attack type. We report the performance of our algorithm and the baselines in Figure 5(a). At each time step, we reported the model performance on the full testing split of the datasets. We have used (1-APCER), (1-BPCER), and (1-ACER) for visualization because learning curves are usually perceived to be increasing functions. Since the testing split is fixed, successful learning is analogous to rising learning curves. For a quantitative comparison, we have included the numerical values for the metrics in tabular format in the Appendix. By inspecting Figure 5(a), we observe, as expected, that ST is highly vulnerable with respect to novel attack types leading to high values for the APCER and ACER metrics. Note that the high value for BPCER is expected but is not sufficient. This baseline demonstrates the vulnerability of current PAD systems, when novel attacks are encountered, and justifies the necessity of developing algorithms for PAD in CL settings. When we use the designed novel attack detection mechanism, we can clearly see that performance improves significantly towards the JT upper-bound as more attacks are identified and learned. Performance degradation in terms of BPCER metric is expected due to occurrence of catastrophic forgetting but we see improvements in APCER outweigh this degradation (see ACER plots). Note that RS, FR, and NACL are all equipped with the proposed mechanism and their major difference is in the implementation of the experience replay procedure. We do not see a clear winner between these methods across all metrics but note that NACL and RS offer storing significantly less amount of data in the memory compared to FR (only 100 samples). We also note that in the majority of the time-steps NACL outperforms RS. We conclude that experience replay is an effective approach to address catastrophic forgetting.
An initially counter-intuitive looking result is that, as opposed to the CL literature, FR does not clearly outperform NACL, despite storing and replaying all samples. However, note that in all RS, FR, and NACL methods, the predicted labels by the model (not the ground truth) are used in the retraining process. Therefore, FR can be more prone to label pollution, because all samples are stored, leading to performance degradation over time. To verify this intuition, in Table 2, we provide a comparison of the percentage of polluted labels (stored in the buffer and used for retraining) between the FR, RS, and NACL methods for each learning time-step for the tasks of the PADISI-Face dataset. As observed, FR indeed faces the challenge of label pollution, leading to performance degradation values similar to RS and NACL. We also observe label pollution is less for NACL at initial time-steps which may explain why after in Figure 5(a), learning curves are saturated. This observation suggests that, as opposed to the normal situation in CL, FR is not necessarily a better option for experience replay even when there is no memory budget limit, due to label pollution.
In the second set of our experiments, we consider that the initial training task consists of training on bona-fide samples and only the first PA type, according to the index used in the datasets. Subsequent tasks are constructed by introducing two novel attack types at each time-step. This setting is closer to a realistic situation. We have visualized the learning curves for our algorithm and the baselines in Figure 5(b). Comparing the results with those of Figure 5(b), we see that improvements in terms of the ACPER metric are similar. This observation suggests that our algorithm is robust even when multiple attacks are encountered in each time-step. We also note that performance degradation in terms of the BPCER metric is less than Figure 5(a). This observation is expected because the base model has been updated less compared to the single-attack per task scenario. As a result, catastrophic forgetting has been less severe. We conclude that our approach is effective for automatically identifying novel attacks and retraining the model.
6.3 Analysis and Ablation Studies
To demonstrate the importance of the ideas used in the NACL algorithm, we preform ablative experiments. We considered the single-attack per task scenario and used the PADISI-Face dataset in these experiments. We first demonstrate the importance of detecting OTDS samples. Consider that OTDS samples are not detected but the model is updated using a binary prediction baseline. This means that in a CL setting, we always store all the testing samples that are identified as PAs during execution, assuming all to be new attack types, and use them to update the model at each time-step. We refer to this approach as No GMM (NG). In a second experiment, we reported performance of the FR setting when real labels (FRR) are used, i.e., performance in the absence of label pollution. This means that upon identifying the novel attack data points, rather than using the labels predicted by the model, we use the real labels to update the model. Performance results for these setting are summarized in Table 5. Extremely poor performance of NG, measured in the APCER, demonstrates that detecting OTDS is necessary for PAD in a continual learning setting. We also observe that, when real labels are used, as expected from the previous discussion, FRR converges to an upper-bound for NACL, close to the visualized JT performance in Figure 5(a). This observation suggests a future direction for improving our algorithm is tackling the challenge of label pollution . We can also conclude that to mitigate catastrophic forgetting further, a larger buffer size should be used.
|Task||APCER ()||BPCER ()||ACER ()|
We also study the effect of the temporal observation-order at which the PAs are encountered on our algorithm performance. The order we used in our experiments is arbitrary and preset. But in practice, the user does not have any control on the temporal order at which the PAs are observed during execution. For this reason, we consider two extreme cases of ordering. We use the pre-update difficulty of PA detection by the model to set a synthetic temporal ordering on PA types. To this end, we start learning the PA with the class index 1 in the dataset. After learning the first task, for all time-steps, we compute the performance of the model on all the remaining PA types. The detection rates for the remaining PAs are a measure of difficulty of detecting (or learning) them by the model. We performed experiments using two easy to difficult (ED) and difficult to easy (DE) orderings. In the ED scenario, we pick the PA with largest detection rate as the next observed PA. This PA is the easiest PA for the system to learn among the remaining PAs. It is the most similar PA to learned PAs from the model’s point of view. We continue until all the attacks are observed. In the DE scenario, we pick the PA with the least detection rate.
Results for ED and DE temporal orderings are reported in Table 5. We observe that in both cases, NACL algorithm is able to improve the performance of the model as more PA types are encountered and learned. The final model performance after observing all PAs denotes that learning in the ED scenario is easier for the algorithm. This observation accords with our intuition because learning novel attacks that are less similar to the previously observed attack types is more challenging. We conclude that the particular PA observation ordering influences the performance of our algorithm, but our algorithm is effective in the worst-case scenario.
Finally, we highlight that our method is stronger is reducing false-negative predictions. In the Appendix, we have demonstrated that by benefiting from manual annotation of the novel samples, i.e., reducing the label pollution, we can considerably reduce the false-positive predictions.
We study the problem of PA detection in a continual learning setting. Our proposed approach is based on screening the data representations in an embedding space. We estimate the learned training data distribution in the embedding space using a GMM distribution. We use this distribution to enable the base model to identify novel attack types as outside training distribution samples. Experience replay is then used to update the model to tackle catastrophic forgetting. We also collect a new dataset that contains various types of face spoofing attacks. Experiments on two datasets demonstrate that our method is effective for a continual learning setting. Future research direction includes tackling label pollution and considering tasks without sharp temporal boundaries.
This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. 2017-17020200005. The views and conclusions contained herein should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes not withstanding any copyright annotation thereon.
A. Agarwal, D. Yadav, N. Kohli, R. Singh, M. Vatsa, and A. Noore.
Face Presentation Attack with Latex Masks in Multispectral Videos.
2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 275–283, 2017.
-  Y. Atoum, Y. Liu, A. Jourabloo, and X. Liu. Face anti-spoofing using patch and depth-based cnns. In 2017 IEEE International Joint Conference on Biometrics (IJCB), pages 319–328, 2017.
A. Bulat and G. Tzimiropoulos.
Super-FAN: Integrated Facial Landmark Localization and Super-Resolution of Real-World Low Resolution Faces in Arbitrary Poses with GANs.In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 109–117, 2018.
Z. Chen, B. Liu, R. Brachman, P. Stone, and F. Rossi.
Lifelong Machine Learning. Morgan & Claypool Publishers, 2nd edition, 2018.
-  Ivana Chingovska, Nesli Erdogmus, André Anjos, and Sébastien Marcel. Face Recognition Systems Under Spoofing Attacks. In Thirimachos Bourlai, editor, Face Recognition Across the Imaging Spectrum, pages 165–194. Springer International Publishing, Cham, 2016.
-  Tejas Indulal Dhamecha, Richa Singh, Mayank Vatsa, and Ajay Kumar. Recognizing Disguised Faces: Human and Machine Evaluation. PLOS ONE, 9(7):1–16, 07 2014.
-  N. Erdogmus and S. Marcel. Spoofing in 2D face recognition with 3D masks and anti-spoofing with Kinect. In 2013 IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS), pages 1–6, 2013.
-  R. M. French. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4):128–135, 1999.
-  A. George and S. Marcel. Learning one class representations for face presentation attack detection using multi-channel convolutional neural networks. IEEE Transactions on Information Forensics and Security, 16:361–375, 2021.
-  K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9726–9735, 2020.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  G. Heusch, A. George, D. Geissbühler, Z. Mostaani, and S. Marcel. Deep models and shortwave infrared information to detect face presentation attacks. IEEE Transactions on Biometrics, Behavior, and Identity Science, 2(4):399–409, 2020.
-  A. Jourabloo, Y. Liu, and X. Liu. Face de-spoofing: Anti-spoofing via noise modeling. In V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, editors, Computer Vision – ECCV 2018, pages 297–315, Cham, 2018. Springer International Publishing.
-  J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
Soheil Kolouri, Mohammad Rostami, Yuri Owechko, and Kyungnam Kim.
Joint dictionaries for zero-shot learning.
Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012.
-  S. Liang, Y. Li, and R. Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. In International Conference on Learning Representations, 2018.
-  Y. Liu, A. Jourabloo, and X. Liu. Learning deep models for face anti-spoofing: Binary or auxiliary supervision. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 389–398, 2018.
-  Y. Liu, J. Stehouwer, A. Jourabloo, and X. Liu. Deep tree learning for zero-shot face anti-spoofing. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4675–4684, 2019.
V. S. Lokhande, S. Tasneeyapant, A. Venkatesh, S. N. Ravi, and V.
Generating accurate pseudo-labels in semi-supervised learning and avoiding overconfident predictions via hermite polynomial activations.In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11432–11440, 2020.
-  D. Lopez-Paz and M.’A. Ranzato. Gradient episodic memory for continual learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6470–6479, Red Hook, NY, USA, 2017. Curran Associates Inc.
-  T. K. Moon. The expectation-maximization algorithm. IEEE Signal Processing Magazine, 13(6):47–60, 1996.
-  Yaniv Morgenstern, Mohammad Rostami, and Dale Purves. Properties of artificial networks evolved to contend with natural spectra. Proceedings of the National Academy of Sciences, 111(Supplement 3):10868–10872, 2014.
-  Nagarajan Natarajan, Inderjit S Dhillon, Pradeep Ravikumar, and Ambuj Tewari. Learning with noisy labels. In NIPS, volume 26, pages 1196–1204, 2013.
P. Oza, H. V. Nguyen, and V. M. Patel.
Multiple class novelty detection under data distribution shift.In A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, editors, Computer Vision – ECCV 2020, pages 432–449, Cham, 2020. Springer International Publishing.
-  I. Pavlidis and P. Symosek. The imaging issue in an automatic face/disguise detection system. In Proceedings IEEE Workshop on Computer Vision Beyond the Visible Spectrum: Methods and Applications (Cat. No.PR00640), pages 15–24, 2000.
D. Pérez-Cabo, D. Jiménez-Cabello, A. Costa-Pazo, and R. J.
Deep anomaly detection for generalized face anti-spoofing.In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1591–1600, 2019.
-  R. Raghavendra, K. B. Raja, and C. Busch. Presentation Attack Detection for Face Recognition Using Light Field Camera. IEEE Transactions on Image Processing, 24(3):1060–1075, 2015.
-  R. Raghavendra, K. B. Raja, S. Venkatesh, F. A. Cheikh, and C. Busch. On the vulnerability of extended Multispectral face recognition systems towards presentation attacks. In 2017 IEEE International Conference on Identity, Security and Behavior Analysis (ISBA), pages 1–8, 2017.
-  S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. iCaRL: Incremental classifier and representation learning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5533–5542, 2017.
-  M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y. Tu, , and G. Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. In International Conference on Learning Representations, 2019.
-  A. Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123–146, 1995.
-  Mohammad Rostami, David Huber, and Tsai-Ching Lu. A crowdsourcing triage algorithm for geopolitical event forecasting. In Proceedings of the 12th ACM Conference on Recommender Systems, pages 377–381, 2018.
-  Mohammad Rostami, David Isele, and Eric Eaton. Using task descriptions in lifelong machine learning for improved performance and zero-shot transfer. Journal of Artificial Intelligence Research, 67:673–704, 2020.
Mohammad Rostami, Soheil Kolouri, Eric Eaton, and Kyungnam Kim.
Sar image classification using few-shot cross-domain transfer learning.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
-  Mohammad Rostami, Soheil Kolouri, Praveen Pilly, and James McClelland. Generative continual concept learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5545–5552, 2020.
-  Mohammad Rostami, Soheil Kolouri, and Praveen K Pilly. Complementary learning for overcoming catastrophic forgetting using experience replay. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 3339–3345, 2019.
-  L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, and M. Kloft. Deep one-class classification. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4393–4402, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
-  T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. ArXiv, abs/1511.05952, 2015.
-  S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, USA, 2014.
-  R. Shao, X. Lan, J. Li, and P. C. Yuen. Multi-adversarial discriminative deep domain generalization for face presentation attack detection. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10015–10023, 2019.
-  H. Shin, J. K. Lee, J. Kim, and J. Kim. Continual learning with deep generative replay. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
-  J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
-  Leonidas Spinoulas, Mohamed E. Hussein, David Geissbühler, Joe Mathai, Oswin G. Almeida, Guillaume Clivaz, Sébastien Marcel, and Wael Abd Almageed. Multispectral biometrics system framework: Application to presentation attack detection. IEEE Sensors Journal, pages 1–1, 2021.
-  Serban Stan and Mohammad Rostami. Unsupervised model adaptation for continual semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2593–2601, 2021.
-  Holger Steiner, Sebastian Sporrer, Andreas Kolb, and Norbert Jung. Design of an Active Multispectral SWIR Camera System for Skin Detection and Face Verification. Journal of Sensors, 2016:16, 2016.
-  Z. Wang, Z. Yu, C. Zhao, X. Zhu, Y. Qin, Q. Zhou, F. Zhou, and Z. Lei. Deep spatial gradient and temporal depth learning for face anti-spoofing. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5041–5050, 2020.
-  G. Xie, L. Liu, X. Jin, F. Zhu, Z. Zhang, J. Qin, Y. Yao, and L. Shao. Attentive region embedding network for zero-shot learning. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9376–9385, 2019.
-  G.-S. Xie, L. Liu, F. Zhu, F. Zhao, Z. Zhang, Y. Yao, J. Qin, and L. Shao. Region graph embedding network for zero-shot learning. In A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, editors, Computer Vision – ECCV 2020, pages 562–580, Cham, 2020. Springer International Publishing.
-  X. Yang, W. Luo, L. Bao, Y. Gao, D. Gong, S. Zheng, Z. Li, and W. Liu. Face anti-spoofing: Model matters, so does data. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3502–3511, 2019.
-  Q. Yu and K. Aizawa. Unsupervised out-of-distribution detection by maximum classifier discrepancy. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9517–9525, 2019.
-  Z. Yu, X. Li, X. Niu, J. Shi, and G. Zhao. Face anti-spoofing with human material perception. In A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, editors, Computer Vision – ECCV 2020, pages 557–575, Cham, 2020. Springer International Publishing.
-  Z. Yu, C. Zhao, Z. Wang, Y. Qin, Z. Su, X. Li, F. Zhou, and G. Zhao. Searching central difference convolutional networks for face anti-spoofing. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5294–5304, 2020.
-  S. Zhang, A. Liu, J. Wan, Y. Liang, G. Guo, S. Escalera, H. J. Escalante, and S. Z. Li. CASIA-SURF: A Large-Scale Multi-Modal Benchmark for Face Anti-Spoofing. IEEE Transactions on Biometrics, Behavior, and Identity Science, 2(2):182–193, 2020.
-  Z. Zhang. A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(11):1330–1334, Nov 2000.
Appendix A PADISI-Face Dataset
Previous efforts on generating face PAD datasets have been focused on a number of major attack types, including, disguises, printed photographs, 3D masks, and replays. As novel types of attacks emerge, existing datasets might be insufficient to guarantee designing suitable PAD algorithms because there has not been enough effort in generating datasets that cover a wide variety of attack categories. Table 4 summarizes a number of highly prevalent face PAD datasets in the literature. The table provides information about the types of presentation attack instruments (PAIs) present in each dataset. As it can be seen, these datasets are limited in terms of diversity of attack types they contain. PADISI-Face is a new dataset captured from different participants to offer more diverse set of attack types. Due to granular labels on these attack types, PADISI-Face is a suitable dataset for testing models in continual learning settings or when there should be significant variations between the training and testing datasets.
The PADISI-Face Dataset is collected using a sensor array designed and built by our team, shown in Figure 6 . The system is designed for more comprehensive future versions of PADISI-Face that will contain beyond the visible range information. The hardware comprises of six different cameras spanning visible (RGB), short-wave-infrared (SWIR) and long-wave infrared (Thermal) electromagnetic spectrum ranges. Additionally, there are two near-infrared (NIR) cameras for high quality stereo depth estimation. For acquisition of data in NIR and SWIR spectra, a synchronized illumination of different wavelength LEDs (shown in Figure 6) were used. The synchronized sequence of LED illuminations were designed to maximize the throughput of the camera suite while increasing the temporal coherence between frames. Figure 7 shows some examples of images collected for bona-fide and several attack samples using the sensor array in various light ranges. For NIR and SWIR modalities, dark channel subtraction is performed to reduce the effect of ambient illumination. Data was collected from each participant over two rounds. In the first round, bona fide samples were collected. Participants presented a presentation attack instrument (PAI) in the second round. PADISI-Face will be available for the use of the research community.
|Pavlidis Symosek ||Facial disguises|
|3DMAD ||3D mask attacks|
|BVSD ||Facial disguises|
|GUC-LiFFAD ||2D print and replay|
|MS-Spoof ||2D print|
|BRSU ||3D masks|
|EMSPAD ||2D print|
|MLFP ||2D & 3D masks|
|CASIA-SURF ||2D print & cutouts|
|MAFPAD||2D print, mannequins|
|fake tattoo, eye area cover|
To enable face detection in all captured frames, we use a standard calibration process using checkerboards . For the checkerboard to be visible in all wavelength regimes, a manual approach is used when a sequence of frames is captured offline while the checkerboard is being lit with a bright halogen light. This makes the checkerboard pattern visible and detectable by all cameras which allows the standard calibration estimation process to be followed. The face can then be easily detected in the RGB space  and the calculated transformation for each camera can be applied to detect the face in the remaining camera frames.
Following the aforementioned approach, face landmarks are detected on the visible spectrum using . A bounding box is then constructed from the landmarks to have a tight crop of the face. The bounding box of the visible spectrum is then projected to the corresponding co-ordinate system of the other cameras to extract approximately aligned faces on different modalities. Each channel is then scaled to the range by dividing the bit depth of the camera and then resized to pixels. In our experiments, we use the visible range information as input to our algorithm. A future direction includes considering beyond the visible information to perform PAD.
Appendix B Reducing False-Positive Predictions
Our primary focus has been on reducing the false-negative predictions. In practical settings, we can assume labels for detected novel data points can be accessible with a delay by the end of each task, e.g., using manual annotation. To model this possibility, we performed an experiment using a selection scheme that assumes manual annotation is possible, i.e., label pollution is reduced to zero. Since updating the model occurs at discrete periods at the end of each task, we have assumed the delay for labeling is less than the time needed to update the model. Hence by the time a task finishes, we assume the labels for the novel samples are accessible before updating the model. Table 5 presents results for the PADISI-Face dataset in the single PA/task scenario, where we have compared Delayed Labels (DL) with NACL. We observe that this sampling scheme leads to reduced false-negative predictions. Additionally, we observe BPCER performance also improves.
|Task||APCER ()||BPCER ()||ACER ()|
Appendix C Experimental Setup Details
We provide details that we used to perform experiments.
c.1 Network structure
In our experiments, the network is consisted of a pre-trained fixed backbone encoder, followed by fully connected layers to reach to the label space.
An important limitation of CNN models when trained on small datasets, such as biometric datasets, is that they tend to select features which are not generalizable due to overfitting. For this purpose, we opted for employing MoCo-v1 as a fixed backbone network 
to improve generalizability of the extracted features. This network is trained on ImageNet using a contrastive loss that attempts to find similarities and dissimilarities among synthesized variants of training data samples in an unsupervised way. It is subclass of self-supervised learning at which a deep neural network is trained to solve pseudo-tasks. As a result, the network learns to extract discriminative features at its early layers to solve the pseudo-tasks. Thus, when we use MoCo-v1 as our backbone for PAD in a continual learning setting, no input or label information of the future PA types have been used to train the feature extraction model. This property ensures that no information about the training dataset has been used in training the model. This allows to claim that the new attacks are indeed unseen. We note that extracting features using this pre-trained network leads to an separability of different types of attacks, as shown in the t-SNE visualizations of Figure8, for PADISI-Face dataset as an example which enables our model to identify data points that belong to new attack classes. This observation demonstrates we can use the backbone model as a good feature extractor to identify OTDS.
We use the same end-to-end network structure for a fair comparison among the methods. The MoCo backbone is followed by three fully connected layers with , , and (turn into nodes when the model is trained to identify OTDS) nodes each. MoCo’s encoder is in essence a ResNet50 architecture with
output nodes, used here as discriminative feature vectors to improve classification. In all experiments, the weights of the backbone network are frozen and learnable parameters
in our formulated would refer to the last fully connected layers. We use ReLU non-linearity in the first two layers and softmax non-linearity in the final layer. We have selected the layer withnodes to represent the embedding space
on which the CL approach is performed, as described in the paper. At each task, the network is trained with 2 output nodes and performance on the testing split is measured during training. When the task is learned, the network output is extended to include a third output. After identifying the OTDS, the network again is trained with 2 outputs. This process is continued until all tasks are learned. To reduce redundancy of inference at stochastic gradient descent step, we compute the input features initially and perform optimization just on the learnable layers. By doing so, we reduce learning time but have the understanding that in practice, inference also needs to be performed end-to-end.
c.2 Implementation Parameters
We use the cross entropy loss as the discrimination loss. We used Keras for implementation of the algorithm and the Adam optimizer to perform stochastic gradient descent. The learning rate is set to bewith a decay rate of . We use a batch size of
. At each batch, we select 100 points randomly and make sure the batch is balanced. To learn each task, we randomly initialize all the trainable weights (fully connected layers) and perform optimization using 10000 batches. At each training epoch, we computed the loss function on the training data split and the performance metrics on the testing split. We ran our code on a cluster node equipped with 4 Nvidia Tesla P100-SXM2 GPU’s. Our code is provided as part of the supplementary material.