Biometric recognition is based on the use of distinctive anatomical and behavioural characteristics to automatically recognise a subject . Among other biometric characteristics, fingerprints offer a high recognition accuracy and at the same time enjoy a high popular acceptance. Despite these and other advantages, fingerprint-based recognition systems can be circumvented by launching Presentation Attacks (PAs), in which an artificial fingerprint, denoted as Presentation Attack Instrument (PAI) is presented to a sensor [10, 12, 25, 17].
The threat posed by PAIs is not reduced to an academic issue. In 2002, Matsumoto et al. [25, 26] analysed the vulnerabilities of eleven commercial fingerprint-based biometric systems to gummy fingerprints. The experimental evaluation showed that to of the PAIs built with cooperative methods were accepted as bona fide presentations (i.e., genuine or live fingers). In 2009, Japan reported the use of presentation attacks in one of its airports, and in 2013, a Brazilian doctor used artificial silicone fingerprints to tamper a biometric attendance system at the Sao Paulo hospital .
In order to tackle those severe security issues, the development of Presentation Attack Detection (PAD) techniques, which automatically detect PAIs presented to the biometric capture device, is a mandatory task, which has attracted a lot of attention within the biometric research community not only for fingerprint systems [24, 38], but also for other characteristics such as face  or iris 
. These PAD methods can be widely classified as hardware- or software-based approaches. Whereas the former require dedicated, and mostly expensive, specific hardware, software-based approaches focus on dynamic or static characteristics extracted from the same biometric samples used for recognition purposes. Therefore, software-based methods are less expensive, and will be the focus of this article.
The newest fingerprint PAD techniques based on deep learning and textural features have shown to be a powerful tool to detect most PAIs[3, 4, 28, 30]. However, they share a common limitation: they depend both on the material used for fabricating the PAIs, and the sensor used for acquiring the fingerprint samples. More specifically, their error rates are multiplied five to 18 times when either the PAIs’ materials or the sensors utilised are not known a priori (see Table I).
|Deep Triple Embedding ||3.33%||25.25%||15.20%|
To address the issue of generalisation to unknown factors, we analyse the combination of local features (i.e., Scale-Invariant Feature Transform, SIFT ) with three different general purpose feature encoding approaches, which have shown remarkable results in object classification tasks [5, 32, 20]: Bag of Words (BoW), Vector of Locally Aggregated Descriptors (Vlad), and Fisher Vector (FV). The local descriptors, computed over the image gradient, allow capturing different artefacts produced by materials used for building the PAIs. Then, the afforementioned encoding approaches assign each local descriptor (i.e., SIFT) to the closest entry in a visual vocabulary . This visual vocabulary defines a common feature space, thereby allowing a better generalisation to unknown attacks or capture devices.
In order to evaluate the performance of the proposed methods and to allow the reproducibility of the results, we conduct a thorough experimental evaluation on the LivDet 2011, LivDet 2013, and LivDet 2015 databases. The performance is reported in compliance with the ISO/IEC 30107 international standard on PAD evaluation , thereby allowing a rigorous analysis of the results. The evaluation shows the capacity of the new method to be used in high security applications: for a high security operating point with an Attack Presentation Classification Error Rate (APCER) of 1%, an average Bona Fide Presentation Classification Error Rate (BPCER) of 0.25%, 0.38% and 7.11% was achieved, respectively, on the three databases, thereby outperforming the state-of-the-art. In addition, we would like to highlight that the proposed method took part in the Fingerprint Liveness Detection Competition 2019, achieving the best detection performance with an average accuracy of 96.17% .
Ii Related work
As we mentioned in Sect. I, we focus on static software-based fingerprint PAD methods, since they are the most time and cost efficient. In particular, we review those methods based on either deep learning or addressing scenarios with unknown factors. For more details on other methods, the reader is referred to [24, 38, 39].
In this context, it has been observed that some textural properties including the morphology, smoothness, and ridge-valley structure may be different between attack and bona fide presentations, and can thus be used to discriminate them. Building upon this idea, several texture-based PAD methods have been proposed in the literature [19, 16]. More recently, new methods based on deep learning approaches have significantly outperformed any earlier PAD techniques. For instance, Nogueira et al. 
benchmarked three classic Convolutional Neural Networks (CNN). One of their proposals achieved the best results in the LivDet 2015 competition, with an overall accuracy of. In spite of those promising results, the main limitation of these methods is that they learn features from a whole image with a fixed size. In many cases, also within the LivDet databases, the Region of Interest (ROI) covers only a small area of the whole image (e.g., 19% for some subsets of LivDet 2011), thus not being large enough to allow an efficient PA detection. This is highlighted by the results achieved on the LivDet 2011 - Italdata dataset, where the ACER increased up to .
To address the small ROI issue, Pala and Bhanu  proposed training a triple convolutional network on one fixed size and randomly extracted patch per image. In spite of the obtained improvement with respect to the previous whole-image-based approach , in the random patch extraction process several patches extracted from Italdata 2011 could stem from the background region of the image, thereby resulting in a still high ACER of 5.1%.
More recently, and based on the fact that PAIs produce spurious minutiae on a fingerprint image, Chugh et al. [3, 4] proposed a deep learning framework for independently classifying local patches around minutiae extracted from a fingerprint image. The final bona fide vs PA decision was defined as the average between PAD scores of the local patches. This approach additionally allows finding PA regions inside a sample, even if the PAI only covers part of the underlying fingerprint. The method achieves the lowest ACER values reported so far over the LivDet databases (see Table I, left column). However, despite the excellent results reported in the known environment (i.e., known attacks and known sensors), an evaluation on more challenging scenarios (i.e., unknown sensors and/or PAI fabrication materials) shows an increase in the error rates (see Table I).
Finally, Park et al. propose in  an efficient CNN based on the fire module of the SqueezeNet to optimise the hardware and time requirements. Evaluated over the LivDet 2011 to 2015, the CNN outperforms for some datasets the work presented in , at the same time reducing over 6 times the execution time. It should be though noted that the performance of this PAD method under more challenging scenarios with unknown attacks or sensors remains unknown.
To sum up, the main drawback of the aforementioned methods is their high dependency both on the PAI fabrication materials and the capture device. To tackle these issues, several approaches based on handcrafted features have been followed. On the one hand, Rattani et al. proposed in 
an automatic adaptation of Weibull-calibrated support vector machines (SVMs). Over the LivDet 2011 database, the obtained equal error rates (EERs) oscillated between 20 and 30% for the best configuration in the presence of unknown PAI species. On the other hand, Ding and Ross analysed an ensemble of one-class SVMs trained only on bona fide data in, which lowered the error rates to 10-22% over the same dataset.
More recently, in an extension of , Chugh and Jain identified in  a subset of six out of 12 PAI species which can yield detection rates similar to known attacks scenarios. That is, training the SpoofBuster with only those six PAI species and testing on all 12 species results in an APCER = 10.24% at BPCER = 0.2%, very close to the APCER = 9.03% when all PAI species are used for training. In spite of these impressive results, it should be noted that the selection of the training PAI plays a crucial role in this study.
This dependecy is highlighted again by Engelsma and Jain in , where multiple generative adversarial networks (GANs) are trained on bona fide images acquired with the RaspiReader sensor. From the same 12 different PAI species, six are used for training and six for testing. In a benchmark with the method proposed in , the GANs outperform the SVMs. However, the average APCERs achieved for a BPCER = 0.2% vary from 31.42% to 68.98%, depending on the training set used. This shows again a high sensitivity to different training datasets. In addition, this approach is not directly comparable to those based on conventional (e.g., Crossmatch or Greenbit) sensors, since a specific hardware, namely the RaspiReader, was used to acquire the samples.
Finally, Gajawada et al. try to tackle this dependency on the PAI species contained in the training set from a different perspective in . They propose a so-called deep learning based “Universal Material Translator” (UMT). Given a reduced number (e.g., five) of samples from a new PAI species, the UMT extracts their main appearance features to embed them into a database of bona fide samples, in order to generate synthetic samples of the new PAI species. Those synthetic samples can be then utilised to train any CNN. Over the LivDet 2015 database, the authors showed how the proposed approach can improve up to 17% the detection rates, achieving a remarkable 21.96% APCER for a BPCER = 0.1%. However, it should be noted that this approach does require some samples (i.e., five) of the analysed unknown PAI species.
In this context, our method tackles the issue of detection performance degradation in the presence of unknown factors (i.e., attacks, sensors, or databases) by transforming the local descriptors extracted from the fingerprint samples into a common feature space. This allows for better generalisation capabilities to more challenging scenarios, not needing any samples of the unknown attacks for training.
Iii Proposed method
Fig. 1 shows an overview of the proposed PAD approach, based on the fusion of three different feature encoding approachs. In the first common processing step, the Pyramid Histogram of Visual Words (PHOW)  algorithm is used to extract local features: the so-called dense Scale-Invariant Feature Transform (dense-SIFT) descriptors (Sect. III-A). Subsequently, three encoding methods are applied to bring the aforementioned local descriptors into a common feature space: Bag-of-Words (BoW)  (Sect. III-B1), ii) Fisher Vector (FV)  (Sect. III-B2), and iii) Vector Locally Aggregated Descriptors (Vlad)  (Sect. III-B3). Afterwards, each set of encoded features is classified using a different Support Vector Machine (SVM) (Sect. III-C). The final bona fide (BP) vs presentation attack (PA) decision for the sample at hand is defined as a weighted score level fusion of the three SVMs (Sect. III-D).
Iii-a Local Features Extraction: dense-SIFT Descriptors
As local feature descriptors we have chosen the dense-SIFT approach, computed over the image gradient, since they can capture lower coherence areas introduced by the coarseness of different PAI fabrication materials. In particular, the Pyramid Histogram Of visual Words (PHOW) approach proposed by  computes SIFT descriptors densely at fixed points on a regular grid with uniform spacing (e.g., 5 pixels), as summarised in Fig. 2 (left). For each point in the grid, the dense-SIFT descriptor computes the gradient vector for each pixel in the feature point’s neighbourhood (Fig. 2, top right), taking into account 8 different directions. Subsequently, a normalized 8-bin histogram of gradient directions (Fig. 2, bottom right) is built over sample regions. In addition, in order to account for the scale variation between fingerprints, these dense-SIFT descriptors are computed over four circular patches or windows with different scales . Therefore, each point in the grid is represented by four SIFT descriptors (i.e., one per ) comprising a total number of 128 features (i.e., 8-bin histograms).
It should be noted that windows with different scales allow extracting local information of fingerprints at different resolution levels, thereby detecting variable-size artefacts produced in the fabrication of PAIs. In addition, near-uniform local patches do not yield stable keypoints or descriptors. Therefore, we have used a fixed threshold on the average norm of the local gradient in order to remove local descriptors from low contrast regions (i.e., regions with an average norm value close to zero).
Iii-B Local Feature Encoding
In the second stage of the PAD algorithm, three different feature encoding approaches for the dense-SIFT descriptors are analysed.
Iii-B1 Bag of Words (BoW)
Bag-of-Words (BoW) based techniques were first developed for text categorization tasks, in which a text document is assigned to one or more categories based on its content . For this purpose, BoW represents the text document by a sparse histogram of word occurrence based on a visual vocabulary. Following this same idea, Csurka et al.  adopted and transformed this technique to represent local features from an image in terms of the so-called visual words. Our method builds upon this last approach.
As first proposed in , the BoW representation first computes the visual vocabulary as a codebook with different centroids or visual words (see Fig. 1, top) with -means clustering. Then, the BoW representation is defined as the histogram of the number of image descriptors assigned to each visual word. Its computation is summarised in Fig. 3. First, an -level pyramid of spatial histograms is used in order to incorporate spatial relationships between patches. To do that, the fingerprint image is partitioned into increasingly fine sub-regions, and the dense-SIFT descriptors inside each sub-region are assigned to the closest centroid among the visual words, using a fast version of -means clustering . Subsequently, the histograms inside each sub-region are computed and stacked into a single and final feature vector.
Iii-B2 Fisher Vector (FV)
BoW approaches encode local features using a hard assignment
, in which a local descriptor is only assigned to one visual word based on a similarity function. In contrast, the Fisher Vector (FV) method derives a kernel from a generative model of the data (e.g., Gaussian Mixture Model, GMM), and describes how the set of local descriptors deviate from an average distribution of the descriptors. The aforementioned model can be understood as a probabilistic visual vocabulary, which thereby allows a soft assignment. Thus, the FV paradigm encodes not only the number of descriptors assigned to each region, but also their position in terms of their deviation with respect to the pre-defined model.
As proposed in , we train a GMM model with diagonal covariances from decorrelated dense-SIFT descriptors extracted on the previous step (see the second row in Fig. 1). In general, the -components of the GMM are represented by the mixture weights (), Gaussian means () and covariance matrix (), with . This leads to an image representation which captures the average statistics first-order and second-order differences between the local features and each of the GMM centres :
where is the soft assignment weights of the -th feature to the -th Gaussian. It is important to highlight that and are computed during the training stage. Finally, the FV representation that defines a fingerprint image is obtained by stacking the differences: .
With the aim of clustering the extracted local features with GMM diagonal covariance matrices, the dense-SIFT features are decorrelated using PCA . In our approach, the dense-SIFT descriptor dimension was reduced from 128 to = 64 components, hence resulting the final FV representation in a size vector, where is the number of Gaussian components in the GMM and is the dimension of a dense-SIFT descriptor.
Iii-B3 Vector Locally Aggregated Descriptors (Vlad)
In order to reduce the high-dimension image representation proposed by the FV and BoW approaches, gaining in efficiency and memory usage, we have finally studied the Vector Locally Aggregated Descriptors (Vlad) methodology  (see Fig. 1, third row). This is a simplified non-probabilistic version of FV, which models the data distribution from the accumulative distances between a visual word and its closest center in the visual vocabulary. Therefore, as in the BoW approach, a visual vocabulary needs to be computed in the first step with the -means algorithm.
More specifically, a -dimensional local feature descriptor (i.e., dense-SIFT descriptor) can be represented by a Vlad descriptor of size as follows:
where and denote the -th component of , and its corresponding closest visual word . In our method, is subsequently -normalised in order to further improve the classification accuracy.
Finally, it is important to highlight that Vlad also uses PCA for decorrelating training data.
In order to classify the final encoded representations, separate linear SVMs have been used for each encoding approach. In order to find the optimal hyperplane separating the bona fide from the attack presentations, the optimisation algorithm bounds the loss from below. Therefore, we have trained two complementary SVMs as follows:
The first SVM labels the bona fide samples as +1 and the presentation attacks as -1, thereby yielding the corresponding (weights) and (bias) classifier parameters.
The second SVM labels the bona fide samples as -1 and the presentation attacks as +1, thereby yielding the corresponding and classifier parameters.
Subsequently, given an encoded feature descriptor
, two different scores are computed, which estimate both the class of the sample (i.e., the score sign) and the confidence of such decision (i.e., the absolute value of the score is the distance to the hyperplane):
The final score is then computed to minimise the distance to the corresponding hyperplane, thereby choosing the most reliable decision for the given vector:
Iii-D Fused Approach (FPAD)
Given three different individual PAD scores, , output by the corresponding SVM, we define the final fused score as follows:
Iv Experimental evaluation
In this section, we evaluate and benchmark the detection performance of each fingerprint encoding scheme described in Sect. III. Specifically, three goals were taken into account for the experimental protocol design: analyse the impact of the key parameter (vocabulary size) on the detection performance of the three proposed PAD schemes, benchmark the detection performance of our proposals against the top state-of-the-art approaches, and study the computational performance of the three fingerprint encoding schemes.
Iv-a Experimental Protocol
The proposed PAD methods were implemented in C++ using the open-source VLFeat library111http://www.vlfeat.org/. All the experiments were conducted on an Intel(R) Xeon(R) CPU E5-2670 v2 processor at 2.50 GHz, 378GB RAM.
|Digital P.||3.20||1.85||2.70||1.10||1.61||0.20||0.10||3.15||0.00 ()|
|Digital P.||5.64||-||1.76||1.09||1.12||4.75||5.20||14.10||4.60 ()|
Iv-A2 Evaluation Protocol and Metrics
To reach the aforementioned objectives, the experimental evaluation considers three different scenarios: known-material and known-sensor, known-sensor and unknown-material, and unknown-sensor and cross-database.
The detection performance is evaluated in compliance with the ISO/IEC IS 30107 : we report the Attack Presentation Classification Error Rate (APCER), which refers to the percentage of misclassified presentation attacks for a fixed threshold, and the Bona Fide Presentation Classification Error Rate (BPCER), which indicates the percentage of misclassified bona fide presentations. We also include the Detection Error Trade-Off (DET) curves between both error rates, as well as the BPCER for a fixed APCER of (BPCER10), (BPCER20) and (BPCER100).
Then, in order to establish a fair benchmark with the existing literature, we report the ACER as the average of the APCER and the BPCER for a fixed detection threshold .
Iv-B Experimental Results
Iv-B1 Known-Material and Known-Sensor Scenario
First, we optimise the algorithms’ detection performances in terms of the main key parameter: the visual vocabulary size . To that end, we focus on the known scenario, in order to avoid a bias due to other variables. We test the following range of values: , since would yield too long feature vectors, not usable for real-time applications. We found that the best value on average is (for more details, the reader is referred to the appendix), and optimised the fusion parameters (see Sect. III-D) for this value in terms of the D-EER.
Fig. 4 shows the DET curves for the FPAD approach over all sensors for . As it can be observed, for low APCER values of 1% (i.e., high security thresholds), the FPAD achieves a remarkable average BPCER100 = 0.25% (vs4̇.05% in ) for LivDet 2011 and 0.38% for LivDet 2013. More in detail, for LivDet 2011, the Digital Persona and Sagem sensors report a BPCER = 0% for any APCER 0.2%. Regarding the LivDet 2013 database, the results are similar and for all sensors, and we observe a BPCER = 0% for any APCER 10%. In contrast, the FPAD suffers a detection performance decrease, with error rates multiplied by up to 42 times. More specifically, it shows a BPCER10 = 0.94%, BPCER20 = 2.12% and BPCER100 = 7.11%.
In Table VI(a), we benchmark our results with the state-of-the-art in terms of the ACER. The lowest value on each row is highlighted in bold. As it can be observed, even if the individual feature encoding approaches do not outperform the FSB, the fused FPAD approach yields the lowest average ACER for both LivDet 2011 (0.28% vs1̇.67%) and LivDet 2013 (0.43%). On the other hand, the FSB achieves the best performance over LivDet 2015 (0.97% vs2̇.82%). Nonetheless, it should be noted that the main goal of the present work is not only to achieve the best performance at a single operating point (i.e., the ACER is measured for ) but overall for different applications requiring either a low BPCER (i.e., high convenience) or low APCER (i.e., high security), and also under more challenging and realistic conditions (i.e., unknown sensors or PAI species).
Iv-B2 Known-Sensor and Unknown-Material Scenario
In this scenario, both training and test samples were acquired by the same sensor, while presentation attacks in the test set were acquired from unknown PAI species. We analyse in detail the best performing single approach (FV) and the FPAD method. For the latter, we select the fixed thresholds obtained for the known-scenario (see values in Table I), and denote this configuration as “fixed thresholds”. In addition, we also evaluate its performance on the best threshold combination (hereafter referred to as “optimised thresholds”). The corresponding DET curves are reported in Fig. 5.
|PAI species||FV||Vlad||BoW||FPAD||FSB |
|Dataset||Train||Test||Fx thr.||Op thr.|
|Bio11||EcoFlex, Gelatine, Latex||Silgum, Woodglue||6.33||10.05||15.05||4.78||2.05||4.60|
|Bio13||Modasil, Woodglue||EcoFlex, Gelatine, Latex||1.00||2.82||5.50||1.50||0.00||1.30|
|Ita11||EcoFlex, Gelatine, Latex||Silgum, Woodglue, Other||4.50||16.50||21.83||3.60||2.00||5.20|
|Ita13||Modasil, Woodglue||EcoFlex, Gelatine, Latex||0.50||1.17||4.63||0.50||0.00||0.60|
Regarding the LivDet 2015 protocol, we can observe a similar behaviour between the FV encoding and the fused FPAD algorithm for fixed thresholds in Fig. 4(a). In particular, the BPCER10 and BPCER20 are slightly higher for the individual FV encoding (around 1.6-7% and 3.5-9%), but for high security thresholds, the FPAD achieves lower error rates (BPCER 14.3% vs. 14.4%). Also, the DET curves for Greenbit and Crossmatch are very close, whereas the performance for HI Scan and Digital Persona decreases. In contrast, the optimised thresholds FPAD achieves the best performance for Hi Scan, only showing a lower performance for Digital Persona. And in all cases, the detection rates are higher, yielding a low BPCER of 7%. Regarding the state-of-the-art,  achieves an average APCER of 22% for a BPCER = 0.1% for the Crossmatch dataset, and he FPAD approach achieves an APCER under 20%, thus highlighting its soundness.
In the second set of experiments, we follow the unkown-material protocol defined in . In this case, Fig. 4(b) shows one of the main strengths of FV encoding: under high security scenarios, an average BPCER100 under 5% can be achieved. In particular, for Italdata 2011 (BPCER100 = 6.20%) and Italdata 2013 (BPCER100 = ) those values outperform the ones reported by . Regarding the fused algorithms, it can be also observed that even the fixed thresholds configuration achieves a BPCER100 comparable to FSB  (i.e., BPCER100 = 4.48% vs. 4.24%). In addition, the optimised thresholds FPAD reports a BPCER100 = 1.85%, which is twice smaller.
We finally compare in Table VI(b) the performance of our methods and FSB  in terms of the ACER. We can observe that the FV encoding outperforms the remaining algorithms for three out of the four datasets. Moreover, for the fixed and optimised thresholds, our FPAD pipeline achieves an average ACER = 2.61% and ACER = 1.01% respectively, which considerably outperforms the top state-of-the-art.
Iv-B3 Unknown-Sensor and Cross-Database Scenarios
Finally, we evaluate the soundness of our proposals in scenarios where different (i.e., unknown) sensors are used following the unknown-sensor and cross-database scenarios proposed by .
In the first set of experiments, training and test samples are acquired using different sensors (i.e., sensor inter-operability analysis). Fig. 5(a) shows the corresponding ISO-compliant evaluation. As it may be observed, training over the Italdata subset yields a better performance at all operating points than training over Biometrika (grey vs orange, and blue vs yellow cuves). Only low BPCERs 0.5% over the LivDet 2013 show a different behaviour. Moreover, for a fixed APCER of 1%, the FV encoding achieves BPCER100 of 26.80%, which reduces almost by 50% the top state-of-the-art result (BPCER100 = 52.52%) . In addition, our optimised thresholds FPAD approach attains a BPCER = 0% for all APCERs over the Italdata13 train set – we may thus conclude that the method found the optimal common feature space from the Italdata 2013 training set to correctly classify the Biometrika 2013 samples.
Table VI(c) benchmarks all methods to FSB  in terms of ACER. In general, and regardless of the particular train-test combination, FV encoding is able to outperform both the other two encoding approaches and the results obtained in  (i.e., average ACER = 7.83% for FV vs. 14.59% for FSB, which implies a relative improvement of 48%). Moreover, the FPAD also outperforms the FSB  for both the fixed and the optimised thresholds by a relative improvement of 38% and 55%, respectively.
In the second experiment, the performance is evaluated over the change of data collection over the same sensor (i.e., train and test over the same sensor, but acquired for LivDet 2011 and LivDet 2013, respectively). We refer to this protocol as cross-database scenario. In Fig. 5(b) we can see different behaviours for each algorithm for the different datasets. Whereas the Biometrika curves (orange and yellow) are very close for the FV encoding, this is not the case for the fused FPAD. This is due to the different generalisation capabilities of the remaining encoding approaches (BoW and Vlad), as it may be seen in Table VI(d). In particular, the ACER achieved training over Biometrika 2011 are better than training over Biometrika 2013 for BoW (28.8% vs. 15.70%), and vice versa for Vlad (15/70% vs. 11.10%). In addition, the poor performance of BoW also affects the fixed thresholds FPAD, thereby yielding a poor BPCER100 of almost 60%. However, the optimised thresholds FPAD can improve the error rates yielded by FV, achieving an average BPCER100 of 26%.
Finally, coming back to the ACER-based benchmark with FSB , we may observe that, on average, all the FV approach (ACER = 9.15%), the fixed thresholds FPAD (ACER = 17.75%) and the optimised thresholds FPAD (ACER = 8.23%) are able to outperform the FSB (ACER = 17.91%) by up to a 55% relative improvement.
Iv-B4 Computational efficiency
In this last set of experiments, we study the computational efficiency of the proposed image encodings for different parameter configurations. For this purpose, we select the LivDet 2015 database, which contains the largest images. We found that the BoW encoding requires 0.38 seconds, Vlad 1.58 seconds, and FV 2.11 seconds. There is thus a trade-off between detection performance and time efficiency. However, in all cases, the algorithms can be utilised for real-time applications.
In this paper, we have proposed a new PAD method based on the combination of local dense-SIFT image descriptors and three different feature encoding approaches (i.e., FV, Vlad, and BoW). The experimental evaluation conducted over the publicly available LivDet 2011, LivDet 2013 and LivDet 2015 databases assessed the performance of our proposals with respect to the top state-of-the-art methods. The analysis of the detection performance showed that the FV reached the best individual detection accuracy for all databases. However, a score-level fusion of the three encoding approaches (known as FPAD) yielded an improved performance, significantly outperforming the top state-of-the-art results in the analysed scenarios, specially under the most challenging and realistic scenarios, where both unknown materials and unknown sensors are frequently employed. In addition, this fused approach achieved the highest detection accuracy on the LivDet 2019 competition .
It should be also noted that the fixed thresholds configurations do not always outperform the FV encoding as a standalone algorithm. This highlights the challenges faced when unknown sensors or PAI species are contained in the test set. However, a proper tuning of the thresholds yields a very promising performance for the FPAD algorithm.
In more details, the ISO-compliant evaluation in terms of BPCER and APCER showed one of the main strengths of the FV encoding and the FPAD proposal: the low BPCERs achieved even for very high security operating points (i.e., APCER 1%). Specifically, the FPAD technique yielded an average BPCER100 of 25% on the unkown-sensor scenario, and a BPCER100 of 26% to 28% on the cross-database scenario, thereby outperforming the top state-of-the-art results  by up to a relative 50% to 60%, respectively. Moreover, both methods proved to be suitable in the presence of unknown PAI species, achieving a BPCER100 as low as 4.6% and 1%. In summary, the previous results indicate that orientation histograms provided by the dense-SIFT method correctly represent the lack of continuity in the ridge’s flow, and hence the artefacts produced in the fabrication of PAIs, and FV as well as the fusion-based proposal in combination with dense-SIFT descriptors found a new common feature space, which allows successfully detecting both known and unknown PAIs.
Finally, the computational efficiency evaluation showed that BoW encoding attained efficiency results below 400 milliseconds, while Vlad and FV encodings were above 1150 milliseconds. As future work lines, we will improve the computational cost of the Vlad and FV encodings in order to obtain the best trade-off between detection accuracy and computational efficiency.
Analysis of the Detection Performance for Different Vocabulary Sizes
As it was mentioned in the article, the main parameter shared by all feature encoding approaches is the vocabulary size . The larger is, the higher number of visual words is, and thus, the less the information loss during the quantisation carried out to convert the local dense-SIFT descriptors into the so-called common feature space. However, this also entails a higher computational cost, and can eventually end up in over fitting. Therefore, we analyse here in detail the impact of on the detection performance and the computational efficiency of the PAD method for each scenario.
-a Known-Material and Known-Sensor Scenario
In the first place, we need to analyse the impact of on the performance of the three proposed schemes individually. We do that under this all-known scenario in order to avoid a bias due to other variables (i.e., unknown PAI species or sensors). More specifically, we test the following range of values: , since would yield too long feature vectors, not usable for real-time applications.
The ACER values for each method and are presented in Table VI(a), and graphically in Fig. 6(a). As it can be observed, most curves reach a minimum (i.e., lowest ACER, and thus, best detection performance) for = 1024. In some cases, the ACER achieved for = 2048 continues to decrease (e.g., the BoW encoding for LivDet 2013), thus not reaching a minimum over the selected range. However, as it was mentioned above, such vocabulary sizes would imply a non real-time detection, and will thus not be considered in the present study.
Now, focusing on the best value on average, = 1024, we can highlight that FV encoding achieves on average, for all sensors, an ACER of 2.13%, 1.88% and 3.31% on LivDet 2011, LivDet 2013 and LivDet 2015, respectively. On the other hand, the best Vlad performance values are found at = 1024 (i.e. 2.88% on LivDet 2011 and 2.68% on LivDet 2013) for all databases with exception of the LivDet 2015 dataset, in which the best accuracy is reached at = 2048 (ACER = 4.16%). Finally, the BoW encoding improves its detection performance with , thereby achieving its minimum ACER result at = 2048.
-B Known-Sensor and Unknown-Material Scenario
In this scenario, both training and test samples were acquired by the same sensor, while presentation attacks in the test set were acquired from unknown PAI species.
In the first set of experiments, we select the LivDet 2015 database, since it already includes unknown PAI species for testing. Fig. 6(b) shows, in terms of ACER, the impact of the parameter key on the performance of the proposed encoding techniques. As it can be seen, the average performance (represented with a dashed red line) improves with increasing values for , achieving a minimum for = 2048. More specifically, the FV encoding yields the best ACER results, with an average value of .
We have also analysed the unknown materials protocols for LivDet 2011 and 2013 proposed in . The results are presented in Table VI(b). In this case, only the BoW encoding reaches the best detection performance for = 2048. On the other hand, on average, the best results are yielded by = 512 for FV, and = 256 for Vlad.
Finally, it should also be highlighted that, for all three datasets (i.e., LivDet 2011, 2013 and 2015), BoW shows a higher variability range for different values of . For instance, ACER varies within 3.03 and 4.44 for FV, between 1.64 and 8.61 for Vlad, and between 9.28 and 16.60 for BoW for LivDet 2015. Therefore, BoW is much more sensitive to changes in .
-C Unknown-Sensor and Cross-Database Scenarios
Finally, in order to evaluate the soundness of our proposals in scenarios where different (i.e., unknown) sensors are used, we follow the unknown-sensor and cross-database scenarios proposed by .
In the first set of experiments, training and test samples are acquired using different sensors. Table VI(c) shows the ACER for different values of . As it can be observed, the FV encoding achieves its better results at different values of , depending on the sensor used for training: whereas for Italdata 2011 and 2013, the lowest ACER is achieved for = 512 (9.60% and 0.90%), for Biometrika it is obtained for = 2048 (18.50% and 1.20%). In general, and regardless of the particular train-test combination, FV encoding is able to outperform both the other two encoding approaches and the results obtained in  (i.e., ACER = 7.83% for FV vs 14.59% for FSB , which implies a relative improvement of 48%). These results indicate that FV encoding found a set of common features in training images that allow a correct detection of PAIs acquired with other sensors.
In the second experiment, the performance is evaluated over the change of data collection over the same sensor (i.e., train and test over the same sensor, but acquired for LivDet 2011 and LivDet 2013, respectively). We refer to this protocol as cross-database scenario, and Table VI(d) shows the impact of on each proposed approach. As it can be observed, again the FV encoding is able to outperform both the other encoding approaches presented in this study and the top state-of-the-art results. In particular, in three out of four cases, the best peformance is achieved for = 2048. Only fo Biometrika13 - Biometrika11 the best performance is reached for = 512.
Under these last two scenarios, the range of variability of BoW’s performance is comparable to FV and Vlad. However, the ACER is multiplied by up to 4.8 times, thus making this encoding not as suitable for PAD purposes as the other two.
In general, we have seen how different values of can impact the performance of the PAD method, and how, depending on the scenario considered, different values yield the best performance. However, an average value of = 1024 always achieved either the best performance for FV and Vlad or it is close to it. Therefore, we can conclude that, if no data is available to carefully analyse the best option, 1024 can be chosen as a sub-optimal value for .
-D Computational efficiency
In this last set of experiments, we study the computational efficiency of the proposed image encodings for different parameter configurations. For this purpose, we select the LivDet 2015 database, since it contains the largest images. Table VII shows the average performance of the proposal over different vocabulary sizes . As it could be expected, different values have an impact on the average computational efficiency of the proposed methods, since the feature vector sizes depend directly on . More specifically, these efficiency results indicate that higher vocabulary sizes worsen the computational efficiency of the PAD methods in many cases. On the other hand, in some cases, larger values also lead to a better detection performance.
It should be noted that, in all cases, the efficiency values reported by BoW encoding for each parameter combination are always below 400 milliseconds, while for FV encoding they are above 1100 milliseconds. Therefore, being FV the most accurate approach, it will be interesting to improve its computation efficiency in future work in order to attain a better trade-off between detection accuracy and computational efficiency.
Image classification using random forests and ferns. In
Proc. Int. Conf. on Computer Vision (ICCV), Cited by: §III-A, §III.
-  (2019) Fingerprint presentation attack detection: generalization and efficiency. In Proc. Int. Conf. on Biometrics (ICB), Cited by: §II.
-  (2017) Fingerprint spoof detection using minutiae-based local patches. In Proc. Int. Joint Conf. on Biometrics (IJCB), pp. 581–589. Cited by: TABLE I, §I, §II, TABLE III.
-  (2018) Fingerprint spoof buster: use of minutiae-centered patches. IEEE Tran. on Information Forensics and Security. Cited by: §-C, TABLE I, §I, §II, §II, §II, §IV-B1, §IV-B2, §IV-B2, §IV-B3, §IV-B3, §IV-B3, TABLE III, TABLE IV, V(a), V(b), §V.
-  (2004) Visual categorization with bags of keypoints. In Proc. Int. Workshop on Statistical Learning in Computer Vision (ECCV), pp. 1–22. Cited by: §I, §III-B1, §III.
-  (2016) An ensemble of one-class SVMs for fingerprint spoof detection across different fabrication materials. In Proc. Int. Workshop on Information Forensics and Security (WIFS), Cited by: §II, §II.
Using the triangle inequality to accelerate k-means. In
Proc. Int. Conf. on Machine Learning (ICML), pp. 147–153. Cited by: §III-B1.
-  (2019) Generalizing fingerprint spoof detector: learning a one-class classifier. In Proc. Int. Conf. on Biometrics (ICB), Cited by: §II.
-  (2019) Universal material translator: towards spoof fingerprint generalization. In Proc. Int. Conf. on Biometrics (ICB), Cited by: §II, §IV-B2.
-  (2006) On the vulnerability of fingerprint verification systems to fake fingerprints attacks. In Porc. Int. Carnahan Conf. on Security Technology, pp. 130–136. Cited by: §I.
-  (2017-08) Presentation attack detection in iris recognition. In Iris and Periocular Biometrics, C. Busch and C. Rathgeb (Eds.), Cited by: §I.
-  (2010) An evaluation of direct attacks using fake fingers generated from iso templates. Pattern Recognition Letters 31 (8), pp. 725–732. Cited by: §I.
Biometric antispoofing methods: a survey in face recognition. IEEE Access 2, pp. 1530–1552. Cited by: §I.
-  (2013) LivDet 2013 fingerprint liveness detection competition 2013. In Proc. Int. Conf. on Biometrics (ICB), pp. 1–6. Cited by: §IV-A1.
-  (2017) Fingerprint presentation attack detection method based on a bag-of-words approach. In Proc. Iberoamerican Conf. on Pattern Recognition (CIARP), Cited by: §III-B1.
-  (2015) Local contrast phase descriptor for fingerprint liveness detection. Pattern Recognition 48 (4), pp. 1050–1058. Cited by: §II.
-  (2017) ISO/IEC FDIS 30107-3. information technology - biometric presentation attack detection - part 3: testing and reporting. International Organization for Standardization. Cited by: §I, §I, §IV-A2.
-  (2012) Aggregating local image descriptors into compact codes. IEEE Trans. on Pattern Analysis and Machine Intelligence 34 (9), pp. 1704–1716. Cited by: §III-B2, §III-B3, §III.
-  (2014) Multi-scale local binary pattern with filters for spoof fingerprint detection. Information Sciences 268, pp. 91–102. Cited by: §II.
-  (2014) Attribute-based classification for zero-shot visual object categorization. IEEE Trans. on Pattern Analysis and Machine Intelligence 36 (3), pp. 453–465. Cited by: §I.
-  (2002) Text classification using string kernels. Journal of Machine Learning Research 2 (Feb), pp. 419–444. Cited by: §III-B1.
-  (2004) Distinctive image features from scale-invariant keypoints. Int. Journal of Computer Vision 60 (2), pp. 91–110. Cited by: §I.
-  (2009) Handbook of fingerprint recognition. Springer Science & Business Media. Cited by: §I.
-  (2015) A survey on antispoofing schemes for fingerprint recognition systems. ACM Computing Surveys (CSUR) 47 (2), pp. 28. Cited by: §I, §II.
-  (2002) Impact of artificial gummy fingers on fingerprint systems. In Optical Security and Counterfeit Deterrence Techniques IV, Vol. 4677, pp. 275–289. Cited by: §I, §I.
-  (2002) Gummy and conductive silicone rubber fingers importance of vulnerability analysis. In Proc. Asian Conf. on Advances in Cryptology (ASIACRYPT), pp. 574–575. Cited by: §I.
-  (2015) LivDet 2015 fingerprint liveness detection competition 2015. In Proc. Int. Conf. on Biometrics Theory, Applications and Systems (BTAS), pp. 1–6. Cited by: §IV-A1.
-  (2016) Fingerprint liveness detection using convolutional neural networks. IEEE Trans. on Information Forensics and Security 11 (6), pp. 1206–1213. Cited by: §-B, §-C, TABLE VI, TABLE I, §I, §II, §II, 4(b), Fig. 6, §IV-B2, §IV-B3, TABLE III, TABLE IV, TABLE V.
-  (2019) LivDet in action-fingerprint liveness detection competition 2019. arXiv preprint arXiv:1905.00639. Cited by: §I, §V.
-  (2017) Deep triplet embedding representations for liveness detection. In Deep Learning for Biometrics, pp. 287–307. Cited by: TABLE I, §I, §II, TABLE III.
-  (2019) Presentation attack detection using a tiny fully convolutional network. IEEE Trans. on Information Forensics and Security. Cited by: §II, TABLE III.
-  (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Computer Vision and Image Understanding 150, pp. 109–125. Cited by: §I.
-  (2010) Improving the fisher kernel for large-scale image classification. In Proc. European Conf. on Computer Vision (ECCV), pp. 143–156. Cited by: §III-B2.
-  (2015) Open set fingerprint spoof detection across novel fabrication materials. IEEE Trans. on Information Forensics and Security 10 (11), pp. 2447–2460. Cited by: §II.
-  (2013) Image classification with the fisher vector: theory and practice. Int. Journal on Computer VIsion 105 (3), pp. 222–245. Cited by: §I, §III-B2, §III.
-  (2016) Presentations and attacks, and spoofs, oh my. Image and Vision Computing 55, pp. 26–30. Cited by: §I.
-  (2013) Fisher vector faces in the wild.. In BMVC, Vol. 2, pp. 4. Cited by: §III-B2.
-  (2014) Presentation attack detection methods for fingerprint recognition systems: a survey. IET Biometrics 3 (4), pp. 219–233. Cited by: §I, §II.
-  (2019) Biometric presentation attack detection: beyond the visible spectrum. arXiv preprint arXiv:1902.11065. Cited by: §II.
-  (2012) LivDet 2011-fingerprint liveness detection competition 2011. In Proc. Int. Conf. on Biometrics (ICB), pp. 208–215. Cited by: §IV-A1.