Review of the Fingerprint Liveness Detection (LivDet) competition series: 2009 to 2015

09/06/2016 ∙ by Luca Ghiani, et al. ∙ Universita Cagliari Clarkson University 0

A spoof attack, a subset of presentation attacks, is the use of an artificial replica of a biometric in an attempt to circumvent a biometric sensor. Liveness detection, or presentation attack detection, distinguishes between live and fake biometric traits and is based on the principle that additional information can be garnered above and beyond the data procured by a standard authentication system to determine if a biometric measure is authentic. The goals for the Liveness Detection (LivDet) competitions are to compare software-based fingerprint liveness detection and artifact detection algorithms (Part 1), as well as fingerprint systems which incorporate liveness detection or artifact detection capabilities (Part 2), using a standardized testing protocol and large quantities of spoof and live tests. The competitions are open to all academic and industrial institutions which have a solution for either softwarebased or system-based fingerprint liveness detection. The LivDet competitions have been hosted in 2009, 2011, 2013 and 2015 and have shown themselves to provide a crucial look at the current state of the art in liveness detection schemes. There has been a noticeable increase in the number of participants in LivDet competitions as well as a noticeable decrease in error rates across competitions. Participants have grown from four to the most recent thirteen submissions for Fingerprint Part 1. Fingerprints Part 2 has held steady at two submissions each competition in 2011 and 2013 and only one for the 2015 edition. The continuous increase of competitors demonstrates a growing interest in the topic.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 20

page 26

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Among biometrics, fingerprints are probably the best-known and widespread because of the fingerprint properties: universality, durability and individuality. Unfortunately it has been shown that fingerprint scanners are vulnerable to presentation attacks

111Traditionally, a majority of papers refer to these types of attacks as “spoofing attacks” but recently the term “presentation attacks” has become the standard standard . Similarly, presentation attack detection is the standard term for liveness detection. Presentation attack detection is a more general term and can refer to multiple approaches for detecting a presentation attacks beyond liveness detection. with an artificial replica of a fingerprint. Therefore, it is important to develop countermeasures to those attacks.

Numerous methods have been proposed to solve the susceptibility of fingerprint devices to attacks by spoof fingers. One primary countermeasure to spoofing attacks is called “liveness detection” or presentation attack detection. Liveness detection is based on the principle that additional information can be garnered above and beyond the data procured and/or processed by a standard verification system, and this additional data can be used to verify if an image is authentic. Liveness detection uses either a hardware-based or software-based system coupled with the authentication program to provide additional security. Hardware-based systems use additional sensors to gain measurements outside of the fingerprint image itself to detect liveness. Software-based systems use image processing algorithms to gather information directly from the collected fingerprint to detect liveness. These systems classify images as either live or fake.

Since 2009, in order to assess the main achievements of the state of the art in fingerprint liveness detection, University of Cagliari and Clarkson University organized the first Fingerprint Liveness Detection Competition.

The First International Fingerprint Liveness Detection Competition (LivDet) 2009 ld09 , provided an initial assessment of software systems based on the fingerprint image only. The second, third and fourth Liveness Detection Competitions (LivDet 2011 ld11 , 2013 ld13 and 2015 ld15 ) were created in order to ascertain the progressing state of the art in liveness detection, and also included integrated system testing.

This paper reviews the previous LivDet competitions and how they have evolved over the years. Section 2 of this paper describes the background of spoofing and liveness detection. Section 3 details the methods used in testing for the LivDet competitions as well as descriptions of the datasets that have generated from the competition so far. Section 4 discusses the trends across the competitions reflecting advances in the state of the art. Section 5 concludes the paper and discusses the future of the LivDet competitions.

2 Background

The concept of spoofing has existed for some time now. Research into spoofing can be seen beginning in 1998 from research conducted by D. Willis and M. Lee where six different biometric fingerprint devices were tested against fake fingers and it was found that four of the six were susceptible to spoofing attacks sixbd . This research was approached again in 2000-2002 by multiple institutions including; Putte and Kuening as well as Matsumoto et al. bfr ; iagf . Putte et al. examined different types of scanning devices as well as different ways of counterfeiting fingerprints bfr . The research presented by these researchers looked at the vulnerability of spoofing. In 2001, Kallo (et al.) looked at a hardware solution to Liveness Detection; while in 2002, Schuckers delved into using software approaches for Liveness Detection fra ; sasm . Liveness detection, with either hardware-based or software-based systems, is used to check if a presented fingerprint originates from a live person or an artificial finger. Usually the result of this analysis is a score used to classify images as either live or fake.

Many solutions have been proposed to solve the vulnerability of spoofing spfs ; fsr . Bozhao Tan et al. has proposed a solution based on ridge signal and valley noise analysis spfs . This solution examines the perspiration patterns along the ridge and the patterns of noise in the valleys of images spfs . It was proposed that since live fingers sweat, but spoof fingers do not, the live fingerprint will look “patchy” compared to a spoof spfs . Also it was proposed that due to the properties of a spoof material, spoof fingers will have granules in the valleys that live fingers will not have spfs . Pietro Coli et al. examined static and dynamic features of collected images on a large data set of images fsr .

There are two general forms of creating artificial fingers, the cooperative method and non-cooperative method. In the cooperative method the subject pushes their finger into a malleable material such as dental impression material, plastic, or wax creating a negative impression of the fingerprint as a mold, see Figure 1. The mold is then filled with a material, such as gelatin, PlayDoh or silicone. This cast can be used to represent a finger from a live subject, see Figure 2.

Figure 1: Negative impression of five fingers using consensual method.
Figure 2: Latex spoof on finger.

The non-cooperative method involves enhancing a latent fingerprint left on a surface, digitizing it through the use of a photograph, and finally printing the negative image on a transparency sheet. This printed image can then be made into a mold, for example, by etching the image onto a printed circuit board (PCB) which can be used to create the spoof cast as seen on Figure 3.

Figure 3: Etched fingerprints on PCB.

Most competitions focus on matching, such as the Fingerprint Verification Competition held in 2000, 2002, 2004 and 2006 fvc06 and the ICB Competition on Iris Recognition (ICIR2013) icbir . However, these competitions did not consider spoofing.

The Liveness Detection Competition series was started in 2009 and created a benchmark for measuring liveness detection algorithms, similar to matching performance. At that time, there had been no other public competitions held that has examined the concept of liveness detection as part of a biometric modality in deterring spoof attacks. In order to understand the motivation of organizing such a competition, we observed that the first trials to face with this topic were often carried out with home-made data sets that were not publicly available, experimental protocols were not unique, and the same reported results were obtained on very small data sets. We pointed out these issues in vitdet .

Therefore, the basic goal of LivDet has been since its birth to allow researchers testing their own algorithms and systems on publicly available data sets, obtained and collected with the most updated techiques to replicate fingerprints enabled by the experience of Clarkson and Cagliari laboratories, both active on this problem since 2000 and 2003, respectively. At the same time, using a “competition” instead of simply releasing data sets, could be assurance of a free-of-charge, third-party testing using a sequestered test set. (Clarkson and Cagliari has never took part in LivDet as competitors, due to conflict of interest.)

LivDet 2009 provided results which demonstrated the state of the art at that time ld09 for fingerprint systems. LivDet continued in 2011, 2013 and 2015 ld11 ; ld13 ; ld15 and contained two parts: evaluation of software-based systems in Part 1: Algorithms, and evaluation of integrated systems in Part 2: Systems. Fingerprint will be the focus of this paper. However, LivDet 2013 also included a Part 1: Algorithms for the Iris biometric ldiris2013 and is continuing in 2015.

Since 2009, evaluation of spoof detection for facial systems was performed in the Competition on Counter Measures to 2-D Facial Spoofing Attacks, first held in 2011 and then held a second time in 2013. The purpose of this competition is to address different methods of detection for 2-D facial spoofing fsa . The competition dataset consisted of 400 video sequences, 200 of them real attempts and 200 attack attempts fsa . A subset was released for training and then another subset of the dataset was used for testing purposes.

2010 2011 2012 2013 2014 2015 Sum
LivDet 2009 4 6 9 2 23 21 65
LivDet 2011 3 7 21 10 41
LivDet 2013 16 10 26
Table 1: Number of LivDet citations on Google Scholar over the years.

During these years many works cited the publications related to the first three LivDet competitions, 2009 ld09 , 2011 ld11 and 2013 ld13 . A quick Google Scholar research produced 65 results for 2009, 41 for 2011 and 26 for 2013. Their distribution is shown in more detail in Table 1, ordered by publication year222Research was last updated on October 1, 2015. In Tables 2, 3 and 4 a partial list of these publications is presented.

Authors Algorithm Type Performance (Average Classification Error)
J. Galbally, et al. hpfld Quality Related Features. 6.6%
E. Marasco, and C. Sansone perspmorph Perspiration and Morphology-based Static Features 12.5%
J. Galbally, et al. iqa Image Quality Assessment 8.2
E. Marasco, and C. Sansone mtf Multiple Textural Features 12.5%
L. Ghiani, et al. expres Comparison of Algorithms N.A.
D. Gragnaniello, et al. wml Wavelet-Markov Local 2.8%
R. Nogueira, et al. convnet Convolutional Networks 3.9%
Y. Jiang, and L. Xin coomat Co-occurrence Matrix 6.8%
Table 2: Publications that cite the LivDet 2009 paper.
Authors Algorithm Type Performance (Average Classification Error)
L. Ghiani, et al. expres Comparison of Algorithms N.A.
X. Jia, et al. mslbp Multi-Scale Local Binary Pattern 7.5% and 8.9%
D. Gragnaniello, et al. lcp Local Contrast Phase Descriptor 5.7%
N. Poh, et al. lrc Likelihood Ratio Computation N.A.
A. F. Sequeira, and J. S. Cardoso fldpci Modeling the Live Samples Distribution N.A.
L. Ghiani, et al. Binarized Statistical Image Features. 7.2%
X. Jia, et al. msltp Multi-Scale Local Ternary Patterns 9.8%
G.L. Marcialis, et al. msltp Comparison of Algorithms N.A.
R. Nogueira, et al. convnet Convolutional Networks 6.5%
Y. Zhang, et al. walbp Wavelet Analysis and Local Binary Pattern 12.5%
A. Rattani, et al. osfsd Textural Algorithms N.A.
P. Johnson, and S. Schuckers porechar Pore Characteristics 12.0%
X. Jia, et al. ocsvm One-Class SVM N.A.
Y. Jiang, and L. Xin coomat Co-occurrence Matrix 11.0%
Table 3: Publications that cite the LivDet 2011 paper.
Authors Algorithm Type Performance (Average Classification Error)
C. Gottschlich, et al. hig Histograms of Invariant Gradients. 6.7%
R. Nogueira, et al. convnet Convolutional Networks 3.6%
Y. Zhang, et al. walbp Wavelet Analysis and Local Binary Pattern 2.1%
P. Johnson, and S. Schuckers porechar Pore Characteristics N.A.%
Table 4: Publications that cite the LivDet 2013 paper.

3 Methods and Datasets

The LivDet competitions feature two distinct parts; Part 1: Algorithms and Part 2: Systems with protocols designed to eliminate variability that may be present across different algorithms or systems. The protocols for each part will be described in further detail in this section with descriptions of each dataset created through this competition.

3.1 Part 1: Algorithm Datasets

The datasets for Part 1: Algorithms changes with each competition. Each competition consists of three to four datasets of live and spoof images from different devices. Eighteen total datasets have been completed and made available thus far in the past four competitions. Fifteen fingerprint liveness datasets and three iris liveness datasets.

LivDet 2009 consisted of data from three optical sensors; Crossmatch, Identix, and Biometrika. The fingerprint images were collected using the consensual approach from three different spoof material types; gelatin, silicone, and play-doh. and numbers of images available can be found in ld09 . Figure 4 shows example images from the datasets.

Figure 4: Examples of spoof images of the LivDet 2009 datasets. Crossmatch (top): (a) Play-Doh, (b) gelatin, (c) silicone; Identix (middle): (d) Play-Doh, (e) gelatin, (f) silicone; Biometrika (top): (g) Play-Doh, (h) gelatin, (i) silicone.

The dataset for LivDet 2011 consisted of images from four different optical devices, Biometrika, Digital Persona, ItalData and Sagem. The spoof materials were gelatin, latex, ecoflex, Play-doh, silicone and wood glue. More information can be found in ld11 . Figure 5 shows images used in the database.

Figure 5: Examples of fake fingerprint images of the LivDet 2011 datasets, from Biometrika (a) latex, (b) gelatin, (c) silicone; from Digital Persona: (d) latex, (e) gelatin, (f) silicone; from Italdata: (g) latex (h) gelatin (i) silicone; from Sagem: (j) latex (k) gelatin (l) silicone.

The dataset for LivDet 2013 consisted of images from four different devices; Biometrika, Crossmatch, ItalData and Swipe. Spoofs were made from gelatin, body double, latex, play-doh, ecoflex, modasil, and wood glue. LivDet 2013 featured the first use of the non-cooperative method for creating spoof images and was used for Biometrika and ItalData. More information can be found in ld13 . Figure 6 gives example images from the databases.

Figure 6: Examples of fake fingerprint images of the LivDet 2013. From Crossmatch (a) body double, (b) latex, (c) wood glue, from Biometrika (d) gelatine, (e) latex, (f) wood glue, from Italdata (g) gelatine, (h) latex, (i) wood glue, from Swipe (j) body double, (k) latex, (l) wood glue.

The dataset for LivDet 2015 consists of images from four different optical devices; Green Bit, Biometrika, Digital Persona and Crossmatch. The spoof materials were Ecoflex, gelatin, latex, wood glue, a liquid Ecoflex and RTV (a two-component silicone rubber) for the Green Bit, the Biometrika and the Digital Persona datasets, and Playdoh, Body Double, Ecoflex, OOMOO (a silicone rubber) and a novel form of gelatin for Crossmatch dataset. More information can be found in ld15 .

3.2 Part 2: Systems Submissions

Public datasets were not released from systems collections, however data was collected on the submitted systems. Unlike in Part 1: Algorithms where data was pre-generated before the competition, Part 2: Systems data was collected through systematic testing of submitted system. For LivDet 2011 this consisted of 500 live attempts from 50 people (totaling 5 images for each of the R1 and R2 fingers) as well as 750 attempts with spoofs of five materials (play-doh, gelatin, silicone, body double, and latex). For LivDet 2013, 1000 live attempts were conducted as well as 1000 spoof attempts from the materials; Play-Doh, gelatin, Ecoflex, Modasil, and latex. In 2015 the system was tested using the three known spoof recipes. Two unknown spoof recipes were also tested to examine the flexibility of the sensor toward novel spoof methods. The known recipes were Playdoh, Body Double, and Ecoflex. The two unknown recipes used were OOMOO (a silicone rubber) and a novel form of gelatin. 2011 attempts were completed with 1010 live attempts from 51 subjects (2 images each of all 10 fingers) and 1001 spoof attempts across the five different materials giving approximately 200 images per spoof type. 500 spoofs were created from each of 5 fingers of 20 subjects for each of the five spoof materials. Two attempts were performed with each spoof.

The submitted system needs to be able to output a file with the collected image as well as a liveness score on the range of 0 to 100 with 100 being the maximum degree of liveness and 50 being the threshold value to determine if an image is live or spoof. If the system is not able to process a live subject it is counted as a failure to enroll and counted against the performance of the system (as part of Ferrlive). However, if the system is unable to process a spoof finger it is considered as a fake non-response and counted as a positive in terms of system effectiveness for spoof detection.

3.3 Image Quality

Fingerprint image quality has a powerful effect on the performance of a matcher. Many commercial fingerprint systems contain algorithms to ensure that only higher quality images are accepted to the matcher. This rejects low quality images where low quality images have been shown to degrade the performance of a matcher nist . The algorithms and systems submitted for this competition did not use a quality check to determine what images would proceed to the liveness detection protocols. Through taking into account the quality of the images before applying liveness detection a more realistic level of error can be shown.

Our methodology uses the NIST Fingerprint Image Quality (NFIQ) software to examine the quality of all fingerprints used for the competition and examine the effects of removing lower quality fingerprint images on the liveness detection protocols submitted. NFIQ computes a feature vector from a quality image map and minutiae quality statistics as an input to a multi-layer perceptron neural network classifier

nist . The quality of the fingerprint is determined from the neural network output. The quality for each image is assigned on a scale from 1 (highest quality) to 5 (lowest quality).

3.4 Performance Evaluation

The parameters adopted for the performance evaluation are the following:

  • Ferrlive: Rate of misclassified live fingerprints.

  • Ferrfake: Rate of misclassified fake fingerprints.

  • Average Classification Error (ACE):

  • Equal Error Rate (EER): Rate at which Ferrlive and Ferrfake are equal.

  • Accuracy: Rate of correctly classified live and fake fingerprints at a 0.5 threshold.

3.5 Specific challenges

In the last two editions of the competition specific challenges were introduced. Two of the 2013 datasets, unlike all the other cases, contain spoofs that were collected using latent fingerprints. The 2015 edition had two new components: (1) the testing set included images from two kinds of spoof materials which were not present in the training set in order to test the robustness of the algorithms with regard to unknown attacks, and (2) one of the data sets was collected using a 1000 dpi sensor.

3.5.1 LivDet 2013 – Consensual vs. Semi-Consensual

In the consensual method the subject pushed his finger into a malleable material such as silicon gum creating a negative impression of the fingerprint as a mold. The mold was then filled with a material, such as Gelatin. The “semi-consensual method” consisted of enhancing a latent fingermark pressed on a surface, and digitizing it through the use of a common scanner.333Obviously all subjects were fully aware of this process, and gave the full consent to replicate their fingerprints from their latent marks. Then, through a binarization process and with an appropriate threshold choice, the binarized image of the fingerprint was obtained. The thinning stage allowed the line thickness to be reduced to one pixel obtaining the skeleton of the fingerprint negative. This image was printed on a transparency sheet, in order to have the mold. A gelatin or silicone material was dripped over this image, and, after solidification, separated and used as a fake fingerprint.

The consensual method leads to an almost perfect copy of a live finger, whose mark on a surface is difficult to recognize as a fake unless through an expert dactiloscopist. On the other hand, the spoof created by semi- or unconsensual method is much less similar. In a latent fingerprint, many details are lost and the skeletonization process further deteriorates the spoof quality making it easier to distinguish a live from a fake. However, while it could be hard to convince someone to leave the cast of a finger, it’s potentially much easier to obtain one of his latent fingerprints. The spoof images in the Biometrika and Italdata 2013 datasets were created by printing the negative image on a transparency sheet. As we will see in the next section, the error rates, as would be expected, are lower than those of the other datasets.

3.5.2 LivDet 2015 – Hidden Materials and 500 vs 1000 dpi

As already stated, the testing sets of LivDet 2015 included spoof images of never-seen-before materials. These materials were liquid Ecoflex and RTV for Green Bit, Biometrika and Digital Persona datasets, and OOMOO and Gelatin for Crossmatch dataset. Our aim was to assess the reliability of algorithms. As a matter of fact, in a realistic scenario, the material used to attack a biometric system could be considered unknown as a liveness detector should be able to deal with any kind of spoof material.

Another peculiarity of the 2015 edition was the presence of the Biometrika HiScan-PRO, a sensor with a resolution of 1000 dpi instead of

500 dpi resolution for most of the datasets used so far in the competition. It is reasonable to hypothesize that doubling the image resolution, the feature extraction phase should benefit as well as the final performance. The results that we will show in the next section does not confirm this hypothesis.

4 Examination of Results

In this Section, we analyze the experimental results for the four LivDet editions. Results show the growth and improvement across the four competitions.

4.1 Trends of Competitors and Results for Fingerprint Part 1: Algorithms

The number of competitors for Fingerprint Part 1: Algorithms have increased during the last years. LivDet 2009 contained a total of 4 algorithm submissions. LivDet 2011 saw a slight decrease in competitors with only 3 organizations submitting algorithms, however LivDet 2013 and 2015 gave rise to the largest of the competitions with 11 submitted algorithms, nine participants in the former and ten in the latest. Submissions for each LivDet are detailed in Table 5.

Participants LivDet 2009 Algorithm Name
Dermalog Identification Systems GmbH Dermalog
Universidad Autonoma de Madrid ATVS
Anonymous Anonymous
Anonymous2 Anonymous2

Participants LivDet 2011
Algorithm Name
Dermalog Identification Systems GmbH Dermalog
University of Naples Federico II Federico
Chinese Academy of Sciences CASIA

Participants LivDet 2013
Algorithm Name
Dermalog Identification Systems GmbH Dermalog
Universidad Autonoma de Madrid ATVS
HangZhou JLW Technology Co Ltd HZ-JLW
Federal University of Pernambuco Itautec
Chinese Academy of Sciences CAoS
University of Naples Federico II (algorithm 1) UniNap1
University of Naples Federico II (algorithm 2) UniNap2
University of Naples Federico II (algorithm 3) UniNap3
First Anonymous participant Anonym1
Second Anonymous participant Anonym2
Third Anonymous participant Anonym3

Participants LivDet 2015
Algorithm Name
Instituto de Biociencias, Letras e Ciencias Exatas COPILHA
Institute for Infocomm Research (I2R) CSI
Institute for Infocomm Research (I2R) CSI_MM
Dermalog hbirkholz
Universidade Federal de Pernambuco hectorn
Anonymous participant anonym
Hangzhou Jinglianwen Technology Co., Ltd jinglian
Universidade Federal Rural de Pernambuco UFPE I
Universidade Federal Rural de Pernambuco UFPE II
University of Naples Federico II unina
New York University nogueira
Zhejiang University of Technology titanz

Table 5: Participants for Part 1: Algorithms.

This increase of participants has shown the grown of interest in the topic, which has been coupled with the general decrease of the error rates.

First of all, the two best algorithms for each competition, in terms of performance, are detailed in Table 6 based on the average error rate across the datasets where “Minimum Average” error rates are the best results and “Second Average” are the second best results.

2009
Minimum Avg Ferrlive Minimum Avg Ferrfake Second Avg Ferrlive Second Avg Ferrfake
13.2% 5.4% 20.1% 9.0%

2011
Minimum Avg Ferrlive Minimum Avg Ferrfake Second Avg Ferrlive Second Avg Ferrfake
11.8% 24.8% 24.5% 24.8%

2013
Minimum Avg Ferrlive Minimum Avg Ferrfake Second Avg Ferrlive Second Avg Ferrfake
11.96% 1.07% 17.64% 1.10%

2015
Minimum Avg Ferrlive Minimum Avg Ferrfake Second Avg Ferrlive Second Avg Ferrfake
5.13% 2.79% 6.45% 4.26%

Table 6: Two best error rates for each competition. It can be noticed a positive trend in terms of both Ferrlive and Ferrfake parameters. In particular 2011 and 2015 exhibited very difficult tasks due to the high quality of fingerprint images thus they should be taken into account as a reference of current liveness detector performance against the “worst scenario”, that is, the high quality reproduction of a subject’s fingerprint.

There is a stark difference between the results seen from LivDet 2009 to LivDet 2015. LivDet 2009 to LivDet 2011 did not see much decrease in error, where LivDet 2013 and LivDet 2015 each decreased in error from the previous compeition.

The mean values of the ACE (Average Classification Error) over all the participants calculated for each dataset confirm this trend. Mean and standard deviation are shown in Table

7.

2009
Mean Std. Dev.
Identix 8.27 4.65
Crossmaatch 15.59 5.60
Biometrika 32.59 9.64
2011
Mean Std. Dev.
Biometrika 31.30 10.25
ItalData 29.50 9.42
Sagem 16.70 5.33
Digital Persona 23.47 13.70
2013
Mean Std. Dev.
Biometrika 7.32 8.80
Italdata 12.25 18.08
Swipe 16.67 15.30
2015
Mean Std. Dev.
GreenBit 11.47 7.10
Biometrika 15.81 7.12
DigitalPersona 14.89 13.72
Crossmatch 14.65 10.28
Table 7: Mean and standard deviation ACE values for each dataset of the competition.

The standard deviation values ranges between 5 and 18% depending on the dataset and competition editions. Mean ACE values confirm the error increase in 2011 due to the high quality of cast and fake materials. The low values in 2013 for the Biometrika and Italdata are due, as stated before, to the use of latent fingerprints in the spoof creation process, creating lower quality spoofs that are easier to detect. In order to confirm that, we compared these values with those obtained for the same sensors in 2011 (see Table 8). Last two rows of Table 8 report average classification error and related standard deviation over above sets. Obviously, further and independent experiments are needed because participants of 2011 and 2013 were different so that different algorithms are likely also. However, results highlight the performance virtually achievable over two scenarios: a sort of “worst case”, namely, the one represented by LivDet 2011, where quality of spoofs is very high, and a sort of “realistic case” (LivDet 2013), where spoofs are created from latent marks as one may expect. The fact that even in this case the average error is , whilst the standard deviation does not differ with regard to LivDet 2011, should not be underestimated. The improvement could be likely due to the different ways of creating the fakes.

Consensual Semi-consensual
(LivDet 2011) (LivDet 2013)
Biometrika 31.30 7.32
ItalData 29.50 12.25
Mean 30.40 9.78
Standard Deviation 1.27 3.49
Table 8: Comparison between mean ACE values for Biometrika and Italdata datasets from LivDet 2011 and 2013.

The abnormally high values for the LivDet 2013 Crossmatch dataset occurred due to an occurrence in the Live data. The Live images were difficult for the algorithms to recognize. All data was collected in the same time frame and data in training and testing sets were determined randomly among the data collected. A follow-up test was conducted using benchmark algorithms at University of Cagliari and Clarkson University which revealed similar scores on the benchmark algorithms as the submitted algorithms with initial results had an EER of 41.28%. The data was further tested with 1000 iterations of train/test dataset generation using a random selection of images for the live training and test sets (with no common subjects between the sets which is a requirement for all LivDet competitions). This provided new error rates shown (see Table 9 and Figures 7 and 8)

Average Error Rate Standard Deviation
FerrLive 7.57% 2.21%
FerrFake 13.41% 1.95%
Equal Error Rate 9.92% 1.42%
Table 9: Crossmatch 2013 Error Rates across 1000 Tests
Figure 7: FerrLive Rates across 1000 tests for Crossmatch 2013.
Figure 8: FerrFake Rates across 1000 tests for Crossmatch 2013.

Examining these results allows us to draw the conclusion that the original selection of subjects was an anomaly that caused improper error rates because each other iteration, even only changing a single subject, dropped FerrLive error rates to 15% and below. Data is being more closely examined in future LivDet competitions in order to counteract this problem with data being processed on a benchmark algorithm before being given to participants. The solution to this for the LivDet 2013 data going forward is to rearrange the training and test sets for future studies using this data. Researchers will need to be clear which split of training/test they used in their study. For this reason we removed from the experimental results of those obtained with the Crossmatch 2013 dataset.

LivDet 2015 error rates confirm a decreasing trend with respect to attacks made up of high quality spoofs, with all the mean values between 11% and 16%. These results are summarized in Figure 9.

Figure 9: LivDet results over the years: near red colors for 2009, near green for 2011, near magenta for 2013 and near blue for 2015.

Reported error rates suggest a slow, but steady advancement in the art of liveness and artifact detection. This gives supporting evidence that the technology is evolving and learning to adapt and overcome the presented challenges.

Comparing the performance of the LivDet 2015 datasets (as shown in Table 7 and in more details in ld15 ) two other important remarks can be made: the higher resolution for Biometrika sensor did not necessarily achieve the best classification performance, while the small size of the images for the Digital Persona device generally degrades the accuracy of all algorithms.

The DET (Detection Error Tradeoff) curves in Figures 10 (a), 11 (a), 12 (a), 13 (a), show the performance of three of the four best algorithms for LivDet 2015 sorted by ACE (nogueira, unina and anonym). Unfortunately, we could not plot the jinglian’s algorithm performance because it output only two possible values, namely 0 or 100. There being no intermediate values it was impossible to obtain different Ferrlive and Ferrfake values varying the threshold value and produce the DET curve. More discussion on fusion results for Figures 10 (b), 11 (b), 12 (b), 13 (b) will be provided in further below.

(a)
(b)
Figure 10: LivDet 2015 Greenbit dataset: DET curves of the nogueira, unina and anonym algorithms (a), DET curves of the fusion of the three classifiers (b).
(a)
(b)
Figure 11: LivDet 2015 Biometrika dataset: DET curves of the nogueira, unina and anonym algorithms (a), DET curves of the fusion of the three classifiers (b).
(a)
(b)
Figure 12: LivDet 2015 Digital Persona dataset: DET curves of the nogueira, unina and anonym algorithms (a), DET curves of the fusion of the three classifiers (b).
(a)
(b)
Figure 13: LivDet 2015 Crossmatch dataset: DET curves of the nogueira, unina and anonym algorithms (a), DET curves of the fusion of the three classifiers (b).

Another important indicator of an algorithm validity is the Ferrfake value calculated when . This value represent the percentage of spoofs able to hack into the system when the rate of legitimate users that are rejected is no more than 1%. As a matter of fact, by varying the threshold value, different Ferrfake and Ferrlive values are obtained and, as the threshold grows from 0 to 100, Ferrfake decrease and Ferrlive increase. Obviously ferrlive value must be kept low to minimize the inconvenience to authorized users but, just as important, the low ferrfake value limits the number of unauthorized users able to enter into the system.

Results in Tables 10, 11, 12 and 13 show that even the best performing algorithm (nogueira) is not yet good enough since when , the ferrfake values (testing on all materials) ranges from 2.66% to 19.10%. These are the percentage of unauthorized users that the system is unable to correctly classify. If we consider only the unknown materials, the results are even worse, ranging from 3.69% to 25.30%.

On the basis of such results, we can say that there is no specific algorithm, among the analyzed ones, able to generalize against never-seen-before spoofing attacks. We observed a performance drop and also found that the amount of the drop is unpredictable as it depends on the material. This should be matter of future discussions.

all materials known materials unknown materials
unina 7.42 2.47 14.49
nogueira 2.66 1.94 3.69
anonym 18.61 10.75 29.82
average 9.56 5.05 16.00
std. dev. 8.19 4.94 13.13
Table 10: Ferrfake values of the best algorithms calculated when for the Crossmatch dataset.
all materials known materials unknown materials
unina 51.30 50.85 52.20
nogueira 19.10 16.00 25.30
anonym 80.83 79.25 84.00
average 50.41 48.70 53.83
std. dev. 30.87 31.68 29.38
Table 11: Ferrfake values of the best algorithms calculated when for the Digital Persona dataset.
all materials known materials unknown materials
unina 41.80 36.10 53.20
nogueira 17.90 15.15 23.40
anonym 75.47 75.25 75.90
average 45.06 42.17 50.83
std. dev. 28.92 30.51 26.33
Table 12: Ferrfake values of the best algorithms calculated when for the Green Bit dataset.
all materials known materials unknown materials
unina 11.60 7.50 19.80
nogueira 15.20 12.60 20.40
anonym 48.40 44.05 57.10
average 25.07 21.38 32.43
std. dev. 20.29 19.79 21.36
Table 13: Ferrfake values of the best algorithms calculated when for the Biometrika dataset.

On the other hand, the performance difference among algorithms may be complementary, that is, one algorithm may be able to detect spoofs that another one does not. Accordingly, Figures 14, 15, 16 and 17 show, for each dataset, the 2D and 3D spread plots generated using again three out of the four better algorithms. In the 3D plots each point coordinate corresponds to the three match scores obtained by the three algorithms for each image. In the 2D plots the same results are presented using two algorithms at a time for a total of three 2D plots for each 3D plot. These plots give an idea of the correlation level of the spoof detection ability of the selected algorithms. As a matter of fact, it appears that in many doubtful cases, the uncertainty of a classifier does not correspond to that of the others. For cases which have a high correlation, the points in the graph would be more or less along a diagonal line. However on this plot, the values are well distributed and this indicate a lower correlation between the algorithms and thus may be complementary.

(a)
(b)
(c)
(d)
Figure 14: 2D and 3D spread plots for the Greenbit dataset: (a) unina - nogueira, (b) nogueira - anonym, (c) unina - anonym, (d) nogueira - unina - anonym.
(a)
(b)
(c)
(d)
Figure 15: 2D and 3D spread plots for the Biometrika dataset: (a) unina - nogueira, (b) nogueira - anonym, (c) unina - anonym, (d) nogueira - unina - anonym.
(a)
(b)
(c)
(d)
Figure 16: 2D and 3D spread plots for the Digital Persona dataset: (a) unina - nogueira, (b) nogueira - anonym, (c) unina - anonym, (d) nogueira - unina - anonym.
(a)
(b)
(c)
(d)
Figure 17: 2D and 3D spread plots for the Crossmatch dataset: (a) unina - nogueira, (b) nogueira - anonym, (c) unina - anonym, (d) nogueira - unina - anonym.

Since there is evidence that the submitted algorithms may be complementary, we performed three fusion techniques at the score level using again three out of the four better algorithms. We used the sum of the a posteriori probabilities generated by the three algorithms, the product of the same a posteriori probabilities and the majority vote (each fingerprint image is classified as live or fake if the majority of individual algorithms rank it as live or fake). These three fusion techniques have very different characteristics from each other: the product tends to crush on a negative classification (i.e. the pattern rejection) rather than positive (for the zero-product property); the sum instead has an opposite characteristic namely it smooths the differences (such as a low-pass filter); finally, the majority emphasize where classifiers agree. In order to show the pros and cons of these classifiers fusion we also plot the curve of the so called “Oracle response”. The Oracle is an ideal fusion technique that correctly classifies a pattern if that pattern is correctly classified by at least one of the single classifiers. It indicates the maximum accuracy that can be potentially achieved. Obviously, the more the results of the various classifiers are complementary the better are the results of the oracle.

The DET curves in Figures 10 (b), 11 (b), 12 (b), 13 (b), compared with that of the single classifiers, clearly show a result improvement especially when using the sum rule. As a matter of fact, as stated before, sum smooths the differences. If score values are balanced (that is x for accept and 1-x for reject) the sum rule emphasizes where classifiers agree as for the majority vote. Moreover, since it averages the results, little doubts of two (wrong) classifiers may be counterbalanced by a strong performance in the (correct) third, leading to better results. The only case in which the performance declines is that of the Digital Persona dataset. Although the fusion at score-level appear a good solution to improve the performance, current results suggest that individual algorithms must be carefully selected and also appropriate combination methodologies should be studied as well.

4.2 Trends of Competitors and Results for Fingerprint Part 2: Systems

Fingerprint Part 2: Systems has not yet shown growth in the number of competitors, but it has only existed for 3 competitions thus far. There are more limitations to the systems testing portion. It is not surprising that there are not great numbers of entrants given the need for a full fingerprint recognition device to be built with an integrated liveness detection module. There has also been a general lack of interest in companies shipping full systems for testing and it appears that there is more comfort in submitting an algorithm. Both Livdet 2011 and LivDet 2013 had 2 submissions whilst LivDet 2015 had only one. Information about competitors is shown in Table 14.

Participants LivDet 2011 Algorithm Name
Dermalog Identification Systems GmbH Dermalog
GreenBit GreenBit

Participants LivDet 2013
Algorithm Name
Dermalog Identification Systems GmbH Dermalog
Anonymous Anonymous1

Participants LivDet 2015
Algorithm Name
Anonymous Anonymous2

Table 14: Participants for Part 2: Systems.

This portion of the LivDet competitions has distinct recognition for the rapid decrease in error rates. In the span of 2 years, the best results from LivDet 2011 were worse than the worst results of LivDet 2013. Thus, systems tests showed a quicker decrease in error rates as well as the one systems submission in 2013 had lower error rates than any submitted algorithms in LivDet.

In 2011, Dermalog performed at a FerrLive of 42.5% and a FerrFake of 0.8%. GreenBit performed at a FerrLive of 38.8% and a FerrFake of 39.47%. Both systems had high FerrLive scores.

The 2013 edition produced much better results since Dermalog performed at a FerrLive of 11.8% and a FerrFake of 0.6%. Anonymous1 performed at a FerrLive of 1.4% and a FerrFake of 0.0%. Both systems had low FerrFake rates. Anonymous1 received a perfect score of 0.0% error, successfully determining every spoof finger presented as a spoof.

Anonymous2, in 2015, scored a FerrLive of 14.95% and a FerrFake of 6.29% at the (given) threshold of 50 15 showing an improvement over the general results seen in LivDet 2011, however the anonymous system did not perform as well as what was seen in LivDet 2013. There is an 11.09% FerrFake for known recipes and 1% for unknown recipes (15). This result is opposite what has been seen in previous LivDet competitions where known spoof types typically have a better performance than unknown spoof types. The error rate for spoof materials was primarily due to impact on color differences error for the playdoh. Testing across 6 different colors of playdoh found that certain colors behaved in different ways. For yellow and white playdoh, the system detected spoofs as fake with high accuracy. For brown and black playdoh, the system would not collect an image. Therefore, it was recorded as a fake non-response and not an error in detection of spoofs. For pink and lime green playdoh, the system incorrectly accepted spoofs as live for almost 100% of images collected. The fact that almost all pink and lime green playdoh images were accepted as live images resulted in a 28% total error rate for playdoh. The system had a 6.9% Fake Non-Response Rate primarily due to brown and black playdoh. This is the first LivDet competition where color of playdoh has been examined in terms of error rates.

Examining the trends of results over the three competitions has shown that since 2011 there has been a downward trend in error rates for the systems. Ferrlive in 2015 while higher than 2013, is drastically lower than 2011. FerrFake has had similar error rates over the years. While the 2015 competition showed a 6.29% Ferrfake, the majority of that error stems from playdoh materials, particularly pink and lime colors. If you discount the errors seen in playdoh, the FerrFake is below 2% for the 2015 system. The error rates for winning systems over the years is shown in Figure 18 and Figure 19

Submitted System FerrLive FerrFake
Anonymous 14.95% 6.29%
Submitted System FerrFake FerrFake
Known Unknown
Anonymous 11.09% 1.00%
Table 15: FerrLive and FerrFake for submitted system in LivDet 2015.
Figure 18: FerrFake for winning systems by year .
Figure 19: FerrLive for winning systems by year .

5 Quality analysis: A Lesson Learned from the LivDet Experience

This section explores the impact of spoof quality on algorithm performance.

Over a time span of ten years, the researchers in the respective groups have searched for materials suitable for creating fingerprint spoofs by experimenting with a very large number of materials and related variants. This process is performed independent of the feature extraction algorithms and the classification methods adopted. In general, it is difficult to create a spoof fingerprint that gives good quality fingerprint images on a consistent basis. The quality analysis here reflects this experience.

With regard to the observation above, Tables 16, 17 report a summary of our research. They include a certain number of materials used as cast and/or mold. Table 16 summarizes the subjective evaluation made by our stuff over ten years experience, and more than 50000 spoofs realized, of the visual quality of the obtained fake fingerprints when combining the mold material (columns) and the cast material (rows). The visual quality assessment has been made up on the basis of the analysis at the microscope of the spoofs and also the quality of images acquired by the capture devices. We experienced that certain materials are less suitable than others. Several of them are not able to replicate the ridges and valleys flow without adding evident artifacts as bubbles and altering the ridges edges.

Mold/Material Foil RTV Modasil Plasticine
Alginat 5 5
RTV 2 2 4
Ecoflex 4 1 3
Gelatine 2 1 3
Latex 2 1 3
Modasil 2 2 4
Transparent 3 5
Silicone
White 2
Silicone
Wood glue 2 1 3
SILIGUM 2 5
Table 16: Quality degree in the spoof fabrication process: 1-Very high, 2-High, 3-Medium, 4-Low, 5-Very Low.
Mold/Material Foil RTV Modasil Plasticine
Alginat 5 5
RTV 1 4 3
Ecoflex 1 4 4
Gelatine 1 4 3
Latex 1 4 3
Modasil 1 4 3
Transparent 1 3
Silicone
White 1
Silicone
Wood glue 1 4 4
SILIGUM 1 3
Table 17: Easiness degree in the fake fabrication process: 1-Very High, 2-High, 3-Medium, 4-Low, 5-Very Low.

Table 17 shows the subjective evaluation on the easiness of obtaining a good spoof from the combination of the same materials of Table 16. This evaluation depends on the solidification time of the adopted material, the level of difficulty in separating mold and cast without destroying one of them or both, the natural dryness or wetness level of the related spoof.

Tables 16, 17 show that, from a practical viewpoint, many materials are difficult to manage when fabricating a fake finger. In many cases, the materials with this property also exhibit a low subjective quality level.

Therefore, thanks to this lesson, the LivDet competition challenge participants with images coming from the spoofs obtained with the best and most “potentially dangerous” materials. The materials choice is made on the basis of the best trade off between the criteria pointed out in Tables 16, 17 and the objective quality values output by quality assessment algorithms such as NFIQ.

What reported has been confirmed along the four LivDet editions. In particular the FerrFake and FerrLive rates for each differing quality levels support the idea that the images quality level is correlated with the error rate decrease. The error rates for each range of quality levels for Dermalog in LivDet 2011 Fingerprint Part 1: Algorithms is shown in figure 20, as an example. The graphs showcase from images of only quality level 1 up to all quality levels being shown. As lower quality spoof images were added, ferrfake generally decreased. For all images which included the worst quality images, the error rates were less consistent likely due to the variability in low quality spoofs.

(a)
(b)
Figure 20: Error rates by quality level.

The percentage of images at each quality level for two representative datasets for LivDet 2011, 2013, and 2015, respectively, are given in Figures 21, 22, and 23. The Crossmatch dataset had high percentages of the data being in the top two quality levels in both LivDet 2011 and 2013. The swipe dataset had many images that were read as being of lower quality which could be seen in the data itself because of the difficulty in collecting spoof data on the swipe device.

Figure 21: Percentages of Images per Quality Level for LivDet 2011.
Figure 22: Percentages of images per Quality Level for LivDet 2013.
Figure 23: Percentages of images per Quality Level for LivDet 2015.

6 Conclusions

Since its first edition in 2009, the Fingerprint Liveness Detection Competition was aimed to allow research centres and companies a fair and independent assessment of their anti-spoofing algorithms and systems.

We have seen over time an increasing interest for this event, and the general recognition for the enormous amount of data made publicly available. The number of citations that LivDet competitions have collected is one of the tangible signs of such interest (about 100 citations according to Google Scholar) and further demonstrates the benefits that the scientific community has received from LivDet events.

The competition results show that liveness detection algorithms and systems strongly improved their performance: from about 70% classification accuracy achieved in LivDet 2011, to 90% classification accuracy in LivDet 2015. This result, obtained under very difficult conditions like the ones of the consensual methodology of fingerprints replication, is comparable with that obtained in LivDet 2013 (first two data sets), where the algorithms performance was tested under the easier task of fingerprints replication from latent marks. Moreover, the two challenges characterizing the last edition, namely, the presence of 1000 dpi capture device and the evaluation against “unknown” spoofing materials, further contributed to show the great improvement that researchers achieved on these issues: submitted algorithms performed very well on both 500 and 1000 dpi capture devices, and some of them also exhibited a good robustness degree against never-seen-before attacks. Results reported on fusion also shows that the liveness detection could further benefit from the combination of multiple features and approaches. A specific section on algorithms and systems fusion might be explicitly added to a future LivDet edition.

There is a dark side of the Moon, of course. It is evident that, despite the remarkable results reported in this paper, there is a clear need of further improvements. Current performance for most submissions are not yet good enough for embedding a liveness detection algorithm into fingerprint verification system where the error rate is still too high for many real applications. In the authors’ opinion, discovering and explaining benefits and limitations of the currently used features is still an issue whose solution should be encouraged, because only the full understanding of the physical process which leads to the finger’s replica and what features extraction process exactly does will shed light on the characteristics most useful for classification. We are aware that this is a challenging task, and many years could pass before seeing concrete results. However, we believe this could be the next challenge for a future edition of LivDet, the Fingerprint Liveness Detection Competition.

Acknowledgements

The first and second author had equal contributions to the research. This work has been supported by the Center for Identification Technology Research and the National Science Foundation under Grant No. 1068055, and by the project “Computational quantum structures at the service of pattern recognition: modeling uncertainty” [CRP-59872] funded by Regione Autonoma della Sardegna, L.R. 7/2007, Bando 2012.

References

References