Biometric Quality: Review and Application to Face Recognition with FaceQnet

06/05/2020 ∙ by Javier Hernandez-Ortega, et al. ∙ 0

"The output of a computerised system can only be as accurate as the information entered into it." This rather trivial statement is the basis behind one of the driving concepts in biometric recognition: biometric quality. Quality is nowadays widely regarded as the number one factor responsible for the good or bad performance of automated biometric systems. It refers to the ability of a biometric sample to be used for recognition purposes and produce consistent, accurate, and reliable results. Such a subjective term is objectively estimated by the so-called biometric quality metrics. These algorithms play nowadays a pivotal role in the correct functioning of systems, providing feedback to the users and working as invaluable audit tools. In spite of their unanimously accepted relevance, some of the most used and deployed biometric characteristics are lacking behind in the development of these methods. This is the case of face recognition. After a gentle introduction to the general topic of biometric quality and a review of past efforts in face quality metrics, in the present work, we address the need for better face quality metrics by developing FaceQnet. FaceQnet is a novel opensource face quality assessment tool, inspired and powered by deep learning technology, which assigns a scalar quality measure to facial images, as prediction of their recognition accuracy. Two versions of FaceQnet have been thoroughly evaluated both in this work and also independently by NIST, showing the soundness of the approach and its competitiveness with respect to current state-of-the-art metrics. Even though our work is presented here particularly in the framework of face biometrics, the proposed methodology for building a fully automated quality metric can be very useful and easily adapted to other artificial intelligence tasks.



There are no comments yet.


page 1

page 3

page 7

page 8

page 9

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

“On two occasions I have been asked, ‘Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?’… I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.” - Charles Babbage, Passages from the Life of a Philosopher, 1864.

“Garbage In, Garbage Out.” The well-known computer science GIGO principle summarises in a very efficient and graphic manner, the pivotal role played by the soundness of the data given as input to any computerised system, in the meaningfulness of the output results. As in any other area of science, only nonsense conclusions can be expected from flawed premises. And automated digital systems are no exception to this rule.

In essence, the GIGO principle establishes a direct link between the reliability of the input and the output of a system. Therefore, it is easy to infer the huge advantages that would be brought about by a tool capable of assessing the robustness and accuracy of the input data to a specific automated system. Properly utilised, such a tool would have a major impact on the performance of the system, and on the ability of users to interpret its results, based on an objective measure of their consistency. In the field of biometrics, these invaluable tools are the so-called: biometric quality metrics.

In biometrics, the general GIGO principle has been translated into the concept referred to as biometric quality. Fundamentally, the simple underlying basis to biometric quality is that, if the biometric samples given as input to an automated recognition system are of low quality, unreliable inaccurate results will be generated. And, the other way around, if the acquired biometric samples are of high quality, low error rates will be achieved.

The previous biometric quality statement leads to a foregone conclusion: high quality samples are preferable to low quality samples. However, such seemingly trivial assertion, raises one immediate fundamental question: How can biometric quality be measured so that we are capable of selecting high quality samples over low quality samples? That is, how can we define what biometric quality is? Furthermore, who establishes what a high/low biometric sample is? It is certainly not easy to give a closed, scientific, fact-based answer to these queries.

We all, as human beings, have an instinctive feeling of what a high or low quality sample is. For the sake of argument, let’s assume we are shown a well-focused frontal portrait of a person, with good homogeneous illumination, no shadows or occlusions, a uniform background, and high resolution. We would all agree that it represents a high-quality face image. Why? Because from such an image we would be able to recognise the person. On the other hand, a low-resolution facial picture, taken from an angle, somewhat blurred, and with heavy shadows, would be regarded by most as presenting low quality. Why? We would all have difficulties recognising the person based on that specific image. Analogue examples could be given for any other biometric characteristic.

In light of the previous argumentation, we may conclude that, in fact, we all possess a subjective, intuitive perception of biometric quality. However, it is difficult for us to translate this intangible insight into measurable objective numbers. While in the example presented above, we would all agree which one is the high quality picture and which is the low quality one, each person would very likely assign to them a different quality measure from to .

This is, precisely, the huge challenge addressed by automated biometric quality metrics: producing an objective quantitative estimation of an inherently subjective concept.

To this aim, quality metrics take advantage of a key notion hidden in the example proposed above regarding facial pictures. Ultimately, what makes humans decide whether or not a biometric sample is of high quality? Its ability to be used for its ultimate purpose: recognise the sample “owner”. Unconsciously, the question being posed to ourselves to decide on the quality of a picture is: how likely is it that I would recognise the person based on this image? The same elementary principle can be exploited by automated quality metrics. They can assign a quality score to samples, based on the suitability of that sample for recognition purposes by automated systems. That is, a biometric quality metric can be, essentially, a predictor of biometric accuracy.

This understanding of biometric quality metrics as predictors of accuracy, is in line with the utility definition given in standard ISO/IEC 29794 for the term quality [1]. This is also, by far, the interpretation followed in most implementations described in the specialised literature, where quality metrics are being applied to a wide variety of tasks such as: quality control of large databases with multiple contributors, design of re-enrolment strategies in case of low quality acquisitions, quality-based multimodal fusion, or adaptation of data processing techniques.

In spite of data quality being nowadays regarded as the number one factor impacting the performance of biometric systems [2, 3], the level of development and research effort in this field varies greatly among biometric characteristics. In particular, fingerprint recognition is clearly ahead in this unofficial classification, with countless published works and metrics, that have led, perhaps most importantly, to the generation of a system-independent and open-source quality metric: the NFIQ [4] (with NFIQ 2.0 under development). This metric is widely accepted by the community as the golden standard that sets the performance bar for all other fingerprint quality algorithms.

On the other hand, facial recognition is one of the most deployed biometric technologies with a great prospective market rise for the coming years [5]

. However, with regard to the amount of effort devoted so far to data quality analysis, it is almost at the other end of the spectrum compared to fingerprints. At the moment, to meet the growth expectations raised by this technology, a point has been reached where it is necessary that face biometrics catches up with fingerprints in the study and understanding of data quality. This need has become ever more pressing by the advent of the new generation of biometric-enabled large European IT systems, such as the Schengen Information Systems (SIS), the Visa Information System (VIS) or the Entry Exit System (EES), which require of a normalised way to audit the quality of biometric data, shared by multiple contributors in central databases. In the current state of play, it is almost an unanimous claim coming from all stakeholders in the face recognition community (academia, governmental institutions, law enforcement agencies, border control agencies and standardization bodies), that a stronger commitment is required to work together on the generation of improved face quality metrics and, eventually, the development of a common standard benchmark similar to the fingerprint NFIQ.

While some valuable works have started to scratch the surface of face quality analysis (see Section III for a review of the state of the art), there is still a long way to go before it reaches the level of progress exhibited by fingerprint-based systems, and before the requirements of the community are met. The current work is a solid step towards bridging this existing technological gap.

With this objective, the article presents an innovative approach to face quality assessment. The new method, FaceQnet, takes advantage of the largely demonstrated ability of deep learning networks to extract the most salient information from face images for recognition purposes. Through a knowledge transfer process, these machine-learned features are combined with training groundtruth quality scores, produced in a completely automatised way that does not involve human labelling. As a result, the system becomes fully scalable, not relying on the potentially biased human quality perception, but taking strictly into account for its training the one parameter which is expected to predict: the accuracy of automated recognition systems. Two successive versions of FaceQnet (v0 and v1) have been assessed and compared to other state-of-the-art methods, following our own evaluation and an independent evaluation performed by NIST, as part of their Face Recognition Vendor Test (FRVT) campaign. The results have shown an improvement between the two implementations of the algorithm, providing new insight into the problem of face quality, and proving the soundness and competitiveness of the approach.

As ancillary contributions of the work, we provide FaceQnet as an open source project to the community111, so that it can be used to further advance the field of face quality estimation. We have also generated and made available, together with FaceQnet, quality labels for popular face databases such as Labeled Faces in the Wild (LFW) and VGGFace2.

The current article is based on the preliminary work presented in [6], which will be called from now on as FaceQnet v0. Consequently, the model trained in the present work will be called FaceQnet v1. The main contributions with respect to [6] are: 1) a modification of the architecture of FaceQnet v0 to avoid overfitting; 2) the generation of new training groundtruth with data from more comparators, to reduce the system dependence of the quality measure; 3) an improved evaluation protocol, including a comparison with other metrics from the state of the art, and a larger variety of face images from four different public databases (see Fig. 1), in order to get a deeper knowledge of its accuracy regarding quality assessment for face recognition; and 4) a more comprehensive introduction and positioning with respect to related works. Even though FaceQnet is presented here in the framework of face biometrics, the proposed methodology for building an automated quality metric can be very useful for other problems in which a task performance prediction is desirable.

The rest of this paper is organized as follows. Section II provides an introduction to biometric quality, and its application in face recognition. Section III summarizes related works in face quality assessment. Section IV summarizes the datasets used. Sections V and VI describe the development and the evaluation of FaceQnet, respectively. Finally, concluding remarks are drawn in Section VII.

Ii Introduction to Biometric Quality Measures

In biometrics, a quality measure is essentially a function that takes a biometric sample as its input and returns an estimation of its quality level [7]. That quality level is usually related to the utility of the sample at hand, or in other words, the expected recognition accuracy when using that specific sample. Introducing high quality samples in a database should improve the accuracy of the recognition system, while low quality samples should have the opposite effect.

The quality of the samples can be also related to more subjective factors such as human perceived quality [8]. Other definitions of biometric quality are discussed in [7]: a quality measure can be an indicator of character, i.e., properties of the biometric source before being acquired (e.g., distinctiveness); or can also be an indicator of fidelity, i.e., the faithfulness of the acquired biometric sample with respect to the biometric source.

As in most of the related works in the literature, for the purpose of the present paper we concentrate on quality measures as predictors of recognition accuracy, which can be categorized according to:

Fig. 1: Examples of varying quality from the four databases used in this paper: VGGFace2 [9], BioSecure [10], CyberExtruder, and LFW [11]. The figure shows a selection of images from each database with variable quality according to their ICAO compliance values [12]. The samples go from high quality images (right column) to low quality images (left column) which suffer from diverse variability factors such as low resolution, blur, bad pose, occlusions, etc.

  • Groundtruth Definition: One of the main differences between approaches for developing quality measures is the definition of high and low quality, i.e., the generation of the groundtruth. Some works employ human perception as their groundtruth. Another approach consists in using an accuracy-based groundtruth, which will result in a quality measure that represents the correlation between the input sample and the expected recognition accuracy of automatic systems.

  • Type of Input

    : Quality assessment modules can be also classified with respect to the amount of information they employ in order to obtain the quality measures. In a

    Full-Reference approach (FR), a gallery sample with high quality is supposed to be available. The system compares the features from the probe samples with the ones from the high quality reference. In Reduced-Reference methods (RR) just partial information of a high quality sample is available. No-Reference methods (NR) do not use any reference information to compare with the probe sample. These methods apply prior information from the samples the system is dealing with, for example for building a statistical model.

    Ref Year Groundtruth Definition Type of Input Features Extracted Output

    2006 Human-based No-Reference Face features, image features Score: individual presence of each factor

    2006 Human-based & Accuracy-based No-Reference Face features, image features Human perception score & Machine recognition score

    2007 Human-based No-Reference Assymetric face features Score: presence of each factor

    2007 Human-based No-Reference Face features Quality functions

    2010 Human-based No-Reference Illumination Individual score

    2012 Accuracy-based Reduced-Reference Contrast, brightness, focus, sharpness and illumination FQI (Face Quality Index): to

    2012 Human-based No-Reference 20 ICAO compliance features Score from each individual test

    2013 Accuracy-based Reduced-Reference Image features, comparator features, sensor features Low/high quality label

    2014 Human-based No-Reference Texture features Individual score

    2015 Accuracy-based No-Reference 2 face features: pose, illumination Predicted FMR/FNMR

    2018 Human-based & Accuracy-based No-Reference CNN features MQV (Machine-based Q.), and HQV (Human-based Q.)

    FaceQnet v0 [6]
    2019 Human-based & Accuracy-based No-Reference CNN features Numerical quality measure: to

    2020 Accuracy-based No-Reference Unsupervised CNN features Numerical quality measure: to

    FaceQnet v1 [Present paper]
    2020 Human-based & Accuracy-based No-Reference CNN features Numerical quality measure: to

    TABLE I: Summary of quality assessment works for face recognition, classified by: 1) the groundtruth definition process; 2) the type of input; 3) the features extracted; and 4) the type of output produced.
  • Features Extracted

    : Biometric quality measures can also be classified in terms of the type of features that are extracted from the samples. Quality-related factors can be measured based on: 1) hand-crafted features (defined by the designer of the method based on his own experience); or based on: 2) machine-learned features, e.g. generated by a Deep Neural Network (DNN) based on a pool of annotated training data.

  • Output

    : The output of the different quality assessment algorithms is not always the same, some methods may generate a qualitative label for each sample in the database in order to distribute the samples into a few quality ranges (e.g. low, medium, or high quality). Other methods just output a decision declaring if a specific sample is compliant with a quality standard or not. More complex works try to estimate the Probability Density Functions (PDFs) of the different variability factors present in the samples, e.g. blur or extreme pose. These PDFs will estimate the grade of presence of these quality factors in each sample of the database. Some of the most recent approaches compute a numerical score for each input sample (e.g., a real value in the range [

    , ]), which serves as a predictor of the expected accuracy when using that sample for recognition.

Ii-a Applications of Biometric Quality Assessment

The output of a biometric quality assessment module can be used at different stages of the recognition task. For example, it can be used during the enrollment process for giving feedback to the users or the operators. It can be also employed during the different stages of recognition itself in order to improve the global accuracy of the comparator:

  • Selection of preprocessing techniques: If a recognition system detects that a biometric sample does not present enough quality, it could activate some additional preprocessing techniques to improve the quality of the final sample. These techniques may involve significant computational overloads, so it is important to know when they can be useful [24, 25].

  • Context switching: A single recognition framework may have different algorithms in its core, each one of them being robust against some specific variability factors and weak against others [26, 27].

  • Fusion at decision-level: This case is closely related to context-switching. It consists in having several recognition algorithms, each one with its weaknesses and strengths. Instead of employing only one of them (context-switching), they all can be employed in parallel, using the quality information to perform a smart fusion of their output scores, weighting each output in function of the quality measure [28, 29].

  • Complementing features

    : The quality measures can be considered as additional features for face analysis and recognition algorithms. Incorporating them to the feature vectors can help to improve the accuracy of such analysis and recognition algorithms

    [30, 31].

  • Sample selection: The quality information can be used for selecting only the best quality samples from a collection. Other approach consists in looking for samples into the database that may have a similar level of variability than the probe sample. This way, the acquisition conditions from the gallery and the probe samples would be as close as possible. This can boost the accuracy compared to using all the samples without taking their quality into account [32].

  • Template update/replacement: When a subject is recognized with high enough confidence, the system could use the probe sample to improve or replace the template of that subject that is stored in the database [33, 34].

Iii Face Quality Assessment: Related Works

There are many application scenarios in which a face recognition system can take advantage of quality assessment. For example, in video-surveillance scenarios quality assessment can be employed for frame selection. In this type of settings, variability factors such as pose, occlusions, blur, etc, are usually present in the acquired images. As it has been stated previously, the recognition accuracy can be improved, for example, discarding the samples with low quality and using only the highest ones. In systems with strict storage requirements, the quality measures can be used to select the best quality images in order to store only those, reducing the amount required of storage. In forensic investigation, having a quality measure related to the face recognition accuracy may help to estimate the level of confidence of the decision. These are just a few examples of applications of face quality assesment.

Several face quality standards have been proposed so far, being the most relevant and extended ones the ICAO 9303 and the ISO/IEC 19794-5 [1]. These standards are composed of a series of guidelines for the acquisition of high quality (i.e., portrait-like) images, usually for their inclusion in official documents (e.g., ID cards or passports). A number of vendors and academic works have developed tools to automatically check if an image complies with the guidelines given in these standards [18]. In general, these works provide as output a binary vector where each feature defines whether or not a specific guideline was passed/not-passed by the image.

In Table I we include a compilation of relevant related works in quality assessment for face recognition. The selection has been made to be representative of the different stages of face quality assessment research in the last years.

First works related to face image quality assessment date back to early 00’s. The studies belonging to this first stage of research were generally centered in extracting hand-crafted features from face images and using them to calculate one or several quality measures. These features were meant to estimate the presence of one or various factors that have traditionally been considered to affect recognition performance, e.g. blurriness, non-frontal pose, or low resolution.

In [13] the authors (workers from Cognitec, one of the most relevant companies in face recognition) presented one of the first compendia of quality measures and showed the relationship between those measures and the recognition performance of a Cognitec’s face recognizer. The features they considered were all hand-crafted and included: the image sharpness, the openess of eyes, the pose, and the presence of glasses.

The research in [15] presented a symmetry-based face quality assessment method that relied in the presence or ausence of assymetries in the face. The authors considered that those assymetries can be caused by factors that have impact in the recognition performance, such as heterogeneous illumination and non-frontal pose.

The work [16] introduced a quality assessment algorithm that checked the existence of factors like blur, heterogeneous lightning, non-frontal pose, and non-neutral expressions. The authors used eigenfaces for developing quality functions related to each one of the different quality factors. However, they did not integrate the different quality functions into a single measure for estimating the overall quality of a given face.

Differently to the three previously mentioned methods, [14] integrated several individual quality measures into an overall quality measure. That work computed various hand-crafted face-specific features like: lighting, pose, presence of eyeglasses, and resolution of the skin texture; and some image-specific features like: resolution of the complete image, existence of compression artifacts, and amount of noise coming from the acquisition sensor. The authors merged the individual quality measures into two different general measures: one based on human perception and other related to machine-recognition accuracy. They found that the quality measure related to machine-recognition was able to improve the recognition accuracy, meanwhile the correlation coefficient between the match scores and the human-based quality measure was much lower. According to the authors, that was because different humans gave different relevance to each individual quality measure, some of them being not critical for face recognition.

Another of these “classic” hand-crafted approaches is the one presented in [17], where the authors studied the effect that the illumination has in face recognition, concluding that some of the best performing face recognition algorithms (at that time) were highly sensitive to different illumination levels when evaluating them with FRVT 2006.

In [12] the authors proposed an accuracy-based Face Quality Index (FQI) combining individual quality factors extracted from five image features: contrast, brightness, focus, sharpness, and illumination. They used the CASPEAL database adding synthetic effects to the images (data augmentation), being able to emulate different real world variations. After computing a numerical value of quality for each feature, they defined the Face Quality Index normalizing each quality measure and modeling the distribution of quality measures as Gaussian PDFs. Values close to the mean of each PDF denoted high quality while scores far to the mean represented low quality. The high quality reference PDFs were obtained using a high quality subset from the FOCS database. Finally they performed an average of all individual quality measures to compute the FQI.

Another approach is described in [18]. The authors presented the BioLab-ICAO framework, an evaluation tool for automatic ICAO compliance checking. The paper defined different individual tests for each input image. The output consists of a score for each test, going from to . Those individual scores were nevertheless not integrated into a final unified quality measure.

In [19] the authors computed quality features divided into three categories. The first class consists of image processing and face recognition related features, e.g. edge density, eye distance, face saturation, pose, etc. The second category is composed by sensor-related features like the ones that can be encountered in the EXIF headers of the images. The last class consists of features related with the comparators they employed, i.e. SVM. They extracted conclusions about which features are more relevant to the specific dataset they used (PaSC) regarding to the overall recognition accuracy. They used that knowledge for splitting the whole dataset in two categories regarding quality: low and high.

The authors of [20] captured a database mimicking a real-life Automatic Border Control (ABC) scenario, and applied face quality assessment to its video sequences. ABC is probably one of the most relevant applications of face recognition, and improving its robustness is of great interest for the industry and for governmental institutions. [20] evaluated the quality of the different frames of the videos by analyzing their texture and applied these quality measures for improving the recognition accuracy.

The work presented in [21] established a relationship between two image features, i.e., pose and illumination, and the final face recognition accuracy. They developed individual quality measures using PDFs in a way similar to [12]. However, the main difference between both works is that in [21] the individual quality measures are employed to finally estimate expected accuracy values, i.e., False Match Rate (FMR) and False Non-Match Rate (FNMR). The authors used six different face recognition systems in order to extract accuracy values from the databases: two of them were Commercial Off-The-Shelf Software (COTS) and four were open-source algorithms, and they applied them to three different datasets: MultiPIE, FRGC and CASPEAL.

With the recent growth of the application of deep learning methods to the face recognition task due to their high accuracy, the research works associated to face quality assessment are also adopting this type of approach successfully. For example, in [22] the authors predicted quality measures related to recognition accuracy (referred to as Machine Quality Values, MQV) and other related to human perceived quality (Human Quality Values, HQV). They annotated the LFW database with human perceived quality using the Amazon Mechanical Turk platform where the participants compared pairs of images from LFW and determined which one had the highest perceived quality. Differently to [21], where they predicted a value for recognition accuracy, [22] employed FMR and FNMR as accuracy values in the training stage and the output was a prediction of MQV or HQV. Other differential point of this work is that the authors employed a pretrained CNN (VGGFace) to extract features from the images. Then they used those features to train their own classifier, which means that they successfully transferred knowledge from face recognition to quality prediction. The authors extracted interesting conclusions such as that both scores (MQV and HQV) are highly correlated with the recognition accuracy, even for cross-database predictions. They also concluded that automatic HQV is a more accurate predictor of accuracy than automatic MQV.

The work in [22] is probably one of the most advanced approaches to face quality estimation reported in the literature. However, it still presents some drawbacks: 1) a high amount of human effort is required to label the database with human perceived quality; and 2) a manual selection of a high quality image is needed for each subject to obtain the machine accuracy prediction, thus involving human effort and introducing human bias [35].

In [6] we presented FaceQnet v0, a deep learning method that had the objective of correlating the quality of an image to its expected accuracy for face recognition. It was designed as an extension of the work presented in [22]. We employed the BioLab-ICAO framework [18]

for labeling the images of the VGGFace2 database with quality information related to their ICAO compliance level. The training of FaceQnet v0 was done using that automatically labelled groundtruth. We showed that the predictions from FaceQnet v0 were highly correlated with the face recognition accuracy of a state-of-the-art commercial system. However, our proposal had some limitations: we used only one face recognizer for the groundtruth generation (probably introducing system dependence); the presence of outliers in the groundtruth data affected significantly to the training process; and because our testing protocol only included two different databases, we were not able to extract conclusions that could be applied to other data with entire confidence.

Some recent face quality assessment works already mention FaceQnet v0 among their main references. One of them is [23]

, in which the authors proposed a face quality assessment method based on unsupervised learning. They computed the variations in the face embeddings coming out from several CNNs pretrained for face recognition. They developed a quality indicator by measuring the robustness across the different embeddings for a single face image. The authors compared their solution against six state-of-the-art face quality assessment approaches (being FaceQnet v0 among them).

The present work represents a step forward in overcoming the limitations of [22] and FaceQnet v0 [6]. As a result, our proposed solution, i.e., FaceQnet v1 is: 1) based on state-of-the-art deep learning; 2) massively scalable without human intervention, thanks to the fully automatic generation of the groundtruth quality labels; 3) developed and tested using multiple face datasets and state-of-the-art face recognition systems; and 4) validated in an independent evaluation by NIST.

Iv Datasets

In this section we describe the characteristics of the databases we used in the development and evaluation of our proposed face quality metric FaceQnet. A sample of images with different qualities from each database can be seen in Fig. 1.

Iv-a VGGFace2 Database

In this work we used two disjoint subsets extracted from the VGGFace2 database [9], one for training our network, FaceQnet, and the other to evaluate our quality measure using two different face verification systems.

The full VGGFace2 database contains million images of , different identities, with an average of images for each subject. All the images in the database were obtained from Google Images and they correspond to well known celebrities such as actors/actresses, politicians, etc. The images were acquired under unconstrained conditions and present large variations in pose, age, illumination, etc. These variations imply different levels of quality.

The creators of the VGGFace2 database also published a CNN based on the ResNet-50 architecture [36] pretrained with their database, showing that they were able to obtain state-of-the-art results when testing against challenging face recognition benchmarks such as IJB-C [37], QUIS-CAMPI [38], or PaSC [39]. This is the model we used as the basis of FaceQnet v0 and v1 (we applied knowledge-transfer to change its domain from face recognition to quality assessment).

Fig. 2: Generation of the quality groundtruth. We first selected a subset of subjects from the VGGFace2 database. We then used ICAO compliance values (BioLab framework [18]) for selecting one gallery image for each subject. After that, we employed FaceNet [40], Face Recognition [41], and DeepSight [42] for feature extraction (Matchers 1, 2, and 3), and we obtained all the mated scores using the Euclidean Distance between the embeddings of the ICAO-compliant gallery images and the rest of the images of the same subject. This way we obtained three distances for each mated pair. Finally, we transformed the distances into normalized scores in the [,] range and we averaged the scores from the three different comparators. The normalized comparison scores are used as groundtruth quality measures of the non-ICAO images.

Iv-B BioSecure Database

The BioSecure Multimodal Database [10] consists of , subjects whose biometric samples were acquired in three different scenarios. Images for the first scenario were obtained remotely using a webcam, the second is a portrait-type scenario using a high quality camera with homogeneus background, and the third scenario is uncontrolled, captured with mobile cameras both indoors and outdoors.

In the present work we have used this database for evaluation purposes. We used , images of subjects from the second and third scenarios for obtaining their quality measures with FaceQnet v1.

Iv-C CyberExtruder Dataset

We used the CyberExtruder database222The Ultimate Face data set was provided by, Inc. 1401 Valley Road, Wayne, New Jersey, 07470, USA to perform the final accuracy tests of FaceQnet v1. The dataset contains , images of , people extracted from Internet. The data is unrestricted, i.e., it contains large pose, lighting, expression, race, and age variability. It also contains images with occlusions.

Iv-D Labeled Faces in the Wild

The Labeled Faces in the Wild (LFW) [11] database has been also processed by FaceQnet v1 in order to label it with quality measures for the accuracy tests. The database consists of , images of , different subjects, having , of them two or more different images available.

This database has been widely used in the recent years for studying face recognition under unconstrained conditions. Publishing an accuracy-based quality measure for each image can help to boost the accuracy of the state-of-the-art face recognition systems that use this dataset for their benchmarks.

V FaceQnet: Development

Biometric quality estimation can be seen as a prediction of biometric accuracy, i.e., a regression problem. With FaceQnet we solve this regression problem in a supervised way using a groundtruth database composed by pairs of face images as the inputs and groundtruth quality measures as the outputs.

The groundtruth database is generated by comparing images of presumed high quality (nearly ICAO compliant) against probe images of varying qualities, using three different recognition systems. First, we used a third party software to obtain ICAO compliance measures, avoiding that way the inclusion of human bias when selecting the templates for the comparisons. This also allows for massive labelling since, thanks to the software, any number of face images can be processed without human intervention/effort. Finally, we applied knowledge-transfer to a CNN pretrained for face recognition based on ResNet-50 to perform the quality prediction using the quality groundtruth. We named the resulting model as FaceQnet v1.

The present paper is based on our preliminary work in [6], FaceQnet v0, whose objective was to correlate the quality of an image to its expected accuracy for face recognition. In the present work, we extend the results obtained in [6] by improving its main weak points: 1) the quality groundtruth is now generated using three different face comparators instead of only one; 2) the learning architecture is revised in order to avoid overfitting; and 3) the testing protocol now includes a comparative evaluation over four different databases of varying quality instead of only two, making possible a deeper understanding of how FaceQnet works.

V-a Generation of the Groundtruth Quality Measures

We can think of the quality in face recognition as a measure of the intra-variability of the images of a person. The ICAO standard for face quality imposes very strict acquisition conditions when capturing new images. Controlling variability factors such as resolution, illumination, pose, focus, etc, makes the images coming from the same person to look as similar as possible, i.e. low intra-variability. This way the comparison scores should be only dependent of the differences between different users, i.e. inter-variability. Based on that rationale, in the current work we have made the next hypothesis in order to compute the quality groundtruth:

  • HYPOTHESIS 1: In this work we make the assumption that a perfectly compliant ICAO image represents perfect quality due to its low intra-variability. We assume that the mated comparison score between such perfect quality picture (i.e., ICAO compliant) and a picture of unknown quality can be a valid and accurate reflection of the quality level of picture (its level of intra-variability). If the comparison score is low, this must be due to the low quality of the image since is of known good quality. On the other hand, if the score is high, it can be assumed that the second image is of good quality, containing a low level of variability factors such as the ones mentioned before. Therefore, that comparison score can be used as a machine-generated groundtruth quality for picture .

To know which images from the training database were closest to ICAO compliance, we used the BioLab framework from [18] (see Fig. 2). This framework outputs a score between and for each one of its individual ICAO compliance tests. Not all of these tests have the same relevance for face recognition, so we selected a subset of them and then we computed a final averaged global ICAO compliance value. More specifically, the tests that we have selected are: blur level, too dark/light illumination, pixelation, heterogeneous background, roll/pitch/yaw levels, hat/cap presence, use of glasses, and presence of shadows.

As the training set for our quality assessment measure, we selected a subset of subjects from the VGGFace2 database. For each subject we selected the image with the highest ICAO compliance value as the gallery image, and we used the rest as probe images. To obtain the comparison scores we decided to use the FaceNet model from [40], an open-source solution called Face Recognition [41], and a vendor solution from the BaseApp company called DeepSight Face [42], as feature extractors to get embeddings for all the images in the database. We input each image of the training database into each one of the three comparators to extract three different -dimensional feature vectors. Using these embeddings, we computed the Euclidean Distance between each template image and all the remaining samples of the same subject. These distances represent the dissimilarity between each test image and its correspondent “ICAO compliant” template. This process gave us three different mated distances for each pair of images. In order to fuse the distances into only one (used as the training groundtruth), they have been transformed to similarity scores into the [,] range using the next equation: , being

each mated distance with zero mean and unitary standard deviation.

Finally, the three normalized similarity scores were averaged to obtain the final groundtruth quality measures for training FaceQnet v1. As explained above, given that the reference images used to compute the similarity scores are nearly ICAO-compliant images of “perfect” quality we can assume that the final similarity score represents the quality level of the probe image. If the resultant similarity scores are high, the correspondent probe images are likely to have high quality characteristics. On the contrary, if the scores are low we can assume that the probe images have low quality regarding the face recognition task.

Fig. 3: Top: Distribution of the groundtruth quality measures for training FaceQnet v1. The training quality measures are a combination of the verification scores obtained using FaceNet, DeepSight, and Face Recognition. Bottom: Training images of high and low subjetive quality, and their corresponding groundtruth quality scores.

Fig. 3 (top) shows the distribution of the fused verification scores we used as the groundtruth quality measures for training FaceQnet v1. We calculated verification scores using the FaceNet, Face Recognition, and DeepSight recognizers, we normalized them to the [,] range, and then we combined them into a final groundtruth quality measure. Fig. 3 (bottom) shows some examples of training images of high and low subjective quality (selected manually) and their associated groundtruth quality measures. The figure shows that the measures are correlated to the subjective quality of the images, i.e. its level of ICAO compliance. With the experiments included in Section VI we prove that the quality measures are also related to face recognition accuracy.

V-B Training of the Deep Regression Model

Fig. 4: FaceQnet (both v0 and v1 versions) is originally based on the ResNet-50 architecture [36], but replacing the last classification layer with two new ones designed for quality regression. For FaceQnet v1 we also added a dropout layer before the first additional fully connected layer. We trained only the new layers keeping the weights of the rest frozen, using the training set of face images and their groundtruth quality measures.

The proposed model, FaceQnet v1, is able to return a reliable prediction of the face recognition accuracy using just a probe image as its input. To that end, it performs end-to-end regression for quality estimation.

Due to the limited amount of face quality training data, we opted to apply knowledge-transfer, which has been shown to be very effective in other face analysis problems such as gender estimation or age estimation [43, 44]. In these works, the authors used a model that was pretrained for a different (but closely related) task using massive data, and they retrained it to perform the new task using only a limited set of data of the target task. This observation led us to the next hypothesis:

  • HYPOTHESIS 2: Facial feature vectors containing information about identity are quite likely to also contain information regarding face quality. Therefore, using knowledge transfer we should be able to extract quality-related information from recognition-related feature vectors.

To use a face-recognition embedding for quality estimation, we need to extract the quality related information from those vectors, and this is done by using the groundtruth quality measures described in Section V-A. We took as basis the ResNet-50 model from [36], pretrained for face recognition, and we removed the last classification layer. We substituted it with two additional Fully Connected (FC) layers to perform quality estimation. The ResNet-50 pretrained model extracts a vector of , elements designed for face recognition. The first added FC layer combines the elements of the embeddings, synthesizing them into feature vectors of elements that concentrate the quality related information. The second FC layer performs a final regression step that outputs a score, i.e., the final quality measure that helps us to know the level of suitability of an image for face recognition.

In order to improve the preliminary results from FaceQnet v0 [6], in the present paper for v1 we also added a dropout layer before the first fully connected layer. The final architecture of FaceQnet v1 is shown in Fig. 4. With this change we avoid overfitting, so FaceQnet v1 will generalize better when facing images from different datasets and scenarios. In addition to the changes made into the generation of the groundtruth data, this change in the architecture makes the model more system and data independent in comparison to the FaceQnet v0 model described in [6].

The input to the network are face images of size previously cropped and aligned using MTCNN [45]. We froze all the weights of the old layers and we only trained the new layers using the quality groundtruth generated in the previous step (see Section V-A).

Once trained, FaceQnet can be used as a “black box” that receives a face image and outputs a quality measure between and related to the face recognition accuracy. This quality measure can be understood as a proximity measure between the input image and a hypothetical corresponding ICAO compliant face image.

Vi FaceQnet: Evaluation

Vi-a US-NIST Assessment: FaceQnet v0

Fig. 5: Samples of images used in the NIST assessment. Figure extracted from [3].

As part of their Face Recognition Vendor Test (FRVT), the US NIST started in 2019 an on-going evaluation of face quality metrics, the FRVT Quality Assessment333 (FRVT-Q). To date, there has been one wave of algorithms assessed in the competition. This first campaign comprised seven algorithms coming from six different participants and included the initial version of FaceQnet (v0) described in the preliminary work [6]. A description of the objectives, experimental protocol, and the full results of the competition so far, were recently presented in a technical report [3].

The FRVT-Q evaluation was performed over a database that contained, for all subjects, three different image categories, each of them with a different expected quality level: 1) “Application” pictures, which correspond to high-resolution ICAO-type portraits (very high quality); 2) “Webcam” pictures, which correspond to close-to frontal images, taken indoors with a cooperative subject and no specific control over illumination or distance to the camera (good-to-average quality); 3) “Wild” pictures, which include photojournalism-style photos, taken under unconstrained conditions with large variations in resolution (large quality range, from very poor to good).

Fig. 6: Brief summary of the results from the FRVT-Q campaign organised by NIST in 2019 for the evaluation of face quality metrics. The graphs have been directly extracted from [3] and show the performance of the first version of FaceQnet presented in [6] (FaceQnet v0). (Top) quality measures for the three different types of images in the evaluation database. (Bottom) ERC plots showing the performance of the different quality assessment methods submitted to the evaluation. For a full description of the competition and results, we refer the interested reader to [3]. (Color image)

The evaluation included two main types of results for all the algorithms assessed: 1) Quality score distributions for each of the three image categories (i.e., application, webcam and wild); and 2) Error versus Reject Curves for different comparators for two verification tasks: “Application vs Webcam” and “Wild vs Wild”. Samples of the types of images used in the NIST evaluation are shown in Fig. 5.

A brief summary of the FRVT-Q results is shown in Fig. 6. The graphs in the figure have been directly extracted from [3] and have been selected to reflect the performance of the initial version of FaceQnet submitted to the evaluation [6] (FaceQnet v0). Fig. 6 (top) depicts the quality score distributions of FaceQnet v0 for the three image categories. The graph in Fig. 6 (bottom) show the ERC curves for all the algorithms in the evaluation, based on the mated comparison scores obtained with the comparator “rankone_008”, for a FMR of , both for the “Application vs Webcam” scenario and the “Wild vs Wild” scenario. These ERC plots have been computed using only mated scores. Being able of predicting the mated scores implies that the quality measure will be a predictor of the recognition accuracy.

The main conclusions that can be extracted from these results are:

  • Given the quality score distributions shown in Fig. 6 (top), we can say that FaceQnet v0 is capable of distinguishing with a reasonable accuracy the difference in quality present in the three image categories considered in the competition. However, it has a tendency to saturate on the low-end of the quality range, that is, it has a significantly limited ability to discern between poor quality images, assigning to all of them very low quality values (see the abnormal high lobe of the wild distribution around quality value ).

  • Fig. 6 (bottom) shows that FaceQnet v0 performs reasonably well in the quality estimation of average, good, and very high quality images (i.e., webcam and application categories). This could already be noticed in the distributions shown in the top graph and is further confirmed by the ERC curves of the “Application vs Webcam” scenario, where, for most of the curve, FaceQnet v0 only performs worse than the two “rankone” quality metrics. Please note that these ERC curves have been extracted using a “rankone” comparator, therefore, it could be expected that the “rankone” comparator and quality metric present the highest correlation of all participants.

    The ERC curves for the “Wild vs Wild” scenario show that FaceQnet v0 struggles in the presence of bad quality images, where its performance is worse than all other algorithms participating in the evaluation. Again, this confirms the observations extracted based on the distributions shown in the top graph. Based on these results, we may say that the metric is able to detect poor images (see the high lobe close to 0 in the “Wild” distribution), but it assigns to them always the same very low quality. Therefore, it needs to improve its ability to better discriminate between pictures corresponding to low values (quality range -).

The limitations handling low quality images of FaceQnet v0 revealed in this evaluation, have been partially addressed in the new release of the tool, FaceQnet v1, described in the present work, through: 1) a change in the architecture adding a dropout layer to avoid the quick saturation of the algorithm in the low-end of the quality range; and 2) an improvement of the training process using additional datasets and face recognition systems to produce the groundtruth quality scores. To evaluate the improvement in performance due to the changes introduced, FaceQnet v1 has been evaluated following a very similar protocol and metrics as those used in the NIST evaluation. This self-assessment is described in the following section.

Vi-B Self-conducted Assessment: FaceQnet v1

Fig. 7: Experimental scheme for testing FaceQnet v1. We computed only the mated verification scores for all the images in the test databases. In parallel, the quality of all the images involved in these mated pairs is obtained using FaceQnet v1. Finally, we calculated the FNMR values when discarding those mated pairs in which at least one of the images has its quality measure under a variable threshold. The mated comparison scores were computed using two different face recognition systems (FaceNet and Face++). The four test databases are: VGGFace2, Biosecure, CyberExtruder, and LFW.
Fig. 8: Distribution of the quality measures for the VGGFace2, Biosecure, CyberExtruder, and LFW databases (with sample images). (Top) quality measures from the preliminary FaceQnet v0 model from [6]. (Bottom) quality measures from the current FaceQnet v1 model. The example images illustrate how the new measures are more widespread along the [,] range than the old ones. VGGFace2 images obtained lower quality measures compared with those from the other databases since they contain more variability. The current FaceQnet v1 model distinguishes better between the quality of the different databases.

In this evaluation we followed a testing protocol similar to the one of NIST described in the previous section. The target is to evaluate the improvements of the FaceQnet v1 model we have trained in the current work.

We tested the FaceQnet v1 model on different datasets: VGGFace2 (no overlap with the training set), BioSecure, CyberExtruder, and LFW. These databases were captured under different conditions and therefore they present different levels of variability. A short description of the databases can be found in Section IV. The experimental scheme for validating FaceQnet is shown in Fig. 7.

First, we processed all the images from each test database with FaceQnet v1, obtaining a quality measure for each individual image. The resulting distributions of the quality measures are shown in Fig. 8, for both FaceQnet v0 and v1. That figure also shows some example images and their associated quality measures. The scores obtained using FaceQnet v1 are more widespread along the [,] range than the ones of FaceQnet v0. As expected, the VGGFace2 database presents a higher amount of low quality images since it represents real world acquisition conditions, while the quality values for the LFW or the BioSecure databases are slightly higher since their images were acquired in more controlled conditions.

Fig. 9: Distribution of the quality measures for the different scenarios of the BioSecure database. The testing set of the database comprehends two different scenarios: mobile and studio. The mobile images are divided in indoor and outdoor subscenarios. The studio images were acquired with and without artificial illumination.

The testing dataset from BioSecure we have used in this self-evaluation has its images divided in two scenarios: a portrait-type scenario using a high quality camera with homogeneus background both with and without artificial illumination (referred to as “studio” scenario), and other scenario that is uncontrolled, captured with mobile cameras both indoors and outdoors (“mobile” scenario). This shapes a total of four subscenarios: “studio with illumination”, “studio without illumination”, “mobile indoor”, and “mobile outdoor”.

We decided to process all the images of each one of the BioSecure subscenarios to see if FaceQnet v1 is able to distinguish between the different types of images. Fig. 9 shows the distribution of the quality measures for the mentioned scenarios and subscenarios. As expected, the quality measures obtained for the “studio” conditions present a higher mean value than the ones from the “mobile” conditions, since its images were obtained with a higher quality camera, a homogeneous background and illumination, better pose, etc. Additionally, the varying acquisition conditions of the “mobile outdoor” subscenario make its quality measures to be more spread along the quality range.

In the last experiment of this self-evaluation we compute Error versus Reject Curves (similarly to the NIST evaluation) for comparing the accuracy of the quality measures obtained with FaceQnet v1 against other Quality Assessment (QA) methods. As references for this comparison we selected a QA method for face recognition based on hand-crafted features [46] and other method designed for general Image Quality Assessment (IQA) [47]. We also include FaceQnet v0 [6] in the comparison, the same version evaluated by NIST as described in the previous subsection.

In this case we compute ERC plots for two different comparators: a COTS software called Face++ from MEGVII [48] and one of the recognizers used in the training phase, i.e., FaceNet. These two comparators used here for evaluation allow us to check how well our quality estimation correlates with the system used for development and also with a face verification system not seen during training. Face++ performs a comparison between two face images returning a numerical comparison score between and , while FaceNet returns a value between and . In both cases, the higher the score, the higher the probability of a mated comparison.

We compute the ERC plots for each combination of one testing database and one face recognizer from the evaluation set. ERC plots are calculated by discarding an increasing amount of images with low quality measures and then obtaining the new values of the FNMR [2]

. The value of the verification threshold is set to fix the desired initial value of the FNRM by using the quantile function with the mated compared pairs. The same threshold is used for all the values in each ERC plot. The curves show the relationship between the FNMR and the reject rates, describing how the FNMR (ideally) decreases when the data with the worst quality is discarded. The goal is to show how correlated are the quality measures with the accuracy of each face recognizer. A trustworthy quality metric should be able to predict which images have a higher impact in the FNMR. The ERC plots have also been used previously in quality assessment for other biometric traits such as fingerprint (e.g. with NFIQ


In Figs. 10 and 11 we have fixed the verification thresholds to obtain an initial FNMR of 10% for all the recognizers when using all the images indistinctly. Regarding the FaceNet results, the FaceQnet model trained in this paper (FaceQnet v1) is always among the two quality assessment methods with higher correlation with the face recognition performance. The hand-crafted algorithm from [46] also obtains good results in its quality predictions. For Face++, FaceQnet v1 also stays among the two best QA methods, except for the CyberExtruder database where FaceQnet v0 [6] obtains the highest correlation between the quality measures and the recognition performance. Analyzing the ERC plots, it can be stated that FaceQnet v1 generates quality measures generally more correlated with the accuracy of face recognition in comparison to FaceQnet v0.

The general IQA algorithm from [47] slightly increases the accuracy of the face recognizers when discarding low quality images. However, its performance is quite poor when compared with the other QA methods that have been adjusted specifically for face quality assessment. This algorithm has been designed for detecting variability factors such as blur, resolution, and homogeneity, but looking at the complete image. These are factors that can affect the accuracy of face recognition, but they might not be the most relevant to detect which images are actually suitable for face recognition. The face QA methods are focused on the zone of the image that contains the face to be analyzed. The method from [46] obtained good results in face QA but, due to its hand-crafted nature, it might perform worse when facing data from other databases and/or scenarios. It would be difficult to adjust this algorithm against different types of images and variability factors. On the other hand, FaceQnet has the potential to be easily adjustable to any possible scenario using a set of training images for fine-tuning the deep model.

Fig. 10: ERC for reduced datasets obtained with the FaceNet comparator for the four testing data subsets. The initial FNMR has been set to 10%. Fractions of the images with lowest quality measures have been removed consecutively. Four different QA algoritmhs have been used for obtaining quality measures of the testing images: a general Image Quality Assessment (IQA) method [47], a method for face QA based on hand-crafted features [46], FaceQnet v0 [6], and the FaceQnet model of this paper (FaceQnet v1). The line labeled PERFECT is generated using . The closer the quality algorithm line is to the PERFECT line, the more related the quality measure is to face recognition accuracy. (Color image)

Vii Conclusion and Discussion

Fig. 11: ERC for reduced datasets obtained with the Face++ comparator for the four testing data subsets. The experimental protocol is the same used for FaceNet (see Figure 10). The line labeled PERFECT is generated using . The closer the quality algorithm line is to the PERFECT line, the more related the quality measure is to face recognition accuracy. (Color image)

The unattainable dream of so many human endeavours: knowing now, what is awaiting tomorrow. But, is it really an unreachable goal? In different contexts, mathematical models are getting more and more accurate at this seemingly impossible task. This is the case of biometric quality metrics. In a way, biometric quality is a window into the future. Even if this can seem a too-poetic of a definition for a computer algorithm, in reality, it may not be that farfetched. Biometric quality metrics allow to have an estimation in the present, of the accuracy that a system will achieve in the future on some given set of data. It is not difficult to grasp the huge value of a tool capable of such a feat.

The importance of assessing data quality for improving the performance of operational systems has been long known among the biometric community. Already in 2006 and 2007 the US NIST organised two back-to-back workshops444,, exclusively dedicated to the discussion of biometric quality and the promotion of research in the field.

As a result of these and other similar initiatives, quality estimation algorithms are being increasingly deployed worldwide. Large national and international IT systems such as the US-VISIT, US Personal Identity Verification (PIV), or the EU Visa Information System (VIS) and Schengen Information System (SIS), mandate the measurement and reporting of quality scores of captured images. This is already being achieved on a regular basis in the case of fingerprints, where there has been a huge effort dedicated to the study of quality metrics. This investment has paid off great dividends, and has led to the development of NFIQ2, a system-independent, open-source fingerprint quality metric which has been included as the common quality benchmark in the ISO/IEC 29794 standard.

In spite of its importance, unanimously agreed by biometricians, the field of biometric quality assessment is far less advanced in most biometric characteristics than in the case of fingerprints. This is the situation where face recognition finds itself at the moment.

As recently as 2018, the US NIST organised a dedicated workshop to discuss all aspects related to face recognition technology555, open to all interested parties, including academia, governmental institutions, law-enforcement agencies, border management agencies, and industry. Among the conclusions of the event, one of the urgencies identified by all stakeholders was to address the lack of reliable face quality metrics by fostering research in this underdeveloped field. An analogue conclusion was reached in 2019 by the European Commission, following their study for the integration of an automated face recognition system in the Schengen Information System and in other large European IT systems [49]. The report, aimed at policy makers, echoes the appeal made by multiple law enforcement entities in Europe for the development of a standard, system-independent, face quality metric similar to the existing NFIQ2 in fingerprint recognition.

This rapidly spreading awareness of the blatant lack of sufficient investment in face quality, has triggered a number of international initiatives to address the problem. Among them, the FRVT Quality Assessment campaign (FRVT-Q) held by US NIST666, is the first evaluation campaign aimed at comparing face quality metrics and set the current state of the art in the field, that will allow us to understand the strengths and limitations of existing technology. Another example of the international commitment to tackle this issue, is the launch by ISO/IEC JTC 1 SC37, the committee for standardisation in biometrics, of a collaborative work item in face quality, with the ambitious objective of producing standard algorithms for face quality estimation777

While the commitment from international institutions and policy makers is an essential part of the equation, real advance in face quality metrics requires fuel. Ultimately, research is the driving force at the core of all this technology. With this pressing necessity as main motivation, the present work can be regarded as a solid contribution to bridge the existing gap in face quality, advancing the field by producing: an up-to-date overall picture of the state of the art, new insights, new open source algorithms, reproducible results following standard evaluation protocols, and new public data for future research and advancement.

In particular, we have developed FaceQnet v1, a new quality metric powered by deep learning technology which receives as input a face image and produces a scalar quality score as an estimation of the suitability of the picture to be used within face recognition systems. As a mean to the collective effort being made to advance the domain of face quality, FaceQnet v1 is put at the disposal of the community as an open source tool through GitHub888, together with the quality scores produced for each of the four test datasets used in the evaluation (VGGFace2, Biosecure, CyberExtruder, and LFW).

In order to reach the most meaningful conclusions possible, FaceQnet has undergone a double evaluation:

  • US NIST independent assessment. The initial version of FaceQnet (v0) was submitted with the first wave of algorithms to the on-going FRVT-Q evaluation campaign organised by US NIST. In that evaluation, while showing promising results with respect to the other participants, the original algorithm revealed some of its flaws. In the current work these limitations have been partially corrected with a new version of FaceQnet (v1).

  • Self-assessment. We have carried out a reproducible self-assessment of the metric, based on public data and following a standard evaluation protocol. This evaluation has shown an improvement with respect to the preliminary version of the algorithm presented in [6] (the same submitted to the NIST evaluation). The new metric described in this work has corrected some of the existing weak points, such as overfitting and system dependence, following a modification in the architecture and the training process. This evaluation has also shown the competitiveness of FaceQnet with respect to other state-of-the-art algorithms.

There are two major and, in our opinion, very valuable conclusions that can be drawn from these evaluations, regarding the two hypotheses that have been made in the work:

  • CONCLUSION 1. Hypothesis 1 is confirmed. The approach followed in the present work for generating groundtruth quality scores, holds. It is safe to assume that the comparison score between a perfect quality picture A (i.e., ICAO compliant picture) and a picture B of lower quality (of the same person), is a valid and accurate reflection of the quality level of picture B. Therefore, the comparison score thus produced, can be used as a machine-generated groundtruth quality score for picture B. This strategy allows automatising the groundtruth generation process, avoiding the highly time- and resource-consuming task of producing quality scores based on human perception, which may also be biased with respect to machines’ understanding of quality.

  • CONCLUSION 2. Hypothesis 2 is confirmed. Machine-learned features for face recognition contain, not only the information regarding the identity of the person, but also the information regarding the quality of the picture. This quality-related information can be extracted from the original feature vector through a knowledge transfer process. Therefore, we can conclude that quality and identity are not only linked at the score level (quality measures are predictors of mated scores), but also at feature level. This new piece of knowledge, we believe can be very impactful and of high added value for the face quality forum, as the amount of labeled data available for face recognition is far higher than that tagged for face quality analysis. Subsequently, it is possible to accurately train from scratch Deep Neural Networks (DNNs) for face recognition (or use one of the already trained models), while, on the other hand, such a process may not be feasible at the moment for face quality estimation. However, the confirmation of hypothesis 2 allows us to overcome this scarcity of data, releasing the full potential of deep learning systems developed for face recognition to be applied as well in quality estimation tasks.

In addition to the two lessons learned pointed out above, the experimental evaluation of FaceQnet has also disclosed some critical points in the design of the algorithm, that need to be carefully taken into account if a similar approach is applied by other researchers for the development of face quality metrics, most importantly:

  • Selection of the training ICAO-compliant images. One of the key points in the present approach is the generation of the groundtruth quality scores based on a perfect ICAO compliant picture (see hypothesis 1). Due to the lack of public databases specifically designed for face quality assessment, such ICAO portraits were selected from an all-purpose face database, relying on an automated ICAO-compliance tester which efficiency has not been sufficiently proven. A manual supervision of the automatically selected pictures was performed as a second check, in order to ensure, to the largest extent possible, an overall high-quality level. In spite of our best efforts, it is likely that many of those training images, even though of high quality, were not fully ICAO compliant. Therefore, it is our strong believe that the training process would largely benefit if it was carried out on images initially acquired under ICAO restrictions, and not selected from a general “in-the-wild” type database.

  • Training database. A popular machine learning principle preaches: “in God we trust, all others must bring data”. Or, in other words, the more data, the better. Only more accurate results can be expected if the size of the training database is significantly increased, with images that cover substantially and uniformly the whole quality spectrum. This means that the training database should comprise for each single subject: 1) pictures acquired in an ICAO compliant environment (see bullet point above); 2) pictures covering a large range of quality values (e.g., close-to-ICAO, frontal webcam indoor, frontal webcam outdoor, in the wild). To the best of our knowledge, there is still not such a database available to the research community. This would be, in our view, an invaluable asset in order to further advance the field of face quality assessment.

  • Face detector. In order to avoid biased results derived from features extracted from the background (i.e., if the background is homogeneous, the image may be automatically classified as ICAO), face quality assessment algorithms should rely exclusively on information stemming from the face, separating in this way the task of face detection from the task of face quality assessment. This means that, for an input image, the first task is to detect only the face and to perform a tight crop solely of that area in the picture. This way the face detector may have difficulties in properly locating the face, but that difficulty would be independent from the face-only quality metrics that we advocate for. This is the most flexible and informative approach for dealing with biometric quality in general, but we understand that in some applications using a quality metric that integrates both the face segmentation and the biometric-only quality may be more efficient and operational. Although in our vision the biometric segmentation (face detection in this case) is not intrinsically part of the face quality algorithm, it can have a decisive impact on its outcome, depending on the accuracy of the face detector utilized. For the training and evaluation of face quality metrics, including FaceQnet, it is highly recommended to use face images with groundtruth segmentation information for the face area, so that a face detector is not required and, therefore, the possible variability introduced by it is removed from the system.

Even though our work has been developed particularly in the framework of face biometrics, the proposed methodology for building a fully automated quality metric can be useful for other problems as well. Our methods can in fact be the basis to develop performance prediction tools for any automated artificial intelligence pipeline when dealing with a specific input.

As a wrap-up, we can say that the present work represents a step forward in the arduous quest for the generation of robust, system-independent, standard face quality metrics. All algorithms, results, and data described in the article have been made available to the community, so that this work can serve as a cornerstone to further advance this fundamental field, for the future deployment and development of face recognition technology.

After all, let’s not forget that, as we stated at the beginning of this article, the results of a computerised system are only as reliable as the data you input. If you input data that is garbage, the result will be unreliable garbage. Consequently, detecting garbage at the input, should be a compulsory critical task for any automated system.


This work was supported in part by projects BIBECA from MINECO/FEDER (RTI2018-101248-B-I00), IDEA-FAST (IMI2-2018-15-two-stage-853981), PRIMA (ITN-2019-860315), and TRESPASS-ETN (ITN-2019-860813). The work was conducted in part during a research stay of J. H.-O. at the Joint Research Centre, Ispra, Italy. He is also supported by a PhD Scholarship from UAM. We also thank the experimental support given by Rudolf Haraksim at the JRC.


  • [1] D. Benini et al., “ISO/IEC 29794-1 Biometric Quality Framework Standard,” ISO/IEC JTC1 SC37, 2006.
  • [2] P. Grother and E. Tabassi, “Performance of biometric quality measures,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 4, pp. 531–543, 2007.
  • [3] P. J. Grother, A. Hom, M. L. Ngan, and K. K. Hanaoka, “Draft report: Ongoing Face Recognition Vendor Test (FRVT) Part 5: Face Image Quality Asssessment,” US Department of Commerce, National Institute of Standards and Technology, Tech. Rep., 2020.
  • [4] E. Tabassi, C. Wilson, and C. I. Watson, “Fingerprint Image Quality,” US Department of Commerce, National Institute of Standards and Technology, Tech. Rep., 2004. [Online]. Available:
  • [5] MarketsandMarkets, “Biometric System Market by Authentication Type (Single-Factor: Fingerprint, Iris, Palm Print, Face, Voice; Multi-Factor), Offering (Hardware, Software), Functionality (Contact, Noncontact, Combined), End User, and Region - Global Forecast to 2024,” Retrieved April 2020. [Online]. Available:
  • [6] J. Hernandez-Ortega, J. Galbally, J. Fierrez, R. Haraksim, and L. Beslay, “FaceQnet: Quality Assessment for Face Recognition based on Deep Learning,” IAPR International Conference on Biometrics (ICB), 2019.
  • [7] F. Alonso-Fernandez, J. Fierrez, and J. Ortega-Garcia, “Quality measures in biometric systems,” IEEE Security and Privacy, vol. 10, no. 6, pp. 52–62, 2012.
  • [8] A. Khodabakhsh, M. Pedersen, and C. Busch, “Subjective versus objective face image quality evaluation for face recognition,” International Conference on Biometric Engineering and Applications (ICBEA), pp. 36–42, 2019.
  • [9] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “VGGFace2: A dataset for recognising faces across pose and age,” IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2018.
  • [10] J. Ortega-Garcia, J. Fierrez, F. Alonso-Fernandez, J. Galbally et al., “The multiscenario multienvironment Biosecure Multimodal Database (BMDB),” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 6, pp. 1097–1111, 2010.
  • [11] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled Faces in the Wild: A database for studying face recognition in unconstrained environments,” UMass, Tech. Rep. 07-49, 2007.
  • [12] A. Abaza, M. A. Harrison, and T. Bourlai, “Quality metrics for practical face recognition,”

    IAPR International Conference on Pattern Recognition (ICPR)

    , pp. 3103–3107, 2012.
  • [13] F. Weber, “Some quality measures for face images and their relationship to recognition performance,” in Biometric Quality Workshop. National Institute of Standards and Technology, Maryland, USA, 2006.
  • [14] R.-L. V. Hsu, J. Shah, and B. Martin, “Quality assessment of facial images,” in IEEE Biometrics Symposium: Special Session on Research at the Biometric Consortium Conference, 2006.
  • [15] X. Gao, S. Z. Li, R. Liu, and P. Zhang, “Standardization of face image sample quality,” in IAPR International Conference on Biometrics (ICB), 2007, pp. 242–251.
  • [16] M. Abdel-Mottaleb and M. H. Mahoor, “Application notes-algorithms for assessing the quality of facial images,” IEEE Computational Intelligence Magazine, vol. 2, no. 2, pp. 10–17, 2007.
  • [17] J. R. Beveridge, D. S. Bolme, B. A. Draper, G. H. Givens, Y. M. Lui, and P. J. Phillips, “Quantifying how lighting and focus affect face recognition performance,” in

    IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops

    , 2010, pp. 74–81.
  • [18] M. Ferrara, A. Franco, D. Maio, and D. Maltoni, “Face image conformance to ISO/ICAO standards in machine readable travel documents,” IEEE Transactions on Information Forensics and Security, vol. 7, no. 4, pp. 1204–1213, 2012.
  • [19] P. J. Phillips, J. R. Beveridge, D. S. Bolme, B. A. Draper, G. H. Givens, Y. M. Lui, S. Cheng, M. N. Teli, and H. Zhang, “On the existence of face quality measures,” IEEE International Conference on Biometrics: Theory, Applications and Systems (BTAS), 2013.
  • [20] R. Raghavendra, K. B. Raja, B. Yang, and C. Busch, “Automatic face quality assessment from video using gray level co-occurrence matrix: An empirical study on Automatic Border Control system,” IAPR International Conference on Pattern Recognition (ICPR), pp. 438–443, 2014.
  • [21] A. Dutta, R. Veldhuis, and L. Spreeuwers, “Predicting face recognition performance using image quality,” arXiv: 1510.07119, 2015.
  • [22] L. Best-Rowden and A. K. Jain, “Learning Face Image Quality From Human Assessments,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 12, pp. 3064–3077, 2018.
  • [23] P. Terhörst, J. N. Kolf, N. Damer, F. Kirchbuchner, and A. Kuijper, “SER-FIQ: Unsupervised estimation of face image quality based on stochastic embedding robustness,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • [24] Y. Song, J. Zhang, L. Gong, S. He, L. Bao, J. Pan, Q. Yang, and M.-H. Yang, “Joint face hallucination and deblurring via structure generation and detail enhancement,” International Journal of Computer Vision, vol. 127, no. 6-7, pp. 785–800, 2019.
  • [25]

    K. Grm, W. J. Scheirer, and V. Štruc, “Face hallucination using cascaded super-resolution and identity priors,”

    IEEE Transactions on Image Processing, vol. 127, no. 6-7, pp. 785–800, 2020.
  • [26] F. Alonso-Fernandez, J. Fierrez, D. Ramos, and J. Gonzalez-Rodriguez, “Quality-based conditional processing in multi-biometrics: application to sensor interoperability,” IEEE Transaction on Systems, Man and Cybernetics Part A, vol. 40, no. 6, pp. 1168–1179, 2010.
  • [27] M. Nappi, S. Ricciardi, and M. Tistarelli, “Context awareness in biometric systems and methods: State of the art and future scenarios,” Image and Vision Computing, vol. 76, pp. 27–37, 2018.
  • [28] J. Fierrez, A. Morales, R. Vera-Rodriguez, and D. Camacho, “Multiple classifiers in biometrics. part 2: Trends and challenges,” Information Fusion, vol. 44, pp. 103–112, 2018.
  • [29] M. Singh, R. Singh, and A. Ross, “A comprehensive overview of biometric fusion,” Information Fusion, vol. 52, pp. 187–205, 2019.
  • [30] G. Fumera, G. L. Marcialis, B. Biggio, F. Roli, and S. C. Schuckers, “Multimodal anti-spoofing in biometric recognition systems,” in Handbook of Biometric Anti-Spoofing: Trusted Biometrics under Spoofing Attacks, S. Marcel, M. S. Nixon, and S. Z. Li, Eds.   Springer London, 2014, pp. 165–184.
  • [31] P. H. Pisani, A. Mhenni, R. Giot, E. Cherrier, N. Poh, A. C. P. d. L. Ferreira de Carvalho, C. Rosenberger, and N. E. B. Amara, “Adaptive biometric systems: Review and perspectives,” ACM Computing Surveys, vol. 52, no. 5, pp. 1–38, 2019.
  • [32] J. Hernandez-Ortega, S. Nagae, J. Fierrez, and A. Morales, “Quality-based pulse estimation from NIR face video with application to driver monitoring,” in Pattern Recognition and Image Analysis, A. Morales, J. Fierrez, J. S. Sanchez, and B. Ribeiro, Eds., 2019, pp. 108–119.
  • [33] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic, “Incremental face alignment in the wild,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1859–1866, 2014.
  • [34] L. Didaci, G. L. Marcialis, and F. Roli, “Analysis of unsupervised template update in biometric recognition systems,” Pattern Recognition Letters, vol. 37, pp. 151–160, 2014.
  • [35] I. Serna, A. Peña, A. Morales, and J. Fierrez, “InsideBias: Measuring Bias in Deep Networks and Application to Face Gender Biometrics,” arXiv:2004.06592, 2020.
  • [36] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
  • [37] B. Maze, J. Adams, J. A. Duncan, N. Kalka, T. Miller, C. Otto, A. K. Jain, W. T. Niggel, J. Anderson, J. Cheney et al., “IARPA Janus Benchmark–C: Face Dataset and Protocol,” IAPR International Conference on Biometrics (ICB), pp. 158–165, 2018.
  • [38] J. C. Neves, G. Santos, S. Filipe, E. Grancho, S. Barra, F. Narducci, and H. Proença, “QUIS-CAMPI: Extending in the wild biometric recognition to surveillance environment,” International Conference on Image Analysis and Processing (ICIAP), pp. 59–68, 2015.
  • [39] J. R. Beveridge et al., “The challenge of face recognition from digital point-and-shoot cameras,” IEEE International Conference on Biometrics: Theory Applications and Systems (BTAS), pp. 1–8, 2013.
  • [40] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815–823, 2015.
  • [41] A. Geitgey, “Face recognition API for Python and the Command Line,” Retrieved December 2019. [Online]. Available:
  • [42] BaseApp, “DeepSight: State of the Art Deep learning powered Face Detection, Recognition, Demographics, Gender, Age, Landmarks,” Retrieved December 2019. [Online]. Available:
  • [43]

    Y. Su, Y. Fu, Q. Tian, and X. Gao, “Cross-database age estimation based on transfer learning,”

    IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1270–1273, 2010.
  • [44] S. Ghosh, R. Singh, M. Vatsa, N. Ratha, and V. M. Patel, Domain Adaptation for Visual Understanding.   Springer, 2020.
  • [45] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.
  • [46] J. Chen, Y. Deng, G. Bai, and G. Su, “Face image quality assessment based on learning to rank,” Signal Processing Letters, vol. 22, no. 1, pp. 90–94, 2014.
  • [47] C. Lennan, H. Nguyen, and D. Tran, “Image Quality Assessment,” Retrieved December 2019. [Online]. Available:
  • [48] Megvii, “Face++ Compare API,” Retrieved December 2019. [Online]. Available:
  • [49] J. Galbally, P. Ferrara, R. Haraksim, A. Psyllos, and L. Beslay, “Study on Face Identification Technology for its Implementation in the Schengen Information System,” JRC Science for Policy Report - 29808EN, 2019.