Face recognition algorithms are good examples of recent advances in Artificial Intelligence (AI). The performance of automatic face recognition has been boosted during the last decade, achieving very competitive accuracies in the most challenging scenarios
. These improvements have been possible due to improved machine learning approaches (e.g., deep learning), powerful computation (e.g., GPUs), and larger databases (e.g., at scale of millions of images). However, the recognition accuracy is not the only aspect to consider when designing biometric systems. Algorithms have an increasingly important role in the decision-making of several processes involving humans. These decisions have therefore increasing effects in our lives. Thus, there is currently a growing need for studying AI behavior to better understand its impact in our society.
Face recognition systems are especially sensitive due to the personal information present in face images (e.g., identity, gender, ethnicity, and age). Previous works suggested that face recognition accuracy is affected by demographic covariates. In [7, 14], authors demonstrated that the performance of commercial face recognition systems varies according to demographic attributes. In [18, 1]
, the authors evaluated how covariates affect the performance of face recognition systems based on deep neural network models. Among the different covariates, the skin color is repetitively remarked as a factor with high impact in the performance[7, 18]
. However, ethnic face attributes are beyond skin color. The shape and size of facial features are partially defined by the ancestry origin. These differences can be used to accurate classify subjects according to their ancestry origin.
The number of published works pointing out the biases in the results of face detection  and recognition algorithms is large [14, 1, 3, 7, 18, 12]. Yet, only a limited number of works analyze how biases affect the learning process of these algorithms. The aim of this work is to analyze face recognition models using a discrimination-aware perspective. Previous studies have demonstrated that ethnicity and gender affect the performance of face recognition models . However, there is a lack of understanding regarding how this demographic information affects the model beyond the performance. The main contributions of this work are:
A general formulation of algorithmic discrimination for machine learning tasks. In this work, we apply this formulation in the context of face recognition.
Discrimination-aware performance analysis based on a new dataset , with 24K identities equally distributed between six demographic groups.
Study of the effects of gender and ethnicity in the feature representation of deep models.
Analysis of the demographic diversity present in some of the most popular face databases.
The rest of the paper is structured as follows: Section 2 presents our general formulation of algorithmic discrimination. Section 3 analyzes some of the most popular face recognition architectures and the experimental protocol followed in this work. Section 4 evaluates the causes and effects of biased learning in face recognition algorithms. Finally, Section 5 summarizes the main conclusions.
2 Formulation of Algorithmic Discrimination
Discrimination is defined by the Cambridge Dictionary as treating a person or particular group of people differently, especially in a worse way than the way in which you treat other people, because of their skin color, sex, sexuality, etc.
For the purpose of studying discrimination in artificial intelligence at large, we now formulate mathematically algorithmic discrimination based on the previous dictionary definition. Even though similar ideas as the ones embedded in our formulation can be found elsewhere [5, 25], we didn’t find this kind of formulation in related works. We hope that formalizing these concepts can be beneficial to foster further research and discussion in this hot topic.
Let’s begin with notation and preliminary definitions. Assume is a learned representation of individual (out of different individuals) corresponding to an input sample (out of samples) of that particular subject. That representation x is assumed to be useful for task , e.g., face authentication or emotion recognition. That representation x is learned using an artificial intelligence approach with parameters . We also assume that there is a goodness criterion on that task maximizing some performance real-valued function in a given dataset (collection of multiple samples) in the form:
The most popular form of the previous expression minimizes a loss functionover a set of training samples in the form:
where O is the output of the learning algorithm that we seek to bring closer to the target function (or groundtruth) T defined by the task at hand. On the other hand, the I individuals can be classified according to D demographic criteria , with , which can be the source for discrimination, e.g., (demographic criterion has two classes in this example). The particular class for a given demographic criterion and a given sample is noted as , e.g., . We assume that all classes are well represented in dataset , i.e., the number of samples for each class in all criteria in is significant. represents all the samples corresponding to class of demographic criterion .
Finally, our definition of algorithmic discrimination: an algorithm discriminates the group of people represented with class (e.g., Female) when performing the task T (e.g., face verification, or emotion recognition), if the goodness G in that task when considering the full set of data (including multiple samples from multiple individuals), is significantly larger than the goodness in the subset of data corresponding to class of the demographic criterion .
The representation x and the model parameters
will typically be real-valued vectors, but they can be any set of features combining real and discrete values. Note that the previous formulation can be easily extended to the case of varying number of samplesfor different subjects, which is a usual case; or to classes K that are not disjoint. Note also that the previous formulation is based on average performances over groups of individuals. Different performance across specific individuals is usual in many artificial intelligence tasks due to diverse reasons, e.g., specific users who were not sensed properly , even for algorithms that on average may perform similarly for the different classes that can be the source of discrimination.
3 Face Recognition Algorithms
A face recognition algorithm, as other machine learning systems, can be divided into two different algorithms: screener and trainer. Both algorithms are used for a different aim and therefore should be studied with a different perspective .
The screener (see Fig. 1
) is an algorithm that given two face images generates an output associated to the probability that they belong to the same person. This probability is obtained comparing the two learned representations obtained from a face model defined by the parameters. These parameters are trained previously based on a training dataset and the goodness criterion G (see Fig. 1). If trained properly, the output of the trainer would be a model with parameters capable of representing the input data (e.g., face images) in a highly discriminant feature space x.
The most popular architecture used to model face attributes is the Convolutional Neural Network (CNN). This type of network has drastically reduced the error rates of face recognition algorithms in the last decade by learning highly discriminative features from large-scale databases. In our experiments we consider two popular face recognition pre-trained models: VGG-Face and ResNet-50. These models have been tested on competitive evaluations and public benchmarks [23, 6].
VGG-Face is a model based on the VGG-Very-Deep-16 CNN architecture trained on the VGGFace dataset . ResNet-50 is a CNN model with 50 layers and 41M parameters initially proposed for general purpose image recognition tasks 
. The main difference between ResNet architecture and traditional convolutional neural networks is the inclusion of residual connections to allow information to skip layers and improve gradient flow.
Before applying the face models, we cropped the face images using the algorithm proposed in . The pre-trained models are used as embedding extractor where x is a -normalised learned representation of a face image. The similarity between two face descriptors and is calculated as the Euclidean distance . Two faces are assigned to the same identity if their distance is smaller than a threshold . The recognition accuracy is obtained by comparing distances between positive matches (i.e., and belong to the same person) and negative matches (i.e., and belong to different persons).
The two face models considered in our experiments were trained with the VGGFace2 dataset according to the details provided in . As we will show in Section 4.3, databases used to train these two models are highly biased. Therefore, it is expected that the recognition models trained with this dataset present algorithmic discrimination.
3.1 Experimental protocol
Labeled Faces in the Wild (LFW) is a database for research on unconstrained face recognition . The database contains more than 13K images of faces collected from the web. In this study we consider the aligned images from the test set provided with view 1 and its associated evaluation protocol. This database is composed by images acquired in the wild, with large pose variations, and varying face expressions, image quality, illuminations, and background clutter among other variations. The performance achieved by the VGG-Face and ResNet-50 models for the LFW database is and Equal Error Rate respectively. These performances serve as a baseline for the models and the rest of experiments. We can observe the superior performance of the ResNet-50 model, with a performance ca. 3 times better than the VGG-Face model.
The experiments with DiveFace will be carried out following a cross-validation methodology using three images for each of the 4K identities from each of the six classes available in DiveFace (72K face images in total). This results in 72K genuine comparisons and near 3M impostor comparisons.
3.2 DiveFace database: an annotation dataset for face recognition trained on diversity
DiveFace was generated using the Megaface MF2 training dataset . MF2 is part of the publicly available Megaface dataset with 4.7 million faces from 672K identities and it includes their respective bounding boxes. All images in the Megaface dataset were obtained from Flickr Yahoo’s dataset .
DiveFace contains annotations equally distributed among six classes related to gender and ethnicity (see Fig. 4 for example images). Gender and ethnicity have been annotated following a semi-automatic process. There are 24K identities (4K for class). The average number of images per identity is 5.5 with a minimum number of 3 for a total number of images greater than 120K. Users are grouped according to their gender (male or female) and three categories related with ethnic physical characteristics:
Group 1: people with ancestral origins in Europe, North-America, and Latin-America (with European origin).
Group 2: people with ancestral origins in Sub-Saharan Africa, India, Bangladesh, Bhutan, among others.
Group 3: people with ancestral origin in Japan, China, Korea, and other countries in that region.
We are aware of the limitations of grouping all human ethnic origins into only three categories. According to studies, there are more than 5K ethnic groups in the world. We categorized according to only three groups in order to maximize differences among classes. Automatic classification algorithms based on these three categories show performances of up to 98% accuracy .
4 Causes and Effects of Biased Learning in Face Recognition Algorithms
4.1 Performance of face recognition: role of demographic information
|Model||Group 1||Group 2||Group 3|
|VGG-Face||7.99||9.38 (17%)||12.03 (50%)||13.95 (76%)||18.43 (131%)||23.66 (196%)|
|ResNet-50||1.60||1.96 (22%)||2.15 (34%)||3.61 (126%)||3.25 (103%)||5.07 (217%)|
This section explores the effects of biased models in the performance of face recognition algorithms. Table 1 shows the performances obtained for each demographic group present in DiveFace. Traditional face recognition benchmarks usually do not explore this kind of demographic covariates. Results reported in Table 1 exhibit large gaps between performances obtained by different demographic groups, suggesting that both gender and ethnicity significantly affect the performance of biased models. These effects are particularly high for ethnicity, with a very large degradation of the results for the class less represented in the training data (Group 3 Female). This degradation produces a relative increment of the Equal Error Rate (EER) of 196% and 217% for VGG-Face and ResNet-50, respectively, with regard to the best class (Group 1 Male). These differences are important as they mark the percentage of faces successfully matched and faces incorrectly matched. These results suggest that your ethnic origin can highly affect your possibilities to be incorrectly matched (false positives).
4.2 Understanding biased performances
The relatively low performance in Group 3 seems to be originated by a limited ability to capture the best discriminant features for the groups underrepresented in the training databases. The results suggest that features capable of reaching high accuracy for a specific demographic group may be less competitive in others. Let’s analyze the causes behind these degradations. Fig. 2
represents the probability distributions of genuine and impostor scores forGroup 1 Male (the best group) and Group 3 Female (the worst group). The comparison between genuine and impostor distributions reveals large differences for the impostor’s ones. The genuine distribution (intra-class variability) between Group 3 and Group 1 is similar, but the impostor distribution (inter-class variability) is significantly different. The model has difficulties to differentiate face attributes from different subjects.
Algorithmic discrimination implications: define the performance function as the accuracy of the face recognition model, and the goodness considering all the samples corresponding to class of the demographic criterion , for an algorithm trained on the full set of data (as described in Eq. 1). Results suggest large differences between the goodness for different classes, especially for classes .
4.3 Bias in face databases
Bias and discrimination concepts are related to each other, but they are not necessarily the same thing. Bias is traditionally associated with unequal representation of classes in a dataset. The history of automatic face recognition has been linked to the history of the databases used for algorithm training during the last two decades. The number of publicly available databases is high, and they allow training models using millions of face images. Fig. 3 summarizes the demographic statistics of some of the most cited face databases. Each of these databases is characterized by its own biases (e.g. image quality, pose, backgrounds, and aging). In this work, we highlight the unequal representation of demographic information in very popular face recognition databases. As it can be seen, the differences between ethnic groups are severe. Even though the people in Group 3 are more than 35% of the world’s population, they represent only 9% of the users in those popular face recognition databases.
Biased databases imply a double penalty for underrepresented classes. On the one hand, models are trained according to non-representative diversity. On the other hand, benchmark accuracies are reported over privileged classes and overestimate the real performance over a diverse society.
Recently, diverse and discrimination-aware databases have been proposed in [4, 20, 29]. These databases are valuable resources to explore how diversity can be used to improve face biometrics. However, some of these databases do not include identities [4, 20], and face images cannot be matched to other images. Therefore, these databases do not allow to properly train or test face recognition algorithms.
Algorithmic discrimination implications: classes are unequally represented in the most popular face databases .
4.4 Biased embedding space of deep models
We now analyze the effects of ethnicity and gender attributes in the embedding space generated by VGG-Face and ResNet-50 models. CNNs are composed of a large number of stacked filters. These filters are trained to extract the richest information for a pre-defined task (e.g. face recognition). As face recognition models are trained to identify individuals, it is reasonable to think that the response of the models can slightly vary from one person to another. In order to visualize the response of the model to different faces, we consider the specific Class Activation MAP (CAM) proposed in , named Grad-CAM. This visualization technique uses the gradients of any target concept, flowing into the selected convolutional layer to produce a coarse localization map. The resulting heat map highlights the activated regions in the image for the mentioned target (e.g. an individual identity in our case). Fig. 4 represents the heat maps obtained by the ResNet-50 model for faces from different demographic groups. Additionally, we include the heat map obtained after averaging results from 120 different individuals from the six demographic groups included in DiveFace. The activation maps show clear differences between ethnic groups with the highest activation for Group 1 and the lowest for Group
3. These differences suggest that features extracted by the model are, at least, partially affected by the ethnic attributes.
On a different front, we applied a popular data visualization algorithm to better understand the importance of ethnic features in the embedding space generated by deep models. t-SNE is an algorithm to visualize high-dimensional data. This algorithm minimizes the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. Fig.5 shows the projection of each face into a 2D space generated from ResNet-50 embeddings and the t-SNE algorithm. Additionally, we have colored each point according to its ethnic attribute. As we can see, the resulting face representation results in three clusters highly correlated with the ethnicity attributes. Note that ResNet-50 has been trained for face recognition, not ethnicity detection. However, the ethnicity information is highly embedded in the feature space and a simple t-SNE algorithm reveals the presence of this information.
These two simple experiments illustrate the presence and importance of ethnic attributes in the feature space generated by face deep models.
Algorithmic discrimination implications: popular deep models trained for task T on biased databases (i.e., unequally represented classes for a given demographic criterion such as gender) result in feature spaces (corresponding to the solution of the Eq. 1) that introduce strong differentiation between classes . This differentiation affects the representation x and enables classifying between classes using x, even though x was trained for solving a different task .
This work has presented a comprehensive analysis of face recognition models according to a new discrimination-aware perspective. This work presents a new general formulation of algorithmic discrimination with application to face recognition. We have shown the high bias introduced when training the deep models with the most popular databases employed in the literature, and testing with the DiveFace dataset with well balanced data across demographic groups222Available at GitHub: https://github.com/BiDAlab/DiveFace. We have evaluated two popular models according to the proposed formulation. Biased models based on competitive deep learning algorithms have been shown to be very sensitive to gender and ethnicity attributes. This sensitivity results in different feature representations and a large gap between performances depending on the ethnic origin. This gap between performances reached up to 200% of relative error degradation between the best class (Group 1 Male) and the worst (Group 3 Female). These results suggest that false positives are 200% more likely in Group 3 Female than in Group 1 Male for the models evaluated in this work. These results encourage training more diverse models and developing methods capable to deal with the differences inherent to demographic groups. Future work will go in line with this approach, as authors do in .
This work has been supported by projects: BIBECA (RTI2018-101248-B-I00 MINECO/FEDER), Bio-Guard (Ayudas Fundacion BBVA a Equipos de Investigacion Cientifica 2017).
Measuring the Gender and Ethnicity Bias in Deep Models for Face Recognition.
Iberoamerican Congress on Pattern Recognition (IAPR), Madrid, Spain, pp. 584–593. Cited by: §1, §1.
-  (2011) Quality Measures in Biometric Systems. IEEE Security & Privacy 10 (6), pp. 52–62. Cited by: §2.
Turning a Blind Eye: Explicit Removal of Biases and Variation from Deep Neural Network embeddings.
Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany. Cited by: §1.
-  (2018-23–24 Feb) Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency, S. A. Friedler and C. Wilson (Eds.), Proceedings of Machine Learning Research, Vol. 81, New York, NY, USA, pp. 77–91. External Links: Cited by: §1, §4.3.
Three Naive Bayes Approaches for Discrimination-Free Classification. Data Mining and Knowledge Discovery 21 (2), pp. 277–292. Cited by: §2.
-  (2018) Vggface2: A Dataset for Recognising Faces Across Pose and Age. In 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG), Lille, France, pp. 67–74. Cited by: §3, §3, Figure 3.
-  (2019) Demographic Effects in Facial Recognition and Their Dependence on Image Acquisition: An Evaluation of Eleven Commercial Systems. IEEE Transactions on Biometrics, Behavior, and Identity Science 1 (1), pp. 32–41. Cited by: §1, §1.
-  (2019) DebFace: De-biasing Face Recognition. arXiv preprint arXiv:1911.08080. Cited by: §1.
-  (2018) Ongoing Face Recognition Vendor Test (FRVT) Part 2: Identification. NIST Internal Report, U.S. Department of Commerce, National Institute of Standards and Technology. External Links: Cited by: §1.
-  (2016) Ms-celeb-1m: A Dataset and Benchmark for Large-Scale Face Recognition. In European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, pp. 87–102. Cited by: Figure 3.
-  (2016) Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 770–778. Cited by: §3.
-  (2019) DemogPairs: Quantifying the Impact of Demographic Imbalance in Deep Face Recognition. In 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG), Lille, France. Cited by: §1.
-  (2016) The Megaface Benchmark: 1 Million Faces for Recognition at Scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4873–4882. Cited by: §3.2, Figure 3.
-  (2012) Face Recognition Performance: Role of Demographic Information. IEEE Transactions on Information Forensics and Security 7 (6), pp. 1789–1801. Cited by: §1, §1.
-  (2019-04) Discrimination in the Age of Algorithms. Journal of Legal Analysis 10, pp. 113–174. External Links: Cited by: §3.
-  (2011) Describable Visual Attributes for Face Verification and Image Search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (10), pp. 1962–1977. Cited by: Figure 3.
-  (2016) Labeled Faces in the Wild: A Survey. In Advances in Face Detection and Facial Image Analysis, Kawulok,Michal, M. E. Celebi, and B. Smolka (Eds.), pp. 189–248. Cited by: §3.1, Figure 3.
-  (2019) An experimental Evaluation of Covariates Effects on Unconstrained Face Verification. IEEE Transactions on Biometrics, Behavior, and Identity Science 1 (1), pp. 42–55. Cited by: §1, §1.
-  (2018) IARPA Janus Benchmark-C: Face Dataset and Protocol. In International Conference on Biometrics (ICB), Gold Coast, Australia, pp. 158–165. Cited by: Figure 3.
-  (2019) Diversity in Faces. arXiv preprint arXiv:1901.10436. Cited by: §4.3.
-  (2019) SensitiveNets: Learning Agnostic Representations with Application to Face Recognition. arXiv preprint arXiv:1902.00334. Cited by: 2nd item, §3.2, Figure 3.
-  (2009) The Multiscenario Multienvironment Biosecure Multimodal Database (BMDB). IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (6), pp. 1097–1111. Cited by: Figure 3.
-  (2015) Deep Face Recognition. In British Machine Vision Conference (BMVC), Cited by: §3, §3, Figure 3.
-  (2019) Machine Behaviour. Nature 568 (7753), pp. 477–486. External Links: Cited by: §1.
-  (2019) Actionable Auditing: Investigating the Impact of Publicly Naming Biased Performance Results of Commercial AI Products. In AAAI/ACM Conf. on AI Ethics and Society (AIES), Cited by: §2.
-  (2018) Deep Learning for Understanding Faces: Machines May Be Just as Good, or Better, than Humans. IEEE Signal Processing Magazine 35 (1), pp. 66–83. Cited by: §3.
-  (2017) Grad-CAM: Visual Explanations from Deep Networks Via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626. Cited by: §4.4.
-  (2015) The New Data and New Challenges in Multimedia Research. arXiv preprint arXiv:1503.01817. Cited by: §3.2.
-  (2019) . arXiv preprint arXiv:1911.10692. Cited by: §4.3, §5.
-  (2011-06) Face Recognition in Unconstrained Videos with Matched Background Similarity. In Computer Vision and Pattern Recognition (CVPR), Vol. , Colorado Springs, CO, USA, pp. 529–534. External Links: Cited by: Figure 3.
-  (2015) From Facial Parts Responses to Face Detection: A Deep Learning Approach. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3676–3684. Cited by: Figure 3.
-  (2014) Learning Face Representation from Scratch. arXiv preprint arXiv:1411.7923. Cited by: Figure 3.
-  (2016) Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters 23 (10), pp. 1499–1503. Cited by: §3.
Age Progression/Regression by Conditional Adversarial Autoencoder. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5810–5818. Cited by: Figure 3.