Artificial Intelligence (AI) is developed to meet human needs that can be represented in the form of objectives. To this end, the most popular machine learning algorithms are designed to minimize a loss function that defines the cost of wrong solutions over a pool of samples. This is a simple but very successful scheme that has enhanced the performance of AI in many fields such as Computer Vision, Speech Technologies, and Natural Language Processing. But this optimization of specific computable objectives may not lead to the behavior one may expect or desire from AI. International agencies, academia and industry are alerting policymakers and the public about unanticipated effects and behaviors of AI agents, not initially considered during the design phases . In this context, aspects such as trustworthiness and fairness should be included as learning objectives and not taken for granted. (See Fig. 1).
Machine vision in general and face recognition algorithms in particular are good examples examples of recent advances in AI [35, 2, 6, 40]. The performance of automatic face recognition has been boosted during the last decade, achieving very competitive accuracies in the most challenging scenarios . These improvements have been made possible due to advances in machine learning (e.g., deep learning), powerful computation (e.g., GPUs), and larger databases (e.g., on a scale of millions of images). However, the recognition accuracy is not the only aspect to be considered when designing biometric systems. Algorithms play an increasingly important role in the decision-making of several processes involving humans. Therefore, these decisions have an increasing impact on our lives. Thus, there is currently a growing need to study AI behavior in order to better understand its impact on our society . Face recognition systems are especially sensitive due to the personal information present in face images (e.g., identity, gender, ethnicity, and age).
The objective of a face recognition algorithm is to recognize when two face images belong to the same person. For this purpose, deep neural networks are usually trained to minimize a cost function over a dataset. Like many other supervised learning processes, the training methods of these networks consist of an iterative process where input images must be associated with the output labels (e.g. identities). This learning by imitation is highly sensitive to the characteristics of the dataset. The literature has demonstrated that face recognition accuracy is affected by demographic covariates[10, 26, 1, 23]. This behavior is a consequence of biases introduced into the dataset and cost functions focused exclusively on performance improvement. The number of published works pointing out the potential discriminatory effects in the results of face detection and recognition algorithms is large [21, 7, 1, 4, 10, 26, 18, 12, 23].
In this environment, only a limited number of works analyze how biases affect the learning process of algorithms dealing with personal information [42, 13]. There is a lack of understanding regarding how demographic information affects popular and widely used pre-trained AI models beyond the performance.
On the other hand, the right to non-discrimination is deeply rooted in the normative framework that underlies various national and international regulations, and can be found, for example, in Article 7 of the Universal Declaration of Human Rights and Article 14 of the European Convention on Human Rights, among others. As evidence of these concerns, in April 2018 the European Parliament adopted a set of laws aimed at regulating the collection, storage and use of personal information: the General Data Protection Regulation (GDPR)111EU 2016/679 (General Data Protection Regulation). Available online at: https://gdpr-info.eu/. According to paragraph 71 of GDPR, data controllers who process sensitive data have to “implement appropriate technical and organizational measures …” that “… prevent, inter alia, discriminatory effects”.
The aim of this work is to analyze face recognition models using a discrimination-aware perspective and to demonstrate that learning processes involving such discrimination-aware perspective can be used to train more accurate and fairer algorithms. The main contributions of this work are:
A general formulation of algorithmic discrimination for machine learning tasks. In this work, we apply this formulation in the context of face recognition.
A comprehensive analysis of causes and effects of biased learning processes including: (i) discrimination-aware performance analysis based on three public datasets, with 64K identities equally distributed across demographic groups; (ii) study of deep representations and the role of sensitive attributes such as gender and ethnicity; (iii) complete analysis of demographic diversity present in some of the most popular face databases, and analysis of new databases available to train models based on diversity.
Based on our analysis of the causes and effects of biased learning algorithms, we propose an efficient discrimination-aware learning method to mitigate bias in deep face recognition models: SensitiveLoss. The method is based on the inclusion of demographic information in the popular triplet loss representation learning. SensitiveLoss incorporates fairness as a learning objective in the training process of the algorithm. The method works as an add-on to be applied over pre-trained representations and allows improving its performance and fairness without a complete re-training. We evaluate the method in three public databases showing an improvement in both overall accuracy and fairness. Our results show how to incorporate discrimination-aware learning rules to significantly reduce bias in deep learning models.
Preliminary work in this research line was presented in . Key improvements here over  include: (i) in-depth analysis of the state-of-the-art, including an extensive survey of face recognition databases; (ii) inclusion of two new datasets in the experiments involving 40,000 new identities and more than 1M images; and (iii) a novel discrimination-aware learning method called SensitiveLoss.
The rest of the paper is structured as follows: Section 2 summarizes the related works. Section 3 presents our general formulation of algorithmic discrimination. Section 4 presents the face recognition architectures used in this work. Section 5 evaluates the causes and effects of biased learning in face recognition algorithms. Section 6 presents the proposed discrimination-aware learning method. Section 7 presents the experimental results. Finally, Section 8 summarizes the main conclusions.
2 Related Work
2.1 Bias in face recognition
Facial recognition systems can suffer various biases, ranging from those derived from variables of unconstrained environments like illumination, pose, expression and resolution of the face, through systematic errors such as image quality, to demographic factors of age, gender and race .
An FBI-coauthored study  tested three commercial algorithms of supplier companies to various public organizations in the US. In all three algorithms, African Americans were less likely to be successfully identified —i.e., more likely to be falsely rejected— than other demographic groups. A similar decline surfaced for females compared to males and younger subjects compared to older subjects.
More recently, the latest NIST evaluation of commercial face recognition technology, the Face Recognition Vendor Test (FRVT) Ongoing, shows that at sensitivity thresholds that resulted in white men being falsely matched once in K, out of a list of algorithms, all but two were more than twice as likely to misidentify black women, some reaching times more . The number of academic studies analyzing fairness of face recognition algorithms has grown during last years .
2.2 De-biasing face recognition
There are attempts to eliminate bias in face recognition, as in , with so-called unlearning, which improves the results, but at the cost of losing recognition accuracy. Das et al. proposed a Multi-Task CNN that also managed to improve performance across subgroups of gender, race, and age . Finally, in  an extension of the triplet loss function is developed to remove sensitive information in feature embeddings, without losing performance in the main task.
, researchers proposed a race-balanced reinforcement learning network to adaptively find appropriate margins losses for the different demographic groups. Their model significantly reduced the performance difference obtained between demographic groups. with an adversarial network, disentangles feature representation of gender, age, race and face recognition and minimizes their correlation. Both methods [42, 13] were applied to train de-biasing deep architectures for face recognition from scratch.
3 Formulating Algorithmic Discrimination
Discrimination is defined by the Cambridge Dictionary as treating a person or particular group of people differently, especially in a worse way than the way in which you treat other people, because of their skin color, sex, sexuality, etc.
For the purpose of studying discrimination in artificial intelligence at large, we now formulate mathematically Algorithmic Discrimination based on the above dictionary definition. Even though ideas similar to those included in our formulation can be found elsewhere [8, 34], we didn’t find this kind of formulation in related works. We hope that the formalization of these concepts can be beneficial in fostering further research and discussion on this hot topic.
Let’s begin with notation and preliminary definitions. Assume is a learned representation of individual (out of different individuals) corresponding to an input image ( samples per individual). That representation x is assumed to be useful for task , e.g., face authentication or emotion recognition. That representation x is generated from the input image I using an artificial intelligence approach with parameters w. We also assume that there is a goodness criterion in that task that maximizes some real-valued performance function in a given dataset (collection of multiple images) in the form:
The most popular form of the previous expression minimizes a loss function over a set of training images in the form:
where O is the output of the learning algorithm that we seek to bring closer to the target function (or groundtruth) T defined by the task at hand. On the other hand, the I
individuals can be classified according toD demographic criteria , with , which can be the source for discrimination, e.g., (the demographic criterion Gender has two classes in this example). The particular class for a given demographic criterion and a given sample is noted as , e.g., . We assume that all classes are well represented in dataset , i.e., the number of samples for each class in all criteria in is significant. represents all the samples corresponding to class of demographic criterion .
Finally, our definition of Algorithmic Discrimination:
An algorithm discriminates the group of people represented with class (e.g., Female) when performing the task T (e.g., face verification), if the goodness G in that task when considering the full set of data (including multiple samples from multiple individuals), is significantly larger than the goodness in the subset of data corresponding to class of the demographic criterion .
The representation x and the model parameters w
will typically be real-valued vectors, but they can be any set of features combining real and discrete values. Note that the previous formulation can be easily extended to the case of varying number of samplesfor different subjects, which is a usual case; or to classes K that are not disjoint. Note also that the previous formulation is based on average performances over groups of individuals. In many artificial intelligence tasks it is common to have different performance between specific individuals due to various reasons, e.g., specific users who were not sensed properly , even in the case of algorithms that, on average, may have similar performance for the different classes that are the source of discrimination. Therefore, in our formulation and definition of Algorithmic Discrimination we opted to use average performances in demographic groups.
Other related works are now starting to investigate discrimination effects in AI with user-specific methods, e.g. [5, 48], but they are still lacking a mathematical framework with clear definitions of User-specific Algorithmic Discrimination (U-AD), in comparison to our defined Group-based Algorithmic Discrimination (G-AD). We will study and augment our framework with an analysis of U-AD in future work.
4 Face recognition: methods
A face recognition algorithm, like other machine learning systems, can be divided into two different algorithms: screener and trainer. Both algorithms are used for a different purpose .
The screener (see Fig. 2) is an algorithm that given two face images generates an output associated with the probability that they belong to the same person. This probability is obtained comparing the two learned representations obtained from a face model defined by the parameters w. These parameters are trained previously based on a training dataset and the goodness criterion G (see Fig. 2). If trained properly, the output of the trainer would be a model with parameters capable of representing the input data (e.g., face images) in a highly discriminant feature space x.
The most popular architecture used to model face attributes is the Convolutional Neural Network (CNN). This type of network has drastically reduced the error rates of face recognition algorithms in the last decade by learning highly discriminative features from large-scale databases. In our experiments we consider two popular face recognition pre-trained models: VGG-Face and ResNet-50. These models have been tested on competitive evaluations and public benchmarks [32, 9].
VGG-Face is a model based on the VGG-Very-Deep-16 CNN architecture trained on the VGGFace dataset . ResNet-50 is a CNN model with 50 layers and 41M parameters initially proposed for general purpose image recognition tasks 
. The main difference between ResNet architecture and traditional convolutional neural networks is the inclusion of residual connections to allow information to skip layers and improve gradient flow.
Before applying the face models, we cropped the face images using the algorithm proposed in . The pre-trained models are used as embedding extractor where x is a -normalised learned representation of a face image. The similarity between two face descriptors and is calculated as the Euclidean distance . Two faces are assigned to the same identity if their distance is smaller than a threshold . The recognition accuracy is obtained by comparing distances between positive matches (i.e., and belong to the same person) and negative matches (i.e., and belong to different persons).
The two face models considered in our experiments were trained with the VGGFace2 dataset according to the details provided in . As shown in Section 5.1, databases used to train these two models are highly biased. Therefore, it is expected that other recognition models trained with this dataset will also present algorithmic discrimination.
|Group 1||Group 2||Group 3|
|Dataset [ref]||# images||# identities||# avg. images per identity||Male||Female||Male||Female||Male||Female|
|Databases for discrimination-aware learning|
Demographic statistics of state-of-the-art face databases (ordered by number of images). In order to obtain demographic statistics, gender and ethnicity classification algorithms were trained based on a ResNet-50 model and 12K identities of DiveFace database (equally distributed between the six demographic groups). Models were evaluated in 20K labeled images of Celeb-A with performance over 97%. The table includes the averaged demographic statistics for the most popular face databases in the literature.
5 Causes and Effects of Biased Learning in Face Recognition Algorithms
5.1 Bias in face databases
Bias and discrimination concepts are related to each other, but they are not necessarily the same thing. Bias is traditionally associated with unequal representation of classes in a dataset. The history of automatic face recognition has been linked to the history of the databases used for algorithm training during the last two decades. The number of publicly available databases is high, and they allow the training of models using millions of face images. Table I summarizes the demographic statistics of some of the most frequently cited face databases. Each of these databases is characterized by its own biases (e.g. image quality, pose, backgrounds, and aging). In this work, we highlight the unequal representation of demographic information in very popular face recognition databases. As can be seen, the differences between ethnic groups are serious. Even though the people in ethnic Group 3 (Asian) are more than 35% of the world’s population, they represent only 9% of the contents in those popular face recognition databases.
Biased databases imply a double penalty for underrepresented classes. On the one hand, models are trained according to non-representative diversity. On the other hand, accuracies are measured on privileged classes and overestimate the real performance over a diverse society.
5.2 Databases for discrimination-aware learning
Recently, diverse and discrimination-aware databases have been proposed in [7, 28, 42]. These databases are valuable resources for exploring how diversity can be used to improve face biometrics. However, some of these databases do not include identities [7, 28], and face images cannot be matched to other images. Therefore, these databases do not allow to properly train or test face recognition algorithms. In our experiments we used three different public databases.
DiveFace  contains annotations equally distributed among six classes related to gender and ethnicity. There are 24K identities (4K per class) and 3 images per identity for a total number of images equal to 72K. Users are grouped according to their gender (male or female) and three categories related with ethnic physical characteristics: Group 1: people with ancestral origins in Europe, North-America, and Latin-America (with European origin). Group 2: people with ancestral origins in Sub-Saharan Africa, India, Bangladesh, Bhutan, among others. Group 3: people with ancestral origin in Japan, China, Korea, and other countries in that region.
Races Face in the Wild (RFW)  is divided into four demographic classes: Caucasian, Asian, Indian and African. Each class has about 10K images of 3K individuals. There are no major differences in pose, age and gender distribution between Caucasian, Asian and Indian groups. The African set has smaller age difference than the others, and while in the other groups women represent about 35%, in Africans they represent less than 10%.
BUPT-Balancedface (BUPT-B)  contains 1.3M images from 28K celebrities obtained from MS-Celeb-1M . Divided into 4 demographic groups, it is roughly balanced by race with 7K subjects per race: Caucasian, Asian, Indian, and African; with 326K, 325K, 275K and 324K images respectively. No gender data is available for this dataset.
Note that the groups included in all three databases are heterogeneous and they include people of different ethnicities. We are aware of the limitations of grouping all human ethnic origins into only four categories. According to studies, there are more than 5,000 ethnic groups in the world. Our experiments are similar to those reported in the literature, and include only four groups in order to maximize differences between classes. Automatic classification algorithms based on these reduced categories show performances of up to 98% accuracy .
Algorithmic Discrimination implications: classes are unequally represented in the most popular face databases . New databases and benchmarks are needed to train more diverse and heterogeneous algorithms. Evaluation over representative populations from different demographic groups is important to prevent discriminatory effects.
5.3 Biased embedding space of deep models
We now analyze the effects of ethnicity and gender attributes in the embedding space generated by VGG-Face and ResNet-50 models. CNNs are composed of a large number of stacked filters. These filters are trained to extract the richest information for a pre-defined task (e.g. face recognitionin VGG-Face and ResNet-50). As face recognition models are trained to identify individuals, it is reasonable to think that the response of the models can slightly vary from one person to another. In order to visualize the response of the model to different faces, we consider the specific Class Activation MAP (CAM) proposed in , named Grad-CAM. This visualization technique uses the gradients of a target flowing into the selected convolutional layer to produce a coarse localization map. The resulting heatmap highlights the activated regions in the image for the selected target (e.g. an individual identity in our case).
Fig. 3 represents the heatmaps obtained with the ResNet-50 model for faces from different demographic groups. Additionally, we include the heatmap obtained with ResNet-50 after averaging results from 120 different individuals from the six demographic groups included in DiveFace (last column). The activation maps show clear differences between ethnic groups with the highest activation for Group 1 and the lowest for Group
3. These differences suggest that features extracted by the model are, at least, partially affected by the ethnic attributes. The activation maps obtained with the VGG-Face model are similar to those of ResNet-50.
On a different front, we applied a popular data visualization algorithm to better understand the importance of ethnic features in the embedding space generated by deep models. t-SNE is an algorithm to visualize high-dimensional data. This algorithm minimizes the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. Fig.4 shows the projection of each face into a 2D space generated from ResNet-50 embeddings and the t-SNE algorithm. This t-SNE projection is unsupervised and just uses as input the face embeddings without any labels. After running t-SNE, we have colored each projected point according to its ethnic attribute. As we can see, the consequent face representation results in three clusters highly correlated with the ethnicity attributes. Note that ResNet-50 has been trained for face recognition, not ethnicity detection. However, the gender and ethnicity information is highly embedded in the feature space and the unsupervised t-SNE algorithm reveals the presence of this information.
These two experiments illustrate the presence and importance of ethnic attributes in the feature space generated by face deep models.
Algorithmic Discrimination implications: popular deep models trained for task T on biased databases (i.e., unequally represented classes for a given demographic criterion such as gender) result in feature spaces (corresponding to the solution of the Eq. 2) that introduce strong differentiation between classes . This differentiation affects the representation x and enables classifying between classes using x, even though x was trained for solving a different task .
6 Discrimination-Aware Learning with SensitiveLoss
As we have discussed, models trained and evaluated over privileged demographic groups may fail to generalize when the model is evaluated over groups different to the privileged one. This is a behavior caused by the wrong assumption of homogeneity in face characteristics of the world population. In this work we propose to reduce the bias in face recognition models incorporating a discrimination-aware learning process.
As shown in previous sections, the main causes of biased results are: i) biased databases not representative of a heterogeneous population; and ii) a learning process guided by loss functions focused exclusively in the improvement of the overall performance.
The methods proposed in this work to reduce bias are based on two strategies:
A modified loss function (SensitiveLoss) that incorporates demographic information to guide the learning process into a more inclusive feature space. The development of new cost functions capable of incorporating discrimination-aware elements into de training process is another way to reduce bias. Our approach is based on the popular triplet loss function and it can be applied to pre-trained models without needing the full re-training of the network.
6.1 SensitiveLoss: de-biasing cost function
Triplet loss was proposed as a distance metric in the context of nearest neighbor classification  and adapted to improve the performance of face descriptors in verification algorithms [32, 9]. In this work we propose to incorporate demographic data to generate discrimination-aware triplets to train a new representation that mitigates biased learning.
Assume that an image is represented by an embedding descriptor obtained by a pre-trained model and (see Section 3 for notation). That image corresponds to the demographic group . A triplet is composed of three different images of two different people: Anchor () and Positive () are different images of the same person, and Negative () is an image of a different person. The Anchor and Positive share the same demographic labels, but these labels may differ for the Negative sample . The transformation represented by parameters ( for De-biasing) is trained to minimize the loss function:
where is the Euclidean Distance, is a margin between genuine and impostor distances, and is a set of triplets generated by an online sensitive triplet generator that guides the learning process (see details in Section 6.2). As shown in previous sections, the effects of biased training include a representation that fails to model properly the distance between faces from different people () belonging to the same minority demographic groups (e.g. ). The proposed triplet loss function considers both genuine and impostor comparisons and also allows to introduce demographic-aware information. In order to guide the learning process in that discrimination-aware spirit, triplets from demographic groups with worst performances are prioritized in the online sensitive triplet generator (e.g. for Asian Females). Fig. 5 shows the block diagram of the learning algorithm.
6.2 SensitiveLoss: sensitive triplets
Inspired in the semi-hard selection proposed in [32, 9], we propose an online selection of triplets that prioritizes the triplets from demographic groups with lower performances (see Fig. 5). On the one hand, triplets within the same demographic group improve the ability to discriminate between samples with similar anthropometric characteristics (e.g. reducing the false acceptance rate in Asian Females). On the other hand, heterogeneous triplets (i.e. triplets involving different demographic groups) improve the generalization capacity of the model (i.e. the overall accuracy).
During the training process we distinguish between generation and selection of triplets:
Triplet Generation: this is where the triplets are formed and joined to compose a training batch. In our experiments, each batch is generated randomly with images from different identities equally distributed among the different demographic groups ( images in total). We propose two types of triplets generation (see Fig. 5):
Unrestricted (U): the generator allows triplets with mixed demographic groups (i.e. or ). Thus, with 300 identities, around 135K triplets are generated (from which the semi-hard ones will be selected).
Restricted (R): the generator does not allow triplets with mixed demographic groups (i.e. ). Thus, with 300 identities, more than 22K triplets are generated (from which the semi-hard ones will be selected).
Triplet Selection: Triplet selection is done online during the training process for efficiency. Among all the triplets in the generated batches, the online selection chooses those for which: (i.e. genuine higher than impostor distance difficult triplet). If a demographic group is not well modeled by the network (both in terms of genuine or impostor comparisons), more triplets from this group are likely to be included in the online selection. This selection is purely guided by performance over each demographic group and could change for each batch depending on model deficiencies.
We chose triplet loss as the basis for SensitiveLoss because it allows us to incorporate the demographic-aware learning in a natural way. The process is data driven and does not require a large number of images per identity (e.g. while softmax requires a large number of samples per identity we only use images per identity). Another advantage is that it is not necessary to train the entire network, and triplet loss can be applied as a domain adaptation technique. In our case, we trained the model to move from a biased domain x to an unbiased domain . Our results demonstrate that biased representations x that exhibit clear performance differences contain the information necessary to reduce such differences. In other words, bias can be at least partially corrected from representations obtained from pre-trained networks, and new models trained from scratch are not necessary. Similar strategies might be applied to other loss functions.
|Group 1||Group 2||Group 3|
|VGG-Face-U||1.59||1.79||1.91||2.04||2.12||2.04||1.91 (23%)||0.18 (61%)||1.34 (61%)|
|VGG-Face-R||1.39||1.60||1.72||1.98||2.07||1.79||1.76 (29%)||0.23 (52%)||1.49 (43%)|
|ResNet-50-U||0.64||0.64||0.83||1.15||1.20||0.72||0.86 (17%)||0.23 (21%)||1.86 (26%)|
|ResNet-50-R||0.64||0.58||0.89||1.14||1.19||0.67||0.85 (18%)||0.24 (16%)||2.06 (9%)|
|VGG-Face-U||7.67||8.94||12.97||10.54||10.03 (21%)||1.98 (43%)||1.69 (39%)|
|VGG-Face-R||7.51||8.31||12.61||10.06||9.62 (24%)||1.96 (44%)||1.68 (40%)|
|ResNet-50-U||2.72||3.20||3.78||3.66||3.34 (30%)||0.42 (54%)||1.39 (42%)|
|ResNet-50-R||2.89||3.30||4.01||3.77||3.49 (27%)||0.44 (52%)||1.39 (42%)|
|VGG-Face-U||6.22||5.54||7.23||8.90||6.97 (25%)||1.27 (51%)||1.61 (41%)|
|VGG-Face-R||6.24||5.29||6.94||8.93||6.85 (27%)||1.33 (48%)||1.69 (34%)|
|ResNet-50-U||2.61||1.76||2.41||3.60||2.59 (33%)||0.66 (45%)||2.04 (1%)|
|ResNet-50-R||2.58||1.75||2.49||3.66||2.62 (32%)||0.68 (44%)||2.09 (3%)|
for matchers VGG-Face and ResNet-50 without and with our de-biasing SensitiveLoss module (U = Unrestricted Triplet Generation; R = Restricted Triplet Generation). Also shown: Average EER across demographic groups, Standard deviation (lower means fairer), and Skewed Error Ratio(1 is fairest).
7 Experimental Results
7.1 Performance of face recognition and role of demographic information
This section explores the effects of biased models in the performance of face recognition algorithms. The experiments are carried out with -fold cross-validation across users and three images per identity (therefore genuine and impostor combinations per identity). Thus, the three databases are divided into a training set () and a test set () in every fold. Resulting in a total of K genuine comparisons (DiveFace K, RFW K y BUPT K) and M impostor comparisons (DiveFace M, RFW M y BUPT M).
Table II shows the performance obtained for each demographic group present in all three databases. In this section we focus on the results obtained by the Baseline systems (denoted as VGG-Face and ResNet-50). The different performances obtained for similar demographic groups in the three databases are caused by the different characteristics of each database (e.g. the African set has a smaller age difference than the others in RFW). The results reported in Table II exhibit large gaps between the performances obtained by the different demographic groups, suggesting that both gender and ethnicity significantly affect the performance of biased models. These effects are particularly high for ethnicity, with a very large degradation in performance for the class less represented in the training data. For DiveFace, this degradation produces a relative increment of the Equal Error Rate (EER) of and for VGG-Face and ResNet-50, respectively, with regard to the best class (Group 1 Male). For RFW and BUPT-Balanceface the differences between demographic groups are similar to those obtained with DiveFace.
These differences are important as they mark the percentage of faces successfully matched and faces incorrectly matched for a certain threshold. These results indicate that ethnicity can greatly affect the chances of being mismatched (false positives).
The relatively low performance in some groups seems to be originated by a limited ability to capture the best discriminant features for the samples underrepresented in the training databases. ResNet-50 seems to learn better discriminant features as it performs better than VGG-Face. Additionally, ResNet-50 shows smaller difference between demographic groups. The results suggest that features capable of reaching high accuracy for a specific demographic group may be less competitive in others. Let’s now analyze the causes behind this degradation. Fig. 6
represents the probability distributions of genuine and impostor distance scores for all demographic groups. A comparison between genuine and impostor distributions reveals large differences for impostors. The genuine distribution (intra-class variability) between groups is similar, but the impostor distribution (inter-class variability) is significantly different. The baseline models behave differently between demographic groups when comparing face features from different people.
Algorithmic Discrimination implications: define the performance function as the accuracy of the face recognition model, and the goodness considering all the samples corresponding to class of the demographic criterion , for an algorithm trained on the full set of data (as described in Eq. 2). Results suggest large differences between the goodness for different classes, especially between the classes and Asian.
7.2 Performance of SensitiveLoss
The proposed de-biasing method SensitiveLoss does not require retraining the entire pre-trained models (see Fig. 5). The sensitive triplets are used to train a dense layer with the following characteristics: number of units equal to the size of the pre-trained representation x (, and , units for VGG-Face and ResNet-50 respectively), dropout (of 0.5), linear activation, random initialization, and
normalization. This layer is relatively easy to train (10 epochs and Adam optimizer) and will be used to generate the new representation.
Table II shows the performance (Equal Error Rate EER in %) for each demographic group as well as the average EER on the DiveFace, RFW and BUPT test sets for the baseline models (VGG-Face and ResNet-50), and the SensitiveLoss methods described in Section 6 (Unrestricted and Restricted). In order to measure the fairness, Table II includes the Standard deviation of the EER across demographic groups (Std) and the Skewed Error Ratio (SER). Theses measures were proposed in  to analyze the performance of de-biasing algorithms. The SER is calculated as the highest divided by the lowest EER across demographic groups.
The results obtained by SensitiveLoss outperform the baseline approaches by:
Improving the fairness metrics (Std and SER) with lower standard deviation in performance across demographic groups. Fairness improvements in terms of EER Std vary by model and database ranging from to relative improvements with an average improvement of . The SER is also reduced by a similar percentage except in the ResNet-50 model evaluated for the BUPT-Balanceface. In this particular case, the standard deviation is clearly improved by but the SER is penalized by the large improvement obtained for the best class.
Reducing the Average EER in the three databases. The results show that discrimination-aware learning not only helps to train fairer representations but also more accurate ones. Our SensitiveLoss discrimination-aware learning results in better representations for specific demographic groups and collectively for all groups.
Concerning the triplet generation method (Unrestricted or Resticted, see Section 6.2), both methods show competitive performances with similar improvements over the baseline approaches. The higher number of triplets generated by the Unrestricted method (about times more) does not show clear improvements compared to the Restricted method.
Fig. 6 shows the score distributions obtained for the ResNet-50 model without and with our SensitiveLoss de-biasing method (with Unrestricted sensitive triplet generation). Table II showed performances for specific decision thresholds (at the EER) for face verification. Fig. 6 provides richer information without fixing the decision thresholds. In comparison to the baseline x, we see that the improvements in Accuracy and Fairness caused by our SensitiveLoss discrimination-aware representation mainly come from better alignment of impostor score distributions across demographic groups. These results suggest how the proposed SensitiveLoss learning method was able to correct the biased behavior of the baseline model.
|RL-RBN (arc) ||()||(39%)||()|
Training with RFW; Training with BUPT-Balancedface.
7.3 Comparison with the state-of-the-art
Table III shows the comparison of our approach with two recent state-of-the-art de-biasing techniques [42, 13]. These two methods consist of full networks trained specifically to avoid bias, whereas what we propose here with SensitiveLoss is not an entire network, but rather an add-on method to reduce the biased outcome of a given network.
The results of this comparison should be interpreted with care, because the networks compared are not the same. Anyway, the comparison gives us a rough idea of the ranges of bias mitigation in the three methods.
From Table III it can be seen that our approach has a performance at least comparable to that of dedicated networks trained from scratch to produce unbiased models. The simplicity of our SensitiveLoss discrimination-aware learning makes it suitable as an add-on for different networks and methods.
Table III also shows the performance of the proposed de-biasing method when training SensitiveLoss with the same or a different database. Note that users employed for training and testing are different in both cases. The results show similar average improvement, but differences in fairness metrics when trained with RFW and BUPT-Balanceface. Our hypothesis to explain this difference is that each database contains particular characteristics and each demographic group contains its own biases (e.g. age distribution is different for each database). These particular characteristics reduce the method’s ability to find fairer representations that generalize to all databases.
Algorithmic Discrimination implications: the discrimination-aware learning method proposed in this work, SensitiveLoss, is a step forward to prevent discriminatory effects in the usage of automatic face recognition systems. The representation reduces the discriminatory effects of the original representation x as differences between goodness criteria across demographic groups are reduced. However, differences still exist and should be considered in the deployment of these technologies.
We have presented a comprehensive analysis of face recognition models based on deep learning according to a new discrimination-aware perspective. We started presenting a new general formulation of Algorithmic Discrimination with application to face recognition. We then showed the high bias introduced when training the deep models with the most popular face databases employed in the literature. We then evaluated two popular pre-trained face models (VGG-Face and ResNet-50) according to the proposed formulation.
The experiments are carried out on three public databases (DiveFace, RFW, and BUPT-B) comprising 64,000 identities and 1.5M images. The results show that the two tested face models are highly biased across demographic groups. In particular, we observed large performance differences in face recognition across gender and ethnic groups. These performance gaps reached up to 200% of relative error degradation between the best class and the worst. This means that false positives are 200% more likely for some demographic groups than for others when using the popular face models evaluated in this work.
We also looked at the interior of the tested models, revealing different activation patterns of the networks for different demographic groups. This corroborates the biased nature of these popular pre-trained face models.
After the bias analysis, we proposed a novel discrimination-aware training method, SensitiveLoss, based on a triplet loss function and online selection of sensitive triplets. Different to related existing de-biasing methods, SensitiveLoss works as an add-on to pre-trained networks, thereby facilitating its application to problems (like face recognition) where hard-worked models exist with excellent performance, but little attention about fairness aspects were considered in their inception. Experiments with SensitiveLoss demonstrate how simple discrimination-aware rules can guide the learning process towards fairer and more accurate representations. The results of the proposed SensitiveLoss representation outperform the baseline models for the three evaluated databases both in terms of average accuracy and fairness metrics. These results encourage the training of more diverse models and the development of methods capable of dealing with the differences inherent to demographic groups.
The framework analyzed in this work is focused on the analysis of Group-based Algorithmic Discrimination (G-AD). Future work will investigate how to incorporate User-specific Algorithmic Discrimination (U-AD) in the proposed framework. Additionally, the analysis of other covariates such as the age will be included in the study. Discrimination by age is an important concern in applications such as automatic recruitment tools. Other future directions include the study of new methods to detect bias in the training process in an unsupervised way or the application of privacy-preserving techniques at image level .
This work has been supported by projects: PRIMA (MSCA-ITN-2019-860315), TRESPASS (MSCA-ITN-2019-860813), BIBECA (RTI2018-101248-B-I00 MINECO/FEDER), and Accenture. I. Serna is supported by a research fellowship from the Spanish CAM (PEJD-2018-PRE/TIC-9449).
Measuring the Gender and Ethnicity Bias in Deep Models for Face Recognition.
Iberoamerican Congress on Pattern Recognition, Madrid, Spain, pp. 584–593. Cited by: §1, §5.2.
-  (2018) Biometrics: In search of identity and security (Q & A). IEEE MultiMedia 25 (3), pp. 22–35. Cited by: §1.
-  (2011) Quality Measures in Biometric Systems. IEEE Security & Privacy 10 (6), pp. 52–62. Cited by: §3.
-  (2018) Turning a Blind Eye: Explicit Removal of Biases and Variation from Deep Neural Network embeddings. In European Conference on Computer Vision (ECCV), Munich, Germany. Cited by: §1, §2.2.
-  (2020) Fair Enough: Improving Fairness in Budget-Constrained Decision Making Using Confidence Thresholds. In AAAI Workshop on Artificial Intelligence Safety (SafeAI), New York, NY, USA, pp. 41–53. Cited by: §3.
-  (2017) Deep Learning for Biometrics. In , Springer (Ed.), Part of the Advances in Computer Vision and Pattern Recognition book series (ACVPR), Vol. , , pp. . Cited by: §1.
-  (2018-23–24 Feb) Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Conference on Fairness, Accountability and Transparency, S. A. Friedler and C. Wilson (Eds.), Proceedings of Machine Learning Research, Vol. 81, New York, NY, USA, pp. 77–91. Cited by: §1, §5.2.
Three Naive Bayes Approaches for Discrimination-Free Classification. Data Mining and Knowledge Discovery 21 (2), pp. 277–292. Cited by: §3.
-  (2018) Vggface2: A Dataset for Recognising Faces Across Pose and Age. In International Conference on Automatic Face & Gesture Recognition (FG), Lille, France, pp. 67–74. Cited by: TABLE I, §4, §4, §6.1, §6.2.
-  (2019) Demographic Effects in Facial Recognition and Their Dependence on Image Acquisition: An Evaluation of Eleven Commercial Systems. IEEE Transactions on Biometrics, Behavior, and Identity Science 1 (1), pp. 32–41. Cited by: §1.
-  (2018) Mitigating Bias in Gender, Age and Ethnicity Classification: a Multi-Task Convolution Neural Network Approach. In European Conference on Computer Vision (ECCV), Munich, Germany. Cited by: §2.2.
-  (2020) Demographic Bias in Biometrics: A Survey on an Emerging Challenge. arXiv:2003.02488. Cited by: §1, §2.1.
-  (2019) DebFace: De-biasing Face Recognition. arXiv:1911.08080. Cited by: §1, §2.2, §7.3, TABLE III.
-  (2018) Ongoing Face Recognition Vendor Test (FRVT) Part 2: Identification. NIST Internal Report, National Institute of Standards and Technology. Cited by: §1.
-  (2019) Ongoing Face Recognition Vendor Test (FRVT) Part 3: Demographic Effects. NIST Internal Report, U.S. Department of Commerce, National Institute of Standards and Technology. Cited by: §2.1, §2.1, TABLE I.
-  (2016) Ms-celeb-1m: A Dataset and Benchmark for Large-Scale Face Recognition. In European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, pp. 87–102. Cited by: TABLE I, §5.2.
-  (2016) Deep Residual Learning for Image Recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 770–778. Cited by: TABLE I, §4.
-  (2019) DemogPairs: Quantifying the Impact of Demographic Imbalance in Deep Face Recognition. In International Conference on Automatic Face & Gesture Recognition (FG), Lille, France. Cited by: §1, TABLE I.
-  (2019) FairFace: Face Attribute Dataset for Balanced Race, Gender, and Age. arXiv:1908.04913. Cited by: TABLE I.
-  (2016) The Megaface Benchmark: 1 Million Faces for Recognition at Scale. In Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, USA, pp. 4873–4882. Cited by: TABLE I.
-  (2012) Face Recognition Performance: Role of Demographic Information. IEEE Transactions on Information Forensics and Security 7 (6), pp. 1789–1801. Cited by: §1, §2.1, §2.1, item i).
-  (2019-04) Discrimination in the Age of Algorithms. Journal of Legal Analysis 10, pp. 113–174. External Links: Cited by: §4.
-  (2020) Issues Related to Face Recognition Accuracy Varying Based on Race and Skin Tone. IEEE Transactions on Technology and Society 1 (), pp. 8–20. Cited by: §1.
-  (2011) Describable Visual Attributes for Face Verification and Image Search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (10), pp. 1962–1977. Cited by: TABLE I.
-  (2016) Labeled Faces in the Wild: A Survey. In Advances in Face Detection and Facial Image Analysis, Kawulok,Michal, M. E. Celebi, and B. Smolka (Eds.), pp. 189–248. Cited by: TABLE I.
-  (2019) An experimental Evaluation of Covariates Effects on Unconstrained Face Verification. IEEE Transactions on Biometrics, Behavior, and Identity Science 1 (1), pp. 42–55. Cited by: §1, §2.1.
-  (2018) IARPA Janus Benchmark-C: Face Dataset and Protocol. In International Conference on Biometrics (ICB), Gold Coast, Australia, pp. 158–165. Cited by: TABLE I.
-  (2019) Diversity in Faces. arXiv:1901.10436. Cited by: §5.2.
-  (2020) FlowSAN: Privacy-enhancing Semi-Adversarial Networks to Confound Arbitrary Face-based Gender Classifiers. IEEE Access 7 (), pp. 99735–99745. Cited by: §8.
-  (2019) SensitiveNets: Learning Agnostic Representations with Application to Face Recognition. arXiv:1902.00334. Cited by: §2.2, TABLE I, §5.2.
-  (2009) The Multiscenario Multienvironment Biosecure Multimodal Database (BMDB). IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (6), pp. 1097–1111. Cited by: TABLE I.
-  (2015) Deep Face Recognition. In British Machine Vision Conference (BMVC), Swansea, UK. Cited by: TABLE I, §4, §4, §6.1, §6.2.
-  (2019) Machine Behaviour. Nature 568 (7753), pp. 477–486. External Links: Cited by: §1, §1.
-  (2019) Actionable Auditing: Investigating the Impact of Publicly Naming Biased Performance Results of Commercial AI Products. In Conference on AI Ethics and Society (AIES), New York, NY, USA. Cited by: §3.
-  (2018) Deep learning for understanding faces: Machines may be just as good, or better, than humans. IEEE Signal Processing Magazine 35 (1), pp. 66–83. Cited by: §1.
-  (2018) Deep Learning for Understanding Faces: Machines May Be Just as Good, or Better, than Humans. IEEE Signal Processing Magazine 35 (1), pp. 66–83. Cited by: §4.
-  (2016) Artificial Intelligence: A Modern Approach. Pearson. Cited by: Fig. 1.
-  (2017) Grad-CAM: Visual Explanations from Deep Networks Via Gradient-Based Localization. In International Conference on Computer Vision (CVPR), Honolulu, Hawaii, USA, pp. 618–626. Cited by: §5.3.
-  (2020) Algorithmic Discrimination: Formulation and Exploration in Deep Learning-based Face Biometrics. In AAAI Workshop on Artificial Intelligence Safety (SafeAI), New York, NY, USA. Cited by: §1.
-  (2020) Special Issue on Machine Vision with Deep Learning. International Journal of Computer Vision 128 (), pp. 771–772. Cited by: §1.
-  (2019-10) Racial Faces in the Wild: Reducing Racial Bias by Information Maximization Adaptation Network. In International Conference on Computer Vision (ICCV), Seoul, Korea. Cited by: TABLE I, §5.2, item i).
-  (2020) Mitigate Bias in Face Recognition using Skewness-Aware Reinforcement Learning. In Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, Washington, USA. Cited by: §1, §2.2, TABLE I, §5.2, §5.2, item i), §7.2, §7.3, TABLE III.
-  (2006) Distance Metric Learning for Large Margin Nearest Neighbor Classification. In Advances in Neural Information Processing Systems (NIPS), pp. 1473–1480. Cited by: §6.1.
-  (2011-06) Face Recognition in Unconstrained Videos with Matched Background Similarity. In Computer Vision and Pattern Recognition (CVPR), Vol. , Colorado Springs, CO, USA, pp. 529–534. External Links: Cited by: TABLE I.
-  (2015) From Facial Parts Responses to Face Detection: A Deep Learning Approach. In International Conference on Computer Vision (ICCV), Santiago, Chile, pp. 3676–3684. Cited by: TABLE I.
-  (2014) Learning Face Representation from Scratch. arXiv:1411.7923. Cited by: TABLE I.
-  (2016) Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters 23 (10), pp. 1499–1503. Cited by: §4.
-  (2020) Joint Optimization of AI Fairness and Utility: A Human-Centered Approach. In Conference on AI, Ethics, and Society (AIES), New York, NY, USA, pp. 400–406. Cited by: §3.
Age Progression/Regression by Conditional Adversarial Autoencoder. In Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, Hawaii, USA, pp. 5810–5818. Cited by: TABLE I.