Investigating the Impact of Inclusion in Face Recognition Training Data on Individual Face Identification

by   Chris Dulhanty, et al.
University of Waterloo

Modern face recognition systems leverage datasets containing images of hundreds of thousands of specific individuals' faces to train deep convolutional neural networks to learn an embedding space that maps an arbitrary individual's face to a vector representation of their identity. The performance of a face recognition system in face verification (1:1) and face identification (1:N) tasks is directly related to the ability of an embedding space to discriminate between identities. Recently, there has been significant public scrutiny into the source and privacy implications of large-scale face recognition training datasets such as MS-Celeb-1M and MegaFace, as many people are uncomfortable with their face being used to train dual-use technologies that can enable mass surveillance. However, the impact of an individual's inclusion in training data on a derived system's ability to recognize them has not previously been studied. In this work, we audit ArcFace, a state-of-the-art, open source face recognition system, in a large-scale face identification experiment with more than one million distractor images. We find a Rank-1 face identification accuracy of 79.71 model's training data and an accuracy of 75.73 modest difference in accuracy demonstrates that face recognition systems using deep learning work better for individuals they are trained on, which has serious privacy implications when one considers all major open source face recognition training datasets do not obtain informed consent from individuals during their collection.



There are no comments yet.


page 1

page 2

page 3

page 4


Deep 3D Face Identification

We propose a novel 3D face recognition algorithm using a deep convolutio...

VIPLFaceNet: An Open Source Deep Face Recognition SDK

Robust face representation is imperative to highly accurate face recogni...

Web-Scale Training for Face Identification

Scaling machine learning methods to very large datasets has attracted co...

Evaluating the Effectiveness of Automated Identity Masking (AIM) Methods with Human Perception

Face de-identification algorithms have been developed in response to the...

Open Source Face Recognition Performance Evaluation Package

Biometrics-related research has been accelerated significantly by deep l...

Single trial ERP amplitudes reveal the time course of acquiring representations of novel faces in individual participants

The neural correlates of face individuation - the acquisition of memory ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Face recognition systems using Deep Convolutional Neural Networks (DCNNs) depend on the collection of large image datasets containing thousands of sets of specific individuals’ faces for training. Using this data, DCNNs learn a set of parameters that can map an arbitrary individual’s face to a feature representation, or faceprint, that has small intra-class and large inter-class variability. The ability of a face recognition system to distinguish between identities within this embedding space depends on the size and diversity of its training data, along with its model capacity and underlying algorithms. Face recognition systems have benefited from the enabling power of Internet in the collection of large-scale image datasets and from hardware improvements in enabling efficient training of large models. Recently, increased attention to face recognition by academia, industry and government has brought new researchers, ideas and funding to the field, leading to performance improvements on benchmark tasks Labelled Faces in the Wild (LFW) (Huang et al., 2007) and MegaFace (Nech and Kemelmacher-Shlizerman, 2017). Consequently, face recognition systems are now being integrated into consumer and industrial electronic devices and offered as application programming interfaces (APIs) by providers such as Amazon, Microsoft, IBM, Megvii and Kairos. However, along with improved performance has come increased public discourse on the ethics of face recognition systems and their development.

Algorithmic auditing of commercial face analysis applications has uncovered disparate performance for intersectional groups across several tasks. Poor performance for darker skinned females by commercial face analysis APIs has been reported by Buolamwini, Gebru and Raji (Buolamwini and Gebru, 2018; Raji and Buolamwini, 2019), as has lower accuracy in face identification by commercial systems with respect to lower (darker) skin reflectance by researchers at the US Department of Homeland Security (Cook et al., 2019). As bias in training data begets bias in model performance, efforts to create more diverse datasets for these tasks have resulted. IBM’s Diversity in Faces dataset (Merler et al., 2019), released in January 2019, is a direct response to this body of research. Using ten established coding schemes from scientific literature, researchers annotated one million face images in an effort to advance the study of fairness and accuracy in face recognition. However, this dataset has seen public scrutiny from a different, but equally notable perspective. A March 2019 investigation by NBC News into the origins of the dataset brought to the public conversation the issue of informed consent in large-scale academic image datasets, as IBM leveraged images from Flickr with a Creative Commons Licence without notifying content owners of their use (Solon, 2019).

To rationalize the collection of large-scale image datasets without explicit consent of individuals, some computer vision researchers appeal to the non-commercial nature of their work. However, work by Harvey

et al. at MegaPixels have found that authors’ stated limitations on dataset use do not translate to real-world restrictions (Harvey and LaPlace, 2019). In the case of Microsoft’s MS-Celeb-1M dataset, authors included an explicit “non-commercial research purpose only” clause with the dataset, which was the largest publicly-available face recognition dataset at the time. However, as the dataset has been cited in published works by the research arms of many commercial entities, findings cannot easily be isolated from improvements in product offerings. As a direct result of MegaPixel’s work on the ethics, origins, and privacy implications of face recognition datasets, MS-Celeb-1M (Guo et al., 2016), Stanford’s Brainwash dataset (Stewart et al., 2016) and Duke’s Multi-Target, Multi-Camera dataset (Ristani et al., 2016) were removed from their authors’ websites in June 2019. However, in the case of MS-Celeb-1M, the data remains accessible via torrents, derived datasets and other hosts (Harvey and LaPlace, 2019).

In addition to issues of bias and informed consent in data collection, the general use of face recognition systems by commercial and government agencies has been raised by civil rights groups and research centers, as there is no oversight for its deployment in civil society (ACLU, 2018; Whittaker et al., 2018). For these and other reasons, multiple cities in the United States have banned the use of face recognition systems for law enforcement purposes (Conger et al., 2019; Wu, 2019; Ravani, 2019). Many people are concerned with their identify being used to train the dual-use technology that is face recognition. With reports of face recognition being used by law enforcement entities to identify protesters in London (Bowcott, 2018) and Hong Kong (Mozur, 2019), and measures enacted to ban face masks in the latter location (Yu, 2019), there is merit in understanding the impact of one’s inclusion in the training data that fuels the development of these systems.

In an effort to inform the conversation about informed consent and privacy in the domain of face recognition, we conduct experiments on a state-of-the-art system. The goal of this work is to determine the impact of an individual’s inclusion in face recognition training data on a derived system’s ability to recognize them. To the best of the authors’ knowledge, this is the first paper to investigate this relationship.

The remainder of this paper is organized in the following manner; section two outlines ethical considerations for some decisions in the design and implementation of this work, section three provides background for the taxonomy, algorithms and data used in face recognition research, section four outlines the design of experiments used to address the research question, section five presents our results and adds discussion and the paper concludes in section six.

2. Ethical Considerations

2.1. Intent

The intent of this work is to investigate the performance of face recognition systems with respect to inclusion in training datasets. While one interpretation of this work may be to motivate efforts to mitigate demographic bias in the development of face recognition systems, it should be noted that increasing the performance of face recognition systems in any context can increase their ability to be used for oppressive purposes. In addition, due to historical societal injustices against marginalized populations and racially-biased police practices in the United States, a disproportionate number of African Americans and Hispanics are present in mugshot databases, often used by law enforcement agencies as data sources for face recognition systems (NAACP, 2018; Garvie, 2016). These populations are therefore poised to receive a greater burden of the effects of improved face recognition systems. We therefore position this work as informing the discussion on data privacy and consent when it comes to face recognition systems and do not advocate for technical improvements without a larger discussion on the appropriate use and legality of the technology.

2.2. Use of MS-Celeb-1M

As noted in the introduction, the MS-Celeb-1M dataset was removed from Microsoft’s website in June 2019. In a response to a Financial Times inquiry, Microsoft stated the website was retired “because the research challenge is over” (Murgia, 2019). However, a version of this dataset with detected and aligned faces from a “cleaned” subset of the original images is available from the Intelligent Behaviour and Understanding Group (iBUG) at Imperial College London. The dataset was offered as training data for the “Lightweight Face Recognition Challenge & Workshop”111 the group organized at ICCV 2019. The group has pre-trained face recognition models available as benchmarks for the challenge, trained on this data.

As this work aims to conduct experiments in a realistic setting in order to better inform the conversation around data collection processes, the analysis of a state-of-the-art model, trained on a large dataset is necessary to gain insights that are applicable to commercial applications. We therefore have decided to use the MS-Celeb-1M dataset, through its derived version offered for the ICCV 2019 Workshop, for the limited scope of this work.

Dataset Year Released # Identities # Images Informed Consent Obtained? Source
CASIA WebFace 2014 10,575 494K No (Yi et al., 2014)
CelebA 2015 10,177 203K No (Liu et al., 2015)
VGGFace 2015 2,622 2.6M No (Parkhi et al., 2015)
MS-Celeb-1M 2016 99,952 10.0M No (Guo et al., 2016)
UMDFaces 2016 8,277 368K No (Bansal et al., 2017)
MegaFace (Challenge 2) 2016 672,057 4.7M No (Nech and Kemelmacher-Shlizerman, 2017)
VGGFace2 2018 9,131 3.3M No (Cao et al., 2018)
Table 1. Prominent open-source face recognition training datasets

3. Background

3.1. Face Recognition Tasks

Within the domain of face recognition lies two categories of tasks: face verification and face identification (Learned-Miller et al., 2016).

In face verification, the goal is to assess if a presented image matches with the reference image of an individual, often to grant access to a physical device or location. Unlocking a smartphone with one’s face provides an example of face verification; a person presents their face to a phone and it is verified against a reference image of the known owner of the device. This task is referred to as 1:1 matching, as there is only one individual that the presented face image is compared against. In order to confirm a match, a threshold of similarity must be met, which can be set by the developer of a system to meet a specific level of security. Performance of a system on face verification tasks is reported in terms of accuracy; the number of correct verifications of all verification attempts.

In face identification, a gallery of known identities is constructed from face images of individuals in advance of testing. Subsequently, a face image of unknown identity is presented to the system as the probe. The probe is then matched for similarity with all images in the gallery, constituting 1:N matching. If the system guarantees that the identity of the probe is within the gallery of identities, the problem is considered closed-set face identification, otherwise it is considered open-set face identification.

Closed-set face identification tasks are common in academic benchmarks, as galleries are carefully constructed by their authors to contain all probes. In open-set face identification, a confidence threshold must be set to reject matches that do not meet a certain level of similarity. The selection of an appropriate threshold is especially relevant in high-risk applications such as law enforcement in which false positives have significant implications.

Face identification performance is reported in terms of accuracy in returning the correct identity of a probe from the gallery, or in the open-set case, no identity if the probe does not exist in the galley. Common performance metrics include Rank-1 accuracy; of all identification attempts, the number of times the correct identity in the gallery is the most similar identity to the probe, and Rank-10 accuracy; the number of times the correct identity is in the ten most similar identities to the probe.

3.2. Deep Face Recognition

Rapid improvements in image classification in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

(Russakovsky et al., 2015) by AlexNet (Krizhevsky et al., 2012), ZFNet (Zeiler and Fergus, 2014), GoogLeNet (Szegedy et al., 2015) and ResNet (He et al., 2016) from 2012 to 2015 cemented the DCNN as the standard method in computer vision research and applications. While early uses of convolutional neural networks in face verification showed preliminary success (Chopra et al., 2005; Huang et al., 2012)

, it was not until the introduction of the aforementioned network architectures that the modern era of deep face recognition was in full swing. Coupled with innovations in loss function design and access to larger image datasets, modern face recognition systems have improved state-of-the-art performance on benchmark face verification and identification tasks significantly in the past six years. For a complete survey of the development of deep face recognition systems, please refer to the review paper by Wang and Deng

(Wang and Deng, 2018); the following is a brief summary of major milestones.

The first system to adapt findings from ILSVRC to face recognition was Facebook’s DeepFace (Taigman et al., 2014), published in 2014 by Taigman et al.. The nine-layer AlexNet-based model was trained on a private dataset of 4.4M images of 4K identities and achieved state-of-the-art accuracy on face verification tasks LFW and YouTube Faces (YTF) (Wolf et al., 2011), reducing the error rate by more than 50% on the latter task.

Following this work, Google introduced FaceNet in 2015 with a major innovation in loss function design (Schroff et al., 2015). While the standard softmax loss function optimized inter-class differences, researchers found that intra-class differences remained high, problematic in the domain of face recognition. To rectify this problem, the triplet loss was introduced to jointly minimize the Euclidean distance between an anchor example and a positive example of the same identity and maximize the distance between an anchor and negative example. Using a ZFNet-based model and a private dataset of 200M images of 8M identities, they achieved state-of-the-art performance on LFW and YTF.

Innovations in loss functions dominated the next wave of improvements in benchmark tasks, motivated by improving discrimination between classes by making features more separable. Wen et al. introduced the Center Loss in 2016 (Wen et al., 2016), followed by Liu et al. with the Angular Softmax in 2017 (Liu et al., 2017). The Large Margin Cosine Loss was introduced in 2018 by Wang et al. (Wang et al., 2018), and in 2019, Deng et al. incorporated the Additive Angular Margin Loss into the ArcFace model (Deng et al., 2019a), considered state-of-the-art on multiple face recognition benchmarks when published.

Figure 1. Experimental procedure to generate feature representations of images in gallery and probe sets from ArcFace model

3.3. Face Recognition Training Datasets

Access to large-scale face recognition training datasets has been essential to the development of modern solutions by the academic community. While early published resulted in the DCNN-era of face recognition came out of companies with access to massive private datasets, such as Facebook’s 500M images and 10M identities (Taigman et al., 2015) and Google’s 200M images and 8M identities (Schroff et al., 2015), the release of several open-source datasets in the ensuing years has allowed researchers to train models at scale. A summary of notable face recognition training datasets of the past six years is provided in Table 1. These datasets catalyzed the field of face recognition and lead to great advances in model performance on benchmark tasks. They largely consist of celebrity identities and copyrighted images scraped from the internet.

One exception is MegaFace, which is derived from the YFCC100M dataset of 100M photos with a Creative Commons Licence, from 550K personal Flickr accounts (Thomee et al., 2015). While the Creative Commons Licence permits the fair use of images, including in this context, Ryan Merkley, CEO of Creative Commons, noted the trouble of conflating copyright with privacy in a March 2019 statement: “… copyright is not a good tool to protect individual privacy, to address research ethics in AI development, or to regulate the use of surveillance tools employed online. Those issues rightly belong in the public policy space, and good solutions will consider both the law and the community norms of CC licenses and content shared online in general” (Merkley, 2019). While MegaFace contains unknown, non-celebrity identities, an October 2019 investigation by the New York Times demonstrated that account metadata associated with images in the dataset allows for a trivial real-world identification of individuals (Hill and Krolik, 2019).

In all datasets, no informed consent was sought or obtained for individuals contained therein.

4. Methodology

4.1. Face Recognition Model

4.1.1. Training Data

We employ a cleaned version of the MS-Celeb-1M dataset (Guo et al., 2016) as training data for a face recognition model in this work. This dataset was prepared for the ICCV 2019 Lightweight Face Recognition Challenge (Deng et al., 2019b). All face images were preprocessed by the RetinaFace model for face detection and alignment (Deng et al., 2019c). A similarity transformation was applied to each detected face using five predicted face landmarks to generate normalized face crops of 112 x 112 pixels.

As the original version of this dataset has been shown to exhibit considerable inter-class noise, efforts have been made to automatically clean the dataset (Jin et al., 2018). In the case of this version, after face detection and alignment, cleaning was performed by a semi-automatic refinement strategy. First, a pre-trained ArcFace model (Deng et al., 2019a)

was used to automatically remove outlier images of each identity. A manual removal of incorrectly labelled images by “ethnicity-specific annotators” followed to result in a dataset of 5,179,510 images of 93,431 identities. We refer to this dataset as


4.1.2. Model

We select the ArcFace model (Deng et al., 2019a) to study in this work. ArcFace employs the Additive Angular Margin Loss and a ResNet100 backbone to arrive at a 512-dimensional feature representation of an input image. The model achieves a verification accuracy of 99.83% on LFW and Rank-1 identification accuracy of 81.91% on the MegaFace Challenge 1 with one million distractors, considered state-of-the-art results. We select the model for study as is the top academic, open-source entrant on the National Institute of Standards and Technology (NIST) Face Recognition Vendor Test (FRVT) 1:1 Verification222, a benchmark used by many commercial entities to validate the performance of their face recognition systems. Pre-trained weights for this model were provided by iBUG.

Metric Probe Set All Males Females
Rank-1 Accuracy (%) In-Domain 79.71 78.50 80.93
Out-of-Domain 75.73 77.30 74.17
Rank-10 Accuracy (%) In-Domain 90.82 90.92 90.73
Out-of-Domain 86.58 88.59 84.57
Rank-100 Accuracy (%) In-Domain 92.72 92.52 92.92
Out-of-Domain 89.22 90.59 87.84
Table 2. Face identification accuracies of ArcFace model on different probe image sets with one million distractor images

4.2. Experiments

To determine the effect of inclusion in the training data of a face recognition system on its ability to identify an individual, we frame the problem as a closed-set face identification task. We construct two probe datasets and perform face identification on a gallery of one million distractor images. We assess the performance of the model on the probe datasets in terms of Rank-1, Rank-10 and Rank-100 identification accuracies. A visual representation of the datasets used in this work is shown in Figure 1.

4.2.1. Probe Data

We construct two probe datasets from the VGGFace2 dataset (Cao et al., 2018). Using regular expressions, we match identities in VGGFace2 by name with the identify list of MS1M-RetinaFace. We find 5,902 VGGFace2 identities present in MS1M-RetinaFace and 3,229 VGGFace2 identities not present in the training dataset. In each of these two groups, we randomly select 500 male identities and 500 female identities for evaluation, based on gender labels provided by VGGFace2 metadata. For each identity, we randomly select 50 images and perform face detection and alignment with the Multi-task Cascaded Convolutional Network (MTCNN) (Zhang et al., 2016) to generate normalized face crops of size 112 x 112 pixels. We refer to the set of 50,000 images of 1000 identities present in the training data as the in-domain probe set and the set of 50,000 images of 1000 identities not present in the training data at the out-of-domain probe set. We then generate 512-dimensional feature representations for all images in the in-domain and out-of-domain probe sets by running them through ArcFace.

4.2.2. Gallery Data

We leverage the MegaFace Challenge 1 “Distractor” dataset (Kemelmacher-Shlizerman et al., 2016) of 1,027,058 images of 690,572 identities to form the basis of the gallery. We again apply MTCNN to generate normalized face crops of 112 x 112 pixels for each image and run each image through ArcFace to generate 512D feature representations of all images in the gallery.

4.2.3. Evaluation Protocol

The experiments conducted in this work follow the protocol of MegaFace Challenge 1, with our probe sets in place of the standard FaceScrub test set (Ng and Winkler, 2014). We employ the Linux development kit offered by MegaFace to perform evaluation. Each probe set is evaluated following Algorithm 1; a written description of this protocol follows.

A probe set contains 1000 identities, each with 50 images represented as 512D features. For each identity, we iterate over their images, adding one image to the gallery at a time, which we will refer to as the needle. We then iterate over the remaining 49 images, using each one as a probe. We rank all images in the gallery by L2 distance in feature space to the probe, and record the position of the needle in the ranked list. We report results for each probe set in terms of Rank-1, Rank-10 and Rank-100 face identification accuracies.

Result: Rank-1, 10 and 100 face identification accuracies for a probe set.
gallery contains 1M distractor images;
for identity in identities1 to 1000 do
       for imageneedle in images1 to 50 do
             add imageneedle to the gallery;
             for imageprobe in images1 to 50 do
                   if imageneedle == imageprobe then
                         rank all images in gallery by L2 distance to imageprobe in feature space;
                         if imageneedle in first position in ranked list then
                        if imageneedle in first 10 positions in ranked list then
                        if imageneedle in first 100 positions in ranked list then
            remove imageneedle from gallery;
Algorithm 1 Closed-set face identification evaluation

5. Results and Discussion

We present results of the experiments in Table 2 for Ranks 1, 10 and 100. We find there is a modest increase in face identification accuracy for identities present in the training data, compared to those who are not. In-domain identities have a 4.0% higher identification accuracy than out-of-domain identities at Rank-1, 4.2% higher at Rank-10, and 3.5% higher at Rank-100. Although not a significant margin, these results suggest that modern DCNN-based face recognition systems are biased towards individuals they are trained on.

The disparate performance between probe sets suggests some amount of overfitting has occurred in the model. Although the model generalizes well to new identities, as evidenced by results on benchmarks LFW, MegaFace and on NIST’s FRVT, these results indicate that the 93k identities the system is trained on are more easily identifiable in a large-scale study. As the model’s Additive Angular Margin Loss sought to increase discrimination between classes by making features more separable, it appears the model has learned to map identities to the same feature representation more consistently for those it has seen before.

We also investigated the role of gender in the performance of the face recognition model. We find small differences in performance between genders for in-domain identities, but a 3 - 4% decrease in performance for females compared to males who are out-of-domain, across all ranks. These results suggest that a gender bias exists in the face recognition model towards female identities. As the model has a smaller drop in face identification accuracy between domains for males, it has a greater ability to generalize to new male identities. While we do not have gender labels available for all identities in MS1M-RetinaFace, recent work has demonstrated that large-scale face recognition datasets are largely biased towards lighter-skinned males (Merler et al., 2019). A representational bias in MS1M-RetinaFace may account for this disparate performance across genders. Looking at these results in a different way, the consistent performance for in-domain identities across genders is perhaps more evidence that the model is overfitting to identities it has seen before. If the model only had a gender bias, we would have seen disparate performance for genders on both probe sets, however, these results suggest the model may also exhibit a “training inclusion bias”.

Results of this study lead to the question; is the bias towards individuals in training data truly a consequence of overtraining, or is this a fundamental element of deep face recognition models? If we look to the manner by which the model was trained, overfitting in a traditional sense seems unlikely, as early stopping was employed, and results on held-out test identities demonstrate strong generalization. Perhaps there is a generalization gap in performance between in-domain and out-of-domain identities that is not apparent in current validation protocols, and increased regularization can mitigate this gap. Further testing on different training datasets and model architectures will be necessary to gather more evidence to answer this question.

We did not analyze the effect of skin type on face recognition model performance in this study, as skin type annotations were not available to us at the time. However, two considerations were made to attempt to control for effects of skin type in these results. First, the selection of 1000 identities for each probe set is far larger than what is used in the standard protocol of MegaFace Challenge 1, where 80 identities are sampled from FaceScrub. Having a larger sample size helps to control for identities who may have either superior or poor performance due to possible model bias. In addition, the approach of random sampling in-domain and out-of-domain probe sets ensures both contain a similar distribution of identities with respect to skin type, with the assumption that the identities common to MS1M-RetinaFace and VGGFace2 and the identities distinct to VGGFace2 follow the same distribution of skin type. As both MS1M-RetinaFace and VGGFace2 use the popularity of celebrities online to construct identity lists, this assumption seems to be reasonable. Having said this, the role of skin type in the performance of the model is a very important relationship to study, and this is planned for future work. Fitzpatrick skin type (Fitzpatrick, 1988) annotations will need to be collected for all individuals in VGGFace2 such that sampling can be done to ensure even representation in probe sets across gender and skin type, and to determine intersectional accuracy.

The results of this study are quite concerning from a privacy and informed consent perspective. As described in the background section on Face Recognition Training Datasets, there does not exist a major open-source dataset that gathers informed consent from the individuals it contains. Without these individuals’ knowledge or permission, the systems trained on their identities have a greater ability to identify them. As face recognition becomes more powerful and ubiquitous, the ability for misuse becomes greater. While MS-Celeb-1M contains only “celebrity” identities, this classification of an individual should not negate informed consent in the development of powerful surveillance technologies. Face recognition systems are unique among biometrics as the face can be easily captured at distance without one’s knowledge. The face uniquely identifies an individual, and it is difficult to opt-out of these systems without wearing a mask or other means of obfuscation, drawing undue attention to one’s self. From a legal perspective, the concept of informed consent in the analysis of images of individuals’ faces has traction in some jurisdictions. As reported by the New York Times with reference to potential financial liabilities of MegaFace (Hill and Krolik, 2019), the Illinois Biometric Information Privacy Act (Assembly, 2008) is a State law enacted in 2008 that gives Illinois residents the right to seek financial compensation from entities using their face scans without their informed consent.

The experiments in this work aim to simulate a real-world testing environment of a state-of-the-art face recognition system, with a gallery of more than one million images. These findings, therefore, may hold for systems that are currently deployed in the real-world.

6. Conclusion

In this work we present the first study to investigate the role of inclusion in face recognition training data on a derived system’s ability to identify an individual. Through the construction of two sets of probe data that overlap and are distinct from the training data of a state-of-the-art system, we conduct a large-scale face identification experiment. We find a modest 4% improvement in face identification accuracy for individuals who are present in training data, which is highly problematic given the norm in the field is to not gather informed consent in the collection of training datasets. Future work will apply this methodology to more models, training datasets and distance metrics (i.e. cosine distance) to see if results are consistent. Following prior work (Buolamwini and Gebru, 2018; Raji and Buolamwini, 2019; Cook et al., 2019), analysis of face recognition model bias with respect to gender, skin type and their intersections in large-scale face identification tasks is needed, as well as tying results to representational bias in training data. Additionally, the relationship between the number of images of an individual in training data and their ability to be identified is an interesting area of study. Finally, analysis of a face recognition model’s feature space directly provides an alternative to a task-based auditing approach, and may be fruitful for understating nuances of inter- and intra-class differences.

We would like to thank the Natural Sciences and Engineering Research Council of Canada and the Canada Research Chairs Program for their support.


  • ACLU (2018) Aclu calls for moratorium on law and immigration enforcement use of facial recognition. External Links: Link Cited by: §1.
  • I. G. Assembly (2008) 740 ilcs 14 / biometric information privacy act.. External Links: Link Cited by: §5.
  • A. Bansal, A. Nanduri, C. D. Castillo, R. Ranjan, and R. Chellappa (2017) Umdfaces: an annotated face dataset for training deep networks. In 2017 IEEE International Joint Conference on Biometrics (IJCB), pp. 464–473. Cited by: Table 1.
  • O. Bowcott (2018) Police face legal action over use of facial recognition cameras. The Guardian. External Links: Link Cited by: §1.
  • J. Buolamwini and T. Gebru (2018) Gender shades: intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pp. 77–91. Cited by: §1, §6.
  • Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman (2018) Vggface2: a dataset for recognising faces across pose and age. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 67–74. Cited by: Table 1, §4.2.1.
  • S. Chopra, R. Hadsell, and Y. LeCun (2005) Learning a similarity metric discriminatively, with application to face verification. In

    IEEE Conference on Compter Vision and Pattern Recognition

    pp. 539–546. Cited by: §3.2.
  • K. Conger, R. Fausset, and S. F. Kovaleski (2019) San francisco bans facial recognition technology. The New York Times. External Links: Link Cited by: §1.
  • C. M. Cook, J. J. Howard, Y. B. Sirotin, J. L. Tipton, and A. R. Vemury (2019) Demographic effects in facial recognition and their dependence on image acquisition: an evaluation of eleven commercial systems. IEEE Transactions on Biometrics, Behavior, and Identity Science 1 (1), pp. 32–41. Cited by: §1, §6.
  • J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2019a) Arcface: additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699. Cited by: §3.2, §4.1.1, §4.1.2.
  • J. Deng, J. Guo, D. Zhang, Y. Deng, X. Lu, and S. Shi (2019b) Lightweight face recognition challenge. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §4.1.1.
  • J. Deng, J. Guo, Y. Zhou, J. Yu, I. Kotsia, and S. Zafeiriou (2019c) RetinaFace: single-stage dense face localisation in the wild. arXiv preprint arXiv:1905.00641. Cited by: §4.1.1.
  • T. B. Fitzpatrick (1988) The validity and practicality of sun-reactive skin types i through vi. Archives of dermatology 124 (6), pp. 869–871. Cited by: §5.
  • C. Garvie (2016) The perpetual line-up: unregulated police face recognition in america. Georgetown Law, Center on Privacy & Technology. Cited by: §2.1.
  • Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao (2016) Ms-celeb-1m: a dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision, pp. 87–102. Cited by: §1, Table 1, §4.1.1.
  • A. Harvey and J. LaPlace (2019) External Links: Link Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.2.
  • K. Hill and A. Krolik (2019) How photos of your kids are powering surveillance technology. The New York Times. External Links: Link Cited by: §3.3, §5.
  • G. B. Huang, H. Lee, and E. Learned-Miller (2012)

    Learning hierarchical representations for face verification with convolutional deep belief networks

    In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2518–2525. Cited by: §3.2.
  • G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller (2007) Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report Technical Report 07-49, University of Massachusetts, Amherst. Cited by: §1.
  • C. Jin, R. Jin, K. Chen, and Y. Dou (2018) A community detection approach to cleaning extremely large face database. Computational intelligence and neuroscience 2018. Cited by: §4.1.1.
  • I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard (2016) The megaface benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4873–4882. Cited by: §4.2.2.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §3.2.
  • E. Learned-Miller, G. B. Huang, A. RoyChowdhury, H. Li, and G. Hua (2016) Labeled faces in the wild: a survey. In Advances in face detection and facial image analysis, pp. 189–248. Cited by: §3.1.
  • W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song (2017) Sphereface: deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 212–220. Cited by: §3.2.
  • Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: Table 1.
  • R. Merkley (2019) Use and fair use: statement on shared images in facial recognition ai. Creative Commons. External Links: Link Cited by: §3.3.
  • M. Merler, N. Ratha, R. S. Feris, and J. R. Smith (2019) Diversity in faces. arXiv preprint arXiv:1901.10436. Cited by: §1, §5.
  • P. Mozur (2019) In hong kong protests, faces become weapons. The New York Times. External Links: Link Cited by: §1.
  • M. Murgia (2019) Microsoft quietly deletes largest public face recognition data set. Financial Times. External Links: Link Cited by: §2.2.
  • NAACP (2018) Criminal justice fact sheet. External Links: Link Cited by: §2.1.
  • A. Nech and I. Kemelmacher-Shlizerman (2017) Level playing field for million scale face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, Table 1.
  • H. Ng and S. Winkler (2014) A data-driven approach to cleaning large face datasets. In 2014 IEEE International Conference on Image Processing (ICIP), pp. 343–347. Cited by: §4.2.3.
  • O. M. Parkhi, A. Vedaldi, and A. Zisserman (2015) Deep face recognition. In Proceedings of the British Machine Vision Conference (BMVC), G. K. L. Tam (Ed.), pp. 41.1–41.12. External Links: Document, ISBN 1-901725-53-7, Link Cited by: Table 1.
  • I. D. Raji and J. Buolamwini (2019) Actionable auditing: investigating the impact of publicly naming biased performance results of commercial ai products. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’19, pp. 429–435. Cited by: §1, §6.
  • S. Ravani (2019) Oakland bans use of facial recognition technology, citing bias concerns. San Francisco Chronicle. External Links: Link Cited by: §1.
  • E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi (2016) Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision workshop on Benchmarking Multi-Target Tracking, Cited by: §1.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §3.2.
  • F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §3.2, §3.3.
  • O. Solon (2019) Facial recognition’s ’dirty little secret’: millions of online photos scraped without consent. External Links: Link Cited by: §1.
  • R. Stewart, M. Andriluka, and A. Y. Ng (2016) End-to-end people detection in crowded scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2325–2333. Cited by: §1.
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §3.2.
  • Y. Taigman, M. Yang, M. Ranzato, and L. Wolf (2014) Deepface: closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1701–1708. Cited by: §3.2.
  • Y. Taigman, M. Yang, M. Ranzato, and L. Wolf (2015) Web-scale training for face identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2746–2754. Cited by: §3.3.
  • B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L. Li (2015) YFCC100M: the new data in multimedia research. arXiv preprint arXiv:1503.01817. Cited by: §3.3.
  • H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu (2018) Cosface: large margin cosine loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274. Cited by: §3.2.
  • M. Wang and W. Deng (2018) Deep face recognition: a survey. arXiv preprint arXiv:1804.06655. Cited by: §3.2.
  • Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016) A discriminative feature learning approach for deep face recognition. In European conference on computer vision, pp. 499–515. Cited by: §3.2.
  • M. Whittaker, K. Crawford, R. Dobbe, G. Fried, E. Kaziunas, V. Mathur, S. Myers West, R. Richardson, J. Schultz, and O. Schwartz (2018) AI now report 2018. Cited by: §1.
  • L. Wolf, T. Hassner, and I. Maoz (2011) Face recognition in unconstrained videos with matched background similarity. IEEE. Cited by: §3.2.
  • S. Wu (2019) Somerville city council passes facial recognition ban. The Boston Globe. External Links: Link Cited by: §1.
  • D. Yi, Z. Lei, S. Liao, and S. Z. Li (2014) Learning face representation from scratch. arXiv preprint arXiv:1411.7923. Cited by: Table 1.
  • E. Yu (2019) Hong kong court reinstates mask ban before citywide election. The New York Times. External Links: Link Cited by: §1.
  • M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §3.2.
  • K. Zhang, Z. Zhang, Z. Li, and Y. Qiao (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23 (10), pp. 1499–1503. Cited by: §4.2.1.