Artificial Intelligence has made remarkable progress in the past decade. Numerous AI-based products have already become prevalent in the market, ranging from robotic surgical assistants to self-driving vehicles. The accuracy of AI systems has surpassed human capability in challenging tasks, such as face recognition (Taigman et al., 2014), lung cancer screening (Ardila et al., 2019) and pigmented skin lesion diagnosis (Tschandl et al., 2019). These practical applications of AI systems have prompted attention and support from industry, academia, and government.
While AI technologies have contributed to increased work productivity and efficiency, a number of reports have also been made on the algorithmic biases and discrimination caused by data-driven decision making in AI systems. For example, COMPAS, an automated risk assessment tool used in criminal justice (Brennan et al., 2009), was reported to contain bias against Black defendants by assigning higher risk scores to Black defendants than White defendants (Angwin et al., 2019). Another recent study also reports the racial and gender bias in computer vision APIs for facial image analysis, which were shown less accurate on certain race or gender groups (Buolamwini and Gebru, 2018).
How can biased machine learning and computer vision models impact our society? We consider a following example. Let’s suppose an online search engine, such as Google, tries to make a list of websites of medical clinics and sort them by relevance. This list may be given to users as a search result or advertising content. The search algorithm will use content in websites to determine and rank their relevance, and any visual content, such as portraits of doctors, may be used as a feature in the pipeline. If the system relies on a biased computer vision model in this pipeline, the overall search results may also inherent the same biases and eventually affect users’ decision makings. Scholars have discussed and found present biases in online media such as skewed search results(Goldman, 2008) or gender difference in STEM career ads (Lambrecht and Tucker, 2019), yet little has been known about mechanisms or origins of such biases.
While previous reports have shown that popular computer vision and machine learning models contain biases and exhibit disparate accuracies on different subpopulations, it is still difficult to identify true causes of these biases. This is because one cannot know to which variable or factor the model responds. If we wish to verify if a model indeed discriminates against a sensitive variable, e.g., gender, we need to isolate the factor of gender and intervene its value for counterfactual analysis (Hardt et al., 2016).
The objective of our paper is to adopt an encoder-decoder architecture for facial attribute manipulation (Lample et al., 2017) and generate counterfactual images which vary along the dimensions of sensitive attributes: gender and race. These synthesized examples are then used to measure counterfactual fairness of black-box image classifiers offered by commercial providers. Figure 1 shows the overall process of our approach. Given an input image, we detect a face and generate a series of novel images by manipulating the target sensitive attributes while maintaining other attributes. We summarize our main contributions as follows.
We propose to use an encoder-decoder network (Lample et al., 2017) to generate novel face images, which allows counterfactual interventions. Unlike previous methods (Denton et al., 2019), our method explicitly isolates the factors for sensitive attributes, which is critical in identifying true causes to model biases.
We construct a novel image dataset which consists of 64,500 original images collected from web search and more than 300,000 synthesized images manipulated from the original images. These images describe people in diverse occupations and can be used for studies on bias measurement or mitigation. Both the code and data will be made publicly available.
Using new methods and data, we measure counterfactual fairness of commercial computer vision classifiers and report whether and how sensitive these classifiers are affected along with attributes being manipulated by our model.
ML and AI Fairness Fairness in machine learning has recently received much attention as a new criterion for model evaluation (Zemel et al., 2013; Hardt et al., 2016; Zafar et al., 2017; Kilbertus et al., 2017; Kusner et al., 2017). While the quality of a machine learning model has traditionally been assessed by its overall performance such as average classification accuracy measured from the entire dataset, the new fairness measures focus on the consistency of model behavior across distinct data segments or the detection of spurious correlations between target variables (e.g., loan approval) and protected attributes (e.g., race or gender).
The existing literature identifies a number of definitions and measures for ML/AI fairness (Corbett-Davies and Goel, 2018), including fairness through unawareness (Dwork et al., 2012), disparate treatment and disparate impact (Zafar et al., 2017), accuracy disparity (Buolamwini and Gebru, 2018), and equality in opportunity (Hardt et al., 2016). These are necessary because different definitions of fairness should be used in different tasks and contexts.
A common difficulty in measuring fairness is that it is challenging to identify or differentiate true causes of the discriminating model behaviors due to the input data that is built upon combination of many factors. Consequently, it is difficult to conclude that the variations in model outputs are solely caused by the sensitive or protected attributes. To overcome the limitation, Kusner et al. (Kusner et al., 2017) proposed the notion of counterfactual fairness based on causal inference. Here, a model, or predictor, is counterfactually fair as long as it produces an equal output to any input data whose values for the sensitive attribute are modified by an intervention but otherwise identical. Similar to (Kusner et al., 2017), our framework is based on counterfactual fairness to measure whether the prediction of the model differs by the intervened gender of the input image, while separating out the influences from all the other factors in the background.
Fairness and Bias in Computer Vision Fairness in computer vision is becoming more critical as many systems are being adapted in real world applications. For example, face recognition systems such as Amazon’s Rekognition are being used by law enforcement to identify criminal suspects (Harwell, 2019). If the system produces biased results (e.g., higher false alarm on Black suspects), then it may lead to a disproportionate arrest rate on certain demographic groups. In order to address this issue, scholars have attempted to identify biased representations of gender and race in public image dataset and computer vision models (Hendricks et al., 2018; Manjunatha et al., 2019; Kärkkäinen and Joo, 2019; McDuff et al., 2019). Buolamwini and Gebru (Buolamwini and Gebru, 2018) have shown that commercial computer vision gender classification APIs are biased and thus perform least accurately on dark-skinned female photographs. (Kyriakou et al., 2019) has also reported that image classification APIs may produce different results on faces in different gender and race. These studies, however, used the existing images without interventions, and thus it is difficult to identify whether the classifiers responded to the sensitive attributes or to the other visual cues. (Kyriakou et al., 2019) used the headshots of people with clean white background, but this hinders the classifiers from producing many comparable tags.
Our paper is most closely related to Denton et al. (Denton et al., 2019), who use a generative adversarial network (GAN) (Goodfellow et al., 2014) to generate face images to measure counterfactual fairness. Their framework incorporates a GAN trained from a face image dataset called CelebA (Liu et al., 2015), and generates a series of synthesized samples by modifying the latent code in the embedding space to the direction that would increase the strength in a given attribute (e.g., smile). Our paper differs from this work for the following reasons. First, we use a different method to examine the essential concept of counterfactual fairness by generating samples that separate the signals of the sensitive attributes out from the rest of the images. Second, our research incorporates the generated data to measure the bias of black-box image classification APIs whereas (Denton et al., 2019) measures the bias of a dataset open to public (Liu et al., 2015). Using our distinct method and data, we aim to identify the internal biases of models trained from unknown data.
Counterfactual Data Synthesis
The objective of our paper is to measure counterfactual fairness of a predictor , a function of an image . This predictor is an image classifier that automatically labels the content of input images. Without the loss of generality, we consider a binary classifier, This function classifies, for example, whether the image displays a doctor or not. We also define a sensitive attribute, , gender and race. Typically,
is a binary variable in the training data, but it can take a continuous value in our experiment since we can manipulate the value without restriction. Following(Hardt et al., 2016), this predictor satisfies counterfactual fairness if for all and any and , where indicates an intervention on the sensitive attribute, . We now explain how this is achieved by an encoder-detector network.
The goal of this intervention is to manipulate an input image such that it changes the cue related to the sensitive attribute while retaining all the other signals. We consider two sensitive attributes: gender and race. We manipulate facial appearance because face is the strongest cue for gender and race identification (Moghaddam and Yang, 2002).
Counterfactual Data Synthesis
Before we elaborate our proposed method for manipulating sensitive attributes, we briefly explain why such a method is necessary to show if a model achieves counterfactual fairness. For an in-depth introduction to the framework of counterfactual fairness, we refer the reader to Kusner et al. (Kusner et al., 2017).
Many studies have reported skewed classification accuracy of existing computer vision models and APIs between gender and racial groups (Buolamwini and Gebru, 2018; Kyriakou et al., 2019; Kärkkäinen and Joo, 2019; Zhao et al., 2017). However, these findings are based on a comparative analysis, which directly compares the classifier outputs between male and female images (or White and non-White) in a given dataset. The limitation of the method is that it is difficult to identify true sources of biased model outputs due to hidden confounding factors. Even though one can empirically show differences between gender groups, such differences may have been caused by non-gender cues such as hair style or image backgrounds (see (Muthukumar et al., 2018), for example). Since there exists an infinite number of possible confounding factors, it will be very difficult to control for all of them.
Consequently, recent works in bias measurement or mitigation have adopted generative models which can synthesize or manipulate text or image data (Denton et al., 2019; Zmigrod et al., 2019). These methods generate hypothetical data in which only sensitive attributes are switched. These data can be used to measure counterfactual fairness but also augment samples in existing biased datasets.
Face Attribute Synthesis
From the existing methods available for face attribute manipulation (Yan et al., 2016; Bao et al., 2017; He et al., 2019), we chose FaderNetwork (Lample et al., 2017) as our base model. FaderNetwork is a computationally efficient model that produces plausible results, but we made a few changes to make it more suitable for our study.
Figure 2 illustrates the flows of our model and (Denton et al., 2019). The model used in (Denton et al., 2019) is based on a GAN that is trained without using any attribute labels. As in standard GANs, this model learns the latent code space from the training set. This space encodes various information such as gender, age, race, and any other cues necessary for generating a facial image. These factors are all entangled in the space, and thus it is hard to control only the sensitive attribute, which is required for the purpose of counterfactual fairness measurement. In contrast, FaderNetwork directly observes and exploits the sensitive attributes in training and makes its latent space invariant to them.
Specifically, FaderNetwork is based on an encoder-decoder network with two special properties. First, it separates the sensitive attribute, , from its encoder output, , and both are fed into the decoder, such that it can reconstruct the original image, i.e., . Second, it makes invariant to by using adversarial training such that the discriminator cannot predict the correct value for given . At test time, an arbitrary value for can be given to obtain an image with a modified attribute value.
Since we want to minimize the change by the model to dimensions other than the sensitive attributes, we added two additional steps as follows. First, we segment the facial skin region from an input face by (Yu et al., 2018)111https://github.com/zllrunning/face-parsing.PyTorch and only retain changes within the region. This prevents the model from affecting background or hair regions. Second, we control for the effects of other attributes (e.g., smiling or young) which may be correlated with the main sensitive attribute, such that their values remain intact while being manipulated. This was achieved by first modeling these attributes as the main sensitive attributes along with in training and fixing their values at testing time. This step may look unnecessary because the model is expected to separate all gender (or any other sensitive attributes) related information. However, it is important to note that the dataset used to train our model may also contain biases and it is hard to guarantee that its sensitive attributes are not correlated with other attributes. By enforcing the model to produce fixed outputs, we can explicitly control for those variables (similar ideas have been used in recent work on attribute manipulation (He et al., 2019)). Figure 3 shows the comparison between our model and the original FaderNetwork. This approach allows our model to minimize the changes in dimensions other than the main attribute being manipulated. Figure 4 shows randomly chosen results by our method.
Computer Vision APIs
We measured counterfactual fairness of commercial computer vision APIs which provide label classification for a large number of visual concepts, including Google Vision API, Amazon Rekognition, IBM Watson Visual Recognition, and Clarifai. These APIs are widely used in commercial products as well as academic research (Xi et al., 2019). While public computer vision datasets usually focus on general concepts (e.g., 60 common object categories in MS COCO (Lin et al., 2014)), these services generate very specific and detailed labels on thousands of distinct concepts. While undoubtedly useful, these APIs have not been fully verified for their fairness. They may be more likely to generate more “positive” labels for people in certain demographic groups. These labels may include highly-paid and competitive occupations such as “doctor” or “engineer” or personal traits such as “leadership” or “attractive”. We measure the sensitivity of these APIs using counterfactual samples generated by our models.
We constructed the baseline data that can be used to synthesize samples. We are especially interested in the effects of gender and race changes on the profession related labels provided by the APIs, and thus collected a new dataset of images related to various professions. We first obtained a list of 129 job titles from the Bureau of Labor Statistics (BLS) website and used Google Image search to download images. Many keywords resulted in biased search results in terms of the gender and race ratio. To obtain more diverse images, we additionally combined six different keywords (male, female, African American, Asian, Caucasian, and Hispanic). This results in around 250 images per keyword. We disregarded images without any face.
We also needed datasets for training our model. For the gender manipulation model, we used CelebA (Liu et al., 2015)
, which is a very popular face attribute dataset with 40 labels annotated for each face. This dataset mostly contains the faces of White people, and thus is not suitable for the race manipulation model. There is no publicly available dataset with a sufficiently large number of African Americans. Instead, we obtained the list of the names of celebrities for each gender and each ethnicity from an online website, FamousFix. Then we used Google Image search to download up to 30 images for each celebrity. We estimated the true gender and race of each face by a model trained from a public dataset(Kärkkäinen and Joo, 2019) and manually verified examples with lower confidences. Finally, this dataset was combined with CelebA to train the race manipulation model.
After training, two models (gender and race) were applied to the profession dataset to generate a series of manipulated images for each input image. If there are multiple faces detected in an image, we only manipulated the face closest to the center of it. These faces are pasted into the original image, only on the facial skin region, and passed to each of the 4 APIs we tested. All the APIs provide both the presence of each label (binary) and the continuous classification confidence if the concept is present in the image. Figure 4 shows example images manipulated in gender and race.
The sensitivity of a classifier with respect to the changes in gender or race cues of images is measured as a slope estimated from the assigned attribute value, , and the model output, , where is a synthesized image with its attribute manipulated to the value . The range of was set to . The center, i.e., gender-neutral face, is 0. is the range observed in training, and will extrapolate images beyond the training set. In practice, this still results in natural and plausible samples. From this range, we sampled 7 evenly spaced images for gender manipulation and 5 images for race manipulation.222We reduced the number from 7 to 5 as this was more cost effective and sufficient to discover the correlation between the attributes and output labels. Let us denote , the -th input image, and , the set of synthesized images (). For each label in , we obtain 7 scores. From the entire image set
, we obtain a normalized classifier output vector:
That is, we normalize the vector such that is always 1 to allow comparisons across concepts. The slopedetermines the sensitivity of the classifier against , and its sign indicates the direction.
Table 1 and 2 show the list of labels returned by each API, more frequently activated with images manipulated to be closer to women and to men, respectively. Not surprisingly, we found the models behave in a closely related way to the actual gender gap in many occupations such as nurses or scientists (see Figure 5, too). One can imagine this bias was induced at least in part due to the bias in the online media and web, from which the commercial models have been trained. Table 3 and 4 show skewed gender and race representations in our main dataset of peoples’ occupations. Indeed, many occupations such as nurse or engineer exhibit very sharp gender contrast, and this may explain the behaviors of the image classifiers. Figure 6 shows example images and their label prediction scores.333The APIs output a binary decision and a prediction confidence for each label. Our analysis is based on binary values (true or false), and we found that using confidence scores makes little difference in the final results.
Similarly, Table 5 and 6 show the labels which are most sensitive to the race manipulation. The tables show all the dimensions which are significantly correlated with the model output (), except plain concepts such as ”Face” or ”Red color”. We found the APIs are in general less sensitive to race change than gender change.
|IBM||Secretary of State||-0.107|
|IBM||Secretary of the Int.||.114|
|Occupation||Female %||Occupation||Male %|
|nutritionist||.921||pest control worker||.971|
|hair stylist||.884||logging worker||.950|
|dental assistant||.835||chief executive officer||.917|
|merchandise displayer||.821||lawn service worker||.909|
|fashion designer||.775||sales engineer||.889|
|occupational therapy asst.||.772||construction worker||.887|
|travel agent||.734||music director||.868|
|medical transcriptionist||.732||software developer||.857|
|preschool teacher||.730||golf player||.855|
|Occupation||White %||Occupation||White %|
|construction inspector||.847||software developer||.429|
|boiler installer||.836||medical assistant||.459|
|baker||.818||computer network architect||.500|
|IBM||President of the U.S.||-0.367|
AI fairness is an increasingly important criterion to evaluate models and systems. In real world applications, especially for private models whose training processes or data are unknown, it is difficult to identify their biased behaviors or to understand the underlying causes. We introduced a novel method based on facial attribute manipulation by an encoder-decoder network to synthesize counterfactual samples, which can help isolate the effects of the main sensitive variables on the model outcomes. Using this methodology, we were able to identify hidden biases of commercial computer vision APIs on gender and race. These biases, likely caused by the skewed representation in online media, should be adequately addressed in order to make these services more reliable and trustworthy.
This work was supported by the National Science Foundation SMA-1831848.
- Machine bias. ProPublica. Cited by: Introduction.
End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nature medicine 25 (6), pp. 954. Cited by: Introduction.
- CVAE-gan: fine-grained image generation through asymmetric training. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2745–2754. Cited by: Face Attribute Synthesis.
- Evaluating the predictive validity of the compas risk and needs assessment system. Criminal Justice and Behavior 36 (1), pp. 21–40. Cited by: Introduction.
- Gender shades: intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pp. 77–91. Cited by: Introduction, Related Work, Related Work, Counterfactual Data Synthesis.
- The measure and mismeasure of fairness: a critical review of fair machine learning. arXiv preprint arXiv:1808.00023. Cited by: Related Work.
- Detecting bias with generative counterfactual face attribute augmentation. arXiv preprint arXiv:1906.06439. Cited by: Figure 2, item 1, Related Work, Counterfactual Data Synthesis, Face Attribute Synthesis.
- Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214–226. Cited by: Related Work.
- Search engine bias and the demise of search engine utopianism. In Web Search, pp. 121–133. Cited by: Introduction.
- Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: Related Work.
Equality of opportunity in supervised learning. In Advances in neural information processing systems, pp. 3315–3323. Cited by: Introduction, Related Work, Related Work, Problem Formulation.
- Oregon became a testing ground for amazon’s facial-recognition policing. but what if rekognition gets it wrong. Washington Post. Cited by: Related Work.
- Attgan: facial attribute editing by only changing what you want. IEEE Transactions on Image Processing 28 (11), pp. 5464–5478. Cited by: Face Attribute Synthesis, Face Attribute Synthesis.
- Women also snowboard: overcoming bias in captioning models. In European Conference on Computer Vision, pp. 793–811. Cited by: Related Work.
- FairFace: face attribute dataset for balanced race, gender, and age. arXiv preprint arXiv:1908.04913. Cited by: Related Work, Counterfactual Data Synthesis, Occupational Images.
- Avoiding discrimination through causal reasoning. In Advances in Neural Information Processing Systems, pp. 656–666. Cited by: Related Work.
- Counterfactual fairness. In Advances in Neural Information Processing Systems, pp. 4066–4076. Cited by: Related Work, Related Work, Counterfactual Data Synthesis.
- Fairness in proprietary image tagging algorithms: a cross-platform audit on people images. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 13, pp. 313–322. Cited by: Related Work, Counterfactual Data Synthesis.
- Algorithmic bias? an empirical study of apparent gender-based discrimination in the display of stem career ads. Management Science 65 (7), pp. 2966–2981. Cited by: Introduction.
- Fader networks: manipulating images by sliding attributes. In Advances in Neural Information Processing Systems, pp. 5967–5976. Cited by: Figure 2, item 1, Introduction, Face Attribute Synthesis.
- Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: Computer Vision APIs.
- Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730–3738. Cited by: Related Work, Occupational Images.
Explicit bias discovery in visual question answering models.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9562–9571. Cited by: Related Work.
- Characterizing bias in classifiers using generative models. arXiv preprint arXiv:1906.11891. Cited by: Related Work.
- Learning gender with support faces. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (5), pp. 707–711. Cited by: Problem Formulation.
- Understanding unequal gender classification accuracy from face images. arXiv preprint arXiv:1812.00099. Cited by: Counterfactual Data Synthesis.
- Deepface: closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1701–1708. Cited by: Introduction.
- Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: an open, web-based, international, diagnostic study. The Lancet Oncology. Cited by: Introduction.
- Understanding the political ideology of legislators from social media images. arXiv preprint arXiv:1907.09594. Cited by: Computer Vision APIs.
- Attribute2image: conditional image generation from visual attributes. In European Conference on Computer Vision, pp. 776–791. Cited by: Face Attribute Synthesis.
- Bisenet: bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 325–341. Cited by: Face Attribute Synthesis.
- Fairness constraints: mechanisms for fair classification. In Artificial Intelligence and Statistics, pp. 962–970. Cited by: Related Work, Related Work.
- Learning fair representations. In International Conference on Machine Learning, pp. 325–333. Cited by: Related Work.
Men also like shopping: reducing gender bias amplification using corpus-level constraints.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2979–2989. Cited by: Counterfactual Data Synthesis.
- Counterfactual data augmentation for mitigating gender stereotypes in languages with rich morphology. arXiv preprint arXiv:1906.04571. Cited by: Counterfactual Data Synthesis.