Towards Fairness in Visual Recognition: Effective Strategies for Bias Mitigation

by   Zeyu Wang, et al.
Princeton University

Computer vision models learn to perform a task by capturing relevant statistics from training data. It has been shown that models learn spurious age, gender, and race correlations when trained for seemingly unrelated tasks like activity recognition or image captioning. Various mitigation techniques have been presented to prevent models from utilizing or learning such biases. However, there has been little systematic comparison between these techniques. We design a simple but surprisingly effective visual recognition benchmark for studying bias mitigation. Using this benchmark, we provide a thorough analysis of a wide range of techniques. We highlight the shortcomings of popular adversarial training approaches for bias mitigation, propose a simple but similarly effective alternative to the inference-time Reducing Bias Amplification method of Zhao et al., and design a domain-independent training technique that outperforms all other methods. Finally, we validate our findings on the attribute classification task in the CelebA dataset, where attribute presence is known to be correlated with the gender of people in the image, and demonstrate that the proposed technique is effective at mitigating real-world gender bias.


page 1

page 2

page 3

page 4


Quantifying Societal Bias Amplification in Image Captioning

We study societal bias amplification in image captioning. Image captioni...

Fair Visual Recognition in Limited Data Regime using Self-Supervision and Self-Distillation

Deep learning models generally learn the biases present in the training ...

A Systematic Study of Bias Amplification

Recent research suggests that predictions made by machine-learning model...

Balancing out Bias: Achieving Fairness Through Training Reweighting

Bias in natural language processing arises primarily from models learnin...

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints

Language is increasingly being used to define rich visual recognition pr...

Mitigating Unwanted Biases with Adversarial Learning

Machine learning is a tool for building models that accurately represent...

1 Introduction

Computer vision models learn to perform a task by capturing relevant statistics from training data. These statistics range from low-level information about color or composition (zebras are black-and-white, chairs have legs) to contextual or societal cues (basketball players often wear jerseys, programmers are often male). Capturing these statistical correlations is helpful for the task at hand: chairs without legs are rare and programmers who are not male are rare, so capturing these dominant features will yield high accuracy on the target task of recognizing chairs or programmers. However, as computer vision systems are deployed at scale and in a variety of settings, especially where the initial training data and the final end task may be mismatched, it becomes increasingly important to both identify and develop strategies for manipulating the information learned by the model.

Societal Context. To motivate the work of this paper, consider one such example of social bias propagation: AI models that have learned to correlate activities with gender [4, 7, 46, 2]

. Some real-world activities are more commonly performed by women and others by men. This real-world gender distribution skew becomes part of the data that trains models to recognize or reason about these activities.

111Buolamwini and Gebru [6] note that collecting a more representative training dataset should be the first step of the solution. That is true in the cases they consider (where people with darker skin tones are dramatically and unreasonably undersampled in datasets) but may not be a viable approach to cases where the datasets accurately reflect the real-world skew. Naturally, these models then learn discriminative cues which include the gender of the actors. In fact, the gender correlation may even become amplified in the model, as Zhao et al. [46] demonstrates. We refer the reader to e.g., [29] for a deeper look at these issues and their impact.

Study Objectives and Contributions.

In this work, we set out to provide an in-depth look at this problem of training visual classifiers in the presence of spurious correlations. We are inspired by prior work on machine learning fairness 

[45, 46, 36, 1] and aim to build a unified understanding of the proposed techniques. Code is available at

We begin by proposing a simple but surprisingly effective benchmark for studying the effect of data bias on visual recognition tasks. Classical literature on mitigating bias generally operates on simpler (often linear) models [10, 44]

, which are easier to understand and control; only recently have researchers begun looking at mitigating bias in end-to-end trained deep learning models 

[15, 2, 35, 16, 42]

. Our work helps bridge the gap, proposing an avenue for exploring mitigating bias in Convolutional Neural Network (CNN) models within a simpler and easier-to-analyze setting than with a fully-fledged black-box system. By utilizing dataset augmentation to introduce controlled biases, we provide simple and precise targets for model evaluation (Sec. 


Using this benchmark, we demonstrate that the presence of spurious bias in the training data severely degrades the accuracy of current models, even when the biased dataset contains strictly more information than an unbiased dataset. We then provide a thorough comparison of existing methods for bias mitigation, including domain adversarial training [41, 36, 1], Reducing Bias Amplification [46], and domain conditional training similar to [35]. To the best of our knowledge, no such comparison exists currently as these methods have been evaluated on different benchmarks under varying conditions and have not been compared directly. We conclude that a domain-independent approach inspired by [10] outperforms more complex competitors (Sec. 4).

Finally, we validate our findings in more realistic settings. We evaluate on the CelebA [28] benchmark for attribute recognition in the presence of gender bias (Sec. 5). We demonstrate that our domain-independent training model successfully mitigates real-world gender bias.

2 Related Work

Mitigating Spurious Correlation. Recent work on the effects of human bias on machine learning models investigates two challenging problems: identifying and quantifying bias in datasets, and mitigating its harmful effects. In relation to the former, [5, 27] study the effect of class-imbalance on learning, while [46] reveal the surprising phenomenon of bias amplification. Additionally, recent works have shown that ML models possess bias towards legally protected classes [26, 6, 4, 7]. Our work complements these by presenting a dataset that allows us to isolate and control bias precisely, alleviating the usual difficulties of quantifying bias.

On the bias mitigation side, early works investigate techniques for simpler linear models [22, 44]. Our constructed dataset allows us to isolate bias while not simplifying our architecture. More recently, works have begun looking at more sophisticated models. For example, [46] propose an inference update scheme to match a target distribution, which can remove bias. [35] introduce InclusiveFaceNet for improved attribute detection across gender and race subgroups; our discriminative architecture is inspired by this work. Conversely, [11] propose a scheme for decoupling classifiers, which we use to create our domain independent architecture. The last relevant approach to bias mitigation for us is adversarial mitigation [1, 45, 12, 15]. Our work uses our novel dataset to explicitly highlight the drawbacks, and offers a comparison between these mitigation strategies that would be impossible without access to a bias-controlled environment.

Fairness Criterion. Pinning down an exact and generally applicable notion of fairness is an inherently difficult and important task. Various fairness criteria have been introduced and analyzed, including demographic parity [23, 45], predictive parity [14], error-rate balance [31]

, equality-of-odds and equality-of-opportunity

[17], and fairness-through-unawareness [30] to try to quantify bias. Recent work has shown that such criteria must be selected carefully; [31] prove minimizing error disparity across populations, even under relaxed assumptions, is equivalent to randomized predictions; [18] introduce and explain the limitations of an ‘oblivious’ discrimination criterion through a non-identifiability result; [30] demonstrate that ignoring protected attributes is ineffective due to redundant encoding; [10] show that demographic parity does not ensure fairness. We define our tasks such that test accuracy directly represents model bias.

Surveying Evaluations. We are inspired by previous work which aggregate ideas, methods and findings to provide a unify survey of a subfield of computer vision [21, 33, 38, 20]. For example, [40] surveys relative dataset biases present in computer vision datasets, including selection bias (datasets favoring certain types of images), capture bias (photographers take similar photos), category bias (inconsistent or imprecise category definitions), and negative set bias (unrepresentative or unbalanced negative instances). We continue this line of work for bias mitigation methods for modern visual recognition systems, introducing a benchmark for evaluation which isolates bias, and showing that our analysis generalizes to other, more complex, biased datasets.

3 A Simple Setting for Studying Bias

We begin by constructing a novel benchmark for studying bias mitigation in visual recognition models. This setting makes it possible to demonstrate that the presence of spurious correlations in training data severely degrades the performance of current models, even if learning such spurious correlations is sub-optimal for the target task.

CIFAR-10S Setup. To do so, we design a benchmark that erroneously correlates target classification decisions (what object category is depicted in the image) with an auxiliary attribute (whether the image is color or grayscale).

We introduce CIFAR-10 Skewed (CIFAR-10S), based on CIFAR-10 [25], a dataset with 50,000 images evenly distributed between 10 object classes. In CIFAR-10S, each of the 10 original classes is subdivided into two new domain subclasses, corresponding to color and grayscale domains within that class. Per class, the 5,000 training images are split 95% to 5% between the two domains; five classes are 95% color and five classes are 95% grayscale. The total number of images allocated to each domain is thus balanced. For testing, we create two copies of the standard CIFAR-10 test set: one in color (Color) and one in grayscale (Gray). These two datasets are considered separately, and only the 10-way classification decision boundary is relevant.

Discussion. We point out upfront that the analogy between color/grayscale and gender domains here wears thin: (1) we consider the two color/grayscale domains as purely binary and disjoint whereas the concept of gender is more fluid; (2) a color/grayscale domain classifier is significantly simpler to construct than a gender recognition model; (3) the transformation between color and grayscale images is linear whereas the manifestation of gender is much more complex.

Nevertheless, we adopt this simple framework to distill down the core algorithmic exploration before diving into the more complex setups in Sec. 5. This formulation has several compelling properties: (1) we can control the correlation synthetically by changing images from color to grayscale, maintaining control over the distribution, (2) we can guarantee that color images contain strictly more information than grayscale images, maintaining control over the discriminative cues in the images, and (3) unlike other datasets, there is no fairness/accuracy trade off since both are complementary. Furthermore, despite its simplicity, this setup still allows us to study the behavior of modern CNN architectures.

Key Issue. We ground the discussion by presenting one key result that is counter-intuitive and illustrates why this very simple setting is reflective of a much deeper problem. We train a standard ResNet-18 [19] architecture with a softmax and cross-entropy loss for 10-way object classification. Training on the skewed CIFAR-10S dataset and testing on Color images yields accuracy.222We report the mean across 5 training runs (except for CelebA in Sec. 5.2

). Error bars are 2 standard deviations (95% confidence interval).

This may seem like a reasonable result until we examine that a model trained on an all-grayscale training set (so never having seen a single color image!) yields a significantly higher accuracy when tested out-of-domain on Color images.

This disparity occurs because the model trained on CIFAR-10S learned to correlate the presence of color and the object classes. When faced with an all-color test set, it infers that it is likely that these images come from one of the five classes that were predominantly colored during training (Fig. 1). In a real world bias setting where the two domains correspond to gender and the classification targets correspond to activities, this may manifest itself as the model making overly confident predictions of activities traditionally associated with female roles on images of women [46].

4 Benchmarking Bias Mitigation Methods

Grounded with the task at hand (training recognition models in the presence of spurious correlations) we perform a thorough benchmark evaluation of bias mitigation methods. Many of these techniques have been proposed in the literature for this task; notable exceptions include prior shift inference for bias mitigation (Sec. 4.3), the distinction between discriminative and conditional training in this context (Sec. 4.4), and the different inference methods for conditional training from biased data (Sec. 4.4). Our findings are summarized in Table 1. In Sec. 5 we demonstrate how our findings on CIFAR10S generalize to real world settings.

Figure 1: Confusion matrix of a ResNet-18 [19] classifier trained on the skewed CIFAR-10S dataset. The model has learned to correlate the presence of color with the five object classes (in bold) and predominantly predicts those classes on the all-color test set.
        Accuracy (%, )
Model Name   Model   Test Inference   Bias ()   Color Gray Mean
Baseline   N-way softmax      
Oversampling   N-way softmax, resampled      
Adversarial   w/ uniform confusion [1, 41]      
w/ reversal, proj. [45]      
DomainDiscrim   joint ND-way softmax      
RBA [46]      
DomainIndepend   N-way classifier per domain      
Table 1: Performance comparison of algorithms on CIFAR-10S. All architectures are based on ResNet-18 [19]. We investigate multiple bias mitigation strategies, and demonstrate that a domain-independent classifier outperforms all baselines on this benchmark.

Setup. To perform this analysis, we utilize the CIFAR-10S domain correlation benchmark of Sec. 3. We assume that at training time the domain labels are available (e.g., we know which images are color and which are grayscale in CIFAR-10S, or which images correspond to pictures of men or women in the real-world setting). All experiments in this section build on the ResNet-18 [19] architecture trained on the CIFAR-10S dataset, with object classes and . The models are trained from scratch on the target data, removing any potential effects from pretraining. Unless otherwise noted the models are tarined for epochs, with SGD at a learning rate of with a factor of drop-off every epochs, a weight decay of

, and a momentum of 0.9. During training, the image is padded with 4 pixels on each side and then a

crop is randomly sampled from the image or its horizontal flip.

Evaluation. We consider two metrics: mean per-class per-domain accuracy (primary) and bias amplification of [46]. The test set is fully balanced across domains, so mean accuracy directly correlates with the model’s ability to avoid learning the domain correlation during training. We include the mean bias metric for completeness with the literature, as


where is the number of grayscale test set examples predicted to be of class , while is the same for color.

4.1 Strategic Sampling

The simplest approach is to strategically sample with replacement to make the training data ‘look’ balanced with respect to the class-domain frequencies. That is, we sample rare examples more often during training, or, equivalently, utilize non-uniform misclassification cost [13, 3]. However, as detailed in [43], there are significant drawbacks to oversampling: (1) seeing exact copies of the same example during training makes overfitting likely, (2) oversampling increases the number of training examples without increasing the amount of information, which increases learning time.

Experimental Evaluation. The baseline model first presented in Sec. 3 is a ResNet-18 CNN with a softmax classication layer, which achieves accuracy. The same model with oversampling improves to accuracy. Both models drive the training loss to zero. Note that data augmentation is critical for this result: without data augmentation the oversampling model achieves only accuracy, overfitting to the data.

4.2 Adversarial Training

Another approach to bias mitigation commonly suggested in the literature is fairness through blindness. That is, if a model does not look at, or specifically encode, information about a protected variable, then it cannot be biased. To this end, adversarial training is set up through the minimax objective: maximize the classifier’s ability to predict the class, while minimizing the adversary’s ability to predict the protected variable based on the underlying learned features.

This intuitive approach, however, has a major drawback. Suppose we aim to have equivalent feature representations across domains. Even if a particular protected attribute does not exist in the feature representation of a classifier, combinations of other attributes can be used as a proxy. This phenomenon is termed redundant encoding in the literature [18, 10]

. For an illustrative example, consider a real-world task of a bank evaluating a loan application, irrespective of the applicant’s gender. Suppose that the applicant’s employment history lists ‘nurse’. It can thus, by proxy, be inferred with high probability that the applicant is also a woman. However, employment history is crucial to the evaluation of a loan application, and thus the removal of this redundant encoding will degrade its ability to perform the evaluation.

Experimental Evaluation. We apply adversarial learning to de-bias the object classifier. We consider both the uniform confusion loss of [1] (inspired by [41]), and the loss reversal with gradient projection of [45].333We apply the adversarial classifiers on the penultimate layer for [1, 41] model, and on the final classification layer for [45] as recommended by the authors. We experimented with other combinations of layers and losses, including applying the projection method of [45] onto the confusion loss of [1, 41], and achieved similar results. The models are trained for

epochs using Adam with learning rates 3e-4 and weight decay 1e-4. We hold out 10,000 images to tune the hyperparameters before retraining the network on the entire training set. To verify training efficacy, we train SVM domain classifiers on the learned features: the accuracy is

before and after adversarial training, verifying training effectiveness. These methods achieve only and accuracy, respectively. As Fig. 2 visually demonstrates, although the adversarial classifier enforces domain confusion it additionally creates undesirable class confusion.

We run one additional experiment to validate the findings. We test whether models encode the domain (color/grayscale) information even when not exposed to a biased training distribution; if so, this would help explain why minimizing this adversarial objective would lead to a worse underlying feature representation and thus reduced classification accuracy. We take the feature representation of a 10-way classifier trained on all color images (so not exposed to color/grayscale skew) and train a linear SVM adversary on this feature representation to predict the color/grayscale domain of a new image. This yields an impressive 82% accuracy; since the ability to discriminate between the two domains emerges naturally even without biased training, it would make sense that requiring that the model not be able to distinguish between the two domains would harm its overall classification ability.

4.3 Domain Discriminative Training

The alternative to fairness through blindness is fairness through awareness [10] where the domain information is first explicitly encoded and then explicitly mitigated. The simplest approach is training a -way discriminative classifier where is the number of target classes and is the number of domains. The correlation between domains and classes can then be removed during inference in one of several ways.

w/o adversary w/ adversary
domains classes domains classes
Figure 2: Adversarial training [45] enforces domain confusion but also introduces unwanted class boundary confusion (t-SNE plots).

4.3.1 Prior Shift Inference

If the outputs of the -way classifier can be interpreted as probabilities, a test-time domain solution to removing class-domain correlation was introduced in [37] and applied in [32] to visual recognition. Let the classifier output a joint probability for target class , domain and image . We can assume that , i.e., the distribution of image appearance within a particular class and domain is the same between training and test time. However, , i.e., the correlation between target classes and domains may have changed. This suggests that the test-time probability should be computed as:


In theory, this requires access to the test label distribution ; however, assuming uncorrelated and at test time (unbiased ) and mean per-class accuracy evaluation (uniform ), .

Eqn. 4 then simplifies to , removing the test distribution requirement. With this assumption, the target class predictions can be computed directly as


or, using the Law of Total Probability,


Experimental Evaluation. We train a -way classifier (20-way softmax in our setting) to discriminate between (class, domain) pairs. This discriminative model with inference prior shift towards a uniform test distribution (Eqn. 4) followed by sum of outputs (Eqn. 6) achieves accuracy, significantly outperforming the accuracy of the -way softmax baseline. To quantify the effects of the two steps of inference: taking the highest output predictor rather than summing across domains (Eqn. 5) has no effect on accuracy because the two domains are easily distinguishable in this case; however, summing the outputs without first applying prior shift drops accuracy from to .

Finally, we verify that the increase in accuracy is not just the result of the increased number of parameters in the classifier layer. We train an ensemble of baseline models, averaging their softmax predictions: one baseline achieves accuracy, two models achieve , and only an ensemble of five baseline models (with 55.9M trainable parameters) achieve accuracy on par with accuracy of the discriminative model (with 11.2M parameters).

4.3.2 Reducing Bias Amplification

An alternative inference approach is Reducing Bias Amplification (“RBA”) of Zhao et al. [46]

. RBA uses corpus-level constraints to ensure inference predictions follow a particular distribution. They propose a Lagrangian relaxation iterative solver since the combinatorial optimization problem is challenging to solve exactly at large scale. This method effectively matches the desired inference distribution and reduces bias; however, the expensive optimization must be run on all test samples before a single inference is possible.

Experimental Evaluation. In the original setting of [46], training and test time biases are equal. However, RBA is flexible enough to optimize for any target distribution. On CIFAR-10S, we thus set the optimization target bias to 0 and the constraint epsilon to . To make the optimization as effective as possible, we substitute in the known test-time domain (because it can be perfectly predicted) so that the optimization only updates the class predictions.

Applying RBA on the scores results in accuracy, a improvement over the simpler inference but an insignificant improvement over of the Baseline model. Interestingly, we also observe that the benefits of RBA optimization are significantly lessened when prior shift is applied beforehand. For example, when using the post-prior shift scores, accuracy only improves negligibly from using inference to using RBA. Therefore, we conclude that applying RBA after prior shift is extraneous. However, the converse is not true as the best accuracy achieved by RBA without prior shift is significantly lower than the accuracy achieved with prior shift inference.

4.4 Domain Independent Training

One concern with the discriminative model is that it learns to distinguish between the class-domain case; in particular, it explicitly learns the boundary between the same class across different domains (e.g., cat in grayscale versus cat in color, or a woman programming versus a man programming). This may be wasteful, as the -way class decision boundaries may in fact be similar across domains and the additional distinction between the same class in different domains may not be necessary. Furthermore, the model is necessarily penalized in cases where the domain prediction is challenging but the target class prediction is unambiguous.

This suggests training separate classifiers per domain. Doing this naively, however, as an ensemble, will yield poor performance as each model will only see a fraction of the data. We thus consider a shared feature representation with an ensemble of classifiers. This alleviates the data reduction problem for the representation though not for the classifiers.

Given the predictions , multiple inference methods are possible. If the domain is known at test time, is reasonable yet entirely ignores the learned class boundaries in the other domains , and may suffer if some classes were poorly represented within during training. If a probabilistic interpretation is possible, then two inference methods are reasonable:


However, Eqn. 7 again ignores the learned class boundaries across domains, and Eqn. 8 requires inferring (which may either be trivial, as in CIFAR-10S, reducing to a single-domain model, or complicated to learn and implicitly encoding the correlations between and that we are trying to avoid). Further, in practice, while the probabilistic interpretation of a single model may be a reasonable approximation, the probabilistic outputs of the multiple independent models are frequently miscalibrated with respect to each other.

A natural option is to instead reason directly on class boundaries of the domains, and perform inference as444

Interestingly, under a softmax probabilistic model this inference corresponds to the geometric mean between

, which is a stable method for combining independent models with different output ranges.


where are the network activations at the classifier layer. For linear classifiers with a shared feature representation this corresponds to averaging the class decision boundaries. We demonstrate that this technique works well in practice across both single and multi-label target classification tasks at removing class-domain correlations.

Experimental Evaluation.

We train a model for performing object classification on the two domains independently. This is implemented as two 10-way independent softmax classifiers sharing the same underlying network. At training time we use knowledge of the image domain to only update one of the classifiers. At test time we apply prior shift to adjust the output probabilities of both classifiers towards a uniform distribution, and consider two inference methods. First, we use only the classifier corresponding to the test domain, yielding a low

accuracy as expected because it is not able to integrate information across the two domains (despite requiring specialized knowledge of the image domain). Instead, we combine the decision boundaries following Eqn. 9 and achieve accuracy, significantly outperforming the baseline of .

4.5 Summary of Findings

So far we illustrated that the CIFAR-10S setup is an effective benchmark for studying bias mitigation, and provided a thorough evaluation of multiple techniques. We demonstrated the shortcomings of strategic resampling and of adversarial approaches for bias mitigation. We showed that the prior shift inference adjustment of output probabilities is a simpler, more efficient, and more effective alternative to the RBA technique [46]. Finally, we concluded that the domain-conditional model with explicit combination of per-domain class predictions significantly outperforms all other techniques. Table 1 lays out the findings.

Recall our original goal of Sec. 3 to train a model that mitigates the domain correlation bias in CIFAR-10S enough to classify color images of objects as well as a model trained on only grayscale images would. We have partially achieved that goal. The DomainIndependent model trained on CIFAR-10S achieves accuracy on color images, significantly better than of Baseline and approaching of the model trained entirely on grayscale images. However, much still remains to be done. We would expect that a model trained on CIFAR-10S would take advantage of the available color cues and perform even better than , ideally approaching accuracy of a model trained on all color images. The correlation bias is a much deeper problem for visual classifiers and much more difficult to mitigate than it appears at first glance.

5 Real World Experiments

While CIFAR-10S proves to be a useful landscape for bias isolation studies, there remains the implicit assumption throughout that such findings will generalize to other settings. Indeed, it is possible that they may not due to the synthetic nature of the proposed bias generation. We thus investigate our findings in three alternative scenarios. First, in Sec. 5.1 we consider two modifications to CIFAR-10S: varying the level of skew beyond the 95%-5% studied in Sec. 4

, and replacing the color/grayscale domains with more realistic non-linear transformations. After verifying all our findings still hold, in Sec. 

5.2 we consider face attribute recognition on the CelebA dataset [28] where the presence of attributes, e.g., “smiling” is correlated with gender.

5.1 CIFAR Extensions

There are two key distinctions between the CIFAR-10S dataset studied in Sec. 4

and the real world scenarios where gender or race are correlated with the target outputs.

Varying Degrees of Domain Distribution. The first distinction is in the level of skew, where domain balance may be more subtle than the 95%-5% breakdown studied above. To simulate this setting, we validated on CIFAR with different levels of color/grayscale skew, using the setup of Sec. 4 in Fig. 3 (Left). The DomainIndep model consistently outperforms the Baseline, although the effect is significantly more pronounced at higher skew levels. For reference, the average gender skew on the CelebA dataset [28] for face attribute recognition described in Sec. 5.2 is 555In this multi-label setting the gender skew is computed on the dev set as the mean across 39 attributes of ..

Figure 3: (Left) The DomainIndep model outperforms the Baseline on CIFAR-10S for varying levels of skew. (Right)

To investigate more real-world domains instead of color-grayscale, we consider the subtle shift between CIFAR and 32x32 ImageNet 

[9, 34].

Other Non-Linear Transformations. The second distinction is that real-world protected attributes differ from each other in more than just a linear color-grayscale transformation (e.g., men and women performing the same task look significantly more different than the same image in color or grayscale). To approximate this in a simple setting, we followed the CIFAR protocol of Sec. 4, but instead of converting images to grayscale, we consider alternative domain options in Table 2. Arguably the most interesting shift corresponds to taking images of similar classes from ImageNet [34, 9], and we focus our discussion on that one.

The domain shift here is subtle (shown in Fig. 3 Right) but the conclusions hold: mean per-class per-domain accuracy is Baseline , Adversarial [1, 41] and [45] (not shown in Table 2), DomainDiscriminative , and our DomainIndependent model . One interesting change is that Oversampling yields , significantly lower than the baseline of , so we investigate further. The drop can be explained by the five classes which were heavily skewed towards CIFAR images at training time: the model overfit to the small handful of ImageNet images which got oversampled, highlighting the concerns with oversampling particularly in situations where the two domains are different from each other and the level of imbalance is high. We observe similar results in the high-to-low-resolution domain shift (third and fourth columns of Table 2), where the two domains are again very different from each other. To counteract this effect we instead applied the class-balanced loss method Cui et al. [8], cross-validating the hyperparameter on a validation set to , and achieved a more reasonable result of , on par with of Baseline but still behind of DomainIndependent.

Model   28x28crop 1/2 res. 1/4 res. ImageNet
Baseline   89.2 85.6 73.7 79.4
Oversamp   90.1 85.4 72.7 78.6
DomDiscr   91.6 88.5 77.3 81.5
DomIndep   93.0 90.2 79.9 83.5
Table 2: On CIFAR-10S, we consider other transformations instead of the grayscale domain: (1) cropping the center of the image, (2,3) reducing the image resolution [39], followed by upsampling or (4) replacing with 32x32 ImageNet images of the same class [9]. We use the inference of Eqn. 6 for DomDiscr and Eqn. 9 for DomIndep, and report mean per-class per-domain accuracy (in %). Our conclusions from Sec. 4 hold across all domain shifts.

5.2 CelebA Attribute Recognition

Finally, we verified our findings on the real-world CelebA dataset [28], used in [36] to study face attribute recognition when the presence of attributes, e.g., “smiling,” is correlated with gender. We trained models to recognize the 39 attributes (all except the “Male” attribute). Out of the 39 attributes, 21 occur more frequently with women and 18 with men, with an average gender skew of 80.0% when an attribute is present. During evaluation we consider the 34 attributes that have sufficient validation and test images.666The removed attributes did not contain at least 1 positive male, positive female, negative male, and negative female image. They are: 5 o’clock shadow, goatee, mustache, sideburns and wearing necktie.

Task and Metric. The target task is multi-label classification, evaluated using mean average precision (mAP) across attributes. We remove the gender bias in the test set by using a weighted mAP metric: for an attribute that appears with men and women images, we weight every positive man image by and every positive woman image by when computing the true positive predictions. This simulates the setting where the total weight of positive examples within the class remains constant but is now equally distributed between the genders.

We also evaluate the bias amplification (BA) of each attribute [46]. For an attribute that appears more frequently with women, this is where are the number of women and men images respectively classified as positive for this attribute. For attributes that appear more frequently with men, the numerators are and

. To determine the binary classifier decision we compute a score threshold for each attribute which maximizes the classifier’s F-score on the validation set. Since our methods aim to de-correlate gender with the attribute we expect that bias amplification will be

negative as the predictions approach a uniform distribution across genders.

Training Setup. The images are the Aligned&Cropped subset of CelebA [28]. We use a ResNet-50 [19] base architecture pre-trained on ImageNet [34]

. The FC layer of the ResNet model is replaced with two consecutive fully connected layers. Dropout and relu is applied to the output between the two fully connected layers, which has size 2048. It is trained with a binary cross entropy loss with logits using a batch size of 32, for 50 epochs with the Adam optimizer 

[24] (learning rate 1e-4). The best model over all epochs is selected per inference method on the validation set. For adversarial training, we run an extensive hyperparameter search over the relative weights of the losses and the number of epochs of the adversary. We select the model with the highest weighted mAP on the validation set among all models that successfully train a de-biased representation (accuracy of the gender classifier drops by at least 1%; otherwise it’s essentially the Baseline model with the same mAP). The models are evaluated on the test set.

Model Model mAP BA
Base N sigmoids 74.7 0.010
Adver w/uniform conf. [1, 41] 71.9 0.019
DomDis 2N sigm, 73.8 0.007
DomInd 2N sigmoids, 73.8 0.009
2N sigm, 75.4 -0.039
2N sigm, 76.0 -0.037
2N sigmoids, 76.3 -0.035
Table 3: Attribute classification accuracy evaluated using mAP (in %, ) weighted to ensure an equal distribution of men and women appearing with each attribute, and Bias Amplification (). Evaluation is on the CelebA test set, across 34 attributes that have sufficient validation data; details in Sec. 5.2.

Results. Table 3 summarizes the results. The overall conclusions from Sec. 4 hold despite the transition to the multi-label setting and to real-world gender bias. Adversarial training as before de-biases the representation but also harms the mAP (71.9% compared to 74.7% for Baseline). In this multi-label setting we do not consider a probabilistic interpretation of the output as the classifier models are trained independently instead of jointly in a softmax. Without this interpretation and prior shift the DomainDiscrminative model achieves less competitive results than the baseline at . RBA inference of [46] towards a uniform distribution performs similarly at . The DomainIndependent model successfully mitigates gender bias and outperforms the domain-unaware Baseline on this task, increasing the weighted mAP from to . Alternative inference methods, such as selecting the known domain, computing the max output over the domains, or summing the outputs of the probabilities directly achieve similar bias amplification results but perform between mAP worse.

Figure 4: Per-attribute improvement of the DomainIndependent model over the Baseline model on the CelebA validation set, as a function of the level of gender imbalance in the attribute. Attributes with high skew (such as “bald”) benefit most significantly.

Analysis. We take a deeper look at the per-class results on the validation set to understand the factors that contribute to the improvement. Overall the DomainIndependent model improves over Baseline on 24 of the 34 attributes. Fig. 4 demonstrates that the level of gender skew in the attribute is highly correlated with the amount of improvement (). Attributes that have skew greater than (out of the positive training images for this attribute at least belong to one of the genders) always benefit from the DomainIndependent model. This is consistent with the findings from CIFAR-10S in Fig. 3(Left). When the level of skew is insufficiently high the harm from using fewer examples when training the DomainIndependent model outweighs the benefit of decomposing the representation.

Oversampling. Finally, we note that the Oversampling model in this case achieves high mAP of 77.6% and bias amplification of -0.061, outperforming the other techniques. This is expected as we know from prior experiments in Sec. 4 and 5.1 that oversampling performs better in settings where the two domains are more similar (color/grayscale, 28x28 vs 32x32 crop) and where the skew is low while the dataset size is large so it wouldn’t suffer from overfitting.

6 Conclusions

We provide a benchmark and a thorough analysis of bias mitigation techniques in visual recognition models. We draw several important algorithmic conclusions, while also acknowledging that this work does not attempt to tackle many of the underlying ethical fairness questions. What happens if the domain (gender in this case) is non-discrete? What happens if the imbalanced domain distribution is not known at training time – for example, if the researchers failed to identify the undesired correlation with gender? What happens in downstream tasks where these models may be used to make prediction decisions? We leave these and many other questions to future work.

7 Acknowledgements

This work is partially supported by the National Science Foundation under Grant No. 1763642, by Google Cloud, and by the Princeton SEAS Yang Family Innovation award. Thank you to Arvind Narayanan and to members of Princeton’s Fairness in AI reading group for great discussions.


  • [1] M. Alvi, A. Zisserman, and C. Nellaker (2018) Turning a blind eye: explicit removal of biases and variation from deep neural network embeddings. arXiv preprint arXiv:1809.02169. Cited by: §1, §1, §2, §4.2, Table 1, §5.1, Table 3, footnote 3.
  • [2] L. Anne Hendricks, K. Burns, K. Saenko, T. Darrell, and A. Rohrbach (2018) Women also snowboard: overcoming bias in captioning models. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §1, §1.
  • [3] S. Bickel, M. Brückner, and T. Scheffer (2009-09) Discriminative learning under covariate shift. Journal of Machine Learning Research. Cited by: §4.1.
  • [4] T. Bolukbasi, K. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai (2016) Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In Advances in Neural Information Processing Systems, pp. 4349–4357. Cited by: §1, §2.
  • [5] M. Buda, A. Maki, and M. A. Mazurowski (2017-10) A systematic study of the class imbalance problem in convolutional neural networks. arXiv:1710.05381 [cs, stat]. Cited by: §2.
  • [6] J. Buolamwini and T. Gebru (2018) Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency, Cited by: §2, footnote 1.
  • [7] A. Caliskan, J. J. Bryson, and A. Narayanan (2017) Semantics derived automatically from language corpora contain human-like biases. Science 356 (6334), pp. 183–186. External Links: Document Cited by: §1, §2.
  • [8] Y. Cui, M. Jia, T. Lin, Y. Song, and S. Belongie (2019-06) Class-balanced loss based on effective number of samples. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §5.1.
  • [9] L. N. Darlow, E. J. Crowley, A. Antoniou, and A. J. Storkey (2018) CINIC-10 is not imagenet or CIFAR-10. CoRR abs/1810.03505. Cited by: Figure 3, §5.1, Table 2.
  • [10] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel (2012) Fairness Through Awareness. In Innovations in Theoretical Computer Science Conference, Cited by: §1, §1, §2, §4.2, §4.3.
  • [11] C. Dwork, N. Immorlica, A. T. Kalai, and M. Leiserson (2018) Decoupled classifiers for group-fair and efficient machine learning. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency, Cited by: §2.
  • [12] H. Edwards and A. Storkey (2016) Censoring Representations with an Adversary. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2.
  • [13] C. Elkan (2001) The foundations of cost-sensitive learning. In

    In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence

    pp. 973–978. Cited by: §4.1.
  • [14] P. Gajane and M. Pechenizkiy (2017) On formalizing fairness in prediction with machine learning. arXiv preprint arXiv:1710.03184. Cited by: §2.
  • [15] Y. Ganin and V. Lempitsky (2015)

    Unsupervised Domain Adaptation by Backpropagation

    In International Conference on Machine Learning, pp. 1180–1189. Cited by: §1, §2.
  • [16] A. Grover, J. Song, A. Kapoor, K. Tran, A. Agarwal, E. J. Horvitz, and S. Ermon (2019) Bias correction of learned generative models using likelihood-free importance weighting. In NeurIPS 2019 : Thirty-third Conference on Neural Information Processing Systems, External Links: Link Cited by: §1.
  • [17] M. Hardt, E. Price, N. Srebro, et al. (2016)

    Equality of opportunity in supervised learning

    In Advances in neural information processing systems, pp. 3315–3323. Cited by: §2.
  • [18] M. Hardt, E. Price, and N. Srebro (2016) Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 3315–3323. External Links: Link Cited by: §2, §4.2.
  • [19] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §3, Figure 1, Table 1, §4, §5.2.
  • [20] D. Hoiem, Y. Chodpathumwan, and Q. Dai (2012) Diagnosing error in object detectors. In ECCV, Cited by: §2.
  • [21] M. Huh, P. Agrawal, and A. A. Efros (2017)

    What makes imagenet good for transfer learning?

    In, Cited by: §2.
  • [22] A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, and A. Torralba (2012) Undoing the Damage of Dataset Bias. In European Conference on Computer Vision, Cited by: §2.
  • [23] N. Kilbertus, M. R. Carulla, G. Parascandolo, M. Hardt, D. Janzing, and B. Schölkopf (2017) Avoiding discrimination through causal reasoning. In Advances in Neural Information Processing Systems, pp. 656–666. Cited by: §2.
  • [24] D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. CoRR abs/1412.6980. External Links: Link, 1412.6980 Cited by: §5.2.
  • [25] A. Krizhevsky and G. Hinton (2009) Learning Multiple Layers of Features from Tiny Images. Cited by: §3.
  • [26] S. LevinT. Guardian (Ed.) (2016-09) A beauty contest was judged by AI and the robots didn’t like dark skin. Cited by: §2.
  • [27] X. Liu, J. Wu, and Z. Zhou (2009) Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics 39 (2). Cited by: §2.
  • [28] Z. Liu, P. Luo, X. Wang, and X. Tang (2015-12) Deep learning face attributes in the wild. In ICCV, Cited by: §1, §5.1, §5.2, §5.2, §5.
  • [29] S. U. Noble (2018-02) Algorithms of oppression: how search engines reinforce racism. NYU Press. Cited by: §1.
  • [30] D. Pedreshi, S. Ruggieri, and F. Turini (2008) Discrimination-aware data mining. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’08, New York, NY, USA, pp. 560–568. External Links: ISBN 978-1-60558-193-4, Link, Document Cited by: §2.
  • [31] G. Pleiss, M. Raghavan, F. Wu, J. Kleinberg, and K. Q. Weinberger (2017) On fairness and calibration. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5680–5689. External Links: Link Cited by: §2.
  • [32] A. Royer and C. H. Lampert (2015) Classifier adaptation at prediction time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1401–1409. Cited by: §4.3.1.
  • [33] O. Russakovsky, J. Deng, Z. Huang, A. C. Berg, and L. Fei-Fei (2013) Detecting avocados to zucchinis: what have we done, and where are we going?. In IEEE International Conference on Computer Vision, Cited by: §2.
  • [34] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: Figure 3, §5.1, §5.2.
  • [35] H. J. Ryu, H. Adam, and M. Mitchell (2018) Inclusivefacenet: improving face attribute detection with race and gender diversity. In Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML), Cited by: §1, §1, §2.
  • [36] H. J. Ryu, M. Mitchell, and H. Adam (2017) Improving Smiling Detection with Race and Gender Diversity. arXiv preprint arXiv:1712.00193. Cited by: §1, §1, §5.2.
  • [37] M. Saerens, P. Latinne, and C. Decaestecker (2002) Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural computation 14 (1), pp. 21–41. Cited by: §4.3.1.
  • [38] G. A. Sigurdsson, O. Russakovsky, and A. Gupta (2017-10) What actions are needed for understanding human actions in videos?. 2017 IEEE International Conference on Computer Vision (ICCV). External Links: ISBN 9781538610329, Link, Document Cited by: §2.
  • [39] J. Su and S. Maji (2017) Adapting models to signal degradation using distillation. In British Machine Vision Conference (BMVC), Cited by: Table 2.
  • [40] A. Torralba and A. A. Efros (2011) Unbiased Look at Dataset Bias. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 1521–1528. Cited by: §2.
  • [41] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko (2015) Simultaneous deep transfer across domains and tasks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4068–4076. Cited by: §1, §4.2, Table 1, §5.1, Table 3, footnote 3.
  • [42] T. Wang, J. Zhao, M. Yatskar, K. Chang, and V. Ordonez (2019)

    Balanced datasets are not enough: estimating and mitigating gender bias in deep image representations

    In Proceedings of the IEEE International Conference on Computer Vision, pp. 5310–5319. Cited by: §1.
  • [43] G. M. Weiss, K. McCarthy, and B. Zabar Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs?. Cited by: §4.1.
  • [44] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork (2013) Learning Fair Representations. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 325–333. Cited by: §1, §2.
  • [45] B. H. Zhang, B. Lemoine, and M. Mitchell (2018) Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, Cited by: §1, §2, §2, Figure 2, §4.2, Table 1, §5.1, footnote 3.
  • [46] J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K. Chang (2017) Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints. In EMNLP, Cited by: §1, §1, §1, §2, §2, §3, §4.3.2, §4.3.2, §4.5, Table 1, §4, §5.2, §5.2.