Mitigating Unwanted Biases with Adversarial Learning

by   Brian Hu Zhang, et al.
Stanford University

Machine learning is a tool for building models that accurately represent input training data. When undesired biases concerning demographic groups are in the training data, well-trained models will reflect those biases. We present a framework for mitigating such biases by including a variable for the group of interest and simultaneously learning a predictor and an adversary. The input to the network X, here text or census data, produces a prediction Y, such as an analogy completion or income bracket, while the adversary tries to model a protected variable Z, here gender or zip code. The objective is to maximize the predictor's ability to predict Y while minimizing the adversary's ability to predict Z. Applied to analogy completion, this method results in accurate predictions that exhibit less evidence of stereotyping Z. When applied to a classification task using the UCI Adult (Census) Dataset, it results in a predictive model that does not lose much accuracy while achieving very close to equality of odds (Hardt, et al., 2016). The method is flexible and applicable to multiple definitions of fairness as well as a wide range of gradient-based learning models, including both regression and classification tasks.



There are no comments yet.


page 1

page 2

page 3

page 4


Towards Controllable Biases in Language Generation

We present a general approach towards controllable societal biases in na...

A Systematic Study of Bias Amplification

Recent research suggests that predictions made by machine-learning model...

Ethical Adversaries: Towards Mitigating Unfairness with Adversarial Machine Learning

Machine learning is being integrated into a growing number of critical s...

LOGAN: Local Group Bias Detection by Clustering

Machine learning techniques have been widely used in natural language pr...

Turning a Blind Eye: Explicit Removal of Biases and Variation from Deep Neural Network Embeddings

Neural networks achieve the state-of-the-art in image classification tas...

ViBE: A Tool for Measuring and Mitigating Bias in Image Datasets

Machine learning models are known to perpetuate the biases present in th...

Identifying and mitigating bias in algorithms used to manage patients in a pandemic

Numerous COVID-19 clinical decision support systems have been developed....
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning leverages data to build models capable of assessing the labels and properties of novel data. Unfortunately, the available training data frequently contains biases with respect to things that we would rather not use for decision making. Machine learning builds models faithful to training data and can lead to perpetuating these undesirable biases. For example, systems designed to predict creditworthiness and systems designed to perform analogy completion have been demonstrated to be biased against racial minorities and women respectively. Ideally we would be able to build a model which captures exactly those generalizations from the data which are useful for performing some task which are not discriminatory in a way which the people building those models consider unfair.

Work on training machine learning systems that output fair decisions has defined several useful measurements for fairness

: Demographic Parity, Equality of Odds, and Equality of Opportunity. These can be imposed as constraints or incorporated into a loss function in order to mitigate disproportional outcomes in the system’s output predictions regarding a protected demographic, such as sex.

In this paper, we examine these fairness measures in the context of adversarial debiasing

. We consider supervised deep learning tasks in which the task is to predict an output variable

given an input variable , while remaining unbiased with respect to some variable . We refer to as the protected variable. For these learning systems, the predictor can be constructed as (input, output, protected) tuples . The predictor is usually given access to the protected variable , though this is not strictly necessary. This construction allows the determination of which types of bias are considered undesirable for a particular application to be chosen through the specification of the protected variable.

We speak to the concept of mitigating bias using the known term debiasing111Note that “debias” may not be quite the right word, as all bias is not necessarily removed., following definitions provided by hardt2016equality hardt2016equality and refined by beutel2017data beutel2017data.

Definition 1.

Demographic Parity. A predictor satisfies demographic parity if and are independent.

This means that is equal for all values of the protected variable : .

Definition 2.

Equality of Odds. A predictor satisfies equality of odds if and are conditionally independent given .

This means that, for all possible values of the true label , is the same for all values of the protected variable:

Definition 3.

Equality of Opportunity. If the output variable is discrete, a predictor satisfies equality of opportunity with respect to a class if and are independent conditioned on .

This means that, for a particular value of the true label , is the same for all values of the protected variable:

We present an adversarial technique for achieving whichever one of these definitions is desired.222Achieving equality of odds and demographic parity are generally incongruent goals. See also kleinberg2016inherent kleinberg2016inherent for incongruency between calibration and equalized odds. A predictor will be trained to model as accurately as possible while satisfying one of the above equality constraints. Demographic parity will be achieved by introducing an adversary which will attempt to predict a value for from . The gradient of will then be incorporated into the weight update rule of so as to reduce the amount of information about transmitted through . Equality of odds will be achieved by also giving access to the true label , thereby limiting any information about which contains beyond the information already contained in .

We consider the case where the protected variable is a discrete feature present in the training set as well as the case in which the protected variable must be inferred from latent semantics (in particular, gender from word embeddings). In order to accomplish the latter we adapt a technique presented by bolukbasi2016man bolukbasi2016man to define a subspace capturing the semantics of the protected variable, and then train a model to perform a word analogies task accurately, while unbiased on this protected variable. A consequence of this technique is that the network learns “debiased” embeddings, embeddings that have the semantics of the protected variable removed. These embeddings are still able to perform the analogy task well, but are better at avoiding problematic examples such as those shown in bolukbasi2016man bolukbasi2016man.

Results on the UCI Adult Dataset demonstrate the technique we introduce allows us to train a model that achieves equality of odds to within 1% on both protected groups.

We also compare with the related previous work of beutel2017data beutel2017data, and find we are able to better equalize the differences between the two groups, measured by both False Positive Rate and False Negative Rate (1 - True Positive Rate), although note that the previous work performs better overall for False Negative Rate.

We provide some discussion on caveats pertaining to this approach, difficulties in training these models that are shared by many adversarial approaches, as well as some discussion on difficulties that the fairness constraints introduce.

2 Related Work

There has been significant work done in the area of debiasing various specific types of data or predictor.

Debiasing word embeddings: bolukbasi2016man bolukbasi2016man devises a method to remove gender bias from word embeddings. The method relies on a lot of human input; namely, it needs a large “training set” of gender-specific words.

Simple models

: lum2016statistical lum2016statistical demonstrate that removing the protected variable from the training data fails to yield a debiased model (since other variables can be highly correlated with the protected variable), and devise a method for learning fair predictive models in cases when the learning model is simple (e.g. linear regression). hardt2016equality hardt2016equality discuss the shortcomings of focusing solely on

demographic parity, present alternate definitions of fairness, and devise a method for deriving an unbiased predictor from a biased one, in cases when both the output variable and the protected variable are discrete.

Adversarial training: goodfellow2014generative goodfellow2014generative pioneered the technique of using multiple networks with competing goals to force the first network to “deceive” the second network, applying this method to the problem of creating real-life-like pictures. beutel2017data beutel2017data apply an adversarial training method to achieve equality of opportunity in cases when the output variable is discrete. They also discuss the ability of the adversary to be powerful enough to enforce a fairness constraint even when it has access to a very small training sample.

3 Adversarial Debiasing

Figure 1: The architecture of the adversarial network.

We begin with a model, which we call the predictor, trained to accomplish the task of predicting given . As in Figure 1, we assume that the model is trained by attempting to modify weights to minimize some loss

, using a gradient-based method such as stochastic gradient descent.

The output layer of the predictor is then used as an input to another network called the adversary which attempts to predict . This is part of the network corresponds to the discriminator in a typical GAN [Goodfellow et al.2014]. We will suppose the adversary has loss term and weights . Depending on the definition of fairness being achieved, the adversary may have other inputs.

  • For Demographic Parity, the adversary gets the predicted label . Intuitively, this allows the adversary to try to predict the protected variable using nothing but the predicted label. The goal of the predictor is to prevent the adversary from doing this.

  • For Equality of Odds, the adversary gets and the true label .

  • For Equality of Opportunity on a given class , we can restrict the training set of the adversary to training examples where .333This last technique of restricting the training set is discussed at length by beutel2017data beutel2017data, so we only mention it here.

In order for gradients to propagate correctly, above refers to the output layer of the network, not to the discrete prediction; for example, for a classification problem,

could refer to the output of the softmax layer.

We update to minimize at each training time step, according to the gradient . We modify according to the expression:



is a tuneable hyperparameter that can vary at each time step and we define

if .

Figure 2: Diagram illustrating the gradients in Eqn. 1 and the relevance of the projection term . Without the projection term, in the pictured scenario, the predictor would move in the direction labelled in the diagram, which actually helps the adversary. With the projection term, the predictor will never move in a direction that helps the adversary.

The middle term prevents the predictor from moving in a direction that helps the adversary decrease its loss while the last term, , attempts to increase the adversary’s loss. Without the projection term, it is possible for the predictor to end up helping the adversary (see Fig. 2). Without the last term, the predictor will never try to hurt the adversary, and, due to the stochastic nature of many gradient-based methods, will likely end up helping the adversary anyway. The result is that when training is completed the desired definition of equality should be satisfied.

Notice that our definitions and method make no assumptions about the nature of the output and protected variables: in particular, they work with both regression and classification models, as well as with both discrete and continuous protected variables.

4 Properties

We note several properties of the above method that we believe distinguish it from past work.

  1. Generality: The above method can be used to enforce demographic parity, equality of odds, or equality of opportunity as described in hardt2016equality hardt2016equality. Further, it applies without modification to the cases when the output variable and/or protected variable are continuous instead of discrete.

  2. Model-agnostic: The adversarial approach described can be applied regardless of how simple or complex the predictor’s model is, as long as the model is trained using a gradient-based method, as many modern learning models are. Further, as we will discuss later, at least in some situations, we suggest that the adversary does not need to be nearly as complex as the predictor—a simple adversary can be used with a complex predictor.

  3. Optimality: Under certain conditions, we show that if the predictor converges, it must converge to a model that satisfies the desired fairness definition. Since the predictor also attempts to decrease the prediction loss , the predictor should still perform well on the target task.

5 Theoretical Guarantees

Proposition 1.

Let the predictor, the adversary, and their weights , be defined according to Section 3 Let be the adversary’s loss, convex in , concave in ,444

We understand that these assumptions are not satisfied in most use cases involving neural networks; however, as with most theoretical analyses of machine learning models (see, for example,

goodfellow2014generative goodfellow2014generative or kingma2014adam kingma2014adam; the former makes even stronger assumptions), assumptions of concavity are necessary for any proofs to work and continuously differentiable everywhere.

Suppose that:

  1. When the predictor’s weights are , the predictor gives the same output regardless of input . (For example, when ).

  2. There are some weights that minimize when the weights for have no effect on the output: For all , .

  3. Predictor and adversary converge to and respectively.

Then, . That is, the adversary gains no advantage from using the weights for .


Since the adversary converges, : otherwise, since is convex in , the adversary’s weights would move toward . In other words, the adversary’s minimum is the point at which the adversary gains an advantage from using . Similarly, since the predictor converges, : Otherwise, the predictor would be able to increase the adversary’s loss by moving toward , and the projection term and negative weight on in Eqn. 1 would push the predictor to move towards . Then:

so we must have . ∎

Note that, in this proof, the adversary can be operating in a few different ways, as long as it is given as one of its inputs; for example, for demographic parity, it could be given only ; for equality of odds, it can be given both and .

We will show in the next propositions that the adversary gaining no advantage from information about is exactly the condition needed to guarantee that desired definitions of equality are satisfied.

Proposition 2.

Let the training data be comprised of triples drawn according to some distribution . Suppose:

  1. The protected variable is discrete.

  2. The adversary is trained for demographic parity; i.e. the adversary is given only the prediction .

  3. The adversary is strong enough that, at convergence, it has learned a randomized function A that minimizes the cross-entropy loss ; i.e. the adversary in fact achieves the optimal accuracy with which you can predict from

  4. The predictor completely fools the adversary; in particular, the adversary achieves loss , the entropy of .

Then the predictor satisfies demographic parity; i.e., .


Notice that if the adversary draws according to the distribution , then its loss is exactly the conditional entropy

where the expectation is taken over . Now suppose for contradiction that is dependent on . Then , so the adversary can achieve loss less than , contradicting assumption (4). ∎

Proposition 3.

If assumptions (2)-(4) above are replaced with the analogous equality of odds assumptions; in particular, that the adversary is given and , and the adversary cannot achieve loss better than  then the predictor will satisfy Equality of Odds; i.e.,


Analogous to the above. Notice that if the adversary draws , then its loss is exactly the conditional entropy

where the expectation is again taken over . But if is conditionally dependent on given , then , so the adversary can achieve loss less than . ∎

Note that Propositions 2 and 3 work analogously in the case of continuous and

, with the probability mass function

replaced with the probability density function

, and the discrete entropy replaced by the differential entropy , since the relevant property ( iff ) holds for differential entropy as well. They also work analogously when the adversary is restricted to a limited set of predictors.

For example, an adversary using least-squares regression trying to enforce equality of odds can be thought of as one that outputs where is the output of the regressor, and is a fixed constant. Note now that the differential entropy is nothing more than the expected log-likelihood, and so the function that minimizes this quantity is the optimal least-squares regressor. Thus, for example, if we restrict to be a linear function of , and the other conditions of Proposition 3 hold, then an analogous argument to the above propositions shows that has no linear relationship with after conditioning on .

These claims together illustrate that a sufficiently powerful adversary trained on a sufficiently large training set can indeed accurately enforce the demographic parity or equality of odds constraints on the predictor, if the adversary and predictor converge. Guaranteed convergence is harder to achieve, both in theory and practice. In the practical scenarios below we discuss methods to encourage the training algorithm to converge, as well as reasonable choices of the adversary model that are both powerful and easy to train.

6 Experiments

All models were trained using the Adam optimizer [Kingma and Ba2014] for both predictor and adversary.

Toy Scenario

We generate a training sample (where is the protected variable) as follows. For each , let be picked uniformly at random, and let . Let vary independently. Then . (where denotes an indicator function). Intuitively, the variable that we are trying to predict, , depends directly on and . We are given as inputs the protected variable , and a noisy measurement of . The end goal would be to train a model that predicts while being unbiased on , effectively removing the direct signal for from the learned model.

If one trains generically a logistic regression model to predict

given , it outputs something like , which is a reasonable model, but heavily incorporates the protected variable . To debias, We now train a model that achieves demographic parity. Note that removing the variable from the training data is insuffucient for debiasing: the model will still learn to use to predict , and is correlated with . If we use the described technique and add in another logistic model that tries to predict given , we find that the predictor model outputs something like . Notice that not only is not included with a positive weight anymore, the model actually learns to use a negative weight on in order to balance out the effect of on Notice that ; i.e., it is not dependent on , so we have successfully trained a model to predict independently of .

Word Embeddings

We train a model to perform the analogy task (i.e., fill in the blank: man : woman :: he : ?).

It is known that word embeddings reflect or amplify problematic biases from the data they are trained on, for example, gender [Bolukbasi et al.2016]. We seek to train a model that can still solve analogies well, but is less prone to these gender biases. We first calculate a “gender direction” using a method based on bolukbasi2016man bolukbasi2016man which gives a method for defining the protected variable. We will use this technique in the context of defining gender for word embeddings, but, as discussed in bolukbasi2016man bolukbasi2016man, the technique generalizes to other protected variables and other forms of embeddings. Following bolukbasi2016man bolukbasi2016man, we pick 10 (male, female) word pairs, and define the and define the bias subspace to be the space spanned by the top principal components of the differences, where is a tuneable parameter. In our experiments, we find that gives reasonable results, so we did not experiment further.

We use embeddings trained from Wikipedia to generate input data from the Google analogy data set [Mikolov et al.2013]. For each analogy in the dataset, we let

comprise the word vectors for the first three words,

be the word vector of the fourth word, and be . It is worth noting that these word vectors computed from the original embeddings are never updated nor is there projection onto the bias subspace and therefore the original word embeddings are never modified. What is learned is a tranform from a biased embedding space to a debiased embedding space.

biased debiased
neighbor similarity neighbor similarity
nurse 1.0121 nurse 0.7056
nanny 0.9035 obstetrician 0.6861
fiancée 0.8700 pediatrician 0.6447
maid 0.8674 dentist 0.6367
fiancé 0.8617 surgeon 0.6303
mother 0.8612 physician 0.6254
fiance 0.8611 cardiologist 0.6088
dentist 0.8569 pharmacist 0.6081
woman 0.8564 hospital 0.5969
Table 1: Completions for he : she :: doctor : ?

As a model, we use the following: let , and output , where our model parameter is . Intuitively, is the “generic” analogy vector as is commonly555see e.g. mikolov2013distributed mikolov2013distributed used for the analogy task. If left to its own devices (i.e., if not told to be unbiased on anything), the model should either learn or else learn as a useless vector.

By contrast, if we add the adversarial discriminator network (here, simply ), we expect the debiased prediction model to learn that should be something close to (or ), so that the discriminator cannot predict . Indeed, both of these expectations hold: Without debiasing, the trained vector is approximately a unit vector nearly perpendicular to ; with debiasing, is approximately a unit vector pointing in a direction highly correlated with . Even after debiasing, gendered analogies such as man : woman :: he : she are still preserved; however, many biased analogies go away, suggesting that the adversarial training process was indeed successful. An example of the kinds of changes in analogy completions observed after debiasing are illustrated in Table 1666The presence of nurse in the second position may seem worrying, but it should be noted that in this particular set of word embeddings, nurse is the nearest neighbor to doctor; no amount of debiasing will change this..

UCI Adult Dataset

Feature Type Description
age Cont Age of the individual
capital_gain Cont Capital gains recorded
capital_loss Cont Capital losses recorded
education_num Cont Highest education level (numerical form)
fnlwgt Cont # of people census takers believe that observation represents
hours_per_week Cont Hours worked per week
education Cat Highest level of education achieved
income Cat Whether individual makes $50K annually
marital_status Cat Marital status
native_country Cat Country of origin
occupation Cat Occupation
race Cat White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
relationship Cat Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
sex Cat Female, Male
workclass Cat Employer type
Table 2:

Features in the UCI dataset per individual. Features are either continuous (Cont) or Categorical (Cat). Categorical features are converted to sparse tensors for the model.

To better align with the work in beutel2017data beutel2017data, we attempt to enforce equality of odds on a model for the task of predicting the income of a person – in particular, predicting whether the income is – given various attributes about the person, as made available in the UCI Adult dataset [Asuncion and Newman2007].

Details on the features that the dataset provides are available in Table 2. We use both categorical and continuous columns as given, with exception to the fnlwgt feature, which we discard. We convert the remaining columns into tensors where the categorical columns are sparse tensors, age is bucketized at boundaries , and the rest of the continuous columns are real-valued.

As discussed before, to enforce equality of odds, we give the adversary access to the true label . The adversary will learn the relationship between and regardless of what the predictor does; further, if the predictor’s predictions give more information about than is already contained in , the adversary will be able to improve its loss. Thus, the predictor, in attempting to fool the adversary, will move toward making sure that does not give such additional information; in other words, toward equality of odds.

Our protected variable is a binary-valued variable for the two sexes annotated, male and female.

Our predictor model is straightforward logistic regression: , where

is the sigmoid function. Our adversary model takes the form of the following logistic-regression-like model:

where and are learnable scalars, is a learnable vector, and

is the inverse of the sigmoid function (logit function)

. Intuitively, we want our adversary to be able to learn functions of the form (i.e. dependent only on the boolean predicted value ), and thus enforce equality of odds. Here, the adversary would learn such a function by making extremely large. We add 1 to to make sure the adversary never tries to ignore by setting , which could be a difficult local minimum for the adversary to escape777This value added to is an adjustable hyperparameter; we found reasonable results using the value 1 and thus not feel the need to experiment further.. This adversary is both general enough to be used whenever and are both discrete888If and are multi-class, then the sigmoid becomes a softmax, but everything else remains the same., and powerful enough that deviation from true equality of odds should cause the adversary to be able to decrease its loss.

Without tweaking, this algorithm ran into issues with local minima, and the resulting models were often closer to demographic parity than equality of odds. We implemented a technique that helped: by increasing the hyperparameter in Eqn. 1 over time, the predictor had a much easier time learning to deceive the adversary and therefore more strictly enforce equality of odds. We set (where is the step counter), and to avoid divergence we set the predictor’s step size to , so that as is preferred for stochastic gradient-based methods such as Adam.

We train the model twice, once with debiasing and once without, and present side-by-side confusion matrices on the test set for income bracket with respect to the protected variable values Male and Female, shown in Table 3, and we present the false positive rates (FPR) and false negative rates (FNR) in Table 4. Note that false negative rate is equal to true positive rate, so the trade-offs are directly comparable to the values of an ROC curve.

Without Debiasing With Debiasing
Female Pred 0 Pred 1 Female Pred 0 Pred 1
True 0 4711 120 True 0 4518 313
True 1 265 325 True 1 263 327
Male Pred 0 Pred 1 Male Pred 0 Pred 1
True 0 6907 697 True 0 7071 533
True 1 1194 2062 True 1 1416 1840
Table 3: Confusion matrices on the UCI Adult dataset, with and without equality of odds enforcement.
Female Male
Without With Without With
beutel2017data beutel2017data FPR 0.1875 0.0308 0.1200 0.1778
FNR 0.0651 0.0822 0.1828 0.1520
Current work FPR 0.0248 0.0647 0.0917 0.0701
FNR 0.4492 0.4458 0.3667 0.4349
Table 4: False Positive Rate (FPR) and False Negative Rate (FNR) for income bracket predictions for the two sex subgroups, with and without adversarial debiasing.

We notice that debiasing has only a small effect on overall accuracy ( vs ), and that the debiased model indeed (nearly) obeys equality of odds: as shown in Table 4, with debiasing, the FNR and FPR values are approximately equal across sex subgroups: and .

Although the values don’t exactly reach equality, neither difference is statistically significant: a two-proportion two-tail large sample -test yields -values 0.25 for and 0.62 for .

7 Conclusion

In this work, we demonstrate a general and powerful method for training unbiased machine learning models. We state and prove theoretical guarantees for our method under reasonable assumptions, demonstrating in theory that the method can enforce the constraints that we claim, across multiple definitions of fairness, regardless of the complexity of the predictor’s model, or the nature (discrete or continuous) of the predicted and protected variables in question. We apply the method in practice to two very different scenarios: a standard supervised learning task, and the task of debiasing word embeddings while still maintaining ability to perform a certain task (analogies). We demonstrate in both cases the ability to train a model that is demonstrably less biased than the original one, and yet still performs extremely well on the task at hand. We discuss difficulties in getting these models to converge. We propose, in the common case of discrete output and protected variables, a simple adversary that is usable regardless of the complexity of the underlying model.

8 Future Work

This process yields many questions that require further work to answer.

  1. The debiased word embeddings we have trained are still useful in analogies. Are they still useful in other, more complex tasks?

  2. The adversarial training method is hard to get right and often touchy, in that getting the hyperparameters wrong results in quick divergence of the algorithm. What ways can be used to stabilize training and ensure convergence, and thus ensure that the theoretical guarantees presented here can work?

  3. There is a body of existing work for image recognition using adversarial networks. Image recognition in general can sometimes be subject to various biases such as being more or less successful at recognizing the faces of people of different races. Can multiple adversaries be combined to create high accuracy image recognition systems which do not exhibit such biases?

  4. In general, do more complex predictors require more complex adversaries? It would appear that in the case of and discrete, a very simple adversary suffices no matter how complex the predictor. Does this also apply to continuous cases, or would a simple adversary be too easy to deceive for a complex predictor?