1 Introduction
Machine learning leverages data to build models capable of assessing the labels and properties of novel data. Unfortunately, the available training data frequently contains biases with respect to things that we would rather not use for decision making. Machine learning builds models faithful to training data and can lead to perpetuating these undesirable biases. For example, systems designed to predict creditworthiness and systems designed to perform analogy completion have been demonstrated to be biased against racial minorities and women respectively. Ideally we would be able to build a model which captures exactly those generalizations from the data which are useful for performing some task which are not discriminatory in a way which the people building those models consider unfair.
Work on training machine learning systems that output fair decisions has defined several useful measurements for fairness
: Demographic Parity, Equality of Odds, and Equality of Opportunity. These can be imposed as constraints or incorporated into a loss function in order to mitigate disproportional outcomes in the system’s output predictions regarding a protected demographic, such as sex.
In this paper, we examine these fairness measures in the context of adversarial debiasing
. We consider supervised deep learning tasks in which the task is to predict an output variable
given an input variable , while remaining unbiased with respect to some variable . We refer to as the protected variable. For these learning systems, the predictor can be constructed as (input, output, protected) tuples . The predictor is usually given access to the protected variable , though this is not strictly necessary. This construction allows the determination of which types of bias are considered undesirable for a particular application to be chosen through the specification of the protected variable.We speak to the concept of mitigating bias using the known term debiasing^{1}^{1}1Note that “debias” may not be quite the right word, as all bias is not necessarily removed., following definitions provided by hardt2016equality hardt2016equality and refined by beutel2017data beutel2017data.
Definition 1.
Demographic Parity. A predictor satisfies demographic parity if and are independent.
This means that is equal for all values of the protected variable : .
Definition 2.
Equality of Odds. A predictor satisfies equality of odds if and are conditionally independent given .
This means that, for all possible values of the true label , is the same for all values of the protected variable:
Definition 3.
Equality of Opportunity. If the output variable is discrete, a predictor satisfies equality of opportunity with respect to a class if and are independent conditioned on .
This means that, for a particular value of the true label , is the same for all values of the protected variable:
We present an adversarial technique for achieving whichever one of these definitions is desired.^{2}^{2}2Achieving equality of odds and demographic parity are generally incongruent goals. See also kleinberg2016inherent kleinberg2016inherent for incongruency between calibration and equalized odds. A predictor will be trained to model as accurately as possible while satisfying one of the above equality constraints. Demographic parity will be achieved by introducing an adversary which will attempt to predict a value for from . The gradient of will then be incorporated into the weight update rule of so as to reduce the amount of information about transmitted through . Equality of odds will be achieved by also giving access to the true label , thereby limiting any information about which contains beyond the information already contained in .
We consider the case where the protected variable is a discrete feature present in the training set as well as the case in which the protected variable must be inferred from latent semantics (in particular, gender from word embeddings). In order to accomplish the latter we adapt a technique presented by bolukbasi2016man bolukbasi2016man to define a subspace capturing the semantics of the protected variable, and then train a model to perform a word analogies task accurately, while unbiased on this protected variable. A consequence of this technique is that the network learns “debiased” embeddings, embeddings that have the semantics of the protected variable removed. These embeddings are still able to perform the analogy task well, but are better at avoiding problematic examples such as those shown in bolukbasi2016man bolukbasi2016man.
Results on the UCI Adult Dataset demonstrate the technique we introduce allows us to train a model that achieves equality of odds to within 1% on both protected groups.
We also compare with the related previous work of beutel2017data beutel2017data, and find we are able to better equalize the differences between the two groups, measured by both False Positive Rate and False Negative Rate (1  True Positive Rate), although note that the previous work performs better overall for False Negative Rate.
We provide some discussion on caveats pertaining to this approach, difficulties in training these models that are shared by many adversarial approaches, as well as some discussion on difficulties that the fairness constraints introduce.
2 Related Work
There has been significant work done in the area of debiasing various specific types of data or predictor.
Debiasing word embeddings: bolukbasi2016man bolukbasi2016man devises a method to remove gender bias from word embeddings. The method relies on a lot of human input; namely, it needs a large “training set” of genderspecific words.
Simple models
: lum2016statistical lum2016statistical demonstrate that removing the protected variable from the training data fails to yield a debiased model (since other variables can be highly correlated with the protected variable), and devise a method for learning fair predictive models in cases when the learning model is simple (e.g. linear regression). hardt2016equality hardt2016equality discuss the shortcomings of focusing solely on
demographic parity, present alternate definitions of fairness, and devise a method for deriving an unbiased predictor from a biased one, in cases when both the output variable and the protected variable are discrete.Adversarial training: goodfellow2014generative goodfellow2014generative pioneered the technique of using multiple networks with competing goals to force the first network to “deceive” the second network, applying this method to the problem of creating reallifelike pictures. beutel2017data beutel2017data apply an adversarial training method to achieve equality of opportunity in cases when the output variable is discrete. They also discuss the ability of the adversary to be powerful enough to enforce a fairness constraint even when it has access to a very small training sample.
3 Adversarial Debiasing
We begin with a model, which we call the predictor, trained to accomplish the task of predicting given . As in Figure 1, we assume that the model is trained by attempting to modify weights to minimize some loss
, using a gradientbased method such as stochastic gradient descent.
The output layer of the predictor is then used as an input to another network called the adversary which attempts to predict . This is part of the network corresponds to the discriminator in a typical GAN [Goodfellow et al.2014]. We will suppose the adversary has loss term and weights . Depending on the definition of fairness being achieved, the adversary may have other inputs.

For Demographic Parity, the adversary gets the predicted label . Intuitively, this allows the adversary to try to predict the protected variable using nothing but the predicted label. The goal of the predictor is to prevent the adversary from doing this.

For Equality of Odds, the adversary gets and the true label .

For Equality of Opportunity on a given class , we can restrict the training set of the adversary to training examples where .^{3}^{3}3This last technique of restricting the training set is discussed at length by beutel2017data beutel2017data, so we only mention it here.
In order for gradients to propagate correctly, above refers to the output layer of the network, not to the discrete prediction; for example, for a classification problem,
could refer to the output of the softmax layer.
We update to minimize at each training time step, according to the gradient . We modify according to the expression:
(1) 
where
is a tuneable hyperparameter that can vary at each time step and we define
if .The middle term prevents the predictor from moving in a direction that helps the adversary decrease its loss while the last term, , attempts to increase the adversary’s loss. Without the projection term, it is possible for the predictor to end up helping the adversary (see Fig. 2). Without the last term, the predictor will never try to hurt the adversary, and, due to the stochastic nature of many gradientbased methods, will likely end up helping the adversary anyway. The result is that when training is completed the desired definition of equality should be satisfied.
Notice that our definitions and method make no assumptions about the nature of the output and protected variables: in particular, they work with both regression and classification models, as well as with both discrete and continuous protected variables.
4 Properties
We note several properties of the above method that we believe distinguish it from past work.

Generality: The above method can be used to enforce demographic parity, equality of odds, or equality of opportunity as described in hardt2016equality hardt2016equality. Further, it applies without modification to the cases when the output variable and/or protected variable are continuous instead of discrete.

Modelagnostic: The adversarial approach described can be applied regardless of how simple or complex the predictor’s model is, as long as the model is trained using a gradientbased method, as many modern learning models are. Further, as we will discuss later, at least in some situations, we suggest that the adversary does not need to be nearly as complex as the predictor—a simple adversary can be used with a complex predictor.

Optimality: Under certain conditions, we show that if the predictor converges, it must converge to a model that satisfies the desired fairness definition. Since the predictor also attempts to decrease the prediction loss , the predictor should still perform well on the target task.
5 Theoretical Guarantees
Proposition 1.
Let the predictor, the adversary, and their weights , be defined according to Section 3 Let be the adversary’s loss, convex in , concave in ,^{4}^{4}4
We understand that these assumptions are not satisfied in most use cases involving neural networks; however, as with most theoretical analyses of machine learning models (see, for example,
goodfellow2014generative goodfellow2014generative or kingma2014adam kingma2014adam; the former makes even stronger assumptions), assumptions of concavity are necessary for any proofs to work and continuously differentiable everywhere.Suppose that:

When the predictor’s weights are , the predictor gives the same output regardless of input . (For example, when ).

There are some weights that minimize when the weights for have no effect on the output: For all , .

Predictor and adversary converge to and respectively.
Then, . That is, the adversary gains no advantage from using the weights for .
Proof.
Since the adversary converges, : otherwise, since is convex in , the adversary’s weights would move toward . In other words, the adversary’s minimum is the point at which the adversary gains an advantage from using . Similarly, since the predictor converges, : Otherwise, the predictor would be able to increase the adversary’s loss by moving toward , and the projection term and negative weight on in Eqn. 1 would push the predictor to move towards . Then:
so we must have . ∎
Note that, in this proof, the adversary can be operating in a few different ways, as long as it is given as one of its inputs; for example, for demographic parity, it could be given only ; for equality of odds, it can be given both and .
We will show in the next propositions that the adversary gaining no advantage from information about is exactly the condition needed to guarantee that desired definitions of equality are satisfied.
Proposition 2.
Let the training data be comprised of triples drawn according to some distribution . Suppose:

The protected variable is discrete.

The adversary is trained for demographic parity; i.e. the adversary is given only the prediction .

The adversary is strong enough that, at convergence, it has learned a randomized function A that minimizes the crossentropy loss ; i.e. the adversary in fact achieves the optimal accuracy with which you can predict from

The predictor completely fools the adversary; in particular, the adversary achieves loss , the entropy of .
Then the predictor satisfies demographic parity; i.e., .
Proof.
Notice that if the adversary draws according to the distribution , then its loss is exactly the conditional entropy
where the expectation is taken over . Now suppose for contradiction that is dependent on . Then , so the adversary can achieve loss less than , contradicting assumption (4). ∎
Proposition 3.
If assumptions (2)(4) above are replaced with the analogous equality of odds assumptions; in particular, that the adversary is given and , and the adversary cannot achieve loss better than then the predictor will satisfy Equality of Odds; i.e.,
Proof.
Analogous to the above. Notice that if the adversary draws , then its loss is exactly the conditional entropy
where the expectation is again taken over . But if is conditionally dependent on given , then , so the adversary can achieve loss less than . ∎
Note that Propositions 2 and 3 work analogously in the case of continuous and
, with the probability mass function
replaced with the probability density function
, and the discrete entropy replaced by the differential entropy , since the relevant property ( iff ) holds for differential entropy as well. They also work analogously when the adversary is restricted to a limited set of predictors.For example, an adversary using leastsquares regression trying to enforce equality of odds can be thought of as one that outputs where is the output of the regressor, and is a fixed constant. Note now that the differential entropy is nothing more than the expected loglikelihood, and so the function that minimizes this quantity is the optimal leastsquares regressor. Thus, for example, if we restrict to be a linear function of , and the other conditions of Proposition 3 hold, then an analogous argument to the above propositions shows that has no linear relationship with after conditioning on .
These claims together illustrate that a sufficiently powerful adversary trained on a sufficiently large training set can indeed accurately enforce the demographic parity or equality of odds constraints on the predictor, if the adversary and predictor converge. Guaranteed convergence is harder to achieve, both in theory and practice. In the practical scenarios below we discuss methods to encourage the training algorithm to converge, as well as reasonable choices of the adversary model that are both powerful and easy to train.
6 Experiments
All models were trained using the Adam optimizer [Kingma and Ba2014] for both predictor and adversary.
Toy Scenario
We generate a training sample (where is the protected variable) as follows. For each , let be picked uniformly at random, and let . Let vary independently. Then . (where denotes an indicator function). Intuitively, the variable that we are trying to predict, , depends directly on and . We are given as inputs the protected variable , and a noisy measurement of . The end goal would be to train a model that predicts while being unbiased on , effectively removing the direct signal for from the learned model.
If one trains generically a logistic regression model to predict
given , it outputs something like , which is a reasonable model, but heavily incorporates the protected variable . To debias, We now train a model that achieves demographic parity. Note that removing the variable from the training data is insuffucient for debiasing: the model will still learn to use to predict , and is correlated with . If we use the described technique and add in another logistic model that tries to predict given , we find that the predictor model outputs something like . Notice that not only is not included with a positive weight anymore, the model actually learns to use a negative weight on in order to balance out the effect of on Notice that ; i.e., it is not dependent on , so we have successfully trained a model to predict independently of .Word Embeddings
We train a model to perform the analogy task (i.e., fill in the blank: man : woman :: he : ?).
It is known that word embeddings reflect or amplify problematic biases from the data they are trained on, for example, gender [Bolukbasi et al.2016]. We seek to train a model that can still solve analogies well, but is less prone to these gender biases. We first calculate a “gender direction” using a method based on bolukbasi2016man bolukbasi2016man which gives a method for defining the protected variable. We will use this technique in the context of defining gender for word embeddings, but, as discussed in bolukbasi2016man bolukbasi2016man, the technique generalizes to other protected variables and other forms of embeddings. Following bolukbasi2016man bolukbasi2016man, we pick 10 (male, female) word pairs, and define the and define the bias subspace to be the space spanned by the top principal components of the differences, where is a tuneable parameter. In our experiments, we find that gives reasonable results, so we did not experiment further.
We use embeddings trained from Wikipedia to generate input data from the Google analogy data set [Mikolov et al.2013]. For each analogy in the dataset, we let
comprise the word vectors for the first three words,
be the word vector of the fourth word, and be . It is worth noting that these word vectors computed from the original embeddings are never updated nor is there projection onto the bias subspace and therefore the original word embeddings are never modified. What is learned is a tranform from a biased embedding space to a debiased embedding space.biased  debiased  

neighbor  similarity  neighbor  similarity 
nurse  1.0121  nurse  0.7056 
nanny  0.9035  obstetrician  0.6861 
fiancée  0.8700  pediatrician  0.6447 
maid  0.8674  dentist  0.6367 
fiancé  0.8617  surgeon  0.6303 
mother  0.8612  physician  0.6254 
fiance  0.8611  cardiologist  0.6088 
dentist  0.8569  pharmacist  0.6081 
woman  0.8564  hospital  0.5969 
As a model, we use the following: let , and output , where our model parameter is . Intuitively, is the “generic” analogy vector as is commonly^{5}^{5}5see e.g. mikolov2013distributed mikolov2013distributed used for the analogy task. If left to its own devices (i.e., if not told to be unbiased on anything), the model should either learn or else learn as a useless vector.
By contrast, if we add the adversarial discriminator network (here, simply ), we expect the debiased prediction model to learn that should be something close to (or ), so that the discriminator cannot predict . Indeed, both of these expectations hold: Without debiasing, the trained vector is approximately a unit vector nearly perpendicular to ; with debiasing, is approximately a unit vector pointing in a direction highly correlated with . Even after debiasing, gendered analogies such as man : woman :: he : she are still preserved; however, many biased analogies go away, suggesting that the adversarial training process was indeed successful. An example of the kinds of changes in analogy completions observed after debiasing are illustrated in Table 1^{6}^{6}6The presence of nurse in the second position may seem worrying, but it should be noted that in this particular set of word embeddings, nurse is the nearest neighbor to doctor; no amount of debiasing will change this..
UCI Adult Dataset
Feature  Type  Description 
age  Cont  Age of the individual 
capital_gain  Cont  Capital gains recorded 
capital_loss  Cont  Capital losses recorded 
education_num  Cont  Highest education level (numerical form) 
fnlwgt  Cont  # of people census takers believe that observation represents 
hours_per_week  Cont  Hours worked per week 
education  Cat  Highest level of education achieved 
income  Cat  Whether individual makes $50K annually 
marital_status  Cat  Marital status 
native_country  Cat  Country of origin 
occupation  Cat  Occupation 
race  Cat  White, AsianPacIslander, AmerIndianEskimo, Other, Black 
relationship  Cat  Wife, Ownchild, Husband, Notinfamily, Otherrelative, Unmarried 
sex  Cat  Female, Male 
workclass  Cat  Employer type 
Features in the UCI dataset per individual. Features are either continuous (Cont) or Categorical (Cat). Categorical features are converted to sparse tensors for the model.
To better align with the work in beutel2017data beutel2017data, we attempt to enforce equality of odds on a model for the task of predicting the income of a person – in particular, predicting whether the income is – given various attributes about the person, as made available in the UCI Adult dataset [Asuncion and Newman2007].
Details on the features that the dataset provides are available in Table 2. We use both categorical and continuous columns as given, with exception to the fnlwgt feature, which we discard. We convert the remaining columns into tensors where the categorical columns are sparse tensors, age is bucketized at boundaries , and the rest of the continuous columns are realvalued.
As discussed before, to enforce equality of odds, we give the adversary access to the true label . The adversary will learn the relationship between and regardless of what the predictor does; further, if the predictor’s predictions give more information about than is already contained in , the adversary will be able to improve its loss. Thus, the predictor, in attempting to fool the adversary, will move toward making sure that does not give such additional information; in other words, toward equality of odds.
Our protected variable is a binaryvalued variable for the two sexes annotated, male and female.
Our predictor model is straightforward logistic regression: , where
is the sigmoid function. Our adversary model takes the form of the following logisticregressionlike model:
where and are learnable scalars, is a learnable vector, and
is the inverse of the sigmoid function (logit function)
. Intuitively, we want our adversary to be able to learn functions of the form (i.e. dependent only on the boolean predicted value ), and thus enforce equality of odds. Here, the adversary would learn such a function by making extremely large. We add 1 to to make sure the adversary never tries to ignore by setting , which could be a difficult local minimum for the adversary to escape^{7}^{7}7This value added to is an adjustable hyperparameter; we found reasonable results using the value 1 and thus not feel the need to experiment further.. This adversary is both general enough to be used whenever and are both discrete^{8}^{8}8If and are multiclass, then the sigmoid becomes a softmax, but everything else remains the same., and powerful enough that deviation from true equality of odds should cause the adversary to be able to decrease its loss.Without tweaking, this algorithm ran into issues with local minima, and the resulting models were often closer to demographic parity than equality of odds. We implemented a technique that helped: by increasing the hyperparameter in Eqn. 1 over time, the predictor had a much easier time learning to deceive the adversary and therefore more strictly enforce equality of odds. We set (where is the step counter), and to avoid divergence we set the predictor’s step size to , so that as is preferred for stochastic gradientbased methods such as Adam.
We train the model twice, once with debiasing and once without, and present sidebyside confusion matrices on the test set for income bracket with respect to the protected variable values Male and Female, shown in Table 3, and we present the false positive rates (FPR) and false negative rates (FNR) in Table 4. Note that false negative rate is equal to true positive rate, so the tradeoffs are directly comparable to the values of an ROC curve.
Without Debiasing  With Debiasing  
Female  Pred 0  Pred 1  Female  Pred 0  Pred 1 
True 0  4711  120  True 0  4518  313 
True 1  265  325  True 1  263  327 
Male  Pred 0  Pred 1  Male  Pred 0  Pred 1 
True 0  6907  697  True 0  7071  533 
True 1  1194  2062  True 1  1416  1840 
Female  Male  
Without  With  Without  With  
beutel2017data beutel2017data  FPR  0.1875  0.0308  0.1200  0.1778 
FNR  0.0651  0.0822  0.1828  0.1520  
Current work  FPR  0.0248  0.0647  0.0917  0.0701 
FNR  0.4492  0.4458  0.3667  0.4349 
We notice that debiasing has only a small effect on overall accuracy ( vs ), and that the debiased model indeed (nearly) obeys equality of odds: as shown in Table 4, with debiasing, the FNR and FPR values are approximately equal across sex subgroups: and .
Although the values don’t exactly reach equality, neither difference is statistically significant: a twoproportion twotail large sample test yields values 0.25 for and 0.62 for .
7 Conclusion
In this work, we demonstrate a general and powerful method for training unbiased machine learning models. We state and prove theoretical guarantees for our method under reasonable assumptions, demonstrating in theory that the method can enforce the constraints that we claim, across multiple definitions of fairness, regardless of the complexity of the predictor’s model, or the nature (discrete or continuous) of the predicted and protected variables in question. We apply the method in practice to two very different scenarios: a standard supervised learning task, and the task of debiasing word embeddings while still maintaining ability to perform a certain task (analogies). We demonstrate in both cases the ability to train a model that is demonstrably less biased than the original one, and yet still performs extremely well on the task at hand. We discuss difficulties in getting these models to converge. We propose, in the common case of discrete output and protected variables, a simple adversary that is usable regardless of the complexity of the underlying model.
8 Future Work
This process yields many questions that require further work to answer.

The debiased word embeddings we have trained are still useful in analogies. Are they still useful in other, more complex tasks?

The adversarial training method is hard to get right and often touchy, in that getting the hyperparameters wrong results in quick divergence of the algorithm. What ways can be used to stabilize training and ensure convergence, and thus ensure that the theoretical guarantees presented here can work?

There is a body of existing work for image recognition using adversarial networks. Image recognition in general can sometimes be subject to various biases such as being more or less successful at recognizing the faces of people of different races. Can multiple adversaries be combined to create high accuracy image recognition systems which do not exhibit such biases?

In general, do more complex predictors require more complex adversaries? It would appear that in the case of and discrete, a very simple adversary suffices no matter how complex the predictor. Does this also apply to continuous cases, or would a simple adversary be too easy to deceive for a complex predictor?
References
 [Asuncion and Newman2007] Asuncion, A., and Newman, D. 2007. Uci machine learning repository.
 [Beutel et al.2017] Beutel, A.; Chen, J.; Zhao, Z.; and Chi, E. H. 2017. Data decisions and theoretical implications when adversarially learning fair representations. arXiv preprint arXiv:1707.00075.
 [Bolukbasi et al.2016] Bolukbasi, T.; Chang, K.W.; Zou, J. Y.; Saligrama, V.; and Kalai, A. T. 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in Neural Information Processing Systems, 4349–4357.
 [Goodfellow et al.2014] Goodfellow, I.; PougetAbadie, J.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680.
 [Hardt et al.2016] Hardt, M.; Price, E.; Srebro, N.; et al. 2016. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems, 3315–3323.
 [Kingma and Ba2014] Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 [Kleinberg, Mullainathan, and Raghavan2016] Kleinberg, J.; Mullainathan, S.; and Raghavan, M. 2016. Inherent tradeoffs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807.
 [Lum and Johndrow2016] Lum, K., and Johndrow, J. 2016. A statistical framework for fair predictive algorithms. arXiv preprint arXiv:1610.08077.
 [Mikolov et al.2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 3111–3119.
Comments
There are no comments yet.