Fairness in Supervised Learning: An Information Theoretic Approach

01/13/2018 ∙ by AmirEmad Ghassami, et al. ∙ University of Illinois at Urbana-Champaign 0

Automated decision making systems are increasingly being used in real-world applications. In these systems for the most part, the decision rules are derived by minimizing the training error on the available historical data. Therefore, if there is a bias related to a sensitive attribute such as gender, race, religion, etc. in the data, say, due to cultural/historical discriminatory practices against a certain demographic, the system could continue discrimination in decisions by including the said bias in its decision rule. We present an information theoretic framework for designing fair predictors from data, which aim to prevent discrimination against a specified sensitive attribute in a supervised learning setting. We use equalized odds as the criterion for discrimination, which demands that the prediction should be independent of the protected attribute conditioned on the actual label. To ensure fairness and generalization simultaneously, we compress the data to an auxiliary variable, which is used for the prediction task. This auxiliary variable is chosen such that it is decontaminated from the discriminatory attribute in the sense of equalized odds. The final predictor is obtained by applying a Bayesian decision rule to the auxiliary variable.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Automated decision making systems based on statistical inference and learning are increasingly common in a wide range of real-world applications such as health care, law enforcement, education, and finance. These systems are trained based on historical data, which might be biased towards certain attributes of the data points [1, 2, 3]

. Hence, such data without noticing possible biases could result in discrimination, which is defined as gratuitous distinction between individuals with different sensitive attribute. These attributes include sex, race, religion, and are referred to as protected attributes in the literature. As an example, in the US justice system, courts use features of criminals such as their age, race, sex, years being in jail, etc., to estimate their possible recidivism–future arrest. After considering these features, the court assigns a score to each in-jail individual, and decides on whether to release that person. If the score exceeds some certain limit, it will be safe to release that individual. For instance, as noted by Angwin et al. analysis

[4], risk scores in the criminal justice system–the COMPAS risk tool–are biased negatively towards African-Americans. They showed that this risk score unjustifiably shows high risk of recidivism for African-American people compared to what it should actually be. As another example, the authors in [5] have studied the accuracy of gender representation in online image searches. The results indicate that for instance, in a Google image search for “C.E.O.”, 11 percent of the depicted results are women, even though 27 percent of U.S. chef executives are women; and in a search for “telemarketer”, 64 percent of the people depicted were female, while the occupation is evenly split between men and women.

There is an interesting connection between the problem of fairness and differential privacy [6, 7, 8]. As in the differential privacy problem, one tries to hide the identity of individuals, in the fairness problem, the goal is to hide the information about the protected attribute. More details regarding this connection is presented in [1].

Different criteria for assessing discrimination has been suggested in the literature. The most commonly used criterion is the so-called demographic parity, which requires the predictor to be statistically independent from the protected attribute. That is, denoting the protected attribute and the prediction by and , respectively, demographic parity requires the model to satisfy

While demographic parity and its variants have been used in several works [9, 10, 11, 12], in some scenarios this criterion fails to provide fairness to all demographics [1]. For example, in the case of hiring an employee, where majority of the applicants are from a certain demographic, if we force the decision making system to be independent of that demographic, the system has to pick equal number of applicants from each demographic. Therefore, the system may admit a lower qualified individual from the smaller demographic to guarantee that the percentages of hired people from different demographics matches. Moreover, denoting the true label by , in most of the cases, as in the image search example, is correlated with the protected attribute (see Figure 1). Therefore, as demographic parity forces to be independent of , this criterion will not be satisfied for the ideal predictor .

Hardt, Price and Srebro have recently proposed equalized odds as a new criterion of fairness [2]. This notion demands that the predictor should be independent of the protected attribute conditioned on the actual label . Therefore, equalized odds requires the model to satisfy


Returning to the example of hiring an employee, this measure implies that among the qualified applicants

, the probability of hiring two people from different demographics should be the same. That is, if two people from different demographics are both qualified, or both not qualified, the system should hire them with equal probability. Also, note that unlike demographic parity, equalized odds allows for the ideal predictor


In this paper, we present a new framework for designing fair predictors from data. We utilize an information theoretic approach to model the information content of variables in the system relative to one another. We use equalized odds as the criterion to assess discrimination. In our proposed scheme, a data variable , is first mapped to an auxiliary variable , to decontaminate it from the discriminatory attribute as well as ensuring generalization. To design this auxiliary variable, for input variable and true label , we seek to find a compact representation of that contains at most a certain level of information about the variable (to avoid overfitting), but maximizes (quality of decision). The auxiliary variable is in turn used as the input for the prediction task. Similar to [2], our framework is only based on joint statistics of the variables rather than functional forms; hence, such a formulation is more general. Furthermore, as in many cases, the functional form of the score and underlying training data are not public. Our formulation (unlike that of [2], for instance) allows both and to have arbitrary cardinality, which implies that we can have multi-level protected attributes and labels. We cast the task of finding a fair predictor as an optimization problem and propose an iterative solution for solving this problem. We observe that the proposed solution does not necessarily converge for some levels of fairness. This suggests that for a given requirement on the accuracy of a predictor, certain levels of fairness may not be achievable.

A somewhat similar idea to our approach is presented in [9], in which the authors used an intermediate representation space with elements called prototypes. However, besides the fact that in that work demographic parity is used as the measure of discrimination, the method used for choosing the prototypes is quite different. Specifically, the main approach to avoid overfitting in the learning process is limiting the number of prototypes111Unfortunately, nothing is said in that work about choosing the number of prototypes., while we achieve the same goal by controlling the information in the auxiliary variable about the data. The approach in [9] has extended in [13] with deep variational auto-encoders with priors that encourage independence between sensitive and latent factors of variation.

The rest of the paper is organized as follows. In Section II we review the notion of equalized odds and introduce our model as well as the details of our proposed learning procedure. Additionally, we propose the optimization that must be solved to address the fairness issue. In Section III we propose an iterative approach for solving the optimization problem introduced. Our concluding remarks are presented in Section IV.

Ii Model Description

We consider a purely observational setting in which we train a predictor from labeled data. For each sample, we have a set of attributes, which includes protected attributes such as gender, race, religion, etc. The protected attributes are denoted by . We use to denote the rest of the attributes. We denote the true label by and the prediction of the label by . For instance, for the example regarding risk of recidivism explained in Section I, represents the race of each individual, represents other features of that individual (which could be correlated to the individual’s race) and determines whether he/she has committed any crimes after being released from the jail.

Figure 1: Graphical model of the proposed framework. , and denote the protected attribute, the rest of the attributes and the true label, respectively. is the compressed representor of , which is used for designing the prediction .

The graphical model of our setup is depicted in Figure 1. As seen in this figure, and can be correlated, and given , is independent of the true label . This property is essential, otherwise, the protected attribute is in fact a direct cause of the label and using this attribute in the prediction process should not be considered as discriminatory.

In order to find a fair predictor, if the joint distribution

was known, we could find close to in the sense of equalized odds. However in reality only the empirical distribution , which is obtained from data is available; therefore it is required to make sure that the predictor generalizes.

Generalization: Since the number of available samples is finite, to prevent overfitting (ensuring generalization) we should constraint our hypothesis space. To do so, we compress our variable to an auxiliary variable , which in turn is used for the prediction task. We also choose such that it is not contaminated by discrimination in the sense of equalized odds [2] defined in the following.

Definition 1.

[Equalized odds] We say that a variable satisfies equalized odds with respect to protected attribute and outcome , if and are independent conditional on , that is,

This definition is equivalent to the one in expression (1).

Once is decontaminated from discriminatory attribute , one can use any predictor to predict from this auxiliary variable. We propose to apply a Bayesian empirical risk minimization decision rule in this work for the prediction task.

To obtain the mechanism for generating the auxiliary variable, we seek for a compact representation of that maximizes the utility/quality of prediction , while it contains at most a certain level of information about the variable . This is in essence similar to the goal in the information bottleneck (IB) method [14]. Maximizing corresponds to maximizing the utility of , and keeping bounded could be viewed as regularization, which rejects complex hypotheses to ensure generalization. See [15] for a detailed discussion regarding using mutual information for finding bounds on generalization error. Note that the fact that we present fairness, accuracy and compactness via mutual information, provides us with a setting in which we do not need to have any requirement on the cardinality of variables (as opposed to [2, 9]).

Next, we present the details of designing the transition probability kernel for generating the auxiliary variable, as well as designing the final predictor.

Ii-a Designing the Auxiliary Variable

As stated earlier, the goal of our learning scheme is to produce a compressed representor of , which has as much information about the true label as possible, and is fair in the sense of Definition 1. We relax the equalized odds requirement in that we allow to have a certain amount of information about the variable conditioned on . The reason for this choice will become clear in Section III. Therefore, the objective is to find mechanism , which maximizes as well as

  1. Ensures fairness: The information shared between the protected attribute and given the true label does not exceed a certain threshold , that is

  2. Ensures generalization: The mutual information in and does not exceed a certain threshold , that is

Therefore, we aim to solve the following optimization problem.

Ii-B Designing the Predictor

As stated before, after obtaining a decontaminated variable , this variable can be used for the prediction task. We utilize a Bayesian decision rule described in the following.

Let be the alphabet of the variable and be the alphabet of variables and

. To quantify the quality of a decision, define a loss function

, where determines the cost of predicting when the true label was . The decisions are based on auxiliary variable , which is statistically related to the true label. We denote the decision rule by . The loss of the decision rule is defined as follows.

Using , the Bayesian risk minimization decision rule is

For instance, for the case of binary labels with Hamming loss, defined as , we have

which implies that we vote for the label with the maximum posterior probability.

Iii Solving the Fairness Optimization Problem

In this Section, we propose a solution for the fairness optimization problem presented in Section II. The Lagrangian for this problem will be as follows222Throughout the paper, uppercase letters for the argument of a distribution indicate all the parameters of the distribution, e.g., .


where the parameters and determine the trade off between accuracy, information compression, and fairness.

Equation (2) is similar to the objective function in [16], where for given variables , , and , the authors aimed to uncover structures in that do not exist in , used for hierarchical text categorization.

We propose an alternating optimization method to solve the aforementioned problem. The pseudo-code of the proposed approach is presented in Algorithm 1. In each iteration, is reduced by minimizing objective function over three distributions , , and separately. Functions and are used for updating , which are defined as follows:



Theorem 1.

For values of small enough, and any arbitrary value , Algorithm 1 converges to a stationary point of the Lagrangian functions given in equation (2).

See Appendix A for a proof.

  Input: Empirical distribution , initial distributions , , and parameters , , termination threshold .
  Initiate , , and .
  while  do
     , .
     , .
     , .
  end while
  Output: Conditional distribution .
Algorithm 1 Designing the conditional distribution of .

In general there is no guarantee that Algorithm 1 converges to the global minimum of the Lagrangian. Nevertheless, experimental results show that this altenative optimization algorithm almost always converges to a local minimum of the objective function in (2). Note that since achieving the global optimum is not guaranteed, one should initiate the algorithm from several different starting distributions.

The fact that convergence occurs only for a certain range of values for parameter , suggests that for a given requirement on the accuracy of a predictor, certain levels of fairness may not be achievable. This can imply an inherent bound for the level of fairness that any algorithm can achieve, a conclusion which could have not been obtained from the other existing works.

Iv Conclusion

We studied the problem of fairness in supervised learning, which is motivated by the fact that automated decision making systems may inherit biases related to sensitive attributes, such as gender, race, religion, etc., from the historical data that they have been trained on. We presented a new framework for designing fair predictors from data via an information theoretic machinery. Equalized odds was used as the criterion for discrimination, which demands that the prediction should be independent of the protected attribute conditioned on the actual label. In our proposed scheme, a data variable is first mapped to an auxiliary variable to decontaminate it from the discriminatory attribute as well as ensuring generalization. We modeled the task of designing the auxiliary variable as an optimization problem which aims to force the variable to be fair in the sense of equalized odds and maximizes the mutual information between the auxiliary variable and the true label, whilst keeping the information that this variable contains about the data limited. We proposed an alternative solution for solving this optimization problem. We observed that the proposed solution does not necessarily converge for some levels of fairness. This suggests that for a given requirement on the accuracy of a predictor, certain levels of fairness may not be achievable. The final predictor is obtained by applying a Bayesian decision rule to the auxiliary variable. Finding an exact bound on the achievable level of fairness, as well as applying the proposed method to real data is considered as our future work.

Appendix A Proof of Theorem 1

The Lagrangian in equation (2) can be written as follows:



We note that, the only unknown parameters are , and all of the other distributions can be estimated from the given samples of .
Changing the notation of to (to emphasize that it is designed), and using [17, Lemma 10.8.1], we can write the optimization as follows:

where the inner minimizations are over all probability distributions. Changing the order of three minimizations, we obtain


Since is a convex function, and summation of a convex function with a linear function remains convex, the first and the third terms of equation (6) combined is a convex function of . For any function , there exist small enough such that the combination of the first three terms of equation (6) remains convex with respect to each .

We add one more term to the Lagrangian for the constraint that for each , should sum up to 1. As a result, taking the derivative of this function with respect to , and setting it equal to zero, the minimum of the function can be found. Below, the derivative of each term is taken separately:


For the second term in we have

Due to the graphical model in Figure 1, we have


The derivative of and can be obtained similarly. Therefore, we have

For the third term in we have


Summing up all terms of the derivative and setting it equal to zero, we get the desired result in (3) and (4).

Using the calculated , we can minimize over and . Again using [17, Lemma 10.8.1], minimum is achieved in marginal distributions and , which can be found from according to Algorithm 1.

Regarding convergence, we note that the Lagrangian in equation (2) could be written as follows

Since the first three terms of are linear combinations of KL-divergences, and hence non-negative, is lower bounded by which is a constant. In addition, in Algorithm 1, assuming small enough , in each of three steps of the alternating algorithm, the value of decreases. Therefore, there exists , such that for values of , the algorithm converges to a stationary point of the objective function in (2).


This work was in part supported by MURI grant ARMY W911NF-15-1-0479, Navy N00014-16-1-2804 and NSF CNS 17-18952.


  • [1] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel, “Fairness through awareness,” in Proceedings of the 3rd Innovations in Theoretical Computer Science Conference.   ACM, 2012, pp. 214–226.
  • [2] M. Hardt, E. Price, N. Srebro et al., “Equality of opportunity in supervised learning,” in Advances in Neural Information Processing Systems, 2016, pp. 3315–3323.
  • [3] L. E. Celis, D. Straszak, and N. K. Vishnoi, “Ranking with fairness constraints,” arXiv preprint arXiv:1704.06840, 2017.
  • [4] J. Angwin, J. Larson, S. Mattu, and L. Kirchner, “Machine bias,” Pro Publica, 2016.
  • [5] M. Kay, C. Matuszek, and S. A. Munson, “Unequal representation and gender stereotypes in image search results for occupations,” in Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems.   ACM, 2015, pp. 3819–3828.
  • [6] C. Dwork, “Differential privacy: A survey of results,” in International Conference on Theory and Applications of Models of Computation.   Springer, 2008, pp. 1–19.
  • [7] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in TCC, vol. 3876.   Springer, 2006, pp. 265–284.
  • [8] K. Kalantari, L. Sankar, and A. D. Sarwate, “Optimal differential privacy mechanisms under hamming distortion for structured source classes,” in Information Theory (ISIT), 2016 IEEE International Symposium on.   IEEE, 2016, pp. 2069–2073.
  • [9] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork, “Learning fair representations,” in

    Proceedings of the 30th International Conference on Machine Learning (ICML-13)

    , 2013, pp. 325–333.
  • [10] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian, “Certifying and removing disparate impact,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.   ACM, 2015, pp. 259–268.
  • [11] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi, “Fairness constraints: Mechanisms for fair classification,” arXiv preprint arXiv:1507.05259, 2017.
  • [12] H. Edwards and A. Storkey, “Censoring representations with an adversary,” arXiv preprint arXiv:1511.05897, 2015.
  • [13] C. Louizos, K. Swersky, Y. Li, M. Welling, and R. Zemel, “The variational fair autoencoder,” arXiv preprint arXiv:1511.00830, 2015.
  • [14] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” in The 37th Allerton Conference on Communication, Control, and Computing, 1999.
  • [15] A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability of learning algorithms,” in Advances in Neural Information Processing Systems, 2017, pp. 2521–2530.
  • [16] G. Chechik and N. Tishby, “Extracting relevant structures with side information,” in Advances in Neural Information Processing Systems, 2003, pp. 881–888.
  • [17] T. M. Cover and J. A. Thomas, Elements of information theory.   John Wiley & Sons, 2012.