A Reductions Approach to Fair Classification

03/06/2018 ∙ by Alekh Agarwal, et al. ∙ 0

We present a systematic approach for achieving fairness in a binary classification setting. While we focus on two well-known quantitative definitions of fairness, our approach encompasses many other previously studied definitions as special cases. Our approach works by reducing fair classification to a sequence of cost-sensitive classification problems, whose solutions yield a randomized classifier with the lowest (empirical) error subject to the desired constraints. We introduce two reductions that work for any representation of the cost-sensitive classifier and compare favorably to prior baselines on a variety of data sets, while overcoming several of their disadvantages.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the past few years, the media have paid considerable attention to machine learning systems and their ability to inadvertently discriminate against minorities, historically disadvantaged populations, and other protected groups when allocating resources (e.g., loans) or opportunities (e.g., jobs). In response to this scrutiny—and driven by ongoing debates and collaborations with lawyers, policy-makers, social scientists, and others 

(e.g., Barocas & Selbst, 2016)—machine learning researchers have begun to turn their attention to the topic of “fairness in machine learning,” and, in particular, to the design of fair classification and regression algorithms.

In this paper we study the task of binary classification subject to fairness constraints with respect to a pre-defined protected attribute, such as race or sex. Previous work in this area can be divided into two broad groups of approaches.

The first group of approaches incorporate specific quantitative definitions of fairness into existing machine learning methods, often by relaxing the desired definitions of fairness, and only enforcing weaker constraints, such as lack of correlation (e.g., Woodworth et al., 2017; Zafar et al., 2017; Johnson et al., 2016; Kamishima et al., 2011; Donini et al., 2018). The resulting fairness guarantees typically only hold under strong distributional assumptions, and the approaches are tied to specific families of classifiers, such as SVMs.

The second group of approaches eliminate the restriction to specific classifier families and treat the underlying classification method as a “black box,” while implementing a wrapper that either works by pre-processing the data or post-processing the classifier’s predictions (e.g., Kamiran & Calders, 2012; Feldman et al., 2015; Hardt et al., 2016; Calmon et al., 2017). Existing pre-processing approaches are specific to particular definitions of fairness and typically seek to come up with a single transformed data set that will work across all learning algorithms, which, in practice, leads to classifiers that still exhibit substantial unfairness (see our evaluation in Section 4). In contrast, post-processing allows a wider range of fairness definitions and results in provable fairness guarantees. However, it is not guaranteed to find the most accurate fair classifier, and requires test-time access to the protected attribute, which might not be available.

We present a general-purpose approach that has the key advantage of this second group of approaches—i.e., the underlying classification method is treated as a black box—but without the noted disadvantages. Our approach encompasses a wide range of fairness definitions, is guaranteed to yield the most accurate fair classifier, and does not require test-time access to the protected attribute. Specifically, our approach allows any definition of fairness that can be formalized via linear inequalities on conditional moments, such as

demographic parity or

equalized odds

 (see Section 2.1). We show how binary classification subject to these constraints can be reduced to a sequence of cost-sensitive classification problems. We require only black-box access to a cost-sensitive classification algorithm, which does not need to have any knowledge of the desired definition of fairness or protected attribute. We show that the solutions to our sequence of cost-sensitive classification problems yield a randomized classifier with the lowest (empirical) error subject to the desired fairness constraints.

Corbett-Davies et al. (2017) and Menon & Williamson (2018) begin with a similar goal to ours, but they analyze the Bayes optimal classifier under fairness constraints in the limit of infinite data. In contrast, our focus is algorithmic, our approach applies to any classifier family, and we obtain finite-sample guarantees. Dwork et al. (2018)

also begin with a similar goal to ours. Their approach partitions the training examples into subsets according to protected attribute values and then leverages transfer learning to jointly learn from these separate data sets. Our approach avoids partitioning the data and assumes access only to a classification algorithm rather than a transfer learning algorithm.

A preliminary version of this paper appeared at the FAT/ML workshop (Agarwal et al., 2017), and led to extensions with more general optimization objectives (Alabi et al., 2018) and combinatorial protected attributes (Kearns et al., 2018).

In the next section, we formalize our problem. While we focus on two well-known quantitative definitions of fairness, our approach also encompasses many other previously studied definitions of fairness as special cases. In Section 3, we describe our reductions approach to fair classification and its guarantees in detail. The experimental study in Section 4 shows that our reductions compare favorably to three baselines, while overcoming some of their disadvantages and also offering the flexibility of picking a suitable accuracy–fairness tradeoff. Our results demonstrate the utility of having a general-purpose approach for combining machine learning methods and quantitative fairness definitions.

2 Problem Formulation

We consider a binary classification setting where the training examples consist of triples , where

is a feature vector,

is a protected attribute, and is a label. The feature vector can either contain the protected attribute as one of the features or contain other features that are arbitrarily indicative of . For example, if the classification task is to predict whether or not someone will default on a loan, each training example might correspond to a person, where represents their demographics, income level, past payment history, and loan amount; represents their race; and represents whether or not they defaulted on that loan. Note that might contain their race as one of the features or, for example, contain their zipcode—a feature that is often correlated with race. Our goal is to learn an accurate classifier from some set (i.e., family) of classifiers

, such as linear threshold rules, decision trees, or neural nets, while satisfying some definition of fairness. Note that the classifiers in

do not explicitly depend on .

2.1 Fairness Definitions

We focus on two well-known quantitative definitions of fairness that have been considered in previous work on fair classification; however, our approach also encompasses many other previously studied definitions of fairness as special cases, as we explain at the end of this section.

The first definition—demographic (or statistical) parity—can be thought of as a stronger version of the US Equal Employment Opportunity Commission’s “four-fifths rule,” which requires that the “selection rate for any race, sex, or ethnic group [must be at least] four-fifths (4/5) (or eighty percent) of the rate for the group with the highest rate.”111See the Uniform Guidelines on Employment Selection Procedures, 29 C.F.R. §1607.4(D) (2015).

Definition 1 (Demographic parity—DP).

A classifier satisfies demographic parity under a distribution over if its prediction is statistically independent of the protected attribute —that is, if for all , . Because , this is equivalent to for all .

The second definition—equalized odds—was recently proposed by Hardt et al. (2016) to remedy two previously noted flaws with demographic parity (Dwork et al., 2012). First, demographic parity permits a classifier which accurately classifies data points with one value , such as the value with the most data, but makes random predictions for data points with

as long as the probabilities of

match. Second, demographic parity rules out perfect classifiers whenever is correlated with . In contrast, equalized odds suffers from neither of these flaws.

Definition 2 (Equalized odds—EO).

A classifier satisfies equalized odds under a distribution over if its prediction is conditionally independent of the protected attribute given the label —that is, if for all , , and . Because , this is equivalent to for all , .

We now show how each definition can be viewed as a special case of a general set of linear constraints of the form

(1)

where matrix and vector describe the linear constraints, each indexed by , and is a vector of conditional moments of the form

where and is an event defined with respect to . Crucially, depends on , while cannot depend on in any way.

Example 1 (Dp).

In a binary classification setting, demographic parity can be expressed as a set of equality constraints, each of the form . Letting , for all , , and , where refers to the event encompassing all points in the sample space, each equality constraint can be expressed as .222Note that . Finally, because each such constraint can be equivalently expressed as a pair of inequality constraints of the form

demographic parity can be expressed as equation (1), where , , , , , and . Expressing each equality constraint as a pair of inequality constraints allows us to control the extent to which each constraint is enforced by positing for some (or all) .

Example 2 (Eo).

In a binary classification setting, equalized odds can be expressed as a set of equality constraints, each of the form . Letting , for all , , and , each equality constraint can be equivalently expressed as

As a result, equalized odds can be expressed as equation (1), where , , , , , and . Again, we can posit for some (or all) to allow small violations of some (or all) of the constraints.

Although we omit the details, we note that many other previously studied definitions of fairness can also be expressed as equation (1). For example, equality of opportunity (Hardt et al., 2016) (also known as balance for the positive class; Kleinberg et al., 2017), balance for the negative class (Kleinberg et al., 2017), error-rate balance (Chouldechova, 2017), overall accuracy equality (Berk et al., 2017), and treatment equality (Berk et al., 2017) can all be expressed as equation (1); in contrast, calibration (Kleinberg et al., 2017) and predictive parity (Chouldechova, 2017) cannot because to do so would require the event to depend on . We note that our approach can also be used to satisfy multiple definitions of fairness, though if these definitions are mutually contradictory, e.g., as described by Kleinberg et al. (2017), then our guarantees become vacuous.

2.2 Fair Classification

In a standard (binary) classification setting, the goal is to learn the classifier with the minimum classification error: . However, because our goal is to learn the most accurate classifier while satisfying fairness constraints, as formalized above, we instead seek to find the solution to the constrained optimization problem333We consider misclassification error for concreteness, but all the results in this paper apply to any error of the form , where .

(2)

Furthermore, rather than just considering classifiers in the set , we can enlarge the space of possible classifiers by considering randomized classifiers that can be obtained via a distribution over . By considering randomized classifiers, we can achieve better accuracy–fairness tradeoffs than would otherwise be possible. A randomized classifier makes a prediction by first sampling a classifier from and then using to make the prediction. The resulting classification error is and the conditional moments are (see Appendix A for the derivation). Thus we seek to solve

(3)

where is the set of all distributions over .

In practice, we do not know the true distribution over and only have access to a data set of training examples . We therefore replace and in equation (3) with their empirical versions and . Because of the sampling error in , we also allow errors in satisfying the constraints by setting for all , where . After these modifications, we need to solve the empirical version of equation (3):

(4)

3 Reductions Approach

We now show how the problem (4) can be reduced to a sequence of cost-sensitive classification problems. We further show that the solutions to our sequence of cost-sensitive classification problems yield a randomized classifier with the lowest (empirical) error subject to the desired constraints.

3.1 Cost-sensitive Classification

We assume access to a cost-sensitive classification algorithm for the set . The input to such an algorithm is a data set of training examples , where and denote the losses—costs in this setting—for predicting the labels or , respectively, for . The algorithm outputs

(5)

This abstraction allows us to specify different costs for different training examples, which is essential for incorporating fairness constraints. Moreover, efficient cost-sensitive classification algorithms are readily available for several common classifier representations (e.g., Beygelzimer et al., 2005; Langford & Beygelzimer, 2005; Fan et al., 1999). In particular, equation (5) is equivalent to a weighted classification problem, where the input consists of labeled examples with and , and the goal is to minimize the weighted classification error . This is equivalent to equation (5) if we set and .

3.2 Reduction

To derive our fair classification algorithm, we rewrite equation (4) as a saddle point problem. We begin by introducing a Lagrange multiplier for each of the constraints, summarized as , and form the Lagrangian

Thus, equation (4) is equivalent to

(6)

For computational and statistical reasons, we impose an additional constraint on the norm of and seek to simultaneously find the solution to the constrained version of (6) as well as its dual, obtained by switching min and max:

(P)
(D)

Because is linear in and and the domains of and are convex and compact, both problems have solutions (which we denote by and ) and the minimum value of (P) and the maximum value of (D) are equal and coincide with . Thus, is the saddle point of  (Corollary 37.6.2 and Lemma 36.2 of Rockafellar, 1970).

We find the saddle point by using the standard scheme of Freund & Schapire (1996), developed for the equivalent problem of solving for an equilibrium in a zero-sum game. From game-theoretic perspective, the saddle point can be viewed as an equilibrium of a game between two players: the -player choosing and the -player choosing . The Lagrangian specifies how much the -player has to pay to the -player after they make their choices. At the saddle point, neither player wants to deviate from their choice.

Our algorithm finds an approximate equilibrium in which neither player can gain more than by changing their choice (where is an input to the algorithm). Such an approximate equilibrium corresponds to a -approximate saddle point of the Lagrangian, which is a pair , where

We proceed iteratively by running a no-regret algorithm for the -player, while executing the best response of the -player. Following Freund & Schapire (1996), the average play of both players converges to the saddle point. We run the exponentiated gradient algorithm (Kivinen & Warmuth, 1997) for the -player and terminate as soon as the suboptimality of the average play falls below the pre-specified accuracy . The best response of the -player can always be chosen to put all of the mass on one of the candidate classifiers , and can be implemented by a single call to a cost-sensitive classification algorithm for the set .

Algorithm 1 fully implements this scheme, except for the functions and , which correspond to the best-response algorithms of the two players. (We need the best response of the -player to evaluate whether the suboptimality of the current average play has fallen below .) The two best response functions can be calculated as follows.

  Input:  training examples  Input: fairness constraints specified by , , ,  Input: bound , accuracy , learning rate
  Set
  for  do
     Set for all
     
     
     
     
     if  then
        Return
     end if
     Set
  end for
Algorithm 1 Exp. gradient reduction for fair classification

: the best response of the -player.

The best response of the -player for a given is any maximizer of over all valid s. In our setting, it can always be chosen to be either or put all of the mass on the most violated constraint. Letting and letting denote the vector of the standard basis, returns

: the best response of the -player.

Here, the best response minimizes over all s in the simplex. Because is linear in , the minimizer can always be chosen to put all of the mass on a single classifier . We show how to obtain the classifier constituting the best response via a reduction to cost-sensitive classification. Letting be the empirical event probabilities, the Lagrangian for which puts all of the mass on a single is then

Assuming a data set of training examples , the minimization of over then corresponds to cost-sensitive classification on with costs444For general error, , the costs and contain, respectively, the terms and instead of and .

Theorem 1.

Letting , Algorithm 1 satisfies the inequality

Thus, for , Algorithm 1 will return a -approximate saddle point of in at most iterations.

This theorem, proved in Appendix B, bounds the suboptimality of the average play , which is equal to its suboptimality as a saddle point. The right-hand side of the bound is optimized by , leading to the bound . This bound decreases with the number of iterations and grows very slowly with the number of constraints . The quantity is a problem-specific constant that bounds how much any single classifier can violate the desired set of fairness constraints. Finally, is the bound on the -norm of , which we introduced to enable this specific algorithmic scheme. In general, larger values of will bring the problem (P) closer to (6), and thus also to (4), but at the cost of needing more iterations to reach any given suboptimality. In particular, as we derive in the theorem, achieving suboptimality may need up to iterations.

Example 3 (Dp).

Using the matrix for demographic parity as described in Section 2, the cost-sensitive reduction for a vector of Lagrange multipliers uses costs

where and , effectively replacing two non-negative Lagrange multipliers by a single multiplier, which can be either positive or negative. Because for all , . Furthermore, because all empirical moments are bounded in , we can assume , which yields the bound . Thus, Algorithm 1 terminates in at most iterations.

Example 4 (Eo).

For equalized odds, the cost-sensitive reduction for a vector of Lagrange multipliers uses costs

where , , and . If we again assume , then we obtain the bound . Thus, Algorithm 1 terminates in at most iterations.

3.3 Error Analysis

Our ultimate goal, as formalized in equation (3), is to minimize the classification error while satisfying fairness constraints under a true but unknown distribution over . In the process of deriving Algorithm 1, we introduced three different sources of error. First, we replaced the true classification error and true moments with their empirical versions. Second, we introduced a bound on the magnitude of . Finally, we only run the optimization algorithm for a fixed number of iterations, until it reaches suboptimality level . The first source of error, due to the use of empirical rather than true quantities, is unavoidable and constitutes the underlying statistical error. The other two sources of error, the bound  and the suboptimality level , stem from the optimization algorithm and can be driven arbitrarily small at the cost of additional iterations. In this section, we show how the statistical error and the optimization error affect the true accuracy and the fairness of the randomized classifier returned by Algorithm 1—in other words, how well Algorithm 1 solves our original problem (3).

To bound the statistical error, we use the Rademacher complexity of the classifier family , which we denote by , where is the number of training examples. We assume that for some and . We note that in the vast majority of classifier families, including norm-bounded linear functions (see Theorem 1 of Kakade et al., 2009

), neural networks (see Theorem 18 of 

Bartlett & Mendelson, 2002), and classifier families with bounded VC dimension (see Lemma 4 and Theorem 6 of Bartlett & Mendelson, 2002).

Recall that in our empirical optimization problem we assume that , where are error bounds that account for the discrepancy between and . In our analysis, we assume that these error bounds have been set in accordance with the Rademacher complexity of .

Assumption 1.

There exists and such that and , where is the number of data points that fall in ,

The optimization error can be bounded via a careful analysis of the Lagrangian and the optimality conditions of (P) and (D). Combining the three different sources of error yields the following bound, which we prove in Appendix C.

Theorem 2.

Let Assumption 1 hold for , where . Let be any -approximate saddle point of , let minimize subject to , and let . Then, with probability at least , the distribution satisfies

where suppresses polynomial dependence on . If for all , then, for all ,

In other words, the solution returned by Algorithm 1 achieves the lowest feasible classification error on the true distribution up to the optimization error, which grows linearly with , and the statistical error, which grows as . Therefore, if we want to guarantee that the optimization error does not dominate the statistical error, we should set . The fairness constraints on the true distribution are satisfied up to the optimization error

and up to the statistical error. Because the statistical error depends on the moments, and the error in estimating the moments grows as

, we can set to guarantee that the optimization error does not dominate the statistical error. Combining this reasoning with the learning rate setting of Theorem 1 yields the following theorem (proved in Appendix C).

Theorem 3.

Let . Let Assumption 1 hold for , where . Let minimize subject to . Then Algorithm 1 with , and terminates in iterations and returns , which with probability at least satisfies

Example 5 (Dp).

If denotes the number of training examples with , then Assumption 1 states that we should set and Theorem 3 then shows that for a suitable setting of , , , and , Algorithm 1 will return a randomized classifier with the lowest feasible classification error up to while also approximately satisfying the fairness constraints

where is with respect to as well as .

Example 6 (Eo).

Similarly, if denotes the number of examples with and and denotes the number of examples with , then Assumption 1 states that we should set and Theorem 3 then shows that for a suitable setting of , , , and , Algorithm 1 will return a randomized classifier  with the lowest feasible classification error up to while also approximately satisfying the fairness constraints

for all , . Again, includes randomness under the true distribution over as well as .

3.4 Grid Search

In some situations, it is preferable to select a deterministic classifier, even if that means a lower accuracy or a modest violation of the fairness constraints. A set of candidate classifiers can be obtained from the saddle point . Specifically, because is a minimizer of and is linear in , the distribution puts non-zero mass only on classifiers that are the -player’s best responses to . If we knew , we could retrieve one such best response via the reduction to cost-sensitive learning introduced in Section 3.2.

We can compute using Algorithm 1, but when the number of constraints is very small, as is the case for demographic parity or equalized odds with a binary protected attribute, it is also reasonable to consider a grid of values , calculate the best response for each value, and then select the value with the desired tradeoff between accuracy and fairness.

Example 7 (Dp).

When the protected attribute is binary, e.g., , then the grid search can in fact be conducted in a single dimension. The reduction formally takes two real-valued arguments and , and then adjusts the costs for predicting by the amounts

respectively, on the training examples with and . These adjustments satisfy , so instead of searching over and , we can carry out the grid search over alone and apply the adjustment to the protected attribute value .

With three attribute values, e.g., , we similarly have , so it suffices to conduct grid search in two dimensions rather than three.

Example 8 (Eo).

If , we obtain the adjustment

for an example with protected attribute value and label , and similarly for protected attribute value . In this case, separately for each , the adjustments satisfy

so it suffices to do the grid search over and and set the parameters for to .

Figure 1: Test classification error versus constraint violation with respect to DP (top two rows) and EO (bottom two rows). All data sets have binary protected attributes except for adult4

, which has four protected attribute values, so relabeling is not applicable there. For our reduction approach we plot the convex envelope of the classifiers obtained on training data at various accuracy–fairness tradeoffs. We show 95% confidence bands for the classification error of our reduction approach and 95% confidence intervals for the constraint violation of post-processing. Our reduction approach dominates or matches the performance of the other approaches up to statistical uncertainty.

4 Experimental Results

We now examine how our exponentiated-gradient reduction555https://github.com/Microsoft/fairlearn performs at the task of binary classification subject to either demographic parity or equalized odds. We provide an evaluation of our grid-search reduction in Appendix D.

We compared our reduction with the score-based post-processing algorithm of Hardt et al. (2016), which takes as its input any classifier, (i.e., a standard classifier without any fairness constraints) and derives a monotone transformation of the classifier’s output to remove any disparity with respect to the training examples. This post-processing algorithm works with both demographic parity and equalized odds, as well as with binary and non-binary protected attributes.

For demographic parity, we also compared our reduction with the reweighting and relabeling approaches of Kamiran & Calders (2012). Reweighting can be applied to both binary and non-binary protected attributes and operates by changing importance weights on each example with the goal of removing any statistical dependence between the protected attribute and label.666Although reweighting was developed for demographic parity, the weights that it induces are achievable by our grid search, albeit the grid search for equalized odds rather than demographic parity. Relabeling was developed for binary protected attributes. First, a classifier is trained on the original data (without considering fairness). The training examples close to the decision boundary are then relabeled to remove all disparity while minimally affecting accuracy. The final classifier is then trained on the relabeled data.

As the base classifiers for our reductions, we used the weighted classification implementations of logistic regression and gradient-boosted decision trees in scikit-learn 

(Pedregosa et al., 2011). In addition to the three baselines described above, we also compared our reductions to the “unconstrained” classifiers trained to optimize accuracy only.

We used four data sets, randomly splitting each one into training examples (75%) and test examples (25%):

  • [nosep]

  • The adult income data set (Lichman, 2013)

    (48,842 examples). Here the task is to predict whether someone makes more than $50k per year, with gender as the protected attribute. To examine the performance for non-binary protected attributes, we also conducted another experiment with the same data, using both gender and race (binarized into white and non-white) as the protected attribute. Relabeling, which requires binary protected attributes, was therefore not applicable here.

  • ProPublica’s COMPAS recidivism data (7,918 examples). The task is to predict recidivism from someone’s criminal history, jail and prison time, demographics, and COMPAS risk scores, with race as the protected attribute (restricted to white and black defendants).

  • Law School Admissions Council’s National Longitudinal Bar Passage Study (Wightman, 1998) (20,649 examples). Here the task is to predict someone’s eventual passage of the bar exam, with race (restricted to white and black only) as the protected attribute.

  • The Dutch census data set (Dutch Central Bureau for Statistics, 2001) (60,420 examples). Here the task is to predict whether or not someone has a prestigious occupation, with gender as the protected attribute.

While all the evaluated algorithms require access to the protected attribute at training time, only the post-processing algorithm requires access to at test time. For a fair comparison, we included in the feature vector , so all algorithms had access to it at both the training time and test time.

We used the test examples to measure the classification error for each approach, as well as the violation of the desired fairness constraints, i.e., and for demographic parity and equalized odds, respectively.

We ran our reduction across a wide range of tradeoffs between the classification error and fairness constraints. We considered and for each value ran Algorithm 1 with across all . As expected, the returned randomized classifiers tracked the training Pareto frontier (see Figure 2 in Appendix D). In Figure 1, we evaluate these classifiers alongside the baselines on the test data.

For all the data sets, the range of classification errors is much smaller than the range of constraint violations. Almost all the approaches were able to substantially reduce or remove disparity without much impact on classifier accuracy. One exception was the Dutch census data set, where the classification error increased the most in relative terms.

Our reduction generally dominated or matched the baselines. The relabeling approach frequently yielded solutions that were not Pareto optimal. Reweighting yielded solutions on the Pareto frontier, but often with substantial disparity. As expected, post-processing yielded disparities that were statistically indistinguishable from zero, but the resulting classification error was sometimes higher than achieved by our reduction under a statistically indistinguishable disparity. In addition, and unlike the post-processing algorithm, our reduction can achieve any desired accuracy–fairness tradeoff, allows a wider range of fairness definitions, and does not require access to the protected attribute at test time.

Our grid-search reduction, evaluated in Appendix D, sometimes failed to achieve the lowest disparities on the training data, but its performance on the test data very closely matched that of our exponentiated-gradient reduction. However, if the protected attribute is non-binary, then grid search is not feasible. For instance, for the version of the adult income data set where the protected attribute takes on four values, the grid search would need to span three dimensions for demographic parity and six dimensions for equalized odds, both of which are prohibitively costly.

5 Conclusion

We presented two reductions for achieving fairness in a binary classification setting. Our reductions work for any classifier representation, encompass many definitions of fairness, satisfy provable guarantees, and work well in practice.

Our reductions optimize the tradeoff between accuracy and any (single) definition of fairness given training-time access to protected attributes. Achieving fairness when training-time access to protected attributes is unavailable remains an open problem for future research, as does the navigation of tradeoffs between accuracy and multiple fairness definitions.

Acknowledgements

We would like to thank Aaron Roth, Sam Corbett-Davies, and Emma Pierson for helpful discussions.

References

  • Agarwal et al. (2017) Agarwal, A., Beygelzimer, A., Dudík, M., and Langford, J. A reductions approach to fair classification. In Fairness, Accountability, and Transparency in Machine Learning (FATML), 2017.
  • Alabi et al. (2018) Alabi, D., Immorlica, N., and Kalai, A. T. Unleashing linear optimizers for group-fair learning and optimization. In Proceedings of the 31st Annual Conference on Learning Theory (COLT), 2018.
  • Barocas & Selbst (2016) Barocas, S. and Selbst, A. D. Big data’s disparate impact. California Law Review, 104:671–732, 2016.
  • Bartlett & Mendelson (2002) Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002.
  • Berk et al. (2017) Berk, R., Heidari, H., Jabbari, S., Kearns, M., and Roth, A. Fairness in criminal justice risk assessments: The state of the art. arXiv:1703.09207, 2017.
  • Beygelzimer et al. (2005) Beygelzimer, A., Dani, V., Hayes, T. P., Langford, J., and Zadrozny, B. Error limiting reductions between classification tasks. In Proceedings of the Twenty-Second International Conference on Machine Learning (ICML), pp. 49–56, 2005.
  • Boucheron et al. (2005) Boucheron, S., Bousquet, O., and Lugosi, G. Theory of classification: a survey of some recent advances. ESAIM: Probability and Statistics, 9:323–375, 2005.
  • Calmon et al. (2017) Calmon, F., Wei, D., Vinzamuri, B., Ramamurthy, K. N., and Varshney, K. R. Optimized pre-processing for discrimination prevention. In Advances in Neural Information Processing Systems 30, 2017.
  • Chouldechova (2017) Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, Special Issue on Social and Technical Trade-Offs, 2017.
  • Corbett-Davies et al. (2017) Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., and Huq, A. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 797–806, 2017.
  • Donini et al. (2018) Donini, M., Oneto, L., Ben-David, S., Shawe-Taylor, J., and Pontil, M. Empirical risk minimization under fairness constraints. 2018. arXiv:1802.08626.
  • Dwork et al. (2012) Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pp. 214–226, 2012.
  • Dwork et al. (2018) Dwork, C., Immorlica, N., Kalai, A. T., and Leiserson, M. Decoupled classifiers for group-fair and efficient machine learning. In Conference on Fairness, Accountability and Transparency (FAT), pp. 119–133, 2018.
  • Fan et al. (1999) Fan, W., Stolfo, S. J., Zhang, J., and Chan, P. K. Adacost: Misclassification cost-sensitive boosting. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML), pp. 97–105, 1999.
  • Feldman et al. (2015) Feldman, M., Friedler, S. A., Moeller, J., Scheidegger, C., and Venkatasubramanian, S. Certifying and removing disparate impact. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015.
  • Freund & Schapire (1996) Freund, Y. and Schapire, R. E. Game theory, on-line prediction and boosting. In

    Proceedings of the Ninth Annual Conference on Computational Learning Theory (COLT)

    , pp. 325–332, 1996.
  • Freund & Schapire (1997) Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
  • Hardt et al. (2016) Hardt, M., Price, E., and Srebro, N.

    Equality of opportunity in supervised learning.

    In Neural Information Processing Systems (NIPS), 2016.
  • Johnson et al. (2016) Johnson, K. D., Foster, D. P., and Stine, R. A. Impartial predictive modeling: Ensuring fairness in arbitrary models. arXiv:1608.00528, 2016.
  • Kakade et al. (2009) Kakade, S. M., Sridharan, K., and Tewari, A. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in neural information processing systems, pp. 793–800, 2009.
  • Kamiran & Calders (2012) Kamiran, F. and Calders, T. Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems, 33(1):1–33, 2012.
  • Kamishima et al. (2011) Kamishima, T., Akaho, S., and Sakuma, J. Fairness-aware learning through regularization approach. In 2011 IEEE 11th International Conference on Data Mining Workshops, pp. 643–650, 2011.
  • Kearns et al. (2018) Kearns, M., Neel, S., Roth, A., and Wu, Z. S. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In Proceedings of the 35th International Conference on Machine Learning (ICML), 2018.
  • Kivinen & Warmuth (1997) Kivinen, J. and Warmuth, M. K. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1–63, 1997.
  • Kleinberg et al. (2017) Kleinberg, J., Mullainathan, S., and Raghavan, M. Inherent trade-offs in the fair determination of risk scores. In Proceedings of the 8th Innovations in Theoretical Computer Science Conference, 2017.
  • Langford & Beygelzimer (2005) Langford, J. and Beygelzimer, A. Sensitive error correcting output codes. In Proceedings of the 18th Annual Conference on Learning Theory (COLT), pp. 158–172, 2005.
  • Ledoux & Talagrand (1991) Ledoux, M. and Talagrand, M. Probability in Banach Spaces: Isoperimetry and Processes. Springer, 1991.
  • Lichman (2013) Lichman, M. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.
  • Menon & Williamson (2018) Menon, A. K. and Williamson, R. C. The cost of fairness in binary classification. In Proceedings of the Conference on Fiarness, Accountability, and Transparency, 2018.
  • Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • Rockafellar (1970) Rockafellar, R. T. Convex analysis. Princeton University Press, 1970.
  • Shalev-Shwartz (2012) Shalev-Shwartz, S. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2012.
  • Wightman (1998) Wightman, L. LSAC National Longitudinal Bar Passage Study, 1998.
  • Woodworth et al. (2017) Woodworth, B. E., Gunasekar, S., Ohannessian, M. I., and Srebro, N. Learning non-discriminatory predictors. In Proceedings of the 30th Conference on Learning Theory (COLT), pp. 1920–1953, 2017.
  • Zafar et al. (2017) Zafar, M. B., Valera, I., Rodriguez, M. G., and Gummadi, K. P. Fairness constraints: Mechanisms for fair classification. In

    Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS)

    , pp. 962–970, 2017.

Appendix A Error and Fairness for Randomized Classifiers

Let denote the distribution over triples . The accuracy of a classifier is measured by 0-1 error, , which for a randomized classifier becomes

The fairness constraints on a classifier are . Recall that . For a randomized classifier we define its moment as

where the last equality follows because is independent of the choice of .

Appendix B Proof of Theorem 1

The proof follows immediately from the analysis of Freund & Schapire (1996) applied to the Exponentiated Gradient (EG) algorithm (Kivinen & Warmuth, 1997), which in our specific case is also equivalent to Hedge (Freund & Schapire, 1997).

Let and . We associate any with the that is equal to on coordinates through and puts the remaining mass on the coordinate .

Consider a run of Algorithm 1. For each , let be the associated element of . Let and let be equal to on coordinates through and put zero on the coordinate . Thus, for any and the associated , we have, for all ,

(7)

and, in particular,