The emergence of machine learning in the last decade has given rise to an important debate regarding the ethical and societal responsibility of its offspring. Machine learning has provided a universal toolbox enhancing the decision making in many disciplines from advertising and recommender systems to education and criminal justice. Unfortunately, both the data and their processing can be biased against specific population groups (even inadvertently) in every single step of the process[BS16]. This has generated societal and policy interest in understanding the sources of this discrimination and interdisciplinary research has attempted to mitigate its shortcomings.
Discrimination is commonly an issue in applications where decisions need to be made sequentially. The most prominent such application is online advertising where platforms need to sequentially select which ad to display in response to particular query searches. This process can introduce discrimination against protected groups in many ways such as filtering particular alternatives [DTD15, APJ16] and reinforcing existing stereotypes through search results [Swe13, KMM15]. Another canonical example of sequential decision making is medical trials where underexploration on female groups often leads to significantly worse treatments for them [LDM16]. Similar issues occur in image classification as stressed by “gender shades” [BG18]. The reverse (overexploration in minority populations) can also cause concerns especially if conducted in a non-transparent fashion [BBC16].
In these sequential settings, the assumption that data are i.i.d. is often violated. Online advertising, recommender systems, medical trials, image classification, loan decisions, criminal recidivism all require decisions to be made sequentially. The corresponding labels are not identical across time and can be affected by the economy, recent events, etc. Similarly labels are also not independent across rounds – if a bank offers a loan then this decision can affect whether the loanee or their environment will be able to repay future loans thereby affecting future labels as discussed by Liu et al. [LDR18]. As a result, it is important to understand the effect of this adaptivity on non-discrimination.
The classical way to model settings that are not i.i.d. is via adversarial online learning [LW94, FS97], which poses the question: Given a class of predictors, how can we make online predictions that perform as well as the best predictor from in hindsight? The most basic online learning question (answered via the celebrated “multiplicative weights” algorithm) concerns competing with a finite set of predictors. The class is typically referred to as “experts” and can be thought as “features” of the example where we want to make online predictions that compete with the best 1-sparse predictor.
In this work, we wish to understand the interplay between adaptivity and non-discrimination and therefore consider the most basic extension of the classical online learning question:
Given a class of individually non-discriminatory predictors, how can we combine them to perform as well as the best predictor, while preserving non-discrimination?
The assumption that predictors are individually non-discriminatory is a strong assumption on the predictors and makes the task trivial in the batch setting where the algorithm is given labeled examples and wishes to perform well on unseen examples drawn from the same distribution. This happens because the algorithm can learn the best predictor from the labeled examples and then follow it (since this predictor is individually non-discriminatory, the algorithm does not exhibit discrimination). This enables us to understand the potential overhead that adaptivity is causing and significantly strengthens any impossibility result. Moreover, we can assume that predictors have been individually vetted to satisfy the non-discrimination desiderata – we therefore wish to understand how to efficiently compose these non-discriminatory predictors while preserving non-discrimination.
1.1 Our contribution
Our impossibility results for equalized odds.
Surprisingly, we show that for a prevalent notion of non-discrimination, equalized odds, it is impossible to preserve non-discrimination while also competing comparably to the best predictor in hindsight (no-regret property). Equalized odds, suggested by Hardt et al. [HPS16] in the batch setting, restricts the set of allowed predictors requiring that, when examples come from different groups, the prediction is independent to the group conditioned on the label. In binary classification, this means that the false negative rate (fraction of positive examples predicted negative) is equal across groups and the same holds for the false positive rate (defined analogously). This notion was popularized by a recent debate on potential bias of machine learning risk tools for criminal recividism [ALMK16, Cho17, KMR17, FPCDG16].
Our impossibility results demonstrate that the order in which examples arrive significantly complicates the task of achieving desired efficiency while preserving non-discrimination with respect to equalized odds. In particular, we show that any algorithm agnostic to the group identity either cannot achieve performance comparable to the best predictor or exhibits discrimination in some instances (Theorem 3.1). This occurs in phenomenally simple settings with only two individually non-discriminatory predictors, two groups, and perfectly balanced instances: groups are of equal size and each receives equal number of positive and negative labels. The only imbalance exists in the order in which the labels arrive which is also relatively well behaved – labels are generated from two i.i.d. distributions, one in the first half of the instance and one in the second half. Although in many settings we cannot actively use the group identity of the examples due to legal reasons (e.g., in hiring), one may wonder whether these impossibility results disappear if we can actively use the group information to compensate for past mistakes. We show that this is also not the case (Theorem 3.2). Although our groups are not perfectly balanced, the construction is again very simple and consists only of two groups and two predictors: one always predicting positive and one always predicting negative. The simplicity of the settings, combined with the very strong assumption on the predictors being individually non-discriminatory speaks to the trade-off between adaptivity and non-discrimination with respect to equalized odds.
Our results for equalized error rates.
The strong impossibility results with respect to equalized odds invite the natural question of whether there exists some alternative fairness notion that, given access to non-discriminatory predictors, achieves efficiency while preserving non-discrimination. We answer the above positively by suggesting the notion of equalized error rates, which requires that the average expected loss (regardless whether it stems from false positives or false negatives) encountered by each group should be the same. This notion makes sense in settings where performance and fairness are measured with respect to the same objective. Consider a medical application where people from different subpopulations wish to receive appropriate treatment and any error in treatment costs equally both towards performance and towards fairness.111In contrast, in equalized odds, a misprediction is only costly to the false negative metric if the true label is positive. It is morally objectionable to discriminate against one group, e.g. based on race, using it as experimentation to enhance the quality of service of the other, and it is reasonable to require that all subpopulations receive same quality of service.
For this notion, we show that, if all predictors are individually non-discriminatory with respect to equalized error rates, running separate multiplicative weights algorithms, one for each subpopulation, preserves this non-discrimination without decay in the efficiency (Theorem 4.1). The key property we use is that the multiplicative weights algorithm guarantees to perform not only no worse than the best predictor in hindsight but also no better; this property holds for a broader class of algorithms [EDKMW08]
. Our result applies to general loss functions beyond binary predictions and only requires predictors to satisfy the weakened assumption of being approximately non-discriminatory.
Finally, we examine whether the decisions of running separate algorithms and running this particular not so efficient algorithm were important for the result. We first give evidence that running separate algorithms is essential for the result; if we run a single instance of “multiplicative weights” or “follow the perturbed leader”, we cannot guarantee non-discrimination with respect to equalized error rates (Theorem 4.2). We then suggest that the property of not performing better than the best predictor is also crucial; in particular, better algorithms that satisfy the stronger guarantee of low shifting regret [HW01, BM05, LS15] are also not able to guarantee this non-discrimination (Theorem 4.2). These algorithms are considered superior to classical no-regret algorithms as they can better adapt to changes in the environment, which has nice implications in game-theoretic settings [LST16]. Our latter impossibility result is a first application where having these strong guarantees against changing benchmarks is not necessarily desired and therefore is of independent learning-theoretic interest.
1.2 Related work
There is a large line of work on fairness and non-discrimination in machine learning (see [PRT08, CKP09, DHP12, ZWS13, JKMR16, HPS16, Cho17, KMR17, KNRW18] for a non-exhaustive list). We elaborate on works that either study group notions of fairness or fairness in online learning.
The last decade has seen a lot of work on group notions of fairness, mostly in classification setting. Examples include notions that compare the percentage of members predicted positive such as demographic parity [CKP09], disparate impact [FFM15], equalized odds [HPS16] and calibration across groups [Cho17, KMR17]. There is no consensus on a universal fairness notion; rather the specific notion considered is largely task-specific. In fact, previous works identified that these notions are often not compatible to each other [Cho17, KMR17], posed concerns that they may introduce unintentional discrimination [CDG18], and suggested the need to go beyond such observational criteria via causal reasoning [KCP17, KLRS17]. Prior to our work, group fairness notions have been studied primarily in the batch learning setting with the goal of optimizing a loss function subject to a fairness constraint either in a post-hoc correction framework as proposed by Hardt et al. [HPS16] or more directly during training from batch data [ZWS13, GCGF16, WGOS17, ZVRG17, BDNP18] which requires care due to the predictors being discriminatory with respect to the particular metric of interest. The setting we focus on in this paper does not have the challenges of the above since all predictors are non-discriminatory; however, we obtain surprising impossibility results due to the ordering in which labels arrive.
Recently fairness in online learning has also started receiving attention. One line of work focuses on imposing a particular fairness guarantee at all times for bandits and contextual bandits, either for individual fairness [JKMR16, KKM17] or for group fairness [CV17]. Another line of work points to counterintuitive externalities of using contextual bandit algorithms agnostic to the group identity and suggest that heterogeneity in data can replace the need for exploration [RSWW18, KMR18]. Moreover, following a seminal paper by Dwork et al. [DHP12], a line of work aims to treat similar people similarly in online settings [LRD17, GJKR18]. Our work distinguishes itself from these directions mainly in the objective, since we require the non-discrimination to happen in the long-term instead of at any time; this extends the classical batch definitions of non-discrimination in the online setting. In particular, we only focus on situations where there are enough samples from each population of interest and we do not penalize the algorithm for a few wrong decisions, leading it to be overly pessimistic. Another difference is that previous work focuses either on individual notions of fairness or on i.i.d. inputs, while our work is about non-i.i.d. inputs in group notions of fairness.
Online learning protocol with group context.
We consider the classical online learning setting of prediction with expert advice, where a learner needs to make sequential decisions for rounds by combining the predictions of a finite set of hypotheses (also referred to as experts). We denote the outcome space by ; in binary classification, this corresponds to . Additionally, we introduce a set of disjoint groups by which identifies subsets of the population based on a protected attribute (such as gender, ethnicity, or income).
The online learning protocol with group context proceeds in rounds. Each round is associated with a group context and an outcome . We denote the resulting -length time-group-outcome sequence tuple by
. This is a random variable that can depend on the randomness in the generation of the groups and the outcomes. We use the shorthandto denote the subsequence until round . The exact protocol for generating these sequences is described below. At round :
An example with group context either arrives stochastically or is adversarially selected.
The learning algorithm or learner
commits to a probability distributionacross experts where denotes the probability that she follows the advice of expert at round . This distribution can be a function of the sequence . We call the learner group-unaware if she ignores the group context for all when selecting .
An adversary then selects an outcome . The adversary is called adaptive if the groups/outcomes at round are a function of the realization of ; otherwise she is called oblivious. The adversary always has access to the learning algorithm, but an adaptive adversary additionally has access to the realized and hence also knows .
Simultaneously, each expert makes a prediction , where is a generic prediction space; for example, in binary classification, the prediction space could simply be the positive or negative labels: , or the probabilistic score: with interpreted as the probability the expert assigns to the positive label in round
, or even an uncalibrated score like the output of a support vector machine:.
Let be the loss function between predictions and outcomes. This leads to a corresponding loss vector where denotes the loss the learner incurs if she follows expert .
The learner then observes the entire loss vector (full information feedback) and incurs expected loss . For classification, this feedback is obtained by observing .
In this paper, we consider a setting where all the experts are fair in isolation (formalized below). Regarding the group contexts, our main impossibility results (Theorems 3.1 and 3.2) assume that the group contexts arrive stochastically from a fixed distribution, while our positive result (Theorem 4.1) holds even when they are adversarially selected.
For simplicity of notation, we assume throughout the presentation that the learner’s algorithm is producing the distribution of round deterministically based on and therefore all our expectations are taken only over which is the case in most algorithms. Our results extend when the algorithm uses extra randomness to select the distribution.
Group fairness in online learning.
We now define non-discrimination (or fairness) with respect to a particular evaluation metric, e.g. in classification, the false negative rate metric (FNR) is the fraction of examples with positive outcome predicted negative incorrectly. For any realization of the time-group-outcome sequence and any group , metric induces a subset of the population that is relevant to it. For example, in classification, is the set of positive examples of group . The performance of expert on the subpopulation is denoted by . An expert is called fair in isolation with respect to metric if, for every sequence , her performance with respect to is the same across groups, i.e. for all . Similarly, the learner’s performance on this subpopulation is . To formalize our non-discrimination desiderata, we require the algorithm to have similar expected performance across groups, when given access to fair in isolation predictors. We make the following assumptions to avoid trivial impossibility results due to low-probability events or underrepresented populations. First, we take expectation over sequences generated by the adversary (that has access to the learning algorithm ). Second, we require the relevant subpopulations to be, in expectation, large enough. Our positive results do not depend on either of these assumptions. More formally: Consider a set of experts such that each expert is fair in isolation with respect to metric . Learner is called -fair in composition with respect to metric if, for all adversaries that produce for all , it holds that:
We note that, in many settings, we wish to have non-discrimination with respect to multiple metrics simultaneously. For instance, equalized odds requires fairness in composition both with respect to false negative rate and with respect to false positive rate (defined analogously). Since we provide an impossibility result for equalized odds, focusing on only one metric makes the result even stronger.
The typical way to evaluate the performance of an algorithm in online learning is via the notion of regret. Regret is comparing the performance of the algorithm to the performance of the best expert in hindsight on the realized sequence as defined below:
In the above definition, regret is a random variable depending on the sequence ; therefore depending on the randomness in groups/outcomes.
An algorithm satisfies the no-regret property (or Hannan consistency) in our setting if for any losses realizable by the above protocol, the regret is sublinear in the time horizon , i.e. . This property ensures that, as time goes by, the average regret vanishes. Many online learning algorithms, such as multiplicative weights updates satisfy this property with .
We focus on the notion of approximate regret, which is a relaxation of regret that gives a small multiplicative slack to the algorithm. More formally, -approximate regret with respect to expert is defined as:
We note that typical algorithms guarantee simultaneously for all experts . When the time-horizon is known in advance, by setting , such a bound implies the aforementioned regret guarantee. In the case when the time horizon is not known, one can also obtain a similar guarantee by adjusting the learning rate of the algorithm appropriately.
Our goal is to develop online learning algorithms that combine fair in isolation experts in order to achieve both vanishing average expected -approximate regret, i.e. for any fixed and , , and also non-discrimination with respect to fairness metrics of interest.
3 Impossibility results for equalized odds
In this section, we study a popular group fairness notion, equalized odds, in the context of online learning. A natural extension of equalized odds for online settings would require that the false negative rate, i.e. percentage of positive examples predicted incorrectly, is the same across all groups and the same also holds for the false positive rate. We assume that our experts are fair in isolation with respect to both false negative as well as false positive rate. A weaker notion of equalized odds is equality of opportunity where the non-discrimination condition is required to be satisfied only for the false negative rate. We first study whether it is possible to achieve the vanishing regret property while guaranteeing -fairness in composition with respect to false negative rate for arbitrarily small . When the input is i.i.d., this is trivial as we can learn the best expert in rounds and then follow its advice; since the expert is fair in isolation, this will guarantee vanishing non-discrimination.
In contrast, we show that, in a non-i.i.d. online setting, this goal is unachievable. We demonstrate this in phenomenally benign settings where there are just two groups that come from a fixed distribution and just two experts that are fair in isolation (with respect to false negative rate) even per round – not only ex post. Our first construction (Theorem 3.1) shows that any no-regret learning algorithm that is group-unaware cannot guarantee fairness in composition, even in instances that are perfectly balanced (each pair of label and group gets of the examples) – the only adversarial component is the order in which these examples arrive. This is surprising because such a task is straightforward in the stochastic setting as all hypotheses are non-discriminatory. We then study whether actively using the group identity can correct the aforementioned similarly to how it enables correction against discriminatory predictors [HPS16]. The answer is negative even in this scenario (Theorem 3.2): if the population is sufficiently not balanced, any no-regret learning algorithm will be unfair in composition with respect to false negative rate even if they are not group-unaware.
3.1 Group-unaware algorithms
We first present the theorem about group-unaware algorithms.
For all , there exists such that any group-unaware algorithm that satisfies for all is -unfair in composition with respect to false negative rate even for perfectly balanced sequences. In particular, for any group-unaware algorithm that ensures vanishing approximate regret222This requirement is weaker than vanishing regret so the impossibility result applies to vanishing regret algorithms., there exists an oblivious adversary for assigning labels such that:
In expectation, half of the population corresponds to each group.
For each group, in expectation half of its labels are positive and the other half are negative.
The false negative rates of the two groups differ by .
Consider an instance that consists of two groups , two experts , and two phases: Phase I and Phase II. Group is the group we end up discriminating against while group is boosted by the discrimination with respect to false negative rate. At each round the groups arrive stochastically with probability each, independent of .
The experts output a score value in , where score can be interpreted as the probability that expert assigns to label being positive in round , i.e. . The loss function is the expected probability of error given by . The two experts are very simple: always predicts negative, i.e. for all , and is an unbiased expert who, irrespective of the group or the label, makes an inaccurate prediction with probability , i.e. for all . Both experts are fair in isolation with respect to both false negative and false positive rates: FNR is for and for regardless the group, and FPR is for and for , again independent of the group. The instance proceeds in two phases:
Phase I lasts for rounds. The adversary assigns negative labels on examples with group context and assigns a label uniformly at random to examples from group .
In Phase II, there are two plausible worlds:
if the expected probability the algorithm assigns to expert in Phase I is at least then the adversary assigns negative labels for both groups
else the adversary assigns positive labels to examples with group context while examples from group keep receiving positive and negative labels with probability equal to half.
We will show that for any algorithm with vanishing approximate regret property, i.e. with , the condition for the first world is never triggered and hence the above sequence is indeed balanced.
We now show why this instance is unfair in composition with respect to false negative rate. The proof involves showing the following two claims:
In Phase I, any -approximate regret algorithm needs to select the negative expert most of the times to ensure small approximate regret with respect to . This means that, in Phase I (where we encounter half of the positive examples from group and none from group ), the false negative rate of the algorithm is close to .
In Phase II, any -approximate regret algorithm should quickly catch up to ensure small approximate regret with respect to and hence the false negative rate of the algorithm is closer to . Since the algorithm is group-unaware, this creates a mismatch between the false negative rate of (that only receives false negatives in this phase) and (that has also received many false negatives before).
Upper bound on probability of playing in Phase I.
We now formalize the first claim by showing that any algorithm with does not satisfy the approximate regret property. The algorithm suffers an expected loss of every time it selects expert . On the other hand, when selecting expert , it suffers a loss of for members of group and an expected loss of for members of group . As a result, the expected loss of the algorithm in the first phase is:
In contrast, the negative expert has, in Phase I, expected loss of:
Therefore, if , the -approximate regret of the algorithm with respect to is linear to the time-horizon (and therefore not vanishing) since:
Upper bound on probability of playing in Phase II.
Regarding the second claim, we first show that for any -approximate regret algorithm with .
The expected loss of the algorithm in the second phase is:
Since, in Phase I, the best case scenario for the algorithm is to always select expert and incur a loss of , this implies that for :
On the other hand, the cumulative expected loss of the -inaccurate expert is
Therefore, if the algorithm has , the -approximate regret of the algorithm with respect to is linear to the time-horizon since , we have:
The last inequality holds since for .
Thus, we have shown that for, for , any algorithm with vanishing approximate regret, necessarily we have .
Gap in false negative rates between groups and .
We now compute the expected false negative rates for the two groups, assuming that . Since we focus on algorithms that satisfy the vanishing regret property, we have already established that:
For ease of notation, let and . Since the group context at round arrives independent of and the adversary is oblivious, we have that are independent of , and hence also independent of .
Since the algorithm is group-unaware, the expected cumulative probability that the algorithm uses in Phase II is the same for both groups. We combine this with the facts that under the online learning protocol with group context, examples of group arrive stochastically with probability half but only receive positive labels in Phase II, we obtain:
Recall that group in Phase I has no positive labels, hence the false negative rate on group is:
In order to upper bound the above false negative rate, we denote the following good event by
By Chernoff bound, the probability of the bad event is:
For , this implies that since .
Therefore, by first using the bound on on the good event and the bound on the probability of the bad event, and then taking the limit , it holds that:
We now move to the false negative rate of :
Similarly as before, letting and, since , we obtain that, for , .
Recall that for our instance and is independent of . From our previous analysis we also know that:
As a result, using that and Inequalities (3), we obtain:
Therefore, similarly with before, it holds that:
As a result, the difference between the false negative rate in group and the one at group is which can go arbitrarily close to by appropriately selecting to be small enough, for any vanishing approximate regret algorithm. This concludes the proof. ∎
3.2 Group-aware algorithms
We now turn our attention to group-aware algorithms, that can use the group context of the example to select the probability of each expert and provide a similar impossibility result. There are three changes compared to the impossibility result we provided for group-unaware algorithms. First, the adversary is not oblivious but instead is adaptive. Second, we do not have perfect balance across populations but instead require that the minority population arrives with probability , while the majority population arrives with probability . Third, the labels are not equally distributed across positive and negative for each population but instead positive labels for one group are at least a percentage of the total examples of the group for a small . Although the upper bounds on and are not optimized, our impossibility result cannot extend to . Understanding whether one can achieve fairness in composition for some values of and is an interesting open question. Our impossibility guarantee is the following:
For any group imbalance and , there exists such that for all any algorithm that satisfies for all , is -unfair in composition.
The instance has two groups: . Examples with group context are discriminated against and arrive randomly with probability while examples with group context are boosted by the discrimination and arrive with the remaining probability . There are again two experts , which output score values in , where can be interpreted as the probability that expert assigns to label being in round . We use the earlier loss function of . The first expert is again pessimistic and always predicts negative, i.e. , while the other expert is optimistic and always predicts positive, i.e. . These experts again satisfy fairness in isolation with respect to equalized odds (false negative rate and false positive rate). Let denote the percentage of the input that is about positive examples for , ensuring that . The instance proceeds in two phases.
Phase I lasts rounds for . The adversary assigns negative labels on examples with group context . For examples with group context , the adversary acts as following:
if the algorithm assigns probability on the negative expert below , i.e. , then the adversary assigns negative label.
otherwise, the adversary assigns positive labels.
In Phase II, there are two plausible worlds:
the adversary assigns negative labels to both groups if the expected number of times that the algorithm selected the negative expert with probability higher than on members of group is less than , i.e. .
otherwise she assigns positive labels to examples with group context and negative labels to examples with group context .
Note that, as before, the condition for the first world will never be triggered by any no-regret learning algorithm (we elaborate on that below) which ensures that .
The proof is based on the following claims:
In Phase I, any vanishing approximate regret algorithm enters the second world of Phase II.
This implies a lower bound on the false negative rate on , i.e. .
In Phase II, any -approximate regret algorithm assigns large enough probability to the positive expert for group . This implies an upper bound on the false negative rate on , i.e. . Therefore this provides a gap in the false negative rates of at least .
Proof of first claim.
To prove the first claim, we apply the method of contradiction. Assume that the algorithm has . This means that the algorithm faces an expectation of at least negative examples, while predicting the negative expert with probability at most ,thereby making an error with probability . Therefore the expected loss of the algorithm is at least:
At the same time, expert makes at most errors (at the positive examples)
Therefore, if , the -approximate regret of the algorithm with respect to is linear to the time-horizon (and therefore not vanishing) since:
This violates the vanishing approximate regret property, thereby leading to contradiction.
Proof of second claim.
The second claim follows directly by the above construction, since positive examples only appear in Phase I when the probability of error on them is greater than .
Proof of third claim.
Having established that any vanishing approximate regret algorithm will always enter the second world, we now focus on the expected loss of expert in this case. This expert makes errors at most in all Phase I and in the examples of Phase II with group context :
Since group has only positive examples in Phase II, the expected loss of the algorithm is at least:
We now show that . If this is not the case, then the algorithm does not have vanishing -approximate regret with respect to expert since:
Given the above, we now establish a gap in the fairness with respect to false negative rate. Since group only experiences positive examples when expert is offered probability higher than , this means that:
Regarding group , we need to take into account the low-probability event that the actual realization has significantly less than examples of group in Phase II (all are positive examples). This can be handled via similar Chernoff bounds as in the proof of the previous theorem. As a result, the expected false negative rate at group is: