    # Review of Mathematical frameworks for Fairness in Machine Learning

A review of the main fairness definitions and fair learning methodologies proposed in the literature over the last years is presented from a mathematical point of view. Following our independence-based approach, we consider how to build fair algorithms and the consequences on the degradation of their performance compared to the possibly unfair case. This corresponds to the price for fairness given by the criteria statistical parity or equality of odds. Novel results giving the expressions of the optimal fair classifier and the optimal fair predictor (under a linear regression gaussian model) in the sense of equality of odds are presented.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

With both the introduction of new ways of storing, sharing and streaming data and the drastic development of the capacity of computers to handle large computations, the conception of models have changed. Mathematical models were first designed following prior ideas or conjectures from physical or biological models, then tested by designing experiments to test the validity of the ideas of their inventors. The model holds until new observations enable to reject its assumptions. The so-called Big Data’s area introduced a new paradigm. The observed data convey enough information to understand the complexity of real life and the more the data, the better the description of the reality. Hence building models optimised to fit the data has become an efficient way to obtain generalizable models able to describe and forecast the real world.

In this framework, the principle of supervised machine learning is to build a decision rule from a set of labeled examples called the learning sample, that fits the data. This rule becomes a model or a decision algorithm that will be used for all the population. Mathematical guarantees can be provided in certain cases to control the generalization error of the algorithm which corresponds to the approximation done by building the model based on the observations and not knowing the true model that actually generated the data set. More precisely, the data are assumed to follow an unknown distribution while only its empirical distribution is at hand. So bounds are given to measure the error made by fitting a model on such observations and still using the model for new data. Yet the underlying assumption is that the observations follow all the same distribution which can be correctly estimated by the learning sample. Potential existing bias in the learning sample will be implicitly learnt and incorporated in the prediction. The danger of an uncontrolled prediction is greater when the algorithm lacks interpretability hence providing predictions that seem to be drawn from a yet accurate black-box but without any control or understanding on the reasons why they were chosen.

More precisely, in a supervised setting, the aim of a machine learning algorithm is to learn the relationships between characteristic variables and a target variable in order to forecast new observations. Set the learning sample as i.i.d observations drawn from an unknown distribution . Set the empirical distribution

. The quality of the prediction will be measured using a loss function defined as

to quantify the error made while predicting when is observed. Then for a given chosen class of algortihms , consider the best model that can be estimated by minimizing over , the loss function (and possibly a penalty to prevent overfitting for example), namely

 ˆfn∈argminf∈F{1nn∑i=1ℓ(Yi,f(Xi))+λpenalty(f)}, (1.1)

where balances the contribution of both terms to get a trade-off between the bias and the efficiency of the algorithm. The oracle rule is the best (yet unknown) rule that could be constructed if the true distribution were known

 f⋆∈argminf∈FEP{ℓ(Y,f(X))+λpenalty(f)}.

The predictions are given by Results from machine learning theory ensures that for proper choices of set of rules , the prediction’s error behaves close to the oracle in the sense that, from a mathematical point of view, the excess risk

 EP{ℓ(Y,ˆfn(X))}−EP{ℓ(Y,f⋆(X))}

is small. So mathematical guarantees warrant that the optimal forecast model reproduces the uses learnt from the learning set for new observations. It shapes the reality according to the learnt rule without questioning nor evolution.

## 2 A definition of fairness in machine learning as independence criterion

### 2.1 Definition of full fairness

There is no doubt that machine learning is a powerful tool that is improving human life and has shown great promise in the developping of very different technological applications, including powering self-driving cars, accurately recognizing cancer in radiographs, or predicting our interests based upon past behavior, to name just a few. Yet with its benefits, machine learning also involves delicate issues such as the presence of bias in the model classifications and predictions. Hence, with this generalization of predictive algorithms in a wide variety of fields, algorithmic fairness is gaining more and more attention not only in the research community but also among the general population, who is experiencing a great impact on its daily life and activity. Thanks to this, there has been a push for the emergence of different approaches for assessing the presence of bias in machine learning algorithms over the last years. Similarly, various classifications have been proposed to understand the different sources of data bias. We refer to  for a recent review.

Consider the probability space

, with the Borel algebra of subsets of and

. We will assume in the following that the bias is modeled by the random variable

that represents an information about the observations that should not be included in the model for the prediction of the target . In the fair learning literature, the variable is referred to as the protected or sensitive attribute. We assume moreover that this variable is observed. Most fairness theory has been developed particularly in the case when and

is a sensitive binary variable. In other words, the population is supposed to be possibly divided into two categories, taking the value

for the minority (assumed to be the unfavored class), and for the default (and usually favored class). Hence, we also study more deeply this case and it will be conveniently indicated in the rest of the chapter, but in principle we consider general . From a mathematical point of view, we follow recent paper  that proposed the two following models that aim at understanding how this bias could be introduced in the algorithms:

1. The first model corresponds to the case where the data are subject to a bias nuisance variable which, in principle, is assumed not to be involved in the learning task, and whose influence in the prediction should be removed. We refer here to the well-known example of the dog vs. wolf in , where the input data were images highly biased by the presence of background snow in the pictures of wolves, and the absence of it in those of dogs. As shown in Figure 3, this situation appears when the attributes are a biased version of unobserved fair attributes and the target variable depends only on . In this framework, learning from induces biases while fairness requires:

Note that either nor is independent of the protected .

2. The second model corresponds to the situation when a biased decision is observed as a result of a fair score which has been biased by the uses giving rise to the target . Thus, a fair model in this case will change the prediction in order to make them independent of the protected variable. This is represented in Figure 3 and, formally, it is required that

where is not observed. Note that previous conditions do not imply the independence between and (even conditionally to ).

In the statistical literature, an algorithm is called fair or unbiased when its outcome does not depend on the sensitive variable. The notion of perfect fairness requires that the protected variable does not play any role in the forecast of the target . In other words, we will be looking at the independence between the protected variable and the outcome , both considering given or not the true value of the target . These two notions of fairness are known in the literature as:

• Statistical parity (S.P.) deals with the independence between the outcome of the algorithm and the sensitive attribute

 (2.1)
• Equality of odds (E.O.) considers the independence between the protected attribute and the outcome conditionally given the true value of the target

 (2.2)

Hence, a perfect fair model should be chosen within a class ensuring one of these restrictions (2.1)-(2.2). Observe that the choice of the notion of fairness is convenient regarding the assumed model for the introduction of the bias in the algorithm: while statistical parity is suitable for model 3, equality of odds is for model 3, and especially well-suited for scenarios where ground truth is available for historical decisions used during the training phase.

In this work, we tackle only these two main notions of fairness developed among the machine learning community. There are other definitions such as avoiding disparate treatment or predictive parity, defined respectively as or . A decision making system suffers from disparate treatment if it provides different outcomes for different groups of people with the same (or similar) values of non-sensitive features but different values of sensitive features . In other words, (partly) basing the decision outcomes on the sensitive feature value amounts to disparate treatment. Technically, the disparate treatment doctrine tries to counter explicit as well as intentional discrimination . It follows from the specification of disparate treatment that a decision maker with an intent to discriminate could try to disadvantage a group with a certain sensitive feature value (e.g., a specific race group) not by explicitly using the sensitive feature itself, but by intentionally basing decisions on a correlated feature (e.g., the non-sensitive feature location might be correlated with the sensitive feature race). This practice is often referred to as redlining in the US anti-discrimination law and also qualifies as disparate treatment . However, such hidden intentional disparate treatment maybe be hard to detect, and some authors argue that statistical parity might be a more suitable framework for detecting such covert discrimination , while others focus only on explicit disparate treatment . For further details, we refer to the comprehensive study of fairness in machine learning given in .

The description of the metrics given above applies in a general context, yet all four fairness measures were originally proposed within the binary classification framework. Hence the literature cites and equivalent denominations will be presented in the following subsection specifically for this context.

### 2.2 The special case of classification

Fairness has been widely studied in the binary classification setting. Here the problem consists in forecasting a binary variable , using observed covariates We introduce also a notion of positive prediction: represents a success while is a failure. We refer to  for a complete description of classification problems in statistical learning. In this framework, the two main algorithmic fairness metrics are specified as follows.

• Statistical parity. Despite the early uses of this notion through the so-called -rule for fair classification purposes by the State of California Fair Employment Practice Commission (FEPC) in 1971, it was first formally introduced as statistical parity in  in the particular case when is also binary. Since then it has received several other denominations in the fair learning literature. For instance, it has been equivalently named in the same introductory work as demographic parity or group fairness, and also in others equal acceptance rate  or benchmarking . Formally, if this definition of fairness is satisfied when both subgroups are equally probable to have a successful outcome

 P(^Y=1∣S=0)=P(^Y=1∣S=1), (2.3)

which can be extended to for general , continuous or discrete. A related and more rigid measure is called avoiding disparate treatment in 

if the probability that the classifier outputs a specific value of the forecast given a feature vector does not change after observing the sensitive feature, namely

.

• Equality of odds (or equalized odds) looks for the independence between the error of the algorithm and the protected variable. Hence, in practice, when is also binary it compares the error rates of the algorithmic decisions between the different groups of the population, and considers that a classifier is fair when both classes have equal False and True Positive Rates

 P(^Y=1∣Y=i,S=0)=P(^Y=1∣Y=i,S=1), % for i=0,1. (2.4)

For general , we note that this condition is equivalent to

 P(^Y=1∣Y=i,S)=P(^Y=1∣Y=i), for i=0,1. (2.5)

This second point of view was introduced in  and has been originally proposed for recidivism of defendants in . Over the last few years it has been given several names, including error rate balance in  or conditional procedure accuracy equality in .

Many other metrics have received significant recent attention in the classification literature. In this setting, the already cited above disparate treatment, also referred to as direct discrimination , looks at the equality for all

 P(^Y=1∣X=x,S=0)=P(^Y=1∣X=x,S=1) (2.6)

Furthermore, we note that equality of opportunity ( or ) and avoiding disparate mistreatment  are two metrics related to the previous equalized odds, yet weaker. The first one requires only the equality of true positive rates, that is when in (2.4), while the second looks at the equality of misclassification error rates across the groups:

 P(^Y≠Y∣S=0)=P(^Y≠Y∣S=1). (2.7)

Thus, equality of odds implies both the lack of disparate mistreatment and equality of opportunity, but not viceversa. Finally, we mention also here predictive parity which was introduced in . It requires the equality of positive predictive values across both groups. Therefore, mathematically it is satisfied when

 P(Y=1∣^Y=1,S=0)=P(Y=1∣^Y=1,S=1). (2.8)

The fairness metrics defined above are evaluated only for binary predictions and outcomes. By contrast, we can find also in the literature a set of metrics involving explicit generation of a continuous-valued score denoted here by . Although scores could be used directly, they can alternatively serve as the input to a thresholding function that outputs a binary prediction.

Among this set, we highlight the notion of test-fairness, which extends predictive parity (2.8) when the prediction is a score. An algorithm satisfies this kind of fairness (or it is said to be calibrated) if for all scores , the individuals who have the same score have the same probability of belonging to the positive class, regardless of group membership. Formally, this is expressed as for all scores . This criteria was introduced in  and has also been termed as matching conditional frequencies by .

A related metric called well-calibration  or calibration within groups  imposes an additional and more stringent condition: a model is well-calibrated if individuals assigned score must have probability exactly of belonging to the positive class. If this condition is satisfied, then test-fairness will also hold automatically, though not viceversa. Indeed, we note that the scores of a calibrated predictor can be transformed into scores satisfying well-calibration.

Finally, balance for positive/negative class was introduced in  as a generalization of the notion of equality of odds. Mathematically, this balance is expressed through the equalities of expected values .

### 2.3 Relationships between fairness criteria

It is also important to note that the wide variety of the proposed criteria formalizing different notions of fairness (see reviews  and  for more details) has lead sometimes to incompatible formulations. The conditions under which more than one metric can be simultaneously satisfied, and relatedly, the ways in which different metrics might be in tension have been studied in several works [20, 59, 11]. Indeed, in the following Propositions 2.1, 2.2, 2.3 we revisit three impossibility theorems of fairness stating the exclusivity, except in non-degenerate cases, of the three main criteria considered in fair learning.

We study first the combination of all three of these metrics and then explore conditions under which it may be possible to simultaneously satisfy two metrics. To begin with, it is interesting to note that from the definition of conditional probability, the respective probability distributions associated with each of these three fairness metrics can be expressed as follows:

 L(Y,^Y∣S) =L(Y∣^Y,S)×L(^Y∣S) (2.9) =L(^Y∣Y,S)×L(Y∣S). (2.10)

We observe that on the right-hand side of equality (2.9) the first factor refers to predictive parity, while the second one to statistical parity. Similarly, in the equality (2.10) the first term represents equality of odds while the second one the base rate, that is the distribution of the true target among each group.

While the three results for fairness incompatibilities are stated hereafter in a general learning setting and their proofs are gathered in the Appendix 6.1, in this section we present a discussion in the binary classification framework. Let us consider then the following notations for

• the group-specific true positive rates

• the group-specific false positive rates

• the group-specific positive predictive values

We consider first if a predictor can simultaneously satisfy equalized odds and statistical parity.

###### Proposition 2.1 (Statistical parity vs. Equality of odds)

If is dependent of and is dependent of , then either statistical parity holds or equality of odds but not both.

In the special case of binary classification the result can be sharpened as follows. Observe that we can write for

 P(^Y=1∣S=s)=P(Y=1∣S=s)TPRs+P(Y=0∣S=s)FPRs (2.11)

Then computing the difference between expression (2.11) for each class and assuming that equalized odds holds, namely

 TPR0=TPR1=P(^Y=1∣Y=1) and FPR0=FPR1=P(^Y=1∣Y=0),

we obtain

 P(^Y=1∣S=0)−P(^Y=1∣S=1) =(P(Y=1∣S=0)−P(Y=1∣S=1))P(^Y=1∣Y=1) +(P(Y=0∣S=0)−P(Y=0∣S=1))P(^Y=1∣Y=0) =(P(Y=1∣S=0)−P(Y=1∣S=1))(P(^Y=1∣Y=1)−P(^Y=1∣Y=0))

Statistical parity requires that left side is exactly zero. Hence, for the right side also being zero necessarily or . However, it is usually assumed that base rates differs across the groups, that is, the ratio of people in the group who belong to the positive class () to the total number of people in that group. Thus, statistical parity and equalized odds are simultaneously achieved only if true and false positive rates are equal. While this is mathematically possible, such condition is not particularly useful since the goal is typically to develop a predictor in which the true positive rate is significantly higher than the false.

###### Proposition 2.2 (Statistical parity vs. Predictive parity)

If is dependent of , then either statistical parity holds or predictive parity but not both.

By contrast, in the binary classification setup the two fairness metrics are actually simultaneously feasible. Assume that statistical parity holds, that is, . Then, from equations (2.9)-(2.10) we can write the difference of positive predictive values

 PPV0−PPV1=TPR0P(Y=1|S=0)−TPR1P(Y=1|S=1)P(^Y=1) (2.12)

Under predictive parity the left side of the above equation must be zero, which in turn requires that the ratio of the true positive rates of the two groups be the reciprocal of the ratio of the base rates, namely

 TPR0TPR1=P(Y=1|S=1)P(Y=1|S=0) (2.13)

Thus, while statistical and predictive parity can be simultaneously satisfied even with different base rates, the utility of such a predictor is limited when the ratio of the base rates differs significantly from 1, as this forces the true positive rate for one of the groups to be very low.

###### Proposition 2.3 (Predictive parity vs. Equality of odds)

If is dependent of then either predictive parity holds or equality of odds but not both.

We explore this incompatibility in more detail in the binary classification framework. If both conditions hold

 TPR0=TPR1,  FPR0=FPR1,  and  PPV0=PPV1, (2.14)

so we can write

 P(^Y=1∣S)=∑i=0,1P(^Y=1∣Y=i,S)P(^Y=1∣S)=TPR0P(Y=1∣S)+FPR0P(Y=0∣S).

This together with equations (2.9)-(2.10) implies

 P(^Y=1|Y=1,S=0)P(Y=1|S=0) =P(y=1|^Y=1,S=0)[TPR0P(^Y=1|S)+FPR0P(Y=0|S=0)],

and using the notations above we obtain

 TPR0P(Y=1|S=0)=PPV0[TPR0P(^Y=1|S)+FPR0(1−P(Y=1|S=0))].

Finally, we obtain the following expressions for the group-specific base rate for

 P(Y=1|S=0)=PPV0FPR0PPV0FPR0+(1−PPV0)TPR0 (2.15)

and reasoning likewise for

 P(Y=1|S=1)=PPV1FPR1PPV1FPR1+(1−PPV1)TPR1 (2.16)

Hence, in the absence of perfect prediction, under assumption (2.14) base rates have to be equal for both equalized odds and predictive parity to simultaneously hold. When perfect prediction is achieved, equations (2.15) and (2.16) take on the indefinite form so therefore do not convey anything definitive about base rates in that scenario.

We also note that the less strict metric equal opportunity (recall it requires only equal TPR across groups) is compatible with predictive parity. This is evident from equations (2.15) and (2.16) when the condition is removed, thereby allowing equalized opportunity and predictive parity to be simultaneously satisfied even with unequal base rates. However, achieving this condition with unequal base rates will require that the FPR differs across the groups. When the difference between the base rates is large, the variation between group-specific FPRs may have to be significant which may reduce suitability for some applications. Hence, while equal opportunity and predictive parity are compatible in the presence of unequal base rates, practitioners should consider the cost (in terms of FPR difference) before attempting to simultaneously achieve both. A similar analysis is possible when we considering parity in negative predictive value instead of positive predictive value, i.e. equal opportunity and parity in NPV are compatible, but only at the cost of variation between group-specific true negative rates (TNRs).

## 3 Price for fairness in machine learning

In this section, we consider how to build fair algorithms and the consequences on the degradation of their performance compared to the possibly unfair case. This corresponds to the price for fairness. Recall that the performance of an algorithm is measured through its risk defined by

 R(f)=E(ℓ(Y,f(X,S))).

Define some class or restriction of classes

 (3.1) (3.2)

From a theoretical point of view, a fair model can be achieved by restricting the minimization (1.1) to such classes. The price for fairness is

 EFair(F):=inff∈FFairR(f)−inff∈FR(f). (3.3)

If denotes the class of all measurable functions, then is known as the Bayes Risk. In the following, we will study the difference of the minimal risks in (3.3) under both fairness assumptions and in two different frameworks: regression and classification.

To address this issue, we will consider the Wasserstein (a.k.a Monge-Kantorovich) distance between distributions. The Wasserstein distance appears as an appropriate tool for comparing probability distributions and arises naturally from the optimal transport problem (we refer to  for a detailed description). For and two probability measures on , the squared Wasserstein distance between and is defined as

 W22(P,Q):=minπ∈Π(P,Q)∫∥x−y∥2dπ(x,y)

where the set of probability measures on with marginals and .

### 3.1 Price for fairness as Statistical Parity

The notion of perfect fairness given by statistical parity criterion implies that the distribution of the predictor does not depend on the protected variable .

#### 3.1.1 Regression

In the regression problem, statistical parity condition is expressed through the equality of distributions . Then in this setting a standard definition of this statistical independence requires that for all and all measurable sets . Since is a real-valued random variable under Borel

-algebra, it is fully characterized by its cumulative distribution function, and so it suffices to consider sets

.

This fairness assumption implies the weakest cases where as presented in [29, 97], or equivalently when . Note that in the case where is a discrete variable, previous criteria have a simpler expression. In particular, in the binary setup when , we can write

 EX,S(f(X,S)) =EX[ES[f(X,S)∣S]] =P(S=0)EX(f(X,S)∣S=0)+P(S=1)EX(f(X,S)∣S=1).

On the other hand, the definition of conditional expectation gives

 EX,S(f(X,S)∣S) =EX,S[Sf(X,S)]E(S)=P(S=1)EX(f(X,S)∣S=1)P(S=1) =EX(f(X,S)∣S=1).

From both equalities above we have that statistical parity holds if and only if

 P(S=0)EX(f(X,S)∣S=0)+P(S=1)EX(f(X,S)∣S=1)=EX(f(X,S)∣S=1),

which, if , reduces to

 EX(f(X,S)∣S=0)=EX(f(X,S)∣S=1).

In the general regression setting, we will use the following notations : , , When is the set of all measurable functions from to , the optimal risk (a.k.a. Bayesian risk), is defined as

 R⋆:=R(F)=minf∈FE∥Y−f(X,S)∥2

is achieved for the Bayes estimator

 η(X,S):=E[Y|(X,S)].

Denote the conditional distribution of the Bayes estimator given and for a predictor the conditional distribution of given . In  the authors relate the excess risk with a minimization problem in the Wasserstein space proving the following lower bound for the price for fairness.

###### Theorem 3.1
 inff∈FFairR(f)−inffR(f)≥infg∈FEW22(μS,νS(g)). (3.4)

Moreover, if and has density w.r.t. Lebesgue measure for almost every , then (3.4) becomes an equality

 EFair(F)=infg∈FESW22(μS,νS(g)). (3.5)

Imposing fairness comes at a price that can be quantified which depends on the 2-Wasserstein distance between distributions of Bayes predictors.

Finding the minimum in (3.5) is related to the minimization of Wasserstein’s variation which has been known as the problem of studying Wasserstein’s barycenter. Actually, for Statistical Parity constraint

 infg∈FESW22(μS,νS(g))=infν(g)ESW22(μS,ν(g))

which amounts to minimize

 ν↦ESW22(μS,ν)

This problem has been studied in ,  or . The distributions are random distributions and define their distribution on the set of distributions. Hence The minimum is reached for the Wasserstein barycenter of . Note that if is discrete, in particular for the two class version , note , the distribution can be written as . Hence its barycenter is a measure that minimizes the functional

 ν↦π0W22(μ0,ν)+(1−π0)W22(μ1,ν).

Existence and uniqueness are ensured as soon as the have density with respect to Lebesgue measure.

#### 3.1.2 Classification

We consider the problem of quantifying the price for imposing statistical parity when the goal is predicting a label. In the following and without loss of generality, we assume that is a binary variable with values in . If is also binary, then Statistical Parity is often quantified in the fair learning literature using the so-called Disparate Impact (DI)

 DI(g,X,S)=P(g(X,S)=1∣S=0)P(g(X,S)=1∣S=1). (3.6)

This measures the risk of discrimination when using the decision rule encoded in on data following the same distribution as in the test set. Hence, in  a classifier is said not to have a Disparate Impact at level when . Perfect fairness is thus equivalent to the assumption that the disparate impact is exactly Note that the notion of DI defined Eq. (3.6) was first introduced as the -rule by the State of California Fair Employment Practice Commission (FEPC) in 1971. Since then, the threshold was chosen in different trials as a legal score to judge whether the discriminations committed by an algorithm are acceptable or not (see e.g.  , or ).

While in the classification problem the notion of statistical parity can be easily extended for general , continuous or discrete, through the equality , the index Disparate Impact has not been used in the literature for quantifying fairness in the general framework. Hence, we only consider the classification problem. Still, if is a multiclass sensitive variable, we observe that a fair classifier should satisfy for all ,

 P(g(X,S)=1)=P(g(X,S)=1∣S=s). (3.7)

Hence, Disparate Impact could be extended to

 DI(g,X,S)=mins∈SP(g(X,S)=1∣S=s)P(g(X,S)=1∣S=1). (3.8)

Tackling the issue of computing a bound in (3.3) is a difficult task and has been studied by several authors. In this specific framework, finding a lower bound for the loss of accuracy induced by the full statistical parity constraint has not been solved. This is mainly due to the fact that the classification setting does not specify a model to constrain the relationships between the labels and the observations , enabling a too large choice of models, contrary to the regression case.

Yet in different frameworks, some results can be proved. On the one hand, in  a notion of fairness is considered which correspond to controling the number of class changes when switching labels, which amounts to study the difference between classification errors for plug in rules corresponding to all possible thresholds of Bayes score called the model belief, . Authors achieve a bound using the distance and prove that the minimum loss is achieved for the 1-Wasserstein barycenter.

In the following we recall results obtained in  which study the price for fairness in statistical parity in the framework where we want to ensure that all classifiers trained by a transformation of the data will be fair with respect to the statistical parity definition.

For this consider the Balanced Error Rate

 BER(g,X,S)=P(g(X,S)=0∣S=1)+P(g(X,S)=1∣S=0)2

corresponding to the problem of estimating the sensitive label from the prediction in the most difficult case where the class are well balanced between each group labeled by the variable . In this setting, unpredictability of the label warrants the fairness of the procedure. Actually, given is not predictable from if , for all

 DI(g,X,S):=a(g)b(g).

We consider classifiers such that and .

###### Theorem 3.2 (Link between Disparate Impact and Predictability)

Given random variables , the classifier has Disparate Impact at level , with respect to , if, and only if, is predictable from .

Then, we can see that the notion of predictability and the distance in Total Variation between the conditional distributions of are connected through the following theorem

###### Theorem 3.3 (Total Variation distance)

Given the variables and ,

 ming∈FBER(g,X,S)=12(1−dTV(L(X|S=0),L(X|S=1))),

where varies in the family of binary classifiers .

is not predictable from if

 dTV(L(X|S=0),L(X|S=1))<1−2ε

where is the Total Variation distance. Hence fairness for all classifier is equivalent to the fact that

 ming∈FBER(g,X,S)=12

which is equivalent to

 dTV(μ0,μ1)=0,

where we have set for . Hence, perfect fairness for all classifiers in classification is equivalent to the fact that the distance between conditional distributions of the characteristics of individuals for the class defined by the different values of is null.

Consider transformations that map the conditional distributions to a joint distribution. Consider

and . Let be a random transformation of such that , and consider the transformed version . This transformation defines a way to repair the data in order to achieve fairness for all possible classifiers applied to these repaired data . This maps transforms the distributions into their image by , namely for all , . Note that the choice of the transformation is equivalent to the choice of the target distribution . Fairness is then achieved when the distance in Total Variations is equal to zero, which amounts to say that and maps the conditional distributions towards thew same distributions, hence .
In this framework the price of fairness can be quantified as follows. For a given deformation , set

 E(TS):=infg∈GP(g(~X)≠Y)−RB(X,S).

The following theorem provides an upper bound for this price for fairness.

###### Theorem 3.4

() For each , assume that the function is Lipschitz with constant . Then, if ,

 E(TS)≤2√2K(∑s=0,1πsW22(μs,μs♯Ts))12.

Hence the minimal excess risk in this setting is achieved by minimizing previous quantity over possible transformations . We thus obtain the following upper bound.

 infTSE(TS) ≤2√2KinfTS(∑s=0,1πsW22(μs,μs♯Ts))12 ≤2√2Kinfν(∑s=0,1πsW22(μs,ν))12 =√2K(∑s=0,1πsW22(μs,μB))12

where denotes the Wasserstein barycenter between with weight for .
Note that previous theorem can easily be extended to the case where takes multiple discrete values . In the case where is continuous, the same result holds using the extension of Wasserstein barycenter in  and provided that conditional distributions are absolutely continuous with respect to Lebesgue measure.

### 3.2 Price for fairness as Equality of Odds

We study now the price for fairness meant as equality of odds, which looks at the independence between the protected attribute and the outcome conditionally given the true value of the target, that is, the error of the algorithm.

#### 3.2.1 Regression

Consider the regression framework detailed in section 3.1.1 and let be a sample of i.i.d. random vectors observed from . Denote by and the matrices containing the observations of the non-sensitive and sensitive, respectively, features and . We will assume standard normal independent errors . Then, we consider the linear normal model

 Y=fβ0,β(X,S)+ε, (3.9)

where the errors are such that , and the predictor

 fβ0,β(X,S)=β0S+βTX, β0∈R, β∈Rp×1 (3.10)

is a linear combination of the sensitive and non-sensitive attributes. Then, the joint distribution of is dimensional normal and we denote the vectors of means and the covariance matrices as follows

 (X,S,Y)∼N⎛⎜ ⎜⎝⎡⎢⎣μXμSμY⎤⎥⎦,⎡⎢ ⎢⎣ΣXΣXSΣXYΣTXSΣSΣSYΣTXYΣTSYΣY⎤⎥ ⎥⎦⎞⎟ ⎟⎠

We note that the equality of odds criterion requires the linear fair predictor being independent of conditionally given , that is

 fβ0,β(X,S)to0.0pt$⊥$⊥S∣Y,

which under the normal model is equivalent to the second order moment constraint

 Cov(f(X,S),S∣Y)=0. (3.11)

Hence, seeking for a fair linear predictor amounts to obtaining conditions on the coefficients for (3.11) to hold. Since linear prediction can be seen as the most suitable framework for Gaussian processes, the relaxation of (3.11) could be justified as being the appropriate notion of fairness when we restrict ourselves to linear predictors. Furthermote, linear predictors, especially under kernel transformations, are used in a wide array of applications. They thus form a practically relevant family of predictors where one would like to achieve non-discrimination. Therefore, in this section, we focus on obtaining non-discriminating linear predictors.

Now if we denote by the vector of correction for fairness

 CS,X,Y:=(ΣXSΣY−ΣSYΣXYΣSΣY−Σ2SY), (3.12)

then the optimal fair equality of odds predictor under the normal model can be exactly computed as in the following result, whose proof is set out in the Appendix 6.2.

###### Proposition 3.5

Under the normal model (3.9), the optimal fair (equality of odds) linear predictor of the form (3.10) is given as the solution to the following optimization problem

 (^β0,fair,^βfair):=% argmin(β0,β)∈FEOE[(Y−fβ0,β(X,S))2] FEO ={(β0,β)∈R×Rp such% that βT(ΣXSΣY−ΣSYΣXY)+β0(ΣSΣY−Σ2SY)=0}.

If moreover and are not linearly dependent, it can be exactly computed as

 ^β0,fair=^βTfairCS,X,Y ^βfair=Σ−1ZΣZY, where ΣZ=ΣX+ΣSCS,X,YCTS,X,Y+CS,X,YΣTXS+ΣXSCTS,X,Y ΣZY=ΣXY+ΣSYCS,X,Y.

Note that the case where and are linearly dependent corresponds to a totally unfair scenario that is not worth studying.We observe that, while condition (3.11) is equivalent to equality of odds in the normal setting, it is generally a weaker constraint. However, the problem of achieving perfect fairness as equalized odds in a wider setup conveys computational challenges as discussed in . They showed that even in the restricted case of learning linear predictors, assuming a convex loss function, and demanding that only the sign of the predictor needs to be non-discriminatory, the problem of matching FPR and FNR requires exponential time to solve in the worst case. Motivated by this hardness result (see Theorem 3 in ), they also proposed a relaxation of the criterion of equalized odds by a more tractable notion of non-discrimination based on second order moments. In particular, they proposed the notion of equalized correlations, which indeed is generally a weaker condition than (3.11), but when considering the squared loss and when are jointly Gaussian, it is in fact equivalent (and, subsequently, equivalent to equality of odds). They also point out that for many distributions and hypothesis classes, there may not exist a non-constant, deterministic, perfectly fair predictor. Hence, we restrict ourselves here to the normal framework in which the computation of the optimal fair predictor is still feasible.

It is of interest to quantify the loss when imposing the fairness equality of odds condition . This will be done comparing with the general loss associated to the minimizer

 [^β0,^βT]T:=argmin(β0,β)∈R×RpE[(Y−fβ0,β(X,S))2]. (3.13)

We have performed some simulations to obtain estimations of the minimal excess risk in (3.3) when imposing equality of odds under this gaussian linear regression framework. Precisely, we have considered and , such that

 X∼N(,).

The results of replications of the experiment are shown in Figure 7

. There we present: (a) the average minimal excess risk; and its (b) standard deviation, as the sample size increases, taking particularly the values

. We observe that the estimation seems to converge.

#### 3.2.2 Classification

We consider again the classification setting where we wish to predict a binary output label from the pair . In this section, we obtain the fair optimal classifier in the sense of equality of odds in the particular case where is also binary. We assume moreover that both the marginals and the joint distribution of are non-degenerate, that is and . There are some other works dealing with the computation of Bayes-optimal classifiers under different notions of fairness. In  statistical parity and equality of oportunity are the considered constraints. Our approach here extends the proposed in , where fairness is defined by the weaker notion of equality of opportunity that requires just the equality of true posisitive rates across both groups.

An optimal fair classifier is formally defined here as the solution to the risk minimination problem over the class of binary classfiers satisfying the equality of odds conditions, that is

 g∗∈argming∈FEOR(g), where FEO:={g∈G:P(g(X,S)=i∣Y=i,S=0)=P(g(X,S)=i∣Y=i,S=1),i=0,1}.

In order to establish the form of such minimizer, we introduce the following assumption on the regression function.

###### Assumption 3.6

For each we require the mapping to be continuous on , where for all we let the regression function

 η(x,s):=P(Y=1∣X=x,S=s)=E[Y∣X=x,S=s]. (3.14)

The following result establishes that the optimal equalized odds classifier is obtained recalibrating the Bayes classifier , and its proof is included in the Appendix 6.3.

###### Proposition 3.7 (Optimal Rule)

Under Assumption 3.6, an optimal classifier can be obtained for all as

 g∗(x,1) =1{1≤2η(X,1)−θ∗1η(X,1)P(Y=1,S=1)+θ∗01−η(X,1)P(Y=0,S=1)} g∗(x,0) =1{1≤2η(X,0)+θ∗1η(X,0)P(Y=1,S=0)−θ∗01−η(X,0)P(Y=0,S=0)},

where is determined from equations

 EX∣S=1[η(X,1)g∗(X,1)]P(Y=1∣S=1) =EX∣S=0[η(X,0)g∗(X,0)]P(Y=1∣S=0) EX∣S=1[(1−η(X,1))g∗(X,1)]P(Y=0∣S=1)