A Central Limit Theorem for L_p transportation cost with applications to Fairness Assessment in Machine Learning

07/18/2018 ∙ by Eustasio del Barrio, et al. ∙ 0

We provide a Central Limit Theorem for the Monge-Kantorovich distance between two empirical distributions with size n and m, W_p(P_n,Q_m) for p>1 for observations on the real line, using a minimal amount of assumptions. We provide an estimate of the asymptotic variance which enables to build a two sample test to assess the similarity between two distributions. This test is then used to provide a new criterion to assess the notion of fairness of a classification algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The analysis of the minimal transportation cost between two sets of random points or of the transportation cost between an empirical and a reference measure is by now a classical problem in probability, to which a significant amount of literature has been devoted. In the case of two sets of

random points, say and in , the object of interest is

where ranges is the set of permutations of and is some cost function. is usually referred to as the cost of optimal matching. This optimal matching problem is closely related to the Kantorovich optimal transportation problem, which, in the Euclidean setting amounts to the minimization of

with ranging in the set of joint probabilities on with marginals . Here and are two probability measures on and the minimal value of is known as the optimal transportation cost between and . The cost function has received special attention and we will write for the optimal transportation cost in that case. It is well known that with this choice of cost function , with and denoting the empirical measures on and .

How large is the cost of optimal matching, ? Under the assumption that are i.i.d. , are i.i.d. and and have finite

-th moment is is easy to conclude that

almost surely. One might then wonder about the rate of approximation, that is, how far is the empirical transportation cost from its theoretical counterpart.

Much effort has been devoted to the case when , namely, when the two random samples come from the same random generator. In this case and the goal is to determine how fast does the empirical optimal matching cost vanish. It is known from the early works [1] and [21] that the answer depends on the dimension . In the case when

is the uniform distribution on the unit hypercube

, if , with a slightly worse rate if . The results for were later extended to a more general setup in [12], covering the case when has bounded support and a density satisfying some smoothness requirements. The one-dimensional case is different. If then, under some integrability assumptions , with converging weakly to a non Gaussian limit, with, see [9]. If then it is still possible to get a limiting distribution for , but now integrability assumptions are not enough and the available results require some smoothness conditions on (and on its density), see [10] for the case . In fact, see [5], the condition that has a positive density an interval is necessary for boundedness of the sequence if . Very recently a CLT in general dimension has been provided in [11]. The authors provide a CLT for quantities concentrating around their mean under some moment conditions (moments of of order with are required). In this paper, we sharpen for the uni-dimensional case their results. We prove asymptotic normality of for an increasing sequence under minimal moment and smoothness assumptions.

Such result enables to construct goodness of fit between two distributions but also to assess how similar two different distributions and can be. The similarity measure between the distributions is the Wasserstein distance and we want to test if versus for a chosen threshold. Note that in a different setting, this test is also considered in [18].
An application is given by the recent framework of Fair Learning or Disparate Impact Assessment which has received a growing attention driven by the generalization of machine learning in nowadays’ life. We refer for instance to [19], [17], [7] or [15] and references therein. In this setting, decisions are driven by machine learning procedures and the main concern is to detect whether a decision rule, learnt from variables is biased with respect to a subcategory of the population. For this a variable is denoted as a protected attribute and splits the population into two groups and . A decision rule is called unfair for when it exhibits a different behavior depending mainly on the values of and not on the values of the variables . This discrimination may come from the algorithm or from a biased situation that would have been learnt from the learning sample. This framework has been proposed originally in [13] and been further developed in [8].
Many criteria have been given in the recent literature on Fair Learning to detect this situation (see in [3] or [4] for a review). A majority of these definitions consider that the decision should be independent from the protected attribute, which means that the decision should have similar behavior in both cases. Actually if ( respectively

) denotes the distribution of the classifier for

(respectively ), then the complete fairness called Statistical Parity is obtained when these two distributions are the same, which corresponds to the independency of the decision with respect to the protected attribute. Therefore, the level of fairness could be quantified by estimating the similarity between and

. Hence we provide a new way of assessing fairness in machine learning by considering confidence intervals to test the similarity of these distributions with respect to Monge-Kantorovich distance. We study how this criterion behaves compared to standard criterion in the fair learning literature. The paper falls into the following parts. Section 

2 provides the main result, i.e the Central Limit Theorem for transportation cost for . Section 3 is devoted to some simulations while Section 4 is devoted to the application of this test to detect disparate impact. Proofs are gathered in the Appendix.

2 CLT for transportation cost

In this section we present the main results in this paper, namely, CLT’s for the transportation cost between an empirical measure and a target measure or between two empirical measures.


To present our results, we set , , and consider the functions

(2.1)

We note that . Since implies while for we have , we see that is finite for every . Under the assumption we show in Lemma 2.1 below that, in fact, . This allows us to introduce also

(2.2)

We observe that changing by in (2.1) would not affect the definition of .

Lemma 2.1

If , , then and . Furthermore, if satisfy , and is continuous on then in as .

Proof. We set , , and observe that

The first claim follows upon using Hölder’s inequality to check that . For the second observe that implies that for every of continuity for (hence, for almost every ) and also that is uniformly integrable (and the same holds for , with convergence at every point in since is continuous). By the remarks before this Lemma we can assume without loss of generality that is continuous at . Then at every of continuity for . Using the bound (2) for and the fact that and are uniformly integrable, we see that the sequence is uniformly integrable and conclude that and in .  

It is convenient at this point to introduce the notation

(2.4)

Lemma 2.1 ensures that is a finite constant provided and have finite moments of order . Also, if then , which is the optimal transportation map from to is different from the identity on a set of positive measure and if is not a Dirac measure. We remark that is not, in general, symmetric in and .

We are ready now for the main result in this section.

Theorem 2.2

Assume that and is continuous on . Then

  • If are i.i.d. and is the empirical d.f. based on the ’s

  • If, furthermore, is continuous, are i.i.d. , independent of the ’s, is the empirical d.f. based on the ’s and then

A proof of this result is given in the Appendix. We would like to make some remarks about Theorem 2.2 at this point. There has been a significant interest in empirical transportation costs in recent times in the literature. We should mention at least [14], giving moment bounds and concentration results for empirical transportation with cost in general dimension, and [5], with a comprehensive discussion of the one dimensional case. Both papers focus on the case where the law underlying the empirical measure and the target measure are equal (in the setup of Theorem 2.2, the case ). With the more specific goal of CLT’s for empirical transportation costs, [20] considers the case when the underlying probabilities are finitely supported, while [22]

covers probabilities with countable support. The approach in these two cases relies on Hadamard directional differentiability of the dual form of the finite (or countable) linear program associated to optimal transportation. Without the constraint of countable support,

[11] covers quadratic transportation costs in general dimension.

There are similarities between the approach in [11] and the presentation here, as one can see from a look at our Appendix. We must emphasize some significant differences, however. An obvious one is that here we only deal with one dimensional probabilities. On the other hand, we cover general costs. A more significant difference is that assumptions in Theorem 2.2 are sharp. Let us focus on (i) to discuss this point. To make sense of we must consider with finite -th moment. Now, if we want to satisfy (i) for every with finite -th moment, by taking to be Dirac’s measure on we see that must have finite moment of order . Then it is easy to check that, for all with finite moment of order if and only if has a finite moment of order . Thus, the assumption of finite moments of order for and seems to be a minimal requirement for (i) to hold. We note that for the quadratic cost, , Theorem 4.1 in [11] required finite moments of order on and for some .

Some words on the role of the continuity of in (i)

are also in place here. That some sort of regularity of the quantile function is needed for handling the empirical transportation functional in dimension one was observed in

[5]. In the case , absolute continuity of is a necessary condition for having (Theorem 5.6 in [5]). Continuity of is also related to assumption (3) in [11]. In fact, that assumption, in the case of one-dimensional probabilities, implies that is supported in a (possibly unbounded) interval and is differentiable in the interior of that interval. Hence, the regularity assumption in Theorem 2.2 is also slightly weaker that that in Theorem 4.1 in [11]. We should also note at this point that Theorem 1 in [20], for the case finitely supported probabiities on the real line corresponds to a case of discontinuity of the quantile functions and this can lead to nonnormal limiting distributions.

We would also like to discuss the role of the centering constants in Theorem 2.2. Under more restrictive assumptions there are similar CLT’s in which is replaced by the simpler constants (see, e.g., Therem 4.3 in [11]). In fact, the Kantorovich duality (see, e.g., [23]) yields that

where is the set of pairs of integrable functions (with respec to and , respectively) satisfying . But this entails . Hence, we can replace the centering constants in Theorem 2.2 provided

(2.5)

Finding sharp conditions under which (2.5) holds seems to be a delicate issue. We limit ourselves to providing a set of sufficient conditions for it. The case has been considered in [5] and can be handled with simple moment conditions. The general case that we consider here seems to add some smoothness requirements. We limit our discussion to . We will assume that is twice differentiable, with nonvanishing density, , in the interior of and satisfies

(2.6)

Furthermore, we will assume that

(2.7)
(2.8)
(2.9)

Condition (2.6) is a natural condition for approximating the quantile process by a weighted uniform standard process. We refer to [10] for details. The other three conditions are implied by the stronger assumption

(2.10)

This condition is, essentially, needed for ensuring that is a bounded sequence, see [5]. We would like to note that (2.10) does not hold for Gaussian , while (2.7), (2.8) and (2.9) do.

With these assumptions we can prove the following.

Proposition 2.3

Assume . Under the assumptions of Theorem 2.2,

  • if satisfies (2.6) to (2.9) then (2.5) holds and, as a consequence,

  • if, furthermore, satisfies (2.6) to (2.9) then

A proof of Proposition 2.3 is given in the Appendix. The scheme of proof, in fact, relies on some auxiliary results in [10] that give, through a completely different approach, asymptotic normality of . The economy in assumptions that one can gain from dealing with the centering in Theorem 2.2 is, in our view, remarkable. Providing sharper conditions under which (2.5) holds remains an interesting open question.


For the statistical application of Theorem 2.2 it is of interest to have a consistent estimator of the asymptotic variances. In the two sample case this can be done as follows. Define

with and

(2.11)

We define similarly exchanging the roles of the ’s and the ’s. Finally, we set

(2.12)

We show next that is a consistent estimator of the asymptotic variance in the two sample case in Theorem 2.2. A consistent estimator for the asymptotic variance in the one sample case can be obtained similarly. We omit details.

Proposition 2.4

If and are continuous on then

almost surely.

Proof. Simply note that and apply Lemma 2.1.


As a consequence of Propositions 2.3 and 2.4 we have that if, additionally,

and (or ) is not a Dirac measure then

(2.13)

We can use (2.13) for statistical applications in several ways. From (2.13) we see that

(2.14)

is a confidence interval for with asymptotic confidence level . Alternatively, we could consider the testing problem

(2.15)

where is some threshold (to be determined by the practitioner). Rejection of the null in (2.15) would yield statistical evidence that the d.f.’s and are almost equal. We can handle this problem by rejecting the null if

(2.16)

It follows from (2.13) that the test defined by (2.16) has asymptotic level . In the next section we explore the use of this test for the assessment of fairness of learning algorithms.

3 Simulations and Results

In this section, we first analize the consistency of the variance estimation given by (2.11)-(2.12) established in Proposition 2.4. Then, we check the performance of the test and finally, we apply both tools to the Fair Learning problem.

Consider two independent samples i.i.d. and i.i.d. of distributions and , respectively, and denote by and the corresponding empirical distribution function on each sample. We have simulated these samples undergoing the following models, for which we can compute the exact expression for the asymptotic variance in Proposition 2.4.

Example 3.1 (Location model)

Consider and . We can write and the Wasserstein distance between both distributions is . In this case, we can compute the functions

and then

Hence, in this model we have an expression for the true variance where . In Table 1, we can see the estimation for an increasing size of the samples, which are close in the limit to the true value.

Example 3.2 (Scale-location model)

Consider and . In this case, , and then

n p=1 p=2 p=3
50 0.7811 5.9767 9.4721
100 0.8742 3.0618 9.668
200 0.9262 5.1305 10.1345
400 1.0510 4.9746 8.7785
500 1.0023 4.0164 9.2851
800 0.9858 3.4522 8.592
1000 1.0923 4.399 8.9125
2000 0.9868 3.4341 9.1057
5000 0.9932 4.1488 8.9690
10000 0.9999 4.0661 9.1961
20000 0.9842 4.0426 8.9744
50000 1.003 3.9567 9.1324
100000 0.9965 4.0184 8.9922
1 4 9
Table 1: Estimates of the variance of the asympotic distribution in the location model of Example 3.1 with
n =1 =0.9 =0.7 =0.5
1 50 0.062 0.146 0.481 0.825
100 0.055 0.193 0.698 0.974
200 0.053 0.275 0.918 1
400 0.051 0.413 0.995 1
500 0.051 0.481 0.999 1
800 0.052 0.64 1 1
1000 0.054 0.728 1 1
2000 0.047 0.937 1 1
2 50 0.074 0.167 0.513 0.839
100 0.063 0.198 0.717 0.979
200 0.059 0.272 0.927 1
400 0.055 0.422 0.995 1
500 0.05 0.484 0.999 1
800 0.053 0.651 1 1
1000 0.053 0.736 1 1
2000 0.051 0.935 1 1
3 50 0.071 0.154 0.515 0.822
100 0.0662 0.206 0.715 0.973
200 0.057 0.266 0.925 1
400 0.052 0.422 0.992 1
500 0.057 0.497 0.997 1
800 0.053 0.652 1 1
1000 0.053 0.733 1 1
2000 0.051 0.937 1 1
Table 2: Estimated probabilities of rejection in the location model with
n =1, =2 =1, =1.5 =0, =2 =0, =1.5
1 50 0.047 0.165 0.535 0.996
100 0.045 0.195 0.8 1
200 0.036 0.323 0.974 1
400 0.052 0.532 1 1
500 0.056 0.614 1 1
800 0.035 0.810 1 1
1000 0.045 0.895 1 1
2000 0.050 0.994 1 1
2 50 0.078 0.376 0.595 0.998
100 0.067 0.551 0.823 1
200 0.062 0.786 0.976 1
400 0.055 0.969 1 1
500 0.059 0.985 1 1
800 0.052 1 1 1
1000 0.056 1 1 1
2000 0.05 1 1 1
3 50 0.091 0.569 0.571 0.997
100 0.093 0.762 0.758 1
200 0.072 0.935 0.939 1
400 0.06 1 0.996 1
500 0.064 0.999 0.997 1
800 0.069 1 1 1
1000 0.06 1 1 1
2000 0.049 1 1 1
Table 3: Estimated probabilities of rejection in the scale-location model

To check the performance of the test (2.15), we have simulated observations in the scenarios of the Examples 3.1 and 3.2 for different values of the parameters of location and scale. Table 2 shows the estimated frequencies of rejection of the test in the location model when , and different values for the cost

.Under the null hypothesis

, that is when , we see that the covering level achieves the nominal value . Moreover, under the alternative , that is when , the values show that the test has high power. Similar results are obtained for the scale-location model, which are contained in Table 3, for different values of the threshold and the cost .
Under the null hypothesis , that is when , the estimated level reaches the nominal value . The test shows again high power when . We note that even in the case , which is not covered by the theoretical results in this paper, the simulations in both models show that the test has asymptotically level and that its power is very high is most cases.

4 Application to Fair Learning

Fair learning is devoted to the analysis of bias that appear when learning automatic decisions (mainly classification rules) from a learning sample . This sample may be prejudiced against a population, which means that the variable to be predicted is, in the sample, unbalanced between the two groups. Hence when trying to find a classification rule, the algorithm will use the discrimination present in the sample rather than learning a true link function. This bias can have been set intentionally or may reflect the bias present in the use cases. A striking example is provided by looking at banks predicting high income from a set of parameters in order to grant a loan. Despite of their claim, this prediction leads to a clear discrimination between male and female, while this variable should not play any role in such forecast. Yet the constitution of the learning sample lead the classifier to underestimate the income of female compared to male.
Hence it is important to detect such automatic bias in order to prevent a generalization and even worse a justification of a discriminatory behavior. Many criterion have been proposed to quantify the influence of the group variable in the behavior of the machine learning algorithm, most of them consider a notion of similarity between the decisions of algorithm conditionally to the belonging of each group. In the following we propose a new criterion by considering the Monge-Kantorovich distance between the distribution of the classification rule conditioned by the two groups. This problem is at the heart of recent studies in machine learning, leading to a new field of research called fair machine learning. Hence To illustrate the application of the theory in previous section to the problem of fairness in Machine Learning, we consider the Adult Income data set (available at https://archive.ics.uci.edu/ml/datasets/adult). It contains instances consisting in the values of attributes, numeric and categorical, and a categorization of each person as having an income of more or less than per year.

Recently, in [8]

the problem of forecasting a binary variable

using observed covariates and assuming that the population is divided into two categories that represent a bias, modeled by a protected variable , is considered. The criteria of fairness in classification problems considered are called Disparate Impact (DI) and Balanced Error Rate (BER), which were introduced in [13]. For a classification rule , Disparate Impact is a score that measures how the two probability and are close. The Balance Error Rate describes how the variable can be learnt by the classification rule originally meant to predict the variable . Using these criteria, they designed procedures to remove the possible discrimination, both in a partial and total way, that are based on the idea of moving the distributions of the variable conditionally given the value of the protected . This approach originally proposed in [13] can also be found in [16].

Also in [8], confidence intervals for the empirical counterpart of DI proposed in [4]

and some numerical results of the procedures that remove discrimination are given for the Adult Income Data and the logit classifier. This classifier is used to make the prediction using the five numerical variables:

Age, Education Level, Capital Gain, Capital Loss and Worked hours per week. Among the rest of the categorical attributes, the sensitive attribute to be the potentially protected is the Gender (“male”or “female”). While in that work the logit is used for binary classification whether a person earns more or less than

per year, here we will consider the result of the logistic regression, that is, the estimated probability of positive outcome, which provides for all observation a distribution on the real line.


This estimation is used to predict whether an individual will have a high income. This forecast algorithm presents some bias with respect to the gender in the sense that the learning sample is biased such that female with similar characteristics as male are more unlikely to be predicted that they will get a high income. This unfairness is usually shown using disparate impact assessment as discussed in [8].

We want to see if the test (2.15) is an appropriate tool to asses fairness in algorithmic classification results and could replace the Disparate impact.
Actually, fairness should be achieved as soon as the distribution of the forecast probability is close for the two groups corresponding to and . We choose to directly control this closeness using Wasserstein distance and not using the Disparate Impact criterion. Yet, we study the relationship between the notions of Disparate Impact and the Balanced Error Rate of a classifier

, with the Wasserstein distance between the probability distributions of

conditionally given the protected attribute . In [8], it is proved that the BER is related to the distance in Total variation between these two conditional distributions, but that the optimal transportation cost is still a reasonable way of quantifying the distances.
Hence we study on simulations how the variation in the Wasserstein distance between the two distributions and affect the Disparate impact and the BER. Fairness should increase as the distance between the two distributions is small, which implies that does not affect the decision rule. For this, in Figures 2 and 3, we represent the evolution of the known criteria DI and BER with the Wasserstein distance , while the distributions and are being pushing forward onto their Wasserstein barycenter, according to the partial repair procedure called Geometric Repair [13], that moves each distribution part towards the Wasserstein barycenter in order to reduce the disparity between the groups.

Figure 1 shows confidence intervals for the empirical quadratic transportation cost as the amount of repair increases.

In Figure 2 we can see that the Disparate Impact decreases with the Wasserstein distance, and the desirable level equal to is attained when . Note that 0.8 is a threshold chosen in many trials about unfair algorithmic treatments (see for instance in [13] or [24]). Moreover, Figure 3 confirms that the closer are both distributions in Wasserstein distance, the more unpredictable is the protected variable from the outcome of the regression. In conclusion although the distance in Total Variation and the Wasserstein distance are of very different nature, controlling the amount of fairness using the Wasserstein distance provides a control on the Disparate Impact. Moreover, it may be an alternative to the Balance Error and can provide a new control over the fairness of an algorithm. In this paper, we restricted ourselves to the logit classification but using a multidimensional version of the CLT as in [11], we could provide a natural criterion of fairness directly on the observations by looking at the distance between and .

[scale=.5,unit=1mm]CI_W2

Amount of repair

Figure 1: Confidence interval (2.14) for

[scale=.4,unit=1mm]DI_W2

Figure 2: Relationship between DI and

[scale=.4,unit=1mm]BER_W2

Figure 3: Relationship between BER and

Appendix

In this Appendix we provide the proof of Theorem 2.2. Parts i) and ii) can be handled similarly. Hence, for the sake of simplicity we focus on part i). The same techniques yield ii) with little extra effor. Throughout the section we will assume that are i.i.d. r.v.’s uniformly distributed on the interval . We write for the empirical distribution function on and , for the related empirical process. These allow to represent any other i.i.d. sample with d.f. by taking . We use this construction in the sequel without further mention.

Given a distribution function we write for the empirical distribution function based on the sample and for the quantile inverse of . Note that . We fix a d.f. and define

(4.1)

Similarly, using the notation in (2.1) for and , we denote

(4.2)

where is a standard Brownian motion on . It follows from Lemma 2.1 that is a centered Gaussian r.v. with variance as in (2.4).

We provide now some empirical counterparts of Lemma 2.1. First, a general variance bound for and them, under more restrictive assumptions, an approximate continuity result for the trajectories of . The main ingredient in the proof is the Efron-Stein inequality for variances, namely, that if with

independent random variables,

is an independent copy of and then

We refer, for instance, to [6] for further details.

Proposition 4.1

If , , then there exists a finite constant , depending only on and such that

A valid choice of the constant is given by with

and

Proof. We recall that in equation (4.1) is the empirical distribution function based on the i.i.d. sample , . We set and , where is the empirical distribution function based on the sample and are i.i.d.. We write for the ordered sample. Let us assume that is continuous. Now, with denoting the rank of within the sample . Continuity of ensures that a.s. there are no ties and is a random permutation of . Let us write for the ranks in the sample . Now, is the minimal value of

among random vectors

which, conditionally given the ’s, have marginals and . This shows that

and, as a consequence,

Using the fact that