1. Introduction
Machine learned models are increasingly being used in web applications for crucial decisionmaking tasks such as lending, hiring, and college admissions, driven by a confluence of factors such as ubiquitous connectivity, the ability to collect, aggregate, and process large amounts of finegrained data, and the ease with which sophisticated machine learning models can be applied. Recently, there has been a growing awareness about the ethical and legal challenges posed by the use of such datadriven systems, which often make use of classification models that deal with users. Researchers and practitioners from different disciplines have highlighted the potential for such systems to discriminate against certain population groups, due to biases in data and algorithmic decisionmaking systems. Several studies have shown that classification and ranked results produced by a biased machine learning model can result in systemic discrimination and reduced visibility for an already disadvantaged group (Barocas and Hardt, 2017; Dwork et al., 2012; Hajian et al., 2016; Pedreschi et al., 2009) (e.g., disproportionate association of higher risk scores of recidivism with minorities (Angwin et al., 2016), over/underrepresentation and racial/gender stereotypes in image search results (Kay et al., 2015), and incorporation of gender and other human biases as part of algorithmic tools (Bolukbasi et al., 2016; Caliskan et al., 2017)). One possible reason is that machinelearned prediction models that are trained on datasets exhibiting existing societal biases end up learning these biases, and can therefore reinforce (or even potentially amplify) them in its results.
Our goal is to develop a framework for identifying biases in machinelearned models across different subgroups of users, and address the following questions:

Is the measured discrepancy statistically significant? When dealing with webscale datasets, we are very likely to observe discrepancies of varying magnitudes owing to lessthanideal scenarios and noise. Observed discrepancies do not necessarily imply that there is bias  the strength of the evidence (as presented by the data) must be considered in order to ascertain that there truly is bias. To this end, we seek to perform rigorous statistical hypothesis tests to quantify the likelihood of the observed discrepancy being due to chance.

Can we perform hypothesis tests in a metricagnostic manner? When certain assumptions about the underlying distribution or the metric to be measured can be made, we can resort to parametric tests suited for these purposes. However, when we wish to have a pluggable interface for any metric (with respect to which we wish to measure discrepancies in fairness), we need to make the testing framework as generic as possible.
There are numerous definitions of fairness including equalized odds, equality of opportunity, individual or group fairness, and counterfactual fairness in addition to simply comparing model assessment metrics across groups. While each of these criteria has merit, there is no consensus on what qualifies a model as fair, and this question is beyond the scope of this paper. Our aim is not to address the relative virtues of these definitions of fairness, but rather to assess the strength of the evidence presented by a dataset that a model is unfair with respect to a given metric.
We develop a permutation testing framework that serves as a blackbox approach to assessing whether a model is fair with respect to a given metric, and provide an algorithm that a practitioner can use to quantify the evidence against the assumption that a model is fair with respect to a specified metric. This is especially appealing because the framework is metric agnostic.
Traditional permutation tests specify that the underlying datagenerating mechanisms are identical between two populations and are somewhat limited in the claims that can be made regarding the fairness of machine learning models. We seek to determine whether a machine learning model has equitable performance for two populations in spite of potential inherent differences between these populations. In this paper, we illustrate the shortcomings of classical permutation tests, and propose an algorithm for permutation testing based on any metric of interest which is appropriate for assessing fairness. Open source packages evaluating fairness (such as
(Tramèr et al., 2017)) implement permutation tests which are not valid for their stated use. Our contribution is to illustrate the potential pitfalls in implementing permutation tests and to develop a permutation testing methodology which is valid in this context.The rest of the paper is organized as follows. In Section 2, we provide a background on permutation tests and illustrate why traditional permutation tests can be problematic as well as how to solve these issues. Section 3 introduces permutation tests that can evaluate fairness in machine learning models. Simulations are presented in Section 4, followed by experiments using realworld datasets in Section 5. We discuss related work in Section 6 and conclude in Section 7. The proofs of all results are pushed to the Appendix for ease of readability.
2. Preliminaries
Permutation tests (discussed extensively in (Good, 2000)) are a natural choice for making comparisons between two populations; however, the validity of permutation tests is largely dependent on the hypothesis of interest, and these tests are frequently misapplied. We describe some background, illustrate misapplications of permutation tests, and establish valid permutation tests in the context of assessing the fairness of machine learning models.
2.1. Background
The standard setup of two sample permutation tests is as follows. A sample is drawn from a population with distribution and a sample is drawn from a population with distribution
. The null hypothesis of interest is
which is often referred to as either the “strong” or “sharp” null hypothesis. A common example is comparing two drugs, perhaps a treatment with a placebo to study the effectiveness of a new drug. The observed data is some measure of outcome for each group given either the treatment or control. In this case, the null hypothesis is that there is no difference whatsoever in the observed outcomes between the two groups.

For a large integer , uniformly choose permutations of the integers

Define .

Recompute the test statistic on the permutations of the data resulting in .

Define the permutation distribution of to be the empirical distribution of the test statistics computed on the permuted data, namely

Reject the null hypothesis at level if exceeds the quantile of the permutation distribution.
This test is appealing because it has an exactness property: when the “sharp” null hypothesis is true, the probability that the test rejects is exactly
(Type I error rate). However, researchers are more commonly interested in testing a “weak” null hypothesis of the form
where is some functional, parameter, etc. of the distribution. Furthermore, researchers typically desire assigning directional effects (such as concluding that ) in addition to simply rejecting the null hypothesis. For instance, in the case of comparing two drugs, the null hypothesis may specify that the mean recovery times are identical between the two drugs: . In the case of rejecting, the researcher would like to conclude either or so that recommendations for the more efficacious drug can be given. Merely knowing that there is a difference between the drugs but being unable to conclude which one is better would be unsatisfying.
While the permutation test is exact for the strong null hypothesis, this is not the case for the weak null. Depending on the test statistic used, the permutation test may not be valid (even asymptotically) for the weak null hypothesis: the rejection probability can be arbitrarily large when only the weak null hypothesis is true (larger than the specified level, as is the requirement for a valid statistical test). This leads to a much higher Type I error rate than expected.
2.2. An Illustrative Example
We use a simple example of comparing means to illustrate the problems with permutation tests. Suppose that and with . Suppose the test statistic used is . Note that the scaling is chosen to have a nondegenerate limiting distribution. The sampling distribution of
is asymptotically normal with mean 0 and variance
.When permuting the data, samples from both populations and are taken (without replacement) from the pooled distribution, and the permutation distribution behaves as though both samples were taken from a mixture distribution
where . The variance of this mixture distribution is
Thus, for our chosen statistic, the permutation distribution is (asymptotically) normal with mean 0 and variance
Under the strong null (specifying equality of distributions), and the permutation distribution approximates the sampling distribution. However, under the weak null, it may be the case that and consequently . Under the weak null, this means that the permutation distribution cannot be used as an approximation of the sampling distribution, and any inference based on the permutation distribution is therefore invalid.
2.3. Valid Permutation Tests
Choosing a pivotal (asymptotically distributionfree; does not depend on the observed data’s distribution) statistic can rectify the issue as identified above. For instance, the sampling distribution of the Studentized statistic
where and
are the sample standard deviations, is asymptotically
. The statistic is pivotal because the asymptotic distribution does not rely on the distributions of the observed data. Since it is distributionfree, the permutation distribution of the Studentized statistic (which behaves as though the two groups were sampled from a distribution that is not necessarily the same as the underlying distributions of these two groups) is asymptotically standard normal as well. Typically, Studentizing the test statistic will give validity for testing with an asymptotically normal test statistic. If the choice of the test statistic is asymptotically pivotal, the resulting permutation test can be expected to be asymptotically valid.Note that even if we are interested in testing the strong null hypothesis, but wish to make directional conclusions, the directional errors can be quite large with the unStudentized statistic. This can occur, for instance, when there is a small positive effect, and the difference of means is negative, but the test rejects due to a difference in standard deviations. The Studentized statistic ensures that the chance of a directional error is no larger than .
3. Permutation Tests for Fairness
Based on the discussion above, let us now consider the problem of testing fairness of a machine learning model on two groups, say group A and group B. We may want to compare metrics such as the area under the receiver operator curve (AUC) between the two groups. In this setting, the permutation test is exact for testing the strong null hypothesis that the distribution of the observed data (the features used by the model together with the outcome) arise from the same distribution between the two groups.
More concretely, suppose that the generated data for group A is and the data for group B is where the ’s are
dimensional vectors of available features and the
’s are response variables. The strong null hypothesis specifies
That is, the joint distributions of the features and the response are identical between the two groups. Surely if this holds, the model will be fair, but a permutation test may reject (or fail to) based on differences in the distribution of features rather than inherent fairness of the models. To further illustrate the issues of permutation testing for fairness, we will discuss tests of fairness based on several statistics for binary classifiers before presenting an algorithm for testing fairness for a general classification or regression problem.
3.1. Within Outcome Permutation
In the context of comparing binary classifiers, one can perform permutation tests under weaker assumptions by permuting data only within positive or negative labels. The distribution of binary classifier metrics typically depends on the number of observations in the positive and negative categories. A slightly more generally applicable permutation test can be performed by permuting the positive examples between groups A and B and separately permuting the negative examples between both groups. This is valid under the slightly more general assumption that the distributions of the features corresponding to the positive examples are equal between the two groups, and likewise for the negative examples. This method of permuting the data is valid under the null hypothesis
i.e. under the null hypothesis that the distribution of the features for the positive (and negative) labeled data are equal. This is slightly more flexible than merely permuting labels in that it does not require that the proportion of positive and negative labels be the same between the groups, but still retains exactness.
3.2. Comparing AUC
In the context of testing for equality of AUC, the null hypothesis of interest is that the AUC of the model is equal between the two groups:
For notational convenience, we use and , , for the observed features of the group corresponding to positive and negative labels respectively, with (and similarly for group B). Assuming that the classifier of interest assigns a positive outcome if some function exceeds a threshold, the null hypothesis of equality of AUC’s is equivalent to testing
This can be done by using the MannWhitney (MW) statistic,
When using this statistic to perform a permutation test, the permutation distribution behaves as though both samples are taken from the mixture distribution
where . In general, the permutation distribution of the under this mixture distribution will not be equal to the sampling distribution when only the assumption of equality of AUC’s holds. The variance of the sampling distribution (as stated in the next section) is dependent on the proportion of positive and negative examples, so if the two groups differ inherently in the number of positive and negative examples, the test will not be valid. Furthermore, using the permutation test will not be valid for making directional conclusions, which in this case is determining which group the model is biased towards.
3.2.1. Validity of the Studentized Difference of AUCs
The MW test statistic has variance where
and are independently randomly selected feature vectors, , and is defined analogously. The asymptotic normality of the statistic is given by the following Theorem
Theorem 1 ().
Let denote the sampling distribution of the MannWhitney statistic under the null hypothesis. Then
where . Let denote the permutation distribution of the MW statistic. Then
in probability, where .
The permutation distribution behaves as though each sample was taken from the mixture distribution and may not approximate the sampling distribution. In particular, the variance of the permutation distribution is not necessarily the same as that of the sampling distribution.
These variances can be consistently estimated, for example by using DeLong’s method. The sampling distribution of the “Studentized” MW test statistic, which normalized by a consistent estimator of the variance, is asymptotically standard normal. Using the Studentized test, the permutation distribution is asymptotically standard normal.
Theorem 2 ().
Let and denote the sampling distribution and permutation distribution, respectively of the MannWhitney statistic normalized by a consistent estimate of the sample standard deviation. Then
in probability.
Because the Studentized statistic is asymptotically pivotal, the permutation distribution and the sampling distribution have the same limiting behavior, and hence the permutation distribution approximates the sampling distribution.
3.3. Proportion Statistics
Although permutation tests are typically valid for comparing proportions, permutation tests may be problematic for binary classification metrics which appear to be measuring proportions (such as true or false positive rate, sensitivity, specificity, etc.). A simple illustrative, classical example is comparing two Bernoulli proportions between independent samples. In this case, the proportion uniquely specifies the distribution of the random variables, and the null hypothesis of equality of distributions is equivalent to testing equality of proportions. In this case, the usual permutation test is valid.
Binary classification metrics such as false positive (or negative) rate, sensitivity, specificity, etc. which seems at face value to be a straightforward extension of the two proportion ztest, can be problematic. We will focus our discussion on falsenegative rate, however, obvious extensions hold for other metrics.
Suppose that the decision of the classifier is given by a function (so that if an observation with features is classified as positive example and if the observation is classified as a negative example). Define
to be the (empirical) false negative rate for group A, and define analogously for group B. In the above notation, the difference of false negative rates is . Scaled by say and assuming that for each group
the asymptotic distribution of the statistic is
where and with .
When permuting labels of the data, the number of positive examples assigned to group A or group B will differ for each permutation. Heuristically, the permutation distribution behaves as though each group was sampled from the mixture distribution between the two groups. In particular, the permutation distribution approximates the sampling distribution when the proportion of positive examples is the same between the two groups. The permutation distribution is asymptotically normal with mean
and variancewhere . In general, this variance will not be equal to the asymptotic variance of the sampling distribution and the permutation test may fail to be valid.
A general fix for permutation tests is to again use an asymptotically pivotal test statistic (one whose asymptotic distribution does not depend on the distributions generating the two groups). In this example, using a Studentized test statistic:
(where hats denote the usual sample proportions) will cure the problem. When using this Studentized statistic, both the permutation distribution and sampling distribution are asymptotically standard normal which ensures the test is asymptotically valid.
Theorem 3 ().
Assume that for both groups the probability of observing a positive example is nonzero, and the probability of correctly classifying these examples is bounded away from 0 and 1. Denote by and the permutation distribution of the Studentized statistic and the sampling distribution under the null hypothesis, respectively, for a sample of size from group A and from group B. Then,
almost surely.
While this result pertains to falsenegative rate, permutation tests based on Studentized statistics are generally valid. Verifying the validity of other proportionbased binary classification metrics is similar to the example of the falsenegative rate.
3.4. A General Algorithm
The examples presented above demonstrate the need for studentizing statistics in binary classification problems. We now provide a general algorithm for performing permutation tests of fairness that is applicable to general classification problems, as well as regression problems.
In the examples presented above, closed form, consistent variance estimators of the test statistic are easily computed. In examples where finding a covariance estimator is difficult, a bootstrap variance estimator can be used. Supposed that data is observed from group A and is observed from group B. For a fixed a large integer , if we resample with replacement samples from and samples from for , then the bootstrap estimate of the variance of a statistic is
(1) 
where denotes the number of bootstrap trials. Whether the test statistic is asymptotically pivotal or needs to be Studentized, a valid permutation test can be performed according to the following algorithm.
4. Simulation Study
Most papers on Fairness in Machine Learning focus on a single definition of fairness, and many adjust model training in order to reduce unfairness(Agarwal et al., 2018; Mary et al., 2019). We could not find any literature on statistically testing if a model is fair other than FairTest (Tramèr et al., 2017). Through simulations, we compare our methodology to FairTest and illustrate some of the problems we are able to cure. Moreover, we show detailed simulations to demonstrate the need for Studentizing the test statistic and compare it to the bootstrap method.
4.1. Comparison With FairTest
FairTest (Tramèr et al., 2017) uses a permutation test based on Pearson’s correlation statistic. We demonstrate the issues of the permutation test as implemented in FairTest (which is based on an unStudentized test statistic), both for testing correlation (which is the stated use) and independence between a protected attribute and model error. The permutation test implemented by FairTest is neither valid for testing the correlation between a protected attribute and model error nor very powerful for testing independence. To demonstrate these issues, we must know the ground truth of the model we are using, so we prefer to use simulated data rather than experimental data.
We first provide an example demonstrating that FairTest’s implementation is not a valid test for their stated use, but that Algorithm 1 provides a valid test. Suppose the protected attribute, is generated as a uniform random variable (bounded away from zero to avoid dividing by values near 0)
and the prediction error for a model is given as
(2) 
such a setting may be a reasonable approximation to being a normalized age variable. In this model, the protected attribute and model error are uncorrelated, although the model error depends heavily on the protected attribute, so we do not have independence. Therefore, the rejection probability for the test of uncorrelatedness at say the 5% nominal level (i.e. a test at a 5% level of significance), should have a null rejection probability of in this setting since the null hypothesis is indeed true. Table 1 reports a MonteCarlo approximation to the null rejection probability by generating 2000 samples of the protected attribute and model errors, and performing a permutation test using FairTest’s implementation (based on Pearson’s correlation) and our methodology (from Algorithm 1 based on the studentized Pearson’s correlation). The reported probability of rejecting the null hypothesis is based on 1,000 permutations of the data and averages the test decisions over 10,000 such simulations.
Null  Rejection Probability  

Hypothesis  Algorithm 1  FairTest  Desired 
Uncorrelated  0.0428  0.7508  0.05 
Independence  0.0501  0.0099  0.05 
In this example of testing for uncorrelatedness, the rejection probability of either testing methodology should be equal to (or at least below) the nominal level (0.05) since the null hypothesis is indeed true. We find the null rejection probability for FairTest is dramatically above the nominal level, reaffirming that the test is not valid for testing uncorrelatedness, whereas the test using Algorithm 1 has a null rejection probability close to the nominal level.
Even if the practitioner desires to test the null hypothesis of independence between the protected attribute and model error, the test implemented by FairTest is “biased,” a statistical term meaning that the rejection probability can be below the nominal level of the test for some alternatives. For instance, if the test is performed at the 5% level, the rejection probability can be dramatically below 0.05 when the null hypothesis of independence is not true. Practically, this means that the power to detect unfairness in a dataset can be significantly worse than random guessing. It is also important to note that even in webscale settings, where the issue is often determining practical significance rather than statistical significance, the naive permutation may fail to detect bias because of the lack of power illustrated here.
To give a concrete example, suppose that the protected attribute is generated as an exponential distribution with rate parameter one (plus one to avoid the instability of dividing by values near zero),
and the prediction error for a model is again given as (2).
Performing a MonteCarlo simulation analogous to the setting testing for uncorrelatedness, the null rejection probability of the test for independence is given in Table 1. As we can see, the rejection probability for FairTest is substantially below the nominal level 5% (despite the fact that the attribute and error are dependent), whereas that of the pvalue using Algorithm 1 is that of the nominal level. While the test using Algorithm 1 is not powerful in this setting, it is at least unbiased which is a firstorder requirement for a reasonable statistical test. The lack of power comes from the choice of test statistic which can easily be changed if the practitioner desires power to detect dependence among correlated variables.
Depending on the hypothesis of interest, the permutation testing implementation given in FairTest can be either invalid or biased. On the other hand, Algorithm 1 provides a test that is valid for correlation and unbiased for independence.
4.2. Comparison of Permutation Methods And The Bootstrap
We consider a case of comparing the falsenegative rate of group A with group B. For group A, samples are generated, with an chance of observing a positive outcome and a chance of observing a negative outcome. For group B, samples are generated, with a chance of observing a positive outcome and an chance of observing a negative outcome. For both groups, the classifier has a true positive (and true negative) rate of . The classifier is fair in the sense that the falsenegative rate is equal between the two groups. For simulations, data is generated in this manner and a permutation pvalue is obtained using both the studentized and unstudentized statistics. Figures 0(a) and 0(b)
give histograms of the pvalues using the unstudentized and studentized statistic, respectively. Note that the pvalues should be uniform, and the pvalues using the Studentized statistic are much closer to the uniform distribution. At nominal level
, the rejection probability using the unstudentized statistic is (very anticonservative) and the rejection probability using the studentized statistic is (very nearly exact).Keeping in mind the goal of providing a metric agnostic system for inference regarding fairness, another natural choice of methodology would be to implement a bootstrap. In this case, we compare our approach with the “basic” bootstrap, implemented as follows:

Uniformly resample, with replacement, , …,
from and similarly take a uniform sample, with replacement, from group B. 
If is a test statistic of interest, approximate the distribution of using the distribution of where is computed on the resampled data.
The basic bootstrap has the advantage that the statistic need not be studentized; however, it has no guarantees of exactness. In the simulation setting described above, the null rejection probability using the unstudentized difference of falsenegative rates is . The distributional approximation using the bootstrap is considerably worse than the permutation test based on the studentized statistic, so we recommend using a permutation test over a bootstrap (see Figure 0(c)).
5. RealWorld Experiments
The permutation testing framework as described above was implemented in Scala, to work with machine learning pipelines that make use of Apache Spark (Apache Spark Team, 2014)
. The framework supports plugging in arbitrary metrics or statistics whose difference is to be compared, such as precision, recall or AUC. To studentize the observed difference of the statistic between the two groups, we need to estimate its standard deviation. We achieve this by performing a bootstrap to obtain the distribution of these differences and computing an unbiased estimate of the variance, from which we obtain the standard deviation.
To studentize the differences obtained during the permutation trials, we make use of the standard deviation of the permutation distribution itself, rather than obtaining a bootstrap estimate for each trial. The estimates obtained through either method are approximately equal, and making use of the former dramatically reduces the runtime of our algorithm. This gives us a total time complexity of instead of a higher time complexity of for not much gain ( is the number of bootstrap trials, is the number of permutation trials, is the sample size considered, and the statistic computation is assumed to have a time complexity of ).
Experimental Setup: We performed our experiments on the ProPublica COMPAS dataset (Larson et al., 2016) (used for recidivism prediction and informing bail decisions) and the Adult dataset from the UCI Machine Learning Repository (Kohavi, 1996) (used for predicting income). The COMPAS dataset contains records, with the labels indicating whether a criminal defendant committed a crime within two years or not. The Adult dataset contains records, with the labels specifying if an individual makes over a year.
Both datasets were divided into an approximate
trainvalidationtest split. We made use of all the features available except gender and race, which we treated as protected attributes. The numerical features were used asis, while the categorical features were onehot encoded. We also ignored the ‘final weight’ feature in the Adult dataset. We then trained a logistic regression model with
regularization on each of these datasets, producing final models with a test AUC of for the COMPAS dataset, and a test AUC of for the Adult dataset.Definitions of Fairness: Let the classifier be defined by the function , where is the input data point and the output is the predicted label. The labels of the data points are given by the function , and the protected attributes are defined by the function ( is the set of protected attribute values). Let be the number of True Positives produced by the classifier and be the number of positive labels (we treat as positives and as the negative labels here). Using these notational conventions, some common metrics to assess fairness are defined below:

A classifier is said to have achieved Equalized Odds if ,
Defining Equalized Odds Distances as:
we see that Equalized Odds can equivalently be defined as
We thus make use of the s as metrics for fairness.

Recall (or True Positive Rate, TPR) is defined as
Thus the difference in recall values between two protected groups and is nothing but from above. We make use of this for performing permutation tests.

False Positive Rate (FPR) is defined as
Thus the difference in FPR values between two protected groups and is nothing but from above. We make use of this for performing permutation tests.
5.1. Empirical Analyses
The first empirical analysis compares the output of the permutation test with conventional fairness metrics. Specifically, we focus on performing a permutation test for the Recall (TPR) and the False Positive Rate (FPR) and compare this with the Equalized Odds fairness metric.
We consider to be the gender of the individual, comprised of two elements, Male (M) and Female (F). The classifier threshold is varied from to , and both the COMPAS (about uniformly random samples) and Adult (about uniformly random samples) test datasets are classified to measure and . We also ran permutation tests for FPR and Recall for each value of , using permutation trials and a significance level of to reject the null hypothesis. Figure 2 shows the resulting graphs, depicting both the Equalized Odds distance as well as the 95th percentile of the permutation distribution (values greater than this are rejected by the test for our chosen significance level).
Increasing reduces both the overall Recall and FPR values for the resulting classifier. When equals or , perfect Equalized Odds is achieved due to all examples being classified as positives or negatives respectively (Recall and FPR rates are equal for all protected groups). However, intermediate values of result in varying degrees of Equalized Odds unfairness. It is up to the enduser to identify whether this difference is large enough to warrant addressing and whether this difference is just a statistical anomaly. However, the permutation test makes a statistically sound decision regarding this, deeming the differences to be unfair only when it crosses the 95th percentile for a given value of .
Recall that the test statistic being computed is the difference in an aggregate metric computed for each subset of the data (resulting from a partitioning of the data into two subsets). Let and be the resulting partitions comprised of the random variables and respectively, each occuring in sample sizes of and . Let the aggregate metric be given by , with the metric at the individual level being . For simplicity, let us consider to be a proportion statistic, but we can extend this reasoning to other metrics as well. It is well know that the variance of the test statistic is . Hence, as the sample sizes and increase the variance of decreases. Consequently, as the sample size increases, the permutation test is able to reject the null hypothesis for smaller differences at a fixed significance level.
In the second experiment, we looked into the effect of sample size on permutation testing, varying it from around to . For the Adult dataset, we were able to sample the test dataset directly, but for the COMPAS dataset, we had to take uniformly random samples from the training data due to insufficient test data points. For each choice of sample size, we varied from to to obtain the minimum difference in Recall and FPR that was detected by the permutation test at a significance level of , with each test being run for permutation trials. Figure 3 shows the resulting graph, from which we can conclude that an increase in the sample size allows smaller differences to be detected with statistical significance.
Conversely, given a minimum difference to be detected, one can compute the sample size to be used for the permutation test. For example, suppose that , the test statistic
follows a normal distribution, and we wish to identify the sample size
such that the permutation test rejects at a significance level of . Since is approximately the th percentile, if we can estimate , we can use the equation for to obtain an estimate for and . As an example, working with the false negative rate example from Section 3.3, we can sample multiple , score them with the model and check if it is a false negative, thereby providing us with samples of the indicator function to estimate with.Another effect of increasing the sample size is a reduction in the standard error of the pvalue estimation (and consequently, a smaller confidence interval). Under the null hypothesis, the permutation test can be treated as
statistically independent trials for which the probability of observing an extreme result remains the same. Thus, we can estimate the pvalue as a binomial proportion, dividing the number of extreme trials (those resulting in values at least as extreme as the observed difference) by the total number of trials. We can also estimate the confidence interval and standard deviation of our estimate by making use of techniques such as those described in (Agresti and Coull, 1998; Wilson, 1927). There is also work (Brown et al., 2001) that compares these estimates and makes recommendations, but the common factor between these estimates is that they are inversely proportional to the number of trials raised to some power, indicating that our estimates of the pvalue have lower standard errors and smaller confidence intervals if the number of trials is increased.6. Related Work
There is extensive literature on different notions of fairness. (Speicher et al., 2018) proposes using inequality indices from economics, namely the Generalized Entropy Index (GEI), to measure how model predictions unequally benefit different groups. It also allows for the decomposition of the fairness metric into betweengroup and withingroup components, to better understand where the source of inequality truly lies. Conditional Equality of Opportunity is another metric, proposed in (Beutel et al., 2019) as a technique to account for distributional differences. This builds on the notion of Conditional Parity (Ritov et al., 2017), which discusses fairness constraints conditioned on certain attributes as more general notions of fairness criteria like Demographic Parity and Equalized Odds. Conditional Equality quantifies this fairness criterion as a weighted sum (over the conditional attribute values) of individual conditional attribute deviations, with the weights being defined by how much importance certain attribute values are to be given over others. Although these metrics quantify the amount of ‘unfairness’ in an algorithm, they deem any nonzero value to be ‘unfair’. These metrics by themselves are insufficient to declare an algorithm to be unfair; we need statistically sound techniques, such as permutation tests, to reject the null hypothesis of fairness. There are numerous opensource packages computing fairness metrics including IBM’s AI Fairness 360^{1}^{1}1https://aif360.mybluemix.net, Google’s MLFairnessGym (D’Amour et al., 2020), Themis (Galhotra et al., 2017), and FairTest (Tramèr et al., 2017), though many do not incorporate formal hypothesis testing. Permutation tests have been used to assess the performance of predictive models (e.g. (Ojala and Garriga, 2010)). Further, robust permutation tests for twosample problems have been proposed in (Chung and Romano, 2013). We are not aware of any related work that established the validity of permutation testing for assessing fairness.
7. Conclusion
There are many aspects of algorithmic fairness that are captured by various metrics and definitions. No single metric captures all aspects of fairness, and we would encourage a practitioner to evaluate fairness along multiple metrics to better understand where biases may be present. For this purpose, our contribution is to provide a methodology to assess the strength of evidence that a model may be unfair with respect to any metric a researcher may be interested in. The framework for permutation testing proposed in this paper provides a flexible, nonparametric approach to assessing fairness, thereby simplifying the burden of performing a statistical test on the practitioner to merely specifying a test statistic. Moreover, the framework attempts to close the gap of not having a formal statistical test for detecting unfairness.
We demonstrated the performance of permutation testing through extensive experiments on two realworld datasets known to exhibit bias. An interesting aspect of the simulation result is that a classifier exhibits bias for differing values of a threshold. Moreover, the values of the threshold for which bias was detectable depended on the metric under consideration. This reinforces the need to experiment with multiple definitions of fairness while attempting to determine if a model is biased. Testing across multiple metrics is greatly simplified by the use of our nonparametric testing framework. We also showed that our framework provides a better distributional approximation than the bootstrap.
Although the discussion in this paper focused on binary classification problems, mainly for simplicity of exposition and brevity, we remark that our methodology is also applicable to most other supervised learning problem settings.
References
 A reductions approach to fair classification. In International Conference on Machine Learning, pp. 60–69. Cited by: §4.
 Approximate is better than "exact" for interval estimation of binomial proportions. The American Statistician 52 (2), pp. 119–126. Cited by: §5.1.
 Machine bias. ProPublica. Cited by: §1.
 Apache Spark: A fast and general engine for largescale data processing. Note: https://spark.apache.org, Last accessed on 20190910 Cited by: §5.
 Fairness in machine learning. In NIPS Tutorial, Cited by: §1.
 Putting fairness principles into practice: challenges, metrics, and improvements. arXiv preprint arXiv:1901.04562. Cited by: §6.
 Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In NIPS, Cited by: §1.
 Interval estimation for a binomial proportion. Statistical science, pp. 101–117. Cited by: §5.1.
 Semantics derived automatically from language corpora contain humanlike biases. Science 356 (6334). Cited by: §1.
 Exact and asymptotically robust permutation tests. Ann. Statist. 41 (2), pp. 484–507. Cited by: §6.
 Asymptotically valid and exact permutation tests based on twosample ustatistics. Journal of Statistical Planning and Inference 168, pp. 97 – 105. Cited by: §7.1.
 Fairness is not static: deeper understanding of long term fairness via simulation studies. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, pp. 525–534. Cited by: §6.
 Fairness through awareness. In ITCS, Cited by: §1.
 Fairness testing: testing software for discrimination. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, pp. 498–510. Cited by: §6.
 Permutation tests: a practical guide to resampling methods for testing hypotheses. Springer series in statistics, Springer. Cited by: §2.
 Algorithmic bias: from discrimination discovery to fairnessaware data mining. In KDD Tutorial on Algorithmic Bias, Cited by: §1.
 Unequal representation and gender stereotypes in image search results for occupations. In CHI, Cited by: §1.

Scaling up the accuracy of NaiveBayes classifiers: a decisiontree hybrid
. In KDD, Cited by: §5.  Data and analysis for ‘how we analyzed the compas recidivism algorithm’. Note: https://github.com/propublica/compasanalysis, Last accessed on 20190910 Cited by: §5.
 Fairnessaware learning for continuous attributes and treatments. In International Conference on Machine Learning, pp. 4382–4391. Cited by: §4.
 Permutation tests for studying classifier performance. Journal of Machine Learning Research 11, pp. 1833–1863. Cited by: §6.
 Measuring discrimination in sociallysensitive decision records. In SDM, Cited by: §1.
 On conditional parity as a notion of nondiscrimination in machine learning. arXiv preprint arXiv:1706.08519. Cited by: §6.
 A unified approach to quantifying algorithmic unfairness: measuring individual &group unfairness via inequality indices. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2239–2248. Cited by: §6.
 FairTest: discovering unwarranted associations in datadriven applications. In 2017 IEEE European Symposium on Security and Privacy (EuroS P), Vol. , pp. 401–416. External Links: ISSN null Cited by: §1, §4.1, §4, §6.
 Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22 (158), pp. 209–212. Cited by: §5.1.
Appendix
Here we present the proofs of the results in the main text.
7.1. Proof of Theorem 1
To find the asymptotic distribution of the difference of AUC’s, we will first find the asymptotic distribution of
Write for the common AUC of group A and B. Define
with and
Then, the multivariate central limit theorem for U statistics gives
where has entries
and
with
and
Applying the delta method to , which has relevant gradient
gives that
Noting that the estimated AUC for group B is independent of that of group A yields the limiting distribution for the sampling distribution given in Theorem 1.
To derive the asymptotic distribution of the permutation distribution, write . The obvious multivariate extension of the limiting results for the permutation distribution of Ustatistics given in (Chung and Romano, 2016) yields that the permutation distribution of
is asymptotically normal, in probability, with mean and variance
in probability, where is given by the same expression as had group A been sampled from distribution . Following the same delta method calculation as for the sampling distribution gives the desired asymptotic distribution of the permutation distribution.
7.2. Proof of Theorem 2
The results of Theorem 2 follow immediately from Slutsky’s Theorem.
7.3. Derivation of limiting distribution of FNR
For further compactness, write . The false negative rate of group A can be written as
where we define and to be the numerator and denominator of the proceeding quantity. Define and analogously for group B. We wish to study the limiting behavior of
Since group A and group B are independent, it is enough to establish the asymptotic normality of
(and the same quantity for group B). Finding the limiting distribution is a routine application of the delta method. Assume and . Then,
where
Applying the delta method with the function (which has gradient ) gives that
where
Simple matrix algebra yields
Assuming , Slutsky’s theorem gives
7.4. Proof of Theorem 3
Suppose that is a uniformly chosen permutation. We wish to study the asymptotic behavior of , the statistic computed on the permuted data.
Suppose that is the combined feature and label data for groups A and B indexed in no particular order and are the corresponding classifications. We can write the difference of false negative proportions (scaled by ) as
where
,
and .
Let be the set of observations satisfying the following conditions.
We begin by deriving the distribution of conditional on . Write