Recent research on fairness has formulated interesting new perspectives on machine learning methodologies and their deployment, through work on definitions, axiomatic characterizations, case studies, and algorithms(Hardt et al., 2016; Dwork et al., 2012; Kleinberg et al., 2017; Chouldechova, 2017; Woodworth et al., 2017).
Much of the work on fairness in machine learning has been focused on classification, although the influential paper of Hardt et al. (2016) considers general frameworks that include regression. Just as the mean gives a coarse summary of a distribution, the regression curve gives a rough summary of a family of conditional distributions (Mosteller and Tukey, 1977)
. Quantile regression targets a more complete understanding of the dependence between a response variable and a collection of explanatory variables.
Given a conditional distribution , the quantile function is characterized by , or . We consider the setting where an estimate is formed using a training set for which a protected attribute is unavailable. The estimate will often give quantiles that are far from , when conditioned on the protected variable. We study methods that adjust the estimator using a heldout sample for which the protected attribute is observed.
As example, to be developed at length below, consider forecasting the birth weight of a baby as a function of the demographics and personal history of the birth mother, including her prenatal care, smoking history, and educational background. As will be seen, when the race of mother is excluded, the quantile function may be very inaccurate, particularly at the lower quantiles corresponding to low birth weights. If used as a basis for medical advice, such inaccurate forecasts could conceivably have health consequences for the mother and infant. It would be important to adjust the estimates if the race of the mother became available.
In this paper we study the simple procedure that adjusts an initial estimate by adding , by carrying out a quantile regression of onto . We show that this leads to an estimate for which the conditional quantiles are close to the target level for both subpopulations and
. This result follows from an empirical process analysis that exploits the special dual structure of quantile regression as a linear program. The main technical result of our paper is that our adjustment procedure is-fair at the population level. Roughly speaking, this means that the effective quantiles for the two subpopulations agree, up to a stochastic error that decays at a parametric rate. We establish this result using empirical process techniques that generalize to more general types of attributes, not just binary.
In the following section we provide technical background on quantile regression, including its formulation in terms of linear programming, the dual program, and methods for inference. We also provide background on notions of fairness that are related to this work and give our definition of fairness. In Section 3 we formally state the methods and results. The key steps in the proof are given in Section 5. We illustrate these results on synthetic data and birth weight data in Section 6. We finish with a discussion of the results and possible directions for future work. Full proofs of the technical results are provided in Section 8
In this section we review the essentials of quantile regression that will be relevant to our analysis. We also briefly discuss definitions of fairness.
2.1 Linear programming formulation
The formulation of quantile estimates as solutions to linear programs starts with the “check” or “hockey stick” function defined by .
For the median, . If
is a random variable, defineas the solution to the optimization . Then the stationary condition is seen to be
from which we conclude is the -quantile of . Similarly the conditional quantile of given random variable can be written as the solution to the optimization For a linear estimator , minimizing the empirical check function loss leads to a convex optimization . Dividing the residual into positive part and negative part yields the linear program
The dual linear program is then formulated as
When , the primal solution is obtained from a set of observations for which the residuals are exactly zero, through the correspondence The dual variables , also known as regression rank scores, play the role of ranks. In particular, the quantity can be interpreted as the quantile at which lies for the conditional distribution of given (Gutenbrunner and Jurečková, 1992). As seen below, the stochastic process plays an important role in fairness and inference for quantile regression.
2.2 Notions of fairness
Hardt et al. (2016) introduce the notion of equalized odds
to assess fairness of classifiers. Suppose a classifierserves to estimate some unobserved binary outcome variable
. Then the estimator is said to satisfy the equalized odds property with respect to a protected attributeif
This fairness property requires that the true positive rates and the false positive rates are constant functions of . In other words,
has the same proportion of type-I and type-II errors across the subpopulations determined by the different values of.
This could be extended to a related notion of fairness for quantile regression estimators. Denote the true conditional quantiles for outcome given attributes as . Analogous to the definition of equalized odds in (2.2), we would call a quantile estimator fair if
Conditioned on the event , we say that is a false positive. Conditioned on the complementary event , we say that is a false negative. Thus, an estimator is fair if the false positive and false negative rates do not depend on the protected attribute .
The notion of fairness that we focus on in this paper is a natural one. Considering binary , we ask if the average quantiles conditional on the protected attribute agree for and . More precisely, define the effective quantiles as
We say that the estimator is fair if . Typically when is trained on a sample of size , exact equality is too strong to ask for. If the estimators are accurate, each of the effective quantiles should be approximately , up to stochastic error that decays at rate . We say is -fair if . As shall be seen, this fairness property follows from the linear programming formulation when is included in the regression. As seen from the birth weight example in Section 6.2, if is not available at training time, the quantiles can be severely under- or over-estimated for a subpopulation. This formulation of fairness is closely related to calibration by group, and demographic parity (Kleinberg et al., 2017; Hardt et al., 2016; Chouldechova, 2017). An advantage of this fairness definition is that it can be evaluated empirically, and does not require a correctly specified model.
3 Method and Results
drawn i.i.d. from some joint distributionon , consider the problem of estimating the conditional quantile . Let denote the expected value operator under , or
. Similarly define the probability operator underas .
Evaluate the level of fairness of an estimator with
An estimator with a smaller is considered more fair. This measurement of fairness generalizes the notion of balanced effective quantiles described in section 2.2. Note that when the protected attribute is binary, is equivalent to for defined in (2.4).
From an initial estimator that is potentially unfair, we propose the following correction procedure. On a training set of size , compute and run quantile regression of on at level . Obtain regression slope and intercept . Define correction .
We show that this estimator will satisfy the following:
Reduced risk: It almost always improves the fit of .
Theorem 3.1 (Faithfulness and fairness).
Suppose , and has finite second moment. Then the corrected estimator
has finite second moment. Then the corrected estimatorsatisfies
Furthermore, there exist positive constants such that ,
Under the stronger assumption that the distribution of is sub-Gaussian, there exist positive constants such that ,
If is binary, then the correction procedure gives balanced effective quantiles:
By modifying the proof of Theorem 3.1 slightly, Corollary 3.1 can be extended to the case where is categorical with categories. In this case the correction procedure needs to be adjusted accordingly. Instead of regressing on , regress on the span of the indicators for , leaving one category out to avoid collinearity. The corrected estimators will satisfy for all categories .
Define as the risk function, where .
Theorem 3.2 (Risk quantification).
The adjustment procedure satisfies
We note that in the different setting where
is a treatment rather than an observational variable, it is of interest to obtain an unbiased estimate of the treatment effect. In this case a simple alternative approach is the so-called “double machine learning” procedure by Chernozhukov et al. (2016); in the quantile regression setting this would regress the residual onto the transformed attribute where is a predictive model of in terms of .
4 Fairness on the training set
When a set of regression coefficients is obtained by running quantile regression of on a design matrix , the estimated conditional quantiles on the training set are always “fair” with respect to any binary covariate that enters the regression. Namely, if a binary attribute is included in the quantile regression, then no matter what other attributes are regressed upon, on the training set the outcome will lie above the estimated conditional quantile for approximately a proportion , for each of the two subpopulations and . This phenomenon naturally arises from the mathematics behind quantile regression. This section explains this property, and lays some groundwork for the out-of-training-set analysis of the following section.
We claim that for any binary attribute , the empirical effective quantiles are balanced:
where denotes the empirical probability measure on the training set.
, the vectorminimizing the Lagrangian tends to lie on the “corners” of the -dimensional cube, with many of its coordinates taking value either 0 or 1 depending on the sign of . We thus arrive at a characterization for , the solution to the dual program. For such that , For such that , the values are solutions to the linear system that makes (2.1) hold. But with covariates and , such equality will typically only occur at most out of terms. For large , these only enter the analysis as lower order terms. Excluding these points, the equality constraint in (2.1) translates to
Assuming that the intercept is included as one of the regressors, the above implies that
which together with (4.2), implies balanced effective quantiles for binary
. In particular, if the protected binary variableis included in the regression, the resulting model will be fair on the training data, in the sense that the quantiles for the subpopulations and will be approximately equal, and at the targeted level .
This insight gives reason to believe that the quantile regression coefficients, when evaluated on an independent heldout set, should still produce conditional quantile estimates that are what we are calling -fair. In the following section we establish -fairness for our proposed adjustment procedure. This requires us to again exploit the connection between the regression coefficients and the fairness measurements formed by the duality of the two linear programs.
5 Proof Techniques
We first establish some necessary notation. From the construction of , the event is equivalent to for , which calls for analysis of stochastic processes of the following form. For , let
Let . It is easy to check that
Suppose is a countable family of real functions on and is some probability measure on . Let . If
there exists such that for all and ;
the collection is a Vapnik-Chervonenkis(VC) class of sets,
then there exist positive constant for which
where . In particular, .
We note that more standard results could be used for concentration of measure over VC classes of Boolean functions, or over bounded classes of real functions. We use the lemma above because of its generality and to make our analysis self-contained. The proof of this result is included in Section 8.
Recall that . We have
We use Lemma 5.1 to control the tail of the first term using the VC class where and . For the second term we have
and we exploit the dual form of quantile regression in terms of rank scores together with large deviation bounds for sub-Gaussian random variables.
6.1 Experiments on synthetic data
In this section we show experiments on synthetic data that verify our theoretical claims. 111Code and data for all experiments are available online at https://drive.google.com/file/d/1Ibaq5VWaAE4539hec4-UdIOgPsNv0x_t/view?usp=sharing The experiment is carried out in independent repeated trials. In each trial, data points are generated independently as follows:
Let . Generate from the multivariate distribution with correlated attributes: , where the the covariance matrix takes value for diagonal entries and for off-diagonal entries.
The protected attribute depends on through a logistic model: with
Given , generate
from a heteroscedastic model:.
The parameters , , are all generated independently from and stay fixed throughout all trials. The coefficient is set to be 3.
In each of the trials, conditional quantile estimators are trained on a training set of size and evaluated on the remaining size held out set. We train three sets of quantile estimators at :
Full quantile regression of on and .
Quantile regression of on only.
Take the estimator from procedure 2 and correct it with the method described in Section 3.
The average residuals are then evaluated on the test set for the and subpopulations. In Figure 1 we display the histograms of these average residuals across all trials for the quantile regression estimator on (0(a)) and the corrected estimator (0(b)). In the simulation we are running, is positively correlated with the response . Therefore when is excluded from the regression, the quantile estimator underestimates when and overestimates when . That is why we observe different residual distributions for the two subpopulations. This effect is removed once we apply the correction procedure, as shown in Figure 0(c).
We also test whether our correction procedure corrects the unbalanced effective quantiles of an unfair initializer. In each trial we measure the fairness level of an estimator by the absolute difference between the effective quantiles of the two subpopulations on a heldout set , where is defined as in (2.4).
We established in previous sections that quantile regression excluding attribute is in general not fair with respect to . A histogram of the fairness measure obtained from this procedure is shown in Figure 0(c) (salmon). Plotted together are the fairness measures after the correction procedure (light blue). For comparison we also include the histogram obtained from the full regression (black). Note that the full regression has the “unfair” advantage of having access to all observations of . Figure 0(c) shows that the correction procedure pulls the fairness measure to a level comparable to that of a full regression, which as we argued in Section 4, produces -fair estimators.
6.2 Birthweight data analysis
The birth weight dataset from Abrevaya (2001), which is analyzed by Koenker and Hallock (2001), includes the weights of 198,377 newborn babies, and other attributes of the babies and their mothers, such as the baby’s gender, whether or not the mother is married, and the mother’s age. One of the attributes includes information about the race of the mother, which we treat as the protected attribute . The variable is binary—black () or not black (). The birth weight is reported in grams. The other attributes include education of the mother, prenatal medical care, an indicator of whether the mother smoked during pregnancy, and the mother’s reported weight gain during pregnancy.
Figure LABEL:fig:all shows the coefficients obtained by fitting a linear quantile regression model, regressing birth weight on all other attributes. The model is fit two ways, either including the protected race variable (solid, salmon confidence bands), or excluding (long dashed, light blue confidence bands). The top-right figure shows that babies of black mothers weigh less on average, especially near the lower quantiles where they weigh nearly 300 grams less compared to babies of nonblack mothers. A description of other aspects of this linear model is given by Koenker and Hallock (2001). A striking aspect of the plots is the disparity between birth weights of infants born to black and nonblack mothers, especially at the left tail of the distribution. In particular, at the 5th percentile of the conditional distribution, the difference is more than 300 grams. Just as striking is the observation that when the race attribute is excluded from the model, the variable “married,” with which it has a strong negative correlation, effectively serves as a proxy, as seen by the upward shift in its regression coefficients. However, this and the other variables do not completely account for race, and as a result the model overestimates the weights of infants born to black mothers, particularly at the lower quantiles.
To correct for the unfairness of , we apply the correction procedure described in Section 3. For the target quantile , the corrected estimator achieves effective quantiles for the black population and for the nonblack population. Table LABEL:fig:effective (left) shows the effective quantiles at a variety of quantile levels. We see that the correction procedure consistently pulls the effective quantiles for both subpopulations closer to the target quantiles.
For 1000 randomly selected individuals from the test set, Figure LABEL:fig:effective (right) shows their observed birth weights plotted against the conditional quantile estimation at before (left) and after (right) the correction. The dashed line is the identity. When is not included in the quantile regression, the conditional quantiles for the black subpopulation are overestimated. Our procedure achieves fairness correction by shifting the estimates for the data points smaller (to the left) and shifting the data points larger (to the right). After the correction, the proportion of data points that satisfy are close to the target for both subpopulations.
In this paper we have studied the effects of excluding a distinguished attribute from quantile regression estimates, together with procedures to adjust for the bias in these estimates through post-processing. The linear programming basis for quantile regression leads to properties and analyses that complement what has appeared previously in the fairness literature. Several extensions of the work presented here should be addressed in future work. For example, the generality of the concentration result of Lemma 5.1 could allow the extension of our results to multiple attributes of different types. In the fairness analysis in Section 5 we used a linear quantile regression in the adjustment step, which allows us to more easily leverage previous statistical analyses Gutenbrunner and Jurečková (1992) on quantile rank scores. Nonparametric methods would be another interesting direction to explore.
The birth data studied here has been instrumental in developing our thinking on fairness for quantile regression. It will be interesting to investigate the ideas introduced here for other data sets. If the tail behaviors, including outliers, of the conditional distributions for a set of subpopulations are very different, and the identification of those subpopulations is subject to privacy restrictions or other constraints that do not reveal them in the data, the issue of bias in estimation and decision making will come into play.
Proof of Lemma 5.1.
To prove the lemma we first transform the problem into bounding the tail of a Rademacher process via a symmetrization technique. Let be distributed i.i.d. Rademacher (). Write for the empirical (probability) measure that puts mass at each . We claim that for all ,
Proof of (8.1): Let be independent copies of and let be the corresponding empirical measure. Define events
For all ,
On the other hand, because is countable, we can always find mutually exclusive events for which
Since for all , the above is upper bounded by . From independence of and , it can be rewritten as
which is no greater than 2 since
Because is an independent copy of , by symmetry and are equal in distribution. Therefore
That concludes the proof of (8.1).
Denote as the Rademacher process . Let be the probability measure of conditioning on . By independence of and , is still Rademacher under , and it is sub-Gaussian with parameter 1. This implies that for all , is sub-Gaussian with parameter under . In other words,
We have shown that conditioning on , is a process with sub-Gaussian increments controlled by the norm with respect to . For brevity write for . Apply Theorem 3.5 in Dirksen (2015) to deduce that there exists positive constant , such that for all ,
where is the diameter of under the metric , and is the generic chaining functional that satisfies
for some constant . Here stands for the -covering number of under the metric . We should comment that the generic chaining technique by Dirksen (2015) is a vast overkill for our purpose. With some effort the large deviation bounds we need can be derived using the classical chaining technique.
Because for all , we have , so that
via change of variables. To bound the covering number, invoke the assumption that is a VC class of sets. Suppose the VC dimension of is . By Lemma 19 in Nolan and Pollard (1987), there exists positive constant for which the covering numbers satisfy
for all and any that is a finite measure with finite support on . Choose by . Choose with and for each . Suppose achieves the minimum. Since is an envelope function for both and ,
which by definition of , is equal to
Take square roots on both sides to deduce that
Plug into (8.3) this upper bound on the covering number to deduce that the integral in (8.3) converges, and is no greater than a constant multiple of . Recall that we also have . From (8.2), there exists positive constant for which
Take so we have . If the zero function does not belong in , including it in does not disrupt the VC set property, and all previous analysis remains valid for . Letting yields
Under , is no longer deterministic. Divide the probability space according to the event :
Choose and (5.1) follows.
Proof of Theorem 3.1.
Recall that . Therefore
Note that we are only allowing to take rational values because Lemma 5.1 only applies to countable sets of functions. This restriction will not hurt us because the supremum of the processes over all equals the supremum over all . Let be the envelope function. We need to check that is a VC class of sets.
Since half spaces in are of VC dimension 3 (Alon and Spencer, 2004, p 221), the set forms a VC class. By the same arguments all four events in (8.5) form VC classes. Deduce that is also a VC class because the VC property is stable under any finitely many union/intersection operations. The assumptions of Lemma 5.1 are satisfied, which gives that for all