Maximum Weighted Loss Discrepancy

06/08/2019 ∙ by Fereshte Khani, et al. ∙ Stanford University 0

Though machine learning algorithms excel at minimizing the average loss over a population, this might lead to large discrepancies between the losses across groups within the population. To capture this inequality, we introduce and study a notion we call maximum weighted loss discrepancy (MWLD), the maximum (weighted) difference between the loss of a group and the loss of the population. We relate MWLD to group fairness notions and robustness to demographic shifts. We then show MWLD satisfies the following three properties: 1) It is statistically impossible to estimate MWLD when all groups have equal weights. 2) For a particular family of weighting functions, we can estimate MWLD efficiently. 3) MWLD is related to loss variance, a quantity that arises in generalization bounds. We estimate MWLD with different weighting functions on four common datasets from the fairness literature. We finally show that loss variance regularization can halve the loss variance of a classifier and hence reduce MWLD without suffering a significant drop in accuracy.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning algorithms have a profound effect on people, especially around critical decisions such as banking and criminal justice (Berk, 2012; Barocas and Selbst, 2016). It has been shown that standard learning procedures (empirical risk minimization) can result in classifiers where some demographic groups suffer significantly larger losses than the average population (Angwin et al., 2016; Bolukbasi et al., 2016). In this work, we consider the setting where demographic information is unavailable Hashimoto et al. (2018), so we would like to ensure that no group suffers a loss much larger than average.

We are interested in measuring the maximum weighted loss discrepancy (MWLD) of a model, which is, over all groups , the maximum difference between the loss of a group and the population loss , weighted by a function that quantifies the importance of each group:


MWLD captures various notions of group fairness; for example, equal opportunity (Hardt et al., 2016) is MWLD with capturing false positives and weighting function for sensitive groups (e.g., defined by race, gender) and for all other groups. We also show that we can bound the loss of a population with shifted demographics: if we tilt the original distribution toward any group based on , the loss on the new distribution can bounded using .

We consider estimating MWLD from finite data by plugging in the empirical distribution. There are two considerations: (i) does the estimator converge? and (ii) can we compute the estimator efficiently? The answers to these two questions depend on the weighting function. We first show that for the uniform weighting function (), we cannot estimate from finite samples (Proposition 3). Next, we study a family of decaying weighting functions (), where governs how much we account for the loss discrepancy of small groups. For this family, we show that the plug-in estimator (i) is efficient to compute and (ii) converges to the population (Theorem 3).

Next, we show a connection to loss variance (Proposition 4.1), an important quantity that arises in generalization bounds (Maurer and Pontil, 2009) and is used as a regularization scheme (Mnih et al., 2008; Audibert et al., 2009; Shivaswamy and Jebara, 2010; Namkoong and Duchi, 2017). In particular, provides us with lower and upper bounds for the loss variance. We also propose an extension called coarse loss variance, which considers only a set of sensitive groups, allowing us to incorporate knowledge about sensitive attributes.

We validate maximum weighted loss discrepancy on four common datasets from the fairness literature: predicting recidivism, credit rating, income, and crime rate. We fit a logistic regression on these datasets and estimate its

for various . We observe that MWLD with smaller converges more slowly to the population quantity. We also observe the group attaining MWLD shrinks as decreases. We then use loss variance (LV) and coarse loss variance (CLV) regularization to train models. Our empirical findings are as follows: 1) We halve the loss variance with only a small increase in the average loss. 2) In some cases, using loss variance as a regularizer simultaneously reduces the classification loss (higher accuracy) and loss variance (lower loss discrepancy).


Consider the prediction task of mapping each input

to a probability distribution over an output space

. Let be a predictor which maps each input to a probability distribution over ; for binary classification, and a predictor returns the probability of . Let be the (bounded) loss incurred by predictor on individual —e.g., the zero-one or logistic loss. Let denote the underlying distribution over individuals ; all expectations are computed with respect to . Define a group to be a measurable function such that if individual is in the group and otherwise. Let be the set of all groups. When clear from context, we use to denote and to denote , so that is the population loss and is the loss of group .

2 Maximum Weighted Loss Discrepancy

We now introduce our central object of study: [Maximum weighted loss discrepancy (MWLD)] For a weighting function

, loss function

, and predictor , define the maximum weighted loss discrepancy to be the maximum difference between the loss of a group and the population loss, weighted by :


where the weighting function (e.g., ) intuitively controls the importance of group .

Group fairness interpretation.

By rearranging the terms of (2), we can bound the loss discrepancy of any group in terms of the group weight and the maximum weighted loss discrepancy:


where the bound is tighter for larger . Existing statistical notions of fairness such as equal opportunity (Hardt et al., 2016) can be viewed as enforcing MWLD to be small for a weighting function that is on sensitive groups (e.g., different races) and on all other groups; see Appendix A for further discussion.

Distributional shift interpretation.

We can use MWLD to bound the loss on a population with shifted demographics. For any group , define the mixture distribution , which tilts the original distribution more towards group (assuming ). Then via simple algebra (Proposition B in Appendix B), the loss under this new distribution can be controlled as follows:


This is similar in spirit to distributionally robust optimization (DRO) using a max-norm metric (Duchi and Namkoong, 2018), but the difference is that the mixture coefficient is group-dependent.

How do we now operationalize MWLD? After all, the supremum over all groups in MWLD (2) appears daunting. In Section 3, we show how we can efficiently estimate MWLD for a restricted family of weighting functions.

3 Estimating Maximum Weighted Loss Discrepancy (MWLD)

We now focus on the problem of estimating MWLD from data. For simplicity of notation, we write instead of . Given points , our goal is to derive an estimator that is (i) efficient to compute and (ii) accurately approximates . Formally:


Whether this goal is achievable depends on the weighting function. Our first result is that we cannot estimate MWLD for the uniform weighting function ():

[] For any loss function and predictor , if is non-degenerate ( and ), then there is no estimator satisfying (5). We prove Proposition 3 by constructing two statistically indistinguishable distributions such that for one and for the other (see Appendix B for details). Proposition 3 is intuitive since MWLD for is asking for uniform convergence over all measurable functions, which is a statistical impossibility. It therefore seems natural to shift our focus to weighting functions that decay to zero as the measure of the group goes to zero.

As our next result, we show that we can estimate MWLD for the weighting function for . In particular, we show (i) we can efficiently compute the empirical , and (ii) it converges to . Letting denote an expectation with respect to the points, define the plug-in estimator:


Although seems to have an intractable max, in the next theorem we prove we can actually compute it efficiently, and that converges to , where the rate of convergence depends on . The key is that attains its max over a linear number of possible groups based on sorting by losses; this is what enables computational efficiency as well as uniform convergence.

[] For , let . Given i.i.d. sample from , we can compute (6) efficiently in time; and for any parameters , for a constant , if , the following holds: .

Proof sketch.

For computational efficiency, we show that if we sort the points by their losses (in time), then there exists an index such that either or achieves the empirical maximum weighted loss discrepancy (6).

To show convergence, let be the weighted loss discrepancy for group , , and analogously let be its empirical counterpart. We first prove for any if , then with probability at least , for some constant . Furthermore, since we assumed , then . By combining these two upper bounds (), we compute an upper bound independent of , thereby applicable to all groups.

To show uniform convergence, from the sorting result, we only need to consider groups of the form and their counterparts . We prove uniform convergence over this set using the KKW inequality (Massart, 1990) and the same procedure we explained before for convergence. See Appendix B for the complete proof.

4 A Closer Look at and Connection to Loss Variance

As shown in Section 2, MWLD has two different interpretations: group fairness and distributional shift interpretations. In this section, we look at these interpretations for the family of weighting functions, , for which we can efficiently estimate MWLD(). In Section 4.1, we show a connection between a particular member of this family () and loss variance. As an extension, in Section 4.2, we introduce coarse loss variance, a simple modification of loss variance which measures weighted loss discrepancy only for sensitive groups.

From the group fairness interpretation (3), provides guarantees on the loss discrepancy of each group according to its size. Therefore groups with similar sizes have similar guarantees. For a fixed value of (here ), Figure 2 (left) shows the bounds on the group loss discrepancy for different sizes and . The upper bound guarantee for smaller groups is weaker, and the parameter governs how much this upper bound varies across group sizes.

From the distributional shift interpretation (4), provides a guarantee on the loss of a new distribution where the weight of is increased by a maximum factor of . Figure 2 (right) illustrates this maximum upweighting factor . The upweighting factor for smaller groups is smaller, and the parameter governs how much this factor varies across group sizes.

Figure 1: left: Upper bound guarantee for loss discrepancy of a group for different values of . right: Magnitude of the shift in a group is dictated by the weighting function of the group.
Figure 2: Relationship between and . Here .

4.1 Loss Variance and Maximum Weighted Loss Discrepancy

In this section, we show an interesting connection between a particular member of the introduced family of weighting function, , and loss variance, which appears prominently in generalization bounds (Maurer and Pontil, 2009). Loss variance, , is the average squared difference between the loss of individuals and the population loss:


From the law of total variance, we have for any group . By observing that , we see that square root of loss variance is an upper bound on . This allows us to bound the loss of any group in terms of the loss variance (using (3)). A natural next question is about the tightness of the upper bound. How much larger can the variance be compared to the ? The next proposition shows that loss variance also provides a lower bound on a function of the . [] For any measurable loss function , the following holds: .

Proof sketch.

We first center the losses and make the average loss (without changing the MWLD). For any , let be the group of points with loss greater than . By definition of MWLD, we have . Therefore we have . Using integration by parts, we express variance with an integral expression in term of cumulative density function (CDF). Plugging this bound into the integral expression yields the result. For more details, see Appendix B. ∎

Figure 2 shows the bounds on for different values of . This proposition establishes a connection between statistical generalization and MWLD (and thereby group fairness). Furthermore, this connections states that reducing loss discrepancy between individuals and the population (7) leads to lower loss discrepancy between groups and the population (2) and vice versa.

Figure 3: Each individual has two sensitive attributes, color and height, and one non-sensitive attribute, having a hat, (a) . (b) . Note that, since the weights of the groups not defined on sensitive attributes are , their expected loss can deviate a lot from average loss (e.g, the expected loss of individuals with hats is , which deviates a lot from average loss ).

4.2 Sensitive Attributes and Coarse Loss Variance

So far, we have focused on the loss discrepancy over all groups, which could be too demanding. Suppose we are given a set of sensitive attributes (e.g., race, gender), and we are interested only in groups defined in terms of those attributes. We define coarse loss variance, which first averages the losses of all individuals with the same sensitive attribute values and considers the variance of these average losses. Formally, let denote the sensitive attributes; for example, . Then the coarse loss variance is:


Coarse loss variance is smaller than loss variance (7) because it ignores fluctuations in the losses of individuals who have identical sensitive attributes. Figure 3 shows the difference between loss variance and coarse loss variance. Analogous to Proposition 4.1 in previous section, we show that coarse loss variance is a close estimate of where if is a function only of sensitive attributes and otherwise. Define to be the set of groups such that only depends on the sensitive attributes . Let , then we have:


For the formal propositions regarding coarse loss variance, see Corollary B in Appendix B. As a caveat, empirical coarse loss variance converges slower to its population counterpart in comparison to loss variance (See Theorem C in Appendix C for the exact convergence rate).

Remark 1.

In some applications, the impact of misclassification could be quite different depending on the true label of the individual. For example, in deciding whether to provide loans, denying an eligible individual a loan could have a greater negative impact than providing a loan to a defaulter. In such situations, we consider the variance conditioned on the label, i.e., , so that we do not attempt to pull the losses of individuals with different labels (and consequently different impacts of misclassification) together.

5 Experiments

We first explore the effect of parameter in , as discussed in Section 4. We then use (coarse) loss variance to train models and show that we can halve the loss variance without significant increase in average loss. Table 1 shows a summary of the datasets considered. For more details about these datasets, see Appendix E.

Name # Records # Attributes Variable to be predicted Sensitive attributes Other attributes
C&C 1994 99 High or low crime rate Different races percentage (discretized) Police budget, #Homeless, …
Income 48842 14 High or low salary Race, Age (discretized), Gender Martial status, Occupation, …
German 1000 22 Good or bad credit rating Age(discretized), Gender Foreign worker, Credit history, …
COMPAS_5 7214 5 Recidivated or not Race, Age (discretized), Gender Prior counts, Charge degree, …
Table 1: Statistics of datasets. For more details about these datasets, see Appendix E.

5.1 Estimating Maximum Weighted Loss Discrepancy

We first fit a logistic regression (LR) predictor on these datasets, with the following objective: . Figure 5(a) shows the values of for different value of . As shown in Theorem 3 we expect to converge slower to the population for smaller . Empirically, we observe a bigger train-test gap for for smaller .

As discussed in Section 4, according to the group fairness interpretation of , we can bound the loss of any group in term of ; where small leads to similar upper bound for all groups, while larger allows weaker upper bounds for smaller groups. For each , we compute the maximum loss discrepancy for groups with size in COMPAS_5 dataset (i.e., ). The solid black line in Figure 5(b) shows this plot. For different values of , we plot the obtained upper bound from (3). Smaller leads to tighter upper bound for small groups and large leads to tighter upper bound for large groups.

(a)                                                                                                                      (b)
Figure 4: (a) The gap between values of in train (dashed lines) and test (solid lines) is larger for smaller . (b) The solid black line indicates the maximum loss discrepancy for different group sizes. Dashed lines show the obtained upper bound from (3). The upper bound is tighter for smaller groups when is small, and it is tighter for larger groups when is large.
Figure 5: A logistic regression with L2-regularizer (LR) leads to high (coarse) loss variance in all datasets.

5.2 Loss Variance Regularization

Recall that loss variance has three different interpretations. 1) It is a lower bound for maximum weighted loss discrepancy (Proposition 4.1); 2) It measures the average loss discrepancy between individuals and the population (7); and 3) It is a regularizer to improve test error. In this section, we study regularizing loss variance and all three aspects. In all datasets that we consider, the effect of misclassification depends on the label. As explained in Remark 1, in order to not attempt to pull together the losses of individuals with different labels, we use loss variance and coarse loss variance conditioned on the label. Formally, we define two objectives based on loss variance and coarse loss variance as follows:


We optimize the objectives above using stochastic gradient descent. We use a logistic regression model for prediction. In all variance computations,

is the log loss.

Figure 6: First row: LV halves the loss variance by only increasing loss 2-3%. Second row: CLV halves the coarse loss variance by increasing the loss around 1-2%.
Figure 7: LV reduces loss variance and average loss simultaneously. Acc. DFPR DFNR Unconstrained 0.668 0.18 -0.30 Zafar et al. (2017) 0.661 0.03 -0.11 Hardt et al. (2016) 0.645 -0.01 -0.01 CLV () 0.661 0.02 -0.10 CLV () 0.656 -0.04 -0.04 Table 2: Comparison between different methods. DFPR (DFNR) denote the difference between False positive (False Negative) rate of white individuals and black individuals.

Figure 5 shows that without any regularization (, the LR objective), both loss variance and coarse loss variance are large on all four datasets. This suggests LR predictions have high loss discrepancy both for groups and individuals. Note that these two notions are incomparable across datasets—smaller loss variance does not imply smaller coarse loss variance and vice versa.

Let’s now evaluate the training procedures we proposed to learn a predictor with lower loss discrepancy for groups and individuals. By varying the regularization parameter in (10), we visualize the trade-off between loss variances and average loss for LV. As shown in Figure 6 (first row), LV halves the loss variance by increasing average loss by only 2–3%. Similar result is shown in the Figure 6 (second row) for CLV (11). However, since the notion of coarse loss variance allows for fluctuations in predictions across individuals with same sensitive attributes (as opposed to loss variance), the CLV is able to achieve a smaller increase in average loss (1–2%). As a baseline, we show the trade-off curve of LR by varying L2-regularizer, . Now we compare LV and CLV together; in particular, we are interested in effect of LV in reducing coarse loss variance. Interestingly, in C&C dataset LV has a better trade-off curve than CLV in the test distribution for small value of . In German dataset, unlike CLV, LV reduced the coarse loss variance substantially in the test time. These two observations, suggests that sometimes LV might generalize to the test set better than CLV (as we mentioned in Section 4.2).

As we discussed in Section 4.1, loss variance has been studied as a way to improve the generalization error of a predictor. As shown in Figure 7, LV reduces loss variance and loss simultaneously in German and C&C datasets for smaller value of regularization strength on L2 (10). In COMPAS_5 and Income dataset, since there are few attributes and many data points, neither LV or LR improved the loss in these two datasets.

We now shift our focus from predicting a distribution over to classification where the goal is to predict the label of an individual. We classify individual to the class if the predictor’s estimate of and otherwise. Our approach is mainly different from previous work (Zafar et al., 2017; Hardt et al., 2016) as its goal is to protect all groups formed from all combinations of sensitive attributes as opposed to treating each sensitive attribute individually. However, we compare our model and show that loss variance regularization reduces the maximum loss discrepancy comparable to previous work. We pre-process COMPAS_5 in a similar fashion to Zafar et al. (2017) and we compare our model to their model and Hardt et al. (2016). We compute the difference between the false positives of blacks and whites (DFPR) and similarly the difference between false negatives of blacks and whites (DFNR). As shown in Table 2, compared to Zafar et al. (2017), our method reached lower DFPR and DFNR, even when we choose a point with same accuracy, our method still has lower DFPR and DFNR. Compared to Hardt et al. (2016), we obtain higher accuracy but worse DFPR and DFNR.

6 Related work

Algorithmic fairness. The issue of algorithmic fairness has risen in prominence with increased prevalence of prediction (Barocas and Selbst, 2016). Group fairness notions which ask for some approximate parity among some predefined groups are very prevalent in fairness literature. Many group fairness notions can be viewed as instantiations of MWLD with different weighting functions and different loss functions (See Appendix A). A major thrust of our work is to guarantee fairness for all or a large number of groups, which is shared by some recent work Kearns et al. (2018); Agarwal et al. (2018); Hébert-Johnson et al. (2017). These works focus on a set of groups that can be expressed as low complexity functions of the sensitive attributes. Depending on the complexity of the functions, estimating any fairness notion across groups in this set can be NP-hard. In contrast, we consider all groups (appropriately weighted), which makes the estimation problem computationally tractable. Oblivious to the sensitive attributes, Zhang and Neill (2016)

also try to protect all groups, using subset scan and parametric bootstrap-based methods to identify subgroups with high classification errors. They provide some heuristic methods and only focus on finding a group with high predictive bias; whereas, we formally provide guarantees and introduce a regularizer for learning a predictor with low loss discrepancy for all groups.

Distributional robustness. In Distributionally Robust optimization (DRO), the broad goal is to control the worst-case loss over distributions close to the sampling distribution (Ben-Tal et al., 2013; Delage and Ye, 2010; Duchi et al., 2016; Wang et al., 2016; Esfahani and Kuhn, 2018; Duchi and Namkoong, 2018), whereas MWLD measures the worst-case weighted loss discrepancy over groups, which correspond to restrictions of the support. The two can be related as follows: conditional value at risk (CVaR), a particular instantiation of DRO considers all distributions such that for all , and we relax groups to permit fractional membership ( maps to rather than ), then DRO with max-norm metric is equivalent to Maximum Weighted Loss Discrepancy (MWLD) with the weighting function , which considers all groups with size at least .

Loss variance regularization. Variance regularization stems from efforts to turn better variance-based generalization bounds into algorithms. Bennett (1962); Hoeffding (1963) show that excess risk of a hypothesis can be bounded according to its variance. Maurer and Pontil (2009) substitute population variance by its empirical counterpart in the excess-risk bound and introduce sample variance penalization as a regularizer. Their analysis shows that under some settings, this regularizer can get better rates of convergence than traditional ERM. Variance regularization as an alternative to ERM also has been previously studied in (Mnih et al., 2008; Audibert et al., 2009; Shivaswamy and Jebara, 2010). Recently, Namkoong and Duchi (2017) provide a convex surrogate for sample variance penalization va distributionally robust optimization. In this work, we provide a connection between this rich literature and algorithmic fairness.

7 Conclusion

We defined and studied maximum weighted loss discrepancy (MWLD). We gave two interpretations for MWLD: 1) Group fairness: it bounds the loss of any group compared to the population loss; 2) Robustness: it bounds the loss on a set of new distributions with shifted demographics, where the magnitude of the shift in a group is dictated by the weighting function of the group. In this paper, we studied computational and statistical challenges of estimating MWLD for a family of weighting functions (); and established a close connection between and loss variance. This motivated loss variance regularization as a way to improve fairness and robustness. We also proposed a variant of loss variance regularization that incorporates information about sensitive attributes. We showed that we can efficiently estimate MWLD for . What other weighting functions does this hold for? We relied on the key property that the sup is attained on possible groups; are there other structures?


All code, data and experiments for this paper are available on the Codalab platform at


This work was supported by Open Philanthropy Project Award. We would like to thank Tatsunori Hashimoto and Mona Azadkia for several helpful discussions and anonymous reviewers for useful feedback.


Appendix A Previous statistical notions

Statistical notions of fairness can be viewed as instantiations of MWLD (Definition 2) with different weighting functions and appropriate loss functions. We categorize existing notions into three rough categories and flesh out the associated weighting functions.

Group fairness:

Early work of statistical fairness Kamiran and Calders [2012], Hardt et al. [2016], Zafar et al. [2017] only control discrepancy of loss for a small number of groups defined on sensitive attributes (e.g., race and gender). This corresponds to a zero-one weighting function where weights of the fixed sensitive groups are , and weights of all other groups are .

Subgroup fairness:

Kearns et al. [2018] argue that group fairness is prone to the “fairness gerrymandering” problem whereby groups corresponding to combinations of sensitive attributes can have high loss even if groups corresponding to sensitive attributes individually are protected.111 They show an illustrative example in which a predictor where black men and white women suffer very small loss but black women and white men suffer very high loss. Such a predictor has similar losses on the groups of blacks and whites (corresponding to the sensitive attribute race) and men and women (corresponding to gender); however, has a high loss on the “sub-groups” defined on the combination of the attributes. To mitigate this issue, they consider exponentially many subgroups () defined by a structured class of functions over the sensitive attributes and control loss of each group in weighted by its size. Formally, their definition corresponds to Definition 2 with the weighting function .

Large-group fairness:

Hashimoto et al. [2018] also consider the setting where sensitive attributes are unknown (which is also our main focus), They aim to control the losses of groups whose size is greater than some predefined value (oblivious to sensitive attributes). In terms of Definition 2, this corresponds to the weighting function .

Appendix B Missing Proofs

Fix any loss function and weighting function such that . For , the following holds:


We prove the following:


In the rest of the proof, all expectations are with respect to .

If then it is obvious that increasing weight of group will decrease the overall loss:


If , as shown in (3) by definition of maximum weighted loss discrepancy we have . We can bound the RHS of (13) as follows:


See 3


Consider distribution such that under this distribution:

We now construct a new distribution such that for this distribution. Let and be two points with loss and respectively (i.e., , ). We construct a new distribution as follows: for with probability , and and with probability each.

The maximum weighted loss discrepancy for this distribution is defined analogous to above. By the existence of groups corresponding to the singletons and with ,

We now assume an estimator exists and will show a contradiction. Let and denote two random variables corresponding to the estimates of and respectively; set , we have:


Now for some Set then we have:


which contradicts with the assumption that . ∎

See 4.1


We first prove .

We assume a more general case . We subtract the average loss, , from the loss of each point to center the losses and make the average loss (without changing the MWLD). First note that if , then . If , let be the cumulative density function (CDF) of .


We now derive an upper bound for ; we can compute an upper bound for in a similar way. For an arbitrary , consider the group of individuals with loss . By bounding the weighted loss of this group using MWLD, we obtain a bound on . For brevity, let .


Using integration by parts, we can express using . We then bound using the above upper bound (28).


By computing a similar bound for , the following holds:


Setting we have:


Now we prove .

We prove for any , , using law of total variance. Note that .


Now we that use the fact .


For any measurable loss function . The following holds,


Let be the random variable indicating the different values of sensitive attributes (), with the following distribution: ; and loss on point is defined as the expected loss of all point with sensitive attribute , formally: . As is only defined on we can now use in the Proposition 4.1 to prove the corollary. ∎

Appendix C Generalization bounds

[Maurer and Pontil [2009]] Let , and denote the training data. Let be the losses of individuals in the training data, with values in . Let denote the empirical variance, then for any , we have


[Hoeffding’s inequality] Let , be i.i.d. random variables with values in and let . Then with probability at least we have:


As a caveat, empirical coarse loss variance converges to its population counterpart slower than loss variance. Let be the number of different settings of the sensitive attributes (). As an example, if we have two sensitive attributes with each having different values, then and . Empirical coarse loss variance converges to the population coarse variance as , while empirical loss variance convergences to its population counterpart as . In the following theorem, we present the formal proof for this bound.

Let denote the number of training data points; and be the loss function. Let be the set of all possible settings of sensitive features, and be the number of possible settings. Then for any , and , with probability of at least , the following holds:


Throughout the proof, we use this fact that if two random variables and their difference are bounded, then the difference of their square is bounded as well. Formally, for any we have:


For simplifying the notations, let and denote and let respectively.


Using Theorem C and (48) while considering , with probability at least , (i) is less than .

We now compute an upper bound for (iii). First note that: and similarly, .


We derived (55), using Theorem C and (48) considering .

Finally, we compute an upper bound for (ii). Let denote the number of time that sensitive attributes setting of appeared in the training data.