Despite the successes of machine learning at complex tasks that involve making predictions about people, there is growing evidence that “state-of-the-art” models can perform significantly less accurately on minority populations than on the majority population. Indeed, a notable study of three commercial face recognition systems known as the “Gender Shades” project[BG18], demonstrated significant performance gaps across different populations at classification tasks. While all systems achieved roughly 90% accuracy at gender detection on a popular benchmark, a closer investigation revealed that the system was significantly less accurate on female subjects compared to males and on dark-skinned individuals compared to light-skinned. Worse yet, this discrepancy in accuracy compounded when comparing dark-skinned females to light-skinned males; classification accuracy differed between these groups by as much as
! This study confirmed empirically the intuition that machine-learned classifiers may optimize predictions to perform well on the majority population, inadvertently hurting performance on the minority population in significant ways.
A first approach to address this serious problem would be to update the training distribution to reflect the distribution of people, making sure historically-underrepresented populations are well-represented in the training data. While this approach may be viewed as an eventual goal, often for historical and social reasons, data from certain minority populations is less available than from the majority population. In particular, we may not immediately have enough data from these underrepresented subpopulations to train a complex model. Additionally, even when adequate representative data is available, this process necessitates retraining the underlying prediction model. In the common setting where the learned model is provided as a service, like a commercial image recognition system, there may not be sufficient incentive (financial, social, etc.) for the service provider to retrain the model. Still, the clients of the model may want to improve the accuracy of the resulting predictions across the population, even when they are not privy to the inner workings of the prediction system.
At a high level, our work focuses on a setting, adapted from [HKRR18], that is common in practice but distinct from much of the other literature on fairness in classification. We are given black-box access to a classifier, , and a relatively small “validation set" of labeled samples drawn from some representative distribution ; our goal is to audit to determine whether the predictor satisfies a strong notion of subgroup fairness, multiaccuracy. Multiaccuracy requires (in a sense that we make formal in Section 2) that predictions be unbiased, not just overall, but on every identifiable subpopulation. If auditing reveals that the predictor does not satisfy multiaccuracy, we aim to post-process to produce a new classifier that is multiaccurate, without adversely affecting the subpopulations where was already accurate.
Even if the initial classifier was trained in good faith, it may still exhibit biases on significant subpopulations when evaluated on samples from . This setting can arise when minority populations are underrepresented in the distribution used to train compared to the desired distribution , as in the Gender Shades study [BG18]. In general, we make no assumptions about how was trained. In particular, may be an adversarially-chosen classifier, which explicitly aims to give erroneous predictions within some protected subpopulation while satisfying marginal statistical notions of fairness. Indeed, the influential work on “Fairness Through Awareness” [DHP12], followed by [KNRW17, HKRR18]
, demonstrated the weakness of statistical notions of fairness (such as statistical parity, equalized odds, and calibration), showing that prediction systems can exhibit material forms of discrimination against protected populations, even though they satisfy statistical fairness conditions. Left unaddressed, such forms of discrimination may discourage the participation of minority populations, leading to further underrepresentation of these groups. Thus, our goal will be to mitigate systematic biases broadly enough to handle inadvertent and malicious forms of discrimination.
We investigate a notion of fairness – multiaccuracy – originally proposed in [HKRR18], and develop a framework for auditing and post-processing for multiaccuracy. We develop a new algorithm, Multiaccuracy Boost, where a simple learning algorithm – the auditor – is used to identify subpopulations in where is systematically making more mistakes. This information is then used to iteratively post-process until the multiaccuracy condition – unbiased predictions in each identifiable subgroup – is satisfied. Our notion of multiaccuracy differs from parity-based notions of fairness, and is reasonable in settings such as gender detection where we would like to boost the classifier’s accuracy across many subgroups. We prove convergence guarantees for Multiaccuracy Boost and show that post-processing for multiaccuracy may actually improve the overall classification accuracy. We describe the post-processing algorithm in Section 3.
Empirically, we validate Multiaccuracy Boost in several different case studies: gender detection from images as in Gender Shades [BG18], a semi-synthetic medical diagnosis task, and adult income prediction. In all three cases, we use standard, initial prediction models that achieve good overall classification error but exhibit biases against significant subpopulations. After post-processing, the accuracy improves across these minority groups, even though minority-status is not explicitly given to the post-processing algorithm as a feature. As long as there are features in the audit set correlated with the (unobserved) human categories, then Multiaccuracy Boost is effective at improving the classification accuracy across these categories. As suggested by the theory, Multiaccuracy Boost actually improves the overall accuracy, by identifying subpopulations where the initial models systematically erred; further, post-processing does not significantly affect performance on groups where accuracy was already high. We show that Multiaccuracy Boost, which only accesses as a black-box, performs comparably and sometimes even better than very strong white-box alternatives which has full access to . These results are reported in Section 4.
In Section 4.1
, we explore the gender detection example further, investigating some of the practical aspects of multiaccuracy auditing and post-processing. In particular, we observe that the representation of images used for auditing (and post-processing) matters; we show that auditing is more effective when using an embedding of the images that was trained using an unsupervised autoencoder compared to using the internal representation of the neural network used for prediction. These findings seem consistent with the guiding philosophy, put forth by[DHP12], that maintaining “awareness” is paramount to detecting unfairness. We also show that the auditing process, which we use algorithmically as a way to boost the accuracy of the classifier, can also be used to help people understand why their prediction models are making mistakes. Specifically, the output of the multiaccuracy auditor can be used to produce examples of inputs where the predictor is erring significantly; this provides human interpretation for biases of the original classifier.
2 Setting and multiaccuracy
Let denote the input space; we denote by the function that maps inputs to their label. Let represent the validation data distribution supported on ; the distribution can be viewed as the “true" distribution, on which we will evaluate the accuracy of the final model. In particular, we assume that the important subpopulations are sufficiently represented on (cf. Remark on data distribution). Our post-processing learner receives as input a small sample of labeled validation data , where , as well as black-box access to an initial regression / classification model . The goal is to output a new model (using calls to ) that satisfies the multiaccuracy fairness conditions (described below).
Importantly, we make no further assumptions about . Typically, we will think of as the output of a learning algorithm, trained on some other distribution (also supported on ); in this scenario, our goal is to mitigate any inadvertently-learned biases. That said, another important setting assumes that is chosen adversarially to discriminate against a protected population of individuals, while aiming to appear accurate and fair on the whole; here, we aim to protect subpopulations against malicious misclassification. The formal guarantees of multiaccuracy provide meaningful protections from both of these important forms of discrimination.
For a subset , we use to denote a sample from conditioned on membership in
. We take the characteristic function ofto be if and otherwise. For a hypothesis , we denote the classification error of with respect to a subset as , where rounds to . For a function and a subset , let be the restriction to where if and otherwise. We use to denote the expected cross-entropy loss of on where .
The goal of multiaccuracy is to achieve low classification error, not just on overall, but also on subpopulations of . This goal is formalized in the following definition adapted from [HKRR18].
Let and let be a class of functions on . A hypothesis is -multiaccurate if for all :
-multiaccuracy guarantees that a hypothesis appears unbiased according to a class of statistical tests defined by . As an example, we could define the class in terms of a collection of subsets , taking to be (and its negation) for each subset in the collection; in this case, -multiaccuracy guarantees that for each , the predictions of are at most -biased.
Ideally, we would hope to take to be the class of all statistical tests. Requiring multiaccuracy with respect to such a , however, requires learning the function exactly, which is information-theoretically impossible from a small sample. In practice, if we take to be a learnable class of functions, then -multiaccuracy guarantees accuracy on all efficiently-identifiable subpopulations.
For instance, if we took to be the class of width- conjunctions, then multiaccuracy guarantees unbiasedness, not just on the marginal populations defined by race and separately gender, but by the subpopulations defined by the intersection of race, gender, and two other (possibly “unprotected") features. In particular, the subpopulations that multiaccuracy protects can be overlapping and include groups beyond traditionally-protected populations. This form of computationally-bounded intersectionality provides strong protections against forms of discrimination, like subset targeting, discussed in [DHP12, HKRR18].
2.2 Classification accuracy from multiaccuracy
Multiaccuracy guarantees that the predictions of a classifier appear unbiased on a rich class of subpopulations; ideally though, we would state a guarantee in terms of the classification accuracy, not just the bias. Intuitively, as we take to define a richer class of tests, the guarantees of multiaccuracy become stronger. This intuition is formalized in the following proposition.
Let as . Suppose that for with , there is some such that . Then if is -multiaccurate, .
That is, if there is a function in that correlates well with the label function on a significant subpopulation , then multi-accuracy translates into a guarantee on the classification accuracy on this subpopulation.
Remark on data distribution.
Note that in our definition of multiaccuracy, we take an expectation over the distribution of validation data. Ideally, should reflect the true population distribution or could be aspirational, increasing the representation of populations who have experienced historical discrimination; for instance, the classification error guarantee of Proposition 1 improves as , the density of the protected subpopulation , grows. Still, if we take the multiaccuracy error term small enough, then we may still hope to improve the accuracy on less-represented subpopulations. Our experiments suggest that applying the multiaccuracy framework with an unbalanced valdiation distribution may still help improve the accuracy on underrepresented groups.
2.3 Auditing for multiaccuracy
With the definition of -multiaccuracy in place, a natural question to ask is how to test if a hypothesis satisfies the definition; further, if does not satisfy -multiaccuracy, can we update efficiently to satisfy the definition, while maintaining the overall accuracy? We will use a learning algorithm to audit a classifier for multiaccuracy. The algorithm receives a small sample from and aims to learn a function that correlates with the residual function . In Section 3, we describe how to use such an auditor to solve the post-processing problem. This connection between subpopulation fairness and learning is also made in [KNRW17, HKRR18, KRR18], albeit for different tasks.
Definition (Multiaccuracy auditing).
Let , and let be a learning algorithm. Suppose is a set of independent samples. A hypothesis passes -multiaccuracy auditing if for :
A special case of -multiaccuracy auditing uses a naive learning algorithm that iterates over statistical tests
. Concretely, in our experiments, we audit with ridge regression and decision tree regression; both auditors are effective at identifying subpopulations on which the model is underperforming. Further, in the image recognition setting, we show that auditing can be used to produce interpretable synopses of the types of mistakes a predictor makes.
3 Post-processing for multiaccuracy
Here, we describe an algorithm, Multiaccuracy Boost, for post-processing a pre-trained model to achieve multiaccuracy. The algorithm is given black-box access to an initial hypothesis and a learning algorithm , and for any accuracy parameter , outputs a hypothesis that passes -multiaccuracy auditing. The post-processing algorithm is an iterative procedure similar to boosting [FS95, SF12], that uses the multiplicative weights framework to improve suboptimal predictions identified by the auditor. This approach is similar to the algorithm given in [HKRR18] in the context of fairness and [TTV09] in the context of pseudorandomness. Importantly, we adapt these algorithms so that Multiaccuracy Boost exhibits what we call the “do-no-harm” guarantee; informally, if has low classification error on some subpopulation identified by , then the resulting classification error on cannot increase significantly. In this sense, achieving our notion of fairness need not adversely affect the utility of the classifier.
A key algorithmic challenge is to learn a multiaccurate predictor without overfitting to the small sample of validation data. In theory, we prove bounds on the sample complexity necessary to guarantee good generalization as a function of the class , the error parameter , and the size of subpopulations we wish to protect . In practice, we need to balance the choice of (or ) and the number of iterations of our algorithm to make sure that the auditor is discovering true signal, rather than noise in the validation data. Indeed, if the auditor learns an expressive enough class of functions, then our algorithm will start to overfit at some point; we show empirically that multiaccuracy post-processing improves the generalization error before overfitting. Next, we give an overview of the algorithm, and state its formal guarantees in Section 3.1.
At a high level, Multiaccuracy Boost starts by partitioning the input space based on the initial classifier into and ; note that we can partition simply by calling . Partitioning the search space based on the predictions of helps to ensure that the we output maintains the initial accuracy of ; in particular, it allows us to search over just the positive-labeled examples (negative, resp.) for a way to improve the classifier. The initial hypothesis may make false positive predictions and false negative predictions for very different reasons, even if in both cases the reason is simple enough to be identified by the auditor.
After partitioning the input space, the procedure iteratively uses the learning algorithm to search over (and within the partitions ) to find any function which correlates significantly with the current residual in prediction . If successfully returns some function that identifies a significant subpopulation where the current hypothesis is inaccurate, the algorithm updates the predictions multiplicatively according to . In order to update the predictions simultaneously for all , at the th iteration, we build by incorporating into the previous model . This approach of augmenting the model at each iteration is similar to boosting. To guarantee good generalization of , we assume that uses a fresh sample per iteration. In practice, when we have few samples, we can put all of our samples in one batch and use noise-addition techniques to reduce overfitting [DFH15, RZ16]; this connection to adaptive data analysis was studied formally in [HKRR18].
From the stopping condition, it is clear that when the algorithm terminates, passes -multiaccuracy auditing. Thus, it remains to bound the number of iterations before Multiaccuracy Boost terminates. Additionally, as described, the algorithm evaluates statistics like , which depends on for all ; we need to bound the number of validation samples we need to guarantee good generalization to the unseen population. Theorem 2 provides formal guarantees on the convergence rate and the sample complexity from
needed to estimate the expectations sufficiently-accurately.
Do no harm.
The distinction between our approach and most prior works on fairness (especially [KNRW17]) is made clear from the “do-no-harm” property that Multiaccuracy Boost exhibits, stated formally as Theorem 3. In a nutshell, the property guarantees that on any subpopulation that audits, the classification error cannot increase significantly from to the post-processed classifier. Further, the bound we prove is very pessimistic. Both in theory and in practice, we do not expect any increase to occur. In particular, the convergence analysis of Multiaccuracy Boost follows by showing that at every update, the average cross-entropy loss on the population we update must drop significantly. Termination is guaranteed because after too many iterations of auditing, the post-processing will have learned perfectly. Thus, if we use Algorithm 3 to post-process a model that is already achieves high accuracy on the validation distribution the resulting model’s accuracy should not deteriorate in significant ways; empirically, we observe that classification accuracy (on held-out test set) tends to improve over after multiaccuracy post-processing.
3.1 Formal guarantees of Multiaccuracy Boost
For clarity of presentation, we describe the formal guarantees of our algorithm assuming that provably agnostic learns a class of tests , in order to describe the sample complexity appropriately. The guarantees on the rate of convergence do not rely on such an assumption. We show that, indeed, Algorithm 3 must converge in a bounded number of iterations. The proof follows by showing that, for an appropriately chosen (on the order of ), each update improves the cross-entropy loss over the updated set , so the bound depends on the initial cross-entropy loss.
To estimate the statistics used in Multiaccuracy Boost, we need to bound the sample complexity required for the auditor to generalize. Informally, we say the dimension of an agnostically learnable class is a value such that random samples from guarantee uniform convergence over with accuracy
with failure probability at most. Examples of upper bounds on this notion of dimension include for a finite class of tests, the VC-dimension [KV94] for boolean tests, and the metric entropy [BLM13] of real-valued tests. We state the formal guarantees as Theorem 2.
Let ; suppose agnostic learns a class of dimension . Then, using , Algorithm 3 converges to a -multiaccurate hypothesis in iterations from random samples with probability .
Roughly speaking, for constant , the sample complexity scales with the dimension of the class . For many relevant classes for which we would want to enforce -multiaccuracy, this dimension will be significantly smaller than the amount of data required to train an accurate initial . Note also that our sample complexity is completely generic and we make no effort to optimize the exact bound. In particular, for structured and , better uniform convergence bounds can be proved. Further, appealing to a recent line of work on adaptive data analysis initiated by [DFH15, RZ16], we can avoid resampling at each iteration as in [HKRR18].
Do no harm.
The do-no-harm property guarantees that the classification error on any subpopulation that audits cannot increase significantly. As we assume can identify a very rich class of overlapping sets, in aggregate, this property gives a strong guarantee on the utility of the resulting predictor. Further, the proof of Theorem 3 reveals that this worst-case bound is very pessimistic and can be improved with stronger assumptions.
Theorem 3 (Do-no-harm).
Let and be a subpopulation where . Suppose audits the characteristic function and its negation. Let be the output of Algorithm 3 when given , , and as input. Then the classification error of on the subset is bounded as
Derivative learning for faster convergence
Here, we propose auditing with an algorithm , described formally in Algorithm C in the Appendix, that aims to learn a smoothed version of the partial derivative function of the cross-entropy loss with respect to the predictions , which grows in magnitude as grows. We show that running Multiaccuracy Boost with converges in a number of iterations that grows with , instead of polynomially, as we would expect for a smooth, strongly convex objective [SS12, B15]. This sort of gradient method does not typically make sense when we don’t have information about for all ; nevertheless, if there is a simple way to improve , we might hope to learn the partial derivative as a function of in order to update . This application of Multiaccuracy Boost
is similar in spirit to gradient boosting techniques[MBBF00, Fri01], which interpret boosting algorithms as running gradient descent on an appropriate cost-functional.
In principle, if the magnitude of the residual is not too close to for most , then the learned partial derivative function should correlate well with the true gradient. Empirically, we observe that is effective at finding ways to improve the model quite rapidly. Formally, we show the following linear convergence guarantee.
4 Experimental Evaluation
We evaluate the empirical performance of Multiaccuracy Boost in three case studies. The first and most in-depth case study aims to emulate the conditions of the Gender Shades study [BG18], to test the effectiveness of multiaccuracy auditing and post-processing on this important real-world example. In Section 4.1, we show experimental results for auditing using two different validation data sets. In particular, one data set is fairly unbalanced and similar to the data used to train, while the other data set was developed in the Gender Shades study and is very balanced. For each experiment, we report for various subpopulations, the population percentage in , accuracies of the initial model, our black-box post-processed model, and white-box benchmarks. In Section 4.1.1, we discuss further subtleties of applying the multiaccuracy framework involving the representation of inputs passed to for auditing; in Section 4.1.2, we show how auditing can be used beyond post-processing. In particular, the hypotheses that learns can be used to highlight subpopulations – in an interpretable way – on which the model is making mistakes.
We evaluate the effectiveness of multiaccuracy post-processing on two other prediction tasks. In both these case studies, we take the training and auditing data distribution to be the same, though the number of examples used for auditing will be quite small. Multiaccuracy still improves the performance on significant subpopulations. These results suggest some interesting hypotheses about why machine-learned models may incorporate biases in the first place, which warrant further investigation.
4.1 Multiaccuracy improves gender detection
In this case study, we replicate the conditions of the Gender Shades study [BG18] to evaluate the effectiveness of the multiaccuracy framework in a realistic setting. For our initial model, we train an inception-resnet-v1 [SIVA17] gender classification model using the CelebA data set with more than 200,000 face images [LLWT15]. The resulting test accuracy on CelebA for binary gender classification is 98.4%.
We applied Multiaccuracy Boost to this using two different auditing distributions. In the first case, we audit using data from the LFW+a111We fixed the original data set’s label noise for sex and race. set [WHT11, HRBLM07], which has similar demographic breakdowns as CelebA (i.e. ). In the second case, we audit using the PPB data set (developed in [BG18]) which has balanced representation across gender and race (i.e. ).
These experiments allows us to track the effectiveness of Multiaccuracy Boost as the representation of minority subpopulations changes.
In both cases, the auditor is “blind” – it is not explicitly given the race or gender of any individual – and knows nothing about the inner workings of the classifier.
Specifically, we take the auditor to be a variant of (Algorithm C) that performs ridge regression to fit .222 To help
avoid outliers, we smooth the loss and use a
quadratic approximation for
To help avoid outliers, we smooth the loss and use a quadratic approximation for. Instead of training the auditor on raw input pixels, we use the low dimensional representation of the input images derived by a variational autoencoder (VAE) trained on CelebA dataset using Facenet [SKP15] library. (For more discussion of the representation used during auditing, cf. Section 4.1.1.)
To test the initial performance of , we evaluated on a random subset of the LFW+a data containing 6,880 face images, each of which is labeled with both gender and race – black (B) and non-black (N). For gender classification on LFW+a, achieves 94.4% accuracy. Even though the overall accuracy is high, the error rate is much worse for females (23.1%) compared to males (0.7%) and worse for blacks (10.2%) compared to non-blacks (5.1 %); these results are qualitatively very similar to those observed by the commercial gender detection systems studied in [BG18]. We applied Multiaccuracy Boost, which converged in 7 iterations. The resulting classifier’s classification error in minority subpopulations was substantially reduced, even though the auditing distribution was similar as the training distribution.
We compare Multiaccuracy Boost against a strong white-box baseline. Here, we retrain the network of using the audit set. Specifically, we retrain the last two layers of the network, which gives the best results amongst retraining methods. We emphasize that this baseline requires white-box access to , which is often infeasible in practice. Multiaccuracy Boost accesses only as a black-box without any additional demographic information, and still achieves comparable, if not improved, error rates compared to retraining. We report the overall classification accuracy as well as accuracy on different subpopulations – e.g. BF indicates black female – in Table 1.
The second face dataset, PPB, in addition to being more balanced, is much smaller; thus, this experiment can be viewed as a stress test, evaluating the data efficiency of our post-processing technique. The test set has 415 individuals and the audit set has size 855. PPB annotates each face as dark (D) or light-skinned (L). As with LFW+a, we evaluated the test accuracy of the original , the multiaccurate post-processed classifier, and retrained classifier on each subgroup. Multiaccuracy Boost converged in 5 iterations and again, substantially reduced error despite a small audit set and the lack of annotation about race or skin color (Table 2).
To further test the data efficiency of Multiaccuracy Boost, we evaluate the effect of audit set size on the resulting accuracy of each method. In Fig. 1, we report the performance of Multiaccuracy Boost versus the white-box retraining method for different sizes of audit set. The plot displays the difference in accuracy for the overall population along with the subgroups that suffered the most initial bias. It shows that the performance of Multiaccuracy Boost may actually be better than the white-box retraining baseline when validation data is especially scarce.
4.1.1 Representation matters
As discussed earlier, in the reported gender detection experiments, we audit for multiaccuracy using ridge regression over an encoding of images produced by a variational autoencoder. Using the representation of images produced by this encoding intuitively makes sense, as the autoencoder’s reconstruction objective aims to preserve as much information about the image as possible while reducing the dimension. Still, we may wonder whether multiaccuracy auditing over a different representation of the images would perform better. In particular, since we are interested in improving the accuracy on the gender detection task, it seems plausible that a representation of the images based on the internal layers of the initial prediction network might preserve the information salient to gender detection more effectively.
We investigate the importance of the representation used to audit empirically. In particular, we also evaluate the performance of Multiaccuracy Boost using the same auditor run over two other sets of features, given by the last-layer and the second-to-last layer of the initial prediction residual network . In Table 3, we show that using the unsupervised VAE representation yields the best results. Still, the representations from the last and second-to-last layers are competitive with that of the VAE, and in some subpopulations are better.
Collectively, these findings bolster the argument for “fairness through awareness”, which advocates that in order to make fair predictions, sensitive information (like race or gender) should be given to the (trustworthy) classifier. While none of these representations explicitly encode sensitive group information, the VAE representation does preserve information about the original input, for instance skin color, that seems useful in understanding the group status. The prediction network is trained to have the best prediction accuracy (on an unbalanced training data set), and thus, the representations from the network reasonably may contain less information about these sensitive features. These results suggest that the effectiveness of multiaccuracy does depend on the representation of inputs used for auditing, but so long as the representation is sufficiently expressive, Multiaccuracy Boost may be robust to the exact encoding of the features.
4.1.2 Multiaccuracy auditing as diagnostic
As was shown in [BG18], we’ve demonstrated that models trained in good faith on unbalanced data may exhibit significant biases on the minority populations. For instance, the initial classification error on black females is significant, whereas on white males, it is near . Importantly, the only way we were able to report these accuracy disparities was by having access to a rich data set where gender and race were labeled. Often, this demographic information will not be available; indeed, the CelebA images are not labeled with race information, and as such, we were unable to evaluate the subpopulation classification accuracy on this set. Thus, practitioners may be faced with a problem: even if they know their model is making undesirable mistakes, it may not be clear if these mistakes are concentrated on specific subpopulations. Absent any identification of the subpopulations on which the model is underperforming, collecting additional training data may not actually improve performance across the board.
We demonstrate that multiaccuracy auditing may serve as an effective diagnostic and interpretation tool to help developers identify systematic biases in their models. The idea is simple: the auditor returns a hypothesis that essentially “scores” individual inputs by how wrong the prediction is. If we consider the magnitude of their scores , then we may understand better the biases that the encoder is discovering.
As an example, we test this idea on the PPB data set, evaluating the test images’ representations with the hypotheses the auditor returns. In Figure 2, we display the images in the test set that get the highest and lowest effect ( large and , respectively) according to the first and second hypothesis returned by . In the first round of auditing, the three highest-scoring images (top-left row) are all women, both black and white. Interestingly, all of the least active images (bottom-left row) are men in suits, suggesting that suits may be a highly predictive feature of being a man according to the original classifier, . Overall the first round of audit seems to primarily identify gender as the axis of bias in . In the second round, after the classifier has been improved by one step of Multiaccuracy Boost, the auditor seems to hone in on the “dark-skinned women” subpopulation as the region of bias, as the highest activating images (top-right row) are all dark-skinned women.
4.2 Additional case studies
Multiaccuracy auditing and post-processing is applicable broadly in supervised learning tasks, not just in image classification applications. We demonstrate the effectiveness ofMultiaccuracy Boost in two other settings: the adult income prediction task and a semi-synthetic disease prediction task.
Adult Income Prediction
For the first case study, we utilize the adult income prediction data set [Koh96] with 45,222 samples and 14 attributes (after removing subjects with unknown attributes) for the task of binary prediction of income more than $50k for the two major groups of Black and White. We remove the sensitive features of gender – female (F) and male (M) and race (for the two major groups) – black (B) and white (W) – from the data, to simulate settings where sensitive features are not available to the algorithm training. We trained a base algorithm, , which is a neural network with two hidden layers on 27,145 randomly selected individuals. The test set consists of an independent set of 15,060 persons.
We audit using a decision tree regression model (max depth ) to fit the residual . receives samples of validation data drawn from the same distribution as training; that is . In particular, we post-process with 3,017 individuals sampled from the same adult income dataset (disjoint from the training set of ). The auditor is given the same features as the original prediction model, and thus, is not given the gender or race of any individual. We evaluate the post-processed classifier on the same independent test set. Multiaccuracy Boost converges in 50 iterations with .
As a baseline, we trained four separate neural networks with the same architecture as before (two hidden layers) for each of the four subgroups using the audit data. As shown in Table 4, multiaccuracy post-processing achieves better accuracy both in aggregate and for each of the subgroups. Importantly, the subgroup-specific models requires explicit access to the sensitive features of gender and race. Training a classifier for each subgroup, or explicitly adding subgroup accuracy into the training objective, assumes that the subgroup is already identified in the data. This is not feasible in the many applications where, say, race or more granular categories are not given. Even when the subgroups are identified, we often do not have enough samples to train accurate classifiers on each subgroup separately. This example illustrates that multiaccuracy can help to boost the overall accuracy of a black-box predictor in a data efficient manner.
4.2.1 Semi-Synthetic Disease Prediction
We design a disease prediction task based on real individuals, where the phenotype to disease relation is designed to be different for different subgroups, in order to simulate a challenging setting. We used 40,000 individuals sampled from the UK Biobank [SGA15]. Each individual contains 60 phenotype features. To generate a synthetic disease outcome for each subgroup, we divided the data set into four groups based on gender – male (M) and female (F) – and age – young (Y) and old (O). For each subgroup, we create synthetic binary labels using a different polynomial function of the input features with different levels of difficulty. The polynomial function orders are 1, 4, 2, and 6 for OF, OM, YF, and YM subgroups respectively.
, we trained a neural network with two hidden layers on 32,000 individuals, without using the gender and age features. Hyperparameter search was done for the best weight-decay and drop-out parameters. Thewe discover performs moderately well on every subpopulation, with the exception of old females (OF) where the classification error is significantly higher. Note that this subpopulation had the least representation in . Again, we audit using to run decision tree regression with validation data samples drawn from . Specifically, the auditor receives a sample of 4,000 individuals without the gender or age features. As a baseline, we trained a separate classifier for each of the subgroups using the same audit data. As Table 5 shows, Multiaccuracy Boost significantly lowers the classification error in the old female population.
In this work, we propose multiaccuracy auditing and post-processing as a method for improving the fairness and accountability of black-box prediction systems. Here, we discuss how our work compares to prior works, specifically, how it fits into the growing literature on fairness for learning systems. We conclude with further discussion of our results and speculation about future investigations.
5.1 Related works
Many different notions of fairness have been proposed in literature on learning and classification [DHP12, HPS16, ZWS13, DIKL17, HKRR18, KNRW17, HSNL18, KRR18, RY18]. Many of these works encode some notion of parity, e.g. different subgroups should have similar false positive rates, as an explicit objective/constraint in the training of the original classifier. The fairness properties are viewed as constraints on the classifier that ultimately limit the model’s utility. A common belief is that in order to achieve equitable treatment for protected subpopulations, the performance on other subpopulations necessarily must degrade.
A notable exception to this pattern is the work of Hébert-Johnson et al. [HKRR18], which introduced a framework for achieving fairness notions that aim to provide accurate predictions for many important subpopulations. [HKRR18] introduced the notion of multiaccuracy333[HKRR18] refers to this notion as “multi-accuracy-in-expectation”. and a stronger notion, dubbed multicalibration, in the context of regression tasks. Multicalibration guarantees (approximately) calibrated predictions, not just overall, but on a rich class of structured “identifiable" subpopulations. [HKRR18] provides theoretical algorithms for achieving multiaccuracy and multicalibration, and shows how to post-process a model to achieve multicalibration in a way that improves the regression objective across all subpopulations (in terms of squared-error). Our work directly extends the approach of [HKRR18], adapting their work to the binary classification setting. Our post-processing algorithm, Multiaccuracy Boost, builds on the algorithm given in [HKRR18], providing the additional “do-no-harm” property. This property guarantees that if the initial predictor has small classification error on some identifiable group, then the resulting post-processed model will also have small classification error on this group.
Independent work of Kearns et al. [KNRW17] also investigated how to achieve statistical fairness guarantees, not just for traditionally-protected groups, but on rich families of subpopulations. [KNRW17] proposed a framework for auditing and learning models to achieve fairness notions like statistical parity and equal false positive rates. Both works [HKRR18, KNRW17] connect the task of learning a model that satisfies the notion of fairness to the task of (weak) agnostic learning [Kea98, KSS94, KMV08, Fel10]. [KNRW17] also reduces the problem of learning a classifier satisfying parity-based notions of fairness across subgroups to the problem of auditing; it would be interesting if their notion of auditing can be used by humans as a way to diagnose systematic discrimination.
Our approach to post-processing, which uses a learning algorithm as a fairness auditor, is similar in spirit to the approach to learning of [KNRW17], but differs technically in important ways. In particular, in the framework of [KNRW17], the auditor is used during (white-box) training to constrain the model selected from a pre-specified hypothesis class; ultimately, this constrains the accuracy of the predictions. In our setting (as in [HKRR18]), we do not restrict ourselves to an explicitly-defined hypothesis class, so we can augment the current model using the auditor; these augmentations improve the accuracy of the model.
Indeed, at a technical level, our post-processing algorithm is most similar to work on boosting [FS95, SF12], specifically, gradient boosting [MBBF00, Fri01]. Still, our perspective is quite different from the standard boosting setting. Rather than using an expressive class of predictors as the base classifiers to be able to learn the function directly, our setting focuses on the regime where data is limited and we must restrict our attention to simple classes. Thus, it becomes important that we leverage the expressiveness (and initial accuracy) of if we are to obtain strong performance using the multiaccuracy approach. Further, the termination of Multiaccuracy Boost certifies that the final model satisfies -multiaccuracy; in general, standard boosting algorithms will not provide such a certificate.
Motivated by unfairness that arises as the result of feedback loops in classification settings, another recent work of Hashimoto et al. [HSNL18] aims to improve fairness at a subpopulation level. Specifically, their notion of fairness similarly aims to give accurate (i.e. bounded loss) predictions not just overall, but on all significant subpopulations. In the multiaccuracy setting, we argued that this goal was information-theoretically infeasible; [HSNL18] sidesteps this impossibility by optimizing over a fixed hypothesis class, and formulating the problem as a min-max optimization. They give show how to relax the problem of minimizing the worst-case subpopulation loss and reduce the relaxation to a certain robust optimization problem. While their approach does not guarantee optimality, it gives a strong certificate upper-bounding the maximum loss over all subpopulations.
A different approach to subgroup fairness is studied by Dwork et al. [DIKL17]. This work investigates the question of how to learn a “decoupled” classifier, where separate classifiers are learned for each subgroup and then combined to achieve a desired notion of fairness. While applicable in some settings, at times, this approach may be untenable. First, decoupling the classification problem requires that we have race, age, and other attributes of interest in the dataset and that the groups we wish to protect are partitioned by these attributes; this information is often not available. Even if this information is available, a priori, it may not always be obvious which subpopulations require special attention. In contrast, the multiaccuracy approach allows us to protect a rich class of overlapping subpopulations without explicit knowledge of the vulnerable populations. An interesting direction for future investigation could try to pair multiaccuracy auditing (to identify subpopulations in need of protection) with the decoupled classification techniques of [DIKL17].
The present work, along with [HKRR18, KNRW17, KRR18], can be viewed as studying information-fairness tradeoffs in prediction tasks (i.e. strengthening the notion of fairness that can be guaranteed using a small sample). These works fit into the larger literature on fairness in learning and prediction tasks [DHP12, ZWS13, BG18, HPS16, DIKL17, KRR18, RY18], discussions of the utility-fairness tradeoffs in fair classification [ALMK16, KMR17, Cho17, CG16, CDPF17, PRW17]. While fairness and accountability serve as the main motivations for developing the multiaccuracy framework, our results may have broader interest. In particular, multiaccuracy post-processing may be applicable in domain adaptation settings, particularly under label distribution shift as studied recently in [LWS18], but when the learner gets a small number of labeled samples from the new distribution.
The multiaccuracy framework can be applied very broadly; importantly, we can post-process any initial model given only black-box access to and a small set of labeled validation data. We show that in a wide range of realistic settings, post-processing for multiaccuracy helps to mitigate systematic biases in predictors across sensitive subpopulations, even when the identifiers for these subpopulations are not given to the auditor explicitly. In our experiments, we observe that standard supervised learning optimizes for overall performance, leading to settings where certain subpopulations incur substantially worse error rates. Multiaccuracy provides a framework for fairness in classification by improving the accuracy in identifiable subgroups, in a way that suffers no tradeoff between accuracy and utility. We demonstrate – both theoretically and empirically – that post-processing with Multiaccuracy Boost serves as an effective tool for improving the accuracy across important subpopulations, and does not harm the populations that are already classified well.
Multiaccuracy works to the extent that the auditor can effectively identify specific subgroups where the original classifier tends to make mistakes. The power of multiaccuracy lies in the fact that in many settings, we can identify issues with
using a relatively small amount of audit data. Thus, multiaccuracy auditing is limited: if the mistakes appear overly-complicated to the bounded auditor (for information- or complexity-theoretic reasons), then the auditor will not be able to identify these mistakes. Our empirical results suggest, however, that in many realistic settings, the subpopulations on which a classifier errs are efficiently-identifiable. This observation may be of interest beyond the context of fairness. In particular, our experiments improving the accuracy of a model trained on CelebA on the LFW+a and PPB test sets suggests a lightweight black-box alternative to more sophisticated transfer learning techniques, which may warrant further investigation.
Our empirical investigations reveal some additional interesting aspects of the multiaccuracy framework. In particular, we’ve shown that multiaccuracy auditing can identify underrepresented groups receiving suboptimal predictions even when the sensitive attributes defining these groups are not explicitly given to the auditor, which proves useful for diagnosing where models make mistakes. We feel that it may be of further interest within the study of model interpretability. Finally, it is striking that Multiaccuracy Boost tends to improve, not just subgroup accuracy, but also the overall accuracy, even when the minority groups remain underrepresented in the validation data. While some of these findings may be due to suboptimal training of our initial models, we believe this is not the only factor at play. In particular, we hypothesize that understanding why models incorporate biases during training – and further, why simple interventions like multiaccuracy post-processing can significantly improve generalization error – requires investigating the dynamics of overfitting during training, not just on the population as a whole, but across significant subpopulations.
The authors thank Omer Reingold and Guy N. Rothblum for their advice and helpful discussions throughout the development of this work; we thank Weihao Kong, Aditi Raghunathan, and Vatsal Sharan for feedback on early drafts of this work.
- [ALMK16] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks. ProPublica, 2016.
- [B15] Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
- [BG18] Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency, pages 77–91, 2018.
- [BLM13] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
- [CDPF17] Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness. KDD, 2017.
- [CG16] Alexandra Chouldechova and Max G’Sell. Fairer and more accurate, but for whom? FATML, 2016.
- [Cho17] Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 2017.
- [DFH15] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. The reusable holdout: Preserving validity in adaptive data analysis. Science, 349(6248):636–638, 2015.
- [DHP12] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard S. Zemel. Fairness through awareness. In Innovations in Theoretical Computer Science (ITCS), pages 214–226, 2012.
- [DIKL17] Cynthia Dwork, Nicole Immorlica, Adam Tauman Kalai, and Max Leiserson. Decoupled classifiers for fair and efficient machine learning. arXiv preprint arXiv:1707.06613, 2017.
- [Fel10] Vitaly Feldman. Distribution-specific agnostic boosting. In Proceedings of the First Symposium on Innovations in Computer Science’10, 2010.
- [Fri01] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
Yoav Freund and Robert E. Schapire.
A desicion-theoretic generalization of on-line learning and an
application to boosting.
European conference on computational learning theory, pages 23–37. Springer, 1995.
- [HKRR18] Úrsula Hébert-Johnson, Michael P. Kim, Omer Reingold, and Guy N. Rothblum. Calibration for the (computationally-identifiable) masses. ICML, 2018.
- [HPS16] Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems, pages 3315–3323, 2016.
- [HRBLM07] Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007.
- [HSNL18] Tatsunori B Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without demographics in repeated loss minimization. ICML, 2018.
- [Kea98] Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983–1006, 1998.
- [KMR17] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores. ITCS, 2017.
Adam Tauman Kalai, Yishay Mansour, and Elad Verbin.
On agnostic boosting and parity learning.
Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 629–638. ACM, 2008.
- [KNRW17] Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. arXiv preprint arXiv:1711.05144v3, 2017.
Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid.In KDD, volume 96, pages 202–207. Citeseer, 1996.
- [KRR18] Michael P. Kim, Omer Reingold, and Guy N. Rothblum. Fairness through computationally-bounded awareness. arXiv Preprint, 1803.03239, 2018.
- [KSS94] Michael J. Kearns, Robert E. Schapire, and Linda M. Sellie. Toward efficient agnostic learning. Machine Learning, 17(2-3):115–141, 1994.
- [KV94] Michael J. Kearns and Umesh Virkumar Vazirani. An introduction to computational learning theory. MIT press, 1994.
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Deep learning face attributes in the wild.
Proceedings of International Conference on Computer Vision (ICCV), December 2015.
- [LWS18] Zachary C. Lipton, Yu-Xiang Wang, and Alexander J. Smola. Detecting and correcting for label shift with black box predictors. In ICML, 2018.
- [MBBF00] Llew Mason, Jonathan Baxter, Peter L Bartlett, and Marcus R Frean. Boosting algorithms as gradient descent. In Advances in neural information processing systems, pages 512–518, 2000.
- [PRW17] Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q. Weinberger. On fairness and calibration. NIPS, 2017.
- [RY18] Guy N. Rothblum and Gal Yona. Probably approximately metric-fair learning. ICML, 2018.
- [RZ16] Daniel Russo and James Zou. How much does your data exploration overfit? controlling bias via information usage. AISTATS, 2016.
- [SF12] Robert E Schapire and Yoav Freund. Boosting: Foundations and algorithms. MIT press, 2012.
- [SGA15] Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton, John Danesh, Paul Downey, Paul Elliott, Jane Green, Martin Landray, et al. Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine, 12(3):e1001779, 2015.
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.
Inception-v4, inception-resnet and the impact of residual connections on learning.In AAAI, volume 4, page 12, 2017.
- [SKP15] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. CoRR, abs/1503.03832, 2015.
- [SS12] Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
- [TTV09] Luca Trevisan, Madhur Tulsiani, and Salil Vadhan. Regularity, boosting, and efficiently simulating every high-entropy distribution. In Computational Complexity, 2009. CCC’09. 24th Annual IEEE Conference on, pages 126–136. IEEE, 2009.
- [WHT11] Lior Wolf, Tal Hassner, and Yaniv Taigman. Effective unconstrained face recognition by combining multiple descriptors and learned background statistics. IEEE transactions on pattern analysis and machine intelligence, 33(10):1978–1990, 2011.
- [ZWS13] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 325–333, 2013.
Appendix A Multiaccuracy and classification error
Here, we prove Proposition 1.
Proposition (Restatement of Propostion 1).
Let as . Suppose that for with , there is some such that . Then if is -multiaccurate, .
For , let . Further denote . Note that the classification error on a set is .
Let and suppose where . Then, we derive the following inequality.
where (6) follows by Hölder’s inequality, from the fact that the contribution to the expectation of from and is lower bounded by , and by the definition for . Further, because we know any is misclassified, we can lower bound the contribution by . Thus, if , then by rearranging we conclude
Theorem 3 follows by a similar argument.
Theorem (Restatement of Theorem 3).
Let and be a subpopulation where . Suppose for audits the characteristic function and its negation. Let be the output of Algorithm 3 when given , , and as input. Then the classification error of on the subset is bounded as
Suppose that . Consider ; suppose . By assumption, is audited on . Consider .
where (12) follows from applying Hölder’s inequality and the assumption that ; and (13) follows from lower bounding the contribution to the expectation based on the true label and the predicted label. Note that by the fact that passes multiaccuracy auditing by and the assumption that . Rearranging gives the following inequality
where the additional comes from accounting for the false positives.
A similar argument holds for with , using . We can expand as a convex combination of the classification error over and .
by the fact that is partitioned into and and is a corresponding convex combination of and . ∎
Appendix B Analysis of Algorithm 3
Here, we analyze the sample complexity and running time of Algorithm 3.
Theorem (Restatement of Theorem 2).
Let and suppose agnostic learns a class of dimension . Then, using , Algorithm 3 converges to a -multiaccurate hypothesis in iterations from samples with probability at least over the random samples.
b.1 Sample complexity
We essentially assume the sample complexity issues away by working with the notion of dimension. We give an example proof outline of a standard uniform convergence argument using metric entropy as in [BLM13].
Suppose has -covering number . Then, with probability at least ,
The lemma follows from a standard uniform convergence argument. First, observe that because every and that the empirical estimate using samples has sensitivity . Thus, we can apply McDiarmid’s inequality to show concentration of the following statistic.
Then, using a standard covering argument, for the -covering number, we can bound the deviation with high probability. Specifically, taking samples guarantees that the empirical estimate for each will be within with probability at least . Taking small enough to union bound against every iteration and adjusting constants shows gives the lemma. ∎
Note that this analysis is completely generic, and more sophisticated arguments may improve the resulting bounds that leverage structure in the specific of interest.
b.2 Convergence analysis
We will track progress of Algorithm 3 by tracking the expected cross-entropy loss. We show that every update makes the expected cross-entropy loss decrease significantly. As the loss is bounded below by , then positive progress at each iteration combined with an upper bound on the initial loss gives the convergence result.
Note that when we estimate the statistical queries from data, we only have access to approximate answers. Thus, per the sample complexity argument above, we assume that each statistical query is -accurate. Further, we will update if we find an update where . Thus, at convergence, it should be clear that the resulting hypothesis will be -multiaccurate. The goal is to show that this way, Multiaccuracy Boost converges quickly.
Let and suppose . Given access to statistical queries that are -accurate, Algorithm 3 converges to a -multiaccurate hypothesis in iterations.
We state this lemma in terms of a class but the proof reveals that any nontrivial update that returns suffices to make progress.
We begin by considering the effect of the multiplicative weights update as a univariate update rule. Suppose we use the multiplicative weights update rule to compute to be proportional to for some . We can track how changes based on the choice of .
Recall , so . Thus, we can rewrite (23) as follows.
where (25) follows by the multiplicative weights update rule implies for . Next, we expand the final logarithmic term.
Thus, we can express the change in after an update based on in terms of the inner product between and . In this sense, we can express the local progress during the update at time in terms of some global progress in the objective.
When we update simultaneously according to , we can express the change in expected cross-entropy as follows.
where (35) follows from the fact that we assumed that our estimates of the statistical queries were -accurate and that we update based on if is at least according to our estimates. Thus, taking , then we see the change in expected cross-entropy over is at least , which shows the lemma. ∎
Appendix C Linear convergence from gradient learning
Here we show that given an auditing algorithm that learns the cross-entropy gradients accurately, Algorithm 3 converges linearly. Consider the following auditor . We assume the norms and inner products are estimated accurately using .