Toward a better trade-off between performance and fairness with kernel-based distribution matching

10/25/2019 ∙ by Flavien Prost, et al. ∙ 39

As recent literature has demonstrated how classifiers often carry unintended biases toward some subgroups, deploying machine learned models to users demands careful consideration of the social consequences. How should we address this problem in a real-world system? How should we balance core performance and fairness metrics? In this paper, we introduce a MinDiff framework for regularizing classifiers toward different fairness metrics and analyze a technique with kernel-based statistical dependency tests. We run a thorough study on an academic dataset to compare the Pareto frontier achieved by different regularization approaches, and apply our kernel-based method to two large-scale industrial systems demonstrating real-world improvements.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the last few years, the research community in machine learning has become more aware of the unintended biases that are learned and carried out by their models. Concerning behavior toward subgroups of the population has been pointed out in multiple applications ranging from the predictions of credit default HardtPS16 to the detection of abusive online comments Dixon.

This growing collective awareness has resulted in significant interest to develop and popularize fairness metrics calders2010three, dwork2012fairness, HardtPS16, kearns2017preventing, Borkan, kallus2019fairness, as well as to find efficient mitigation techniques Agarwal2018, Zafar, Gupta2016, Dixon, BeutelAdversarial, LiptonThreshold. However these techniques typically come with multiple challenges as they might require collecting additional data (e.g., re-balancing techniques in Dixon) or might generate instability in the training process (e.g., adversarial techniques in BeutelAdversarial, ZhangAdversarial, Madras).

We focus here on mitigation techniques in a classification setting and would like to ensure that our work is compatible with the challenges that real-world production systems face. We follow the work done by alexbeutelputting in which the authors optimize for Equality of Opportunity HardtPS16, i.e. equalized false positive rate, by minimizing the correlation between the subgroup identity and the predictions over negative examples. Their method is shown to be efficient in a production system and was later generalized to a pairwise ranking loss for recommender settings alexbeutelrecommendation.

Although this method gives good empirical results, the authors acknowledge that correlation does not guarantee statistical independence matching equality of opportunity. We therefore take inspiration from the work done by GrettonMMD, which developed a kernel-based method to test the independence of two sample distributions, called Maximum Mean Discrepancy (MMD). This test has the advantage of being easy to compute and can be optimized by gradient-descent algorithms which made it useful for applications such as domain adaptation Long2015, Bousmalis2016.

We focus on a mitigation technique which combines the MMD test and the framework of alexbeutelputting

. This approach was mentioned in transfer learning for fairness

Candice, but has not previously been studied as a core bias mitigation technique; we analyze here its efficiency and practical application. Our main contributions are as follows:

  • The definition of the MinDiff, a lightweight framework with a collection of regularization techniques for optimizing fairness metrics.

  • An empirical comparison revealing the improvement of the performance/accuracy trade-off with kernel-based approaches.

  • The description of the application of these techniques on two large scale production systems and the resulting improvements.

2 MinDiff Framework

2.1 Setting and notations

We consider a general task that consists in learning a function that maps a set of features to a binary label such that

. We assume that the system makes an adversarial decision for each example where the predicted probability is above a certain threshold (undesirable outcome for the user). This setup is similar to many applications where the system might flag abusive comments

Dixon or reject an applicant’s loan HardtPS16. For each example, we can compute a loss (e.g. cross-entropy) and measure the primary performance of the model with traditional metrics such as accuracy.

We further assume that each example is associated to a subgroup , but this feature is available only for a fraction of the training data and is not observed at prediction time. For simplicity, we will focus on the binary case where (out or in subgroup) however the concept applies to a more general setup.

We now want to evaluate the fairness of our model and consider the Equality of Opportunity HardtPS16 which is defined as an equality of the False Positive Rates across groups111Equality of opportunity can also be framed as equalized false negative rates.. We measure the deviation from this ideal criteria with the False Positive Rate Gap ():

2.2 Mitigating bias with MinDiff

Numerous regularization approaches have been proposed to encourage models during training to optimize for various fairness definitions Zafar, alexbeutelputting, Madras, Agarwal2018, ZhangAdversarial, Gupta2016. Here, we build off of the approach offered by alexbeutelputting to penalize the model for any dependence among the negative examples between the distribution of predicted probabilities and the subgroup label . In practice, they add an additional regularization loss term to minimize the correlation between the two distributions. We give the naming “MinDiff” to this framework in relation to the fact that they minimize the difference of a given quantity between two slices of data.


where is a hyper parameter controlling the trade-off between primary and this MinDiff loss.

While correlation is not a sufficient test for statistical independence, the authors achieve good empirical results and claim that this is more easily adapted to real-world systems than other similar methods such as adversarial training BeutelAdversarial, Madras, ZhangAdversarial.

We build on this technique and leverage the work done by GrettonMMD who introduce a framework to test the statistical dependence between two sample distributions. Their statistic test, called Maximum Mean Discrepancy (MMD), consists in taking the mean between two samples and mapped into a Reproducing Kernel Hilbert Subspace, and is easy to compute with the following formula:

where are elements from the sample , elements from and is a universal kernel (Gaussian or Laplace kernels are used in practice).

Based on this work, we suggest a new implementation of MinDiff. We add a similar loss component which penalizes any statistical dependence between the predictions for negative examples and the associated subgroup but we now measure it with the MMD function. Our new loss is then equal to:


where are the predictions over examples where , are the predictions over examples where , and

is a hyperparameter controlling the trade-off between primary and MinDiff loss.

3 Academic Comparison

Task description

We use UCI’s Adult dataset222 UCIData which contains census information over 40,000 individuals. The task is to predict if someone is earning more or less than $50,000 (binary classification) and we use “sex” as the sensitive attribute, restricted in the dataset to binary values {male, female}. We use accuracy to measure the performance of the model and the as the fairness metric.

We set the architecture to a feed-forward neural network with one hidden layer and selected the parameters with cross-validation: we used a learning rate of 0.001, a batch size of 256 examples and 64 hidden units. Without any bias mitigation, the model is 84.5% accurate and has a 0.12


We then train three different models by optimizing respectively the following losses: (a) The loss defined in equation 1 (used in alexbeutelputting), (b) Our new loss defined in equation 2 with a Gaussian kernel, (c) Our new loss defined in equation 2 with a Laplace kernel. We will use respectively the names , and to refer to each of these models.

Both Gaussian and Laplace required a parameter called kernel length (in the Gaussian kernel ) that we set to

in order to be in the same order of magnitude as the standard deviation of the underlying distribution of probability. This value is motivated by the analysis reported in appendix

A where we describe how the value of this parameter impacts the results.

Figure 1: Visualization of the trade-off between performance and fairness. For the model, as we increase the MinDiff weight, the accuracy gets worse (left) while the decreases (middle). The right plot shows the Pareto frontier (accuracy vs ) for each MinDiff model.

Trade-off between accuracy and fairness

First, we start by varying the weight associated to the MinDiff loss. For each value of we report the accuracy and

as the average values over twenty training runs. We also compute the standard error of the estimates.

The resulting graphs for the model can be found in the Figure 1. As expected, the MinDiff loss is able to reduce the from 0.14 to 0.02 (left-plot), however this comes with a decrease in accuracy from 0.85 to 0.65 (middle-plot). The trade-off between those two metrics induced by the model results in a Pareto-frontier plotted on the right.

MMD achieves a better “Pareto frontier”

The previous graphs help us to understand that these methods induce a significant trade-off between accuracy and fairness controlled by from Equation (2). As a result, we want to analyze the variations of the Pareto-frontier with the three models and report it in Figure 1 (right plot). We can see that, while both have similar effect with small values of (upper right part), outperforms when we increase the fairness weight. In particular is able to reach low (< 0.01) with relatively high accuracy (>83%) whereas the model is unable to reduce the below 0.02. We hypothesize that

is unable to remove entirely the bias as it only matches the mean and the variance of the two samples and not the true distribution. Additionally, we observe that the choice of the kernel does not seem to have a large impact as

and have similar results.

4 Applications to real-world systems

4.1 Classifier Setting

The first system follows the same framework as the one introduced in §2.1. We train a model

with log-loss to predict a binary variable

for an item. When the predicted score is higher than a threshold (chosen for a fixed recall), we classify the item as “positive” (undesirable outcome).

Correlation MMD
FPR ratio 5.22 2.82
Table 1: The MMD loss results in significantly better FPR ratio than that of the correlation loss.

The original analysis showed that the false positive rate (FPR) is higher for the protected group than for the majority group. Our goal is to minimize the ratio gap . Table 1 shows the comparison between different versions of MinDiff: MMD improves the fairness ratio gap over correlation loss by , while both models maintain neutral performance for the main task.

4.2 Recommender System

We now focus on a large-scale, production recommender system where the model is trained to predict the probability of an item being clicked () by the user and is used in inference to score and rank the items to display. We consider a subgroup A of users as a variable in . We utilize the work done by alexbeutelrecommendation who suggested a pairwise fairness metric for recommendation as well as a MinDiff formulation on pairs of items to improve the system that we summarize below.

Fairness metric for ranking. For a random pair of items where exactly one of the items is clicked, we can define the pairwise ranking accuracy as the frequency with which the model ranks higher the clicked item. With this, we can evaluate if the system is under-ranking a subgroup of items by computing the difference between the pairwise ranking accuracy when the clicked item is in or out of the subgroup (gap in pairwise ranking accuracy). We report this metric bucketed by level of satisfaction of the user after the click, as in alexbeutelrecommendation and define the total gap as the sum over all buckets.

MinDiff formulation. To optimize the fairness metric, alexbeutelrecommendation adds a MinDiff loss over random pairs of items with () by penalizing the correlation between the following quantities:

Our algorithm re-uses this formulation but penalizes the dependence between and with .

Results of our algorithm. We compare our MMD approach to the initial system and to the method of alexbeutelrecommendation and display the results in Figure 2. While the MinDiff approach with correlation reduced the total gap in pairwise ranking accuracy by 60%, MMD is able to bring the gap even lower (reduction of 65% from the number). Additionally, online experiments showed that these features came with neutral impact on overall system performance.

(a) Original
(b) With
(c) With MMD
Figure 2: Evolution of the gap in pairwise ranking accuracy

5 Conclusion

In this paper, we define the MinDiff framework as a collection of regularization techniques for mitigating bias and analyze approaches based on the MMD statistical test. We show empirically that they achieve a better Pareto-frontier and describe two applications of it to real-world systems. With this work, we hope to help reduce the challenges in bringing ML Fairness to industry applications.



A Heuristics for values of the kernel length in the Gaussian and Laplace kernels

We use the following formulas to compute the Gaussian and Laplace kernels:

where is a parameter called kernel length.

We want to analyze how the parameter impacts previous results and what value is optimal. We describe here our analysis only for the Gaussian kernel even if similar results apply to the Laplace one.

We vary the value of for three values of (0.1, 1 and 5) and report the results in Figure 3. As metrics are not monotonous functions of , the Pareto frontier is harder to read and we therefore do not report it in this paper.

Figure 3: Impact of the kernel decay length on the model

We observe that, if the accuracy decreases slightly between and and then drops as we decrease further (left plot). We believe that this is due to the fact that smaller values of make the kernel more sensitive, as a result, the returns higher values for any differences between the distributions. On the other side, has a more complicated relationship and seems to be optimal for intermediate values of kernel length (right plot).

Overall, there seems to be a sweet spot around 0.1 and 0.5 where we reach a good trade-off between fairness and accuracy. We observe that this range is close to the standard deviation of the underlying probability distribution and hypothesize that this heuristic should generalize to any distributions.