Rényi Fair Inference

06/28/2019 ∙ by Sina Baharlouei, et al. ∙ University of Southern California 0

Machine learning algorithms have been increasingly deployed in critical automated decision-making systems that directly affect human lives. When these algorithms are only trained to minimize the training/test error, they could suffer from systematic discrimination against individuals based on their sensitive attributes such as gender or race. Recently, there has been a surge in machine learning society to develop algorithms for fair machine learning. In particular, many adversarial learning procedures have been proposed to impose fairness. Unfortunately, these algorithms either can only impose fairness up to first-order dependence between the variables, or they lack computational convergence guarantees. In this paper, we use Rényi correlation as a measure of fairness of machine learning models and develop a general training framework to impose fairness. In particular, we propose a min-max formulation which balances the accuracy and fairness when solved to optimality. For the case of discrete sensitive attributes, we suggest an iterative algorithm with theoretical convergence guarantee for solving the proposed min-max problem. Our algorithm and analysis are then specialized to fair classification and the fair clustering problem under disparate impact doctrine. Finally, the performance of the proposed Rényi fair inference framework is evaluated on Adult and Bank datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As we experience the widespread adoption of machine learning models in automated decision-making, we have witnessed increased reports of instances in which the employed model results in discrimination against certain groups of individuals datta2015automated ; sweeney2013discrimination ; bolukbasi2016man ; machinebias2016 . In this context, discrimination is defined as the unwanted distinction against individuals based on their membership to a specific tribal or group. For instance, machinebias2016 presents an example of a computer-based risk assessment model for recidivism, which is biased against certain ethnicities. In another example, datta2015automated demonstrates gender discrimination in online advertisements for web pages associated with employment. In addition to its ethical standpoint, equal treatment of different groups is legally required by many countries civilright1964 . Thus, research on fairness in machine learning encountered significant attention in recent years; see calmon2017optimized ; feldman2015certifying ; hardt2016equality ; zhang2018mitigating ; xu2018fairgan ; dwork2018decoupled ; fish2016confidence ; woodworth2017learning ; zafar2017fairness ; zafar2015fairness ; perez2017fair ; bechavod2017penalizing .

Anti-discrimination laws imposed by many countries typically evaluate fairness by notions of disparate treatment and disparate impact. We say a decision-making process suffers from disparate treatment if its decisions discriminate against individuals of a certain protected group based on their sensitive/protected attribute information. On the other hand, we say the disparate impact if the decisions adversely affect a protected group of individuals with certain sensitive attribute zafar2015fairness . In simpler words, disparate treatment is intentional discrimination against a protected group, while the disparate impact is an unintentional disproportionate outcome that hurts a protected group. To quantify fairness, several notions of fairness have been proposed in the recent decade calders2009building ; hardt2016equality . Examples of these notions include demographic parity,

equalized odds

, and equalized opportunity.

Demographic parity condition requires that the model output (for example, assigned label) is independent of sensitive attributes. This definition might not be desirable when the base ground-truth outcome of the two groups are completely different. This shortcoming can be addressed using equalized odds notion hardt2016equality which requires that the model output is conditionally independent of sensitive attributes given the ground-truth label. Finally, equalized opportunity requires having equal false positive or false negative rates across protected groups. All above fairness notions require (conditional) independence between the model output and the sensitive attribute.

Machine learning approaches for imposing fairness can be broadly classified into three main categories: pre-processing methods, post-processing methods, and in-processing methods. Pre-processing methods modify the training data to remove discriminatory information before passing data to the decision-making process

calders2009building ; feldman2015certifying ; kamiran2010classification ; kamiran2009classifying ; kamiran2012data ; dwork2012fairness ; calmon2017optimized ; ruggieri2014using . These methods map the training data to a transformed space in which the dependencies between the class label and the sensitive attributes are removed edwards2015censoring ; hardt2016equality ; xu2018fairgan ; sattigeri2018fairness ; raff2018gradient ; madras2018learning ; zemel2013learning ; louizos2015variational . On the other hand, post-processing methods adjust the output of a trained classifier to remove discrimination while maintaining high classification accuracy fish2016confidence ; dwork2018decoupled ; woodworth2017learning . The third category is the in-process approach that enforces fairness by either introducing constraints or adding a regularization term to the training procedure zafar2017fairness ; zafar2015fairness ; perez2017fair ; bechavod2017penalizing ; berk2017convex ; agarwal2018reductions ; celis2019classification ; donini2018empirical ; rezaei2019fair ; kamishima2011fairness ; zhang2018mitigating ; bechavod2017penalizing ; kearns2017preventing ; menon2018cost ; alabi2018unleashing . The Rényi fair inference framework proposed in this paper also belongs to this in-process category.

Among in-processing methods, many add a regularization term or constraints to impose statistical independence between the classifier output and the sensitive attributes. To do that, various independence proxies such as mutual information kamishima2011fairness , false positive/negative rates bechavod2017penalizing , equalized odds donini2018empirical , Pearson correlation coefficient zafar2015fairness ; zafar2017fairness , Hilbert Schmidt independence criterion perez2017fair were used. As will be discussed in section 2

, many of these methods cannot capture higher order dependence between random variables or lead to computationally expensive algorithms. Motivated by these limitations, we propose to use Rényi correlation to impose several known group fairness measures. Rényi correlation captures high order dependencies between random variables. Moreover, Rényi correlation is a normalized measure and can be computed efficiently in certain instances.

Using Rényi correlation coefficient as a regularization term, we propose a min-max optimization framework for fair statistical inference. We apply our Rényi framework to the classification problem and show that it can be solved up to first-order stationarity using an iterative procedure. In addition to classification, we apply our regularization framework to the fair K-means clustering problem under disparate impact doctrine. Many recent works on fair clustering propose a two-phase algorithm for solving the

-centers and -median fair clustering problems chierichetti2017fair ; backurs2019scalable ; rosner2018privacy ; bercea2018cost . In the first phase, data points are partitioned into small subsets, referred to as fairlets, that satisfy fairness requirements. Then in the second phase, these fairlets are merged to form -clusters by one of the existing clustering algorithms. A similar approach was proposed by schmidt2018fair for the -means clustering problem. These methods impose fairness as a hard constraint which can degrade the clustering quality significantly. Moreover, phase one is typically computationally expensive and limits the scalability of the method. However, Rényi fair -means clustering method imposes fairness as a soft constraint and its complexity is no higher than the traditional -means clustering method; also known as Lloyd’s algorithm. Finally, we evaluate the performance of our algorithms on Bank and Adult datasets.


We introduce Rényi correlation as a tool to impose several notions of group fairness. Unlike Pearson correlation and HSIC, Rényi correlation captures higher order dependence of random variables.

Using Rényi correlation as a regularization term in training, we propose a min-max formulation for fair statistical inference. Unlike methods that use an adversarial neural network to impose fairness, we show that in particular instances (such as binary classification or sensitive attributes), it suffices to use a simple quadratic function as the adversarial objective. This observation helped us to develop a simple multi-step gradient ascent descent algorithm for fair inference and guarantee its theoretical convergence to first-order stationarity.

Our Rényi correlation framework leads to a natural fair classification method and a novel fair -means clustering algorithm. For -means clustering problem, we show that sufficiently large regularization coefficient yields perfect fairness under disparate impact doctrine. Unlike the two-phase methods proposed in chierichetti2017fair ; backurs2019scalable ; rosner2018privacy ; bercea2018cost ; schmidt2018fair , our method does not require any pre-processing step, is scalable, and allows for regulating the trade-off between the clustering quality and fairness.

2 Rényi Correlation as a Measure of Dependence

The most widely used notions for group fairness in machine learning are demographic parity, equalized odds, and equalized opportunities. These notions require (conditional) independence between a certain model output and a sensitive attribute. This independence is typically imposed by adding fairness constraints or regularization terms to the training objective function. For instance, kamishima2011fairness

added a regularization term based on mutual information. Since estimating mutual information between model output and sensitive variables during training is not computationally tractable,


approximates the probability density functions using a logistic regression model. To have a tighter estimation,

song2019learning used an adversarial approach that estimates the joint probability density function using a parameterized neural network. Although these works start from a well-justified objective function, they end up solving approximations of the objective function due to computational barriers. Thus, no fairness guarantee can be provided even when the resulting optimization problems are solved to global optimality in the large sample size scenarios. A more tractable measure of dependence between two random variables is the Pearson correlation. The Pearson correlation coefficient between the two random variables and is defined as where denotes the covariance and

denotes the variance. The Pearson correlation coefficient is used in 

zafar2015fairness to decorrelate the binary sensitive attribute and the decision boundary of the classifier. A major drawback of Pearson correlation is that it only captures linear dependencies between random variables. In fact, two random variables and may have strong dependence but have zero Pearson correlation. This property raises concerns about the use of the Pearson correlation for imposing fairness. Similar to the Pearson correlation, the HSIC measure proposed in perez2017fair may be zero even if the two variables have strong dependencies. While universal Kernels can be used to resolve this issue, they could arrive at the expense of computational interactability. In addition, HSIC is not a normalized dependence measure gretton2005kernel ; gretton2005measuring which raises concerns about the appropriateness of using it as a measure of dependence.

In this paper, we suggest to use Hirschfeld-Gebelein-Rényi correlation renyi1959measures ; hirschfeld1935connection ; gebelein1941statistische as a dependence measure between random variables to impose fairness. Rényi correlation, which is also known as maximal correlation, between two random variables and is defined as


where the supremum is over the set of measurable functions and satisfying the constraints. Unlike HSIC and Pearson correlation, Rényi correlation is a normalized measure that captures higher-order dependencies between random variables. Rényi correlation between two random variables is zero if and only if the random variables are independent, and it is one if there is a strict dependence between the variables renyi1959measures .In addition, as we will discuss in section 3, leads to a computationally tractable framework when used to impose fairness in certain cases. These computational and statistical benefits make Rényi correlation a powerful tool in the context of fair inference.

3 A General Min-Max Framework for Rényi Fair Inference

Consider a learning task over a given random variable . Our goal is to minimize the average inference loss

where our loss function is parameterized with parameter

. To find the optimal value of parameter with the smallest average loss, we need to solve the following optimization problem

where the expectation is taken over . Notice that this formulation is quite general and can include regression, classification, clustering, or dimensionality reduction tasks as special cases.

Assume that, in addition to minimizing the average loss, we are interested in bringing fairness to our learning task. Let be the sensitive attribute and be a certain output of our inference task using parameter . Assume we are interested in reducing the dependence between the random variable and the sensitive attribute . To balance the goodness-of-fit and fairness, one can solve the following optimization problem


where is a positive scalar balancing fairness and goodness-of-fit. Notice that the above framework is quite general. For example, may be the assigned label in a classification task, the assigned cluster in a clustering task, or the output of a regressor in a regression task.

Using the definition of Rényi correlation, we can rewrite optimization problem (2) as


where the supremum is taken over the set of measurable functions. The next natural question to ask is whether this optimization problem can be efficiently solved in practice. This question motivates the discussions of the following subsection.

3.1 Computing Rényi Correlation

The objective function in (3) may be non-convex in in general. Several algorithms have been recently proposed for solving such non-convex min-max optimization problems sanjabi2018convergence ; nouiehed2019solving ; jin2019minmax . Most of these methods require solving the inner maximization problem to (approximate) global optimality. More precisely, we need to be able to solve the optimization problem described in (1

). While popular heuristic approaches such as parameterizing the functions

and with neural networks can be used to solve (1

), we focus on solving this problem in a more rigorous manner. In particular, we narrow down our focus to the discrete random variable case. This case holds for many practical sensitive attributes among which are the gender and race. In what follows, we show that in this case, (

1) can be solved “efficiently” to global optimality.

Theorem 3.1 (Restated from witsenhausen1975sequences ).

Let and be two discrete random variables. Then the Rényi coefficient

is equal to the second largest singular value of the matrix

, where .

The above theorem provides a computationally tractable approach for computing the Rényi coefficient. This computation could be further simplified when one of the random variables is binary.

Theorem 3.2 (Rephrased from Theorem 2 in  farnia2015minimum and Theorem 3 in razaviyayn2015discrete ).

Suppose that is a discrete random variable and is a binary random variable. Let

be the one-hot encoded version of

, i.e., if , where is the

-th standard unit vector. Let

. Then,


where Equivalently,


Consider the random variable and its one-hot encoded version . Then any function can be equivalently represented as where . Hence this function is separable in the sense defined in farnia2015minimum . Consequently, Theorem 2 in farnia2015minimum implies the desired result. ∎

Let us specialize our framework to classification and clustering problems in the next two sections.

4 Rényi Fair Classification

In a typical (multi-class) classification problem, we are given samples from a random variable and the goal is to predict from . Here is the input feature vector, and is the class label. Let be the output of our classifier with

Here is that parameter of the classifier that needs to be tuned. For example, could represent the output of a neural network after softmax layer; or the soft probability label assigned by a logistic regression model. In order to find the optimal parameter , we need to solve the optimization problem


where is the loss function and the expectation is taken over the random variable . Let be the sensitive attribute. We say a model satisfies demographic parity if the assigned label  is independent of the sensitive attribute , see dwork2012fairness . Using our regularization framework, to find the optimal parameter balancing classification accuracy and fairness objective, we need to solve


4.1 General Discrete Case

When is discrete, Theorem 3.1 implies that (7) can be rewritten as


Here is the right singular vector corresponding to the largest singular value of Given training data sampled from the random variable , we can estimate the entries of the matrix using and where is the set of samples with sensitive attribute . Motivated by the algorithm proposed in jin2019minmax , we present Algorithm 1 for solving(8).

1:Input: , step-size .
2:for  do
3:     Set by finding the second singular vector of
4:     Set
5:end for
Algorithm 1 Rényi Fair Classifier for Discrete Sensitive Attributes

To understand the convergence behavior of Algorithm 1 for the nonconvex optimization problem (8), we need to first define an approximate stationary solution. Let us define . Assume further that has -Lipschitz gradient, then is -weakly convex; for more details check rafique2018non . For such weakly convex function, we say is a -stationary solution if the gradient of its Moreau envelop is smaller than epsilon, i.e., with and is a given constant. The following theorem demonstrate the convergence of Algorithm 1.

Theorem 4.1 (Rephrased from Theorem 27 in jin2019minmax ).

Suppose that is -Lipschitz and -gradient Lipschitz. Then Algorithm 1 computes an -stationary solution of the objective function in (8) in iterations.

4.2 Binary Case

When is binary, we can obtain a more efficient algorithm compared to Algorithm 1 by exploiting Theorem 3.2. Particularly, by a simple scaling of and ignoring the constant terms, the optimization problem (7) can be written as


Defining , the above problem can be rewritten as

Thus, given training data sampled from the random variable , we need to solve


Notice that the maximization problem in (10) is concave, separable, and has a closed-form solution. Motivated by nouiehed2019solving , we propose Algorithm 2 for solving (10).

1:Input: , step-size .
2:for  do
3:     Set
4:     Set
5:end for
Algorithm 2 Rényi Fair Classifier for Binary Sensitive Attributes

While the result in Theorem 4.1 can be applied to Algorithm 2, under the following assumption, we can achieve a better convergence rate using the methodology in nouiehed2019solving .

Assumption 4.1.

We assume that there exists a constant scalar such that

This assumption is reasonable when soft-max is used. This is because we can always assume

lies in a compact set in practice, and hence the output of the softmax layer cannot be arbitrarily small.

Theorem 4.2 (Rephrased from nouiehed2019solving ).

Suppose that is -Lipschitz and -gradient Lipschitz. Then Algorithm 2 computes an -stationary solution of the objective function in (10) in iterations.

Notice that this convergence rate is clearly a faster rate than the one obtained in Theorem 4.1.

Remark 4.3 (Extension to multiple sensitive attributes).

Our discrete Rényi classification framework can naturally be extended to the case of multiple discrete sensitivity attributes by concatenating all attributes into one. For instance, when we have two sensitivity attribute and , we can consider them as a single attribute corresponding to the four combinations of .

Remark 4.4 (Extension to other notions of fairness).

Our proposed framework imposes the demographic parity notion of group fairness. However, other notions of group fairness may be represented by (conditional) independence conditions. For such cases, we can again apply our framework. For example, we say a predictor satisfies equalized odds condition if the predictor is conditionally independent of the sensitive attribute given the true label . Similar to formulation (7), the equalized odds fairness notion can be achieved by the following min-max problem


5 Fair Rényi Clustering

In this section, we apply the proposed fair Rényi framework to the widespread -means clustering problem. Given a set of data points , in the -means problem, we seek to partition them into clusters such that the following objective function is minimized:


where is the centroid of cluster ; the variable if data point belongs to cluster  and it is zero otherwise; and represent the association matrix and the cluster centroids respectively. Now, suppose we have an additional sensitive attribute for each one of the given data points. In order to have a fair clustering under disparate impact doctrine, we need to make the random variable independent of . In other words, we need to make the clustering assignment independent of the sensitive attribute . Using our framework in (2), we can easily add a regularizer to this problem to impose fairness under disparate impact doctrine. In particular, for binary sensitive attribute , using Theorem 3.2, and absorbing the constants into the hyper-parameter , we need to solve


where encodes the clustering information of data point and is the sensitive attribute for data point .

Fixing the assignment matrix , and cluster centers , the vector can be updated in closed-form. More specifically, at each iteration equals to the current proportion of the privileged group in the -th cluster. Combining this idea with the update rules of assignments and cluster centers in the standard K-means algorithm, we proposed Algorithm 3, which is a fair -means algorithm under disparate impact doctrine.

1:Input: and
2:Initialize: .
3:while   do
4:     Set
5:     for  do Update
7:         Set and for all
8:         Set Update
9:     end for
10:     Set Update
11:end while
Algorithm 3 Rényi Fair K-means

The main difference between this algorithm and the popular -means algorithm is in Step 6 of Algorithm 3. When , this step would be identical to the update of cluster assignment variables in -means. However, when , Step 6 considers fairness when computing the distance considered in updating the cluster assignments.

5.1 Numerical Experiments

In this section, we evaluate our fair logistic regression, and fair k-means algorithms by performing experiments on the standard Bank 111 https://archive.ics.uci.edu/ml/datasets/Bank%20Marketing. and Adult 222https://archive.ics.uci.edu/ml/datasets/adult. datasets. The Bank dataset contains the information of individuals contacted by a Portuguese bank institution. For this dataset, we sampled 3 continuous features: age, balance, and duration; and the sensitive attribute is the marital status of the individuals. The Adult dataset contains the census information of individuals including education, gender, capital-gain, and etc. We selected continuous features (age, fnlwgt, capital-gain, education-num, hours-per-week), and sampled data samples for the fair K-means problem. For the fair logistic regression problem, we run our algorithm on both datasets considering all training samples. For the two datasets cases, we consider the sensitivity attribute to be the gender of the individuals. We implemented Algorithms 2 and  3 to solve the fair logistic regression and fair K-means clustering problem for the described datasets. The results are summarized in Figure 1.

We use the deviation of the elements of the vector as a measure of fairness. The element of represents the ratio of the number of data points that belong to the privileged group () in cluster over the number of data points in that cluster. The deviation of these elements is a measure for the deviation of these ratios across different clusters. A clustering solution is exactly fair if the all entries of are the same. For , we plot in Figure 1 the minimum, maximum, average, and average standard deviation of the entries of  vector for different values of . For an exactly fair clustering solution, these values should be the same. As we can see in Figure 1 part (a) and (b), increasing yields exact fair clustering at the price of a higher clustering loss.

For fair logistic regression, we use Demographic Parity violation as a measure of fairness. The latter notion is defined as Smaller DP violation indicates a more fair solution. Plot (c) demonstrates the DP violation and training error for different values of . We can see the that increasing yields a more fair solution. This comes at the price of a larger training error. On the other hand, in (d) we show a plot of the DP violation and test error for different values of . As an interesting observation, we noticed that the test error initially decreases as we increase (between and ). This can be due to the bias of the unfair logistic regression against the protected group. Hence, our fairness term plays the role of a regularizer which improves the generalization. However, if we further increase , the test performance drops again indicating that the fairness term dominates for such high values of .

Figure 1: Trade-off between accuracy and fairness for K-means and logistic regression problems