Addressing multiple metrics of group fairness in data-driven decision making

03/10/2020 ∙ by Marius Miron, et al. ∙ Universitat Pompeu Fabra European Union 1

The Fairness, Accountability, and Transparency in Machine Learning (FAT-ML) literature proposes a varied set of group fairness metrics to measure discrimination against socio-demographic groups that are characterized by a protected feature, such as gender or race.Such a system can be deemed as either fair or unfair depending on the choice of the metric. Several metrics have been proposed, some of them incompatible with each other.We do so empirically, by observing that several of these metrics cluster together in two or three main clusters for the same groups and machine learning methods. In addition, we propose a robust way to visualize multidimensional fairness in two dimensions through a Principal Component Analysis (PCA) of the group fairness metrics. Experimental results on multiple datasets show that the PCA decomposition explains the variance between the metrics with one to three components.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine Learning (ML) systems reduce uncertainty in decision-making by predicting relevant outcomes based on algorithmically detected patterns in data. However, a growing literature has uncovered algorithmic discrimination in sensitive contexts and described fairness-aware ML algorithms [Barocas2016, hajian2016algorithmic]. A number of statistical formalizations of a value-driven concept such as fairness, constructed for the purpose of using them in data-driven algorithms, have led to a long and confusing list of criteria and related metrics [narayanan2018translation]. These criteria suffer from incompatibilities and trade-offs between fairness and other objectives, such as accuracy. To better understand the relation between conflicting criteria we study how a varied set of group-wise classification metrics are related across multiple datasets. For instance, by determining which metric yields more disparity between groups, one can better explain the type of discrimination in a machine learning model and focus on optimizing that specific metric.

There are two main families of algorithmic fairness concepts: individual fairness [dwork2012fairness] and group fairness. The latter derives from a concept of non-discrimination on the basis of membership to a protected group; the majority of the existing literature on Fairness, Accountability, and Transparency in Machine Learning (FAT-ML) refers to it and it is the focus of this paper. A protected group is a group distinguished by a protected feature [pedreschi2012study], where protected features are usually categories from a given legal context, such as gender or race in anti-discrimination legislation.111See e.g. Article 21 of the EU Charter of Fundamental Rights: https://fra.europa.eu/en/charterpedia/article/21-non-discrimination

In the case of automatic classification algorithms, group fairness is the absence of group discrimination, and group discrimination is evidenced by imbalances with respect to classification metrics across protected and non-protected groups. These group-wise metrics are usually defined in terms of predicted risk scores, predicted outcomes, actual outcomes, or a combination of them.

Fulfilling specific fairness criteria in a machine learning algorithm is usually done through either processing training data, modifying the way the algorithm works, or modifying the output of the algorithm (pre-, in-, and post-processing [Barocas_mimeo]

). In most cases, the unifying idea is that the objective of minimizing a loss-function is constrained to the fulfillment of fairness criteria. Clearly, imposing many fairness constraints will make finding an optimum impossible, and in fact, “impossibility theorems” prove that multiple fairness criteria are incompatible under fairly weak assumptions

[kleinberg2016inherent, Chouldechova2017, berk2017convex]. However, since different fairness notions lead to different fairness criteria, and not all notions can be fulfilled with just one criterion, algorithm developers are left with the dilemma of deciding between different value-concepts when trying to implement an appropriate fairness metric into the algorithm [komiyama2018nonconvex].

In this paper, we develop a framework that helps better deal with this problem by shifting the decision margin: instead of choosing the appropriate fairness metric in a given decision setting, we assess disparate impact by considering a higher number of group classification metrics. Our contributions are listed below:

Clustering of metrics We start by showing empirically, that many group metrics are highly correlated and can be clustered into two or three groups, which is in line with findings by Friedler et al. [friedler2019comparative]

. Moreover, a Principal Component Analysis (PCA) uncovers how a multidimensional vector of group metrics can in practice be reduced to two main axes. This also allows for the ranking of different ML algorithms along the line of these axes.

Fairness visualization In addition to the empirical analysis, our contributions go in the direction of fairness visualization. The clustering displays a multitude of fairness-related factors and their correlations in a single graph allowing researchers to focus on a smaller set of metrics that are orthogonal to each other. Furthermore, in order to better visualize the disparities, we align and center the PCA decomposition of various matrices corresponding to different ML models.

The remainder of this paper is structured as follows. In Section 2 we discuss the relation with the previous work. In Section 3 we present the methodology including the problem definition, a method to visualize multiple group fairness metrics based on clustering, a method to evaluate correlations of these metrics between various datasets, and a method to compare ML models in terms of fairness using PCA. We present experiments using the proposed framework in Section 4. In Section 5 we present the conclusions.

2 Relation to previous work

The FAT-ML literature is voluminous, multidisciplinary, and rapidly evolving. Reviews on fairness criteria in decision making are provided by Romei and Ruggieri [romei2014multidisciplinary], Žliobaitė [zliobaite2017measuring], as well as Barocas and Selbst[Barocas2016] who elaborate on mechanisms to address biased data and algorithmic unfairness. There are different methods to ensure that fairness criteria are satisfied in classification algorithms [kamiran2012data, calmon2017optimized, Hardt2016, pleiss2017fairness, Zafar2017fairness, agarwal2018reductions, komiyama2018nonconvex]. In this research we do not propose a fair ML algorithm. Our goal is to study the disparity that occurs when deploying general-purpose ML algorithms. Specifically, we propose a framework to analyze for fairness, similarly to other frameworks which test software for discrimination, such as Themis [galhotra2017fairness], Aequitas [saleiro2018aequitas], and BlackBoxAuditing [adler2018auditing]. Following the recommendations in [friedler2019comparative] and in contrast to Themis and BlackBoxAuditing, we do not explore other metrics in the realms of causal discrimination or indirect influence.222Our framework has the possibility of exploring feature importance, however this is not the goal of this paper. Similar to Aequitas which provides a map to navigate between different group-fairness metrics according to the type of problem, we aim at giving a bigger picture on how these metrics are related and we plot the disparity between various protected groups in a lower dimensional space.

The choice of an appropriate fairness measure from a long list of potential ones can be very complex, as it depends on the respective policy context and the stakeholders involved [narayanan2018translation]. For instance, [liu2018delayed, menon2018cost] restrict their analysis on two fairness metrics: true positive rates and predictive prevalence. Another fairness-enhancing method [Zafar2017fairness] optimizes for false positive rates and false negative rates, a constraint which is also discussed in [berk2017convex]. Furthermore, there are impossibility theorems [kleinberg2016inherent, Chouldechova2017] that mathematically prove the impossibility of reconciling different fairness notions if the prevalence of the outcome, i.e., the “base rate” differs across different protected groups. There is a tension between fairness criteria and optimal accuracy [Corbett2017, Zafar2017fairness, berk2017convex, menon2018cost]. To that extent, in this paper we study the conflict between group-wise accuracy metrics and fairness related metrics. Particularly, we study how the group-wise binary classification metrics relate to each other using clustering and correlation.

Another approach to addressing the complexity in fair machine learning is to simplify the long list of fairness criteria before addressing its tensions. Our approach relates most to the one taken by Friedler et al.  [friedler2019comparative] who show that most fairness measures are highly correlated with one another. As a further extension of their work, we use clustering to have a more in depth view on how the measures are related to each other. After performing a PCA of all tested fairness measures, our approach goes one step further by extending the framework to the comparison of algorithms.

3 Methodology

3.1 Problem definition

A dataset contains a set of features , including a set of protected features , with , and a set of associated binary predictions . A binary decision making system takes as input the features and solves a binary classification task with an output where the binary labels have different meanings depending on the task, e.g., not re-offended/re-offended, bad/good credit score, not receiving/receiving a benefit. Depending on the impact on the human subjects, a decision is assistive, as in the case of credit scoring (should/should not receive a loan), or punitive, as in the case of recidivism333Recidivism is defined as the act of a person committing a crime after they have been convicted of an earlier crime [brennan2013emergence]. prediction (recidivist/non-recidivist) in criminal justice. A machine learning model solving this binary classification task yields a set of predictions . The performance of the machine learning model is usually measured on test data comprising pairwise observations and their associated ground-truth binary labels .

Group metrics: Let be a set of metrics used to report the performance of a given machine learning model on the test set such as the ones defined in Section 4.3 (accuracy, false positive rate etc.). In the literature, group fairness is defined for a particular metric in relation to a protected feature (e.g., gender). Hence, given a protected feature (e.g., gender) from and the associated groups (e.g.,{ Men, Women }), we compute the group-wise metrics for each (i.e., given gender as the protected feature, we compute the metrics for the groups Men and Women). An outcome is considered fair with respect to the metric and two groups if .

In this paper we aim at developing a method to evaluate group fairness that encompasses several metrics in rather than relying on a single metric . To that extent, we want to compare ML models for all groups of a given protected feature across all the metrics .

3.2 Clustering of metrics

Here we aim at discovering relations between different metrics, at finding out which produces more discrimination, and at determining which groups are more discriminated across multiple features. We achieve this by clustering the metrics vectors computed for each group and machine learning method at each protected feature . For each protected feature we form the matrix with the vectors on each line: . The matrix has the size where .

Clustering the columns: We cluster the columns to analyze the relation between metrics. We compute the pairwise correlation between the columns of this matrix where

. Then, we perform a hierarchical clustering of the metrics using

. Clusters are merged and created using the un-weighted pair grouping method [day1984efficient].

Clustering the rows: We cluster the rows, ML models and groups, to discover how far are different groups from each other for each ML model. This clustering involves computing the distance vector where .

In order to see if the clusters yielded by the distance vectors are similar across different datasets and protected features, we compute the correlation between the distance vectors. A high correlation means that the metrics produce a similar clustering across different datasets.

3.3 PCA decomposition on the columns and visualization

Due to the high number of metrics , it is cumbersome to visualize the distance between groups and ML models using the matrix of dimensions . It is therefore desirable to reduce the number of metrics by projecting the columns of the matrix to a lower dimension.

Because we need to compare distances between a reference group, usually the largest group, and the other groups across different ML methods, a straight-forward PCA decomposition of the matrix produces a 2D or 3D scatter plot of all data points in the matrix which makes it difficult to compare between various ML models. In order to have a fair comparison between ML models we need to plot the disparity between the reference group and the other groups for each ML model separately. Thus, we need to form metrics matrices for each ML model and we obtain the basis vectors for . The PCA is then obtained by multiplying each with the basis vector. For a better visualization of disparity between groups, we align and overlay these plots with the axes centered on the reference group.

For each protected feature we form a matrix which contains solely the metrics for a ML model . Then, for we apply a PCA decomposition to the matrix

obtaining the eigenvectors

. Then, the PCA decomposition is computed as for all . Let be the index of the largest group in . Then, we align the matrix with respect to by subtracting the vector from each row of the matrix.

4 Experiments

4.1 Datasets

4.1.1 YouthCAT: Recidivism in juvenile justice in Catalonia.

444Recidivism is defined as the act of a person committing a crime after they have been convicted of an earlier crime [brennan2013emergence].

This dataset contains data on juvenile recidivism in Catalonia for offenders aged 12-17 years (N=4,753)555Provided by the Centre for Legal Studies and Specialised Training [cejfe_2017_savry], available online http://cejfe.gencat.cat/en/recerca/opendata/jjuvenil/reincidencia-justicia-menors/index.html The crimes have been committed between 2002 and 2010 and all sentences were finished by 2010. The recidivism was reported in 2013 and 2015. The dataset contains demographic (age, foreign status, nationality: Spanish, European, Latin American, Maghrebi, Other) and criminal history features (number of crimes, type of crime, sentence). This dataset has been used in the following study on algorithmic fairness: [Miron2019].

4.1.2 COMPAS: Recidivism risk score in Broward County.

Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) is a risk assessment tool developed by Northpointe which assesses a criminal defendant’s likelihood to re-offend 666Provided by ProPublica, available online https://github.com/propublica/compas-analysis/.. The database contains: criminal history, jail and prison time, demographics and COMPAS risk scores for defendants from Broward County from 2013 and 2014. The data comprise demographic features (age, gender, race) and criminal history features (count of prior crimes, type of crime). This dataset has been used in the following studies on algorithmic fairness: [Chouldechova2017, Corbett2017].

4.1.3 Statlog: German Credit Dataset.

This dataset has been used to classify people (N=1,000) described by a set of attributes as good or bad credit risks

777Provided by Hamburg University, available online https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data).. As attributes, the dataset contains demographic (age, foreign status, gender and marital status) and qualitative features (status of account, savings, credit history, purpose). This dataset has been used in the following studies on algorithmic fairness: [zemel2013learning, fish2016confidence]

4.1.4 Credit: Default of Taiwanese Credit Card Clients.

This dataset [yeh2009comparisons] has been used to detect default payments in Taiwan (N=30,000) 888Provided by Chung Hua University, Taiwan, available online http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients.. These data include demographic features such as age, gender, marital status, education, as well as credit history data including amount of given credit, amount of bill payment, and history of past payment. This dataset has been used in the following studies on algorithmic fairness: [berk2017convex, lipton2018does]

These datasets differ from other datasets used in binary classification due to the fact that they are cases of decision making processes which can be automated using machine learning. According to the classification in Section 3.1 the decisions for case of credit score (Statlog) are assistive, and in the case of recidivism prediction (COMPAS, YouthCAT) and credit card default (Credit) are punitive.

Due to the impact on human subjects, the decisions have an important ethical dimension [Barocas2016] and are often addressed from the point of view of fairness in decision making as indicated by their use in the FAT-ML literature. In this case we are interested in group fairness, whether the machine learning decisions are biased towards a particular category of people. The protected features are gender for all four datasets, foreigner status for YouthCAT and Statlog, and national group and race for YouthCAT and COMPAS.

4.2 Machine Learning Methods

Each of the datasets described in 4.1 proposes a decision making problem which can be modeled as a binary classification task: predicting whether someone will recividate or not (COMPAS, YouthCAT), predicting if someone will default or not (Credit), and predicting whether a person is a good creditor (Statlog).

We test a number of machine learning algorithms for supervised learning: logistic regression (

logit

), multi-layer perceptron (

mlp

), support vector machine with a linear (

lsvm) K-nearest neighbors (knn

), random forest (

rf

), decision trees (

tree

), and naive bayes (

nb) [robert2014machine].

To account for overfitting we use cross-validation to split the data between training and testing. In each split, the validation data is chosen from the training set, with

% random elements kept for validation. The validation set is used to tune the ML models hyper-parameters and to pick the binarization threshold for the prediction of the ML models.

Fairness-aware machine learning aims at fixing the disparities in ML algorithms with respect to a single metric. Due to the fact that these methods operate at different steps (pre-, post-, or during training) and can optimize with respect to different metrics, we do not attempt a detailed comparison between these methods in this paper.

4.3 Metrics

Here we aim at computing a set of performance metrics from which we derive a set of group-wise metrics corresponding to the protected features and the groups in each dataset. There are various metrics for evaluating a ML method, some of these are application-dependent. For instance, a ML system in a criminal justice context might be evaluated differently from one in e-commerce. Furthermore, the meaning of a metric changes when the decision making intervention is assistive ( means a good creditor in Statlog) or punitive ( means a recidivist in COMPAS).

4.3.1 Performance metrics.

The ML predictions and their associated ground-truth binary labels determine four numbers: true positives (correct positive assignments), true negatives (correct negative assignments), false positives (incorrect positive assignments), and false negatives (incorrect negative assignments). From we can calculate various metrics including the True Positive Rate, , with the complement False Negative Rate, , the True Negative Rate, , with the complement False Positive Rate, . Furthermore, we can compute Positive Predictive Value, , with complement the False Discovery Rate, ; and Negative Predictive Value, , with complement the False Omission Rate, . Furthermore, we compute metrics which depend on the prevalence: Predicted Prevalence and Predicted Positive Rate , where is the total number of people in the dataset and is the total number of people in the data which are part of a group . In this case, solely makes sense as a group-wise metric.

We compute Accuracy, which measures how well a model correctly detects or excludes a condition , and Balanced Accuracy .

If ML predictions

are probabilistic then these metrics can be obtained at different classification thresholds which are applied to the output probability. To measure predictive performance we use the area under the ROC curve (AUC) which trades-off specificity (false positive rate) and sensitivity (true positive rate) for all the thresholds

.

4.3.2 Computing a classification threshold.

To maximize the performance of the models we choose the threshold value which maximizes balanced accuracy on the validation set, defined as , where is the varying threshold, is the true positive rate, and is the true negative rate. The best threshold is obtained for on the validation set. Here we report as on the test set.

4.3.3 Group-wise metrics.

As described in Section 3.1 the classification metrics are computed group-wise, to evaluate the fairness of ML models [narayanan2018translation]. Thus, we calculate for each experiment, for each protected feature and the corresponding groups , and for all the ML models .

4.4 Experimental setup

4.4.1 Data encoding.

Data is encoded numerically. Numerical values are normalized to have a mean of

and standard deviation of

. Categorical features are encoded as binary if they have two possible values, or using one-hot encoding when having multiple categories. Ordinal features (e.g., low, medium, high) are also encoded numerically.

4.4.2 Parameters and model selection.

We perform k-fold cross validation with . Each fold is replicated times with a different random seed that controls the random split between training, validation, and testing sets.

While for some ML classifiers (nb, logit, mlp) a probability of classification is naturally produced, for other classifiers (svm, trees, forest) this probability can be derived using additional methods implemented within the sklearn library.

For each random seed we determine the best hyper-parameters for each ML algorithm. We train models for each ML algorithm representing different random combinations of hyper-parameters. For logit

we pick the inverse of regularization strength from an uniform distribution

. For mlp, we use a two layer network with the sizes , where is the number of input features and is chosen randomly from an uniform distribution . In addition we experimentally determined the batch size to be

, we update parameters using the stochastic gradient descent for

epochs. The cost function for mlp classification is binary cross entropy, with an penalty on weights of to avoid over-fitting. For knn the number of neighbors and the distance metrics are picked randomly between and between Minkowski, Euclidean and Manhattan. For the lsvm the penalty C is drawn from an uniform distribution . For the rf

we randomly pick the number of estimators to be between

, the maximum depth between and the minimum number of samples per leaf between . We select as the best model for each ML algorithm, the one having the highest AUCROC on the validation set.

4.4.3 Software implementation details.

The experiments are replicated times for different seeds to ensure robustness and reproducibility. The code is implemented in Python using libraries such as pandas and sklearn-pandas for data processing, sklearn and pytorch for machine learning, numpy and scipy for numerical processing. This research complies with research reproducibility principles, and code is made publicly available as a part of a framework.999HUMAINT repository: https://gitlab.com/HUMAINT/humaint-fatml.

4.5 Results

4.5.1 Fairness analysis using clustering.

Towards a fairness analysis encompassing a wide variety of metrics we look at how group-wise measures, ML models, and groups cluster. The matrix for a protected feature (e.g., race) comprises all computed metrics across all groups and for all ML models. The hierarchical clustering of on lines and columns gives important information on how metrics relate to each other and on how groups differ across all metrics.

Figure 1: The matrix and the resulting hierarchical clustering for the metrics (columns), and the groups, ML models (lines) for dataset COMPAS and protected feature . The metrics are presented along with the corresponding variances.

Figure 1 shows an example of the matrix and of resulting clusters for the COMPAS dataset and the protected feature "race" for a single seed. For brevity, we limit the number of ML models in the plot to the top two in terms of AUCROC: logit and mlp.

The matrix is presented as a table having the metrics on the columns, and the groups (Caucasian, African-American, Hispanic, Other, Native-American), ML models (logit and mlp) on the rows. We display the clustering information for the metrics above the columns and the clustering information of the groups and ML models on the left side of the lines. The clustering is computed according to the method in Section 3.2.

First, we look at the clustering on the columns. We observe that the evaluation metrics cluster on two different groups. On one hand, we have

and on the other hand, . We associate the first group with performance metrics and the second one with errors and prevalence metrics. The two main big clusters observed in Figure 1 is in line with the correlation between the metrics observed in [friedler2019comparative].

We display the variance of each metric near the corresponding label. The variance and the clustering information can be used to choose a set of metrics to optimize for, e.g. the metrics with the highest variance from two separate clusters. Note that the variance is equal for complementary metrics. There is less variance for accuracy metrics which are closely related. Since the ML models are trained to optimize predictive performance, this finding is in line with the trade-off between accuracy and fairness in the literature. Furthermore, we observe that have the largest variance and the largest disparity between groups occurs at these metrics. In addition, and are by definition complementary and optimizing for one’s parity means that the other is also optimized. Furthermore, , , do not cluster closely to other metrics.

Second, we look at the clustering on the rows to assess for disparity between different groups and methods. With respect to the first cluster of metrics, African-Americans have lower for logit and mlp, meaning that they are less likely to be correctly labeled as non-recidivists, and higher , meaning that they are more likely to be correctly label as recidivists. With respect to the second cluster of metrics, a higher proportion of African Americans and Native Americans are classified as recidivists when compared to other groups (higher ), and the ML models are less likely to wrongly classify them as recidivists. These disparities yielded by ML models on the COMPAS dataset were widely presented in the FAT-ML literature, however using solely one or two fairness metrics.

4.5.2 Robustness testing across datasets.

Considering the clusters seen in Section 4.5.1 for the matrix , we perform a PCA decomposition of matrix for all datasets and seeds to determine the number of components and the explained variance. The means for the percentage of explained variances of the first two components are and . In this case, across all machine learning methods and datasets, the first two components explain of variance between the groups in the dataset across for the chosen classification metrics. This finding is consistent with the two main clusters observed in Figure 1. Note that similar clusters were observed across other datasets and protected features, a fact which is confirmed by high correlations coefficients in Figure 2. With respect to the correlation of metrics observed in Friedler et al. [friedler2019comparative], its robustness is not tested across multiple datasets.

We want to assess whether the clusters observed in Figure 1 are consistent across other datasets and protected features. To do so, we compute correlation coefficients between all distance vectors on all protected features and datasets. Means and standard deviations of the correlation coefficients across all seeds are reported in Figure 2.

Figure 2: The means and standard deviations of the correlations of vectors for all datasets and protected features. Columns are in the same ordering as rows.

We observe that the distance vectors for YouthCAT and COMPAS across the selected protected features are highly correlated. The two datasets point out to a similar scenario, criminal recidivism. However, in the case of COMPAS, race yields less correlation with the national group or foreigner. In fact, the YouthCAT does not implicitly hold race as a feature, although national groups may encode different races. This points out to the fact that the categories considered in the dataset may yield different results in terms of group fairness [benthall2019racial].

Despite the fact that each dataset and protected feature yields two or three clusters, the way the metrics are distributed between the clusters is scenario dependent. Hence, except for COMPAS and YouthCAT metrics obtained for other datasets do not have a high correlation.

The particular case of Statlog shows that distance vectors are poorly correlated with the ones from the other three datasets. This can be explained by the fact that Statlog presents a different type of problem, as the decision making is assistive, unlike YouthCAT, COMPAS, and Credit for which the decision is punitive. This simple fact changes the meaning of the labels and the meaning of the metrics. Note that in the comparative analysis of fairness-enhancing methods [friedler2019comparative] the datasets for which the decision is punitive are analyzed separately from the assistive ones.

4.5.3 Multi-metric fairness visualization using PCA decomposition.

We aim at visualizing the distance between the ML methods and groups across all metrics in a lower dimension. We use the method described in Section 3.3. The plots are centered at the reference group, considered here the largest among all groups.

Figure 3: The PCA for and "national_group" for YouthCAT. The fraction of explained variance for the components are: .

Figure 3 shows the data points corresponding to the PCA matrices aligned between ML models for the dataset YouthCAT and the protected feature "national_group". For brevity, we limit the number of ML models to the top two () in terms of AUCROC: logit and mlp.

The groups in Figure 3 for the dataset YouthCAT are Spanish, European, Latin American, Maghrebi, and Other. The reference group on which the plot is centered is Spanish. We observe that points corresponding to Maghrebi are far from the reference group and from all the other groups on the axis determined by the first component. Similarly, the Europeans and Others are far from the reference group on the axis determined by the second component.

While PCA axes do not hold any specific meaning in contrast to the classification metrics, they can give the magnitude of disparity, which is not easily accessible through the clustering in Figure 1. However, the axis represent a linear combination of these metrics which were proven to be highly correlated in Section 4.5.1 and in Friedler et al. [friedler2019comparative]. The coefficients of this linear combination can be easily obtained from the eigen vectors.

5 Conclusions

In this paper we propose a reproducible and open methodology to visualize and study group fairness in data-driven decision making, beyond the limitations of an analysis relying on a very limited set of metrics. The context in which this framework is developed is characterized by various facets and definitions of fairness metrics [romei2014multidisciplinary, zliobaite2017measuring, Barocas2016] and impossibility theorems [kleinberg2016inherent, Chouldechova2017], which prove that it is impossible to optimize for different fairness metrics. To that extent, we discover that fairness measures are highly correlated and it is convenient to visualize and assess fairness in two or three orthogonal dimensions. Moreover, our experiments prove that the classification metrics group into two or three clusters. The resulting clusters do not generalize over the analyzed datasets and are dependent on each scenario.

A two-dimensional reduction gives the possibility to compare different ML models in terms of fairness, to identify the groups affected by disparate impact. However, in this representation the axes do not hold any specific meaning and it is difficult to claim that a group is discriminated. The authors recommend that the PCA analysis in Section 4.5.3 is used in conjunction with the clustering plot in Section 4.5.1. While the former is useful to compare ML models and to have an initial measure of disparity, the latter offers information on what metrics are problematic for each groups and how these metrics are related.

5.0.1 Limitations.

The present study does not consider a comparison with decision-making systems which do not rely on ML, such as structured professional judgments, like SAVRY [cejfe_2017_savry], which has been applied to the YouthCAT dataset. Neither does it conduct a comparison between fairness-enhancing methods. The results are reported for a set of machine learning methods and clusters of metrics. The variance of PCA components can change when including different systems in the evaluation.

5.0.2 Future work.

The open source framework allows for the current methodology to be applied to any binary decision making dataset. We plan on extending the current study to include more datasets and a comparison with fairness-enhancing methods.

References