Detecting discriminatory risk through data annotation based on Bayesian inferences

01/27/2021 ∙ by Elena Beretta, et al. ∙ Politecnico di Torino Fondazione Bruno Kessler 0

Thanks to the increasing growth of computational power and data availability, the research in machine learning has advanced with tremendous rapidity. Nowadays, the majority of automatic decision making systems are based on data. However, it is well known that machine learning systems can present problematic results if they are built on partial or incomplete data. In fact, in recent years several studies have found a convergence of issues related to the ethics and transparency of these systems in the process of data collection and how they are recorded. Although the process of rigorous data collection and analysis is fundamental in the model design, this step is still largely overlooked by the machine learning community. For this reason, we propose a method of data annotation based on Bayesian statistical inference that aims to warn about the risk of discriminatory results of a given data set. In particular, our method aims to deepen knowledge and promote awareness about the sampling practices employed to create the training set, highlighting that the probability of success or failure conditioned to a minority membership is given by the structure of the data available. We empirically test our system on three datasets commonly accessed by the machine learning community and we investigate the risk of racial discrimination.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In the last decades machine learning systems are widely spreading in different academic domains, as well as many public and private sectors are increasing the exploitation of these systems. Their widespread and pervasiveness is mainly driven by the exponential growth of computational power and the extensive availability of large amounts of data (Kleinberg, 2018). Supervised machine learning models are also particularly widespread and now deeply rooted in different sectors due to their usage versatility. The predictive ability of supervised machine learning systems is deployed in disparate areas of application: credit reliability (Kleinberg, 2018), justice system (Angwin and Parris, 2016; Berk et al., 2018), job recommendations (Siting et al., 2012), university selection process (Kanoje et al., 2016), cultural contents (Schedl et al., 2018),(Indira and Kavithadevi, 2019) and purchases recommendations (Oyebode and Orji, 2020). The key ingredient that supervised machine learning models have in common is the availability of a set of labeled data used to train the model in elaborating a response related to past events (Gebru et al., 2018)

. Since the known properties of the available set of data is used to create a classifier that makes predictions about new entities of the same type, the structure, properties and quality of the data are aspects that largely and directly influence the quality of the model and the results it produces

(O’Neil, 2016), (Albarghouthi and Vinitsky, 2019). Although data-driven decision models have been shown to produce both economic and social benefits, many researchers have highlighted several problems and damages related to their use in different areas, especially if they are built on partial or incomplete data (Hardt et al., 2016), (Dwork et al., 2012). As a matter of fact, in recent years several studies have found a convergence of issues related to the ethics and transparency of these systems in the process of data collection and in the way they are recorded (Marda and Narayan, 2020). While the process of rigorous data collection and analysis is fundamental to the design of the model, this step is still largely overlooked by the machine learning community (Beretta et al., 2019; Jo and Gebru, 2020). As the practice of removing protected attributes from available data has been shown to potentially exacerbate further discrimination (Williams et al., 2018) - making bias even more difficult to detect - practices related to data collection, data transparency and data explainability become even more relevant and urgent. The aim of our work is to provide a data annotation system that serves as a diagnostic framework containing immediate information about the data appropriateness in order to more accurately assess the quality of the available data used in training models. We propose a data annotation method based on Bayesian statistical inference that aims to warn of the risk of discriminatory results of a given data set. In particular, our method aims to deepen the statistical knowledge related to the information contained in the available data, and to promote awareness of the sampling practices used to create the training set, highlighting that the probability of a discriminatory result is strongly influenced by the structure of the available data. We test our data annotation systems on three dataset widely spread in machine learning community: the COMPAS dataset (Harry Thornburg Larson et al., 2016), the Drug Consumption dataset (Fehrman et al., 2015), (Fehrman et al., 2017) and the Adult dataset (Kohavi and Becker, 1996).

1.1. Problem Statement

The majority of machine learning systems are based on historical data processing (Noble, 2018). This is particularly true in supervised machine learning models. Several studies have shown evidence that many equity and discrimination issues are due to input data properties (Benjamin, 2019). Most of today data sets used to train models are chosen through non-probabilistic methods, generating problems of data imbalance and representativeness (Eubanks, 2018; Noble, 2018). This means that different fractions of the population do not show the same opportunity to be represented within the sample - aka, training sets -, leading some groups of individuals to have a lower probability of being represented. Common observed effects of a bad sampling are underestimation and overestimation of some groups (Barocas et al., 2018). Undetected distortions in data may also easily represent a spurious statistical noise. This happens when the data structure induces dependence between two variables that are not linked by a real cause-effect relationship.

Data Sampling

A key moment in the pipeline of a machine learning model is when the programmed algorithm is supplied with training data representing the entities on which the model itself trains its knowledge to make predictions. The quality of the data used in this phase is fundamental for the desired result, according to the principle of ”garbage in - garbage out”: even the most sophisticated models can present distorted results in the presence of low quality data

(Tommasi et al., 2017). One of the main causes of data distortion is the way the data is selected and provided to the algorithm displaying problems related to inaccuracy, lack of update or inadequate representativeness. However, while knowledge of bias typologies has proliferated over the years, less attention is paid to issues concerning data collection, notation and sampling (McDuff et al., 2018). In the spirit of fostering a broader awareness of data handling, we provide a reasoned list of issues that may arise during this phase:

  • Data Selcetion: the large proliferation of data sets availability on the same kind of problem to be analyzed, make hard the a priori choice of a given data set (Gebru et al., 2018);

  • Inadequate sampling methods: most models are trained with data sets that have been ”found” and not subjected to probabilistic sampling methods, leading to limited or no data control (Asudeh et al., 2019);

  • Cost and Time Limit: collecting large amounts of data that present proportional representations of each property with respect to a sensitive attribute is time consuming and often costly and labor-intensive (Caliskan et al., 2017);

  • Data set validation: in the design of a machine learning model, more attention is paid to the mathematical basis of the classifier, restricting the data formation process to a black box (Gil et al., 2016; Geiger et al., 2020);

  • Validation planning: data validation, when applied, is often performed only after the model has been trained and used, making the feedback cycle inefficient and often ineffective (Holland et al., 2018);

  • Lack of statistical rigorousness:

    the suitability of the data set varies depending on the task for which the data are prepared. For instance, models based on linear regression imply assumptions of normality on the measurement error

    (Gebru et al., 2018; Wang and Ni, 2017). This specificity is often absent in the pipeline of machine learning models.

Miss-dependency

Two-dimensional or bivariate statistics is the study of the degree to which two distinct characters of the same statistical unit are connected. However, the connection only measures the degree of statistical dependency without inducing a cause-effect relationship or dependency between the variables. For instance, it can be shown that people with small feet make more spelling mistakes than people with large feet. However, this statistical dependency does not indicate that having small feet is the cause of spelling mistakes; the greater frequency of spelling mistakes may in fact be due to the younger age of people with small feet. In this case there could be a third variable, age, responsible for the cause-effect relationship. While in a human-centered model - where the human makes the decisions - this distinction is quite evident, in a machine learning model miss-dependency is not always deductible. This depends on two reasons: i) the machine does not recognize the meaning of the instance but looks at the properties of the variables; ii) the way in which the data are structured modifies the interpretation that the machine is having regarding the relation of statistical dependency. This means that, while in a human-centered model it is the human to verify that the relationships of statistical dependence detected in the available data are leading to a cause-effect relationship, in machine learning models the machine is not always able to recognize a spurious connection, erroneously assigning to two or more variables a cause relationship. In other words, the structure of the available data is responsible for the successful or failed relationships established with the protected attributes (ethnicity, gender, etc.) in the data. In addition, the rapid growth and spread of current machine learning systems is due in part to the ease of design of the models themselves, which thanks to modern software allows the construction of predictive models avoiding the understanding and adoption of rigorous statistical analysis. The simplicity of design has therefore created a gap between predictive and analytical-explicative power, favoring misinterpretation between causality and statistical dependence. The distinction between statistical dependence and causal dependence in data is therefore a primary issue in machine learning models, especially to determine the causes of failure, potential biases encoded in the data and the reliability of application.
Based on the problems highlighted, our contribution aims to answer the following research questions:

  • Is it possible to establish the probability of composition of the training data from the available data set?

  • Do the available data known to the machine learning community present a discriminatory future risk based on their structure?

2. Background

When machine learning model decisions are based on historical records, they tend to embed distortions that exist in reality and crystallize them. Prejudices and human bias therefore become part of the technology itself. This is particularly evident with regard to ethnic discrimination. Over the last years, the rise of machine learning models in various sectors is leading to a dramatic increase of discriminatory outcomes for ethnic minorities, across different fields of application. A striking and well known case is the COMPAS software, used in U.S. court to estimate the probability of defendants’ recidivism, which has been shown to underestimate the risk of recidivism for white defendants and overestimate it for black defendants

(Harry Thornburg Larson et al., 2016). However, the COMPAS case is not an isolated phenomenon. In a 2017 experiment conducted on the Airbnb platform, applications from guests with typically African American names were found to be 16% less likely to be accepted than identical guests with typically white names (Edelman et al., 2017). Also in 2017, a geo-statistical analysis revealed that the design of the popular Pokémon GO game strengthens existing geographical prejudices, for example by benefiting urban areas and neighborhoods with smaller minority populations, economically disadvantaging ethnic minority areas (Colley et al., 2017). Several studies have demonstrated the discriminatory potential of targeting advertising (Speicher et al., 2018), (Song, 2020), which is only recently receiving interventions to remove the prejudicial content of the model. For example, Facebook after years of scandals related to ads that exclude people based on race (Angwin and Parris, 2016) has finally removed the racial targeting option for ads (Kukura, 2020). In a 2019 study, the commercial algorithm widely used in the U.S. health care system to guide health care decisions was found to discriminate against black patients (Obermeyer et al., 2019)

. The algorithm falsely assigned a healthier condition to black patients despite the risk of complications being the same for white patients, making black people less likely to receive more financial resources for extra care. Although facial recognition technologies are now used in several domains, they still present many discriminatory issues related to differences in margins of error - generally software has a 20% higher margin of recognition error for black women

(Raji et al., 2020)

-. As an example, we report what happened recently with Google Vision AI, a computer vision service for image labeling

(Kayser-Bril, 2020). By providing the system with two images of people holding a body temperature thermometer, it labeled the image containing the white person as an ”electronic device”, while in the image containing the black person the device held was labeled as a ”gun”. In a later experiment it was shown that it was sufficient to apply a pink mask on the black person’s hand in order the software labeled the image as ”tool”. Racial bias encoded in machine learning systems is likely to spread silently and like wild fire in everyday technologies. The increasing and ubiquitous spread of such models also intended to make allocative decisions about people’s lives makes the problem of prejudice and rational discrimination more urgent than ever. For this reason and for the historical moment we are experiencing, our work intends to focus on rational discrimination in data.

3. Motivating Example

Given a population composed of 60% Caucasians, 35% black people and 15% Asian people, the probability of positive outcome for the respective ethnic groups is 70% for Caucasians, 20% for Blacks and 60% for Asians. What is the probability of failure with respect to the protected attribute Ethnicity?

In this example the probabilities are given rather than the numerosity in order to simplify the following notation. To offer a better a better understanding of the Methodology this data will be used in Section 4. The data gives the probability of success, but the similar reasoning is also valid for cases where the probability of failure is known. The intent is to verify whether the probabilities of success or failure of a subgroup are influenced by group membership - and vice versa - and more specifically how these probabilities affect the composition of the training set.

4. Methodology

Our data annotation system is based on four modules:

  • Dependence: assesses the degree of connection among the protected attribute - in our study, the ethnicity - and the target variable;

  • Diverseness: provides the training diversification probability in respect to each level of the protected attribute and the target variable;

  • Inclusiveness: provides the probability that two properties are simultaneously included in the training set;

  • Training Likelihood: provides the occurrence likelihood of the protected attribute levels given the target variable levels - and vice versa - before the training set is sampled.

4.1. Quantifying Dependence

Excluding some specific domains where the dependence of some protected attributes with the response variable is not considered problematic, but rather it is fundamental for the understanding of a certain problem (for example the gender attribute in the medical field in the detection of particular diseases

(Dahiwade et al., 2019)), in the broad field of machine learning systems the dependence between the protected attribute and the response variable has caused severe consequences (O’Neil, 2016; Noble, 2018)

. The dependence between the protected attribute and the response variable is therefore one of the major causes of discrimination and as such must be rigorously examined. The first step for a correct bias detection within the data is given by the dependency analysis between the different modalities of a protected attribute and the response variable. In statistics, the measurement of the degree of dependence of two qualitative variables is called contingency; contingency measures the degree of connection of two categorical variables. To determine the degree of connection, the marginal frequencies and the combined frequencies of the bivariate table are used. Given two categorical variables

and , the dependency or independence is established through the theoretical independence table once the table of the observed real data is given. The contingency is therefore given by the difference between the observed and theoretical frequencies:

(1)

If the table of the observed real data and the theoretical table of independence coincide - that is if for each cell the value is null - then the two variables are independent. Otherwise, it is necessary to measure the degree of connection between the variables. The degree of connection between two categorical variables is commonly measured by the Pearson connection index, obtained as the sum of the relative quadratic contingencies. The index assumes a value of zero in case of independence in distribution and increases as the degree of connection between variables increases:

(2)

In order to support Pearson’s connection index, the contingency coefficient is adopted with the purpose of reducing the in the range [0;1]:

(3)

However, the effect size of the degree of connection between two categorical variables is not always easy to interpret, where by effect size we mean a quantitative measure of the magnitude of a phenomenon. To offer a better understanding of the relationship of dependency between two variables, several simplified methods of interpretation have been proposed, especially to guide social scientists in the interpretation of statistical test results. In the spirit of simplifying the interpretation of the dependency between the response variable and the protected categories for a data set user, we introduce the concept of the Effect Size Index w (ES w):

(4)

where and are the value of the ith cells. Notice that unlike the contingency coefficient, the ES w is not derived from frequencies but from proportions. The relationship between the Pearson connection index, the contingency coefficient and the ES Index is given by the following formula:

(5)

Alternatively to the Formula 4 it is also possible to calculate the ES w from the contingency coefficient:

(6)

The size of the ES w between two variables is then evaluated through the use of Table 1, which relates the magnitude of the ES with a nominal label.

Magnitude Value
SMALL w = 0.1
MEDIUM w = 0.3
LARGE w = 0.5
Table 1. Conventional definitions of Effect Size Index w magnitude

The advantage of using the conventional conversion table for the user of the data set is that the magnitude of the dependency is displayed quickly and immediately without the need for more complex statistical tests.

4.2. Estimating Diverseness

Intuitively, the probability of an event represents how likely the event will occur. According to the classical definition the probability is given by the following ratio:

(7)

We now apply this elementary theory to the problem of data collection in machine learning. When the data set is partitioned into training and test sets, a split with a more or less standard ratio (70/30 or 80/20) is generally performed, i.e. a sampling is performed on the available data. Let’s consider the case in which the training data set is generated by random sampling on the original data set without considering further techniques (stratification or re-sampling) - for example in the case of a non expert user -. The probability an event occurs turns into the probability that the training set shows some existing properties contained in the original data set:

(8)

In our data annotation this ratio is introduced to allow the dataset user to answer questions like: ”If I perform a random sampling on the original dataset, what is the probability that the training set is mainly composed of positive examples? What is the probability of belonging to a certain group with respect to the target variable?”

Prior Probabilities

The a priori probability of a data property is the degree of belief of the property in the absence of other information, also known as the unconditional probability. The degree of belief is the probability of a property to be true in an uncertain environment. The probability is referred to the belief and not to the truth of the fact, as it is not possible for the user to know exactly the truth, that is if the original data are representative of the real world. Since the user does not have access to the complete information, several hypotheses on how the real data is structured have to be drawn, assigning to each of them a probability of being true. Formally:

(9)

We estimate the prior probabilities by using the data of the problem introduced in Section

3, where the target variable assumes value 1 in case of negative outcome, otherwise 0.

Formula Probability
P = 0.48
P = 0.52
P = 0.6
P = 0.35
P = 0.15
Table 2. Example of prior probabilities

In this specific case, the prior probabilities indicate that the training set has probability 0.48 to be composed by individuals who display a positive outcome and 0.52 to be composed by individuals who display a negative outcome; finally, the probabilities that it is formed by individuals of white, black and Asian ethnicity are respectively 0.6, 0.35 and 0.15 (Table 2).

4.3. Estimating Inclusiveness

Posterior Probabilities

Given two events A and B, the probability

is said posterior probability because it allows to calculate the probability of A, knowing that B occurred. In our case the posterior probability means to compute the probability that

, knowing that has occurred (and vice versa). In other words, the probability that the training set shows the property Y = y, knowing the property has occurred (and vice versa). We start by estimating the probability that two events occur simultaneously. From the definition of conditional probability:

(10)

Since from Compound Probability Theorem (Ross, 1996) is equal to , i.e. the probability of both properties occurring is the same, either of the two formulas can be employed indistinctly.

Formula Probability
P = 0.42
P = 0.07
P = 0.09
P = 0.18
P = 0.28
P = 0.06
Table 3. Example of properties occurring simultaneously

4.4. Estimating Training Likelihood

From the definition of conditional probability, we derive the Bayes Theorem for the properties of the training set:

(11)

In the case of binary classification and in the case of protected attributes we are in the presence of a certain event partition. This means that the events are disjointed from each other and if and that as a whole they are the only ones possible, i. e., if a certain property occurs, one and only one certainly appeared. In other words, it is not possible that the training set is composed of individuals who belong simultaneously to the black and white ethnic group, or who simultaneously show a positive and negative outcome. The union of the occurrence of the single properties is therefore the whole set of possible properties. For the properties outcome and ethnicity the generalization formula are respectively:

(12)

By applying Formulas 10 and 12 the Bayes Theorem can be generalized for each property of the training set:

(13)

The first equation in Formula 13 derives the probability of the outcome property given the ethnic property, while the second equation derives the probability of the ethnic property given the outcome property. In other words, it derives the probability of composition of the training set based on the posterior probabilities of the outcome and ethnicity properties. Carried out a random sampling on the original data, the Formula answers the following questions:

  • In the sampled training set what is the probability of belonging to an ethnic group with respect to the outcome variable?

  • In the sampled training set what is the probability of obtaining a certain outcome with respect to the ethnic group?

Complementarily, the two equations can be interpreted as the probability of bias within the training set.

Formula Probability
P = 0.7
P = 0.2
P = 0.6
P = 0.3
P = 0.8
P = 0.4
P = 0.34
P = 0.87
P = 0.53
P = 0.15
P = 0.11
P = 0.18
Table 4. Example of posterior probabilities

5. Case studies datasets

COMPAS (Correctional Offender Management Profling for Alternative Sanctions)111Retrieved from: https://www.propublica.org/datastore/dataset/compas-recidivism-risk-score-data-and-analysis is a popular tool used by U.S. court to estimate the defendants’ probability of recidivism. This dataset displays the probability of reoffending based on two year of further studies. The dataset has been shown to underestimate the risk of recidivism for white defendants and overestimate it for black defendants (Harry Thornburg Larson et al., 2016).
Drug Consumption (Fehrman et al., 2015; Fehrman et al., 2017) contains information on the consumption of 18 drugs based on personality traits and socio-economic attribute. For simplicity of analysis we assumed the consumption of Cannabis as target variable but the annotation of the dataset can be made on each target drug.
Adult Dataset (Kohavi and Becker, 1996) The data set contains adult income annual census from the US Census Bureau. It is commonly employed in forecasting tasks in order to predict the factors leading to income below or above $50,000.

Property COMPAS Drug Adult
Consumption Dataset
Size 6172x9 1885x31 48842x15
Target 0 no 0 non user 0
variable 1 yes 1 user 1
Levels of Asian Asian AIE
ethnicity Black Black API
attribute Caucasian Black/Asian Black
Hispanic Caucasian Caucasian
NA White/Asian Other
Other White/Black
Other

American-Indian/Eskimo, Asian-Pac-Islander, Native American

Table 5. Summary of Datasets Prominent Properties

6. Results and Discussion

We performed the analyses that constitute our data annotation system for each of the datasets presented in Section 5. Sub-sections 6.1, 6.2, 6.3 and 6.4 report the analysis for each module - dependency, diverseness, inclusiveness, training likelihood, respectively - and contain an example graphic module. Figures 4 and 5 shows an illustrative example of the graphical visualization for the complete notation.

6.1. Dependence

This module aims to analyze the connection relationships between the protected attribute Ethnicity and the target variable that are established and depend on the available data. For instance, for the COMPAS dataset the module highlights the dependency relationships between recidivism and different ethnic minorities. Summary results for dependence module are shown in Table 6.

COMPAS Drug Adult
Consumption Dataset
Contingency 0.1413 0.1558 0.0994
coefficient
Effect size w 0.1427 0.1578 0.0999
variable
Magnitude of SMALL SMALL VERY
Effect size w SMALL
Table 6. Summary of Dependence Prominent Properties

None of the three datasets displays worrying dependency values among the protected attribute Ethnicity and the target variable, showing the magnitude of the Effect Size w as small or very small. However, the results of the COMPAS dataset - which is proven to contain bias - indicate that this module alone is not sufficient to show a latent bias risk. The degree of bias depends on the sample size and the value of the contingency coefficient of the target variable and the protected attribute (Zhou et al., 2017)

. Smaller samples lead to more bias and higher variance

(Zimmerman et al., 2017) and therefore the results of the dependency must be analyzed in relation to the amount of data available. In order to facilitate the interpretation of the connection

Figure 1. Example of Dependence graphic visualization

relations, we propose a graphic notation for dependence. Figure 1 shows the graphical representations of the dependency modules based on different connection magnitude.

6.2. Diverseness

This module aims to analyze the diverseness of the data available by estimating prior probabilities. They determine the probability that training set will display an a priori environment based on the original data available, i.e. they show the probability of training set composition stratified by each of target variable and protected attribute levels. For example, in our case study the module highlights the probability that training set will be equally composed by ethnic minorities and ethnic majorities. Summary results for diverseness module are shown in Table 7.

COMPAS Drug Adult
Consumption Dataset
0 0.545 0.329 0.239
1 0.455 0.671 0.761
Caucasian 0.341 0.912 0.855
Black 0.514 0.018 0.096
Asian 0.005 0.014
Hispanic 0.082
Native American 0.002
Other 0.056 0.033 0.008
White/Black 0.011
White/Asian 0.011
Black/Asian 0.002
Amer-Indian-Eskimo 0.010
Asian-Pac-Islander 0.031
Table 7. Summary of Diverseness Analysis Results

In terms of target variable probabilities, the results show strong distortions for the Drug Consumption and Adult datasets with a high probability of positive examples - i.e. showing a negative outcome - while the probabilities of the COMPAS dataset are quite homogeneous. Regarding the probabilities of the protected attribute ethnicity, the distortions are even more pronounced than the target variable ones, revealing a very high probability of composition for the Caucasian ethnicity in the Drug Consumption and Adult datasets. In the case of the COMPAS dataset the probabilities are indeed distorted, although still not such as to predict at this point of the analysis more severe future distortions, which is why more in-depth analysis are required. Figure 4.2 shows the graphical representation of the diverseness module that simplifies the display of prior probabilities. In the example is given the notation for a dataset where both the levels of the target variable and those of the protected attribute ethnicity are equiprobable.

Figure 2. Example of Diverseness graphic visualization

6.3. Inclusiveness

This module aims to analyze the inclusiveness of the data available by estimating the simultaneously probabilities. They determine the probability that training set will simultaneously display two by two the target variable and the protected attribute properties. For instance, in our case study the module highlights the probability that in training set the property Asian appears simultaneously with property success. Summary results for diverseness module are shown in Table 8.

COMPAS Drug Adult
Consumption Dataset
0AIE 0.0006
0Asian 0.0023 0.0019
0API 0.0041
0Black 0.1514 0.0023 0.0057
0Black/Asian 0.0000
0Caucasian 0.1281 0.0555 0.1061
0Hispanic 0.0320
0NA 0.0006
0Other 0.0219 0.0013 0.0005
0White/Asian 0.0004
0White/Black 0.0006
1AIE 0.0042
1Asian 0.0008 0.0007
1API 0.0111
1Black 0.1661 0.0010 0.0412
1Black/Asian 0.0003
1Caucasian 0.0822 0.1165 0.3115
1Hispanic 0.0189
1NA 0.0005
1Other 0.0124 0.0050 0.0036
1White/Asian 0.0016
1White/Black 0.0014

American-Indian/Eskimo, Asian-Pac-Islander, Native American

Table 8. Summary of Inclusiveness Analysis Results

The results of this module show that the probability that two properties will occur simultaneously is related to the sample size. Evidence of this can be found in the results of the Drug Consumption and Adult datasets, where the highest probabilities of simultaneous events involve the Caucasian property. The COMPAS dataset shows quite homogeneous probabilities especially with regard to the Black property, while for the Caucasian property the highest probabilities are related to the simultaneous occurrence with the Non-recidivist property. Since the simultaneous probabilities depend on the number of examples within the available data and the sample size, this result alone is not sufficient to establish a priori the certain presence of serious data distortions, although some evidence can already be seen. Figure 3 shows the graphical representation of the inclusiveness module that simplifies the display of simultaneously probabilities. In the example is given the notation for a dataset where all the properties of the target variable and those of the protected attribute ethnicity are equiprobable.

Figure 3. Example of Inclusiveness graphic visualization

6.4. Training Likelihood

This module aims to analyze the training likelihood of the data available by estimating the posterior probabilities. They determine the probability that in the training set the occurrence of the properties of the protected attribute is given by the properties of the target variable - and vice versa - . For example, in the COMPAS dataset they determine the probability that the occurrence of reoffending is given by the properties of the protected attribute ethnicity. Summary results for training likelihood module are shown in Table 9.

COMPAS Drug Adult
Consumption Dataset
0AIE 0.117
0Asian 0.742 0.731
0API 0.269
0Black 0.477 0.697 0.121
0Black/Asian 0.000
0Caucasian 0.609 0.323 0.254
0Hispanic 0.629
0NA 0.545
0Other 0.638 0.206 0.123
0White/Asian 0.200
0White/Black 0.300
1AIE 0.883
1Asian 0.258 0.269
1API 0.731
1Black 0.523 0.303 0.879
1Black/Asian 1.000
1Caucasian 0.391 0.677 0.746
1Hispanic 0.371
1NA 0.455
1Other 0.362 0.794 0.877
1White/Asian 0.800
1White/Black 0.700
AIE0 0.005
AIE1 0.011
Asian0 0.007 0.031
Asian1 0.003 0.006
API0 0.035
API1 0.030
Black0 0.450 0.037 0.048
Black1 0.591 0.008 0.111
Black/Asian0 0.000
Black/Asian1 0.002
Caucasian0 0.381 0.895 0.908
Caucasian1 0.293 0.921 0.839
Hispanic0 0.095
Hispanic1 0.067
NA0 0.002
NA1 0.002
Other0 0.065 0.021 0.004
Other1 0.044 0.040 0.010
White/Asian0 0.006
White/Asian1 0.013
White/Black0 0.010
White/Black1 0.011

American-Indian/Eskimo, Asian-Pac-Islander, Native American

Table 9. Summary of Training Likelihood Analysis Results

The results of this module show that the posterior probabilities of target variable and protected attribute ethnicity are quite skewed in all dataset. In the case of the Adult dataset given as occurred event 1 or event 0, the probability of occurrence of the Caucasian ethnic group is respectively 0.908 and 0.839, - i.e. very high for both events - while the probabilities of all other ethnic groups conditioned to the target variable are all significantly lower; this means that the original data contain many examples of individuals belonging to the Caucasian ethnic group. In the case of Drug Consumption, a similar reasoning can be carried out for the ethnicity probabilities conditioned to the target variable; moreover, notice that given the property Black/Asian, the probability of occurrence of event 1, i. e. that the individual is a consumer, is 1 - while the probability of 0 is 0 - which means that in the available data there are no examples of individuals belonging to the ethnic group Black/Asian showing a positive outcome - i. e. negative examples -. Figures

4 and 5 shows the graphical visualization of our data annotation system for the COMPAS dataset.

Figure 4. Data annotation visualization for COMPAS dataset
Figure 5. Data annotation visualization for COMPAS dataset

The analysis of the COMPAS dataset shows that if an individual is randomly sampled from the original data for the training set, the probability that this individual is black knowing that the re-offending property has occurred - i.e. knowing the outcome of the re-offending event - is 0.591, while the probability that the individual is white knowing that the re-offending property has occurred is 0.293. Instead, given as occurred the property Black the probability that the individual has not reoffended is 0.477, while the probability that the individual has reoffended is 0.523; given the property Caucasian, the probability that the individual has not reoffended is 0.609, while the probability that the individual has reoffended is 0.391, that is significantly lower. This means that in this dataset the reoffending is related to ethnicity, and that success or failure are determined by the membership to a specific ethnic group. The differences in probability between the properties highlight the risk of future bias, and in the case of the COMPAS dataset they anticipate the underestimation of recidivism for the Caucasian ethnic group and the overestimation of recidivism for the Black ethnic group proven in recent studies (Harry Thornburg Larson et al., 2016).

6.5. Final Remarks

  • in traditional sampling practices, instead of observing all the units of a population, only a subset of a population is detected, which must show certain probabilistic characteristics. In machine learning models the training set is sampled not from the real population but from the available data. While in classical sampling the empirical knowledge alone is effectively of a sample nature, in machine learning systems the available data are often of sample nature too, precisely due to the fact that it is not possible to make assumptions on the real population. Considering a random sampling from the available data, we have shown that the probability of composition of the training set can be predicted, highlighting that the structure of the data directly affects the probability of properties distribution;

  • we analyzed three datasets frequently accessed by machine learning community. Of these, all three showed more or less pronounced distortions for the protected attribute Ethnicity. Although the COMPAS dataset is the sole one that has been shown to discriminate against black people, the Drug Compsuntion and Adult datasets reveal possible future bias in the detriment of ethnic minorities.

7. Related Work

Although there are a number of papers that for ethical purposes deal with data annotation they are all very recent, indicating that this field of study is still partially explored and has only recently received considerable attention. Our contribution differs from the others because it induces a probabilistic reasoning on the causes of model discrimination based on sampling problems; our intention is to deepen the knowledge of data validation analysis, focusing on the meaning of probabilities. From a graphical point of view, our work has been inspired by the Data Nutrition Labels (Holland et al., 2018), a data labeling system mainly based on descriptive data statistics. A similar approach is addressed in (Beretta et al., 2019), where an operational framework is proposed to identify the bias risks of automatic decision systems. In (Gebru et al., 2018) the authors propose a data labeling system based on discursive data sheets. In (Chang et al., 2017) the authors propose a collaborative crowdsourcing system to improve the quality of the labels.

Since ethically data annotation represent a quite new field of study, there are several works that provide different types of labels. We believe that at present the focus should not be on achieving a unified data annotation system in the short term, but rather on the fact that the fair machine learning community is working together to focus attention on the data collection problem. Especially because awareness of data issues is often not rooted outside of this community. It is important that this field and this work inspire greater awareness of the possible causes of discrimination due to the fundamental ingredient that all users and designers of machine learning systems (from the most to the least experienced) use, data.

8. Conclusions

The purpose of the current study was to detect the potential race discriminatory risk for future machine learning system by providing a data annotation system based on Bayesian Inference. Our notation serves as a diagnostic framework to immediately visualize data appropriateness and potential bias occurring when sampling the training set from an available dataset. The investigation of the probabilities of the training set sampling has shown that it is possible to establish a risk of future bias by observing prior and posterior probabilities of the ethnicity and target variable properties. The empirical findings in this study provide a new perspective on data annotation practices by showing that Bayesian inferences may reveal the risk of bias in three different widespread dataset. Furthermore, this study has raised important questions about the awareness of most widely data sampling practices in machine learning community. The findings of this investigation complement those of earlier studies. Our data annotation system is limited to the binary case and to the analysis of categorical variables for classification tasks. This would be a fruitful area for further work. Our intent is to expand the work in the following directions: i) extend the notation to multiple protected attributes - the probabilities of the training set will then be given by the vectors of the protected attribute combinations - ; ii) extend the notation to the non-binary case - for prediction tasks involving regression analysis for example - ; iii) extend the probabilistic notation to non-labeled data.

References

  • (1)
  • Albarghouthi and Vinitsky (2019) Aws Albarghouthi and Samuel Vinitsky. 2019. Fairness-Aware Programming. In Proceedings of the Conference on Fairness, Accountability, and Transparency (Atlanta, GA, USA) (FAT* ’19). Association for Computing Machinery, New York, NY, USA, 211–219. https://doi.org/10.1145/3287560.3287588
  • Angwin and Parris (2016) Julia Angwin and Terry Jr. Parris. 2016. Facebook Lets Advertisers Exclude Users by Race. ProPublica. Retrieved September 12, 2020 from https://www.propublica.org/article/facebook-lets-advertisers-exclude-users-by-race
  • Asudeh et al. (2019) Abolfazl Asudeh, Assessing Jin, Remedying Coverage for a Given Dataset, and Hosagrahar Visvesvaraya Jagadish. 2019. Assessing and Remedying Coverage for a Given Dataset. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, New Jersey, US, 554–565.
  • Barocas et al. (2018) Solon Barocas, Moritz Hardt, and Arvind Narayanan. 2018. Fairness and Machine Learning. http://www.fairmlbook.org.
  • Benjamin (2019) Ruha Benjamin. 2019. Assessing risk, automating racism. Science 366, 6464 (2019), 421–422. https://doi.org/10.1126/science.aaz3873 arXiv:https://science.sciencemag.org/content/366/6464/421.full.pdf
  • Beretta et al. (2019) Elena Beretta, Antonio Santangelo, Bruno Lepri, Antonio Vetró, and Juan Carlos De Martin. 2019. The Invisible Power of Fairness. How Machine Learning Shapes Democracy. In

    Advances in Artificial Intelligence, Proceedings of 32nd Canadian Conference on Artificial Intelligence, Canadian AI 2019

    (Kingston, ON, Canada), Marie-Jean Meurs and Frank Rudzicz (Eds.), Vol. 11489. Springer, Cham, Germany, 238–250.
    https://doi.org/10.1007/978$-$3$-$030$-$18305$-$9{_}19
  • Berk et al. (2018) Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. 2018. Fairness in Criminal Justice Risk Assessments: The State of the Art. Sociological Methods & Research 50, 1 (2018), 3–44. https://doi.org/10.1177/0049124118782533
  • Caliskan et al. (2017) Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases. Science 356, 6334 (2017), 183–186. https://doi.org/10.1126/science.aal4230 arXiv:https://science.sciencemag.org/content/356/6334/183.full.pdf
  • Chang et al. (2017) Joseph Chee Chang, Saleema Amershi, and Ece Kamar. 2017. Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (Denver, Colorado, USA) (CHI ’17). Association for Computing Machinery, New York, NY, USA, 2334–2346. https://doi.org/10.1145/3025453.3026044
  • Colley et al. (2017) Ashley Colley, Jacob Thebault-Spieker, Allen Yilun Lin, Donald Degraen, Benjamin Fischman, Jonna Häkkilä, Kate Kuehl, Valentina Nisi, Nuno Jardim Nunes, Nina Wenig, Dirk Wenig, Brent Hecht, and Johannes Schöning. 2017. The Geography of PokéMon GO: Beneficial and Problematic Effects on Places and Movement. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (Denver, Colorado, USA) (CHI ’17). Association for Computing Machinery, New York, NY, USA, 1179–1192. https://doi.org/10.1145/3025453.3025495
  • Dahiwade et al. (2019) D. Dahiwade, G. Patle, and E. Meshram. 2019. Designing Disease Prediction Model Using Machine Learning Approach. , 1211-1215 pages.
  • Dwork et al. (2012) Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through Awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (Cambridge, Massachusetts) (ITCS ’12). Association for Computing Machinery, New York, NY, USA, 214–226. https://doi.org/10.1145/2090236.2090255
  • Edelman et al. (2017) Benjamin Edelman, Michael Luca, and Dan Svirsky. 2017. Racial Discrimination in the Sharing Economy: Evidence from a Field Experiment. American Economic Journal: Applied Economics 9, 2 (April 2017), 1–22. https://doi.org/10.1257/app.20160213
  • Eubanks (2018) Virginia Eubanks. 2018. Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor. St. Martin’s Press, Inc., USA.
  • Fehrman et al. (2015) Elaine Fehrman, Vincent Egan, and Evgeny M. Mirkes. 2015. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
  • Fehrman et al. (2017) Elaine Fehrman, Awaz K. Muhammad, Evgeny M. Mirkes, Vincent Egan, and Alexander N. Gorban. 2017. The Five Factor Model of Personality and Evaluation of Drug Consumption Risk. , 231–242 pages. https://doi.org/10.1007/978-3-319-55723-6_18
  • Gebru et al. (2018) Timnit Gebru, Jamie Morgenstern, W.Jennifer Vecchione, Brianna Vaughan, Hanna Wallach, Hal III Daumé, and Kate Crawford. 2018. Datasheets for Datasets. arXiv:arXiv:1803.09010
  • Geiger et al. (2020) R. Stuart Geiger, Kevin Yu, Yanlai Yang, Mindy Dai, Jie Qiu, Rebekah Tang, and Jenny Huang. 2020. Garbage in, Garbage out? Do Machine Learning Application Papers in Social Computing Report Where Human-Labeled Training Data Comes From?. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (Barcelona, Spain) (FAT* ’20). Association for Computing Machinery, New York, NY, USA, 325–336. https://doi.org/10.1145/3351095.3372862
  • Gil et al. (2016) Yolanda Gil, Cédric H. David, Ibrahim Demir, Bakinam T. Essawy, Robinson W. Fulweiler, Jonathan L. Goodall, Leif Karlstrom, Huikyo Lee, Heath J. Mills, Ji-Hyun Oh, Suzanne A. Pierce, Allen Pope, Mimi W. Tzeng, Sandra R. Villamizar, and Xuan Yu. 2016. Toward the Geoscience Paper of the Future: Best practices for documenting and sharing research from data to software to provenance. Earth and Space Science 3, 10 (2016), 388–415. https://doi.org/10.1002/2015EA000136 arXiv:https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1002/2015EA000136
  • Hardt et al. (2016) Moritz Hardt, Eric Price, and Nathan Srebro. 2016.

    Equality of Opportunity in Supervised Learning. In

    Proceedings of the 30th International Conference on Neural Information Processing Systems (Barcelona, Spain) (NIPS’16). Curran Associates Inc., Red Hook, NY, USA, 3323–3331.
  • Harry Thornburg Larson et al. (2016) Jeff Harry Thornburg Larson, Surya Mattu, and Lauren Kirchner. 2016. Machine Bias. ProPublica. Retrieved September 2, 2020 from https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
  • Holland et al. (2018) Sarah Holland, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia Chmielinski. 2018. The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards. CoRR abs/1805.03677 (2018), 21 pages. arXiv:1805.03677 http://arxiv.org/abs/1805.03677
  • Indira and Kavithadevi (2019) K. Indira and M. K. Kavithadevi. 2019. Efficient Machine Learning Model for Movie Recommender Systems Using Multi-Cloud Environment. obile Networks and Applications 24, 6 (2019), 1872–1882. https://doi.org/10.1007/s11036-019-01387-4
  • Jo and Gebru (2020) Eun Seo Jo and Timnit Gebru. 2020. Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (Barcelona, Spain) (FAT* ’20). Association for Computing Machinery, New York, NY, USA, 306–316. https://doi.org/10.1145/3351095.3372829
  • Kanoje et al. (2016) Sumitkumar Kanoje, Debajyoti Mukhopadhyay, and Sheetal Girase. 2016. User Profiling for University Recommender System Using Automatic Information Retrieval. In Procedia Computer Science, Vol. 78. Elsevier, Netherlands, 5–12. https://doi.org/10.1016/j.procs.2016.02.002 1st International Conference on Information Security & Privacy 2015.
  • Kayser-Bril (2020) Nicolas Kayser-Bril. 2020. Google apologizes after its Vision AI produced racist results. AlgorithmWatch. Retrieved August 17, 2020 from https://algorithmwatch.org/en/story/google-vision-racism/
  • Kleinberg (2018) Jon Kleinberg. 2018. Inherent Trade-Offs in Algorithmic Fairness. In Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems (Irvine, CA, USA) (SIGMETRICS ’18). ACM Press, New York, NY, USA, 40–40. https://doi.org/10.1145/3219617.3219634
  • Kohavi and Becker (1996) Ronny Kohavi and Barry Becker. 1996. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
  • Kukura (2020) Joe Kukura. 2020. Facebook (Finally) Removes Racial Ad Targeting. SFist. Retrieved September 12, 2020 from https://sfist.com/2020/08/31/facebook-finally-removes-racial-ad-targeting/
  • Marda and Narayan (2020) Vidushi Marda and Shivangi Narayan. 2020. Data in New Delhi’s Predictive Policing System. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (Barcelona, Spain) (FAT* ’20). Association for Computing Machinery, New York, NY, USA, 317–324. https://doi.org/10.1145/3351095.3372865
  • McDuff et al. (2018) Daniel McDuff, Roger Cheng, and Ashish Kapoor. 2018. Identifying Bias in AI using Simulation. arXiv:arXiv:1810.00471
  • Noble (2018) Safiya Umoja Noble. 2018. Algorithms of oppression: How search engines reinforce racism. NYU Press, New York, NY, USA.
  • Obermeyer et al. (2019) Ziad Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 6464 (2019), 447–453. https://doi.org/10.1126/science.aax2342 arXiv:https://science.sciencemag.org/content/366/6464/447.full.pdf
  • O’Neil (2016) Cathy O’Neil. 2016. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown Publishing Group, New York.
  • Oyebode and Orji (2020) Oladapo Oyebode and Rita Orji. 2020. A hybrid recommender system for product sales in a banking environment. , 11 pages. https://doi.org/10.1007/s42786-019-00014-w
  • Raji et al. (2020) Inioluwa Deborah Raji, Timnit Gebru, Margaret Mitchell, Joy Buolamwini, Joonseok Lee, and Emily Denton. 2020. Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (New York, NY, USA) (AIES ’20). Association for Computing Machinery, New York, NY, USA, 145–151. https://doi.org/10.1145/3375627.3375820
  • Ross (1996) Sheldon M Ross. 1996. Stochastic processes. Wiley, New Jersey, US. https://books.google.de/books?id=ImUPAQAAMAAJ
  • Schedl et al. (2018) Markus Schedl, Hamed Zamani, Ching-Wei Chen, Yashar Deldjoo, and Mehdi Elahi. 2018. Current challenges and visions in music recommender systems research. International Journal of Multimedia Information Retrieval 7, 2 (2018), 95–116. https://doi.org/10.1007/s13735-018-0154-2
  • Siting et al. (2012) Z. Siting, H. Wenxing, Z. Ning, and Y. Fan. 2012. Job recommender systems: A survey. In 2012 7th International Conference on Computer Science Education (ICCSE). IEEE Xplore Digital Library, New York, 920–924.
  • Song (2020) Lin Song. 2020. Two-Sided Price Discrimination by Media Platforms. Marketing Science 39, 2 (2020), 317–338. https://doi.org/10.1287/mksc.2019.1211 arXiv:https://doi.org/10.1287/mksc.2019.1211
  • Speicher et al. (2018) Toll Speicher, Muhammad Ali, Giridhari Venkatadri, Filipe Nunes Ribeiro, George Arvanitakis, Fabrício Benevenuto, Krishna P. Gummadi, Patrick Loiseau, and Alan Mislove. 2018. Potential for Discrimination in Online Targeted Advertising. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency (Proceedings of Machine Learning Research, Vol. 81), Sorelle A. Friedler and Christo Wilson (Eds.). PMLR, New York, NY, USA, 5–19. http://proceedings.mlr.press/v81/speicher18a.html
  • Tommasi et al. (2017) Tatiana Tommasi, Patricia Novi, Barbara Caputo, and Tinne Tuytelaars. 2017. A Deeper Look at Dataset Bias.

    Csurka G. (eds) Domain Adaptation in Computer Vision Applications. Advances in Computer Vision and Pattern Recognition, Springer, Cham, Swiss. 37–55 pages.

    https://doi.org/10.1007/978-3-319-58347-1_2
  • Wang and Ni (2017) Yan Wang and Xuelei Sherry Ni. 2017. Predicting Class-Imbalanced Business Risk Using Resampling, Regularization, and Model Emsembling Algorithms. International Journal of Managing Information Technology (IJMIT) 11, 1 (2017), 15 pages. https://ssrn.com/abstract=3366806
  • Williams et al. (2018) Betsy A. Williams, Catherine F. Brooks, and Yotam Shmargad. 2018. How Algorithms Discriminate Based on Data They Lack: Challenges, Solutions, and Policy Implications. Journal of Information Policy 8 (2018), 78–115. https://www.jstor.org/stable/10.5325/jinfopoli.8.2018.0078
  • Zhou et al. (2017) Yao Zhou, M. Isabel Vales, Aoxue Wang, and Zhiwu Zhang. 2017. Systematic bias of correlation coefficient may explain negative accuracy of genomic prediction. Briefings Bioinform 18, 5 (2017), 744–753. https://doi.org/10.1093/bib/bbw064
  • Zimmerman et al. (2017) Donald W. Zimmerman, Bruno D. Zumbo, and Richard H. Williams. 2017. Bias in estimation and hypothesis testing of correlation. Psicológica 24, 1 (2017), 133–158.