1. Introduction
In the last decades machine learning systems are widely spreading in different academic domains, as well as many public and private sectors are increasing the exploitation of these systems. Their widespread and pervasiveness is mainly driven by the exponential growth of computational power and the extensive availability of large amounts of data (Kleinberg, 2018). Supervised machine learning models are also particularly widespread and now deeply rooted in different sectors due to their usage versatility. The predictive ability of supervised machine learning systems is deployed in disparate areas of application: credit reliability (Kleinberg, 2018), justice system (Angwin and Parris, 2016; Berk et al., 2018), job recommendations (Siting et al., 2012), university selection process (Kanoje et al., 2016), cultural contents (Schedl et al., 2018),(Indira and Kavithadevi, 2019) and purchases recommendations (Oyebode and Orji, 2020). The key ingredient that supervised machine learning models have in common is the availability of a set of labeled data used to train the model in elaborating a response related to past events (Gebru et al., 2018)
. Since the known properties of the available set of data is used to create a classifier that makes predictions about new entities of the same type, the structure, properties and quality of the data are aspects that largely and directly influence the quality of the model and the results it produces
(O’Neil, 2016), (Albarghouthi and Vinitsky, 2019). Although datadriven decision models have been shown to produce both economic and social benefits, many researchers have highlighted several problems and damages related to their use in different areas, especially if they are built on partial or incomplete data (Hardt et al., 2016), (Dwork et al., 2012). As a matter of fact, in recent years several studies have found a convergence of issues related to the ethics and transparency of these systems in the process of data collection and in the way they are recorded (Marda and Narayan, 2020). While the process of rigorous data collection and analysis is fundamental to the design of the model, this step is still largely overlooked by the machine learning community (Beretta et al., 2019; Jo and Gebru, 2020). As the practice of removing protected attributes from available data has been shown to potentially exacerbate further discrimination (Williams et al., 2018)  making bias even more difficult to detect  practices related to data collection, data transparency and data explainability become even more relevant and urgent. The aim of our work is to provide a data annotation system that serves as a diagnostic framework containing immediate information about the data appropriateness in order to more accurately assess the quality of the available data used in training models. We propose a data annotation method based on Bayesian statistical inference that aims to warn of the risk of discriminatory results of a given data set. In particular, our method aims to deepen the statistical knowledge related to the information contained in the available data, and to promote awareness of the sampling practices used to create the training set, highlighting that the probability of a discriminatory result is strongly influenced by the structure of the available data. We test our data annotation systems on three dataset widely spread in machine learning community: the COMPAS dataset (Harry Thornburg Larson et al., 2016), the Drug Consumption dataset (Fehrman et al., 2015), (Fehrman et al., 2017) and the Adult dataset (Kohavi and Becker, 1996).1.1. Problem Statement
The majority of machine learning systems are based on historical data processing (Noble, 2018). This is particularly true in supervised machine learning models. Several studies have shown evidence that many equity and discrimination issues are due to input data properties (Benjamin, 2019). Most of today data sets used to train models are chosen through nonprobabilistic methods, generating problems of data imbalance and representativeness (Eubanks, 2018; Noble, 2018). This means that different fractions of the population do not show the same opportunity to be represented within the sample  aka, training sets , leading some groups of individuals to have a lower probability of being represented. Common observed effects of a bad sampling are underestimation and overestimation of some groups (Barocas et al., 2018). Undetected distortions in data may also easily represent a spurious statistical noise. This happens when the data structure induces dependence between two variables that are not linked by a real causeeffect relationship.
Data Sampling
A key moment in the pipeline of a machine learning model is when the programmed algorithm is supplied with training data representing the entities on which the model itself trains its knowledge to make predictions. The quality of the data used in this phase is fundamental for the desired result, according to the principle of ”garbage in  garbage out”: even the most sophisticated models can present distorted results in the presence of low quality data
(Tommasi et al., 2017). One of the main causes of data distortion is the way the data is selected and provided to the algorithm displaying problems related to inaccuracy, lack of update or inadequate representativeness. However, while knowledge of bias typologies has proliferated over the years, less attention is paid to issues concerning data collection, notation and sampling (McDuff et al., 2018). In the spirit of fostering a broader awareness of data handling, we provide a reasoned list of issues that may arise during this phase:
Data Selcetion: the large proliferation of data sets availability on the same kind of problem to be analyzed, make hard the a priori choice of a given data set (Gebru et al., 2018);

Inadequate sampling methods: most models are trained with data sets that have been ”found” and not subjected to probabilistic sampling methods, leading to limited or no data control (Asudeh et al., 2019);

Cost and Time Limit: collecting large amounts of data that present proportional representations of each property with respect to a sensitive attribute is time consuming and often costly and laborintensive (Caliskan et al., 2017);

Validation planning: data validation, when applied, is often performed only after the model has been trained and used, making the feedback cycle inefficient and often ineffective (Holland et al., 2018);

Lack of statistical rigorousness:
the suitability of the data set varies depending on the task for which the data are prepared. For instance, models based on linear regression imply assumptions of normality on the measurement error
(Gebru et al., 2018; Wang and Ni, 2017). This specificity is often absent in the pipeline of machine learning models.
Missdependency
Twodimensional or bivariate statistics is the study of the degree to which two distinct characters of the same statistical unit are connected. However, the connection only measures the degree of statistical dependency without inducing a causeeffect relationship or dependency between the variables. For instance, it can be shown that people with small feet make more spelling mistakes than people with large feet. However, this statistical dependency does not indicate that having small feet is the cause of spelling mistakes; the greater frequency of spelling mistakes may in fact be due to the younger age of people with small feet. In this case there could be a third variable, age, responsible for the causeeffect relationship. While in a humancentered model  where the human makes the decisions  this distinction is quite evident, in a machine learning model missdependency is not always deductible. This depends on two reasons: i) the machine does not recognize the meaning of the instance but looks at the properties of the variables; ii) the way in which the data are structured modifies the interpretation that the machine is having regarding the relation of statistical dependency. This means that, while in a humancentered model it is the human to verify that the relationships of statistical dependence detected in the available data are leading to a causeeffect relationship, in machine learning models the machine is not always able to recognize a spurious connection, erroneously assigning to two or more variables a cause relationship. In other words, the structure of the available data is responsible for the successful or failed relationships established with the protected attributes (ethnicity, gender, etc.) in the data. In addition, the rapid growth and spread of current machine learning systems is due in part to the ease of design of the models themselves, which thanks to modern software allows the construction of predictive models avoiding the understanding and adoption of rigorous statistical analysis. The simplicity of design has therefore created a gap between predictive and analyticalexplicative power, favoring misinterpretation between causality and statistical dependence. The distinction between statistical dependence and causal dependence in data is therefore a primary issue in machine learning models, especially to determine the causes of failure, potential biases encoded in the data and the reliability of application.
Based on the problems highlighted, our contribution aims to answer the following research questions:

Is it possible to establish the probability of composition of the training data from the available data set?

Do the available data known to the machine learning community present a discriminatory future risk based on their structure?
2. Background
When machine learning model decisions are based on historical records, they tend to embed distortions that exist in reality and crystallize them. Prejudices and human bias therefore become part of the technology itself. This is particularly evident with regard to ethnic discrimination. Over the last years, the rise of machine learning models in various sectors is leading to a dramatic increase of discriminatory outcomes for ethnic minorities, across different fields of application. A striking and well known case is the COMPAS software, used in U.S. court to estimate the probability of defendants’ recidivism, which has been shown to underestimate the risk of recidivism for white defendants and overestimate it for black defendants
(Harry Thornburg Larson et al., 2016). However, the COMPAS case is not an isolated phenomenon. In a 2017 experiment conducted on the Airbnb platform, applications from guests with typically African American names were found to be 16% less likely to be accepted than identical guests with typically white names (Edelman et al., 2017). Also in 2017, a geostatistical analysis revealed that the design of the popular Pokémon GO game strengthens existing geographical prejudices, for example by benefiting urban areas and neighborhoods with smaller minority populations, economically disadvantaging ethnic minority areas (Colley et al., 2017). Several studies have demonstrated the discriminatory potential of targeting advertising (Speicher et al., 2018), (Song, 2020), which is only recently receiving interventions to remove the prejudicial content of the model. For example, Facebook after years of scandals related to ads that exclude people based on race (Angwin and Parris, 2016) has finally removed the racial targeting option for ads (Kukura, 2020). In a 2019 study, the commercial algorithm widely used in the U.S. health care system to guide health care decisions was found to discriminate against black patients (Obermeyer et al., 2019). The algorithm falsely assigned a healthier condition to black patients despite the risk of complications being the same for white patients, making black people less likely to receive more financial resources for extra care. Although facial recognition technologies are now used in several domains, they still present many discriminatory issues related to differences in margins of error  generally software has a 20% higher margin of recognition error for black women
(Raji et al., 2020). As an example, we report what happened recently with Google Vision AI, a computer vision service for image labeling
(KayserBril, 2020). By providing the system with two images of people holding a body temperature thermometer, it labeled the image containing the white person as an ”electronic device”, while in the image containing the black person the device held was labeled as a ”gun”. In a later experiment it was shown that it was sufficient to apply a pink mask on the black person’s hand in order the software labeled the image as ”tool”. Racial bias encoded in machine learning systems is likely to spread silently and like wild fire in everyday technologies. The increasing and ubiquitous spread of such models also intended to make allocative decisions about people’s lives makes the problem of prejudice and rational discrimination more urgent than ever. For this reason and for the historical moment we are experiencing, our work intends to focus on rational discrimination in data.3. Motivating Example
Given a population composed of 60% Caucasians, 35% black people and 15% Asian people, the probability of positive outcome for the respective ethnic groups is 70% for Caucasians, 20% for Blacks and 60% for Asians. What is the probability of failure with respect to the protected attribute Ethnicity?
In this example the probabilities are given rather than the numerosity in order to simplify the following notation. To offer a better a better understanding of the Methodology this data will be used in Section 4. The data gives the probability of success, but the similar reasoning is also valid for cases where the probability of failure is known. The intent is to verify whether the probabilities of success or failure of a subgroup are influenced by group membership  and vice versa  and more specifically how these probabilities affect the composition of the training set.
4. Methodology
Our data annotation system is based on four modules:

Dependence: assesses the degree of connection among the protected attribute  in our study, the ethnicity  and the target variable;

Diverseness: provides the training diversification probability in respect to each level of the protected attribute and the target variable;

Inclusiveness: provides the probability that two properties are simultaneously included in the training set;

Training Likelihood: provides the occurrence likelihood of the protected attribute levels given the target variable levels  and vice versa  before the training set is sampled.
4.1. Quantifying Dependence
Excluding some specific domains where the dependence of some protected attributes with the response variable is not considered problematic, but rather it is fundamental for the understanding of a certain problem (for example the gender attribute in the medical field in the detection of particular diseases
(Dahiwade et al., 2019)), in the broad field of machine learning systems the dependence between the protected attribute and the response variable has caused severe consequences (O’Neil, 2016; Noble, 2018). The dependence between the protected attribute and the response variable is therefore one of the major causes of discrimination and as such must be rigorously examined. The first step for a correct bias detection within the data is given by the dependency analysis between the different modalities of a protected attribute and the response variable. In statistics, the measurement of the degree of dependence of two qualitative variables is called contingency; contingency measures the degree of connection of two categorical variables. To determine the degree of connection, the marginal frequencies and the combined frequencies of the bivariate table are used. Given two categorical variables
and , the dependency or independence is established through the theoretical independence table once the table of the observed real data is given. The contingency is therefore given by the difference between the observed and theoretical frequencies:(1) 
If the table of the observed real data and the theoretical table of independence coincide  that is if for each cell the value is null  then the two variables are independent. Otherwise, it is necessary to measure the degree of connection between the variables. The degree of connection between two categorical variables is commonly measured by the Pearson connection index, obtained as the sum of the relative quadratic contingencies. The index assumes a value of zero in case of independence in distribution and increases as the degree of connection between variables increases:
(2) 
In order to support Pearson’s connection index, the contingency coefficient is adopted with the purpose of reducing the in the range [0;1]:
(3) 
However, the effect size of the degree of connection between two categorical variables is not always easy to interpret, where by effect size we mean a quantitative measure of the magnitude of a phenomenon. To offer a better understanding of the relationship of dependency between two variables, several simplified methods of interpretation have been proposed, especially to guide social scientists in the interpretation of statistical test results. In the spirit of simplifying the interpretation of the dependency between the response variable and the protected categories for a data set user, we introduce the concept of the Effect Size Index w (ES w):
(4) 
where and are the value of the ith cells. Notice that unlike the contingency coefficient, the ES w is not derived from frequencies but from proportions. The relationship between the Pearson connection index, the contingency coefficient and the ES Index is given by the following formula:
(5) 
Alternatively to the Formula 4 it is also possible to calculate the ES w from the contingency coefficient:
(6) 
The size of the ES w between two variables is then evaluated through the use of Table 1, which relates the magnitude of the ES with a nominal label.
Magnitude  Value 

SMALL  w = 0.1 
MEDIUM  w = 0.3 
LARGE  w = 0.5 
The advantage of using the conventional conversion table for the user of the data set is that the magnitude of the dependency is displayed quickly and immediately without the need for more complex statistical tests.
4.2. Estimating Diverseness
Intuitively, the probability of an event represents how likely the event will occur. According to the classical definition the probability is given by the following ratio:
(7) 
We now apply this elementary theory to the problem of data collection in machine learning. When the data set is partitioned into training and test sets, a split with a more or less standard ratio (70/30 or 80/20) is generally performed, i.e. a sampling is performed on the available data. Let’s consider the case in which the training data set is generated by random sampling on the original data set without considering further techniques (stratification or resampling)  for example in the case of a non expert user . The probability an event occurs turns into the probability that the training set shows some existing properties contained in the original data set:
(8) 
In our data annotation this ratio is introduced to allow the dataset user to answer questions like: ”If I perform a random sampling on the original dataset, what is the probability that the training set is mainly composed of positive examples? What is the probability of belonging to a certain group with respect to the target variable?”
Prior Probabilities
The a priori probability of a data property is the degree of belief of the property in the absence of other information, also known as the unconditional probability. The degree of belief is the probability of a property to be true in an uncertain environment. The probability is referred to the belief and not to the truth of the fact, as it is not possible for the user to know exactly the truth, that is if the original data are representative of the real world. Since the user does not have access to the complete information, several hypotheses on how the real data is structured have to be drawn, assigning to each of them a probability of being true. Formally:
(9) 
We estimate the prior probabilities by using the data of the problem introduced in Section
3, where the target variable assumes value 1 in case of negative outcome, otherwise 0.Formula  Probability 

P = 0.48  
P = 0.52  
P = 0.6  
P = 0.35  
P = 0.15  
In this specific case, the prior probabilities indicate that the training set has probability 0.48 to be composed by individuals who display a positive outcome and 0.52 to be composed by individuals who display a negative outcome; finally, the probabilities that it is formed by individuals of white, black and Asian ethnicity are respectively 0.6, 0.35 and 0.15 (Table 2).
4.3. Estimating Inclusiveness
Posterior Probabilities
Given two events A and B, the probability
is said posterior probability because it allows to calculate the probability of A, knowing that B occurred. In our case the posterior probability means to compute the probability that
, knowing that has occurred (and vice versa). In other words, the probability that the training set shows the property Y = y, knowing the property has occurred (and vice versa). We start by estimating the probability that two events occur simultaneously. From the definition of conditional probability:(10) 
Since from Compound Probability Theorem (Ross, 1996) is equal to , i.e. the probability of both properties occurring is the same, either of the two formulas can be employed indistinctly.
Formula  Probability 

P = 0.42  
P = 0.07  
P = 0.09  
P = 0.18  
P = 0.28  
P = 0.06  
4.4. Estimating Training Likelihood
From the definition of conditional probability, we derive the Bayes Theorem for the properties of the training set:
(11) 
In the case of binary classification and in the case of protected attributes we are in the presence of a certain event partition. This means that the events are disjointed from each other and if and that as a whole they are the only ones possible, i. e., if a certain property occurs, one and only one certainly appeared. In other words, it is not possible that the training set is composed of individuals who belong simultaneously to the black and white ethnic group, or who simultaneously show a positive and negative outcome. The union of the occurrence of the single properties is therefore the whole set of possible properties. For the properties outcome and ethnicity the generalization formula are respectively:
(12) 
By applying Formulas 10 and 12 the Bayes Theorem can be generalized for each property of the training set:
(13) 
The first equation in Formula 13 derives the probability of the outcome property given the ethnic property, while the second equation derives the probability of the ethnic property given the outcome property. In other words, it derives the probability of composition of the training set based on the posterior probabilities of the outcome and ethnicity properties. Carried out a random sampling on the original data, the Formula answers the following questions:

In the sampled training set what is the probability of belonging to an ethnic group with respect to the outcome variable?

In the sampled training set what is the probability of obtaining a certain outcome with respect to the ethnic group?
Complementarily, the two equations can be interpreted as the probability of bias within the training set.
Formula  Probability 

P = 0.7  
P = 0.2  
P = 0.6  
P = 0.3  
P = 0.8  
P = 0.4  
P = 0.34  
P = 0.87  
P = 0.53  
P = 0.15  
P = 0.11  
P = 0.18  
5. Case studies datasets
COMPAS (Correctional Offender Management Profling for Alternative Sanctions)^{1}^{1}1Retrieved from: https://www.propublica.org/datastore/dataset/compasrecidivismriskscoredataandanalysis is a popular tool used by U.S. court to estimate the defendants’ probability of recidivism. This dataset displays the probability of reoffending based on two year of further studies. The dataset has been shown to underestimate the risk of recidivism for white defendants and overestimate it for black defendants (Harry Thornburg Larson et al., 2016).
Drug Consumption (Fehrman
et al., 2015; Fehrman et al., 2017) contains information on the consumption of 18 drugs based on personality traits and socioeconomic attribute. For simplicity of analysis we assumed the consumption of Cannabis as target variable but the annotation of the dataset can be made on each target drug.
Adult Dataset (Kohavi and Becker, 1996) The data set contains adult income annual census from the US Census Bureau. It is commonly employed in forecasting tasks in order to predict the factors leading to income below or above $50,000.
Property  COMPAS  Drug  Adult 

Consumption  Dataset  
Size  6172x9  1885x31  48842x15 
Target  0 no  0 non user  0 
variable  1 yes  1 user  1 
Levels of  Asian  Asian  AIE 
ethnicity  Black  Black  API 
attribute  Caucasian  Black/Asian  Black 
Hispanic  Caucasian  Caucasian  
NA  White/Asian  Other  
Other  White/Black  
Other  
AmericanIndian/Eskimo, AsianPacIslander, Native American
6. Results and Discussion
We performed the analyses that constitute our data annotation system for each of the datasets presented in Section 5. Subsections 6.1, 6.2, 6.3 and 6.4 report the analysis for each module  dependency, diverseness, inclusiveness, training likelihood, respectively  and contain an example graphic module. Figures 4 and 5 shows an illustrative example of the graphical visualization for the complete notation.
6.1. Dependence
This module aims to analyze the connection relationships between the protected attribute Ethnicity and the target variable that are established and depend on the available data. For instance, for the COMPAS dataset the module highlights the dependency relationships between recidivism and different ethnic minorities. Summary results for dependence module are shown in Table 6.
COMPAS  Drug  Adult  
Consumption  Dataset  
Contingency  0.1413  0.1558  0.0994 
coefficient  
Effect size w  0.1427  0.1578  0.0999 
variable  
Magnitude of  SMALL  SMALL  VERY 
Effect size w  SMALL  
None of the three datasets displays worrying dependency values among the protected attribute Ethnicity and the target variable, showing the magnitude of the Effect Size w as small or very small. However, the results of the COMPAS dataset  which is proven to contain bias  indicate that this module alone is not sufficient to show a latent bias risk. The degree of bias depends on the sample size and the value of the contingency coefficient of the target variable and the protected attribute (Zhou et al., 2017)
. Smaller samples lead to more bias and higher variance
(Zimmerman et al., 2017) and therefore the results of the dependency must be analyzed in relation to the amount of data available. In order to facilitate the interpretation of the connectionrelations, we propose a graphic notation for dependence. Figure 1 shows the graphical representations of the dependency modules based on different connection magnitude.
6.2. Diverseness
This module aims to analyze the diverseness of the data available by estimating prior probabilities. They determine the probability that training set will display an a priori environment based on the original data available, i.e. they show the probability of training set composition stratified by each of target variable and protected attribute levels. For example, in our case study the module highlights the probability that training set will be equally composed by ethnic minorities and ethnic majorities. Summary results for diverseness module are shown in Table 7.
COMPAS  Drug  Adult  
Consumption  Dataset  
0  0.545  0.329  0.239 
1  0.455  0.671  0.761 
Caucasian  0.341  0.912  0.855 
Black  0.514  0.018  0.096 
Asian  0.005  0.014  
Hispanic  0.082  
Native American  0.002  
Other  0.056  0.033  0.008 
White/Black  0.011  
White/Asian  0.011  
Black/Asian  0.002  
AmerIndianEskimo  0.010  
AsianPacIslander  0.031  
In terms of target variable probabilities, the results show strong distortions for the Drug Consumption and Adult datasets with a high probability of positive examples  i.e. showing a negative outcome  while the probabilities of the COMPAS dataset are quite homogeneous. Regarding the probabilities of the protected attribute ethnicity, the distortions are even more pronounced than the target variable ones, revealing a very high probability of composition for the Caucasian ethnicity in the Drug Consumption and Adult datasets. In the case of the COMPAS dataset the probabilities are indeed distorted, although still not such as to predict at this point of the analysis more severe future distortions, which is why more indepth analysis are required. Figure 4.2 shows the graphical representation of the diverseness module that simplifies the display of prior probabilities. In the example is given the notation for a dataset where both the levels of the target variable and those of the protected attribute ethnicity are equiprobable.
6.3. Inclusiveness
This module aims to analyze the inclusiveness of the data available by estimating the simultaneously probabilities. They determine the probability that training set will simultaneously display two by two the target variable and the protected attribute properties. For instance, in our case study the module highlights the probability that in training set the property Asian appears simultaneously with property success. Summary results for diverseness module are shown in Table 8.
COMPAS  Drug  Adult  
Consumption  Dataset  
0AIE  0.0006  
0Asian  0.0023  0.0019  
0API  0.0041  
0Black  0.1514  0.0023  0.0057 
0Black/Asian  0.0000  
0Caucasian  0.1281  0.0555  0.1061 
0Hispanic  0.0320  
0NA  0.0006  
0Other  0.0219  0.0013  0.0005 
0White/Asian  0.0004  
0White/Black  0.0006  
1AIE  0.0042  
1Asian  0.0008  0.0007  
1API  0.0111  
1Black  0.1661  0.0010  0.0412 
1Black/Asian  0.0003  
1Caucasian  0.0822  0.1165  0.3115 
1Hispanic  0.0189  
1NA  0.0005  
1Other  0.0124  0.0050  0.0036 
1White/Asian  0.0016  
1White/Black  0.0014  
AmericanIndian/Eskimo, AsianPacIslander, Native American
The results of this module show that the probability that two properties will occur simultaneously is related to the sample size. Evidence of this can be found in the results of the Drug Consumption and Adult datasets, where the highest probabilities of simultaneous events involve the Caucasian property. The COMPAS dataset shows quite homogeneous probabilities especially with regard to the Black property, while for the Caucasian property the highest probabilities are related to the simultaneous occurrence with the Nonrecidivist property. Since the simultaneous probabilities depend on the number of examples within the available data and the sample size, this result alone is not sufficient to establish a priori the certain presence of serious data distortions, although some evidence can already be seen. Figure 3 shows the graphical representation of the inclusiveness module that simplifies the display of simultaneously probabilities. In the example is given the notation for a dataset where all the properties of the target variable and those of the protected attribute ethnicity are equiprobable.
6.4. Training Likelihood
This module aims to analyze the training likelihood of the data available by estimating the posterior probabilities. They determine the probability that in the training set the occurrence of the properties of the protected attribute is given by the properties of the target variable  and vice versa  . For example, in the COMPAS dataset they determine the probability that the occurrence of reoffending is given by the properties of the protected attribute ethnicity. Summary results for training likelihood module are shown in Table 9.
COMPAS  Drug  Adult  
Consumption  Dataset  
0AIE  0.117  
0Asian  0.742  0.731  
0API  0.269  
0Black  0.477  0.697  0.121 
0Black/Asian  0.000  
0Caucasian  0.609  0.323  0.254 
0Hispanic  0.629  
0NA  0.545  
0Other  0.638  0.206  0.123 
0White/Asian  0.200  
0White/Black  0.300  
1AIE  0.883  
1Asian  0.258  0.269  
1API  0.731  
1Black  0.523  0.303  0.879 
1Black/Asian  1.000  
1Caucasian  0.391  0.677  0.746 
1Hispanic  0.371  
1NA  0.455  
1Other  0.362  0.794  0.877 
1White/Asian  0.800  
1White/Black  0.700  
AIE0  0.005  
AIE1  0.011  
Asian0  0.007  0.031  
Asian1  0.003  0.006  
API0  0.035  
API1  0.030  
Black0  0.450  0.037  0.048 
Black1  0.591  0.008  0.111 
Black/Asian0  0.000  
Black/Asian1  0.002  
Caucasian0  0.381  0.895  0.908 
Caucasian1  0.293  0.921  0.839 
Hispanic0  0.095  
Hispanic1  0.067  
NA0  0.002  
NA1  0.002  
Other0  0.065  0.021  0.004 
Other1  0.044  0.040  0.010 
White/Asian0  0.006  
White/Asian1  0.013  
White/Black0  0.010  
White/Black1  0.011  
AmericanIndian/Eskimo, AsianPacIslander, Native American
The results of this module show that the posterior probabilities of target variable and protected attribute ethnicity are quite skewed in all dataset. In the case of the Adult dataset given as occurred event 1 or event 0, the probability of occurrence of the Caucasian ethnic group is respectively 0.908 and 0.839,  i.e. very high for both events  while the probabilities of all other ethnic groups conditioned to the target variable are all significantly lower; this means that the original data contain many examples of individuals belonging to the Caucasian ethnic group. In the case of Drug Consumption, a similar reasoning can be carried out for the ethnicity probabilities conditioned to the target variable; moreover, notice that given the property Black/Asian, the probability of occurrence of event 1, i. e. that the individual is a consumer, is 1  while the probability of 0 is 0  which means that in the available data there are no examples of individuals belonging to the ethnic group Black/Asian showing a positive outcome  i. e. negative examples . Figures
4 and 5 shows the graphical visualization of our data annotation system for the COMPAS dataset.The analysis of the COMPAS dataset shows that if an individual is randomly sampled from the original data for the training set, the probability that this individual is black knowing that the reoffending property has occurred  i.e. knowing the outcome of the reoffending event  is 0.591, while the probability that the individual is white knowing that the reoffending property has occurred is 0.293. Instead, given as occurred the property Black the probability that the individual has not reoffended is 0.477, while the probability that the individual has reoffended is 0.523; given the property Caucasian, the probability that the individual has not reoffended is 0.609, while the probability that the individual has reoffended is 0.391, that is significantly lower. This means that in this dataset the reoffending is related to ethnicity, and that success or failure are determined by the membership to a specific ethnic group. The differences in probability between the properties highlight the risk of future bias, and in the case of the COMPAS dataset they anticipate the underestimation of recidivism for the Caucasian ethnic group and the overestimation of recidivism for the Black ethnic group proven in recent studies (Harry Thornburg Larson et al., 2016).
6.5. Final Remarks

in traditional sampling practices, instead of observing all the units of a population, only a subset of a population is detected, which must show certain probabilistic characteristics. In machine learning models the training set is sampled not from the real population but from the available data. While in classical sampling the empirical knowledge alone is effectively of a sample nature, in machine learning systems the available data are often of sample nature too, precisely due to the fact that it is not possible to make assumptions on the real population. Considering a random sampling from the available data, we have shown that the probability of composition of the training set can be predicted, highlighting that the structure of the data directly affects the probability of properties distribution;

we analyzed three datasets frequently accessed by machine learning community. Of these, all three showed more or less pronounced distortions for the protected attribute Ethnicity. Although the COMPAS dataset is the sole one that has been shown to discriminate against black people, the Drug Compsuntion and Adult datasets reveal possible future bias in the detriment of ethnic minorities.
7. Related Work
Although there are a number of papers that for ethical purposes deal with data annotation they are all very recent, indicating that this field of study is still partially explored and has only recently received considerable attention. Our contribution differs from the others because it induces a probabilistic reasoning on the causes of model discrimination based on sampling problems; our intention is to deepen the knowledge of data validation analysis, focusing on the meaning of probabilities. From a graphical point of view, our work has been inspired by the Data Nutrition Labels (Holland et al., 2018), a data labeling system mainly based on descriptive data statistics. A similar approach is addressed in (Beretta et al., 2019), where an operational framework is proposed to identify the bias risks of automatic decision systems. In (Gebru et al., 2018) the authors propose a data labeling system based on discursive data sheets. In (Chang et al., 2017) the authors propose a collaborative crowdsourcing system to improve the quality of the labels.
Since ethically data annotation represent a quite new field of study, there are several works that provide different types of labels. We believe that at present the focus should not be on achieving a unified data annotation system in the short term, but rather on the fact that the fair machine learning community is working together to focus attention on the data collection problem. Especially because awareness of data issues is often not rooted outside of this community. It is important that this field and this work inspire greater awareness of the possible causes of discrimination due to the fundamental ingredient that all users and designers of machine learning systems (from the most to the least experienced) use, data.
8. Conclusions
The purpose of the current study was to detect the potential race discriminatory risk for future machine learning system by providing a data annotation system based on Bayesian Inference. Our notation serves as a diagnostic framework to immediately visualize data appropriateness and potential bias occurring when sampling the training set from an available dataset. The investigation of the probabilities of the training set sampling has shown that it is possible to establish a risk of future bias by observing prior and posterior probabilities of the ethnicity and target variable properties. The empirical findings in this study provide a new perspective on data annotation practices by showing that Bayesian inferences may reveal the risk of bias in three different widespread dataset. Furthermore, this study has raised important questions about the awareness of most widely data sampling practices in machine learning community. The findings of this investigation complement those of earlier studies. Our data annotation system is limited to the binary case and to the analysis of categorical variables for classification tasks. This would be a fruitful area for further work. Our intent is to expand the work in the following directions: i) extend the notation to multiple protected attributes  the probabilities of the training set will then be given by the vectors of the protected attribute combinations  ; ii) extend the notation to the nonbinary case  for prediction tasks involving regression analysis for example  ; iii) extend the probabilistic notation to nonlabeled data.
References
 (1)
 Albarghouthi and Vinitsky (2019) Aws Albarghouthi and Samuel Vinitsky. 2019. FairnessAware Programming. In Proceedings of the Conference on Fairness, Accountability, and Transparency (Atlanta, GA, USA) (FAT* ’19). Association for Computing Machinery, New York, NY, USA, 211–219. https://doi.org/10.1145/3287560.3287588
 Angwin and Parris (2016) Julia Angwin and Terry Jr. Parris. 2016. Facebook Lets Advertisers Exclude Users by Race. ProPublica. Retrieved September 12, 2020 from https://www.propublica.org/article/facebookletsadvertisersexcludeusersbyrace
 Asudeh et al. (2019) Abolfazl Asudeh, Assessing Jin, Remedying Coverage for a Given Dataset, and Hosagrahar Visvesvaraya Jagadish. 2019. Assessing and Remedying Coverage for a Given Dataset. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, New Jersey, US, 554–565.
 Barocas et al. (2018) Solon Barocas, Moritz Hardt, and Arvind Narayanan. 2018. Fairness and Machine Learning. http://www.fairmlbook.org.
 Benjamin (2019) Ruha Benjamin. 2019. Assessing risk, automating racism. Science 366, 6464 (2019), 421–422. https://doi.org/10.1126/science.aaz3873 arXiv:https://science.sciencemag.org/content/366/6464/421.full.pdf

Beretta et al. (2019)
Elena Beretta, Antonio
Santangelo, Bruno Lepri, Antonio
Vetró, and Juan Carlos De Martin.
2019.
The Invisible Power of Fairness. How
Machine Learning Shapes Democracy. In
Advances in Artificial Intelligence, Proceedings of 32nd Canadian Conference on Artificial Intelligence, Canadian AI 2019
(Kingston, ON, Canada), MarieJean Meurs and Frank Rudzicz (Eds.), Vol. 11489. Springer, Cham, Germany, 238–250. https://doi.org/10.1007/978$$3$$030$$18305$$9{_}19  Berk et al. (2018) Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. 2018. Fairness in Criminal Justice Risk Assessments: The State of the Art. Sociological Methods & Research 50, 1 (2018), 3–44. https://doi.org/10.1177/0049124118782533
 Caliskan et al. (2017) Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain humanlike biases. Science 356, 6334 (2017), 183–186. https://doi.org/10.1126/science.aal4230 arXiv:https://science.sciencemag.org/content/356/6334/183.full.pdf
 Chang et al. (2017) Joseph Chee Chang, Saleema Amershi, and Ece Kamar. 2017. Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (Denver, Colorado, USA) (CHI ’17). Association for Computing Machinery, New York, NY, USA, 2334–2346. https://doi.org/10.1145/3025453.3026044
 Colley et al. (2017) Ashley Colley, Jacob ThebaultSpieker, Allen Yilun Lin, Donald Degraen, Benjamin Fischman, Jonna Häkkilä, Kate Kuehl, Valentina Nisi, Nuno Jardim Nunes, Nina Wenig, Dirk Wenig, Brent Hecht, and Johannes Schöning. 2017. The Geography of PokéMon GO: Beneficial and Problematic Effects on Places and Movement. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (Denver, Colorado, USA) (CHI ’17). Association for Computing Machinery, New York, NY, USA, 1179–1192. https://doi.org/10.1145/3025453.3025495
 Dahiwade et al. (2019) D. Dahiwade, G. Patle, and E. Meshram. 2019. Designing Disease Prediction Model Using Machine Learning Approach. , 12111215 pages.
 Dwork et al. (2012) Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through Awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (Cambridge, Massachusetts) (ITCS ’12). Association for Computing Machinery, New York, NY, USA, 214–226. https://doi.org/10.1145/2090236.2090255
 Edelman et al. (2017) Benjamin Edelman, Michael Luca, and Dan Svirsky. 2017. Racial Discrimination in the Sharing Economy: Evidence from a Field Experiment. American Economic Journal: Applied Economics 9, 2 (April 2017), 1–22. https://doi.org/10.1257/app.20160213
 Eubanks (2018) Virginia Eubanks. 2018. Automating Inequality: How HighTech Tools Profile, Police, and Punish the Poor. St. Martin’s Press, Inc., USA.
 Fehrman et al. (2015) Elaine Fehrman, Vincent Egan, and Evgeny M. Mirkes. 2015. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
 Fehrman et al. (2017) Elaine Fehrman, Awaz K. Muhammad, Evgeny M. Mirkes, Vincent Egan, and Alexander N. Gorban. 2017. The Five Factor Model of Personality and Evaluation of Drug Consumption Risk. , 231–242 pages. https://doi.org/10.1007/9783319557236_18
 Gebru et al. (2018) Timnit Gebru, Jamie Morgenstern, W.Jennifer Vecchione, Brianna Vaughan, Hanna Wallach, Hal III Daumé, and Kate Crawford. 2018. Datasheets for Datasets. arXiv:arXiv:1803.09010
 Geiger et al. (2020) R. Stuart Geiger, Kevin Yu, Yanlai Yang, Mindy Dai, Jie Qiu, Rebekah Tang, and Jenny Huang. 2020. Garbage in, Garbage out? Do Machine Learning Application Papers in Social Computing Report Where HumanLabeled Training Data Comes From?. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (Barcelona, Spain) (FAT* ’20). Association for Computing Machinery, New York, NY, USA, 325–336. https://doi.org/10.1145/3351095.3372862
 Gil et al. (2016) Yolanda Gil, Cédric H. David, Ibrahim Demir, Bakinam T. Essawy, Robinson W. Fulweiler, Jonathan L. Goodall, Leif Karlstrom, Huikyo Lee, Heath J. Mills, JiHyun Oh, Suzanne A. Pierce, Allen Pope, Mimi W. Tzeng, Sandra R. Villamizar, and Xuan Yu. 2016. Toward the Geoscience Paper of the Future: Best practices for documenting and sharing research from data to software to provenance. Earth and Space Science 3, 10 (2016), 388–415. https://doi.org/10.1002/2015EA000136 arXiv:https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1002/2015EA000136

Hardt
et al. (2016)
Moritz Hardt, Eric Price,
and Nathan Srebro. 2016.
Equality of Opportunity in Supervised Learning. In
Proceedings of the 30th International Conference on Neural Information Processing Systems (Barcelona, Spain) (NIPS’16). Curran Associates Inc., Red Hook, NY, USA, 3323–3331.  Harry Thornburg Larson et al. (2016) Jeff Harry Thornburg Larson, Surya Mattu, and Lauren Kirchner. 2016. Machine Bias. ProPublica. Retrieved September 2, 2020 from https://www.propublica.org/article/machinebiasriskassessmentsincriminalsentencing
 Holland et al. (2018) Sarah Holland, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia Chmielinski. 2018. The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards. CoRR abs/1805.03677 (2018), 21 pages. arXiv:1805.03677 http://arxiv.org/abs/1805.03677
 Indira and Kavithadevi (2019) K. Indira and M. K. Kavithadevi. 2019. Efficient Machine Learning Model for Movie Recommender Systems Using MultiCloud Environment. obile Networks and Applications 24, 6 (2019), 1872–1882. https://doi.org/10.1007/s11036019013874
 Jo and Gebru (2020) Eun Seo Jo and Timnit Gebru. 2020. Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (Barcelona, Spain) (FAT* ’20). Association for Computing Machinery, New York, NY, USA, 306–316. https://doi.org/10.1145/3351095.3372829
 Kanoje et al. (2016) Sumitkumar Kanoje, Debajyoti Mukhopadhyay, and Sheetal Girase. 2016. User Profiling for University Recommender System Using Automatic Information Retrieval. In Procedia Computer Science, Vol. 78. Elsevier, Netherlands, 5–12. https://doi.org/10.1016/j.procs.2016.02.002 1st International Conference on Information Security & Privacy 2015.
 KayserBril (2020) Nicolas KayserBril. 2020. Google apologizes after its Vision AI produced racist results. AlgorithmWatch. Retrieved August 17, 2020 from https://algorithmwatch.org/en/story/googlevisionracism/
 Kleinberg (2018) Jon Kleinberg. 2018. Inherent TradeOffs in Algorithmic Fairness. In Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems (Irvine, CA, USA) (SIGMETRICS ’18). ACM Press, New York, NY, USA, 40–40. https://doi.org/10.1145/3219617.3219634
 Kohavi and Becker (1996) Ronny Kohavi and Barry Becker. 1996. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
 Kukura (2020) Joe Kukura. 2020. Facebook (Finally) Removes Racial Ad Targeting. SFist. Retrieved September 12, 2020 from https://sfist.com/2020/08/31/facebookfinallyremovesracialadtargeting/
 Marda and Narayan (2020) Vidushi Marda and Shivangi Narayan. 2020. Data in New Delhi’s Predictive Policing System. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (Barcelona, Spain) (FAT* ’20). Association for Computing Machinery, New York, NY, USA, 317–324. https://doi.org/10.1145/3351095.3372865
 McDuff et al. (2018) Daniel McDuff, Roger Cheng, and Ashish Kapoor. 2018. Identifying Bias in AI using Simulation. arXiv:arXiv:1810.00471
 Noble (2018) Safiya Umoja Noble. 2018. Algorithms of oppression: How search engines reinforce racism. NYU Press, New York, NY, USA.
 Obermeyer et al. (2019) Ziad Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 6464 (2019), 447–453. https://doi.org/10.1126/science.aax2342 arXiv:https://science.sciencemag.org/content/366/6464/447.full.pdf
 O’Neil (2016) Cathy O’Neil. 2016. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown Publishing Group, New York.
 Oyebode and Orji (2020) Oladapo Oyebode and Rita Orji. 2020. A hybrid recommender system for product sales in a banking environment. , 11 pages. https://doi.org/10.1007/s4278601900014w
 Raji et al. (2020) Inioluwa Deborah Raji, Timnit Gebru, Margaret Mitchell, Joy Buolamwini, Joonseok Lee, and Emily Denton. 2020. Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (New York, NY, USA) (AIES ’20). Association for Computing Machinery, New York, NY, USA, 145–151. https://doi.org/10.1145/3375627.3375820
 Ross (1996) Sheldon M Ross. 1996. Stochastic processes. Wiley, New Jersey, US. https://books.google.de/books?id=ImUPAQAAMAAJ
 Schedl et al. (2018) Markus Schedl, Hamed Zamani, ChingWei Chen, Yashar Deldjoo, and Mehdi Elahi. 2018. Current challenges and visions in music recommender systems research. International Journal of Multimedia Information Retrieval 7, 2 (2018), 95–116. https://doi.org/10.1007/s1373501801542
 Siting et al. (2012) Z. Siting, H. Wenxing, Z. Ning, and Y. Fan. 2012. Job recommender systems: A survey. In 2012 7th International Conference on Computer Science Education (ICCSE). IEEE Xplore Digital Library, New York, 920–924.
 Song (2020) Lin Song. 2020. TwoSided Price Discrimination by Media Platforms. Marketing Science 39, 2 (2020), 317–338. https://doi.org/10.1287/mksc.2019.1211 arXiv:https://doi.org/10.1287/mksc.2019.1211
 Speicher et al. (2018) Toll Speicher, Muhammad Ali, Giridhari Venkatadri, Filipe Nunes Ribeiro, George Arvanitakis, Fabrício Benevenuto, Krishna P. Gummadi, Patrick Loiseau, and Alan Mislove. 2018. Potential for Discrimination in Online Targeted Advertising. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency (Proceedings of Machine Learning Research, Vol. 81), Sorelle A. Friedler and Christo Wilson (Eds.). PMLR, New York, NY, USA, 5–19. http://proceedings.mlr.press/v81/speicher18a.html

Tommasi
et al. (2017)
Tatiana Tommasi, Patricia
Novi, Barbara Caputo, and Tinne
Tuytelaars. 2017.
A Deeper Look at Dataset Bias.
Csurka G. (eds) Domain Adaptation in Computer Vision Applications. Advances in Computer Vision and Pattern Recognition, Springer, Cham, Swiss. 37–55 pages.
https://doi.org/10.1007/9783319583471_2  Wang and Ni (2017) Yan Wang and Xuelei Sherry Ni. 2017. Predicting ClassImbalanced Business Risk Using Resampling, Regularization, and Model Emsembling Algorithms. International Journal of Managing Information Technology (IJMIT) 11, 1 (2017), 15 pages. https://ssrn.com/abstract=3366806
 Williams et al. (2018) Betsy A. Williams, Catherine F. Brooks, and Yotam Shmargad. 2018. How Algorithms Discriminate Based on Data They Lack: Challenges, Solutions, and Policy Implications. Journal of Information Policy 8 (2018), 78–115. https://www.jstor.org/stable/10.5325/jinfopoli.8.2018.0078
 Zhou et al. (2017) Yao Zhou, M. Isabel Vales, Aoxue Wang, and Zhiwu Zhang. 2017. Systematic bias of correlation coefficient may explain negative accuracy of genomic prediction. Briefings Bioinform 18, 5 (2017), 744–753. https://doi.org/10.1093/bib/bbw064
 Zimmerman et al. (2017) Donald W. Zimmerman, Bruno D. Zumbo, and Richard H. Williams. 2017. Bias in estimation and hypothesis testing of correlation. Psicológica 24, 1 (2017), 133–158.
Comments
There are no comments yet.