Bias and fairness in ML has received a lot of research attention in recent years. This is due at least in part to a number of high-profile issues with deployed ML systems Holstein et al. (2019). One of the challenges for research in this area is the availability of appropriate datasets. In this paper we present a family of synthetic datasets that we hope will address this issue.
Bias can occur in ML classification when we have a desirable outcome and a sensitive feature where is the majority class and is the sensitive minority. Bias may be quantified in various ways Caton and Haas (2020): one of the accepted measures is Disparate Impact (DIs) Feldman et al. (2015):
that is, the probability of good outcomespredicted for the sensitive minority are less than those for the majority. A threshold of 0.8 would represent the 80% rule: scenarios with would be considered clearly unfair.
It is often the case the disparate impact arises due to shortcomings in the training data representing discriminatory practice in the past. It may also happen that the ML algorithm itself is biased failing to pick up patterns in the data due to model underfitting. This is termed underestimation Cunningham and Delany (2021). An underestimation score () in line with would be:
A poor underestimation score would indicate that the predicted outcomes for the minority are out of line with the actual outcomes in test data.
Even though there are many datasets available for ML research there are not many datasets suitable for testing impacts and remedies for disparate impact or underestimation. Feldman et al. Feldman et al. (2015) introduced a synthetic dataset in 2015 to help address this issue. While this dataset has been used in a wide range of studies on bias in ML it is limited in that it is a fairly simple dataset. The outcome is whether a student will be accepted for a college course; there are just three inputs, SAT score, IQ and gender. By contrast the synthetic dataset we present here has 18 features with complex interactions between them Kennedy et al. (2011). This dataset is presented in the next section and some basic experiments are presented in 3.
2 The Data
In this section, we present an overview of the synthetic credit scoring dataset summarised in Table 1. Each sample in the dataset represents a residential mortgage applicant’s information and the class label is generated based on a threshold to distinguish between those likely to repay and those likely to default on their financial obligation. Hence, the classification objective is to assign borrowers to one of two groups: good or bad. The framework111https://www.researchgate.net/publication/241802100_A_Framework_for_Generating_Data_to_Simulate_Application_Scoring used to generate these synthetic datasets is explained in Kennedy et al. (2011).
In line with other research on bias in ML, we select Age as sensitive attribute. Given that the context is mortgage applications, we categorize younger people as the underprivileged group and older people as the privileged group . Since Age is a continuous variable, we can vary the magnitude of under-representation by adjusting the threshold of separating privileged and unprivileged group . In addition, we can also vary the level of class imbalance in the dataset by adjusting the threshold on the class labels.
2.1 Generating Specific Datasets
We split the dataset generated in Section 2 into three subsets: small, medium and large base datasets, which contain 3,493, 13,231 and 37,607 samples respectively. Through user-defined parameters, it is possible to vary the level of bias by adjusting the threshold on the class label to distinguish between those likely to repay and those likely to default on their financial obligation (class imbalance), and on the sensitive feature Age to separate privileged and unprivileged group (feature imbalance). In addition, a Gaussian random noise can also be added to Score to increase classification complexity: this is controlled by a user-defined parameter. To ensure reproducibility, we have provided access to all of the data and code used in this article at the author’s Github222https://github.com/williamblanzeisky/SBDG page.
2.2 The Exemplar Dataset
We also provide exemplar datasets where specific thresholds have been applied. The thresholds have been set so that there is 25% class imbalance and 30% feature imbalance. In the next section, we will conduct a set of experiments on one of these exemplar datasets to show clear evidence of negative legacy and underestimation.
3 Sample Experiments
Using the scikit-learn logistic regression classifier, we show a clear evidence of negative legacy on the medium exemplar dataset. 70% of the observations in the dataset are used for training, and the remaining 30% are reserved for model testing. Figure1 illustrates the disparate impact score and balance accuracy for the train and test set. We see that the logistic regression model does not satisfy the 80% rule for disparate impact.
In addition, we conducted some experiments to demonstrate the effectiveness of current remediation strategies in the literature to fix underestimation. Specifically, we evaluate two categories of mitigation techniques: by adding counterfactuals (pre-processing) Blanzeisky and Cunningham (2021a) and using Pareto Simulated Annealing to simultaneously optimize a classifier on balance accuracy and underestimation (in-processing) Blanzeisky and Cunningham (2021b). Figure 2 shows the effectiveness of these strategies on the medium exemplar dataset. We can see that the in-processing strategy (PSA (BA+US)) outperforms the other strategies in terms of underestimation while maintaining comparable balance accuracy.
|Categorical||Location||Location of purchased dwelling||
|Employment Sector||Borrower’s employment sector||
|Occupation||Employment activity of the borrower||
|Education||Highest level of formal education||
|Label||Risk of defaulting||(0) Bad (Default); (1) Good (Paid)||Label|
|Binary||New Home||Newly built dwelling||(0) Old Home; (1) New Home||NewHome|
|First Time Buyer||Never purchased property before||(0) New Owner; (1) Previous Owner||FTB|
|Continuous||Age||Age of the borrower||18 - 55||Age|
|Income||Total income of the borrower||0 - inf||Income|
|Expenses-to-Income||Ratio of borrower-expenditure-to-income||0 - 1||Exp:Inc|
|Loan Value||Amount advanced to the borrower||0 - inf||LoanValue|
|Loan-to-Value||Loan to value ratio||1 - inf||LTV|
|Loan Term||Length of the loan in years||20 - 40||LoanTerm|
|Loan Rate||Interest rate paid on the loan||0 - 1||Interest|
|House Value||Market value of the property||1 - inf||HouseVal|
|MRTI||Ratio of mortgage-repayments-to-income||0 - 1||MRTI|
|Score||Estimated credit risk score||0 - inf||Score|
The objective of this short paper is to introduce a set of synthetic datasets to the community working on bias in ML. The data relates to mortgage applications and the model used to synthesise the data is based on a significant body of research on factors influencing mortgage approvals Kennedy et al. (2011). Altogether the datasets contain 37,607 examples described by 17 features. We focus on Age as the sensitive feature and code is provided to convert this to a category. The threshold for this conversion can be adjusted to control the level of imbalance in the dataset. The level of imbalance in the outcome (Good/Bad) can also be controlled. We also provide exemplar datasets with 25% class imbalance and 30% feature imbalance.
-  (2021) Algorithmic factors influencing bias in machine learning. CoRR abs/2104.14014. External Links: Cited by: §3.
-  (2021) Using pareto simulated annealing to address algorithmic bias in machine learning. External Links: Cited by: §3.
-  (2020) Fairness in machine learning: a survey. arXiv preprint arXiv:2010.04053. Cited by: §1.
-  (2021) Underestimation bias and underfitting in machine learning. Lecture Notes in Computer Science, pp. 20–31. External Links: Cited by: §1.
-  (2015) Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 259–268. Cited by: §1, §1.
-  (2019) Improving fairness in machine learning systems: what do industry practitioners need?. In Proceedings of the 2019 CHI conference on human factors in computing systems, pp. 1–16. Cited by: §1.
-  (2011) A framework for generating data to simulate application scoring. In Credit Scoring and Credit Control XII, Conference Proceedings, External Links: Cited by: §1, §2, §4.