Introducing a Family of Synthetic Datasets for Research on Bias in Machine Learning

by   William Blanzeisky, et al.

A significant impediment to progress in research on bias in machine learning (ML) is the availability of relevant datasets. This situation is unlikely to change much given the sensitivity of such data. For this reason, there is a role for synthetic data in this research. In this short paper, we present one such family of synthetic data sets. We provide an overview of the data, describe how the level of bias can be varied, and present a simple example of an experiment on the data.



page 1

page 2

page 3

page 4


Copula-based synthetic data generation for machine learning emulators in weather and climate: application to a simple radiation model

Can we improve machine learning (ML) emulators with synthetic data? The ...

Synthetic Data: Opening the data floodgates to enable faster, more directed development of machine learning methods

Many ground-breaking advancements in machine learning can be attributed ...

BENN: Bias Estimation Using Deep Neural Network

The need to detect bias in machine learning (ML) models has led to the d...

Studying Up Machine Learning Data: Why Talk About Bias When We Mean Power?

Research in machine learning (ML) has primarily argued that models train...

Synthetic Data – what, why and how?

This explainer document aims to provide an overview of the current state...

Towards the Detection of Building Occupancy with Synthetic Environmental Data

Information about room-level occupancy is crucial to many building-relat...

Technical Note: Bias and the Quantification of Stability

Research on bias in machine learning algorithms has generally been conce...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bias and fairness in ML has received a lot of research attention in recent years. This is due at least in part to a number of high-profile issues with deployed ML systems Holstein et al. (2019). One of the challenges for research in this area is the availability of appropriate datasets. In this paper we present a family of synthetic datasets that we hope will address this issue.

Bias can occur in ML classification when we have a desirable outcome and a sensitive feature where is the majority class and is the sensitive minority. Bias may be quantified in various ways Caton and Haas (2020): one of the accepted measures is Disparate Impact (DIs) Feldman et al. (2015):


that is, the probability of good outcomes

predicted for the sensitive minority are less than those for the majority. A threshold of 0.8 would represent the 80% rule: scenarios with would be considered clearly unfair.

It is often the case the disparate impact arises due to shortcomings in the training data representing discriminatory practice in the past. It may also happen that the ML algorithm itself is biased failing to pick up patterns in the data due to model underfitting. This is termed underestimation Cunningham and Delany (2021). An underestimation score () in line with would be:


A poor underestimation score would indicate that the predicted outcomes for the minority are out of line with the actual outcomes in test data.

Even though there are many datasets available for ML research there are not many datasets suitable for testing impacts and remedies for disparate impact or underestimation. Feldman et al. Feldman et al. (2015) introduced a synthetic dataset in 2015 to help address this issue. While this dataset has been used in a wide range of studies on bias in ML it is limited in that it is a fairly simple dataset. The outcome is whether a student will be accepted for a college course; there are just three inputs, SAT score, IQ and gender. By contrast the synthetic dataset we present here has 18 features with complex interactions between them Kennedy et al. (2011). This dataset is presented in the next section and some basic experiments are presented in 3.

2 The Data

In this section, we present an overview of the synthetic credit scoring dataset summarised in Table 1. Each sample in the dataset represents a residential mortgage applicant’s information and the class label is generated based on a threshold to distinguish between those likely to repay and those likely to default on their financial obligation. Hence, the classification objective is to assign borrowers to one of two groups: good or bad. The framework111 used to generate these synthetic datasets is explained in Kennedy et al. (2011).

In line with other research on bias in ML, we select Age as sensitive attribute. Given that the context is mortgage applications, we categorize younger people as the underprivileged group and older people as the privileged group . Since Age is a continuous variable, we can vary the magnitude of under-representation by adjusting the threshold of separating privileged and unprivileged group . In addition, we can also vary the level of class imbalance in the dataset by adjusting the threshold on the class labels.

2.1 Generating Specific Datasets

We split the dataset generated in Section 2 into three subsets: small, medium and large base datasets, which contain 3,493, 13,231 and 37,607 samples respectively. Through user-defined parameters, it is possible to vary the level of bias by adjusting the threshold on the class label to distinguish between those likely to repay and those likely to default on their financial obligation (class imbalance), and on the sensitive feature Age to separate privileged and unprivileged group (feature imbalance). In addition, a Gaussian random noise can also be added to Score to increase classification complexity: this is controlled by a user-defined parameter. To ensure reproducibility, we have provided access to all of the data and code used in this article at the author’s Github222 page.

2.2 The Exemplar Dataset

We also provide exemplar datasets where specific thresholds have been applied. The thresholds have been set so that there is 25% class imbalance and 30% feature imbalance. In the next section, we will conduct a set of experiments on one of these exemplar datasets to show clear evidence of negative legacy and underestimation.

3 Sample Experiments

Using the scikit-learn logistic regression classifier, we show a clear evidence of negative legacy on the medium exemplar dataset. 70% of the observations in the dataset are used for training, and the remaining 30% are reserved for model testing. Figure

1 illustrates the disparate impact score and balance accuracy for the train and test set. We see that the logistic regression model does not satisfy the 80% rule for disparate impact.

In addition, we conducted some experiments to demonstrate the effectiveness of current remediation strategies in the literature to fix underestimation. Specifically, we evaluate two categories of mitigation techniques: by adding counterfactuals (pre-processing) Blanzeisky and Cunningham (2021a) and using Pareto Simulated Annealing to simultaneously optimize a classifier on balance accuracy and underestimation (in-processing) Blanzeisky and Cunningham (2021b). Figure 2 shows the effectiveness of these strategies on the medium exemplar dataset. We can see that the in-processing strategy (PSA (BA+US)) outperforms the other strategies in terms of underestimation while maintaining comparable balance accuracy.

Figure 1: A demonstration of negative legacy on the train and test set. The red horizontal line represents 80% rule of Disparate Impact.
Figure 2: Evaluation of two remediation strategies to fix underestimation. It is clear that both strategies (PSA (BA + US) and Counterfactual) fixes underestimation without much loss in balance accuracy. The blue vertical line represents the desired underestimation score.
Type Variable Description Range/Values Label
Categorical Location Location of purchased dwelling
(1) Dublin; (2) Cork; (3) Galway; (4) Limerick;
(5) Waterford; (6) Other
Employment Sector Borrower’s employment sector
(1) Agriculture, Forestry and fishing; (2) Construction;
(3) Wholesale and Retail; (4) Transportation and Storage;
(5) Hospitality; (6) Information and Communication;
(7) Professional, Scientific and Technical; (8) Admin and
Support services; (9) Public Admin.; (10) Education;
(11) Health; (12) Industry; (13) Financial; (14) Other
Occupation Employment activity of the borrower
(1) Manager, Administrator, and Professional (henceforth MAP);
(2) Associate professional and technical, Clerical and secretarial,
Personal and protective service, and Sales (henceforth Office);
(3) Craft and related (henceforth Trade); (henceforth Office);
(4) Craft and related (henceforth Trade); (5) Other manual
operators (henceforth Farmer); (6) Self Employed
Household Family Composition
(1) 1 Adult, no child <18; (2) 1 Adult, 1+ child <18;
(3) 2 Adults, no child <18; (4) 3+ adults, no child <18;
(5) 2 Adults, 1+ child <18; (6) Other
Education Highest level of formal education
(1) Primary or below (PB); (2) Lower secondary (LS); (3) Higher
secondary (HS); (4) Post leaving certificate (PLC); (5) Third level
non-honours degree (TLND); (6) Third level honours degree or
above (TLHD); (7) Other
Label Risk of defaulting (0) Bad (Default); (1) Good (Paid) Label
Binary New Home Newly built dwelling (0) Old Home; (1) New Home NewHome
First Time Buyer Never purchased property before (0) New Owner; (1) Previous Owner FTB
Continuous Age Age of the borrower 18 - 55 Age
Income Total income of the borrower 0 - inf Income
Expenses-to-Income Ratio of borrower-expenditure-to-income 0 - 1 Exp:Inc
Loan Value Amount advanced to the borrower 0 - inf LoanValue
Loan-to-Value Loan to value ratio 1 - inf LTV
Loan Term Length of the loan in years 20 - 40 LoanTerm
Loan Rate Interest rate paid on the loan 0 - 1 Interest
House Value Market value of the property 1 - inf HouseVal
MRTI Ratio of mortgage-repayments-to-income 0 - 1 MRTI
Score Estimated credit risk score 0 - inf Score
Table 1: Residential Mortgage Application Credit Scoring dataset description

4 Conclusion

The objective of this short paper is to introduce a set of synthetic datasets to the community working on bias in ML. The data relates to mortgage applications and the model used to synthesise the data is based on a significant body of research on factors influencing mortgage approvals Kennedy et al. (2011). Altogether the datasets contain 37,607 examples described by 17 features. We focus on Age as the sensitive feature and code is provided to convert this to a category. The threshold for this conversion can be adjusted to control the level of imbalance in the dataset. The level of imbalance in the outcome (Good/Bad) can also be controlled. We also provide exemplar datasets with 25% class imbalance and 30% feature imbalance.


  • [1] W. Blanzeisky and P. Cunningham (2021) Algorithmic factors influencing bias in machine learning. CoRR abs/2104.14014. External Links: 2104.14014 Cited by: §3.
  • [2] W. Blanzeisky and P. Cunningham (2021) Using pareto simulated annealing to address algorithmic bias in machine learning. External Links: 2105.15064 Cited by: §3.
  • [3] S. Caton and C. Haas (2020) Fairness in machine learning: a survey. arXiv preprint arXiv:2010.04053. Cited by: §1.
  • [4] P. Cunningham and S. J. Delany (2021) Underestimation bias and underfitting in machine learning. Lecture Notes in Computer Science, pp. 20–31. External Links: ISBN 9783030739591, ISSN 1611-3349 Cited by: §1.
  • [5] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian (2015) Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 259–268. Cited by: §1, §1.
  • [6] K. Holstein, J. Wortman Vaughan, H. Daumé III, M. Dudik, and H. Wallach (2019) Improving fairness in machine learning systems: what do industry practitioners need?. In Proceedings of the 2019 CHI conference on human factors in computing systems, pp. 1–16. Cited by: §1.
  • [7] K. Kennedy, S. J. Delany, and B. Mac Namee (2011) A framework for generating data to simulate application scoring. In Credit Scoring and Credit Control XII, Conference Proceedings, External Links: Link Cited by: §1, §2, §4.