Algorithmic Fairness Datasets: the Story so Far

Data-driven algorithms are being studied and deployed in diverse domains to support critical decisions, directly impacting on people's well-being. As a result, a growing community of algorithmic fairness researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of the risks and opportunities of automated decision-making for different populations. Algorithmic fairness progress hinges on data, which can be used appropriately only if adequately documented. Unfortunately, the algorithmic fairness community, as a whole, suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity). In this work, we survey over two hundred datasets employed in algorithmic fairness research, producing standardized and searchable documentation for each of them, along with in-depth documentation for the three most popular fairness datasets, namely Adult, COMPAS and German Credit. These documentation efforts support multiple contributions. Firstly, we summarize the merits and limitations of popular algorithmic fairness datasets, questioning their suitability as general-purpose fairness benchmarks. Secondly, we document hundreds of available alternatives, annotating their domain and supported fairness tasks, to assist dataset users in task-oriented and domain-oriented search. Finally, we analyze these resources from the perspective of five important data curation topics: anonymization, consent, inclusivity, labeling of sensitive attributes, and transparency. We discuss different approaches and levels of attention to these topics, making them tangible, and distill them into a set of best practices for the curation of novel datasets.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

01/15/2019

Fair and Unbiased Algorithmic Decision Making: Current State and Future Challenges

Machine learning algorithms are now frequently used in sensitive context...
10/01/2021

A survey on datasets for fairness-aware machine learning

As decision-making increasingly relies on machine learning and (big) dat...
06/10/2021

It's COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks

Risk assessment instrument (RAI) datasets, particularly ProPublica's COM...
09/29/2021

Understanding Relations Between Perception of Fairness and Trust in Algorithmic Decision Making

Algorithmic processes are increasingly employed to perform managerial de...
02/24/2021

3D4ALL: Toward an Inclusive Pipeline to Classify 3D Contents

Algorithmic content moderation manages an explosive number of user-creat...
11/10/2021

Local Justice and the Algorithmic Allocation of Societal Resources

AI is increasingly used to aid decision-making about the allocation of s...
10/27/2017

The Algorithmic-Autoregulation (AA) Methodology and Software: a collective focus on self-transparency

There are numerous efforts to achieve a lightweight and systematic accou...

1 Introduction

Following the widespread study and application of data-driven algorithms in contexts that are central to people’s well-being, a community of researchers has coalesced around the growing field of algorithmic fairness, investigating algorithms through the lens of justice, equity, bias, power and harms. A line of work gaining traction in the field, intersecting with critical data studies, human-computer interaction, and computer-supported cooperative work, studies data documentation and proposes standardized processes to describe key characteristics of datasets (gebru2018datasheets; holland2018dataset; bender2018data; geiger2020garbage; jo2020lessons; miceli2021documenting). Most prominently, gebru2018datasheets and holland2018dataset proposed two complementary documentation frameworks, called Datasheets for Datasets and Dataset Nutrition Labels

, to improve data curation practices and favour more informed data selection and utilization for dataset users. Overall, this line of work has contributed to an unprecedented attention to dataset documentation in Machine Learning (ML), including a novel track focused on datasets at the Conference on Neural Information Processing Systems (NeurIPS), an initiative to support dataset tracking in repositories for scholarly articles,

111https://medium.com/paperswithcode/datasets-on-arxiv-1a5a8f7bd104 and dedicated works producing retrospective documentation for existing datasets (bandy2021addressing; garbin2021structured), auditing their properties (prabhu2020large) and tracing their usage (peng2021mitigating).

In recent work, bender2021:dangers propose the notion of documentation debt in relation to training sets that are undocumented and too large to document retrospectively. We extend this definition to the collection of datasets employed in a given field of research. We see two components at work contributing to the documentation debt of a research community. On one hand, opacity is the result of poor documentation affecting single datasets, contributing to misunderstandings and misuse of specific resources. On the other hand, when relevant information exists but does not reach interested parties, there is a problem of documentation sparsity. One example that is particularly relevant for the algorithmic fairness community is represented by the German Credit dataset (hofmann1994:sg), a popular resource in this field. While several recent works of algorithmic fairness experiment on this dataset using sex as a protected attribute (he2020geometric; yang2020fairness; baharlouei2020renyi; lohaus2020too; martinez2020minimax; wang2021fair), existing documentation shows that this feature cannot be reliably retrieved (gromping2019:sg). Moreover, the mere fact that a dataset exists and is relevant to a given task or a given domain may be unknown. The BUPT Faces datasets, for instance, were presented as the second existing resource for face analysis with race annotations (wang2020mitigating)

. However several resources were already available at the time, including Labeled Faces in the Wild

(han2014age), UTK Face (zhifei2017age), Racial Faces in the Wild (wang2019racial), and Diversity in Faces (merler2019diversity).222Hereafter, for brevity, we only report dataset names. The relevant references and additional information can be found in Appendix A.

To reduce the documentation debt of the algorithmic fairness community, we survey the datasets used in more than 500 articles on fairness and ML, presented at seven major conferences, considering each edition in the period 2014–2021, and more than twenty domain-specific workshops in the same period. We find over 200 datasets employed in studies of algorithmic fairness, for which we produce compact and standardized documentation, called data briefs. Data briefs are intended as a lightweight format to document fundamental properties of a data artifact, including its purpose, its features, with particular attention to sensitive ones, the underlying labeling procedure, and the envisioned ML task, if any. To favor domain-based and task-based search from dataset users, data briefs also indicate the domain of the processes that produced the data (e.g., radiology) and list the fairness tasks studied on a given dataset (e.g. fair ranking). For this endeavour, we have contacted creators and knowledgeable practitioners identified as primary points of contact for the datasets. We received feedback on preliminary versions of the data briefs from 72 curators and practitioners, whose efforts are acknowledged at the end of this article. Moreover, we identify and carefully analyze the three datasets utilized most often in the surveyed articles (Adult, COMPAS, and German Credit), retrospectively producing a datasheet and a nutrition label for each of them. From these documentation efforts, we extract a summary of the merits and limitations of popular algorithmic fairness benchmarks, a taxonomy of domains and fairness tasks for existing datasets, and a set of best practices for curating novel resources concerning anonymization, consent, inclusivity, labeling practices, and transparency.

Overall, we make the following contributions.

  • Analysis of popular fairness benchmarks. We produce detailed documentation for Adult, COMPAS, and German Credit, from which we extract a summary of their merits and limitations. We call into question their suitability as general-purpose fairness benchmarks due to contrived prediction tasks, noisy data, severe coding mistakes, and age.

  • Survey of existing alternatives. We document over two hundred resources used in fair ML research, annotating their domain, the tasks they support, and the roles they play in works of algorithmic fairness. By assembling sparse information on hundreds of datasets into a single document, we aim to support domain-oriented and task-oriented search by dataset users. Contextually, we provide and describe a novel taxonomy of tasks in algorithmic fairness.

  • Best practices for the curation of novel resources. We analyze different approaches to anonymization, consent, inclusivity, labeling, and transparency across these datasets. By comparing existing approaches and evaluating their advantages, we make the underlying concerns visible and practical, and extract best practices to inform the curation of new datasets and post-hoc remedies to existing ones.

The rest of this work is organized as follows. Section 2 presents the methodology of this survey. Section 3 analyzes the perks and limitations of the most popular datasets, namely Adult (§ 3.1), COMPAS (§ 3.2) and German Credit (§ 3.3), and provides an overall summary of their merits and limitations as fairness benchmarks (§ 3.4). Section 4 discusses alternative fairness resources from the perspective of the underlying domains (§ 4.1), the fair ML tasks they support (§ 4.2) and the roles they play (§ 4.3). Section 5 presents important topics in data curation, discussing existing approaches and best practices to avoid re-identification (§ 5.1), elicit informed consent (§ 5.2), consider inclusivity (§ 5.3), collect sensitive attributes (§ 5.4) and document datasets (§ 5.5). Finally, Section 6 contains concluding remarks and recommendations. Interested readers may find the data briefs in Appendix A, followed by the detailed documentation produced for Adult (Appendix B), COMPAS (Appendix C) and German Credit (Appendix D).

2 Methodology

In this survey, we consider (1) every article published in the proceedings of domain-specific conferences such as the ACM Conference on Fairness, Accountability, and Transparency (FAccT), and the AAAI/ACM Conference on Artificial Intelligence, Ethics and Society (AIES); (2) every article published in proceedings of well-known machine learning and data mining conferences, including the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), the Conference on Neural Information Processing Systems (NeurIPS), the International Conference on Machine Learning (ICML), the International Conference on Learning Representations (ICLR), the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD); (3) every article available from Past Network Events and Older Workshops and Events of the FAccT network.

333https://facctconference.org/network/ We consider the period from 2014, the year of the first workshop on Fairness, Accountability, and Transparency in Machine Learning, to the first week of May 2021, thus including works presented at FAccT and ICLR in 2021.

To target works of algorithmic fairness, we select a subsample of these articles whose titles contain either of the following strings, where the star symbol represents the wildcard character: *fair* (targeting e.g. fairness, unfair), *bias* (biased, debiasing), discriminat* (discrimination, discriminatory), *equal* (equality, unequal), *equit* (equity, equitable), disparate (disparate impact), *parit* (parity, disparities). These selection criteria are centered around equity-based notions of fairness, typically operationalized by equalizing some algorithmic property across individuals or groups of individuals. Through manual inspection by two authors, we discard articles where these keywords are used with a different meaning. Discarded works include articles on handling selection bias (kato2018learning), enhancing desirable discriminating properties of models (chen2018virtual), or generally focused on model performance (li2018learning; zhong2019unequaltraining). This leaves us with 511 articles.

From the articles that pass this initial screening, we select datasets treated as important data artifacts, either being used to train/test an algorithm or undergoing a data audit, i.e., an in-depth analysis of different properties. We produce a data brief for these datasets by (1) reading the information provided in the surveyed articles, (2) consulting the provided references, and (3) reviewing scholarly articles or official websites found by querying Google with the dataset name. We discard the following:

  • Word Embeddings (WEs). We only consider the corpora they are trained on if WEs are trained as part of a given work and not taken off the shelf;

  • toy datasets, i.e., simulations with no connection to real-world processes, unless used in more than one article, which we take as a sign of importance in the field;

  • resources that are only used as a minor source of auxiliary information, such as the percentage of US residents in each state;

  • datasets for which the available information is insufficient. This happens very seldom when points (1), (2), and (3) outlined above result in little to no information about the curators, purposes, features, and format of a dataset. For popular datasets, this is never the case.

For each of the 210 datasets satisfying the above criteria, we produce a data brief, available in Appendix A with a description of the underlying coding procedure.

Data briefs also keep track of dataset popularity by listing fairness articles which employ a given resource. We identify Adult, COMPAS, and German Credit as the most utilized dataset in the surveyed algorithmic fairness literature. Following gebru2018datasheets and holland2018dataset, we produce in-depth documentation for these three datasets by carefully consulting relevant references previously known by the authors or found querying search engines for academic publications with the dataset name. Appendices B-D contain datasheets and nutrition labels for these three datasets, preceded by a list of relevant publications that informed their compilation. The next section presents a summary of the perks and limitations of these datasets.

3 Most Popular Datasets

Figure 1: Utilization of datasets in fairness research follows a power law.

Figure 1

depicts the numbers of articles using each dataset, showing that dataset utilization in surveyed scholarly works follows a power law distribution. Over 100 datasets are only used once, also because some of these resources are not publicly available. Complementing this long tail is a short head of nine articles used in ten or more articles. These datasets are Adult (108 usages), COMPAS (77), German Credit (33), Communities and Crime (24), Law School (17), Bank Marketing (15), MovieLens (14), CelebA (13) and Credit Card Default (10). The tenth most used resource is the toy dataset from

zafar2017fairness, used in 7 articles. In this section, we summarize positive and negative aspects of the three most popular datasets, namely Adult, COMPAS, and German Credit, informed by extensive documentation in Appendices B, C and D.

3.1 Adult

The Adult dataset was created as a resource to benchmark the performance of machine learning algorithms on socially relevant data. Each instance is a person who responded to the March 1994 US Current Population Survey, represented along demographic and socio-economic dimensions, with features describing their profession, education, age, sex, race, personal and financial condition. The dataset was extracted from the census database, preprocessed, and donated to UCI Machine Learning Repository in 1996 by Ronny Kohavi and Barry Becker. A binary variable encoding whether respondents’ income is above $50,000 was chosen as the target of the prediction task associated with this resource.

Adult inherits some positive sides from the best practices employed by the US Census Bureau. Although later filtered somewhat arbitrarily, the original sample was designed to be representative of the US population. Trained and compensated interviewers collected the data. Attributes in the dataset are self-reported and provided by consensual respondents. Finally, the original data from the US Census Bureau is well documented, and its variables can be mapped to Adult by consulting the original documentation (usdeptcomm1995current), except for a variable denominated fnlwgt, whose precise meaning is unclear.

A negative aspect of this dataset is the contrived prediction task associated with it. Income prediction from socio-economic factors is a task whose social utility appears rather limited. Moreover, the $50,000 threshold for binary prediction is high, and model properties such as accuracy are very sensitive to it (hardt2021facing). Finally, the dataset is rather old; more current resources are available to inform socio-economic studies of the US population.

3.2 Compas

This dataset was created for an external audit of racial biases in the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) risk assessment tool developed by Northpointe (now Equivant), which estimates the likelihood of a defendant becoming a recidivist. Instances represent defendants scored by COMPAS in Broward County, Florida, between 2013–2014, reporting their demographics, criminal record, custody and COMPAS scores. Defendants’ public criminal records were obtained from the Broward County Clerk’s Office website matching them based on date of birth, first and last names. The dataset was augmented with jail records and COMPAS scores provided by the Broward County Sheriff’s Office. Finally, public incarceration records were downloaded from the Florida Department of Corrections website. Instances are associated with two target variables (is_recid and is_violent_recid), indicating whether defendants were booked in jail for a criminal offense (potentially violent) that occurred after their COMPAS screening but within two years.

On the upside, this dataset is recent and captures some relevant aspects of the COMPAS risk assessment tool and the criminal justice system in Broward County. On the downside, it was compiled from disparate sources, hence clerical errors and mismatches are present (larson2016how). Moreover, in its official release (propublica2016compas), the COMPAS dataset features redundant variables and data leakage due to spuriously time-dependent recidivism rates (barenstein2019propublica). For these reasons, researchers must perform further preprocessing in addition to the standard one by ProPublica. More subjective choices are required of researchers interested in counterfactual evaluation of risk-assessment tools, due to the absence of a clear indication of whether defendants were detained or released pre-trial (mishler2021fairness). The lack of a standard preprocessing protocol beyond the one by ProPublica (propublica2016compas), which is insufficient to handle these factors, may cause issues of reproducibility and difficulty in comparing methods. Moreover, according to Northpointe’s response to the ProPublica’s study, several risk factors considered by the COMPAS algorithm are absent from the dataset (dieterich2016compas). Finally, defendants’ personal information (e.g. race and criminal history) is available in conjunction with obvious identifiers, making re-identification of defendants trivial (§ 5.1).

Overall, these considerations paint a mixed picture for a dataset of high social relevance that was extremely useful to catalyze attention on algorithmic fairness issues, displaying at the same time several limitations in terms of its continued use as a flexible benchmark for fairness issues of all sorts. In this regard, bao2021COMPASlicated suggest avoiding the use of COMPAS to demonstrate novel approaches in algorithmic fairness, as considering the data without proper context may bring to misleading conclusions which could misguidedly enter the broader debate on criminal justice and risk assessment.

3.3 German Credit

The German Credit dataset was created to study the problem of automated credit decisions at a regional Bank in southern Germany. Instances represent loan applicants from 1973 to 1975, who were deemed creditworthy and were granted a loan, bringing about a natural selection bias. The data summarizes their financial situation, credit history, and personal situation, including housing and number of liable people. A binary variable encoding whether each loan recipient punctually payed every installment is the target of a classification task. Among the covariates, marital status and sex are jointly encoded in a single variable. Many documentation mistakes are present in the UCI entry associated with this resource (hofmann1994:sg). Due to one of these mistakes, users of this dataset are led to believe that the variable sex can be retrieved from the joint marital_status-sex variable, however this is false. A revised version with correct variable encodings, called South German Credit, was donated to gromping2019:sg2 with an accompanying report (gromping2019:sg).

The greatest upside of this dataset is the fact that it captures a real-world application of credit scoring at a bank. On the downside, the data is nearly fifty years old, significantly limiting the societally useful insights that can be gleaned from it. Most importantly, the popular release of this dataset (hofmann1994:sg) comes with highly inaccurate documentation which contains wrong variable codings. For example, the variable reporting whether loan recipients are foreign workers has its coding reversed, so that, apparently, fewer than 5% of the loan recipients in the dataset would be German. Luckily, this error has no impact on numerical results obtained from this dataset, as it is irrelevant at the level of abstraction afforded by raw features, with the exception of potentially counterintuitive explanations in works of interpretability. This coding error, along with others discussed in gromping2019:sg was corrected in a novel release of the dataset (gromping2019:sg2). Unfortunately and most importantly for the fair ML community, retrieving the sex of loan applicants is simply not possible, unlike the original documentation suggested. This is due to the fact that one value of this feature was used to indicate both women who are divorced, separated or married and men who are single, while the original documentation reported each feature value to correspond to same-sex applicants (either male-only or female-only). This particular coding error ended up having a non-negligible impact on the fair ML community, where many works studying group fairness extracted sex from the joint variable and used it as a sensitive attribute, even years after the redacted documentation was published (wang2021fair). These coding mistakes are part of a documentation debt whose influence continues to affect the algorithmic fairness community.

3.4 Summary

Adult, COMPAS and German Credit are the most used datasets in the surveyed algorithmic fairness literature. Their status as de facto fairness benchmarks is probably due to their use in seminal works

(pedreshi2008discriminationaware; calders2009building) and influential articles (angwin2016machine) on algorithmic fairness. Once this fame was created, researchers had clear incentives to study novel problems and approaches on these datasets, which have become even more established benchmarks in the algorithmic fairness literature as a result (bao2021COMPASlicated). On close scrutiny, the fundamental merit of these datasets is originating from human processes, encoding protected attributes, and having different base rates for the target variable across sensitive groups. Their use in recent works on algorithmic fairness can be interpreted as a signal that the authors have basic awareness of default data practices in the field and that the data was not made up to fit the algorithm. Overarching claims of significance in real-world scenarios stemming from experiments on these datasets should be met with skepticism. Experiments that claim extracting a sex variable from the German Credit dataset should be considered noisy at best. As for alternatives, bao2021COMPASlicated suggest employing well-designed simulations. A complementary avenue is to seek different datasets that are relevant for the problem at hand. We hope that the two hundred data briefs accompanying this work will prove useful in this regard, favouring both domain-oriented and task-oriented searches, according to the taxonomy discussed in the next section.

4 Existing Alternatives

In this section, we discuss existing fairness resources from three different perspectives. In section 4.1 we describe the different domains spanned by fairness datasets. In section 4.2 we provide a taxonomy of fairness tasks supported by the same resources. In section 4.3 we discuss the different roles played by these datasets in fairness research, such as supporting training and benchmarking.

4.1 Domain

Figure 2: Datasets employed in fairness research span diverse domains.

Algorithmic fairness concerns arise in any domain where Automated Decision Making (ADM) systems may influence human well-being. Unsurprisingly, the datasets in our survey reflect a variety of areas where ADM systems are studied or deployed, including criminal justice, education, search engines, online marketplaces, emergency response, social media, medicine and literature. In Figure 2, we report a subdivision of the surveyed datasets in different macrodomains.444The total exceeds 210 due to multiple domains being applicable to some dataset. We mostly follow the area-category taxonomy by Scimago,555See the “subject area” and “subject category” drop down menus from https://www.scimagojr.com/journalrank.php departing from it where appropriate. For example, we consider computer vision and linguistics macrodomains of their own for the purposes of algorithmic fairness, as much fair ML work has been published in both disciplines. Below we present a summary of each macrodomain and its main subdomains.

Computer Science. Datasets from this macrodomain are very well represented, comprising information systems, social media, library and information sciences, computer networks, and signal processing. Information systems heavily feature datasets on search engines for various items such as text, images, worker profiles, and real estate, retrieved in response to queries issued by users (Occupations in Google Images, Scientist+Painter, Zillow Searches, Burst, Online Freelance Marketplaces, Bing US Queries, Symptoms in Queries). Other datasets represent problems of item recommendation, covering products, businesses, and movies (Amazon Recommendations, Amazon Reviews, Google Local, MovieLens, FilmTrust). The remaining datasets in this subdomain represent knowledge bases (Freebase15k-237, Wikidata) and automated screening systems (CVs from Singapore, Pymetrics Bias Group). Datasets from social media that are not focused on links and relationships between people are also considered part of computer science in this survey. These resources are often focused on text, powering tools, and analyses of hate speech and toxicity (Civil Comments, Twitter Abusive Behavior, Twitter Offensive Language, Twitter Hate Speech Detection, Twitter Online Harassment), dialect (TwitterAAE), and political leaning (Twitter Presidential Politics). Twitter is by far the most represented platform, while datasets from Facebook (German Political Posts), Steeemit (Steemit), Instagram (Instagram Photos), Reddit (RtGender), Fitocracy (RtGender), and YouTube (YouTube Dialect Accuracy) are also present. Datasets from library and information sciences

are mainly focused on academic collaboration networks (Cora Papers, CiteSeer Papers, PubMed Diabetes Papers, ArnetMiner Citation Network, 4area, Academic Collaboration Networks), except for a dataset about peer review of scholarly manuscripts (Paper-Reviewer Matching).

Social Sciences. Datasets from social sciences are also plentiful, spanning law, education, social networks, demography, social work, political science, transportation, sociology and urban studies. Law datasets are mostly focused on recidivism (Crowd Judgement, COMPAS, Recidivism of Felons on Probation, State Court Processing Statistics, Los Angeles City Attorney’s Office Records) and crime prediction (Strategic Subject List, Philadelphia Crime Incidents, Stop, Question and Frisk, Real-Time Crime Forecasting Challenge, Dallas Police Incidents, Communities and Crime), with a granularity spanning the range from individuals to communities. In the area of education we find datasets that encode application processes (Nursery, IIT-JEE), student performance (Student, Law School, UniGe, ILEA, Student Performance, EdGap) and attempts at automated grading (Automated Student Assessment Prize). Some datasets on student performance support studies of differences across schools and educational systems, for which they report useful features (Law School, ILEA, EdGap), while the remaining datasets are more focused on differences in the individual condition of students, typically within the same institution. Datasets about social networks mostly concern online social networks (Facebook Ego-networks, Facebook Large Network, Pokec Social Network, Rice Facebook Network, Twitch Social Networks, University Facebook Networks), except for High School Contact and Friendship Network, also featuring offline relations. Demography datasets comprise census data from different countries (Dutch Census, Indian Census, National Longitudinal Survey of Youth, Section 203 determinations, US Census Data (1990)). Datasets from social work cover complex personal and social problems, including child maltreatment prevention (Allegheny Child Welfare), emergency response (Harvey Rescue) and drug abuse prevention (Homeless Youths’ Social Networks, DrugNet). Resources from political science describe registered voters (North Carolina Voters), electoral precincts (MGGG States), polling (2016 US Presidential Poll), and sortition (Climate Assembly UK). Transportation data summarizes rides and requests for taxis and ride-hailing services (NYC Taxi Trips, Shanghai Taxi Trajectories, Ride-hailing App). Sociology resources summarize online (Libimseti) and offline dating (Columbia University Speed Dating). Finally, we assign SafeGraph Research Release to urban studies.

Computer Vision

. This is an area of early success for artificial intelligence, where fairness typically concerns learned representations and equality of performance across classes. The surveyed articles feature several popular datasets on image classification (ImageNet, MNIST, Fashion MNIST, CIFAR), visual question answering (Visual Question Answering), segmentation, recognition, and captioning (MS-COCO). We find over ten face analysis datasets (Labeled Faces in the Wild, UTK Face, Adience, FairFace, IJB-A, CelebA, Pilot Parliaments Benchmark, MS-Celeb-1M, Diversity in Faces, Multi-task Facial Landmark, Racial Faces in the Wild, BUPT Faces), for which fairness is most often intended as the robustness of classifiers across different subpopulations, without much regard for downstream benefits or harms to these population. Synthetic images are popular to study the relationship between fairness and disentangled representations (dSprites, Cars3D, shapes3D). Similar studies can be conducted on datasets with spurious correlations between subjects and backgrounds (Waterbirds, Benchmarking Attribution Methods) or gender and occupation (Athletes and health professionals). Finally, the Image Embedding Association Test dataset is a fairness benchmark to study biases in image embeddings across religion, gender, age, race, sexual orientation, disability, skin tone, and weight. It is worth noting that this significant proportion of computer vision datasets is not an artifact of including CVPR in the list of candidate conferences, which contributed just five additional datasets (Multi-task Facial Landmark, Office31, Racial Faces in the Wild, BUPT Faces, Visual Question Answering).

Health. This macrodomain, comprising medicine, psychology and pharmacology displays a notable diversity of subdomains interested by fairness concerns. Specialties represented in the surveyed datasets are mostly medical, including public health (Antelope Valley Networks, Willingness-to-Pay for Vaccine, Kidney Matching, Kidney Exchange Program), cardiology (Heart Disease, Arrhythmia, Framingham), endocrinology (Diabetes 130-US Hospitals, Pima Indians Diabetes Dataset), health policy (Heritage Health, MEPS-HC). Specialties such as radiology (National Lung Screening Trial, MIMIC-CXR-JPG, CheXpert) and dermatology (SIIM-ISIC Melanoma Classification, HAM10000) feature several image datasets for their strong connections with medical imaging. Other specialties include critical care medicine (MIMIC-III), neurology (Epileptic Seizures), pediatrics (Infant Health and Development Program), sleep medicine (Apnea), nephrology (Renal Failure), pharmacology (Warfarin) and applied psychology

(Drug Consumption). These datasets are often extracted from care data of multiple medical centers to study problems of automated diagnosis. Resources derived from longitudinal studies, including Framingham and Infant Health and Development Program are also present. Works of algorithmic fairness in this domain are typically concerned with obtaining models with similar performance for patients across race and sex.

Economics and Business. We consider this a further macrodomain, comprising datasets from economics, finance and marketing. Economics datasets mostly consist of census data focused on wealth (Adult, US Family Income, Poverty in Colombia, Costarica Household Survey) and other resources which summarize employment (ANPE), tariffs (U.S. Harmonized Tariff Schedules) and division of goods (Spliddit Divide Goods). Finance resources feature data on microcredit and peer-to-peer lending (Mobile Money Loans, Kiva, Prosper Loans Network), mortgages (HMDA), loans (German Credit, Credit Elasticities), credit scoring (FICO) and default prediction (Credit Card Default). Finally, marketing datasets describe marketing campaigns (Bank Marketing), customer data (Wholesale) and advertising bids (Yahoo! A1 Search Marketing).

Linguistics

. In addition to the textual resources we already described, such as the ones derived from social media, several datasets employed in algorithmic fairness literature can be assigned to the domain of linguistics and Natural Language Processing (NLP). There are many examples of resources curated to be fairness benchmarks for different tasks, including machine translation (Bias in Translation Templates), sentiment analysis (Equity Evaluation Corpus), coreference resolution (Winogender, Winobias, GAP Coreference), named entity recognition (In-Situ), language models (BOLD) and word embeddings (WEAT). Other datasets have been considered for their size and importance for pretraining text representations (Wikipedia dumps, One billion word benchmark, BookCorpus, WebText) or their utility as NLP benchmarks (GLUE, Business Entity Resolution).

Miscellaneous. This macrodomain contains several datasets originating from the news domain (Yow news, Guardian Articles, Latin Newspapers, Adressa, Reuters 50 50, New York Times Annotated Corpus, TREC Robust04). Other resources include datasets on sushi preferences (Sushi), video games (FIFA 20 Players), the internet (Greek Websites), and toy datasets (Toy Dataset 1–4).

Arts and Humanities. In this area we mostly find literature datasets, which contain text from literary works (Shakespeare, Curatr British Library Digital Corpus, Victorian Era Authorship Attribution, Nominees Corpus, Riddle of Literary Quality), which are typically studied with NLP tools. Other datasets assigned to this sphere include domain-specific information systems about books (Goodreads Reviews), movies (MovieLens) and music (Last.fm, Million Song Dataset, Million Playlist Dataset). Additionally, Olympic Athletes is a historical sports-related dataset, going back to the first edition of the Olympics in 1896.

Natural Sciences. This domain is represented with three datasets from biology (iNaturalist), biochemestry (PP-Pathways) and plant science, with the classic Iris dataset.

As a whole, many of these datasets encode fundamental human activities where algorithms and ADM systems have been studied and deployed. Alertness and attention to equity seems especially important in some domains, including social sciences, computer science, medicine, and economics. Here the potential for impact may result in large benefits, but also great harm, particularly for vulnerable populations and minorities, more likely to be neglected during the design, training, and testing of an ADM. After concentrating on domains, in the next section we analyze the variety of tasks studied in works of algorithmic fairness and supported by these datasets.

4.2 Task and setting

Researchers and practitioners are showing an increasing interest in algorithmic fairness, proposing solutions for many different tasks, including fair classification, regression and ranking. At the same time, the academic community is developing an improved understanding of important challenges that run across different tasks in the algorithmic fairness space (chouldechova2020snapshot), also thanks to practitioner surveys (holstein2019:improving) and studies of specific legal challenges (andrus2021what). To exemplify, the presence of noise corrupting labels for sensitive attributes represents a challenge that may apply across different tasks, including fair classification, regression and ranking. We refer to these challenges as settings, describing them in the second part of this section. While our survey focuses on fair ML datasets, it is cognizant of the wide variety of tasks tackled in the algorithmic fairness literature, which are captured in a specific field of our data briefs. In this section we provide an overview of common tasks and settings studied on these datasets, showing their variety and diversity.

4.2.1 Task

Fair classification (dwork2012fairness) is the most common task by far. Typically, it involves equalizing some measure of interest across subpopulations, such as the recall, precision, or accuracy for different racial groups. On the other hand, individually fair classification focuses on the idea that similar individuals (low distance in the covariate space) should be treated similarly (low distance in the outcome space), often formalized as a Lipschitz condition. Unsurprisingly, the most common datasets for fair classification are the most popular ones overall (§ 3), i.e., Adult, COMPAS, and German Credit.

Fair regression (berk2017convex) concentrates on models that predict a real-valued target, requiring the average loss to be balanced across groups. Individual fairness in this context may require losses to be as uniform as possible across all individuals. Fair regression is a less popular task, often studied on the Communities and Crime dataset.

Fair ranking (yang2017measuring) requires ordering candidate items based on their relevance to a current need. Fairness in this context may concern both the people producing the items that are being ranked (e.g. artists) and those consuming the items (users of a music streaming platform). It is typically studied in applications of recommendation (MovieLens, Amazon Recommendations, Last.fm, Million Song Dataset, Adressa) and search engines (Yahoo! c14B Learning to Rank, Microsoft Learning to Rank, TREC Robust04).

Fair matching (kobren2019paper) is similar to ranking as they are both tasks defined on two-sided markets. This task, however, is focused on highlighting and matching pairs of items on both sides of the market, without emphasis on the ranking component. Datasets for this task are from diverse domains, including dating (Libimseti, Columbia University Speed Dating) transportation (NYC Taxi Trips, Ride-hailing App) and organ donation (Kidney Matching, Kidney Exchange Program).

Fair risk assessment (coston2020counterfactual) studies algorithms that score instances in a dataset according to a predefined type of risk. Relevant domains include healthcare and criminal justice. Key differences with respect to classification are an emphasis on real-valued scores rather than labels, and awareness that the risk assessment process can lead to interventions impacting the target variable. For this reason, fairness concerns are often defined in a counterfactual fashion. The most popular dataset for this task is COMPAS, followed by datasets from medicine (IHDP, Stanford Medicine Research Data Repository), social work (Allegheny Child Welfare), Economics (ANPE) and Education (EdGap).

Fair representation learning (creager2019flexibly) concerns the study of features learnt by models as intermediate representations for inference tasks. A popular line of work in this space, called disentaglement, aims to learn representations where a single factor of import corresponds to a single feature. Ideally, this approach should select representations where sensitive attributes cannot be used as proxies for target variables. Cars3D and dSprites are popular datasets for this task, consisting of synthetic images depicting controlled shape types under a controlled set of rotations. Post-processing approaches are also applicable to obtain fair representations from biased ones via debiasing.

Fair clustering (chierichetti2017fair) is an unsupervised task concerned with the division of a sample into homogenous groups. Fairness may be intended as an equitable representation of protected subpopulations in each cluster, or in terms of average distance from the cluster center. While Adult is the most common dataset for problems of fair clustering, other resources often used for this task include Bank Marketing, Diabetes 130-US Hospitals, Credit Card Default and US Census Data (1990).

Fair anomaly detection

(zhang2021towards), also called outlier detection (davidson2020framework), is aimed at identifying surprising or anomalous points in a dataset. Fairness requirements involve equalizing salient quantities (e.g. acceptance rate, recall, precision, distribution of anomaly scores) across populations of interest. This problem is particularly relevant for members of minority groups, who, in the absence of specific attention to dataset inclusivity, are less likely to fit the norm in the feature space.

Fair districting (schutzman2020:to) is the division of a territory into electoral districts for political elections. Fairness notions brought forth in this space are either outcome-based, requiring that seats earned by a party roughly match their share of the popular vote, or procedure-based, ignoring outcomes and requiring that counties or municipalities are split as little as possible. MGGG States is a reference resource for this task.

Fair task assignment and truth discovery (goel2019crowdsourcing; li2020towards) are different subproblems in the same area, focused on the subdivision of work and the aggregation of answers in crowdsourcing. Here fairness may be intended concerning errors in the aggregated answer, requiring errors to be balanced across subpopulations of interest, or in terms of the work load imposed to workers. A dataset suitable for this task is Crowd Judgement.

Fair spatio-temporal process learning (shang2020listwise) focuses on the estimation of models for processes which evolve in time and space. Surveyed applications include crime forecasting (Real-Time Crime Forecasting Challenge, Dallas Police Incidents) and disaster relief (Harvey Rescue), with fairness requirements focused on equalization of performance across different neighbourhoods and special attention to their racial composition.

Fair influence maximization (farnad2020unifying) models and optimizes the propagation of information and influence over networks, and has connections with graph covering problems. Applications include obesity prevention (Antelope Valley Networks) and drug-use prevention (Homeless Youths’ Social Networks). Fairness is typically intended with respect to different labels associated with people (e.g. gender, race), who are represented as nodes within a network, and should be reached by the information with the same probability.

Fair resource allocation/subset selection (babaioff2019fair; huang2020towards) can often be formalized as a classification problem with constraints on the number of positives. Fairness requirements are similar to those of classification. Subset selection may be employed to choose a group of people from a wider set for a given task (US Federal Judges, Climate Assembly UK). Resource allocation concerns the division of goods (Spliddit Divide Goods) and resources (ML Fairness Gym, German Credit).

Fair data summarization (celis2018fair) refers to presenting a summary of datasets that is equitable to subpopulations of interest. It may involve finding a small subset representative of a larger dataset (strongly linked to subset selection) or selecting the most important features (dimensionality reduction). Approaches for this task have been applied to select a subset of images (Scientist+Painter) or customers (Bank Marketing), that represent the underlying population across sensitive demographics.

Fair graph mining (kang2020inform) focuses on representations and prediction tasks on graph structures. Fairness may be defined either as a lack of bias in representations, or with respect to a final inference task defined on the graph. Fair graph mining approaches have been applied to knowledge bases (Freebase15k-237, Wikidata), collaboration networks (CiteSeer Paper, Academic Collaboration Networks) and social network datasets (Facebook Large Network, Twitch Social Networks).

Fair pricing (kallus2021fairness) concerns learning and deploying an optimal pricing policy for revenue while maintaining equity of access to services and consumer welfare across sensitive groups. Datasets employed in fair pricing are from the financial (Credit Elasticities) and public health domains (Willingness-to-Pay for Vaccine).

Fair advertising (celis2019toward) is also concerned with access to goods and services. It comprises both bidding strategies and auction mechanisms which may be modified to reduce discrimination with respect to the gender or race composition of the audience that sees an ad. One publicly available dataset for this subtask is Yahoo! A1 Search Marketing.

Fair routing (qian2015scram) is the task of suggesting an optimal path from a starting location to a destination. For this task, experimentation has been carried out on a semi-synthetic traffic dataset (Shanghai Taxi Trajectories). The proposed fairness measure requires equalizing the driving cost per customer across all drivers.

Fair entity resolution (cotter2019training) is a task focused on deciding whether multiple records refer to the same entity, which is useful, for instance, for the construction and maintenance of knowledge bases. Business Entity Resolution is a proprietary dataset for fair entity resolution, where constraints of performance equality across chain and non-chain businesses can be tested.

Fair sentiment analysis (kiritchenko2018examining) is a specific instance of fair classification, where text snippets are classified as positive, negative, or neutral depending on the sentiment they express. Fairness is often intended with respect to the entities mentioned in the text (e.g. men and women). The central idea is that the estimated sentiment for a sentence should not change if female entities (e.g. “her”, “woman”, “Mary”) are substituted with their male counterparts (“him”, “man”, “James”). The Equity Evaluation Corpus is a benchmark developed to assess gender and race bias in sentiment analysis models.

Bias in Word Embeddings (WEs) (bolukbasi2016man) is the study of undesired semantics and stereotypes captured by vectorial representations of words. WEs are typically trained on large text corpora (Wikipedia dumps) and audited for associations between gendered words (or other words connected to sensitive attributes) and stereotypical or harmful concepts, such as the ones encoded in WEAT.

Bias in Language Models (LMs) (bordia2019identifying) is, quite similarly, the study of biases in LMs, which are flexible models of human language based on contextualized word representations, which can be employed in a variety of linguistics and NLP tasks. LMs are trained on large text corpora from which they may learn spurious correlations and stereotypes. The BOLD dataset is an evaluation benchmark for LMs, based on prompts that mention different socio-demographic groups. LMs complete these prompts into full sentences, which can be tested along different dimensions (sentiment, regard, toxicity, emotion and gender polarity).

Fair Machine Translation (MT) (stanovsky2019evaluating) concerns automatic translation of text from a source language into a target one. MT systems can exhibit gender biases, such as a tendency to translate gender-neutral pronouns from the source language into gendered pronouns of the target language in accordance with gender stereotypes. For example, a “nurse” mentioned in a gender-neutral context in the source sentence may be rendered with feminine grammar in the target language. Bias in Translation Templates is a set of short templates to test such biases.

Fair speech-to-text (tatman2017:gender) is a speech recognition task requiring accurate annotation of spoken language into text across different demographics. YouTube Dialect Accuracy is a dataset developed to audit the accuracy of YouTube’s automatic captions across two genders and five dialects of English.

4.2.2 Setting

As noted at the beginning of this section, there are different settings (or challenges) that run across many tasks described above. Some of these settings are specific to fair ML, such as ensuring fairness across an exponential number of groups, or in the presence of noisy labels for sensitive attributes. Other settings are connected with common ML challenges, including few-shot and privacy-preserving learning. Below we describe common settings encountered in the surveyed articles. Most of these settings are tested on fairness datasets which are popular overall, i.e. Adult, COMPAS and German Credit. We highlight situations where this is not the case, potentially due to a given challenge arising naturally in some other dataset.

Rich-subgroup fairness (kearns2018preventing) is a setting where fairness properties are required to hold not only for a limited number of protected groups, but across an exponentially large number of subpopulations. This line of work represents an attempt to bridge the normative reasoning underlying individual and group fairness.

Noisy fairness is a general expression we adopt to indicate problems where sensitive attributes are missing (chen2019fairness), encrypted (kilbertus2018blind) or corrupted by noise (lamy2019noisetolerant). These problems respond to real-world challenges related to the confidential nature of protected attributes, that individuals may wish to hide, encrypt, or obfuscate. This setting is most commonly studied on highly popular fairness dataset (Adult, COMPAS), moderately popular ones (Law School and Credit Card Default), and a dataset about home mortgage applications in the US (HMDA).

Limited-label fairness comprises settings with limited information on the target variable, including situations where labelled instances are few (ji2020can), noisy (wang2021fair), or only available in aggregate form (sabato2020bounding).

Robust fairness problems arise under perturbations to the training set (huang2019stable), adversarial attacks (nanda2021fairness) and dataset shift (singh2021fairness). This line of research is often connected with work in robust machine learning, extending the stability requirements beyond accuracy-related metrics to fairness-related ones.

Dynamical fairness (liu2018delayed; damour2020fairness) entails repeated decisions in changing environments, potentially affected by the very algorithm that is being studied. Works in this space study the co-evolution of algorithms and populations on which they act over time. For example, an algorithm that achieves equality of acceptance rates across protected groups in a static setting may generate further incentives for the next generation of individuals from historically disadvantaged groups. Popular resources for this setting are FICO and the ML Fairness GYM.

Preference-based fairness (zafar2017from) denotes work informed by the preferences of stakeholders. For people subjected to a decision this is related to notions of envy-freeness and loss aversion (ali2019loss), while policy-makers can express indications on how to trade-off different fairness measures (zhang2020joint).

Multi-stage fairness (madras2018predict) refers to settings where several decision makers coexist in a compound decision-making process. Decision makers, both humans and algorithmic, may act with different levels of coordination. A fundamental question in this setting is how to ensure fairness under composition of different decision mechanisms.

Fair few-shot learning (zhao2020fair) aims at developing fair ML solutions in the presence of a small amount of data samples. The problem is closely related, and possibly solved, by

fair transfer learning

(coston2019fair) where the goal is to exploit the knowledge gained on a problem to solve a different but related one. Datasets where this setting arises naturally are Communities and Crime, where one may restrict the training set to a subset of US states, and Mobile Money Loans, which consists of data from different African countries.

Fair private learning (bagdasaryan2019differential; jagielski2019differentially) studies the interplay between privacy-preserving mechanisms and fairness constraints. Works in this space consider the equity of machine learning models designed to avoid leakage of information about individuals in the training set. Common domains for datasets employed in this setting are face analysis (UTK Face, FairFace, Diversity in Face) and medicine (CheXpert, SIIM-ISIC Melanoma Classification, MIMIC-CXR-JPG).

Additional settings that are less common include fair federated learning (li2020fair), where algorithms are trained across multiple decentralized devices, fair incremental learning (zhao2020maintaining), where novel classes may be added to the learning problem over time,

fair active learning

(noriegacampero2019active), allowing for the acquisition of novel information during inference and fair selective classification (jones2021selective), where predictions are issued only if model confidence is above a certain threshold.

Overall, we found a variety of tasks defined on fairness datasets, ranging from generic, such as fair classification, to narrow and specificly defined on certain datasets, such as fair districting on MGGG States and fair truth discovery on Crowd Judgement. Orthogonally to this dimension, many settings or challenges may arise to complicate these tasks, including noisy labels, system dynamics, and privacy concerns. Quite clearly, algorithmic fairness research has been expanding in both directions, by studying a variety of tasks under diverse and challenging settings. In the next section, we analyze the roles played in scholarly works by the surveyed datasets.

4.3 Role

The datasets used in algorithmic fairness research can play different roles. For example, some may be used to train novel algorithms, while others are suited to test existing algorithms from a specific point of view. Chapter 7 of barocas2019fair, describes six different roles of datasets in machine learning. We adopt their framework to analyse fair ML datasets, adding to the taxonomy two roles that are specific to fairness research.

A source of real data. While synthetic datasets and simulations may be suited to demonstrate specific properties of a novel method, the usefulness of an algorithm is typically established on data from the real world. More than a sign of immediate applicability to important challenges, good performance on real-world sources of data signals that the researchers did not make up the data to suit the algorithm. This is likely the most common role for fairness datasets, especially common for the ones hosted on the UCI ML repository, including Adult, German Credit, Communities and Crime, Diabetes 130-US Hospitals, Bank Marketing, Credit Card Default, US Census Data (1990). These resources owe their popularity in fair ML research to being a product of human processes and to encoding protected attributes. Quite simply, they are sources of real human data.

A catalyst of domain-specific progress. Datasets can spur algorithmic insight and bring about domain-specific progress. Civil Comments is a great example of this role, powering the Jigsaw Unintended Bias in Toxicity Classification challenge. The challenge responds to a specific need in the space of automated moderation against toxic comments in online discussion. Early attempts at toxicity detection resulted in models which associate mentions of frequently attacked identities (e.g. gay) with toxicity, due to spurious correlations in training sets. The dataset and associated challenge tackle this issue by providing toxicity ratings for comments, along with labels encoding whether members of a certain group are mentioned, favouring measurement of undesired bias. Many other datasets can play a similar role, including, Winogender, Winobias and the Equity Evaluation Corpus. In a broader sense, COMPAS and the accompanying study (angwin2016machine) have been an important catalyst, not for a specific task, but for fairness research overall.

A way to numerically track progress on a problem. This role is common for machine learning benchmarks that also provide human performance baselines. Algorithmic methods approaching or surpassing these baselines are often considered a sign that the task is “solved” and that harder benchmarks are required (barocas2019fair). Algorithmic fairness is a complicated, context-dependent construct whose correct measurement is continuously debated. Due to this reason, we are unaware of any dataset having a similar role in the algorithmic fairness literature.

A resource to compare models. Practitioners interested in solving a specific problem may take a large set of algorithms and test them on a group of datasets that are representative of their problem, in order to select the most promising ones. For well-established machine learning challenges, there are often leaderboards providing a concise comparison between algorithms for a given task, which may be used for model selection. This setting is rare in the fairness literature, also due to inherent difficulties in establishing a single measure of interest in the field. One notable exception is represented by friedler2019comparative, who employed a suite of four datasets (Adult, COMPAS, German Credit, Ricci) to compare the performance of four different approaches to fair classification.

A source of pre-training data. Flexible, general-purpose models are often pre-trained to encode useful representations, which are later fine-tuned for specific tasks in the same domain. For example, large text corpora are often employed to train language models and word embeddings which are later specialized to support a variety of downstream NLP applications. Wikipedia dumps, for instance, are often used to train word embeddings and investigate their biases (brunet2019understanding; liang2020artificial; papakyriakopulos2020bias). Several algorithmic fairness works aim to study and mitigate undesirable biases in learnt representations. Corpora like Wikipedia dumps are used to obtain representations via realistic training procedures that mimic common machine learning practice as closely as possible.

A source of training data. Models for a specific task are typically learnt from training sets that encode relations between features and target variable in a representative fashion. One example from the fairness literature is Large Movie Review, used to train sentiment analysis models, later audited for fairness (liang2020artificial). For fairness audits, one alternative would be resorting to publicly available models, but sometimes a close control on training corpus and procedure is necessary. Indeed, it is interesting to study issues of model fairness in relation to biases present in respective training corpora, which can help explain the causes of bias (brunet2019understanding). Some works measure biases in representations before and after fine-tuning on a training set and regard the difference as a measure of bias in the training set. babaeianjelodar2020quantifying employ this approach to measure biases in RtGender, Civil Comments and datasets from GLUE.

A representative summary of a service. Much important work in the fairness literature is focused on measuring fairness and harms in the real world. This line of work includes audits of products and services, which rely on datasets extracted from the application of interest. Datasets created for this purpose include Amazon Recommendations, Pymetrics Bias Group, Occupations in Google Images, Zillow Searches, Online Freelance Marketplaces, Bing US Queries, YouTube Dialect Accuracy. Several other datasets were originally created for this purpose and later repurposed in the fairness literature as sources of real data, including Stop Question and Frisk, HMDA, Law School, and COMPAS.

An important source of data

. Some datasets acquire a pivotal role in research and industry, to the point of being considered a de-facto standard for a given purpose. This status warrants closer scrutiny of the dataset, through which researchers aim to uncover potential biases and problematic aspects that may impact models and insights derived from the dataset. ImageNet, for instance, is a dataset with millions of images across thousands of categories. Since its release in 2011, this resource has been used to train, benchmark and compare hundreds of computer vision models. Given its status in machine learning research, ImageNet has been the subject of two quantitative investigations analyzing its biases and other problematic aspects in the person subtree, uncovering issues of representation

(yang2020towards) and non-consensuality (prabhu2020large). A different data bias audit was carried out on SafeGraph Research Release. SafeGraph data captures mobility patterns in the US, with data from nearly 50 million mobile devices obtained and maintained by Safegraph, a private data company. Their recent academic release has become a fundamental resource for pandemic research, to the point of being used by the Centers for Disease Control and Prevention to measure the effectiveness of social distancing measures (moreland2020timing). To evaluate its representativeness for the overall US population, coston2021leveraging have studied selection biases in this dataset.

In algorithmic fairness research, datasets play similar roles to the ones they play in machine learning according to barocas2019fair, including training, catalyzing attention, and signalling awareness of common data practices. One notable exception is that fairness datasets are not used to track algorithmic progress on a problem over time, likely due to the fact that there is no consensus on a single measure to be reported. On the other hand, two roles peculiar to fairness research are summarizing a service or product that is being audited and being an important dataset whose biases and ethical aspects are particularly worthy of attention. We note that these roles are not mutually exclusive and that datasets can play multiple roles. COMPAS, for example, was originally curated to perform an audit of pretrial risk assessment tools and was later used extensively in fair ML research as a source of real human data, becoming, overall, a catalyst for fairness research and debate.

In sum, existing fairness datasets originate from a variety of domains, support diverse tasks, and play different roles in the algorithmic fairness literature. In the next section we continue our discussion on the key features of these datasets with a change of perspective, asking which lessons can be learnt from existing resources for the curation of novel ones.

5 Best Practices for Dataset Curation

Figure 3: Most datasets employed in algorithmic fairness were created or updated after 2015, with a clear growth in recent years.

In this section, we analyze the surveyed datasets from different perspectives, typical of critical data studies, human-computer interaction, and computer-supported cooperative work. In particular, we discuss concerns of re-identification (§ 5.1), consent (§ 5.2), inclusivity (§ 5.3), sensitive attribute labeling (§ 5.4) and transparency (§ 5.5). We describe a range of approaches and consideration to these topics, ranging from negligent to conscientious. Our aim is to make these concerns and related desiderata more visible and concrete, to help inform responsible curation of novel fairness resources, whose number has been increasing in recent years (Figure 3).

5.1 Re-identification

Motivation. Data re-identification (or de-anonymization) is a practice through which instances in a dataset, theoretically representing people in an anonymized fashion, are successfully mapped back to the respective individuals. Their identity is thus discovered and associated with the information encoded in the dataset features. Examples of external re-identification attacks include de-anonymization of movie ratings from the Netflix prize dataset (narayanan2008robust), identification of profiles based on social media group membership (wondracek2010practical) and identification of people depicted in verifiably pornographic categories of ImageNet (prabhu2020large). These analyses were carried out as ‘attacks’ by external teams for demonstrative purposes, but dataset curators and stakeholders may undertake similar efforts internally (mckenna2019:history_drb).

There are multiple harms connected to data re-identification, especially the ones featured in algorithmic fairness research due to their social significance. Depending on the domain and breadth of information provided by a dataset, malicious actors may acquire information about mobility patterns, consumer habits, political leaning, psychological traits, and medical conditions of individuals, just to name a few. The potential for misuse is tremendous, including phishing attacks, blackmail, threat, and manipulation. Face recognition datasets are especially prone to successful re-identification as, by definition, they contain information strongly connected with a person’s identity. The problem also extends to general purpose computer vision datasets. In a recent dataset audit,

prabhu2020large found images of beach voyeurism and other non-consensual depictions in ImageNet, and were able to identify the victims using reverse image search engines, highlighting downstream risks of blackmail and other forms of abuse.

Disparate consideration. In this work, we find that fairness datasets are proofed against re-identification with a full range of measures and care. Perhaps surprisingly, some datasets allow for straightforward re-identification of individuals, providing their full names. We do not discuss these resources here to avoid amplifying the harms discussed above. Other datasets afford plausible re-identification, providing social media handles and aliases, such as Twitter Abusive Behavior, Sentiment140, Facebook Large Network, and Google Local. Columbia University Speed Dating may also fall in this category due to a restricted population from which the sample is drawn, and provision of age, field of study and ZIP code where participants grew up in addition. In contrast, many datasets come with strong guarantees against de-anonymization, which is especially typical of health data, such as MIMIC-III and Heritage Health (el2012deidentification). Indeed, health is a domain where a culture of patient record confidentiality is widely established and there is a strong attention to harm avoidance. Also datasets describing scholarly works and academic collaboration networks (Academic Collaboration Networks, PubMed Diabetes Papers, Cora, CiteSeer) are typically de-identified, with numerical IDs substituting names. This is possibly a sign of attention to anonymization from curators when the data represents potential colleagues. As a consequence, researchers are protected from related harms, but posterior annotation of sensitive attributes similarly to biega2019overview becomes impossible. One notable exception is ArnetMiner Citation Network, especially focused on data mining from academic social networks and profiling of researchers.

Mitigating factors. A wide range of factors may help to reduce the risk of re-identification. A first set of approaches concerns the distribution of data artefacts. Some datasets are simply kept private, minimizing risks in this regard. These include UniGe, Student Performance, Apnea, Symptoms in Queries and Pymetrics Bias Group, the last two being proprietary datasets that are not disclosed to preserve intellectual property. Twitter Online Harrassment is available upon request to protect the identities of Twitter users that were included. Another interesting approach are mixed release strategies: NLSY has some publicly available data, while access to further information that may favour re-identification (e.g. ZIP code and census tract) is restricted. For crawl-based datasets, it is possible to keep a resource private while providing code to recreate it (Bias in Bios). While this may alleviate some concerns, it will not deter motivated actors. As a post-hoc remedy, proactive removal of problematic instances is also a possibility, as shown by recent work on ImageNet (yang2020towards).

Another family of approaches is based on redaction, aggregation, and injection of noise. Obfuscation typically involves the distribution of proprietary company data at a level of abstraction which maintains utility to a company while hindering reconstruction of the underlying human-readable data, which also makes re-identification highly unlikely (Yahoo! c14B Learn to Rank, Microsoft Learning to Rank). Noise injection can take many forms, such as top-coding (Adult) and blurring. Targeted scrubbing of identifiable information is also rather common, with ad-hoc techniques applied in different domains. For example, the curators of ASAP, a dataset featuring student essays, removed personally identifying information from the essays using named entity recognition and several heuristics. Finally, aggregation of data into subpopulations of interest also supports the anonymity of the underlying individuals (FICO).

So far we have covered datasets that feature human data derived from real-world processes. Toy datasets, on the other hand, are perfectly safe from this point of view, however their social relevance is inevitably low. In this work we survey four popular ones, taken from zafar2017fairness; donini2018empirical; lipton2018does; singh2019policy

. Semi-synthetic datasets aim for the best of both worlds by generating artificial data from models that emulate the key characteristics of the underlying processes, as is the case with Antelope Valley Networks, Kidney Matching and the generative adversarial network trained by

mcduff2019characterizing on MS-Celeb-1M.

One last important factor is the age of a dataset. Re-identification of old information about individuals requires matching with auxiliary resources from the same period, which are less likely to be maintained than comparable resources from recent years. Moreover, even if successful, the consequences of re-identification are likely mitigated by dataset age, as old information about individuals is less likely to support harm against them. The German Credit dataset, for example, represents loan applicants from 1973–1975, whose re-identification and subsequent harm appears less likely than re-identification for more recent datasets in the same domain.

Anonymization vs social relevance. Utility and privacy are typically considered conflicting objectives for a dataset (wieringa2021data). If we define social relevance as the breadth and depth of societally useful insights that can be derived from a dataset, a similar conflict with privacy becomes clear. Old datasets hardly afford any insight that is actionable and relevant to current applications. Insight derived from synthetic datasets is inevitably questionable. Noise injection increases uncertainty and reduces the precision of claims. Obfuscation hinders subsequent annotation of sensitive attributes. Conservative release strategies increase friction and deter from obtaining and analyzing the data. The most socially relevant fairness datasets typically feature confidential information (e.g. criminal history and financial situation) in conjunction with sensitive attributes of individuals (e.g. race and sex). For these reasons, the social impact afforded by a dataset and the safety against re-identification of included individuals are potentially conflicting objectives that require careful balancing. In the next section we discuss informed consent, another important aspect for the privacy of data subjects.

5.2 Consent

Motivation. In the context of data, informed consent is an agreement between a data processor and a subject, aimed at allowing collection and use of personal information while guaranteeing some control to the subject. paullada2020data note that in the absence of individual control on personal information, anyone with access to the data can process it with little oversight, possibly against the interest and well-being of data subjects. Consent is thus an important tool in a healthy data ecosystem that favours development, trust and dignity.

Negative examples. A separate framework, often conflated with consent, is copyright. Licenses such as Creative Commons discipline how academic and creative works can be shared and built upon, with proper credit attribution. According to the Creative Commons organization, however, their licenses are not suited to protect privacy and cover research ethics (merkley2019use). In computer vision, and even more so in face recognition, consent and copyright are often considered and discussed jointly, and Creative Commons licenses are frequently taken as an all-inclusive permit encompassing intellectual property, consent and ethics (prabhu2020large). merler2019diversity, for example, mention privacy and copyright concerns in the construction of Diversity in Faces. These concerns are apparently jointly solved by obtaining images from YFCC-100M, due to the fact that “a large portion of the photos have Creative Commons license”. Indeed lack of consent is a widespread and far-reaching problem in face recognition datasets (keyes2019government). prabhu2020large find several examples of non-consensual images in large scale computer vision datasets. A particularly egregious example covered in this survey is MS-Celeb-1M, released in 2016 as the largest publicly available training set for face recognition in the world (guo2016msceleb1m). As suggested by its name, the dataset should feature only celebrities, “to enable our training, testing, and re-distributing under certain licenses” (guo2016msceleb1m). However, the dataset was later found to feature several people who are in no way celebrities, and must simply maintain an online presence. The dataset was retracted for this reason (murgia2019microsoft).

Positive examples. One domain where informed consent doctrine has been well-established for decades is medicine; fairness datasets from this space are typically sensitive to the topic. Experiments such as randomized controlled trials always require consent elicitation and often discuss the process in the respective articles. Infant Health and Development Program (IHDP), for instance, is a dataset used to study fair risk assessment. It was collected through the IHDP program, carried out between 1985 and 1988 in the US to evaluate the effectiveness of comprehensive early intervention in reducing developmental and health problems in low birth weight premature infants. brooks1992effects clearly state that “of the 1302 infants who met enrollment criteria, 274 (21%) had parents who refused consent and 43 were withdrawn before entry into the assigned group”. Longitudinal studies require trust and continued participation. They typically produce insights and data thanks to participants who have read and signed an informed consent form. Examples of such datasets include Framingham, stemming from a study on cardiovascular disease, and the National Longitudinal Survey of Youth, following the lives of representative samples of US citizens, focusing on their labor market activities and other significant life events. Field studies and derived datasets (DrugNet, Homeless Youths’ Social Networks) are also attentive to informed consent.

The FRIES framework. According to the Consentful Tech Project,666https://www.consentfultech.io/ consent should be Freely given, Reversible, Informed, Enthusiastic, and Specific (FRIES). Below we expand on these points and discuss some fairness datasets through the FRIES lens. Pokec Social Network summarizes the networks of Pokec users, a popular social network in Slovakia and Czech Republic. Due to default privacy settings being predefined as public, a wealth of information for each profile was collected by curators, including information on demographics, politics, education, marital status and children (takac2012data). While privacy settings are a useful tool to control personal data, default public settings are arguably misleading and do not amount to freely given consent. In the presence of more conservative predefined settings, a user can explicitly choose to publicly share their information. This may be interpreted as consent to share one’s information here and now with other users; more loose interpretations favouring data collection and distribution are also possible, but they seem rather lacking in specificity. It is far from clear that choosing public profile settings entails consent to become part of a study and a publicly available dataset for years to come.

This stands in contrast with Framingham and other datasets derived from medical studies, where consent may be provided or refused with fine granularity (levy2010consent). In this regard, let us consider a consent form for a recent Framingham exam (framingham2021consent). The form comes with five different consent boxes which cover participation in examination, use of resulting data, participation in genetic studies, sharing of data with external entities, notification of findings to subject. Before the consent boxes, a well-structured document informs participants on the reasons for this study, clarifies that they can choose to drop out without penalties at any point, provides a point of contact, explains what will happen in the study and what are the risks to the subject. Some examples of accessible language and open explanations include the following:

  • “You have the right to refuse to allow your data and samples to be used or shared for further research. Please check the appropriate box in the selection below.”

  • “There is a potential risk that your genetic information could be used to your disadvantage. For example, if genetic research findings suggest a serious health problem, that could be used to make it harder for you to get or keep a job or insurance.”

  • “However, we cannot guarantee total privacy. […] Once information is given to outside parties, we cannot promise that it will be kept private.”

Moreover, the consent form is accessible from a website that promises to deliver a Spanish version soon, showing attention to linguistic minorities. Overall, this approach seems geared towards trust and truly informed consent.

In some cases, consent is made unapplicable by necessity. Allegheny Child Welfare, for instance, stems from an initiative by the Allegheny County’s Department of Human Services to develop assistive tools to support child maltreatment hotline screening decisions. Individuals who resort to this service are in a situation of need and emergency that makes enthusiastic consent highly unlikely. Similar considerations arise in any situations where data subjects are in a state of need and can only access a service by providing their data. An extreme example is Harvey Rescue, the result of crowdsourced efforts to connect rescue parties with people requesting help in the Houston area. Moreover, the provision of data is mandatory in some cases, such as the US census, which conflicts with meaningful, let alone enthusiastic, consent.

Finally, consent should be reversible, giving individuals a chance to revoke it and be removed from a dataset. This is an active area of research, studying specific tools for consent management (albanese2020dynamic) and approaches for retroactive removal of an instance from a model’s training set (ginart2019making). Unfortunately, even when discontinued or redacted, some datasets remain available through backchannels and derivatives. MS-Celeb-1M is, again, a negative example in this regard. The dataset was removed by Microsoft after widespread criticism and claims of privacy infringement. Despite this fact, it remains available via academic torrents (peng2021mitigating). Moreover, MS-Celeb-1M was used as a source of images for several datasets derived from it, including the BUPT Faces and Racial Faces in the Wild datasets covered in this survey. This fact demonstrates that harms related to data artefacts are not simply remedied via retirement or redaction. Ethical considerations about consent and potential harms to people must be more than an afterthought and need to enter the discussion during design.

5.3 Inclusivity

Motivation. Issues of representation, inclusion and diversity are central to the fair ML community. Due to historical biases stemming from structural inequalities, some populations and their perspectives are underrepresented in certain domains and in related data artefacts (jo2020lessons)

. For example, the person subtree of ImageNet contains images that skew toward male, young and light skin individuals

(yang2020towards). Female entities were found to be underrepresented in popular datasets for coreference resolution (zhao2018gender). Even datasets that match natural group proportions may support the development of biased tools with low accuracy for minorities.

Recent works have demonstrated the disparate performance of tools on sensitive subpopulations in domains such as health care (obermeyer2019dissecting), speech recognition (tatman2017:gender) and computer vision (Buolamwini2018gender). Inclusivity and diversity are often considered a primary solution in this regard, both in training sets, which support the development of better models, and test sets capable of flagging such issues.

Positive examples. Ideally, inclusivity should begin with a clear definition of data collection objectives (jo2020lessons). Indeed, we find that diversity and representation are strong points of datasets that were creted to assess biases in services, products and algorithms (BOLD, HMDA, FICO, Law School, Scientist+Painter, CVs from Singapore, YouTube Dialect Accuracy, Pilot Parliaments Benchmark), which were designed and curated with special attention to sensitive groups. We also find instances of ex-post remedies to issues of diversity. As an example, the curators of ImageNet proposed a demographic balancing solution based on a web interface that removes the images of overrepresented categories (yang2020towards). A natural alternative is the collection of novel instances, a solution adopted for Framingham. This dataset stems from a study of key factors that contribute to cardiovascular disease, with participants recruited in Framingham, Massachusetts over multiple decades. Recent cohorts were especially designed to reflect a greater racial and ethnic diversity of the town (tsao2015cohort).

Negative examples. Among the datasets we surveyed, we highlight one whose low inclusivity is rather obvious. WebText is a 40 GB dataset that supported training of the GPT-2 language model (radford2019language). The authors crawled every document reachable from outbound Reddit links that collected at least 3 karma. While this was considered a useful heuristic to achieve size and quality, it ended up skewing this resource towards content appreciated by Reddit users, who are predominantly male, young and with good internet access. This should act as reminder that size does not guarantee diversity (bender2021:dangers), and that sampling biases are almost inevitable.

Inclusivity is nuanced. While inclusivity surely requires an attention to subpopulations, a more precise definition may depend on context and application. Based on the task at hand, an ideal sample may feature all subpopulations with equal presence, or proportionally to their share in the overall population. Let us call these the equal and proportional approach to diversity. The equal approach is typical of datasets that are meant to be evaluation benchmarks (Pilot Parliaments Benchmark, Winobias) and allow for statistically significant statements on performance differences across groups. On the other hand, the proportional approach is rather common in datasets collected by census offices, such as US Census Data (1990), and in resources aimed precisely at studying issues of representation in services and products (Occupations in Google Images).

Open-ended collection of data is ideal to ensure that various cultures are represented in the manner in which they would like to be seen (jo2020lessons). Unfortunately, we found no instance of datasets where sensitive labels were self-reported according to open-ended responses. On the contrary, individuals with non-conforming gender identities were excluded from some datasets and analyses. Bing US Queries is a proprietary dataset used to study differential user satisfaction with the Bing search engine across different demographic groups. It consists of a subset of Bing users who provided their gender at registration according to a binary categorization, which misrepresents or simply excludes non-binary users from the subset. Similarly, a dataset may be inclusive and allow for non-binary gender indication (Climate Assembly UK), but if used in conjunction with an auxiliary dataset where gender has binary encoding, a common solution is removing instances whose gender is neither female nor male (flanigan2020neutralizing).

Inclusivity does not guarantee benefits. To avoid downstream harms, inclusion by itself is insufficient. The context in which people and sensitive groups are represented should always be taken into account. Despite its overall skew towards male subjects, ImageNet has a high female-to-male ratio in classes such as bra, bikini and maillot, which often feature images that are voyeuristic, pornographic and non-consensual (prabhu2020large). Similarly, in MS-COCO, a famous dataset for object recognition, there is roughly a 1:3 female-to-male ratio, increasing to 0.95 for images of kitchens (hendricks2018women). This sort of representation is unlikely to benefit women in any way and, on the contrary, may contribute to reinforce stereotypes and support harmful biases.

Another clear (but often ignored) disconnect between the inclusion of a group and benefits to it is represented by the task at hand and, more in general, by possible uses afforded by a dataset. In this regard, we find many datasets from the face recognition domain, which are presented as resources geared towards inclusion (Diversity in Faces, BUPT Faces, UTK Face, FairFace, Racial Faces in the Wild). Attention to subpopulations in this context is still called “diversity” (Diversity in Faces, FairFace, Racial Faces in the Wild) or “social awareness” (BUPT Faces), but is driven by business imperatives and goals of robustness for a technology that can very easily be employed for surveillance purposes, and become detrimental to vulnerable populations included in datasets.

Overall, attention to subpopulations is an upside of many datasets we surveyed. However, inclusion, representation, and diversity can be defined in different ways according to the problem at hand. Individuals would rather be included on their own terms, and decide whether and how they should be represented. The problems of diversity and robustness have some clear commonalities, but it seems advisable to maintain separate languages and avoid equating either one with fairness.

5.4 Sensitive Attribute Labelling

Motivation. Datasets are often taken as factual information that supports objective computation and pattern extraction. The etymology of the word “data”, meaning “given”, is rather revealing in this sense. On the contrary, research in human-computer interaction, computer-supported cooperative work, and critical data studies argues that this belief is superficial, limited and potentially harmful (muller2019how; crawford2021excavating).

Data is, quite simply, a human-influenced entity (miceli2021documenting)

, determined by a chain of discretionary decisions on measurement, sampling and categorization, which shape how and by whom data will be collected and annotated, according to which taxonomy and based on which guidelines. Data science professionals, often more cognizant of the context surrounding data than theoretical researchers, report significant awareness of how curation and annotation choices influence their data and its relation with the underlying phenomena

(muller2019how). In an interview, a senior text classification researcher responsible for ground truth annotation shows consciousness of their own influence on datasets by stating “I am the ground truth.” (muller2019how).

Sensitive attributes, such as race and gender, are no exception in this regard. Inconsistencies in racial annotation are rather common within the same system (lum2020impact) and, even more so, across different systems (scheuerman2020how; khan2021one). External annotation (either human or algorithmic) is essentially based on co-occurrence of specific traits with membership in a group, thus running the risk of encoding and reinforcing stereotypes. Self-reported labels overcome this issue, although they are still based on an imposed taxonomy, unless provided in open-ended fashion. In this section, we summarize the practices through which sensitive attributes are annotated in datasets used in algorithmic fairness research.

Procurement of sensitive attributes. Self-reported labels for sensitive attributes are typical of datasets that represent users of a service, who may report their demographics during registration (Bing US Queries, MovieLens, Libimseti), or that were gathered through surveys (HMDA, Adult, Law School, Sushi, Willingness-to-Pay for Vaccine). These are all resources for which collection of protected attributes was envisioned at design, potentially as an optional step. However, when sensitive attributes are not available, their annotation may be possible through different mechanisms.

A common approach is having sensitive attributes labelled by non-experts, often workers hired on crowdsourcing platforms. CelebFaces Attributes Dataset (CelebA) features images of celebrities from the CelebFaces dataset, augmented with annotations of landmark location and categorical attributes, including gender, skin tone and age, which were annotated by a “professional labeling company” (liu2015deep). In a similar fashion, Diversity in Faces consists of images labeled with gender and age by workers hired through the Figure Eight crowd-sourcing platform, while the creators of FairFace hired workers on Amazon Mechanical Turk to annotate gender, race, and age in a public image dataset. This practice also raises concerns of fair compensation of labour, which are not discussed in this work.

Some creators employ algorithms to obtain sensitive labels. Face datasets curators often resort to the Face++ API (Racial Faces in the Wild, Instagram Photos, BUPT Faces) or other algorithms (UTK Face, FairFace). In essence labeling is classifying, hence measuring and reporting accuracy for this procedure is in order but rarely happens. Creators occasionally note that automated labels were validated by human annotators (Racial Faces in the Wild, FairFace) and very seldom report inter-annotator agreement (Occupations in Google Images).

Other examples of external labels include the geographic origin of candidates in resumes (CVs from Singapore), political leaning of US Twitter profiles (Twitter Political Searches), English dialect of tweets (TwitterAAE) and gender of subjects featured in image search results for professions (Occupations in Google Images). Annotation may also rely on external knowledge bases such as Wikipedia,777https://en.wikipedia.org/wiki/Category:American_female_tennis_players as is the case with RtGender. In situations where text written by individuals is available, rule-based approaches exploiting gendered nouns (“woman”) or pronouns (“she”) are also applicable (Bias in Bios, Demographics on Twitter).

Some datasets may simply have no sensitive attribute. These are often used in works of individual fairness, but may occasionally support studies of group fairness. For example, dSprites is a synthetic computer vision dataset where regular covariates may play the role of sensitive variables (locatello2019fairness). Alternatively, datasets can be augmented with simulated demographics, as done by madnadi2017:building who randomly assigned a native language to test-takers in ASAP.

Face datasets. Posterior annotation is especially common in computer vision datasets. The Pilot Parliaments Benchmark, for instance, was devised as a testbed for face analysis algorithms. It consists of images of parliamentary representatives from three African and three European countries, that were labelled by a surgical dermatologist with the Fitzpatrick skin type of the subjects (fitzpatrick1988validity). This is a dermatological scale for skin color, which can be retrieved from people’s appearance. On the contrary, annotations of race or ethnicity from a photo are simplistic at best, and it should be clear that they actually capture perceived race from the perspective of the annotator (FairFace, BUPT Faces). Careful nomenclature is an important first step to improve the transparency of a dataset and make the underlying context more visible.888In this article we discuss sensitive attributes following the naming convention in the accompanying documentation of a dataset.

Similarly to scheuerman2020how, we find that documentation accompanying face recognition datasets hardly ever describes how specific taxonomies for gender and race were chosen, conveying a false impression of objectivity. A description of the annotation process is typically present, but minimal. For Multi-task Facial Landmark, for instance, we only know that “The ground truths of the related tasks are labeled manually” (zhang2014facial).

Annotation trade-offs. It is worth re-emphasizing that sensitive label assignment is a classification task that rests on assumptions. Annotation of race and gender in images, for example, is based on the idea that they can be accurately ascertained from pictures, which is an oversimplification of these constructs. The envisioned classes (e.g. binary gender) are another subjective choice stemming from the point of view of dataset curators and may reflect narrow or outdated conceptions and potentially harm the data subjects. In this regard we quote the curators of MS-Celeb-1M, who do not annotate race but consider it for their sampling strategy: “We cover all the major races in the world (Caucasian, Mongoloid, and Negroid)” (guo2016msceleb1m). For these reasons, external annotation of sensitive attributes is controversial and inevitably influenced by dataset curators.

On the other hand, external annotation may be the only way to test specific biases. Occupations in Google Images, for instance, is an image dataset collected to study gender and skin tone diversity in image search results for various professions. The creators hired workers on Amazon Mechanical Turk to label the gender (male, female) and Fitzpatrick skin tone (Type 1–6) of the primary person in each image. The Pilot Parliaments Benchmark was also annotated externally to obtain a benchmark for the evaluation of face analysis technology, with a balanced representation of gender and skin type. Different purposes can motivate data collection and annotation of sensitive attributes. Purposes and aims should be documented clearly, while also reflecting on other uses and potential for misuse of a dataset (gebru2018datasheets). Dataset curators may use documentation to discuss these aspects and specify limitations for the intended use of a resource (peng2021mitigating). In the next section we focus on documentation and why it represents a key component of data curation.

5.5 Transparency

Motivation. Transparent and accurate documentation is a fundamental part of data quality. Its absence may lead to serious issues, including lack of reproducibility, concerns of scientific validity, ethical problems and harms (barocas2019fair). Clear documentation can shine a light on inevitable choices made by dataset creators and on the context surrounding the data. In the absence of this information, the curation mechanism mediating reality and data is hidden, the data becomes one with its context, to the point that interpretation of numerical results can be misleading and overarching (bao2021COMPASlicated).

Good documentation should discuss and explain features, providing context about who collected and annotated the data, how and for which purpose (gebru2018datasheets; denton2020bringing). This provides dataset users with information they can leverage to select appropriate datasets for their tasks and avoid unintentional misuse (gebru2018datasheets). Other actors, such as reviewers, may also access the official documentation of a dataset to ensure that it is employed in compliance with its stated purpose, guidelines and terms of use (peng2021mitigating).

Positive examples. In this survey, we find examples of excellent documentation in datasets related to studies and experiments, including CheXpert, Framingham and NLSY. Indeed, datasets curated by medical institutions and census offices are often well-documented. The ideal source of good documentation are descriptor articles published in conjunction with a dataset (e.g. MIMIC-III), typically offering stronger guarantees than web pages in terms of quality and permanence. Official websites hosting and distributing datasets are also important to collect updates, errata, and additional information that may not be available at the time of release. The Million Song Dataset and Goodreads Reviews, for instance, are available on websites which contain a useful overview of the respective dataset, a list of updates, code samples, pointers to documentation and contacts for further questions.

Negative examples. On the other hand, some datasets are opaque and poorly documented. Among publicly available ones, Arrhythmia is distributed with a description of the features but no context about the purposes, actors, and subjects involved in the data collection. Similarly, the whole curation process and composition of Multi-task Facial Landmark is described in a short paragraph, explaining it consists of 10,000 outdoor face images from the web that were labelled manually with gender. Most face datasets suffer from opaque documentation, especially concerning the choice of sensitive labels and their annotation.

Retrospective documentation. Good documentation may also be produced retrospectively (bandy2021addressing; garbin2021structured). German Credit is an interesting example of a dataset that was poorly documented for decades, until the recent publication of a report correcting severe coding mistakes (gromping2019:sg). For instance, from the old documentation it seemed possible to retrieve the sex of data subjects from a feature jointly encoding sex and marital status. The dataset archaeology work by gromping2019:sg shows that this is not the case, which has particular relevance for many algorithmic fairness works using this dataset with sex as a protected feature, as this feature is simply not available. Numerical results obtained in this setting may be an artefact of the wrong coding with which the dataset has been, and still is, officially distributed in the hofmann1994:sg. Until the report and the new redacted dataset (gromping2019:sg2) become well-known, the old version will remain prevalent and more mistakes will be made. In other words, while the documentation debt for this particular dataset has been retrospectively addressed, many algorithmic fairness works published after the report continue to use the German Credit dataset with sex as a protected attribute (he2020geometric; yang2020fairness; baharlouei2020renyi; lohaus2020too; martinez2020minimax; wang2021fair). This is an issue of documentation sparsity, where the right information exists but it does not reach interested parties, including researchers and reviewers.

Documentation is a fundamental part of data curation, with most responsibility resting on creators. Dataset users can also play a role in mitigating the documentation debt by proactively looking for information about the resources they plan to use. Brief summaries discussing and motivating the chosen datasets can be included in scholarly articles, at least in supplementary materials when conflicting with page limitations. Indeed, documentation debt is a problem for the whole research community, which can be addressed collectively with retrospective contributions and clarifications. We argue that it is also up to individual researchers to seek contextual information for situating the data they want to use.

6 Conclusions and Recommendations

Algorithmic fairness is a young research area, undergoing a fast expansion, with diverse contributions in terms of methodology and applications. Progress in the field hinges on different resources, including, very prominently, datasets. In this work, we have surveyed hundreds of datasets used in the algorithmic fairness literature to help the research community reduce its documentation debt, improve the utilization of existing datasets, and the curation of novel ones.

With respect to existing resources, we have shown that the most popular datasets in the fairness literature (Adult, COMPAS and German Credit) have limited merits beyond originating from human processes and encoding protected attributes. On the other hand, several negative aspects call into question their current status of general-purpose fairness benchmarks, including contrived prediction tasks, noisy data, severe coding mistakes, and age. In a practical demonstration of documentation debt and its consequences, we find several works of algorithmic fairness using German Credit with sex as a protected attribute, while careful analysis of recent documentation shows that this feature cannot be reliably retrieved from the data.

We have documented over two hundred datasets to provide viable alternatives, annotating their domain, the tasks they support and discussing the roles they play in works of algorithmic fairness. We have shown that the processes generating the data belong to many different domains, including, for instance, criminal justice, education, search engines, online marketplaces, emergency response, social media, medicine, and finance. At the same time, we have described a variety of tasks studied on these resources, ranging from generic, such as fair classification, to narrow such as fair districting and fair truth discovery. Overall, such diversity of domains and tasks provides a glimpse into the variety of human activities and applications that can be impacted by automated decision making, and that can benefit from algorithmic fairness research. Tasks and domain annotations are made available in our data briefs to facilitate the work of researchers and practitioners interested in the study of algorithmic fairness applied to specific domains or tasks. By assembling sparse information on hundreds of datasets into a single document, we aimed to provide a useful reference to support both domain-oriented and task-oriented dataset search.

At the same time, we have analyzed issues connected to re-identification, consent, inclusivity, labeling, and transparency running across these datasets. By describing a range of approaches and attentiveness to these topics, we have looked to make them more visible and concrete. On one hand, this may prove valuable to inform post-hoc data interventions aimed at mitigating potential harms caused by existing datasets. On the other hand, as novel datasets are increasingly curated, published, and adopted in fairness research, it is important to motivate these concerns, make them tangible, and distill existing approaches into best practices for future endeavours of data curation, which we summarize below. Our recommendations complement (and do not replace) a growing body of work studying key aspects in the life cycle of datasets (gebru2018datasheets; jo2020lessons; prabhu2020large; crawford2021excavating; peng2021mitigating).

Social relevance of data, intended as the breadth and depth of societally useful insights afforded by datasest, is a central requirement in fairness research. Unfortunately, this may conflict with user privacy, favouring re-identification or leaving consideration of consent in the background. Consent should be considered during the initial design of a dataset, in accordance with existing frameworks, such as the FRIES framework outlined in the Consentful Tech project. Moreover, different strategies are available to alleviate concerns of re-identification, including noise injection, conservative release, and synthetic data generation. Algorithmic fairness is motivated by aims of justice and harm avoidance for people, which should be extended to data subjects.

Inclusivity is also important for social relevance, as it allows for a wider representation and supports analyses that take into account important groups. However, inclusivity is insufficient in itself. Possible uses afforded by a dataset should always be considered, evaluating costs and benefits for the data subjects and the wider population. In the absence of these considerations, acritical inclusivity runs the risk of simply supporting system robustness across sensitive attributes, such as race and gender, rebranded as fairness.

Sensitive attributes are a key ingredient to measure inclusion and increase the social relevance of a dataset. Although often impractical, it is typically preferable for sensitive attributes to be self-reported by data subjects. Externally assigned labels and taxonomies can harm individuals by erasing their needs and point of view. Sensitive attribute labelling is thus a shortcut whose advantages and disadvantages should be carefully weighted and, if chosen, it should be properly documented. Possible approaches based on human labour include expert and non-expert annotation, while automated approaches range from simple rule-based systems to complex and opaque algorithms. To label is to classify, hence measuring and reporting per-group accuracy is in order. Some labeling endeavours are more sensible than others: while skin tone can arguably be retrieved from pictures, annotations of race from an image actually capture

perceived race from the perspective of the annotator. Rigorous nomenclature favours better understanding and clarifies the subjectivity of certain labels.

Reliable documentation shines a light on inevitable choices made by dataset creators and on the context surrounding the data. This provides dataset users with information they can leverage to select appropriate datasets for their tasks and avoid unintentional misuse. Datasets for which some curation choices are poorly documented may appear more objective at first sight. However, it should be clear that objective data and turbid data are very different things. Proper documentation increases transparency, trust and understanding. At a minimum, it should include the purpose of a data artifact, a description of the sample, the features and related annotation procedures, along with an explicit discussion of the associated task, if any. It should also clarify who was involved in the different stages of the data development procedure, with special attention to annotation. Data documentation also supports reviewers and readers of academic research in assessing whether a dataset was selected with good reason and utilized in compliance with creators’ guidelines.

Understanding and budgeting for all these aspects during early design phases, rather than after collection or release, can be invaluable for data subjects, data users, and society. While possible remedies exist, data is an extremely fluid asset allowing for easy reproduction and derivatives of all sorts; remedies applied to a dataset do not necessarily benefit its derivates. In this work, we have targeted the collective documentation debt of the algorithmic fairness community, resulting from the opacity surrounding certain resources and the sparsity of existing documentation. We have mainly targeted sparsity in a centralized documentation effort; as a result, we have found and described a range of weaknesses and best practices that can be adopted to reduce opacity and mitigate concerns of privacy and inclusion. Similarly to other types of data interventions, useful documentation can be produced after release, but, as shown in this work, the documentation debt may propagate nonetheless. In a mature research community, curators, users and reviewers can all contribute to cultivating a data documentation culture and keep the overall documentation debt in check.

Acknowledgements.
The authors would like to thank the following researchers and dataset creators for the useful feedback on the data briefs: Alain Barrat, Luc Behaghel, Asia Biega, Marko Bohanec, Chris Burgess, Robin Burke, Alejandro Noriega Campero, Margarida Carvalho, Abhijnan Chakraborty, Robert Cheetham, Paulo Cortez, Thomas Davidson, Maria De-Arteaga, Lucas Dixon, Michele Donini, Marco Duarte, Fehrman, H. Altay Guvenir, Moritz Hardt, Yu Hen Hu, Irina Higgins, Won Ik Cho, Rachel Huddart, Lalana Kagal, Dean Karlan, Vijay Keswani, Been Kim, Hyunjik Kim, Jiwon Kim, Svetlana Kiritchenko, Joseph A. Konstan, Varun Kumar, Jeremy Andrew Irvin, Jamie N. Larson, Jure Leskovec, Andrea Lodi, Oisin Mac Aodha, Loic Matthey, Julian McAuley, Brendan McMahan, Sergio Moro, Luca Oneto, Orestis Papakyriakopoulos, Stephen Robert Pfohl, Christopher G. Potts, Mike Redmond, Kit Rodolfa, Veronica Rotemberg, Rachel Rudinger, Sivan Sabato, Kate Saenko, Mark D. Shermis, Daniel Slunge, Luca Soldaini, Efstathios Stamatatos, Ryan Steed, Rachael Tatman, Schrasing Tong, Alan Tsang, Andreas van Cranenburgh, Lucy Vasserman, Roland Vollgraf, Alex Wang, Zeerak Waseem, Kellie Webster, Pang Wei Koh, Bryan Wilder, Nick Wilson, I-Cheng Yeh, Elad Yom-Tov, Michal Zabovsky, Yukun Zhu.

References

Appendix A Data briefs

Data briefs were drafted by the first author and reviewed by the remaining authors. For over 95% of the surveyed datasets, we identified at least one contact involved in the data curation process or familiar with the dataset, who received a preliminary version of the respective data brief and a request for corrections and additions. Data briefs are meant as a short documentation format to provide key information on datasets used in fairness research, comprising the following fields:

Description.

This is a free-text field reporting (1) the aim/purpose of a data artifact (i.e., why it was developed/collected), as stated by curators or inferred from context; (2) a high-level description of the available features; (3) the labeling procedure for annotated attributes, with special attention to sensitive ones, if any; (4) the envisioned ML task, if any.

Affiliation of creators.

Typically derived from reports, articles, or official web pages presenting a dataset. Datasets can be derivatives of other datasets (e.g., Adult). We typically refer to the definitive resource while providing the prior context where appropriate.

Domain.

The main field where the data is used (e.g., computer vision for ImageNet) or the field studying the processes and phenomena that produced the dataset (e.g., radiology for CheXpert).

Tasks in fairness literature.

An indication of the task performed on the dataset in each surveyed article that uses the current resource.

Data spec.

The main format of the data. The envisioned categories are text, image, time-series, tabular data, and pairs. The latter denotes a special type of tabular data where rows and columns correspond to entities and cells to a relation between them, such as relevance for query-document pairs, ratings for user-item pairs, co-authorship relation for author-author pairs. A “mixture” category was added for resources with multimodal data.

Sample size.

Dataset cardinality.

Year.

Last known update to the dataset. For resources whose collection and curation are ongoing (e.g., Framingham) we write 2021.

Sensitive features.

Sensitive attributes in the dataset. These are typically explicitly annotated, but may include implicit ones, such as textual references to people and their demographics in text datasets. References to gender, for instance, can easily be retrieved from English-language text corpora based on intrinsically gendered words, such as she, man, aunt.

Link.

A link to the website where the resource can be downloaded or requested.

Further information.

Reference to works and web pages describing the dataset.

Following the algorithmic fairness literature, we define sensitive features as encoding membership to groups that are salient for society and have some special protection based on the law, including race, ethnicity, sex, gender, and age. We may occasionally stretch this definition and report features considered sensitive in some works, such as political leaning or education, so long as they reflect essential divisions in society. We also report domain-specific attributes considered sensitive in a given context, such as language for Section 203 determinations or brand ownership for Amazon Recommendations. We follow the language of the available documentation for the names and values of sensitive features, including distinctions between race and ethnicity. For datasets that report geographical information at any granularity (GPS coordinates, neighbourhoods, countries) we report “geography” among the sensitive attributes. If an article considers features to be sensitive in an arbitrary fashion (e.g., sepal width in the Iris dataset), we do not report it in the respective field.

For the dataset domain, we follow the area-category taxonomy defined by Scimago,999https://www.scimagojr.com/journalrank.php with the addition of “news”, “social media”, and “social networks”. Tasks in the fairness literature were labeled via open coding. The final taxonomy is detailed in Section 4.2. We distinguish between works that are more focused on evaluation rather than a proposal of novel solutions by writing, e.g. “fair ranking evaluation” instead of “fair ranking”. We use “evaluation” as a broad term for works focusing on audits of algorithms, products, platforms, or datasets and their properties from multiple fairness and accuracy perspectives. With some abuse of notation, we also use this label for works that focus on properties of fairness metrics (pleiss2017fairness). Unless otherwise specified, “fairness evaluation” is about fair classification, which is the most common task.

a.1 2010 Frequently Occurring Surnames

  • Description: this dataset reports all surnames occurring 100 or more times in the 2010 US Census, broken down by race (White, Black, Asian and Pacific Islander (API), American Indian and Alaskan Native only (AIAN), multiracial, or Hispanic).

  • Affiliation of creators: US Census Bureau.

  • Domain: linguistics.

  • Tasks in fairness literature: noisy fair subset selection (mehrotra2021mitigating).

  • Data spec: tabular data.

  • Sample size: K surnames.

  • Year: 2016.

  • Sensitive features: race.

  • Link: https://www.census.gov/topics/population/genealogy/data/2010_surnames.html

  • Further info: https://www2.census.gov/topics/genealogy/2010surnames/surnames.pdf

a.2 2016 US Presidential Poll

  • Description: this dataset was collected and maintained by FiveThirtyEight, a website specialized in opinion poll analysis. This resource was developed with the goal of providing an aggregated estimate based on multiple polls, weighting each input according to sample size, recency, and historical accuracy of the polling organization. For each poll, the dataset provides the period of data collection, its sample size, the pollster conducting it, their rating, and a url linking to the source data.

  • Affiliation of creators: FiveThirtyEight.

  • Domain: political science.

  • Tasks in fairness literature: limited-label fairness evaluation (sabato2020bounding).

  • Data spec: tabular data.

  • Sample size: K poll results.

  • Year: 2016.

  • Sensitive features: geography.

  • Link: http://projects.fivethirtyeight.com/general-model/president_general_polls_2016.csv

  • Further info: https://projects.fivethirtyeight.com/2016-election-forecast/

a.3 4area

  • Description: this dataset was extracted from DBLP to study the problem of topic modeling on documents connected by links in a graph structure. The creators extracted from DBLP articles published at 20 major conferences from four related areas, i.e., database, data mining, machine learning, and information retrieval. Each author is associated with four continuous variables based on the fraction of research papers published in these areas. The associated task is the prediction of these attributes.

  • Affiliation of creators: University of Illinois at Urbana-Champaign.

  • Domain: library and information sciences.

  • Tasks in fairness literature: fair clustering (harb2020kfc).

  • Data spec: author-author pairs.

  • Sample size: K nodes (authors) connected by K edges (co-author relations).

  • Year: 2009.

  • Sensitive features: author.

  • Link: not available

  • Further info: sun2009itopic

a.4 Academic Collaboration Networks

  • Description: these dataset represent two collaboration networks from the preprint server arXiv, covering scientific papers submitted to the astrophysics (AstroPh) and condensed matter (CondMat) physics categories. Each node in the network is an author, with links indicating co-authorship of one or more articles. Nodes are indicated with ids, hence information about the researchers in the graph is not immediately available. These datasets were developed to study the evolution of graphs over time.

  • Affiliation of creators: Carnegie Mellon University; Cornell University.

  • Domain: library and information sciences.

  • Tasks in fairness literature: fair graph mining (kang2020inform).

  • Data spec: author-author pairs.

  • Sample size: 19K nodes (authors) connected by K edges (indications of co-authorship) (AstroPh). 23K nodes connected by K edges (CondMat).

  • Year: 2009.

  • Sensitive features: none.

  • Link: http://snap.stanford.edu/data/ca-AstroPh.html (AstroPh) and http://snap.stanford.edu/data/ca-CondMat.html (CondMat)

  • Further info: leskovec2007graph

a.5 Adience

  • Description: this resource was developed to favour the study of automated age and gender identification from images of faces. Photos were sourced from Flickr albums, among the ones automatically uploaded from iPhone and made available under Creative Commons license. All images were manually labeled for age, gender and identity “using both the images themselves and any available contextual information”. These annotations are fundamental for the tasks associated with this dataset, i.e. age and gender estimation. One author of Buolamwini2018gender labeled each image in Adience with Fitzpatrick skin type.

  • Affiliation of creators: Adience; Open University of Israel.

  • Domain: computer vision.

  • Tasks in fairness literature: data bias evaluation (Buolamwini2018gender), robust fairness evaluation (nanda2021fairness).

  • Data spec: image.

  • Sample size: K images of K subjects.

  • Year: 2014.

  • Sensitive features: age, gender, skin type.

  • Link: https://talhassner.github.io/home/projects/Adience/Adience-data.html

  • Further info: eidinger2014:age; Buolamwini2018gender

a.6 Adressa

  • Description: this dataset was curated as part of the RecTech project on recommendation technology owned by Adresseavisen (shortened to Adressa) a large Norwegian newspaper. It summarizes one week of traffic to the newspaper website by both subscribers and non-subscribers, during February 2017. The dataset describes reading events, i.e. a reader accessing an article, providing access timestamps and user information inferred from their IP. Specific information about the articles is also available, including author, keywords, body, and mentioned entities. The dataset curators also worked on an extended version of the dataset (Adressa 20M), ten times larger than the one described here.

  • Affiliation of creators: Norwegian University of Science and Technology; Adresseavisen.

  • Domain: news, information systems.

  • Tasks in fairness literature: fair ranking (chakraborty2019equality).

  • Data spec: user-article pairs.

  • Sample size: M ratings by M readers over articles.

  • Year: 2018.

  • Sensitive features: geography.

  • Link: http://reclab.idi.ntnu.no/dataset/

  • Further info: (gulla2017adressa)

a.7 Adult

  • Description: this dataset was created as a resource to benchmark the performance of machine learning algorithms on socially relevant data. Each instance is a person who responded to the March 1994 US Current Population Survey, represented along demographic and socio-economic dimensions, with features describing their profession, education, age, sex, race, personal and financial condition. The dataset was extracted from the census database, preprocessed, and donated to UCI Machine Learning Repository in 1996 by Ronny Kohavi and Barry Becker. A binary variable encoding whether respondents’ income is above $50,000 was chosen as the target of the prediction task associated with this resource. See Appendix B for extensive documentation.

  • Affiliation of creators: Silicon Graphics Inc.

  • Domain: economics.

  • Tasks in fairness literature: fairness evaluation (sharma2020:cc; cardoso2019framework; oneto2019taking; friedler2019comparative; chen2018why; lipton2018does; pleiss2017fairness; diciccio2020evaluating; speicher2018unified; feldman2015certifying; maity2021statistical; kim2020fact; liu2019implicit; williamson2019fairness; vonkugelgen2021fairness; ngong2020towards; jabbari2020empirical; huan2020fairness; zliobaite2015relation), fair classification (he2020geometric; sharma2020data; goel2018nondiscriminatory; raff2018fair; zhang2018mitigating; hu2020fair; celis2019classification; yang2020fairness; cho2020fair; savani2020intraprocessing; wu2019cpfairness; donini2018empirical; quadrianto2017recycling; calmon2017optimized; xu2020algorithmic; zhang2017achieving; yurochkin2021sensei; vargo2021individually; chuang2021fair; roh2021fairbatch; yurochkin2020training; baharlouei2020renyi; lohaus2020too; martinez2020minimax; mukherjee2020two; roh2020frtrain; celis2020data; cotter2019training; gordaliza2019obtaining; wang2019repairing; agarwal2018reductions; creager2021exchanging; delobelle2020ethical; ogura2020convex; feldman2015certifying; zafar2017fairness; fish2015fair; raff2018gradient), fair clustering (abbasi2021fair; ghadiri2021socially; harb2020kfc; ahmadian2020fair; huang2019coresets; bera2019fair; chierichetti2017fair; brubach2020pairwise; mahabadi2020individual; backurs2019scalable; mary2019fairnessaware; wang2020augmented; berk2017convex; beutel2017data), noisy fair clustering (esmaeili2020probabilistic), fair active classification (noriegacampero2019active; bakker2019dadi), fair preference-based classification (ali2019loss; ustun2019fairness), noisy fair classification (lahoti2020fairness; wang2020robust; mozannar2020fair; kilbertus2018blind), fair anomaly detection (zhang2021towards), noisy fairness evaluation (awasthi2021evaluating), robust fairness evaluation (black2021leaveoneout), data bias evaluation (beretta2021detecting), rich-subgroup fairness evaluation (kearns2019empirical; chouldechova2017fairer), fair representation learning (ruoss2020learning; zhao2019inherent; zhao2020conditional; louizos2016variational; quadrianto2019discovering; madras2018learning), fair multi-stage classification (hu2020fairmultiple; goel2020importance), robust fair classification (mandal2020ensuring; huang2019stable; rezaei2021robust), dynamical fair classification (zhang2019group), fair ranking evaluation (kallus2019fairness), fair data summarization (chiplunkar2020how; jones2020fair; kleindessner2019fair; celis2018fair; halabi2020fairness), fair regression (agarwal2019fair), limited-label fair classification (chzhen2019leveraging; wang2021fair; choi2020group), limited-label fairness evaluation (ji2020can).

  • Data spec: tabular data.

  • Sample size: K instances.

  • Year: 1996.

  • Sensitive features: age, sex, race.

  • Link: https://archive.ics.uci.edu/ml/datasets/adult

  • Further info: kohavi1996scaling; kohavi1994adult_uci; usdeptcomm1995current; hardt2021facing; mckenna2019:history; mckenna2019:history_drb

a.8 Allegheny Child Welfare

  • Description: this dataset stems from an initiative by the Allegheny County’s Department of Human Services to develop assistive tools to support child maltreatment hotline screening decisions. Referrals received by Allegheny County via a hotline between September 2008 and April 2016 were assembled into a dataset. To obtain a relevant history and follow-up time for each referral, a subset of samples spanning the period from April 2010 to April 2014 is considered. Each data point pertains to a referral for suspected child abuse or neglect and contains a wealth of information from the integrated data management systems of Allegheny County. This data includes cross-sector administrative information for individuals associated with a report of child abuse or neglect, including data from child protective services, mental health services, drug, and alcohol services. The target to be estimated by risk models is future child harm, as measured e.g. by re-referrals, which complements the role of the screening staff who are focused on the information currently available about the referral.

  • Affiliation of creators: Allegheny County Department of Human Services; Auckland University of Technology; University of Southern California; University of Auckland; University of California.

  • Domain: social work.

  • Tasks in fairness literature: fairness evaluation of risk assessment (coston2020counterfactual), fair risk assessment (mishler2021fairness).

  • Data spec: tabular data.

  • Sample size: K calls.

  • Year: 2019.

  • Sensitive features: age, race, gender of child.

  • Link: not available

  • Further info: vaithianathan2017developing

a.9 Amazon Recommendations

  • Description: this dataset was crawled to study anti-competitive behaviour on Amazon, and the extent to which Amazon’s private label products are recommended on the platform. Considering the categories backpack and battery, where Amazon is known to have a strong private label presence, the creators gathered a set of organic and sponsored recommendations from Amazon.in, exploiting snowball sampling. Metadata for each product was also collected, including user rating, number of reviews, brand, seller.

  • Affiliation of creators: Indian Institute of Technology; Max Planck Institute for Software Systems.

  • Domain: information systems.

  • Tasks in fairness literature: fair ranking evaluation (dash2021when).

  • Data spec: item-recommendation pairs.

  • Sample size: M recommendations associated with K items.

  • Year: 2021.

  • Sensitive features: brand ownership.

  • Link: not available

  • Further info: dash2021when

a.10 Amazon Reviews

  • Description: this is large-scale dataset of over ten million products and respective reviews on Amazon, spanning more than two decades. It was created to study the problem of image-based recommendation and its dynamics. Rich metadata are available for both products and reviews. Reviews consist of ratings, text, reviewer name, and review ID, while products include title, price, image, and sales rank of product.

  • Affiliation of creators: University of California, San Diego.

  • Domain: information systems.

  • Tasks in fairness literature: fair ranking (patro2019incremental).

  • Data spec: user-product pairs (reviews).

  • Sample size: M reviews of products.

  • Year: 2018.

  • Sensitive features: none.

  • Link: https://nijianmo.github.io/amazon/index.html

  • Further info: mcauley2015imagebased; he2016ups

a.11 Anpe

  • Description: this dataset represents a large randomized controlled trial, assigning job seekers in France to a program run by the Public employment agency (ANPE), or to a program outsourced to private providers by the Unemployment insurance organization (Unédic). The data involves 400 public employment branches and over 200,000 job-seekers. Data about job seekers includes their demographics, their placement program and the subsequent duration of unemployment spells.

  • Affiliation of creators: Paris School of Economics; Institute of Labor Economics; CREST; ANPE; Unédic; Direction de l’Animation de la Recherche et des Études Statistiques.

  • Domain: economics.

  • Tasks in fairness literature: fairness evaluation of risk assessment (kallus2019assessing).

  • Data spec: tabular data.

  • Sample size: K job seekers.

  • Year: 2012.

  • Sensitive features: age, gender, nationality.

  • Link: https://www.openicpsr.org/openicpsr/project/113904/version/V1/view?path=/openicpsr/113904/fcr:versions/V1/Archive&type=folder

  • Further info: behaghel2014private

a.12 Antelope Valley Networks

  • Description: this a set of synthetic datasets generated to study the problem of influence maximization for obesity prevention. Samples of agents are generated to emulate the demographic and obesity distribution across regions in the Antelope Valley in California, exploiting data from the U.S. Census, the Los Angeles County Department of Public Health, and Los Angeles Times Mapping L.A. project. Each agent in the network has a geographic region, gender, ethnicity, age, and connections to other agents, which are more frequent for agents with similar attributes. Agents are also assigned a weight status, which may change based on interactions with other agents in their ego-network, emulating social learning.

  • Affiliation of creators: National University of Singapore; National University of Southern California.

  • Domain: public health.

  • Tasks in fairness literature: fair influence maximization (farnad2020unifying).

  • Data spec: agent-agent pairs.

  • Sample size: synthetic networks, containing individuals each.

  • Year: 2019.

  • Sensitive features: ethnicity, gender, age, geography.

  • Link: https://github.com/bwilder0/fair_influmax_code_release

  • Further info: wilder2018optimizing; tsang2019group

a.13 Apnea

  • Description: this dataset results from a sleep medicine study focused on establishing important factors for the automated diagnosis of Obstructive Sleep Apnea (OSA). The task associated with this dataset is the prediction of medical condition (OSA/no OSA) from available patient features, which include demographics, medical history, and symptoms.

  • Affiliation of creators: Massachusetts Institute of Technology; Massachusetts General Hospital; Harvard Medical School.

  • Domain: sleep medicine.

  • Tasks in fairness literature: fair preference-based classification (ustun2019fairness).

  • Data spec: mixture (time series and tabular data).

  • Sample size: K patients.

  • Year: 2016.

  • Sensitive features: age, sex.

  • Link: not available

  • Further info: ustun2016clinical

a.14 ArnetMiner Citation Network

  • Description: this dataset is one of the many resources made available by the ArnetMiner online service. The ArnetMiner system was developed for the extraction and mining of data from academic social networks, with a focus on profiling of researchers. The DBLP Citation Network is extracted from academic resources, such as DBLP, ACM and MAG (Microsoft Academic Graph). The dataset captures the relationships between scientific articles and their authors in a connected graph structure. It can be used for tasks such as community discovery, topic modeling, centrality and influence analysis. In its latest versions, the dataset comprises over 20 fields, including paper title, keywords, abstract, venue, year, along with authors, and their affiliations. The ArnetMiner project was partially funded by the Chinese National High-tech R&D Program, the National Science Foundation of China, IBM China Research Lab, the Chinese Young Faculty Research Funding program and Minnesota China Collaborative Research Program.

  • Affiliation of creators: Tsinghua University; IBM.

  • Domain: library and information sciences.

  • Tasks in fairness literature: fair graph mining (buyl2020debayes).

  • Data spec: article-article pairs.

  • Sample size: M papers connected by M citations.

  • Year: 2021.

  • Sensitive features: author.

  • Link: http://www.arnetminer.org/citation

  • Further info: tang2008arnetminer; https://www.aminer.org/

a.15 Arrhythmia

  • Description: data provenance for this set of patient records seems uncertain. The first work referencing this dataset dates to 1997 and details a machine learning approach for the diagnosis of arrhythmia, which presumably motivated its collection. Each data point describes a different patient; features include demographics, weight and height and clinical measurements from ECG signals, along with the diagnosis of a cardiologist into 16 different classes of arrhythmia (including none), which represents the target variable.

  • Affiliation of creators: Bilkent University; Baskent University.

  • Domain: cardiology.

  • Tasks in fairness literature: fair classification (donini2018empirical; mary2019fairnessaware), robust fair classification (rezaei2021robust), limited-label fair classification (chzhen2019leveraging).

  • Data spec: tabular data.

  • Sample size: patients.

  • Year: 1997.

  • Sensitive features: age, sex.

  • Link: https://archive.ics.uci.edu/ml/datasets/arrhythmia

  • Further info: guvenir1997supervised

a.16 Athletes and health professionals

  • Description: the datasets were developed to study the effects of bias in image classification. The health professional dataset (doctors and nurses) contains race and gender as sensitive features and the athlete dataset (basketball and volleyball players) contains gender and jersey color as sensitive features. Each subgroup, separated by combinations of sensitive features, is roughly balanced at  200 images. The collected data was manually examined by the curators to remove stylized images and images containing both females and males.

  • Affiliation of creators: Massachusetts Institute of Technology.

  • Domain: computer vision.

  • Tasks in fairness literature: bias discovery (tong2020investigating).

  • Data spec: image.

  • Sample size: images of athletes and images of health professionals.

  • Year: 2020.

  • Sensitive features: Gender (both), race (health professionals), jersey color (athletes).

  • Link: https://github.com/ghayat2/Datasets

  • Further info: tong2020investigating

a.17 Automated Student Assessment Prize (ASAP)

  • Description: this dataset was collected to evaluate the feasibility of automated essay scoring. It consists of a collection of essays by US students in grade levels 7–10, rated by at least two human raters. The dataset comes with a predefined training/validation/test split and powers the Hewlett Foundation Automated Essay Scoring competition on Kaggle. The curators tried to remove personally identifying information from the essays using Named Entity Recognizer (NER) and several heuristics.

  • Affiliation of creators: University of Akron; The Common Pool; OpenEd Solutions.

  • Domain: education.

  • Tasks in fairness literature: fair regression evaluation (madnadi2017:building).

  • Data spec: text.

  • Sample size: K student essays.

  • Year: 2012.

  • Sensitive features: none.

  • Link: https://www.kaggle.com/c/asap-aes/data/

  • Further info: shermis2014stateoftheart

a.18 Bank Marketing

  • Description: often simply called Bank dataset in the fairness literature, this resource was produced to support a study of success factors in telemarketing of long-term deposits within a Portuguese bank, with data collected over the period 2008–2010. Each data point represents a telemarketing phone call and includes client-specific features (e.g. job, education), features about the marketing phone call (e.g. day of the week and duration) and meaningful environmental features (e.g. euribor). The classification target is a binary variable indicating client subscription to a term deposit.

  • Affiliation of creators: Instituto Universitário de Lisboa (ISCTE-IUL), ISTAR, Lisboa; University of Minho.

  • Domain: marketing.

  • Tasks in fairness literature: fair classification (savani2020intraprocessing; baharlouei2020renyi; zafar2017fairness), fair clustering (abbasi2021fair; harb2020kfc; ahmadian2020fair; huang2019coresets; bera2019fair; mahabadi2020individual; backurs2019scalable; chierichetti2017fair), fair data summarization (halabi2020fairness), noisy fair classification (kilbertus2018blind), fairness evaluation (lipton2018does), limited-label fairness evaluation (ji2020can).

  • Data spec: tabular data.

  • Sample size: K phone contacts.

  • Year: 2012.

  • Sensitive features: age.

  • Link: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

  • Further info: moro2014datadriven

a.19 Benchmarking Attribution Methods (BAM)

  • Description: this dataset was developed to evaluate different explainability methods in computer vision. It was constructed by pasting object pixels from MS-COCO (lin2014microsoft) into scene images from MiniPlaces (zhou2018:places). Objects are rescaled to a variable proportion between one third and one half of the scene images onto which they are pasted. Both scene images and object images belong to ten different classes, for a total of 100 possible combinations. Scene images were chosen between the ones that do not contain the objects from the ten MS-COCO classes. This dataset enables users to freely control how each object is correlated with scenes, from which ground truth explanations can be formed. The creators also propose a few quantitative metrics to evaluate interpretability methods by either contrasting different inputs in the same dataset or contrasting two models with the same input.

  • Affiliation of creators: Google.

  • Domain: computer vision.

  • Tasks in fairness literature: fair representation learning (david2020debiasing).

  • Data spec: image.

  • Sample size: K images over 10 object classes and 10 image classes.

  • Year: 2020.

  • Sensitive features: none.

  • Link: https://github.com/google-research-datasets/bam

  • Further info: yang2019benchmarking

a.20 Bias in Bios

  • Description: this dataset was developed as a large-scale study of gender bias in occupation classification. It consists of online biographies of professionals scraped from the Common Crawl. Biographies are detected in crawls when they match the regular expression “name is a(n) title”, with title being one of twenty-eight common occupations. The gender of each person in the dataset is identified via the third person gendered pronoun, typically used in professional biographies. The envisioned task mirrors that of a job search automated system in a two-sided labor marketplace, i.e. automated occupation classification. The dataset curators provide python code to recreate the dataset from old Common Crawls.

  • Affiliation of creators: Carnegie Mellon University; University of Massachusetts Lowell; Microsoft; LinkedIn.

  • Domain: linguistics, information systems.

  • Tasks in fairness literature: fairness evaluation (dearteaga2019bias), fair classification (yurochkin2021sensei).

  • Data spec: text.

  • Sample size: K biographies.

  • Year: 2018.

  • Sensitive features: gender.

  • Link: https://github.com/Microsoft/biosbias

  • Further info: dearteaga2019bias

a.21 Bias in Translation Templates

  • Description: this resource was developed to study the problem of gender biases in machine translation. It consists of a set of short templates of the form One thing about the man/woman, [he/she] is [a ##], where [he/she] can be a gender-neutral or gender-specific pronoun, and [a ##] refers to a profession or conveys sentiment. Templates are built so that the part before the comma acts as a gender-specific clue, and the part after the comma contains information about gender and sentiment/profession. Accurate translations should correctly match the grammatical gender before and after the comma, in every word where it is required by the target language. The curators identify a set of languages to which this template is easily applicable, namely German, Korean, Portuguese, and Tagalog, which are chosen for their different properties with respect to grammatical gender. Depending on which language pair is being considered for translation, the curators identify a set of criteria for the evaluation of translation quality, with special emphasis on the correctness of grammatical gender.

  • Affiliation of creators: Seoul National University.

  • Domain: linguistics.

  • Tasks in fairness literature: bias evaluation of machine translation (cho2021towards).

  • Data spec: text.

  • Sample size: K templates.

  • Year: 2021.

  • Sensitive features: gender.

  • Link: https://github.com/nolongerprejudice/tgbi-x

  • Further info: cho2021towards

a.22 Bing US Queries

  • Description: this dataset was created to investigate differential user satisfaction with the Bing search engine across different demographic groups. The authors selected log data of a random subset of Bing’s desktop and laptop users from the English-speaking US market over a two week period. The data was preprocessed by cleaning spam and bot queries, and it was enriched with user demographics, namely age (bucketed) and gender (binary), which were self-reported by users during account registration and automatically validated by the dataset curators. Moreover, queries were labeled with topic information. Finally, four different signals were extracted from search logs, namely graded utility, reformulation rate, page click count, and successful click count.

  • Affiliation of creators: Microsoft.

  • Domain: information systems.

  • Tasks in fairness literature: fair ranking evaluation (mehrotra2017auditing).

  • Data spec: query-result pairs.

  • Sample size: M (non-unique) queries issued by M distinct users.

  • Year: 2017.

  • Sensitive features: age, gender.

  • Link: not available

  • Further info: mehrotra2017auditing

a.23 Bold

  • Description: this resource is a benchmark to measure biases of language models with respect to sensitive demographic attributes. The creators identified six attributes (e.g. race, profession) and values of said attribute (e.g. African American, flight nurse) for which they gather prompts from English Language Wikipedia, either from pages about the group (e.g. “A flight nurse is a registered”) or people representing it (e.g. “Over the years, Isaac Hayes was able”). Prompts are fed to different language models, whose outputs are automatically labelled for sentiment, regard, toxicity, emotion and gender polarity. These labels are also validated by human annotators hired on Amazon Mechanical Turk.

  • Affiliation of creators: Amazon; University of California, Santa Barbara.

  • Domain: linguistics.

  • Tasks in fairness literature: fairness evaluation of language models (dhamala2021bold).

  • Data spec: text.

  • Sample size: K prompts.

  • Year: 2021.

  • Sensitive features: gender, race, religion, profession, political leaning.

  • Link: https://github.com/amazon-research/bold

  • Further info: dhamala2021bold

a.24 BookCorpus

  • Description: this dataset was developed for the problem of learning general representations of text useful for different downstream tasks. It consist of text from 11,038 books from the web by unpublished authors available on https://www.smashwords.com/ in 2015. The BookCorpus contains thousands of duplicate books (only 7,185 are unique) and many contain copyright restrictions. The GPT (radford2018improving) and BERT (devlin2019BERT) language models were trained on this dataset.

  • Affiliation of creators: University of Toronto; Massachusetts Institute of Technology.

  • Domain: linguistics.

  • Tasks in fairness literature: data bias evaluation (tan2019assessing).

  • Data spec: text.

  • Sample size: 1B words in 74M sentences from 11K books.

  • Year: unknown.

  • Sensitive features: textual references to people and their demographics.

  • Link: not available

  • Further info: zhu2015aligning; bandy2021addressing

a.25 BUPT Faces

  • Description: this resource consists of two datasets, developed as a large scale collection, suitable for training face verification algorithms operating on diverse populations. The underlying data collection procedure mirrors the one from RFW (§ A.141), including sourcing from MS-Celeb-1M and automated annotation of so-called race into one of four categories: Caucasian, Indian, Asian and African. For categories where not enough images were readily available, the authors resort to the FreeBase celebrity list, downloading images of people from Google and cleaning them ”both automatically and manually”. The remaining images were obtained from MS-Celeb-1M (§ A.113), on which the BUPT Faces datasets are heavily based.

  • Affiliation of creators: Beijing University of Posts and Telecommunications.

  • Domain: computer vision.

  • Tasks in fairness literature

    : fair reinforcement learning

    (wang2020mitigating).

  • Data spec: image.

  • Sample size: M images of K celebrities (BUPT-Globalface); M images of K celebrities (BUPT-Balancedface).

  • Year: 2019.

  • Sensitive features: race.

  • Link: http://www.whdeng.cn/RFW/Trainingdataste.html

  • Further info: wang2020mitigating

a.26 Burst

  • Description

    : Burst is a free provider of stock photography powered by Shopify. This dataset features a subset of Burst images used as a resource to test algorithms for fair image retrieval and ranking, aimed at providing, in response to a query, a collection of photos that is balanced across demographics. Images come with human-curated tags annotated internally by the Burst team.

  • Affiliation of creators: Shopify.

  • Domain: information systems.

  • Tasks in fairness literature: fair ranking (karako2018using).

  • Data spec: image.

  • Sample size: K images.

  • Year: 2021.

  • Sensitive features: gender.

  • Link: not available

  • Further info: karako2018using; https://burst.shopify.com/

a.27 Business Entity Resolution

  • Description: A proprietary Google dataset, where the task is to predict whether a pair of business descriptions describe the same real business.

  • Affiliation of creators: Google.

  • Domain: linguistics.

  • Tasks in fairness literature: fair entity resolution (cotter2019training).

  • Data spec: text.

  • Sample size: 15K samples.

  • Year: 2019.

  • Sensitive features: geography, business size.

  • Link: not available

  • Further info: cotter2019training

a.28 Cars3D

  • Description: this dataset consists of CAD-generated models of 199 cars rendered from from 24 rotation angles. Originally devised for visual analogy making, it is also used for more general research on learning disentangled representation.

  • Affiliation of creators: University of Michigan.

  • Domain: computer vision.

  • Tasks in fairness literature: fair representation learning (locatello2019fairness).

  • Data spec: image.

  • Sample size: K images.

  • Year: 2020.

  • Sensitive features: none.

  • Link: https://github.com/google-research/disentanglement_lib/tree/master/disentanglement_lib/data/ground_truth

  • Further info: reed2015deep

a.29 CelebA

  • Description: CelebFaces Attributes Dataset (CelebA) features images of celebrities from the CelebFaces dataset, augmented with annotations of landmark location and binary attributes. The attributes, ranging from highly subjective features (e.g. attractive, big nose) and potentially offensive (e.g. double chin) to more objective ones (e.g. black hair) were annotated by a “professional labeling company”.

  • Affiliation of creators: Chinese University of Hong Kong.

  • Domain: computer vision.

  • Tasks in fairness literature: fair classification (savani2020intraprocessing; kim2019multiaccuracy; chuang2021fair; lohaus2020too; creager2019flexibly), fair anomaly detection (zhang2021towards), bias discovery (amini2019uncovering) fair anomaly detection (zhang2021towards), fairness evaluation of private classification (cheng2021can), fairness evaluation of selective classification (jones2021selective), fairness evaluation (wang2020towards), fair representation learning (quadrianto2019discovering), fair data summarization (chiplunkar2020how), fair data generation (choi2020fair).

  • Data spec: image.

  • Sample size: K face images of over K unique individuals.

  • Year: 2015.

  • Sensitive features: gender, age, skin tone.

  • Link: http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html

  • Further info: liu2015deep

a.30 CheXpert

  • Description: this dataset consists of chest X-ray images from patients that have been treated at the Stanford Hospital between October 2002 and July 2017. Each radiograph, either frontal or lateral, is annotated for the presence of 14 observations related to medical conditions. Most annotations were automatically extracted from free text radiology reports and validated against a set of 1,000 held-out reports, manually reviewed by a radiologist. For a subset of the X-ray images, high-quality labels are provided by a group of 3 radiologists. The task associated with this dataset is the automated diagnosis of medical conditions from radiographs.

  • Affiliation of creators: Stanford University.

  • Domain: radiology.

  • Tasks in fairness literature: fairness evaluation of selective classification (jones2021selective), fairness evaluation of private classification (cheng2021can).

  • Data spec: image.

  • Sample size: K chest radiographs from K patients.

  • Year: 2019.

  • Sensitive features: sex, age (of patient).

  • Link: https://stanfordmlgroup.github.io/competitions/chexpert/

  • Further info: irvin2019chexpert; garbin2021structured

a.31 Cifar

  • Description

    : CIFAR-10 and CIFAR-100 are a labelled subset of the 80 million tiny images database. CIFAR consists of 32x32 colour images that students were paid to annotate. The project, aimed at advancing the effectiveness of supervised learning techniques in computer vision, was funded by the the Canadian Institute for Advanced Research, after which the dataset is named.

  • Affiliation of creators: University of Toronto.

  • Domain: computer vision.

  • Tasks in fairness literature: fairness evaluation (wang2020towards), fair incremental learning (zhao2020maintaining), robust fairness evaluation (nanda2021fairness).

  • Data spec: image.

  • Sample size: images x 10 classes (CIFAR-10) or images x 100 classes (CIFAR-100).

  • Year: 2009.

  • Sensitive features: none.

  • Link: https://www.cs.toronto.edu/~kriz/cifar.html

  • Further info: krizhevsky2009learning

a.32 CiteSeer Papers

  • Description: this dataset was created to study the problem of link-based classification of connected entities. The creators extracted a network of papers from CiteSeer, belonging to one of six categories: Agents, Artificial Intelligence, Database, Human Computer Interaction, Machine Learning and Information Retrieval. Each article is associated with a bag-of-word representation, and the associated task is classification into one of six topics.

  • Affiliation of creators: University of Maryland.

  • Domain: library and information sciences.

  • Tasks in fairness literature: fair graph mining (li2021on).

  • Data spec: paper-paper pairs.

  • Sample size: K articles connected by K citations.

  • Year: 2016.

  • Sensitive features: none.

  • Link: http://networkrepository.com/citeseer.php

  • Further info: lu2003:link

a.33 Civil Comments

  • Description: this dataset derives from an archive of the Civil Comments platform, a browser plugin for independent news sites, whose users peer-reviewed each other’s comments with civility ratings. When the plugin shut down, they decided to make comments and metadata available, including the crowd-sourced toxicity ratings. A subset of this dataset was later annotated with a variety of sensitive attributes, capturing whether members of a certain group are mentioned in comments. This dataset powers the Jigsaw Unintended Bias in Toxicity Classification challenge.

  • Affiliation of creators: Jigsaw; Civil Comments.

  • Domain: social media.

  • Tasks in fairness literature: fair toxicity classification (adragna2020fairness; yurochkin2021sensei; chuang2021fair), fairness evaluation of selective classification (jones2021selective), fair robust toxicity classification (adragna2020fairness), fairness evaluation of toxicity classification (hutchinson2020unintented), fairness evaluation (babaeianjelodar2020quantifying).

  • Data spec: text.

  • Sample size: M comments.

  • Year: 2019.

  • Sensitive features: race/ethnicity, gender, sexual orientation, religion, disability.

  • Link: https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification

  • Further info: borkan2019nuanced

a.34 Climate Assembly UK

  • Description: this resource was curated to study the problem of subset selection for sortition, a political system where decisions are taken by a subset of the whole voting population selected at random. The data describes participants to Climate Assembly UK, a panel organized by the Sortition Foundation in 2020. With the goal of understanding public opinion on how the UK can meet greenhouse gas emission targets. The panel consisted of 110 UK residents selected from a pool of 1,715 who responded to an invitation from the Sortition Foundation reaching citizens. Features for each subject in the pool describe their demographics and climate concern level.

  • Affiliation of creators: Carnegie Mellon University; Harvard University; Sortition Foundation.

  • Domain: political science.

  • Tasks in fairness literature: fair subset selection (flanigan2020neutralizing).

  • Data spec: tabular data.

  • Sample size: K pool participants.

  • Year: 2020.

  • Sensitive features: gender, age, education, urban/rural, geography, ethnicity.

  • Link: not available

  • Further info: flanigan2020neutralizing; https://www.climateassembly.uk/

a.35 Columbia University Speed Dating

  • Description: this dataset is a result of a speed dating experiment aimed at understanding preferences in mate selection in men and women. Subjects were recruited from students at Columbia University. Fourteen rounds were conducted with different proportions of male and female subjects, over the period 2002–2004, with participants meeting each potential mate for four minutes and rating them thereafter on six attributes. They also provide an overall evaluation of each potential mate and a binary decision indicating interest in meeting again. Before an event, each participant filled in a survey disclosing their preferences, expectations, and demographics. The inference task associated with this dataset is optimal recommendation in symmetrical two-sided markets.

  • Affiliation of creators: Columbia University; Harvard University; Stanford University.

  • Domain: sociology.

  • Tasks in fairness literature: fair matching (zheng2018fairness), preference-based fair ranking (paraschakis2020matchmaking).

  • Data spec: person-person pairs.

  • Sample size: K dating records involving people.

  • Year: 2016.

  • Sensitive features: gender, age, race, geography.

  • Link: https://data.world/annavmontoya/speed-dating-experiment

  • Further info: fisman2006gender

a.36 Communities and Crime

  • Description: this dataset was curated to develop a software tool supporting the work of US police departments. It was especially aimed at identifying similar precincts to exchange best practices and share experiences among departments. The creators were supported by the police departments of Camden (NJ) and Philadelphia (PA). The factors included in the dataset were the ones deemed most important to define similarity of communities from the perspective of law enforcement; they were chosen with the help of law enforcement officials from partner institutions and academics of criminal justice, geography and public policy. The dataset includes socio-economic factors (aggregate data on age, income, immigration, and racial composition) obtained from the 1990 US census, along with information about policing (e.g. number of police cars available) based on the 1990 Law Enforcement Management and Administrative Statistics survey, and crime data derived from the 1995 FBI Uniform Crime Reports. In its released version on UCI, the task associated with the dataset is predicting the total number of violent crimes per 100K population in each community. The most referenced version of this dataset was preprocessed with a normalization step; after receiving multiple requests, the creators also published an unnormalized version.

  • Affiliation of creators: La Salle University; Rutgers University.

  • Domain: law.

  • Tasks in fairness literature: fair classification (yang2020fairness; Sharifi2019average; heidari2018fairness; lohaus2020too; cotter2019training; creager2019flexibly; cotter2018training), fair regression evaluation (heidari2019moral), fair few-shot learning (slack2020fairness; slack2019fair), rich-subgroup fairness evaluation (kearns2019empirical), rich-subgroup fair classification (kearns2018preventing), fair regression (chzhen2020fair; chzhen2020fairwassertein; romano2020achieving; agarwal2019fair; mary2019fairnessaware; komiyama2018nonconvex; ogura2020convex; berk2017convex), fair representation learning (ruoss2020learning), robust fair classification (mandal2020ensuring), fair private classification (jagielski2019differentially), fairness evaluation of transfer learning (lan2017discriminatory).

  • Data spec: tabular data.

  • Sample size: communities.

  • Year: 2009.

  • Sensitive features: race, geography.

  • Link: https://archive.ics.uci.edu/ml/datasets/communities+and+crime and http://archive.ics.uci.edu/ml/datasets/communities+and+crime+unnormalized

  • Further info: redmond2002datadriven

a.37 Compas

  • Description: this dataset was created for an external audit of racial biases in the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) risk assessment tool developed by Northpointe (now Equivant), which estimates the likelihood of a defendant becoming a recidivist. Instances represent defendants scored by COMPAS in Broward County, Florida, between 2013–2014, reporting their demographics, criminal record, custody and COMPAS scores. Defendants’ public criminal records were obtained from the Broward County Clerk’s Office website matching them based on date of birth, first and last names. The dataset was augmented with jail records and COMPAS scores provided by the Broward County Sheriff’s Office. Finally, public incarceration records were downloaded from the Florida Department of Corrections website. Instances are associated with two target variables (is_recid and is_violent_recid), indicating whether defendants were booked in jail for a criminal offense (potentially violent) that occurred after their COMPAS screening but within two years. See Appendix C for extensive documentation.

  • Affiliation of creators: ProPublica.

  • Domain: law.

  • Tasks in fairness literature: fair classification (he2020geometric; sharma2020data; goel2018nondiscriminatory; oneto2019taking; celis2019classification; canetti2019from; cho2020fair; savani2020intraprocessing; donini2018empirical; heidari2018fairness; russell2017when; quadrianto2017recycling; calmon2017optimized; diciccio2020evaluating; xu2020algorithmic; vargo2021individually; roh2021fairbatch; maity2021statistical; lohaus2020too; roh2020frtrain; celis2020data; cotter2019training; mary2019fairnessaware; wang2019repairing; delobelle2020ethical; ogura2020convex; lum2016statistical; zafar2017fairnessbeyond; berk2017convex; wadsworth2018achieving), fairness evaluation (cardoso2019framework; mcnamara2019equalized; kasy2021fairness; taskesen2021statistical; friedler2019comparative; wick2019unlocking; zhang2018equality; pleiss2017fairness; chaibubneto2020causal; speicher2018unified; corbettdavies2017algorithmic; liu2019implicit; agarwal2018reductions; ngong2020towards; jabbari2020empirical; chouldechova2017fair; grgichlaca2016case), fair risk assessment (coston2020counterfactual; mishler2021fairness; nabi2019learning), fair task assignment (goel2019crowdsourcing) for crowdsourced judgements, noisy fair classification (lahoti2020fairness; lamy2019noisetolerant; chzhen2019leveraging; kilbertus2018blind), data bias evaluation (beretta2021detecting), fair representation learning (ruoss2020learning; zhao2020conditional; bower2018debiasing), robust fair classification (mandal2020ensuring; rezaei2021robust), dynamical fairness evaluation (zhang2020how), fair reinforcement learning (metevier2019offline), fair ranking evaluation (kallus2019fairness; yang2017measuring), fair multi-stage classification (madras2018predict), dynamical fair classification (valera2018enhancing), preference-based fair classification (zafar2017from; ustun2019fairness), fair regression (komiyama2018nonconvex), fair multi-stage classification (goel2020importance), limited-label fair classification (chzhen2019leveraging; wang2021fair; choi2020group), robust fairness evaluation (slack2020fairness; slack2019fairness), rich subgroup fairness evaluation (chouldechova2017fairer; zhang2017identifying).

  • Data spec: tabular data.

  • Sample size: K defendants.

  • Year: 2016.

  • Sensitive features: sex, age, race.

  • Link: https://github.com/propublica/compas-analysis

  • Further info: angwin2016machine; larson2016how

a.38 Cora Papers

  • Description: this resource was produced within the wider development effort for Cora, an Internet portal for computer science research papers available in the early 2000s. The portal supported keyword search, topical categorization of articles, and citation mapping. This dataset consists of articles and citation links between them. It contains bag-of-word representations for the text of each article, and the associated task is classification into one of seven topics.

  • Affiliation of creators: Just Research Carnegie Mellon University; Massachusetts Institute of Technology; Univeristy of Maryland; Lawrence Livermore National Laboratory.

  • Domain: library and information sciences.

  • Tasks in fairness literature: .

  • Data spec: article-article pairs.

  • Sample size: K articles connected by K citations.

  • Year: 2019.

  • Sensitive features: none.

  • Link: https://relational.fit.cvut.cz/dataset/CORA

  • Further info: mccallum2000automating; sen2008collective

a.39 Costarica Household Survey

  • Description: this data comes from the national household survey of Costa Rica, performed by the national institute of statistics and census (Instituto Nacional de Estadística y Censos). The survey is aimed at measuring the socio-economical situation in the country and informing public policy. The data collection procedure is specially designed to allow for precise conclusions with respect to six different regions of the country and about differences in urban vs rural areas; stratification along these variables is deemed suitable. The 2018 survey contains a special section on the crimes suffered by respondents.

  • Affiliation of creators: Instituto Nacional de Estadística y Censos.

  • Domain: economics.

  • Tasks in fairness literature: fair classification (noriegacampero2020algorithmic).

  • Data spec: tabular data.

  • Sample size: K households.

  • Year: 2018.

  • Sensitive features: sex, age, birthplace, disability, geography, family size.

  • Link: https://www.inec.cr/encuestas/encuesta-nacional-de-hogares

  • Further info: https://www.inec.cr/sites/default/files/documetos-biblioteca-virtual/enaho-2018.pdf

a.40 Credit Card Default

  • Description: this dataset was built to investigate automated mechanisms for credit card default prediction following a wave of defaults in Taiwan connected to patters of card over-issuing and over-usage. The dataset contains payment history of customers of an important Taiwanese bank, from April to October 2005. Demographics, marital status, and education of customers are also provided, along with the amount of credit and a binary variable encoding default on payment, which is the target variable of the associated task.

  • Affiliation of creators: Chung-Hua University; Thompson Rivers University.

  • Domain: finance.

  • Tasks in fairness literature: fair classification (cho2020fair; berk2017convex), fair clustering (harb2020kfc; ghadiri2021socially; harb2020kfc; bera2019fair), noisy fair clustering (esmaeili2020probabilistic), noisy fair classification (wang2020robust), fair data summarization (Tantipongpipat2019multicriteria; samadi2018price), fairness evaluation (lipton2018does).

  • Data spec: tabular data.

  • Sample size: K credit card holders.

  • Year: 2016.

  • Sensitive features: gender, age.

  • Link: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

  • Further info: yeh2009comparisons

a.41 Credit Elasticities

  • Description: this dataset stems from a randomized trial conducted by a consumer lender in South Africa to study loan price elasticity. Prior customers were contacted by mail with limited-time loan offers at variable and randomized interest rates. The aim of the study was understanding the relationship between interest rate and customer acceptance rates, along with the benefits for the lender. Customers who accepted and received formal approval, filled in a short survey with factors of interest for the study, including demographics, education, and prior borrowing history.

  • Affiliation of creators: Yale University; Dartmouth College.

  • Domain: finance.

  • Tasks in fairness literature: fair pricing evaluation (kallus2021fairness).

  • Data spec: tabular data.

  • Sample size: K clients.

  • Year: 2008.

  • Sensitive features: gender, age, geography.

  • Link: http://doi.org/10.3886/E113240V1

  • Further info: karlan2008credit

a.42 Crowd Judgement

  • Description: this dataset was assembled to compare the performance of the COMPAS recidivism risk prediction system against that of non-expert human assessors (dressel2018accuracy). A subset of 1,000 defendants were selected from the COMPAS dataset. Crowd-sourced assessors were recruited through Amazon Mechanical Turk. They were presented with a summary of each defendant, including demographics and previous criminal history, and asked to predict whether they would recidivate within 2 years of their most recent crime. These judgements, assembled via plain majority voting, ended up exhibiting accuracy and fairness levels comparable to that displayed by the COMPAS system. While this dataset was assembled for an experiment, it was later used to study the problem of fairness in crowdsourced judgements.

  • Affiliation of creators: Dartmouth College.

  • Domain: law.

  • Tasks in fairness literature: fair truth discovery (li2020towards), fair task assignment (li2020towards; goel2019crowdsourcing) (for crowdsourced judgements).

  • Data spec: judge-defendant pair.

  • Sample size: K defendants from COMPAS and crowd-sourced labellers. Each defendant is judged by 20 different labellers.

  • Year: 2018.

  • Sensitive features: sex, age and race of defendants and crowd-sourced judges.

  • Link: https://farid.berkeley.edu/downloads/publications/scienceadvances17/

  • Further info: (dressel2018accuracy)

  • Variants: a similar dataset was collected by wang2019empirical.

a.43 Curatr British Library Digital Corpus

  • Description: this dataset is a subset of English language digital texts from the British Library focused on volumes of 19th-century fiction, obtained through the Curatr platform. It was selected for the well-researched presence of stereotypical and binary concepts of gender in this literary production. The goal of the creators was studying gender biases in large text corpora and their relationship with biases in word embeddings trained on those corpora.

  • Affiliation of creators: University College Dublin.

  • Domain: literature.

  • Tasks in fairness literature: data bias evaluation (leavy2020mitigating).

  • Data spec: text.

  • Sample size: K books.

  • Year: 2020.

  • Sensitive features: textual references to people and their demographics.

  • Link: http://curatr.ucd.ie/

  • Further info: leavy2019curatr

a.44 CVs from Singapore

  • Description: this dataset was developed to test demographic biases in resume filtering. In particular, the authors studied nationality bias in automated resume filtering in Singapore, across the three major ethnic groups of the city state: Chinese, Malaysian and Indian. The dataset consists of 135 resumes (45 per ethnic group) used for application to finance jobs in Singapore, collected by Jai Janyani. The dataset only includes resumes for which the origin of the candidates can be reliably inferred to be either Chinese, Malaysian, or Indian from education and initial employment. The dataset also comprises 9 finance job postings from China, Malaysia, and India (3 per country). All job-resume pairs are rated for relevance/suitability by three annotators.

  • Affiliation of creators: University of Maryland.

  • Domain: information systems.

  • Tasks in fairness literature: fair ranking (deshpande2020mitigating).

  • Data spec: text.

  • Sample size: resumes.

  • Year: 2020.

  • Sensitive features: ethnic group.

  • Link: not available

  • Further info: deshpande2020mitigating

a.45 Dallas Police Incidents

  • Description: this dataset is due to the Dallas OpenData initiative101010https://www.dallasopendata.com/ and “reflects crimes as reported to the Dallas Police Department” beginning June 1, 2014. Each incident comes with rich spatio-temporal data, information about the victim, the officers involved and the type of crime. A subset of the dataset is available on Kaggle111111https://www.kaggle.com/carrie1/dallaspolicereportedincidents.

  • Affiliation of creators: Dallas Police Department.

  • Domain: law.

  • Tasks in fairness literature: fair spatio-temporal process learning (shang2020listwise).

  • Data spec: tabular.

  • Sample size: K incidents.

  • Year: 2021.

  • Sensitive features: age, race, and gender (of victim), geography.

  • Link: https://www.dallasopendata.com/Public-Safety/Police-Incidents/qv6i-rri7

  • Further info:

a.46 Demographics on Twitter

  • Description: this dataset was developed to test demographic classifiers on Twitter data. In particular, the tasks associated with this resource are the automatic inference of gender, age, location and political orientation of users. The true values for these attributes, which act as a ground truth for learning algorithms, were inferred from tweets and user bios, such as the ones containing the regexp ”I’m a gendered noun”, with gendered nouns including mother, woman, father, man.

  • Affiliation of creators: Massachusetts Institute of Technology.

  • Domain: social media.

  • Tasks in fairness literature: fairness evaluation of sentiment analysis (shen2018darling).

  • Data spec: mixture.

  • Sample size: K profiles.

  • Year: 2017.

  • Sensitive features: gender, age, political orientation, geography.

  • Link: not available

  • Further info: vijayaraghavan2017twitter

a.47 Diabetes 130-US Hospitals

  • Description: this dataset contains 10 years of care data from 130 US hospitals extracted from Health Facts, a clinical database associated with a multi-institution data collection program. The dataset was extracted to study the association between the measurement of HbA1c (glycated hemoglobin) in human bloodstream and early hospital readmission, and was donated to UCI in 2014. The dataset includes patient demographics, in-hospital procedures, and diagnoses, along with information about subsequent readmissions.

  • Affiliation of creators: Virginia Commonwealth University; University of Cordoba; Polish Academy of Sciences.

  • Domain: endocrinology.

  • Tasks in fairness literature: fair clustering (chierichetti2017fair; bera2019fair; backurs2019scalable; mahabadi2020individual; huang2019coresets; bera2019fair).

  • Data spec: tabular data.

  • Sample size: K patients.

  • Year: 2014.

  • Sensitive features: age, race, gender.

  • Link: https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008

  • Further info: strack2014impact

a.48 Diversity in Faces (DiF)

  • Description: this large dataset was created to favour the development and evaluation of robust face analysis algorithms across diverse demographics and domain-specific features, such as craniofacial distances and facial contrast). One million images of people’s faces from Flickr were labelled, mostly automatically, according to 10 different coding schemes, comprising, e.g., cranio-facial measurements, pose, and demographics. Age and gender were inferred both automatically and by human workers. Statistics about the diversity of this dataset along these coded measures are available in the accompanying report.

  • Affiliation of creators: IBM.

  • Domain: computer vision.

  • Tasks in fairness literature: fair representation learning (quadrianto2019discovering), fairness evaluation of private classification (bagdasaryan2019differential).

  • Data spec: image.

  • Sample size: M images.

  • Year: 2019.

  • Sensitive features: skin color, age, and gender.

  • Link: https://www.ibm.com/blogs/research/2019/01/diversity-in-faces/

  • Further info: merler2019diversity

a.49 Drug Consumption

  • Description: this dataset was collected by Elaine Fehrman between March 2011 and March 2012 after receiving approval from relevant ethics boards from the University of Leicester. The goal of this dataset is to seek patterns connecting an individual’s risk of drug consumption with demographics and psychometric measurements of the Big Five personality traits (NEO-FFI-R), impulsivity (BIS-11), and sensation seeking (ImpSS). The study employed an online survey tool from Survey Gizmo to recruit participants world-wide; over 93% of the final usable sample reported living in an English-speaking country. Target variables summarize the consumption of 18 psychoactive substances on an ordinal scale ranging from never using the drug to using it over a decade ago, or in the last decade, year, month, week, or day. The 18 substances considered in the study are classified as central nervous system depressants, stimulants, or hallucinogens and comprise the following: alcohol, amphetamines, amyl nitrite, benzodiazepines, cannabis, chocolate, cocaine, caffeine, crack, ecstasy, heroin, ketamine, legal highs, LSD, methadone, magic mushrooms, nicotine, and Volatile Substance Abuse (VSA), along with one fictitious drug (Semeron) introduced to identify over-claimers. A version of the dataset donated to the UCI Machine Learning Repository is associated with 18 prediction tasks, i.e. one per substance.

  • Affiliation of creators: Rampton Hospital; Nottinghamshire Healthcare NHS Foundation Trust; University of Leicester; University of Nottingham; University of Salahaddin.

  • Domain: applied psychology.

  • Tasks in fairness literature: fair classification (donini2018empirical; mary2019fairnessaware), evaluation of data bias (beretta2021detecting), limited-label fair classification (chzhen2019leveraging), robust fair classification (rezaei2021robust).

  • Data spec: tabular data.

  • Sample size: K respondents.

  • Year: 2016.

  • Sensitive features: age, gender, ethnicity, geography.

  • Link: https://archive.ics.uci.edu/ml/datasets/Drug+consumption+%28quantified%29

  • Further info: fehrman2017five; fehrman2019personality

a.50 DrugNet

  • Description: this dataset was collected to study drug consumption patterns in connection with social ties and behaviour of drug users. This work puts particular emphasis on situations at risk of disease transmission and to assess the opportunity for prevention via recruitment of peer educators to demonstrate, disseminate and support HIV prevention practices among their connections. Participants were recruited in Hartford neighbourhoods of high drug-use activity, mostly via street outreach and recruitment by early participants. Eligibility criteria included being at least 18 years old, using an illicit drug, and signing an informed consent form. Each participant provided data about their drug use, most common sites of usage, HIV risk practices associated with drug use and sexual behavior, and social ties deemed important by the respondent and their demographics.

  • Affiliation of creators: Institute for Community Research of Hartford; Hispanic Health Council, Hartford; Boston College.

  • Domain: social work, social networks.

  • Tasks in fairness literature: fair graph clustering (kleindessner2019guarantees).

  • Data spec: person-person pairs.

  • Sample size: people.

  • Year: 2016.

  • Sensitive features: ethnicity, sex, age.

  • Link: https://sites.google.com/site/ucinetsoftware/datasets/covert-networks/drugnet

  • Further info: weeks2002social

a.51 dSprites

  • Description: this dataset was assembled by researchers affiliated with Google DeepMind as an artificial benchmark for unsupervised methods aimed at learning disentangled data representations. Each image in the dataset consists of a black-and-white sprite with variable shape, scale, orientation and position. Together these are the generative factors underlying each image. Ideally, systems trained on this data should learn disentangled representations, such that latent image representations are clearly associated with changes in a single generative factor.

  • Affiliation of creators: Google.

  • Domain: computer vision.

  • Tasks in fairness literature: fair representation learning (locatello2019fairness; creager2019flexibly).

  • Data spec: image.

  • Sample size: K images.

  • Year: 2017.

  • Sensitive features: none.

  • Link: https://github.com/deepmind/dsprites-dataset

  • Further info: Higgins2017betaVAELB

a.52 Dutch Census

  • Description

    : this dataset was derived from the 2001 census carried out by the Dutch Central Bureau for Statistics to gather data about family composition, economic activities, levels of education, and occupation of Dutch citizens and foreigners from various countries of origin. A version of the dataset commonly employed in the fairness research literature has been preprocessed and made available online. The associated task is the classification of individuals into high-income and low-income professions.

  • Affiliation of creators: Bournemouth University; TU Eindhoven.

  • Domain: demography.

  • Tasks in fairness literature: fair classification (agarwal2018reductions; xu2020algorithmic; zhang2017achieving; lohaus2020too), fairness evaluation (cardoso2019framework).

  • Data spec: tabular data.

  • Sample size: K respondents.

  • Year: 2001.

  • Sensitive features: sex, age, citizenship.

  • Link: https://sites.google.com/site/conditionaldiscrimination/

  • Further info: zliobaite2011handling; https://microdata.worldbank.org/index.php/catalog/2102/data-dictionary/F2?file_name=NLD2001-P-H; https://www.cbs.nl/nl-nl/publicatie/2004/31/the-dutch-virtual-census-of-2001

a.53 EdGap

  • Description: this dataset focuses on education performance in different US counties, with a focus on inequality of opportunity and its connection to socioeconomic factors. Along with average SAT and ACT test scores by county, this dataset reports socioeconomic data from the American Community Survey by the Bureau of Census, including household income, unemployment, adult educational attainment, and family structure. Importantly, some states require all students to take ACT or SAT tests, while others do not. As a result, average test scores are inherently higher in states that do not require all students to test, and they are not directly comparable to average scores in states where testing is mandatory.

  • Affiliation of creators: Memphis Teacher Residency.

  • Domain: education.

  • Tasks in fairness literature: fair risk assessment (he2020inherent).

  • Data spec: tabular data.

  • Sample size: K counties.

  • Year: 2021.

  • Sensitive features: geography.

  • Link: https://www.edgap.org/

  • Further info:

a.54 Epileptic Seizures

  • Description: this dataset was curated to study electroencephalographic (EEG) time series in relation to epilepsy. The dataset consists of EEG recordings from healthy volunteers with eyes closed and eyes open, and from epilepsy patients during seizure-free intervals and during epileptic seizures. Volunteers and patients are recorded for 23.6-sec. A version of this dataset, used in fairness research, was donated to UCI Machine Learning Repository by researchers affiliated with Rochester Institute of Technology in 2017, with a classification task based on the patients’ condition and state at the time of recording. The data was later removed from UCI at the original curators’ request.

  • Affiliation of creators: University of Bonn.

  • Domain: neurology.

  • Tasks in fairness literature: robust fairness evaluation (black2021leaveoneout).

  • Data spec: time series.

  • Sample size: individuals, each summarized by K-points time series.

  • Year: 2017.

  • Sensitive features: none.

  • Link: https://archive.ics.uci.edu/ml/datasets/Epileptic+Seizure+Recognition; http://epileptologie-bonn.de/cms/upload/workgroup/lehnertz/eegdata.html

  • Further info: andrzejak2001indications

a.55 Equity Evaluation Corpus (EEC)

  • Description: this dataset was compiled to audit sentiment analysis systems for gender and race bias. It is based on 11 short sentence templates; 7 templates include emotion words, while the remaining 4 do not. Moreover, each sentence includes one gender- or race-associated word, such as names predominantly associated with African American or European American people. Gender-related words consist of names, nouns, and pronouns.

  • Affiliation of creators: National Research Council Canada.

  • Domain: linguistics.

  • Tasks in fairness literature: fair sentiment analysis evaluation (liang2020artificial).

  • Data spec: text.

  • Sample size: K sentences.

  • Year: 2018.

  • Sensitive features: race, gender.

  • Link: https://saifmohammad.com/WebPages/Biases-SA.html

  • Further info: kiritchenko2018examining

a.56 Facebook Ego-networks

  • Description: this dataset was collected to study the problem of identifying users’ social circles, i.e. categorizing links between nodes in a social network. The data represents ten ego-networks whose central user was asked to fill in a survey and manually identify the circles to which their friends belonged. Features from each profile, including education, work and location are anonymized.

  • Affiliation of creators: Stanford University.

  • Domain: social networks.

  • Tasks in fairness literature: fair graph mining (li2021on).

  • Data spec: user-user pairs.

  • Sample size: K people connected by K friend relations.

  • Year: 2012.

  • Sensitive features: geography, gender.

  • Link: https://snap.stanford.edu/data/egonets-Facebook.html

  • Further info: leskovec2012learning

a.57 Facebook Large Network

  • Description: this dataset was developed to study the effectiveness of node embeddings for learning tasks defined on graphs. The dataset concentrates on verified Facebook pages of politicians, governmental organizations, television shows, and companies, represented as nodes, while edges represent mutual likes. In addition, each page comes with node embeddings which are extracted from the textual description of each page. The original task on this dataset is page category classification.

  • Affiliation of creators: University of Edinburgh.

  • Domain: social networks.

  • Tasks in fairness literature: fair graph mining evaluation (kang2020inform).

  • Data spec: page-page pairs.

  • Sample size: 20K nodes (pages) connected by K edges (mutual likes).

  • Year: 2019.

  • Sensitive features: none.

  • Link: http://snap.stanford.edu/data/facebook-large-page-page-network.html

  • Further info: rozemberczki2021multi

a.58 FairFace

  • Description: this dataset was developed as a balanced resource for face analysis with diverse race, gender and age composition. The associated task is race, gender and age classification. Starting from a large public image dataset (Yahoo YFCC100M), the authors sampled images incrementally to ensure diversity with respect to race, for which they considered seven categories: White, Black, Indian, East Asian, Southeast Asian, Middle East, and Latino. Sensitive attributes were annotated by workers on Amazon Mechanical Turk, and also through a model based on these annotations. Faces with low agreement between model and annotators were manually re-verified by the dataset curators. This dataset was annotated automatically with a binary Fitzpatrick skin tone label (cheng2021can).

  • Affiliation of creators: University of California, Los Angeles.

  • Domain: computer vision.

  • Tasks in fairness literature: fairness evaluation of private classification (cheng2021can).

  • Data spec: image.

  • Sample size: K images.

  • Year: 2019.

  • Sensitive features: race, age, gender, skin tone.

  • Link: https://github.com/joojs/fairface

  • Further info: karkkainen2019fairface

a.59 Fashion MNIST

  • Description: this dataset is based on product assortement from the Zalando website. It contains gray-scale resized versions of thumbnail images of unique clothing products, labeled by in-house fashion experts according to their category, including e.g. trousers, coat and shirt. The envisioned task is object classification. The dataset, sharing the same size and structure as MNIST, was developed to provide a harder and more representative task, and to replace MNIST as a popular computer vision benchmark.

  • Affiliation of creators: Zalando.

  • Domain: computer vision.

  • Tasks in fairness literature: robust fairness evaluation (black2021leaveoneout).

  • Data spec: image.

  • Sample size: K images across 10 product categories.

  • Year: 2017.

  • Sensitive features: none.

  • Link: https://github.com/zalandoresearch/fashion-mnist

  • Further info: xiao2017fashionmnist

a.60 Fico

  • Description: based on a sample of 301,536 TransUnion TransRisk scores from 2003, this dataset was created to study the problem of adjusting predictors for compliance with the equality of opportunity fairness metric. The TransUnion data was preprocessed and aggregated to summarize the CDF of risk scores by race (Non-Hispanic white, Black, Hispanic, Asian). The original data comes from a 2007 report to the US Congress on credit scoring and its effects on the availability and affordability of credit carried out by a dedicated Federal Reserve working group. The collection, creation, processing, and aggregation was carried out by the working group; the data was later scraped by the creators, who made it available without any modification.

  • Affiliation of creators: Google; University of Texas at Austin; Toyota Technological Institute at Chicago.

  • Domain: finance.

  • Tasks in fairness literature: fairness evaluation (hardt2016equality), dynamical fair classification (liu2020disparate), dynamical fairness evaluation (zhang2020how; liu2018delayed; creager2020causal), fair resource allocation (goelz2019paradoxes).

  • Data spec: tabular data.

  • Sample size: N/As. CDFs are provided over risk scores which are normalized (0-100%) and quantized with step 0.5%.

  • Year: 2016.

  • Sensitive features: race.

  • Link: https://github.com/fairmlbook/fairmlbook.github.io/tree/master/code/creditscore/data

  • Further info: usfr2007report; hardt2016equality; barocas2019fair

a.61 FIFA 20 Players

  • Description: this dataset was scraped by Stefano Leone and made available on Kaggle. It includes the players’ data for the Career Mode from FIFA 15 to FIFA 20, a popular football game. Several tasks are envisioned for this dataset, including a historical comparison of players.

  • Affiliation of creators: unknown.

  • Domain: .

  • Tasks in fairness literature: noisy fairness audit (awasthi2021evaluating).

  • Data spec: tabular data.

  • Sample size: K players.

  • Year: 2019.

  • Sensitive features: geography.

  • Link: https://www.kaggle.com/stefanoleone992/fifa-20-complete-player-dataset

  • Further info:

a.62 FilmTrust

  • Description: this dataset was crawled from the entire FilmTrust website, a movie recommendation service with a social network component. The dataset comprises user-movie ratings on a 5-star scale and user-user indications of trust about movie taste. This resource can be used to train and evaluate recommender systems.

  • Affiliation of creators: Northeastern University; Nanyang Technological University; American University of Beirut; University of Cambridge.

  • Domain: information systems, movies.

  • Tasks in fairness literature: fair ranking (liu2018personalizing).

  • Data spec: user-movie pairs and user-user pairs.

  • Sample size: K ratings by K users over K movies.

  • Year: 2011.

  • Sensitive features: none.

  • Link: https://guoguibing.github.io/librec/datasets.html

  • Further info: guo2016novel

a.63 Framingham

  • Description: the Framingham Heart Study began in 1948 under the direction of the National Heart, Lung, and Blood Institute (NHLBI), with the goal of identifying key factors that contribute to cardiovascular disease, given a mounting epidemic of cardiovascular disease whose etiology was mostly unknown at the time. Six different cohorts have been recruited over the years among citizens of Framingham, Massachusetts, without symptoms of cardiovascular disease. After the original cohort, two more were enrolled from the children and grandchildren of the first one. Additional cohorts were also started to reflect the increased racial and ethnic diversity in the town of Framingham. Participants in the study report on their habits (e.g. physical activity, smoking) and undergo regular physical examination and laboratory tests.

  • Affiliation of creators: National Heart, Lung, and Blood Institute (NHLBI); Boston University.

  • Domain: cardiology.

  • Tasks in fairness literature: fair ranking evaluation (kallus2019fairness).

  • Data spec: mixture.

  • Sample size: K respondents.

  • Year: 2021.

  • Sensitive features: age, sex, race.

  • Link: https://framinghamheartstudy.org/

  • Further info: kannel1979diabetes; tsao2015cohort

a.64 Freebase15k-237

  • Description: Freebase was a collaborative knowledge base which allowed its community members to fill in structured data about diverse entities and relations between them. This database was developed from a prior Freebase dataset (bordes2013:translating), pruning it from redundant relations and augmenting it with textual relationships from the ClueWeb12 corpus. The creators of this dataset worked on the joint optimization of entity knowledge base and representations of the entities’ textual relations, with the goal of providing representations of entities suited for knowledge base completion.

  • Affiliation of creators: Microsoft; Stanford University.

  • Domain: information systems.

  • Tasks in fairness literature: fair graph mining (bose2019compositional), fairness evaluation in graph mining (fisher2020measuring).

  • Data spec: entity-relation-entity triples.

  • Sample size: K entities connected by K edges (relations).

  • Year: 2016.

  • Sensitive features: demographics of people featured in entities and their relations.

  • Link: https://www.microsoft.com/en-us/download/details.aspx?id=52312

  • Further info: toutanova2015representing

a.65 GAP Coreference

  • Description: this resource was developed as a gender-balanced coreference resolution dataset, useful for auditing gender-dependent differences in the accuracy of existing pronoun resolution algorithms and for training new algorithms that are less gender-biased. The dataset consists of thousands of ambiguous pronoun-name pairs in sentences extracted from Wikipedia. Several measures are taken to avoid the success of naïve heuristics and to favour diversity. Most notably, while the initial (automated) stage of the data collection pipeline extracts contexts with a female:male ratio of 1:9, feminine pronouns are oversampled to achieve a 1:1 ratio. Each example is presented to and annotated for coreference by three in-house workers.

  • Affiliation of creators: Google.

  • Domain: linguistics.

  • Tasks in fairness literature: data bias evaluation (kocijan2020gap).

  • Data spec: text.

  • Sample size: K sentences.

  • Year: 2018.

  • Sensitive features: gender.

  • Link: https://github.com/google-research-datasets/gap-coreference

  • Further info: webster2018mind

a.66 German Credit

  • Description: the German Credit dataset was created to study the problem of automated credit decisions at a regional Bank in southern Germany. Instances represent loan applicants from 1973 to 1975, who were deemed creditworthy and were granted a loan, bringing about a natural selection bias. The data summarizes their financial situation, credit history and personal situation, including housing and number of liable people. A binary variable encoding whether each loan recipient punctually payed every installment is the target of a classification task. Among covariates, marital status and sex are jointly encoded in a single variable. Many documentation mistakes are present in the UCI entry associated with this resource (hofmann1994:sg). Due to one of these mistakes, users of this dataset are led to believe that the variable sex can be retrieved from the joint marital_status-sex variable, however this is false. A revised version with correct variable encodings, called South German Credit, was donated to gromping2019:sg2 with an accompanying report (gromping2019:sg). See Appendix D for extensive documentation.

  • Affiliation of creators: Hypo Bank (OP/EDV-VP); Universität Hamburg; Strathclyde University (German Credit); Beuth University of Applied Sciences Berlin (South German Credit).

  • Domain: finance.

  • Tasks in fairness literature: fair classification (he2020geometric; sharma2020data; raff2018fair; celis2019classification; yang2020fairness; donini2018empirical; vargo2021individually; baharlouei2020renyi; lohaus2020too; martinez2020minimax; mary2019fairnessaware; delobelle2020ethical; raff2018gradient), fairness evaluation (friedler2019comparative; feldman2015certifying), active fair resource allocation (cai2020:fa), preference-based fair classification (zhang2020joint), active fair classification (noriegacampero2019active), noisy fair classification (kilbertus2018blind), robust fairness evaluation (black2021leaveoneout), fair representation learning (ruoss2020learning; louizos2016variational), fair reinforcement learning (metevier2019offline), fair ranking evaluation (kallus2019fairness; wu2018discrimination; yang2017measuring), fair ranking (singh2019policy; bower2021individually), fair multi-stage classification (goel2020importance), limited-label fair classification (chzhen2019leveraging; wang2021fair; choi2020group), limited-label fairness evaluation (ji2020can).

  • Data spec: tabular data.

  • Sample size: K.

  • Year: 1994 (German Credit); 2020 (South German Credit).

  • Sensitive features: age, geography.

  • Link: https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data) (German Credit); https://archive.ics.uci.edu/ml/datasets/South+German+Credit+%28UPDATE%29 (South German Credit)

  • Further info: gromping2019:sg

a.67 German Political Posts

  • Description: this dataset was used as a training set for German word embeddings, with the goal of investigating biases in word representations. The authors used the Facebook and Twitter APIs to collect posts and comments from the social media channels of six main political parties in Germany (CDU/CSU, SPD, Bundnis90/Die Grünen, FDP, Die Linke, AfD). Facebook posts are from the period 2015–2018, while tweets were collected between January and October 2018. Overall, the dataset consists of millions of posts, for a total of half a billion tokens. A subset of the Facebook comments (100,000) were labeled by human annotators based on whether they contain sexist content, with four sub-labels indicating sexist comments, sexist buzzwords, gender-related compliments, statements against gender equality and assignment of gender stereotypical roles to people.

  • Affiliation of creators: Technical University of Munich.

  • Domain: social media.

  • Tasks in fairness literature: bias evaluation in WEs (papakyriakopulos2020bias).

  • Data spec: text.

  • Sample size: M posts comments and tweets.

  • Year: 2020.

  • Sensitive features: textual references to people and their demographics.

  • Link: not available

  • Further info: papakyriakopulos2020bias

a.68 Glue

  • Description: this benchmark was assembled to reliably evaluate the progress of natural language processing models. It consists of multiple datasets and associated tasks from the natural language processing domain, including paraphrase detection, textual entailment, sentiment analysis and question answering. Given the quick progress registered by language models on GLUE, a similar benchmark called SuperGLUE was subsequently released comprising more challenging and diverse tasks (wang2019:superglue).

  • Affiliation of creators: New York University; University of Washington; DeepMind.

  • Domain: linguistics.

  • Tasks in fairness literature: fairness evaluation (babaeianjelodar2020quantifying; rudinger2017:social), fairness evaluation of language models (cheng2021fairfil), fairness evaluation of selective classification (jones2021selective).

  • Data spec: text.

  • Sample size: K samples. Datasets have variable sizes spanning three orders of magnitude.

  • Year: 2018.

  • Sensitive features: none.

  • Link: https://gluebenchmark.com/

  • Further info: wang2018glue

a.69 Goodreads Reviews

  • Description: there are several versions of this dataset, corresponding to different crawls. Here we refer to the most well documented one by wan2018item. This resource consists of anonymized reviews collected from public user book shelves. Rich metadata is available for books and reviews, including. authors, country code, publisher, userid, rating, timestamp, and text. A few medium-size subsamples focused on specific book genres are available. The task typically associated with this resource is book recommendation.

  • Affiliation of creators: University of California, San Diego.

  • Domain: literature, information systems.

  • Tasks in fairness literature: fair ranking evaluation (raj2020comparing), fairness evaluation (chen2018why).

  • Data spec: user-book pairs.

  • Sample size: M records from K users over M books.

  • Year: 2019.

  • Sensitive features: author.

  • Link: https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/

  • Further info: wan2018item

a.70 Google Local

  • Description: this dataset contains reviews and ratings from millions of users on local businesses from five different continents. Businesses are labelled with nearly 50 thousand categories. This resource was collected as a real world example of interactions between users and ratable items, with the goal of testing novel recommendation approaches. The dataset comprises data that is specific to users (e.g. places lived), businesses (e.g. GPS coordinates), and reviews (e.g. timestamps).

  • Affiliation of creators: University of California, San Diego.

  • Domain: information systems.

  • Tasks in fairness literature: fair ranking (patro2019incremental).

  • Data spec: user-business pairs.

  • Sample size: M reviews and ratings from M users on M local businesses.

  • Year: 2018.

  • Sensitive features: geography.

  • Link: https://cseweb.ucsd.edu/~jmcauley/datasets.html#google_local

  • Further info: he2017translationbased

a.71 Greek Websites

  • Description: this dataset was created to demonstrate the bias goggles tools, which enables users to explore diverse bias aspects connected with popular Greek web domains. The dataset is a subset of the Greek web, crawled from Greek websites that cover politics and sports, represent big industries, or are generally popular. Starting from a seed of hundreds of websites, crawlers followed the links up to depth 7, avoiding popular sites such as Facebook and Twitter. The final dataset has a graph structure, comprising pages and links between them.

  • Affiliation of creators: FORTH-ICS, University of Crete.

  • Domain: .

  • Tasks in fairness literature: bias discovery(konstantakis2020bias).

  • Data spec: page-page pairs.

  • Sample size: k pages from k domains.

  • Year: 2020.

  • Sensitive features: none.

  • Link: https://pangaia.ics.forth.gr/bias-goggles/about.html#Dataset

  • Further info: konstantakis2020bias

a.72 Guardian Articles

  • Description: this dataset consists of articles from The Guardian, retrieved from The Guardian Open Platform API. In particular, the authors crawled every article that appeared on the website between 2009 and 2018. They created this dataset to demonstrate a framework for the identification of gender biases in training data for machine learning.

  • Affiliation of creators: University College Dublin.

  • Domain: news.

  • Tasks in fairness literature: data bias evaluation (leavy2020mitigating).

  • Data spec: text.

  • Sample size: unknown.

  • Year: 2020.

  • Sensitive features: textual references to people and their demographics.

  • Link: not available

  • Further info: leavy2020mitigating

a.73 Ham10000

  • Description: the dataset comprises 10,015 dermatoscopic images collected over a period of 20 years the Department of Dermatology at the Medical University of Vienna, Austria and the skin cancer practice of Cliff Rosendahl in Queensland, Australia. Images were acquired and stored through different modalities; each image depicts a lesion and comes with metadata detailing the region of skin lesion, patient demographics, and diagnosis, which is the target variable. The dataset was employed for the lesion disease classification of the ISIC 2018 challenge.

  • Affiliation of creators: Medical University of Vienna; University of Queensland.

  • Domain: dermatology.

  • Tasks in fairness literature: fair classification (martinez2020minimax).

  • Data spec: image.

  • Sample size: 10K images.

  • Year: 2018.

  • Sensitive features: age, sex.

  • Link: https://doi.org/10.7910/DVN/DBW86T

  • Further info: tschandl2018ham10000

a.74 Harvey Rescue

  • Description: this dataset is the result of crowdsourced efforts to connect rescue parties with people requesting help in the Houston area, mostly due to the flooding caused by Hurricane Harvey. Most requests are from August 28, 2017, and were sent via social media; they are timestamped and associated with the location of the people seeking help.

  • Affiliation of creators: Harvey Relief Handiworks; Harvey Relief Coalition.

  • Domain: social work.

  • Tasks in fairness literature: fair spatio-temporal process learning (shang2020listwise).

  • Data spec: tabular data.

  • Sample size: 1K help requests.

  • Year: 2017.

  • Sensitive features: geography.

  • Link: not available

  • Further info: http://harveyrelief.handiworks.co/

a.75 Heart Disease

  • Description: this dataset is a collection of medical data from separate groups of patients referred for cardiac catheterisation and coronary angiography at 5 different medical centers, namely the Cleveland Clinic (data from 1981–1984), the Hungarian Institute of Cardiology in Budapest (1983–1987), the Long Beach Veterans Administration Medical Center (1984–1987) and the University Hospitals of Basel and Zurich (1985). The binary target variable in this dataset encodes a diagnosis of Coronary artery disease. Covariates relate to patient demographics, exercise data (e.g. maximum heart rate) and routine test data (e.g. resting blood pressure). Overall, 76 covariates are available but 14 are recommended. Names and social security numbers of the patients were initially available, but have been removed from the publicly available dataset.

  • Affiliation of creators: Veterans Administration Medical Center, Long Beach; Hungarian Institute of Cardiology, Budapest; University Hospital, Zurich; University Hospital, Basel; Studer Corporation; Stanford University.

  • Domain: cardiology.

  • Tasks in fairness literature: fairness evaluation (pleiss2017fairness), fair active classification (noriegacampero2019active).

  • Data spec: tabular data.

  • Sample size: K patients.

  • Year: 1988.

  • Sensitive features: age, sex.

  • Link: https://archive.ics.uci.edu/ml/datasets/heart+disease

  • Further info: detrano1989international

a.76 Heritage Health

  • Description: this dataset was developed as part of the Heritage Health Prize competition with the goal of reducing the cost of health care by decreasing the number of avoidable hospitalizations. The competition requires predicting the number of days a patient will spend in hospital during the 12 months following a cutoff date. The dataset features basic demographic information about patients, along with data about prior hospitalizations (e.g. length of stay and diagnosis), laboratory tests and prescriptions.

  • Affiliation of creators: CHEO Research Institute, Inc; University of Ottawa; University of Maryland; Privacy Analytics, Inc; Kaggle; Heritage Provider Network.

  • Domain: health policy.

  • Tasks in fairness literature: fair multi-stage classification (madras2018predict), fair representation learning (louizos2016variational), fair classification (raff2018fair; raff2018gradient), fair transfer learning (madras2018learning).

  • Data spec: tabular data.

  • Sample size: K patients.

  • Year: 2011.

  • Sensitive features: age, sex.

  • Link: https://www.kaggle.com/c/hhp/data

  • Further info: el2012deidentification

a.77 High School Contact and Friendship Network

  • Description: this dataset was developed to compare and contrast different methods commonly employed to measure human interaction and build the underlying social network. Data corresponds to interactions and friendship relations between students of a French high school in Marseilles. The authors consider four different methods of network data collection, namely face-to-face contacts measured by two concurrent methods (sensors and diaries), self-reported friendship surveys, and Facebook links.

  • Affiliation of creators: Aix Marseille Université; Université de Toulon; Centre national de la recherche scientifique; ISI Foundation.

  • Domain: social networks.

  • Tasks in fairness literature: fair graph clustering (kleindessner2019guarantees).

  • Data spec: student-student pairs.

  • Sample size: students.

  • Year: 2015.

  • Sensitive features: gender.

  • Link: http://www.sociopatterns.org/datasets/high-school-contact-and-friendship-networks/

  • Further info: mastrandrea2015contact

a.78 Hmda

  • Description: The Home Mortgage Disclosure Act (HMDA) is a US federal law from 1975 mandating that financial institutions maintain and disclose information about mortgages to the public. Companies submit a Loan Application Register (LAR) to the Federal Financial Institutions Examination Council FFIEC who maintain and disclose the data. The LAR format is subject to changes, such as the one which happened in 2017. From 2018 onward, entries to the LAR comprise information about the financial institution (e.g. geography, id), the applicants (e.g. demographics, income), the house (e.g. value, construction method), the mortgage conditions (type, interest rate, amount) and the outcome. Ethnicity, race, and sex of applicants are self-reported.

  • Affiliation of creators: Federal Financial Institutions Examination Council.

  • Domain: finance.

  • Tasks in fairness literature: noisy fairness evaluation (chen2019fairness; kallus2020assessing).

  • Data spec: tabular data.

  • Sample size: M records.

  • Year: 2021.

  • Sensitive features: sex, geography, race, ethnicity.

  • Link: https://ffiec.cfpb.gov/data-browser/

  • Further info: https://ffiec.cfpb.gov/; https://www.consumerfinance.gov/data-research/hmda/

a.79 Homeless Youths’ Social Networks

  • Description: this dataset was collected to study methamphetamine use norms among homeless youth in association with their social networks. A sample of homeless youth aged 13–25 years was recruited between 2011—2012 from two drop-in centers in California. After obtaining informed consent/assent, participants filled in a survey and answered questions from an interview. The survey included questions on demographics, migratory status, educational status and housing. To reconstruct the social network between them, each participant provided information for up to 50 people with whom they had interacted during the previous 30 days.

  • Affiliation of creators: University of Denver; University of Southern California.

  • Domain: social work.

  • Tasks in fairness literature: fair influence maximization (rahmattalabi2019exploring).

  • Data spec: person-person pairs.

  • Sample size: youth.

  • Year: 2015.

  • Sensitive features: age, gender, sexual orientation, race and ethnicity.

  • Link: not available

  • Further info: barman2016sociometric

a.80 Iit-Jee

  • Description: this dataset was released in response to a Right to Information application filed in June 2009, and contains country-wide results for the Joint Entrance Exam (EET) to Indian Institutes of Technology (IITs), a group of prestigious engineering schools in India. The dataset contains the marks obtained by every candidate who took the test in 2009, divided according to the specific Math, Physics, and Chemistry sections of the test. Demographics such as ZIP code, gender, and birth categories (ethnic categories relating to the caste system) are also included.

  • Affiliation of creators: Indian Institute of Technology, Kharagpur.

  • Domain: education.

  • Tasks in fairness literature: fair ranking (celis2020interventions).

  • Data spec: tabular data.

  • Sample size: K students.

  • Year: 2009.

  • Sensitive features: gender, birth category.

  • Link: not available

  • Further info: celis2020interventions

a.81 Ijb-A

  • Description: the IARPA Janus Benchmark A (IJB-A) dataset was proposed as a face recognition benchmark with wide geographic representation and pose variation for subjects. It consists of in-the-wild images and videos of 500 subjects, obtained through internet searches over Creative Commons licensed content. The subjects were manually specified by the creators of the dataset to ensure broad geographic representation. The tasks associated with the dataset are face identification and verification. The dataset curators also collected the subjects’ skin color and gender, through an unspecified annotation procedure. Similar protected attributes (gender and Fitzpatrick skin type) were labelled by one author of Buolamwini2018gender.

  • Affiliation of creators: Noblis; National Institute of Standards and Technology (NIST); Intelligence Advanced Research Projects Activity (IARPA); Michigan State University.

  • Domain: computer vision.

  • Tasks in fairness literature: data bias evaluation (Buolamwini2018gender).

  • Data spec: image.

  • Sample size: K images of subjects.

  • Year: 2015.

  • Sensitive features: gender, skin color.

  • Link: https://www.nist.gov/itl/iad/image-group/ijb-dataset-request-form

  • Further info: klare2015pushing

a.82 Ilea

  • Description: this dataset was created by the Inner London Education Authority (ILEA) considering data from 140 British schools. It comprises the results of public examinations taken by students of age 16 over the period 1985–1987. These values are used as a measurement of school effectiveness, with emphasis on quality of education and equality of opportunity for students of different backgrounds and ethnicities. Student-level records report their sex and ethnicity, while school-level factors include the percentage of students eligible for free meals and the percentage of girls in each institute.

  • Affiliation of creators: Inner London Education Authority (ILEA).

  • Domain: education.

  • Tasks in fairness literature: fair representation learning (oneto2019learning; oneto2020exploiting).

  • Data spec: unknown.

  • Sample size: K students from secondary schools.

  • Year: unknown.

  • Sensitive features: age, sex, ethnicity.

  • Link: not available

  • Further info: (nuttall1989differential; goldstein1991multilevel)

a.83 Image Embedding Association Test (iEAT)

  • Description: the Image Embedding Association Test (iEAT) is a resource for quantifying biased associations between representations of social concepts and attributes in images. It mimics seminal work on biases in WEs (caliksan2017semantics), following the Implicit Association Test (IAT) from social psychology (greenwald1998measuring). The curators identified several combinations of target concepts (e.g. young) and attributes (e.g. pleasant), testing similarities between representations of these concepts learnt by unsupervised computer vision models. For each attribute/concept they obtained a set of images from the IAT, the CIFAR-100 dataset or Google Image Search, which act as the source of images and the associated sensitive attribute labels.

  • Affiliation of creators: Carnegie Mellon University; George Washington University.

  • Domain: computer vision.

  • Tasks in fairness literature: fairness evaluation of learnt representations (steed2021image).

  • Data spec: image.

  • Sample size: image for 15 iEATs.

  • Year: 2021.

  • Sensitive features: religion, gender, age, race, sexual orientation, disability, skin tone, weight.

  • Link: https://github.com/ryansteed/ieat/tree/master/data

  • Further info: steed2021image

a.84 ImageNet

  • Description

    : Imagenet is one of the most influential machine learning dataset of the 2010s. Much important work on computer vision, including early breakthroughs in deep learning has been sparked by ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a competition held yearly from 2010 to 2017. The most used portion of ImageNet is indeed the data powering the classification task in ILSVRC 2012, featuring 1,000 classes, over 100 of which represent different dog breeds. Recently, several problematic biases were found in the

    person subtree of ImageNet, tracing their causes and proposing approaches to remove them (prabhu2020large; yang2020towards; crawford2021excavating).

  • Affiliation of creators: Princeton University.

  • Domain: computer vision.

  • Tasks in fairness literature: fair classification (dwork2018decoupled), bias discovery (amini2019uncovering), data bias evaluation (yang2020towards), fair incremental learning (zhao2020maintaining), fairness evaluation (dwork2017decoupled).

  • Data spec: image.

  • Sample size: M images depicting K categories (synsets).

  • Year: 2021.

  • Sensitive features: people’s gender and other sensitive annotations may be present in synsets from the person subtree.

  • Link: https://image-net.org/

  • Further info: deng2009imagenet; barocas2019fair; prabhu2020large; yang2020towards; crawford2021excavating

a.85 In-Situ

  • Description: this dataset was curated to measure biases in named entity recognition algorithms, based on gender, race and religion of people represented by entities. The authors exploit census data to build a list of 123 names typical of men and women of different race and religion. Next, they extract 289 sentences mentioning people from the CoNLL 2003 NER test data (tjong-kim-sang2003:introduction), itself derived from Reuters 1990s news stories. Finally, they substitute the unigram person entity from the CoNLL 2003 shared task with each of names obtained previously as specific to a demographic group.

  • Affiliation of creators: Twitter.

  • Domain: linguistics.

  • Tasks in fairness literature: fairness evaluation in entity recognition (mishra2020assessing).

  • Data spec: text.

  • Sample size: K sentences.

  • Year: 2020.

  • Sensitive features: gender, race and religion.

  • Link: https://github.com/napsternxg/NER_bias

  • Further info: mishra2020assessing

a.86 iNaturalist Datasets

  • Description: these datasets were curated as challenging real-world benchmarks for large-scale fine-grained visual classification and feature visually similar classes with large class imbalance. They consist of images of plants and animals from iNaturalist, a social network where nature enthusiasts share information and observations about biodiversity. There are four different releases of the dataset: 2017, 2018, 2019, and 2021. A subset of the images are also annotated with bounding boxes and have additional metadata such as where and when the images were captured.

  • Affiliation of creators: California Institute of Technology; University of Edinburgh; Google; Cornell University; iNaturalist.

  • Domain: biology.

  • Tasks in fairness literature: fairness evaluation of private classification (bagdasaryan2019differential).

  • Data spec: image.

  • Sample size: M images from K different species of plants and animals.

  • Year: 2021.

  • Sensitive features: none.

  • Link: https://github.com/visipedia/inat_comp

  • Further info: vanhorn2018inaturalist; van2021benchmarking

a.87 Indian Census

  • Description: very little information seems to be available on this dataset. It represents a count of residents of 35 Indian states, repeated every ten years between 1951 and 2001.

  • Affiliation of creators: Office of the Registrar General of India.

  • Domain: demography.

  • Tasks in fairness literature: fairness evaluation of private resource allocation (pujol2020fair).

  • Data spec: tabular data.

  • Sample size: state.

  • Year: unknown.

  • Sensitive features: geography.

  • Link: https://www.indiabudget.gov.in/budget_archive/es2006-07/chapt2007/tab97.pdf

  • Further info:

a.88 Infant Health and Development Program (IHDP)

  • Description: this dataset is the result of the IHDP program carried out between 1985 and 1988 in the US. A longitudinal randomized trial was conducted to evaluate the effectiveness of comprehensive early intervention in reducing developmental and health problems in low birth weight premature infants. Families in the experimental group received an intervention based on an educational program delivered through home visits, a daily center-based program and a parent supporting group. Children in the study were assessed across multiple cognitive, behavioral, and health dimensions longitudinally in four phases at ages 3, 5, 8, and 18. The dataset also contains information on household composition, source of health care, parents’ demographics and employment.

  • Affiliation of creators: unknown.

  • Domain: pediatrics.

  • Tasks in fairness literature: fair risk assessment (madras2019fairness; yi2019fair).

  • Data spec: mixture.

  • Sample size: K infants.

  • Year: 1993.

  • Sensitive features: race and ethnicity (of parents), age (maternal), gender (of infant).

  • Link: https://www.icpsr.umich.edu/web/HMCA/studies/9795

  • Further info: brooks1992effects

a.89 Instagram Photos

  • Description: this dataset was crawled from Instagram to explore trade-offs between fairness and revenue in platforms that serve ads to their users. The authors crawled metadata from photos (location and tags) and users (names), using Kevin Systrom as a seed user and cascading into profiles that like or comment photos. The curators concentrated on cities with enough geotagged data, namely New York and Los Angeles. Moreover, they labeled the users with gender and race. Gender was labeled via US social security data, using the proportion of babies with a given name registered with either gender. Gender was only assigned to users with a first name for which there were both at least 50 births and 95% of recorded births were one gender. Race were labeled using the Face++ API on a subset of photos. Photos were not downloaded, rather they were fed to Face++ via their publicly available URL. Finally, the ground truth labels were validated by two research assistants. To emulate a location-based advertisement model, the creators devised a task aimed at predicting what topics a user will be interested in, given their locations from previous check-ins.

  • Affiliation of creators: Columbia University.

  • Domain: social media.

  • Tasks in fairness literature: fair advertising (riederer2017price).

  • Data spec: unknown.

  • Sample size: M photos from K users.

  • Year: 2017.

  • Sensitive features: race, gender, geography.

  • Link: not available

  • Further info: riederer2017price

a.90 Iris

  • Description: the most popular dataset on the UCI Machine Learning Repository was created by E. Anderson and popularized by R.A. Fisher in the pattern recognition community in the 1930s. The measurements in this collection represent the length and width of sepal and petals of different Iris flowers, collected to evaluate the morphological variation of different Iris species. The typical learning task associated with this dataset is labelling the species based on the available measurements.

  • Affiliation of creators: Missouri Botanical Garden; Washington University.

  • Domain: plant science.

  • Tasks in fairness literature: fair clustering (chen2019proportionally; abbasi2021fair).

  • Data spec: tabular data.

  • Sample size: samples from three species of Iris.

  • Year: 1988.

  • Sensitive features: none.

  • Link: https://archive.ics.uci.edu/ml/datasets/iris

  • Further info: (anderson1936species; fisher1936use)

a.91 KDD Cup 99

  • Description: this dataset was developed for a data mining competition on cybersecurity, focused on building an automated network intrusion detector based on TCP dump data. The task is predicting whether a connection is legitimate and inoffensive or symptomatic of an attack, such as denial-of-service or user-to-root; tens of attack classes have been simulated and annotated within this dataset. The available features include basic TCP/IP information, network traffic and contextual features, such as number of failed login attempts.

  • Affiliation of creators: Massachusetts Institute of Technology.

  • Domain: computer networks.

  • Tasks in fairness literature: fair clustering (chen2019proportionally).

  • Data spec: tabular data.

  • Sample size: M connections.

  • Year: 1999.

  • Sensitive features: none.

  • Link: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

  • Further info: tavallaee2009detailed

a.92 Kidney Exchange Program

  • Description: this dataset is based on data of the Canadian Kidney Paired Donation Program (KPD) to study strategic behavior among entities controlling part of the incompatible patient-donor pairs. Based on data from the Canadian Blood Services on the KPD and census, these instances were generated. The random instance generator is available upon request. The instances are weighted graphs. The incompatible patient-donor pairs represent the vertices of the graph, an arc means that the donor of a vertex is compatible with the patient of another vertex, and weights represent the benefit of the donation. Compatibility is encoded based on true blood type distribution and risk of transplant rejection.

  • Affiliation of creators: Université de Montréal; Polytechnique de Montréal.

  • Domain: public health.

  • Tasks in fairness literature: fair matching evaluation (farnadi2019enhancing).

  • Data spec: patient-donor pairs.

  • Sample size: 180.

  • Year: 2020.

  • Sensitive features: blood type, geography.

  • Link: https://github.com/mxmmargarida/KEG

  • Further info: carvalho2019game

a.93 Kidney Matching

  • Description

    : this dataset was created via a simulator based on real data provided by the Organ and Tissue Authority of Australia. The data was validated against additional information from the Australian Bureau of Statistics, the Public and Research sets, and Wikipedia. The simulator models the probability distribution over the Blood Type and State of donors and patients, along with the quality of a donated organ (summarized by Kidney Donor Patient Index) and of a patient (quantified by the Expected Post-Transplant Survival). The envisioned task for this data is optimal matching of organs and patients.

  • Affiliation of creators: unknown.

  • Domain: public health.

  • Tasks in fairness literature: fairness matching evaluation (mattei2018fairness).

  • Data spec: tabular data.

  • Sample size: unknown.

  • Year: 2018.

  • Sensitive features: age, geography, blood type.

  • Link: not available

  • Further info: mattei2018axiomatic

a.94 Kiva

  • Description: this dataset was obtained from kiva.org, a non-profit organization allowing low-income entrepreneurs and students to borrow money through loan crowdfunding. The data summarizes all transactions occurred in 2017. Transactions are typically between 25$ to 50$ and range from 5$ to 10,000$. Features include information about the loan, such as its purpose, sector and amount, and data specific to the borrower and their demographics. Women are prevalent in this dataset, probably due to the priorities of partner organizations and the easier access to capital enjoyed by men in many countries.

  • Affiliation of creators: Kiva; DePaul University.

  • Domain: finance.

  • Tasks in fairness literature: fair ranking (burke2018balanced; liu2018personalizing; sonboli2020and), bias discovery (sonboli2019localized).

  • Data spec: tabular data.

  • Sample size: M transactions involving K loans and K users.

  • Year: 2018.

  • Sensitive features: gender, geography, activity.

  • Link: not available

  • Further info: sonboli2019localized

a.95 Labeled Faces in the Wild (LFW)

  • Description: LFW is a public benchmark for face verification, maintained by researchers affiliated with the University of Massachusetts. It was built to measure the progress of face verification systems in unconstrained settings (e.g. variable pose, illumination, resolution). The dataset consists of images of people who appeared in the news, labelled with the name of the respective individual. According to perception of human coders who were later asked to annotate this dataset, images mostly skew white, male and below 60.

  • Affiliation of creators: University of Massachussets, Amherst; Stony Brook University.

  • Domain: computer vision.

  • Tasks in fairness literature: fair data summarization (samadi2018price), fair clustering (ghadiri2021socially), robust fairness evaluation (black2021leaveoneout).

  • Data spec: image.

  • Sample size: K face images of K individuals.

  • Year: 2007.

  • Sensitive features: gender, age, race.

  • Link: http://vis-www.cs.umass.edu/lfw/

  • Further info: huang2007labeled; han2014age; gebru2018datasheets

a.96 Large Movie Review

  • Description

    : a set of reviews from IMDB, collected, filtered and preprocessed by researchers affiliated with Stanford University. Polarity judgements are balanced in terms of positive and negative reviews and automatically inferred from star-based ratings, so that 7 or more is positive, while 4 or less is considered negative. The dataset was collected to provide a large benchmark for sentiment analysis algorithms.

  • Affiliation of creators: Stanford University.

  • Domain: linguistics.

  • Tasks in fairness literature: fair sentiment analysis evaluation (liang2020artificial).

  • Data spec: text.

  • Sample size: K reviews.

  • Year: 2011.

  • Sensitive features: textual references to people and their demographics.

  • Link: https://ai.stanford.edu/~amaas/data/sentiment/

  • Further info: maas2011learning

a.97 Last.fm

  • Description: the Last.fm datasets were collected via the Last.fm API with the purpose of studying music consumption, discovery and recommendation on the web. Two datasets are provided: LFM1K, comprising timestamped listening habits of a limited user sample (1K) at song granularity, and LFM360K, containing the top 50 most played artists of a wider user population (360K).

  • Affiliation of creators: Barcelona Music and Audio Technologies; Universitat Pompeu Fabra.

  • Domain: music, information systems.

  • Tasks in fairness literature: fair ranking evaluation (ekstrand2018all).

  • Data spec: user-song pairs (LFM1K); user-artist pairs (LFM360K).

  • Sample size: 19M timestamped records of 1K users playing songs from 170K artists (LFM1K); M play counts (user-artist pairs) for 400K users over 300K artists (LFM360K).

  • Year: 2010.

  • Sensitive features: user age, gender, geography; artist.

  • Link: http://ocelma.net/MusicRecommendationDataset/

  • Further info: Celma:Springer2010

a.98 Latin Newspapers

  • Description: this dataset was built to study gender bias in language models and their connection with the corpora they have been trained on. It was built crawling articles from the websites of three newspapers from Chile, Peru, and Mexico. More detailed information about this resource seems to be missing.

  • Affiliation of creators: Capital One.

  • Domain: news.

  • Tasks in fairness literature: data bias evaluation (florez2019unintended).

  • Data spec: text.

  • Sample size: K articles.

  • Year: 2019.

  • Sensitive features: textual references to people and their demographics.

  • Link: not available

  • Further info: florez2019unintended

a.99 Law School

  • Description: This dataset was collected to study performance in law school and bar examination of minority examinees in connection with affirmative action programs established after 1967 and subsequent anecdotal reports suggesting low bar passage rates for black examinees. Students, law schools, and state boards of bar examiners contributed to this dataset. The study tracks students who entered law school in fall 1991 through three or more years of law school and up to five administrations of the bar examination. Variables include demographics of candidates (e.g. age, race, sex), their academic performance (undergraduate GPA, law school admission test, and GPA), personal condition (e.g. financial responsibility for others during law school) along with information about law schools and bar exams (e.g. geographical area where it was taken). The associated task in machine learning is prediction of passage of the bar exam.

  • Affiliation of creators: Law School Admission Council (LSAC).

  • Domain: education.

  • Tasks in fairness literature: fair classification (yang2020fairness; cho2020fair; russell2017when; agarwal2018reductions; berk2017convex), rich-subgroup fairness evaluation (kearns2019empirical), noisy fair classification (lahoti2020fairness; lamy2019noisetolerant), fairness evaluation (black2020fliptest; kusner2017counterfactual), fair regression (chzhen2020fair; chzhen2020fairwassertein; agarwal2019fair; komiyama2018nonconvex), fair representation learning (ruoss2020learning), robust fair classification (mandal2020ensuring), limited-label fair classification (wang2021fair).

  • Data spec: tabular data.

  • Sample size: K examinees.

  • Year: 1998.

  • Sensitive features: sex, race, age.

  • Link: not available

  • Further info: wightman1998lsac

a.100 Libimseti

  • Description: this dataset was collected to explore the effectiveness of recommendations in online dating services based on collaborative filtering. It was collected in collaboration with employees of the dating platform libimseti.cz, one of the largest Czech dating websites at the time. The data consists of anonymous ratings provided by (and to) users of the web service on a 10-point scale.

  • Affiliation of creators: Charles University in Prague; Libimseti.

  • Domain: sociology, information systems.

  • Tasks in fairness literature: fair matching (tziavelis2019equitable).

  • Data spec: user-user pairs.

  • Sample size: 10M ratings over 200K users.

  • Year: 2007.

  • Sensitive features: gender.

  • Link: http://colfi.wz.cz/

  • Further info: brozovsky2006recommender; brozovsky2007recommender

a.101 Los Angeles City Attorney’s Office Records

  • Description: this dataset was extracted from the Los Angeles City Attorney’s case management system. It consists of a collection of records aimed at powering data-driven approaches to decision making and resource allocation for misdemeanour recidivism reduction via individually tailored social service interventions. Focusing on cases handled by the office between 1995–2017, the data includes information about jail bookings, charges, court appearances, outcomes, and demographics.

  • Affiliation of creators: Los Angeles City Attorney’s Office; University of Chicago.

  • Domain: law.

  • Tasks in fairness literature: fair classification (rodolfa2020case).

  • Data spec: tabular data.

  • Sample size: M unique individuals associated with M cases.

  • Year: 2020.

  • Sensitive features: race, ethnicity.

  • Link: not available

  • Further info: (rodolfa2020case)

a.102 Meps-Hc

  • Description: the Medical Expenditure Panel Survey (MEPS) data is collected by the US Department of Health and Human Services, to survey healthcare spending and utilization by US citizens. Overall, this is a set of large-scale surveys of families and individuals, their employers, and medical providers (e.g. doctors, hospitals, pharmacies). The Household Component (HC) focuses on households and individuals, who provide information about their demographics, medical conditions and expenses, health insurance coverage, and access to care. Individuals included in a panel undergo five rounds of interviews over two years. Healthcare expenditure is often regarded as a target variable in machine learning applications, where it has been used as a proxy for healthcare utilization, with the goal of identifying patients in need.

  • Affiliation of creators: Agency for Healthcare Research and Quality.

  • Domain: health policy.

  • Tasks in fairness literature: fair transfer learning (coston2019fair), fair regression (romano2020achieving), fairness evaluation (singh2019understanding).

  • Data spec: tabular data.

  • Sample size: K, variable on a yearly basis.

  • Year: 2021.

  • Sensitive features: gender, ethnicity, age.

  • Link: https://meps.ahrq.gov/mepsweb/data_stats/download_data_files.jsp

  • Further info: https://www.ahrq.gov/data/meps.html

a.103 MGGG States

  • Description: developed by the Metric Geometry and Gerrymandering Group121212https://mggg.org/, this dataset contains precinct-level aggregated information about demographics and political leaning of voters in each district. The data hinges on several distinct sources of data, including GIS mapping files from the US Census. Bureau131313https://www.census.gov/geographies/mapping-files.html, demographic data from IPUMS141414https://www.nhgis.org/ and election data from MIT Election and Data Science 151515https://electionlab.mit.edu/. Source and precise data format vary by state.

  • Affiliation of creators: Tufts University.

  • Domain: political science.

  • Tasks in fairness literature: fair districting for electoral precincts (schutzman2020:to).

  • Data spec: mixture.

  • Sample size: variable number of precincts (thousands) per state.

  • Year: 2021.

  • Sensitive features: race, political affiliation (representation in different precincts).

  • Link: https://github.com/mggg-states

  • Further info: https://mggg.org/

a.104 Microsoft Learning to Rank

  • Description: this dataset was released to spur advances in learning to rank algorithms, capable of producing a list of documents in response to a text query, ranked according to their relevance for the query. The dataset contains relevance judgements for query-document pairs, obtained “from a retired labeling set” of the Bing search engine. Over 100 numerical features are provided for each query-document pair, summarizing the salient lexical properties of the pair and the quality of the webpage, including its page rank.

  • Affiliation of creators: Microsoft.

  • Domain: information systems.

  • Tasks in fairness literature: fair ranking (bower2021individually).

  • Data spec: query document pairs.

  • Sample size: K queries.

  • Year: 2013.

  • Sensitive features: none.

  • Link: https://www.microsoft.com/en-us/research/project/mslr/

  • Further info: (qin2013introducing)

a.105 Million Playlist Dataset (MPD)

  • Description: this dataset powered the 2018 RecSys Challenge on automatic playlist continuation. It consists of a sample of public Spotify playlists created by US Spotify users between 2010–2017. Each playlist consists of a title, track list and additional metadata. For each track, MPD provides the title, artist, album, duration and Spotify pointers. User data is anonymized. The dataset was augmented with record label information crawled from the web (knees2019towards).

  • Affiliation of creators: Spotify; Johannes Kepler University; University of Massachusetts.

  • Domain: music, information systems.

  • Tasks in fairness literature: data bias evaluation (knees2019towards).

  • Data spec: tabular data.

  • Sample size: M playlists containing M unique tracks by K artists.

  • Year: 2018.

  • Sensitive features: artist, record label.

  • Link: https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge

  • Further info: chen2018recsys

a.106 Million Song Dataset (MSD)

  • Description: this dataset was created as a large-scale benchmark for algorithms in the musical domain. Song data was acquired through The Echo Nest API, capturing a wide array of information about the song (duration, loudness, key, tempo, etc.) and the artist (name, id, location, etc.). In total the dataset creators retrieved one million songs, and for each song 55 fields are provided as metadata. This dataset also powers the Million Song Dataset Challenge, integrating the MSD with implicit feedback from taste profiles gather from an undisclosed set of applications.

  • Affiliation of creators: Columbia University; The Echo Nest.

  • Domain: music, information systems.

  • Tasks in fairness literature: dynamical evaluation of fair ranking (ferraro2019artist).

  • Data spec: user-song pairs.

  • Sample size: M play counts over M users and K songs.

  • Year: 2012.

  • Sensitive features: artist; geography.

  • Link: http://millionsongdataset.com/; https://www.kaggle.com/c/msdchallenge

  • Further info: bertinmahieux2011million; mcfee2012milllion

a.107 Mimic-Cxr-Jpg

  • Description: this dataset was curated to encourage research in medical computer vision. It consists of chest x-rays sourced from the Beth Israel Deaconess Medical Center between 2011–2016. Each image is tagged with one or more of fourteen labels, derived from the corresponding free-text radiology reports via natural language processing tools. A subset of 687 report-label pairs have been validated by a board of certified radiologists with 8 years of experience.

  • Affiliation of creators: Massachusetts Institute of Technology; Beth Israel Deaconess Medical Center; Stanford University; Harvard Medical School; National Library of Medicine.

  • Domain: radiology.

  • Tasks in fairness literature: fairness evaluation of private classification (cheng2021can).

  • Data spec: images.

  • Sample size: K images of K patients.

  • Year: 2019.

  • Sensitive features: sex.

  • Link: https://physionet.org/content/mimic-cxr-jpg/2.0.0/

  • Further info: (johnson2019mimic)

a.108 Mimic-Iii

  • Description: this dataset was extracted from a database of patients admitted to critical care units at the Beth Israel Deaconess Medical Center in Boston (MA), following the widespread adoption of digital health records in US hospitals. Data comprises vital signs, medications, laboratory measurements, notes and observations by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, length of stay, survival data, and demographics. The dataset spans over a decade of intensive care unit stays for adult and neonatal patients.

  • Affiliation of creators: Massachusetts Institute of Technology; Beth Israel Deaconess Medical Center; A*STAR.

  • Domain: critical care medicine.

  • Tasks in fairness literature: fair classification (martinez2020minimax), fairness evaluation (chen2018why; zhang2020hurtful), robust fair classification (singh2021fairness).

  • Data spec: mixture.

  • Sample size: 60K patients.

  • Year: 2016.

  • Sensitive features: age, ethnicity, gender.

  • Link: https://mimic.mit.edu/

  • Further info: johnson2016mimiciii

a.109 ML Fairness Gym

  • Description: this resource was developed to study the long-term behaviour and emergent properties of fair ML systems. It is an extension of OpenAI Gym (brockman2016openai)

    , simulating the actions of agents within environments as Markov Decision Processes. As of 2021, four environments have been released. (1)

    Lending emulates the decisions of a bank, based on perceived credit-worthiness of individuals, which is distributed according to an artificial sensitive feature. (2) Attention allocation concentrates on agents tasked with monitoring sites for incidents. (3) College admission

    relates to sequential game theory, where agents represent colleges and environments contain students capable of strategically manipulating their features at different costs, for instance through preparation courses. (4)

    Infectious disease models the problem of vaccine allocation and its long-term consequences on people in different demographic groups.

  • Affiliation of creators: Google.

  • Domain: N/A.

  • Tasks in fairness literature: dynamical fair resource allocation (atwood2019fair; damour2020fairness), dynamical fair classification (damour2020fairness).

  • Data spec: time series.

  • Sample size: variable.

  • Year: 2020.

  • Sensitive features: synthetic.

  • Link: https://github.com/google/ml-fairness-gym

  • Further info: damour2020fairness

a.110 Mnist

  • Description: one of the most famous resources in computer vision, this dataset was created from an earlier database released by the National Institute of Standards and Technology (NIST). It consists of hand-written digits collected among high-school students and Census Bureau employees, which have to be correctly labelled by image processing systems. Several augmentations have also been used in the fairness literature, discussed at the end of this section.

  • Affiliation of creators: AT&T Labs.

  • Domain: computer vision.

  • Tasks in fairness literature: fair clustering (harpeled2019near; li2020deep), fair anomaly detection (zhang2021towards), fair classification (creager2021exchanging).

  • Data spec: image.

  • Sample size: K images across 10 digits.

  • Year: 1998.

  • Sensitive features: none.

  • Link: http://yann.lecun.com/exdb/mnist/

  • Further info: lecun1998gradientbased; barocas2019fair

  • Variants:

    • MNIST-USPS (li2020deep): merge with USPS dataset of handwritten digits (hull1994:database).

    • Color-reverse MNIST (li2020deep) or MNIST-Invert (zhang2021towards): images from MNIST, reversed via for each pixel .

    • Color MNIST (arjovsky2020invariant): images from MNIST colored red or green based on class label.

    • C-MNIST: images from MNIST, such that both digits and background are colored.

    .

a.111 Mobile Money Loans

  • Description: this dataset captures the ongoing collaboration between some banks and mobile network operators in East Africa. Phone data, including mobile money transactions, is used as “soft” financial data to create a credit score. Mobile money (bank-less) transactions represent a low-barrier tool for the financial inclusion of the poor and are fairly popular in some African countries.

  • Affiliation of creators: unknown.

  • Domain: finance.

  • Tasks in fairness literature: fair transfer learning (coston2019fair).

  • Data spec: tabular data.

  • Sample size: K people.

  • Year: unknown.

  • Sensitive features: age, gender.

  • Link: not available

  • Further info: speakman2018:three

a.112 MovieLens

  • Description: first released in 1998, MovieLens datasets represent user ratings from the movie recommender platform run by the GroupLens research group from the University of Minnesota. While different datasets have been released by GroupLens, in this section we concentrate on MovieLens 1M, the one predominantly used in fairness research. User-system interactions take the form of a quadruple (UserID, MovieID, Rating, Timestamp), with ratings expressed on a 1-5 star scale. The dataset also reports user demographics such as age and gender, which is voluntarily provided by the users.

  • Affiliation of creators: University of Minnesota.

  • Domain: information systems, movies.

  • Tasks in fairness literature: fair ranking (burke2018balanced; sonboli2020and; dickens2020hyperfair; farnadi2018fairnessaware; liu2018personalizing), fair ranking evaluation (ekstrand2018all; yao2017beyond; yao2017new), fair data summarization (halabi2020fairness), fair representation learning (oneto2020exploiting; oneto2019learning), fair graph mining (buyl2020debayes; bose2019compositional), noisy fair raking (burke2018synthetic).

  • Data spec: user-movie pairs.

  • Sample size: M reviews by K users over K movies.

  • Year: 2003.

  • Sensitive features: gender, age.

  • Link: https://grouplens.org/datasets/movielens/1m/

  • Further info: harper2015movielens

a.113 MS-Celeb-1M

  • Description: this dataset was created as a large scale public benchmark for face recognition. The creators cover a wide range of countries and emphasizes diversity echoing outdated notions of race: “We cover all the major races in the world (Caucasian, Mongoloid, and Negroid)” (guo2016msceleb1m). While (in theory) containing only images of celebrities, the dataset was found to feature people who simply must maintain an online presence, and was retracted for this reason. Despite termination of the hosting website, the dataset is still searched for, available and used to build new fairness datasets, such as RFW (§ A.141) and BUPT Faces (§ A.25). The dataset was recently augmented with gender and nationality data automatically inferred from biographies of people (mcduff2019characterizing). From nationality, a race-related attribute was also annotated on a subset of 20,000 images.

  • Affiliation of creators: Microsoft.

  • Domain: computer vision.

  • Tasks in fairness literature: fairness evaluation through artificial data generation (mcduff2019characterizing).

  • Data spec: image.

  • Sample size: M images representing K people.

  • Year: 2016.

  • Sensitive features: gender, race, geography.

  • Link: not available

  • Further info: guo2016msceleb1m; mcduff2019characterizing; murgia2019microsoft

a.114 Ms-Coco

  • Description: this dataset was created with the goal of improving the state of the art in object recognition. The dataset consists of over 300,000 labeled images collected from Flickr. Each image was annotated based on whether it contains one or more of the 91 object types proposed by the authors. Segmentations are also provided to indicate the region where objects are located in each image. Finally, five human-generated captions are provided for each image. Annotation, segmentation and captioning were performed by human annotators hired on Amazon Mechanical Turk. A subset of the images depicting people have been augmented with gender labels “man” and “woman” based on whether captions mention one word but not the other (zhao2017men; hendricks2018women).

  • Affiliation of creators: Cornell University; Toyota Technological Institute; Facebook; Microsoft; Brown University; California Institute of Technology; University of California at Irvine.

  • Domain: computer vision.

  • Tasks in fairness literature: fair representation learning (david2020debiasing), fair classification (hendricks2018women).

  • Data spec: image.

  • Sample size: K images.

  • Year: 2014.

  • Sensitive features: gender.

  • Link: https://cocodataset.org/

  • Further info: lin2014microsoft

a.115 Multi-task Facial Landmark (MTFL)

  • Description: this dataset was developed to evaluate the effectiveness of multi-task learning in problems of facial landmark detection. The dataset builds upon an existing collection of outdoor face images sourced from the web already labelled with bounding boxes and landmarks (yi2013deep), by annotating whether subjects are smiling or wearing glasses, along with their gender and pose. These annotations, whose provenance is not documented, allow researchers to define additional classification tasks for their multi-task learning pipeline.

  • Affiliation of creators: The Chinese University of Hong Kong.

  • Domain: computer vision.

  • Tasks in fairness literature: fair clustering (li2020deep).

  • Data spec: image.

  • Sample size: K images.

  • Year: 2014.

  • Sensitive features: gender.

  • Link: http://mmlab.ie.cuhk.edu.hk/projects/TCDCN.html

  • Further info: zhang2014facial; zhang2015learning

a.116 National Longitudinal Survey of Youth

  • Description: the National Longitudinal Surveys from the US Bureau of Labor Statistics follow the lives of representative samples of US citizens, focusing on their labor market activities and other significant life events. Subjects periodically provide responses to questions about their education, employment, housing, income, health, and more. Two different cohorts were started in 1979 (NLSY79) and (NLSY97), which have been associated with machine learning tasks of income prediction and GPA prediction respectively.

  • Affiliation of creators: US Bureau of Labor Statistics.

  • Domain: demography.

  • Tasks in fairness literature: fair regression (komiyama2018nonconvex; chzhen2020fairwassertein; chzhen2020fair).

  • Data spec: tabular data.

  • Sample size: K respondents (NLSY79); K respondents (NLSY97).

  • Year: 2021.

  • Sensitive features: age, race, sex.

  • Link: https://www.bls.gov/nls/nlsy79.htm (NLSY79); https://www.bls.gov/nls/nlsy97.htm (NLSY97)

  • Further info:

a.117 National Lung Screening Trial (NLST)

  • Description: the NLST was a randomized controlled trial aimed at understanding whether imaging through low-dose helical computed tomography reduces lung cancer mortality relative to chest radiography. Participants were recruited at 33 screening centers across the US, among subjects deemed at risk of lung cancer based on age and smoking history, and were made aware of the trial. A breadth of features about participants is available, including demographics, disease history, smoking history, family history of lung cancer, type, and results of screening exams.

  • Affiliation of creators: National Cancer Institute’s Division of Cancer Prevention, Division of Cancer Treatment and Diagnosis.

  • Domain: radiology.

  • Tasks in fairness literature: fair preference-based classification (ustun2019fairness).

  • Data spec: image.

  • Sample size: K participants.

  • Year: 2020.

  • Sensitive features: age, ethnicity, race, sex.

  • Link: https://cdas.cancer.gov/nlst/

  • Further info: national2011national; https://www.cancer.gov/types/lung/research/nlst

a.118 New York Times Annotated Corpus

  • Description: this corpus contains nearly two million articles published in The New York Times over the period 1987–2007. For some articles, annotations by library scientists are available, including topics, mentioned entities, and summaries. The data is provided in News Industry Text Format (NITF).

  • Affiliation of creators: The New York Times.

  • Domain: news.

  • Tasks in fairness literature: bias evaluation in WEs (brunet2019understanding).

  • Data spec: text.

  • Sample size: M articles.

  • Year: 2008.

  • Sensitive features: textual references to people and their demographics.

  • Link: https://catalog.ldc.upenn.edu/LDC2008T19

  • Further info:

a.119 Nominees Corpus

  • Description: this corpus was curated to study gender-related differences in literary production, with attention to perception of quality. It consists of fifty Dutch-language fiction novels nominated for either the AKO Literatuurprijs(shortlist) or the Libris Literatuur Prijs (longlist) in the period 2007–2012. The corpus was curated to control for nominee gender and country of origin. Word counts, LIWC counts, and metadata for this dataset are available at http://dx.doi.org/10.17632/tmp32v54ss.2.

  • Affiliation of creators: University of Amsterdam.

  • Domain: literature.

  • Tasks in fairness literature: fairness evaluation (koolen2017:stereotypes).

  • Data spec: text.

  • Sample size: novels.

  • Year: 2017.

  • Sensitive features: gender, geography (of author).

  • Link: not available

  • Further info: koolen2017:stereotypes; koolen2018reading

a.120 North Carolina Voters

  • Description: US voter data is collected, curated, and maintained for multiple reasons. Data about voters in North Carolina is collected publicly as part of voter registration requirements and also privately. Private companies curating these datasets sell voter data as part of products, which include outreach lists and analytics. These datasets include voters’ full names, address, demographics, and party affiliation.

  • Affiliation of creators: North Carolina State Board of Elections.

  • Domain: political science.

  • Tasks in fairness literature: data bias evaluation (coston2021leveraging), fair clustering (abbasi2021fair), fairness evaluation of advertisement (speicher2018potential).

  • Data spec: tabular data.

  • Sample size: M voters.

  • Year: 2021.

  • Sensitive features: race, ethnicity, age, geography.

  • Link: https://www.ncsbe.gov/results-data/voter-registration-data

  • Further info:

  • Variants: a privately curated version of this dataset is maintained by L2.161616https://l2-data.com/states/north-carolina/.

a.121 Nursery

  • Description: this dataset encodes applications for a nursery school in Ljubljana, Slovenia. To favour transparent and objective decision-making, a computer-based decision support system was developed for the selection and ranking of applications. The target variable reported is thus the output of an expert systems based on a set of rules, taking as an input information about the family, including housing, occupation and financial status, included in the dataset. The variables were reportedly constructed in a careful manner, taking into account laws that were in force at that time and following advice given by leading experts in that field. However, the variables also appear to be coded rather subjectively. For example, the variable social condition admits as a value Slightly problematic, allegedly reserved for “When education ability of parents is low (unequal, inconsistent education, exaggerated pretentiousness or indulgence, neurotic reactions of parents), or there are improper relations in family (easier forms of parental personality disturbances, privileged or ignored children, conflicts in the family)”. Given that the true map between inputs and outputs is known, this resource is mostly useful to evaluate methods of structure discovery.

  • Affiliation of creators: University of Maribor; Jožef Stefan Institute; University of Ljubljana; Center for Public Enterprises in Developing Countries.

  • Domain: education.

  • Tasks in fairness literature: fair classification (romano2020achieving).

  • Data spec: tabular data.

  • Sample size: K combinations of input data (hypothetical applicants).

  • Year: 1997.

  • Sensitive features: family wealth.

  • Link: https://archive.ics.uci.edu/ml/datasets/nursery

  • Further info: olave1989application

a.122 NYC Taxi Trips

  • Description: this dataset was collected through a Freedom of Information Law request from the NYC Taxi and Limousine Commission. Data points represent New York taxi trips over 4 years (2010–2013), complete with spatio-temporal data, trip duration, number of passengers, and cost. Reportedly, the dataset contains a large number of errors, including misreported trip distance, duration, and GPS coordinates. Overall, these errors account for 7% of all trips in the dataset.

  • Affiliation of creators: University of Illinois.

  • Domain: transportation.

  • Tasks in fairness literature: fair matching (lesmana2019balancing; nanda2020:bt).

  • Data spec: tabular data.

  • Sample size: M taxi trips.

  • Year: 2016.

  • Sensitive features: none.

  • Link: https://experts.illinois.edu/en/datasets/new-york-city-taxi-trip-data-2010-2013-2

  • Further info: https://bit.ly/3yrT8jt

  • Variants: a similar, smaller dataset was obtained by Chris Whong from the NYC Taxi and Limousine Commission under the Freedom of Information Law.171717http://www.andresmh.com/nyctaxitrips/.

a.123 Occupations in Google Images

  • Description: this dataset was collected to study gender and skin tone diversity in image search results for jobs, and its relation with gender and race conentration in different professions. The dataset consists of the top 100 results for 96 occupations from Google Image Search, collected in December 2019. The creators hired workers on Amazon Mechanical Turk to label the gender (male, female) and Fitzpatrick skin tone (Type 1–6) of the primary person in each image, adding “Not applicable” and “Cannot determine” as possible options. Three labels were collected for each image, to which the majority label was assigned where possible.

  • Affiliation of creators: Yale Universiy.

  • Domain: information systems.

  • Tasks in fairness literature: noisy fair subset selection (mehrotra2021mitigating).

  • Data spec: image.

  • Sample size: K images of 100 occupations.

  • Year: 2019.

  • Sensitive features: gender, skin tone (inferred).

  • Link: https://drive.google.com/drive/u/0/folders/1j9I5ESc-7NRCZ-zSD0C6LHjeNp42RjkJ

  • Further info: celis2020implicit

a.124 Office31

  • Description: this dataset was curated to support domain adaptation algorithms for computer vision systems. It features images of 31 different office tools (e.g. chair, keyboard, printer) from 3 different domains: listings on Amazon, high quality camera images, low quality webcam shots.

  • Affiliation of creators: University of California, Berkeley.

  • Domain: computer vision.

  • Tasks in fairness literature: fair clustering (li2020deep).

  • Data spec: image.

  • Sample size: K images.

  • Year: 2011.

  • Sensitive features: none.

  • Link: https://paperswithcode.com/dataset/office-31

  • Further info: saenko2010adapting

a.125 Olympic Athletes

  • Description: this is a historical sports-related dataset on the modern Olympic Games from their first edition in 1896 to the 2016 Rio Games. The dataset was consolidated by Randi H Griffin utilizing SportsReference as the primary source of information. For each athlete, the dataset comprises demographics, height, weight, competition, and medal.

  • Affiliation of creators: unknown.

  • Domain: history.

  • Tasks in fairness literature: fair clustering (huang2019coresets).

  • Data spec: tabular data.

  • Sample size: K athletes.

  • Year: 2018.

  • Sensitive features: sex, age.

  • Link: https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results

  • Further info: https://www.sports-reference.com/

a.126 Omniglot

  • Description: this dataset was designed to study the problem of automatically learning basic visual concepts. It consists of handwritten characters from different alphabets drawn online via Amazon Mechanical Turk by 20 different people.

  • Affiliation of creators: New York University; University of Toronto; Massachusetts Institute of Technology.

  • Domain: computer vision.

  • Tasks in fairness literature: fair few-shot learning (li2020fair).

  • Data spec: image.

  • Sample size: K images from 50 different alphabets.

  • Year: 2019.

  • Sensitive features: none.

  • Link: https://github.com/brendenlake/omniglot

  • Further info: lake2015humanlevel

a.127 One billion word benchmark

  • Description: this dataset was proposed in 2014 as a benchmark for language models. The authors sourced English textual data from the EMNLP 6th workshop on Statistical Machine Translation181818http://statmt.org/wmt11/training-monolingual.tgz, more specifically the Monolingual language model training data, comprising a news crawl from 2007–2011 and data from the European Parliament website. Preprocessing includes removal of duplicate sentences, rare words (appearing less than 3 times) and mapping out-of-vocabulary words to the UNK token. The ELMo contextualized WEs (peters2018:deep) were trained on this benchmark.

  • Affiliation of creators: Google; University of Edinburgh; Cantab Research Ltd.

  • Domain: linguistics.

  • Tasks in fairness literature: data bias evaluation (tan2019assessing).

  • Data spec: text.

  • Sample size: M words.

  • Year: 2014.

  • Sensitive features: textual references to people and their demographics.

  • Link: https://opensource.google/projects/lm-benchmark

  • Further info: chelba2014billion

a.128 Online Freelance Marketplaces

  • Description: this dataset was created to audit racial and gender biases on TaskRabbit and Fiverr, two popular online freelancing marketplaces. The dataset was built by crawling workers’ profiles from both websites, including metadata, activities, and past job reviews. Profiles were later annotated with perceived demographics (gender and race) by Amazon Mechanical Turk based on profile images. On TaskRabbit, the authors executed search queries for all task categories in the 10 largest cities where the service is available, logging workers’ ranking in search results. On Fiverr, they concentrated on 9 tasks of diverse nature. The total number of queries that were issued on each platform, resulting in as many search result pages, is not explicitly stated.

  • Affiliation of creators: Northeastern University, GESIS Leibniz Institute for the Social Sciences, University of Koblenz-Landau, ETH Zürich.

  • Domain: information systems.

  • Tasks in fairness literature: fairness evaluation (hannak2017bias).

  • Data spec: query-result pairs.

  • Sample size: K workers (Fiverr); K (TaskRabbit).

  • Year: 2017.

  • Sensitive features: gender, race.

  • Link: not available

  • Further info: hannak2017bias

a.129 Paper-Reviewer Matching

  • Description: this dataset summarizes the peer review assignment process of 3 different conferences, namely one edition of Medical Imaging and Deep Learning (MIDL) and two editions of the Conference on Computer Vision and Pattern Recognition (called CVPR and CVPR2018). The data, provided by OpenReview and the Computer Vision Foundation, consist of a matrix of paper-reviewer affinities, a set of coverage constraints to ensure each paper is properly reviewed, and a set of upper bound constraints to avoid imposing an excessive burden on reviewers.

  • Affiliation of creators: unknown.

  • Domain: library and information sciences.

  • Tasks in fairness literature: fair matching (kobren2019paper).

  • Data spec: paper-reviewer pairs.

  • Sample size: reviewers for papers (MIDL); K reviewers for K papers (CVPR). K reviewers for K papers (CVPR2018).

  • Year: 2019.

  • Sensitive features: none.

  • Link: not available

  • Further info: kobren2019paper

a.130 Philadelphia Crime Incidents

  • Description: this dataset is provided as part of OpenDataPhilly initiative. It summarizes hundreds of thousands of crime incidents handled by the Philadelphia Police Department over a period of ten years (2006–2016). The dataset comes with fine spatial and temporal granularity and has been used to monitor seasonal and historical trends and measure the effect of police strategies.

  • Affiliation of creators: Philadelphia Police Department.

  • Domain: law.

  • Tasks in fairness literature: fair resource allocation (elzayn2019fair).

  • Data spec: tabular data.

  • Sample size: M crime incidents.

  • Year: 2021.

  • Sensitive features: geography.

  • Link: https://www.opendataphilly.org/dataset/crime-incidents

  • Further info:

a.131 Pilot Parliaments Benchmark (PPB)

  • Description: this dataset was developed as a benchmark with a balanced representation of gender and skin type to evaluate the performance of face analysis technology. The dataset features images of parliamentary representatives from three African countries (Rwanda, Senegal, South Africa) and three European countries (Iceland, Finland, Sweden) to achieve a good balance between skin type and gender while reducing potential harms connected with lack of consent from the people involved. Three annotators provided gender and Fitzpatrick labels. A certified surgical dermatologist provided the definitive Fitzpatrick skin type labels. Gender was annotated based on name, gendered title, and photo appearance.

  • Affiliation of creators: Massachusetts Institute of Technology; Microsoft.

  • Domain: computer vision.

  • Tasks in fairness literature: fair classification (kim2019multiaccuracy; amini2019uncovering), fairness evaluation (Buolamwini2018gender), bias discovery (kim2019multiaccuracy; amini2019uncovering), fairness evaluation (raji2019actionable).

  • Data spec: image.

  • Sample size: K images of K individuals.

  • Year: 2018.

  • Sensitive features: gender, skin type.

  • Link: http://gendershades.org/

  • Further info: Buolamwini2018gender

a.132 Pima Indians Diabetes Dataset (PIDD)

  • Description: this resource owes its name to the respective entry on the UCI repository (now unavailable), and was derived from a medical study of Native Americans from the Gila River Community, often called Pima. The study was initiated in the 1960s by the National Institute of Diabetes and Digestive and Kidney Diseases and found a large prevalence of diabetes mellitus in this population. The dataset commonly available nowadays represents a subset of the original study, focusing on women of age 21 or older. It reports whether they tested positive for diabetes, along with eight covariates that were found to be significant risk factors for this population. These include the number of pregnancies, skin thickness, and body mass index, based on which algorithms should predict the test results.

  • Affiliation of creators: Logistics Management Institute; National Institute of Diabetes Digestive and Kidney Diseases; John Hopkins University.

  • Domain: endocrinology.

  • Tasks in fairness literature: fairness evaluation (sharma2020:cc), fair clustering (chen2019proportionally).

  • Data spec: tabular data.

  • Sample size: subjects.

  • Year: 2016.

  • Sensitive features: age.

  • Link: https://www.kaggle.com/uciml/pima-indians-diabetes-database

  • Further info: smith1988using; radin2017digital

a.133 Pokec Social Network

  • Description: this graph dataset summarizes the networks of Pokec users, a social network service popular in Slovakia and Czech Republic. Due to default privacy settings being predefined as public, a wealth of information for each profile was collected by curators including information on demographics, politics, education, marital status, and children wherever available. This resource was collected to perform data analysis in social networks.

  • Affiliation of creators: University of Zilina.

  • Domain: social networks.

  • Tasks in fairness literature: fair data summarization (halabi2020fairness).

  • Data spec: user-user pairs.

  • Sample size: M nodes (profiles) connected by M edges (friendship relations).

  • Year: 2013.

  • Sensitive features: gender, geography, age.

  • Link: https://snap.stanford.edu/data/soc-pokec.html

  • Further info: takac2012data

a.134 Popular Baby Names

  • Description: this dataset summarizes birth registration in New York City, focusing on names sex and race of newborns, providing a reliable source of data to assess naming trends in New York. A similar nation-wide database is maintained by the US Social Security Administration.

  • Affiliation of creators: City of New York, Department of Health and Mental Hygiene (NYC names); United States Social Security Administration (US names).

  • Domain: linguistics.

  • Tasks in fairness literature: fair sentiment analysis (yurochkin2020training; mukherjee2020two), bias discovery in WEs (swinger2019what).

  • Data spec: tabular data.

  • Sample size: K unique names (NYC names); K unique names (US names).

  • Year: 2021.

  • Sensitive features: sex, race.

  • Link: https://catalog.data.gov/dataset/popular-baby-names (NYC names); https://www.ssa.gov/oact/babynames/limits.html (US names)

  • Further info:

a.135 Poverty in Colombia

  • Description: this dataset stems from an official survey of households performed yearly by the Colombian national statistics department (Departamento Administrativo Nacional de Estadística). The survey is aimed at soliciting information about employment, income, and demographics. The data serves as an input for studies on poverty in Colombia.

  • Affiliation of creators: Departamento Administrativo Nacional de Estadística.

  • Domain: economics.

  • Tasks in fairness literature: fair classification (noriegacampero2020algorithmic).

  • Data spec: tabular data.

  • Sample size: unknown.

  • Year: 2018.

  • Sensitive features: age, sex, geography.

  • Link: https://www.dane.gov.co/index.php/estadisticas-por-tema/pobreza-y-condiciones-de-vida/pobreza-y-desigualdad/pobreza-monetaria-y-multidimensional-en-colombia-2018

  • Further info: https://www.dane.gov.co/files/investigaciones/condiciones_vida/pobreza/2018/bt_pobreza_monetaria_18.pdf

a.136 PP-Pathways

  • Description: this dataset represents a network of physical interactions between proteins that are experimentally documented in humans. The dataset was assembled to study the problem of automated discovery of the proteins (nodes) associated with a given disease. Starting from a few known disease-associated proteins and a a map of protein-protein interactions (edges), the task is to find the full list of proteins associated with said disease.

  • Affiliation of creators: Stanford University; Chan Zuckerberg Biohub.

  • Domain: biology.

  • Tasks in fairness literature: fair graph mining (kang2020inform).

  • Data spec: protein-protein pairs.

  • Sample size: K proteins (nodes) linked by K physical interactions.

  • Year: 2018.

  • Sensitive features: none.

  • Link: http://snap.stanford.edu/biodata/datasets/10000/10000-PP-Pathways.html

  • Further info: agrawal2018large

a.137 Prosper Loans Network

  • Description: this dataset represents transactions on the Prosper marketplace, a famous peer-to-peer lending service where US-based users can register as lenders or borrowers. This resource has a graph structure and covers the period 2005–2011. Loan records include user ids, timestamps, loan amount, and rate. The dataset was first associated with a study of arbitrage and its profitability in a peer-to-peer lending system.

  • Affiliation of creators: Prosper; University College Dublin.

  • Domain: finance.

  • Tasks in fairness literature: fair classification (li2020fairness).

  • Data spec: lender-borrower pairs.

  • Sample size: M loan records involving K people.

  • Year: 2015.

  • Sensitive features: none.

  • Link: http://mlg.ucd.ie/datasets/prosper.html

  • Further info: redmond2013temporal

a.138 PubMed Diabetes Papers

  • Description: this dataset was created to study the problem of classification of connected entities via active learning. The creators extracted a set of articles related to diabetes from PubMed, along with their citation network. The task associated with the dataset is inferring a label specifying the type of diabetes addressed in each publication. For this task, TF/IDF-weighted term frequencies of every article are available.

  • Affiliation of creators: University of Maryland.

  • Domain: library and information sciences.

  • Tasks in fairness literature: fair graph mining (li2021on).

  • Data spec: article-article pairs.

  • Sample size: K articles connected by K citations.

  • Year: 2020.

  • Sensitive features: none.

  • Link: https://linqs.soe.ucsc.edu/data

  • Further info: namata2012query

a.139 Pymetrics Bias Group

  • Description: Pymetrics is a company that offers a candidate screening tool to employers. Candidates play a core set of twelve games, derived from psychological studies. The resulting gamified psychological measurements are exploited to build predictive models for hiring, where positive examples are provided by high-performing employees from the employer. Pymetrics staff maintain a Pymetrics Bias Group dataset for internal fairness audits by asking players to fill in an optional demographic survey after they complete the games.

  • Affiliation of creators: Pymetrics.

  • Domain: information systems.

  • Tasks in fairness literature: fairness evaluation (wilson2021building).

  • Data spec: tabular data.

  • Sample size: K users.

  • Year: 2021.

  • Sensitive features: gender, race.

  • Link: not available

  • Further info: wilson2021building

a.140 Race on Twitter

  • Description: this dataset was collected to power applications of user-level race prediction on Twitter. Twitter users were hired through Qualtrics, were they filled in a survey providing their Twitter handle and demographics, including race, gender, age, education, and income. The dataset creators downloaded the most recent 3,200 tweets by the users who provided their handle. The data, allegedly released in an anonymized and aggregated format, appears to be unavailable.

  • Affiliation of creators: University of Pennsylvania.

  • Domain: social media.

  • Tasks in fairness literature: fairness evaluation (ballburack2021differential).

  • Data spec: text.

  • Sample size: M tweets from K users.

  • Year: 2018.

  • Sensitive features: race, gender, age.

  • Link: http://www.preotiuc.ro/

  • Further info: preotiucpietro2018user

a.141 Racial Faces in the Wild (RFW)

  • Description: this dataset was developed as a benchmark for face verification algorithms operating on diverse populations. The dataset comprises 4 clusters of images extracted from MS-Celeb-1M (§ A.113), a dataset that was discontinued by Microsoft due to privacy violations. Clusters are of similar size and contain individuals labelled Caucasian, Asian, Indian and African. Half of the labels (Asian, Indian) are derived from the “Nationality attribute of FreeBase celebrities”; the remaining half (Caucasian, African) is automatically estimated via the Face++ API. This attribute is referred to as “race” by the authors, who also assert “carefully and manually” cleaning every image. Clusters feature multiple images of each individual to allow for face verification applications.

  • Affiliation of creators: Beijing University of Posts; Telecommunications and Canon Information Technology (Beijing).

  • Domain: computer vision.

  • Tasks in fairness literature: fair reinforcement learning (wang2020mitigating).

  • Data spec: image.

  • Sample size: K images of K individuals.

  • Year: 2019.

  • Sensitive features: race (inferrred).

  • Link: http://www.whdeng.cn/RFW/testing.html

  • Further info: wang2019racial

a.142 Real-Time Crime Forecasting Challenge

  • Description: this dataset was assembled and released by the US National Institute of Justice in 2017 with the goal of advancing the state of automated crime forecasting. It consists of calls-for-service (CFS) records provided by the Portland Police Bureau for the period 2012–2017. Each CFS record contains spatio-temporal data and crime-related categories. The dataset was released as part of a challenge with a toal prize of 1,200,000$.

  • Affiliation of creators: National Institute of Justice.

  • Domain: law.

  • Tasks in fairness literature: fair spatio-temporal process learning (shang2020listwise).

  • Data spec: tabular data.

  • Sample size: 700K CFS records.

  • Year: 2017.

  • Sensitive features: geography.

  • Link: https://nij.ojp.gov/funding/real-time-crime-forecasting-challenge-posting#data

  • Further info: conduent2018:real

a.143 Recidivism of Felons on Probation

  • Description: this dataset covers probation cases of persons who were sentenced in 1986 in 32 urban and suburban US jurisdictions. It was assembled to study the behaviour of individuals on probation and their compliance with court orders across states. Possible outcomes include successful discharge, new felony rearrest, and absconding. The information on probation cases was frequently obtained through manual reviews and transcription of probation files, mostly by college students. Variables include probationer’s demographics, educational level, wage, history of convictions, disciplinary hearings and probation sentences. The final dataset consists of K probation cases “representative of 79,043 probationers”.

  • Affiliation of creators: US Department of Justice; National Association of Criminal Justice Planners.

  • Domain: law.

  • Tasks in fairness literature: limited-label fair classification (wang2020augmented).

  • Data spec: tabular data.

  • Sample size: K probation cases.

  • Year: 2005.

  • Sensitive features: sex, race, ethnicity, age.

  • Link: https://www.icpsr.umich.edu/web/NACJD/studies/9574

  • Further info: https://bjs.ojp.gov/data-collection/recidivism-survey-felons-probation

a.144 Renal Failure

  • Description: the dataset was created to compare the performance of two different algorithms for automated renal failure risk assessment. Considering patients who received care at NYU Langone Medical Center, each entry encodes their health records, demographics, disease history, and lab results. The final version of the dataset has a cutoff date, considering only patients who did not have kidney failure by that time, and reporting, as a target ground truth, whether they proceeded to have kidney failure within the next year.

  • Affiliation of creators: New York University; New York University Langone Medical Center.

  • Domain: nephrology.

  • Tasks in fairness literature: fairness evaluation (williams2019quantification).

  • Data spec: tabular data.

  • Sample size: M patients.

  • Year: 2019.

  • Sensitive features: age, gender, race.

  • Link: not available

  • Further info: williams2019quantification

a.145 Reuters 50 50

  • Description: this dataset was extracted from the Reuters Corpus Volume 1 (RCV1), a large corpus of newswire stories, to study the problem of authorship attribution. The 50 most prolific authors were selected from RCV1, considering only texts labeled corporate/industrial. The dataset consists of short news stories from these authors, labelled with the name of the author.

  • Affiliation of creators: University of the Aegean.

  • Domain: news.

  • Tasks in fairness literature: fair clustering (harb2020kfc).

  • Data spec: text.

  • Sample size: K articles.

  • Year: 2011.

  • Sensitive features: author, textual references to people and their demographics.

  • Link: http://archive.ics.uci.edu/ml/datasets/Reuter_50_50

  • Further info: houvardas2006n

a.146 Ricci

  • Description: this dataset relates to the US supreme court labor case on discrimination Ricci vs DeStefano (2009), connected with the disparate impact doctrine. It represents 118 firefighter promotion tests, providing the scores and race of each test taker. Eighteen firefighters from the New Haven Fire Department claimed “reverse discrimination” after the city refused to certify a promotion examination where they had obtained high scores. The reasons why city officials avoided certifying the examination included concerns of potential violation of the ‘four-fiths’ rule, as, given the vacancies at the time, no black firefighter would be promoted. The dataset was published and popularized by Weiwen Miao for pedagogical use.

  • Affiliation of creators: Haverford College.

  • Domain: law.

  • Tasks in fairness literature: fairness evaluation (feldman2015certifying; friedler2019comparative), limited-label fairness evaluation (ji2020can).

  • Data spec: tabular data.

  • Sample size: test takers.

  • Year: 2018.

  • Sensitive features: race.

  • Link: http://jse.amstat.org/jse_data_archive.htm; https://github.com/algofairness/fairness-comparison/tree/master/fairness/data/raw

  • Further info: gastwirth2009formal; miao2010did

a.147 Rice Facebook Network

  • Description: this dataset repesents the Facebook sub-network of students and alumni of Rice University. It consists of a crawl of reachable profiles in the Rice Facebook network, augmented with academic information obtained from Rice University directories. This collection was created to study the problem of inferring unknown attributes in a social network based on the network graph and attributes that are available for a fraction of users.

  • Affiliation of creators: MPI-SWS; Rice University; Northeastern University.

  • Domain: social networks.

  • Tasks in fairness literature: fair influence maximization (ali2019fairness).

  • Data spec: user-user pairs.

  • Sample size: K profiles connected by K edges.

  • Year: 2010.

  • Sensitive features: none.

  • Link: not available

  • Further info: mislove2010you

a.148 Riddle of Literary Quality

  • Description: this text corpus was assembled to study the factors that correlate with the acceptance of a text as literary (or non-literary) and good (or bad). It consists of 401 Dutch-language novels published between 2007–2012. These works were selected for being bestsellers or often lent from libraries in the period 2009–2012. Due to copyright reasons, the data is not publicly available.

  • Affiliation of creators: Huygens ING – KNAW; University of Amsterdam; Fryske Akademy.

  • Domain: literature.

  • Tasks in fairness literature: fairness evaluation (koolen2017:stereotypes).

  • Data spec: text.

  • Sample size: novels.

  • Year: 2017.

  • Sensitive features: gender (of author).

  • Link: not available

  • Further info: koolen2017:stereotypes; https://literaryquality.huygens.knaw.nl/

a.149 Ride-hailing App

  • Description: this dataset was gathered from a ride-hailing app operating in an undisclosed major Asian city. It summarizes spatio-temporal data about ride requests (jobs) and assignments to drivers during 29 consecutive days. The data tracks the position and status of taxis logging data every 30-90 seconds.

  • Affiliation of creators: Max Planck Institute for Software Systems; Max Planck Institute for Informatics.

  • Domain: transportation.

  • Tasks in fairness literature: fair matching (suhr2019twosided).

  • Data spec: driver-job pairs.

  • Sample size: K drivers handling K job requests.

  • Year: 2019.

  • Sensitive features: geography.

  • Link: not available

  • Further info: suhr2019twosided

a.150 RtGender

  • Description: this dataset captures differences in online commenting behaviour to posts and videos of female and male users. It was created by collecting posts and top-level comments from four platforms: Facebook, Reddit, Fitocracy, TED talks. For each of the four sources, the possibility to reliably report the gender of the poster or presenter shaped the data collection procedure. Authors of posts and videos were selected among users self-reporting their gender or public figures for which gender annotations were available. For instance, the authors created two Facebook-based datasets: one containing all posts and associated top-level comments for all 412 members of US parliament who have public Facebook pages, and a similar one for 105 American public figures (journalists, novelists, actors, actresses, etc.). The gender of these figures was derived based on their presence on Wikipedia category pages relevant for gender.191919e.g. https://en.wikipedia.org/wiki/Category:American_female_tennis_players The gender of commenters and a reliable ID to identify them across comments may be useful for some analyses. The authors report commenters’ first names and a randomized ID, which should support these goals, while reducing chances of re-identification based on last name and Facebook ID.

  • Affiliation of creators: Stanford University; University of Michigan; Carnegie Mellon University.

  • Domain: social media, linguistics.

  • Tasks in fairness literature: fairness evaluation (babaeianjelodar2020quantifying).

  • Data spec: text.

  • Sample size: M posts with M comments.

  • Year: 2018.

  • Sensitive features: gender.202020Annotations for Facebook and TED come from Wikipedia and mirkin2015:motivating respectively. Reddit and Fitocracy rely on self-reported labels..

  • Link: https://nlp.stanford.edu/robvoigt/rtgender/

  • Further info: (voigt2018rtgender)

a.151 SafeGraph Research Release

  • Description: this dataset captures mobility patterns in the US and Canada. It is maintained by SafeGraph, a data company powering analytics about access to Points-of-Interest (POI) and mobility, including pandemic research. SafeGraph data is sourced from millions of mobile devices, whose users allow location tracking by some apps. The Research Release dataset consists of aggregated estimates of hourly visit counts to over 6 million POI. Given the increasing importance of SafeGraph data, directly influencing not only private initiative but also public policy, audits of data representativeness are being carried out both internally (squire2019measuring) and externally (coston2021leveraging).

  • Affiliation of creators: Safegraph.

  • Domain: urban studies.

  • Tasks in fairness literature: data bias evaluation (coston2021leveraging).

  • Data spec: mixture.

  • Sample size: M POI.

  • Year: 2021.

  • Sensitive features: geography.

  • Link: https://www.safegraph.com/academics

  • Further info: https://docs.safegraph.com/v4.0/docs

a.152 Scientist+Painter

  • Description

    : this resource was crawled to study the problem of fair and diverse representation in subsets of instances selected from a large dataset, with a focus on gender concentration in professions. The dataset consists of approximately 800 images that equally represent male scientists, female scientists, male painters, and female painters. These images were gathered from Google image search, selecting the top 200 medium sized JPEG files that passed the strictest level of Safe Search filtering. Then, each image was processed to obtain sets of 128-dimensional SIFT descriptors. The descriptors are combined, subsampled and then clustered using k-means into 256 clusters.

  • Affiliation of creators: École Polytechnique Fédérale de Lausanne (EPFL); Microsoft; University of California, Berkeley.

  • Domain: information systems.

  • Tasks in fairness literature: fair data summarization (celis2016fair; celis2018fair).

  • Data spec: image.

  • Sample size: images.

  • Year: 2016.

  • Sensitive features: male/female.

  • Link: goo.gl/hNukfP

  • Further info: celis2016fair

a.153 Section 203 determinations

  • Description: this dataset is created in support of the language minority provisions of the Voting Rights Act, Section 203. The data contains information about limited-English proficient voting population by jurisdiction, which is used to determine whether election materials must be printed in minority languages. For each combination of language protected by Section 203 and US jurisdiction, the dataset provides information about total population, population of voting age, US citizen population of voting age, combining this information with language spoken at home and overall English proficiency.

  • Affiliation of creators: US Census Bureau.

  • Domain: demography.

  • Tasks in fairness literature: fairness evaluation of private resource allocation (pujol2020fair).

  • Data spec: tabular data.

  • Sample size: K combinations of jurisdictions and languages potentially spoken therein.

  • Year: 2017.

  • Sensitive features: geography, language.

  • Link: https://www.census.gov/data/datasets/2016/dec/rdo/section-203-determinations.html

  • Further info: https://www.census.gov/programs-surveys/decennial-census/about/voting-rights/voting-rights-determination-file.2016.html

a.154 Sentiment140

  • Description: this dataset was created to study the problem of sentiment analysis in social media, envisioning applications of product quality and brand reputation analysis via Twitter monitoring. The sentiment of tweets, retrieved via Twitter API, is automatically inferred based on the presence of emoticons conveying joy or sadness. This dataset is part of the LEAF benchmark for federated learning. In federated learning settings, devices correspond to accounts.

  • Affiliation of creators: Stanford University.

  • Domain: social media.

  • Tasks in fairness literature: fair federated learning (li2020fair).

  • Data spec: text.

  • Sample size: M tweets by K accounts.

  • Year: 2012.

  • Sensitive features: textual references to people and their demographics.

  • Link: http://help.sentiment140.com/home

  • Further info: go2009twitter

a.155 Shakespeare

  • Description: this dataset is available as part of the LEAF benchmark for federated learning (caldas2018leaf). It is built from “The Complete Works of William Shakespeare”, where each speaking role represents a different device. The task envisioned for this dataset is next character prediction.

  • Affiliation of creators: Google; Carnegie Mellon University; Determined AI.

  • Domain: literature.

  • Tasks in fairness literature: fair federated learning (li2020fair).

  • Data spec: text.

  • Sample size: M tokens over K speaking roles.

  • Year: 2020.

  • Sensitive features: textual references to people and their demographics.

  • Link: https://www.tensorflow.org/federated/api_docs/python/tff/simulation/datasets/shakespeare

  • Further info: mcmahan2017communicationefficient; caldas2018leaf

a.156 Shanghai Taxi Trajectories

  • Description: this semi-synthetic dataset represents the road network and traffic patterns of Shanghai. Trajectories were collected from thousands of taxis operating in Shanghai. Spatio-temporal traffic patterns were extracted from these trajectories and used to build the dataset.

  • Affiliation of creators: Shanghai Jiao Tong University; CITI-INRIA Lab.

  • Domain: transportation.

  • Tasks in fairness literature: fair routing (qian2015scram).

  • Data spec: unknown.

  • Sample size: unknown.

  • Year: 2015.

  • Sensitive features: geography.

  • Link: not available

  • Further info: qian2015scram

a.157 shapes3D

  • Description: this dataset is an artificial benchmark for unsupervised methods aimed at learning disentangled data representations. It consists of imagesof 3D shapes in a walled environment, with variable floor colour, wall colour, object colour, scale, shape and orientation.

  • Affiliation of creators: DeepMind; Wayve.

  • Domain: computer vision.

  • Tasks in fairness literature: fair representation learning (locatello2019fairness), fair data generation (choi2020fair).

  • Data spec: image.

  • Sample size: K images.

  • Year: 2018.

  • Sensitive features: none.

  • Link: https://github.com/deepmind/3d-shapes

  • Further info: kim2018disentangling

a.158 SIIM-ISIC Melanoma Classification

  • Description: this dataset was developed to advance the study of automated melanoma classification. The resource consists of dermoscopy images from six medical centers. Images in the dataset are tagged with a patient identifier, allowing lesions from the same patient to be mapped to one another. Images were queried from medical databases among patients with dermoscopy imaging from 1998 to 2019, ranging in quality from 307,200 to 24,000,000 pixels. A curated subset is employed for the 2020 ISIC Grand Challenge.212121https://www.kaggle.com/c/siim-isic-melanoma-classification This dataset was annotated automatically with a binary Fitzpatrick skin tone label (cheng2021can).

  • Affiliation of creators: Memorial Sloan Kettering Cancer Center; University of Queensland; University of Athens; IBM; Universitat de Barcelona; Melanoma Institute Australia; Sydney Melanoma Diagnostic Center; Emory University; Medical University of Vienna; Mayo Clinic; SUNY Downstate Medical School; Stony brook Medical School; Rabin Medical Center; Weill Cornell Medical College.

  • Domain: dermatology.

  • Tasks in fairness literature: fairness evaluation of private classification (cheng2021can).

  • Data spec: image.

  • Sample size: K images of K patients.

  • Year: 2020.

  • Sensitive features: skin type.

  • Link: urlhttps://doi.org/10.34970/2020-ds01

  • Further info: rotemberg2021patient

a.159 SmallNORB

  • Description: this dataset was assembled by researchers affiliated with New York University as a benchmark for robust object recognition under variable pose and lighting conditions. It consists of images of 50 different toys belonging to 5 categories (four-legged animals, human figures, airplanes, trucks, and cars) obtained by 2 different cameras.

  • Affiliation of creators: New York University; NEC Labs America.

  • Domain: computer vision.

  • Tasks in fairness literature: fair representation learning (locatello2019fairness).

  • Data spec: image.

  • Sample size: K images.

  • Year: 2005.

  • Sensitive features: none.

  • Link: https://cs.nyu.edu/~ylclab/data/norb-v1.0-small/

  • Further info: lecun2004learning

a.160 Spliddit Divide Goods

  • Description: this dataset summarizes instances of usage of the divide goods feature of Spliddit, a not-for-profit academic endeavor providing easy access to fair division methods. A typical use case for the service is inheritance division. Participants express their preferences by dividing 1,000 points between the available goods. In response, the service provides suggestions that are meant to maximize the overall satisfaction of all stakeholders.

  • Affiliation of creators: Spliddit.

  • Domain: economics.

  • Tasks in fairness literature: fair preferece-based resource allocation (babaioff2019fair).

  • Data spec: tabular data.

  • Sample size: K division instances.

  • Year: 2016.

  • Sensitive features: none.

  • Link: not available

  • Further info: caragiannis2016unreasonable; http://www.spliddit.org/apps/goods

a.161 Stanford Medicine Research Data Repository

  • Description: this is a data lake/repository developed at Stanford University, supporting a number of data sources and access pipelines. The aim of the underlying project is favouring access to clinical data for research purposes through flexible and robust management of medical data. The data comes from Stanford Health Care, the Stanford Children’s Hospital, the University Healthcare Alliance and Packard Children’s Health Alliance clinics.

  • Affiliation of creators: Stanford University.

  • Domain: medicine.

  • Tasks in fairness literature: fair risk assessment (pfohl2019creating).

  • Data spec: mixture.

  • Sample size: 3M individuals.

  • Year: 2021.

  • Sensitive features: race, ethnicity, gender, age.

  • Link: https://starr.stanford.edu/

  • Further info: lowe2009stride; datta2020new

a.162 State Court Processing Statistics (SCPS)

  • Description: this resource was curated as part of the SCPS program. The program tracked felony defendants from charging by the prosecutor until disposition of their cases for a maximum of 12 months (24 months for murder cases). The data represents felony cases filed in approximately 40 populous US counties in the period 1990-2009. Defendants are summarized by 106 variables summarizing demographics, arrest charges, criminal history, pretrial release and detention, adjudication, and sentencing.

  • Affiliation of creators: US Department of Justice.

  • Domain: law.

  • Tasks in fairness literature: fairness evaluation of multi-stage classification (green2019disparate).

  • Data spec: tabular data.

  • Sample size: K defendants.

  • Year: 2014.

  • Sensitive features: gender, race, age, geography.

  • Link: https://www.icpsr.umich.edu/web/NACJD/studies/2038/datadocumentation

  • Further info: https://bjs.ojp.gov/data-collection/state-court-processing-statistics-scps

a.163 Steemit

  • Description: this resource was collected to test novel approaches for personalized content recommendation in social networks. It consists of two separate datasets summarizing interactions in the Spanish subnetwork and the English subsnetwork of Steemit, a blockchain-based social media website. The datasets summarize user-post interactions in a binary fashion, using comments as a proxy for positive engagement. The datasets cover a whole year of commenting activities over the period 2017–2018 and comprise the text of posts.

  • Affiliation of creators: Hong Kong University of Science and Technology; WeBank.

  • Domain: social media.

  • Tasks in fairness literature: fairness evaluation (xiao2019beyond).

  • Data spec: user-post pairs.

  • Sample size: K users interacting over K posts.

  • Year: 2019.

  • Sensitive features: textual references to people and their demographics.

  • Link: https://github.com/HKUST-KnowComp/Social-Explorative-Attention-Networks

  • Further info: xiao2019beyond

a.164 Stop, Question and Frisk

  • Description: Stop, Question and Frisk (SQF) is an expression that commonly refers to a New York City policing program under which officers can briefly detain, question, and search a citizen if the officer has a reasonable suspicion of criminal activity. Concerns about race-based disparities in this practice have been expressed multiple times, especially in connection with the subjective nature of “reasonable suspicion” and the fact that being in a “high-crime area” lawfully lowers the bar of want may constitute reasonable suspicion. The NYPD has a policy of keeping track of most stops, recording them in UF-250 forms which are maintained centrally and distributed by the NYPD. The form includes several information such as place and time of a stop, the duration of the stop and its outcome along with data on demographics and physical appearance of the suspect. Currently available data pertains to years 2003–2020.

  • Affiliation of creators: New York Police Department.

  • Domain: law.

  • Tasks in fairness literature: preference-based fair classification (zafar2017from), robust fair classification (kallus2018residual), noisy fair classification (kilbertus2018blind), fairness evaluation (goel2017combatting).

  • Data spec: tabular data.

  • Sample size: M records.

  • Year: 2021.

  • Sensitive features: race, age, sex, geography.

  • Link: https://www1.nyc.gov/site/nypd/stats/reports-analysis/stopfrisk.page

  • Further info: gelman2007analysis; goel2016precinct

a.165 Strategic Subject List

  • Description: this dataset was funded through a Bureau of Justice Assistance grant and leveraged by the Illinois Institute of Technology to develop the Chicago Police Department’s Strategic Subject Algorithm. The algorithm provides a risk score which reflects an individual’s probability of being involved in a shooting incident either as a victim or an offender. For each individual, the dataset provides information about the circumstances of their arrest, their demographics and criminal history. The dataset covers arrest data from the period 2012–2016; the associated program was discontinued in 2019.

  • Affiliation of creators: Chicago Police Department; Illinois Institute of Technology.

  • Domain: law.

  • Tasks in fairness literature: fairness evaluation (black2020fliptest).

  • Data spec: tabular data.

  • Sample size: K individuals.

  • Year: 2020.

  • Sensitive features: ace, sex, age.

  • Link: https://data.cityofchicago.org/Public-Safety/Strategic-Subject-List-Historical/4aki-r3np

  • Further info: hollywood2019real

a.166 Student

  • Description: the data was collected from two Portuguese public secondary schools in the Alentejo region, to investigate student achievement prediction and identify decisive factors in student success. The data tracks student performance in Mathematics and Portuguese through school year 2005-2006 and is complemented by demographic, socio-econonomical, and personal data obtained through a questionnaire. Numerical grades (20-point scale) collected by students over three terms are typically the target of the associated prediction task.

  • Affiliation of creators: University of Minho.

  • Domain: education.

  • Tasks in fairness literature: fair regression (chzhen2020fairwassertein; chzhen2020fair; heidari2019on), rich-subgroup fairness evaluation (kearns2019empirical), fair data summarization (jones2020fair).

  • Data spec: tabular data.

  • Sample size: students.

  • Year: 2014.

  • Sensitive features: sex, age.

  • Link: https://archive.ics.uci.edu/ml/datasets/student+performance

  • Further info: cortez2008using

a.167 Student Performance

  • Description: this resource represents students at an undisclosed US research university, spanning the Fall 2014 to Spring 2019 terms. The associated task is predicting student success based on university administrative records. Student features include demographics and academic information on prior achievement and standardized test scores.

  • Affiliation of creators: Cornell University.

  • Domain: education.

  • Tasks in fairness literature: fairness evaluation (lee2020evaluation).

  • Data spec: tabular data.

  • Sample size: unknown.

  • Year: 2020.

  • Sensitive features: gender, racial-ethnic group.

  • Link: not available

  • Further info: lee2020evaluation

a.168 Sushi

  • Description: this dataset was sourced online via a commercial survey service to evaluate rank-based approaches to solicit preferences and provide recommendations. The dataset captures the preferences for different types of sushi held by people in different areas of Japan. These are encoded both as ratings in a 5-point scale and ordered lists of preferences, which recommenders should learn via collaborative filtering. Demographic data was also collected to study geographical preference patterns.

  • Affiliation of creators: Japanese National Institute of Advanced Industrial Science and Technology (AIST).

  • Domain: .

  • Tasks in fairness literature: fair data summarization (chiplunkar2020how).

  • Data spec: user-sushi pairs.

  • Sample size: K respondents.

  • Year: 2016.

  • Sensitive features: gender, age, geography.

  • Link: https://www.kamishima.net/%20sushi/

  • Further info: kamishima2003:nantonac

a.169 Symptoms in Queries

  • Description: the purpose of this dataset is to study, using only aggregate statistics, the fairness and accuracy of a classifier that predicts whether an individual has a certain type of cancer based on their Bing search queries. The dataset does not include individual data points. It provides, for each US state, and for 18 types of cancer, the proportion of individuals who have this cancer in the state according to CDC 2019 data,222222https://gis.cdc.gov/Cancer/USCS/DataViz.html and the proportion of individuals who are predicted to have this cancer according to the classifier that was calculated using Bing queries.

  • Affiliation of creators: Microsoft; Ben-Gurion University of the Negev.

  • Domain: information systems, public health.

  • Tasks in fairness literature: limited-label fairness evaluation (sabato2020bounding).

  • Data spec: tabular data.

  • Sample size: statistics for cancer types across US states.

  • Year: 2020.

  • Sensitive features: geography.

  • Link: https://github.com/sivansabato/bfa/blob/master/cancer_data.m

  • Further info: sabato2020bounding

a.170 TAPER Twitter Lists

  • Description: this resource was collected to study the problem of personalized expert recommendation, leveraging Twitter lists where users labelled other users as relevant for (or expert in) a given topic. The creators started from a seed dataset of over 12 million geo-tagged Twitter lists, which they filtered to only keep US-based users in topics: news, music, technology, celebrities, sports, business, politics, food, fashion, art, science, education, marketing, movie, photography, and health. A subset of this dataset was annotated with user race (whites and non-whites) via Face++ (zhu2018fairnessaware).

  • Affiliation of creators: Texas A&M University.

  • Domain: social media.

  • Tasks in fairness literature: fair ranking (zhu2018fairnessaware).

  • Data spec: user-topic pairs.

  • Sample size: K Twitter lists featuring K list members.

  • Year: 2016.

  • Sensitive features: race.

  • Link: not available

  • Further info: ge2016taper

a.171 Toy Dataset 1

  • Description: this dataset consists of K points generated as follows. Binary class labels are generated at random for each point. Next, two-dimensional features

    are assigned to each point, sampling from gaussian distributions whose mean and variance depend on

    , so that ; . Finally, each point’s sensitive attribute

    is sampled from a Bernoulli distribution so that

    , where is a rotated version of : . Parameter controls the correlation between class label and sensitive attribute .

  • Affiliation of creators: Max Planck Institute for Software Systems.

  • Domain: N/A.

  • Tasks in fairness literature: fair classification (zafar2017fairness; roh2020frtrain), fair preference-based classification (zafar2017from; ali2019loss), fair few-shot learning (slack2019fair; slack2020fairness), noisy fair classification (kilbertus2018blind).

  • Data spec: tabular data.

  • Sample size: K points.

  • Year: 2017.

  • Sensitive features: N/A.

  • Link: https://github.com/mbilalzafar/fair-classification/tree/master/disparate_impact/synthetic_data_demo

  • Further info: zafar2017fairness

a.172 Toy Dataset 2

  • Description: this dataset contains synthetic relevance judgements over pairs of queries and documents that are biased against a minority group. For each query, there are 10 candidate documents, 8 from group and 2 from minority group

    . Each document is associated with a feature vector

    , with both components sampled uniformly at random from the interval . The relevance of documents is set to and clipped between 0 and 5. Feature is then corrupted and replaced by zero for group , leading to a biased representation between groups, such that any use of should lead to unfair rankings.

  • Affiliation of creators: Cornell University.

  • Domain: N/A.

  • Tasks in fairness literature: fair ranking (singh2019policy; bower2021individually).

  • Data spec: query-document pairs.

  • Sample size: K relevance judgements overs queries with candidate documents.

  • Year: 2019.

  • Sensitive features: N/A.

  • Link: https://github.com/ashudeep/Fair-PGRank

  • Further info: singh2019policy

a.173 Toy Dataset 3

  • Description: this dataset was created to demonstrate undesirable properties of a family of fair classification approaches. Each instance in the dataset is associated with a sensitive attribute , a target variable encoding employability, one feature that is important for the problem at hand and correlated with (work_experience) and a second feature which is unimportant yet also correlated with (hair_length). The data generating process is the following:

    .

  • Affiliation of creators: Carnegie Mellon University; University of California, San Diego.

  • Domain: N/A.

  • Tasks in fairness literature: fairness evaluation (lipton2018does; black2020fliptest).

  • Data spec: tabular data.

  • Sample size: K points.

  • Year: 2018.

  • Sensitive features: N/A.

  • Link: not available

  • Further info: lipton2018does

a.174 Toy Dataset 4

  • Description: in this toy example, features are generated according to four 2-dimensional isotropic Gaussian distributions with different mean and variance . Each of the four distributions corresponds to a different combination of binary label and protected attribute as follows: (1) ; (2) ; (3) ; (4)