From determining which individuals should be detained awaiting trial (Harris and Lofstrom, 2020) to choosing which images are retrieved through Internet search (Noble, 2018), algorithmic systems are increasingly tasked with making a variety of decisions. In response, concerns about the potential biases that these systems reflect and produce have also arisen. As a recent example of this, an English student assessment algorithm was shown to inherit (and possibly even amplify) the English Collegiate system’s bias for students from private schools (Adams and McIntyre, 2020), contributing in part to the waves of student protests that followed its deployment.
As the risk of algorithmic discrimination has increased in recent years, so too has the number of proposed fixes from the field of algorithmic fairness. Countless fairness strategies, metrics, and toolkits have been developed and, more rarely, integrated into algorithmic systems and products. Most of these innovations revolve around measuring, and sometimes mitigating, disparities in treatment and performance across a “sensitive” or “protected” attribute. Typically, this process will require access to data revealing this “sensitive” or “protected” attribute. Using open-access, technical toolkits as an example, the fairness methods in Fairlearn (Microsoft) (Microsoft, 2020), AI Fairness 360 (IBM) (Bellamy et al., 2018), What-if tool (Google) (Wexler et al., 2020), LiFT (LinkedIn) (Vasudevan and Kenthapadi, 2020), Aequitas (Saleiro et al., 2019), and FAT-forensics (Sokol et al., 2020) all require access to demographic attribute data. In many situations, however, information about demographics can be extremely difficult for practitioners to even procure (Holstein et al., 2019; Veale and Binns, 2017). For the toolkits that do not require demographic data access, they generally rely on simulated data (e.g. (D’Amour et al., 2020)) or approach the problem more from the angle of design (e.g. (Barghouti et al., 2020)).
Demographic data stands apart from the other types of data fed into algorithmic systems in that both its collection and use are socially and legally constrained (Andrus et al., 2020). Data protection regulation, such as the EU’s General Data Protection Regulation (GDPR) (European Parliament and Council of European Union, 2016), builds in extra protections for data about certain attributes, deeming it off-limits for many use cases. Perceiving this potential conflict between data availability and making systems less discriminatory, legal scholars have argued that we will likely need regulatory provisions for collecting demographic data to audit algorithmic bias (Žliobaitė and Custers, 2016; Zarsky, 2014; Williams et al., 2018; Tischbirek, 2020).
With an uncertain future of such provisions, computer science researchers have proposed a number of methods and techniques that explicitly try to bypass the collection of demographic data through inference and proxies (e.g., (Romanov et al., 2019; Gupta et al., 2018; Zhang, 2016)), to only use demographic data during model training (e.g., (Kamishima et al., 2011; Zafar et al., 2017; Sattigeri et al., 2019; Ghili et al., 2019))111Raghavan et al. (2019) report that many of the hiring companies whose statements they analyzed seem to rely on this approach., or to identify and mitigate algorithmic discrimination for computationally identified groups without the use of demographics at all (e.g., (Benthall and Haynes, 2019; Hashimoto et al., 2018; Lahoti et al., 2020)). Coming at the problem from a privacy angle, others have sought to make demographic data collection and use more private and less centralized through various combinations of data sanitization, cryptographic privacy, differential privacy, and third-party auditors (e.g., (Hajian et al., 2015; Jagielski et al., 2019; Kilbertus et al., 2018; Kuppam et al., 2020; Veale and Binns, 2017)).
Acknowledging that these alternative methods exist, our goal was to understand how demographic data availability is actually encountered and dealt with in practice. By interviewing a wide range of practitioners dealing with concerns around algorithmic fairness, we strove to characterize the actual impediments to demographic data use as they emerge on the ground. We also assessed a number of concerns practitioners have around how to ensure that their demographic data collection process is responsible and ethical. Though there is some question as to whether algorithmic fairness techniques meaningfully address discrimination and injustice and thereby justify the collection and use of this sensitive data (Barabas, 2019; Hoffmann, 2019; Katell et al., 2020; Selbst et al., 2019), understanding how practitioners in industry think about and address the use of demographic data is an important piece of this debate. By sketching some of the limits of the current practice of algorithmic fairness, we can better assess what interventions around demographic data may be necessary and how they should be applied.
2. Background and Related Research
Past work on the barriers to demographic data procurement and use has largely considered the question of legality. While it is clear that data protection regulations like GDPR are likely to inhibit the procurement, storage, and use of many demographic attributes,it is less clear what types of carve-outs might apply if using the data for fairness purposes (Goodman, 2016). There is also an open question of whether or not anti-discrimination law permits the inclusion of demographic attributes in decision-making processes (Xiang, 2021; Bent, 2020). Bogen et al. (2020) survey the requirements of U.S anti-discrimination law around the collection and usage of sensitive attribute data in employment, credit, and healthcare, and find that “there are few clear, generally accepted principles about when and why companies should collect sensitive attribute data for anti-discrimination purposes.” (Bogen et al., 2020, p. 2)
A more critical branch of scholarship is now interrogating the algorithmic fairness community’s conceptualization of demographic attributes themselves and how that propagates into notions of fairness. Work in this area has addressed race (Hanna et al., 2020; Benthall and Haynes, 2019), gender (Hamidi et al., 2018; Scheuerman et al., 2020; Hu and Kohler-Hausmann, 2020), and disability (Bennett and Keyes, 2020), calling into question what it means to even rely on these categories as a basis for assessing unfairness, and what harms are reproduced by relying on these infrastructures of categorization. Similar to this work, important contributions emerging from various academic fields and activist communities on the issue of justice in data use interrogate when and where it is acceptable to collect what types of data and what degree of control data subjects should have over their data thereafter (Taylor, 2017; Cifor et al., 2019; Milan and Treré, 2019; Rainie et al., 2019; Petty et al., 2018; d4bl, 2020). These lines of work center questions of autonomy and representation around data and its use, going beyond the common legal standards of privacy and anti-discrimination.
Less work has been done, however, in understanding how practitioners confront issues around demographic data procurement and use and if the difficulties they encounter mirror those discussed above. There is a growing literature of work studying fairness practitioners themselves that has focused on the needs of public sector practitioners (Veale et al., 2018), the needs of private sector practitioners (Holstein et al., 2019) and the organizational patterns they fit within (Rakova et al., 2020). These studies lay important groundwork in identifying the problem,222Holstein et al. found, for example, that a majority of their survey respondents “indicated that the availability of tools to support fairness auditing without access to demographics at an individual level would be at least ‘Very’ useful” (Holstein et al., 2019, p. 8). but with this paper we aim to provide more texture as to how practitioners themselves think about and navigate this gap, and, in turn, what paths forward would need to address in order to be effective.
Building off of themes from prior work in this area, we scoped an interview study looking at how demographic data procurement and use actually proceeds in practice. We interviewed 38 practitioners from 26 organizations, with 5 participants being the maximum number coming from a single organization. All participants either A) were involved in efforts to detect bias or unfairness in a model or product, B) were familiar with company policies relevant to the usage of demographic data, or C) were familiar with regulations or policies relevant to the collection or use of demographic data.
Most participants were from for-profit tech companies (34 out of 38). A slight majority (55%) of participants were from companies with more than 1,000 employees, with the rest from smaller organizations. 80% of participants worked in the US. Participants held a variety of positions within their organizations (Table 1) and came from a diverse set of sectors (Table 2). For the categories in 1, “External Consultant” includes outside auditors and advisors for various issues surrounding system bias and fairness. The “Leadership Role” category is used to describe participants overseeing the work of larger teams (e.g. “Director/Head of X”). “Tech Contributor” is an umbrella term for roles such as software engineers, engineering managers, and data scientists.
|External Consultant (EC)||4|
|Leadership Role (LR)||9|
|Product/Program/Project Manager (PM)||7|
|Tech Contributor (TC)||10|
|Ad Tech||LP1, LP2||2|
|Finance||LR2, LR3, TC3, TC4||4|
|Healthcare||EC1, LR8, PM5, TC7||4|
|Hiring/HR||LR1, LR4, LR7, PM7, TC10||5|
|Social||LP3, LP4, LR6, PM6, TC5, TC9||6|
|Telecom||LR9, PM1, TC1||3|
|Other||EC2, EC3, EC4, R4, TC8||5|
We employed a variety of recruitment methods. Firstly, we identified individuals in the authors’ professional networks that would most likely have experience with attempting to implement algorithmic fairness techniques. These contacts were asked to participate themselves and were also encouraged to share our call for participants with relevant individuals in their own network. Similarly, we distributed an interest form and study description to various organizational mailing lists. In order to broaden the search beyond our established networks, we also carried out searches for various participant archetypes through the LinkedIn Recruiter service (LinkedIn, 2020). As suggested by Maramwidze-Merrison (2016), LinkedIn can be a prime mechanism for identifying and contacting specialists that would otherwise not be accessible to the researcher. By using a Recruiter account, we were able to send requests for participation to individuals with greater than two degrees of separation from the account-holder. Given the typical use case of this medium, however, we clearly stated that we were not reaching out with an employment opportunity. In addition to these methods, we also conducted searches for news articles and organizational publications on topics related to algorithmic fairness and bias auditing in order to identify teams and authors to directly reach out to. Finally, following each interview we asked participants to refer any relevant contacts. With all of these recruitment methods, we attempted to sample participants from industries with varying degrees of regulation, as we expected this to be an important axis of analysis based on the existing literature.
Participation in the interviews involved one 60-75 minute video call. In some cases, we agreed to have multiple interviewees from the same organization participate in the same call if they so requested. Before the call, we shared our consent and confidentiality practices with participants and then obtained verbal consent before starting the interview. We transcribed the audio recordings, and deleted them after transcription. We redacted all findings to maintain the anonymity of participants and their institutions.
Since the term “demographic data” may mean different things to different people, we provided the following definition at the beginning of the interviews. This was intended as a starting point, but we also invited participants to speak on other related topics.
Demographic data includes most regulated categories (e.g., “protected classes” and “sensitive attributes” like sex, race, national origin) as well as other less-protected classes (e.g., socioeconomic class and geography). Demographic data can even include data that might be used as proxies for these variables, such as likes or dislikes.
The interview questions focused on asking participants to walk us through examples of times when they had participated in successful or unsuccessful attempts to use demographic data for bias assessments. We probed for details about what data they wanted to use and what their intended purpose was. We asked about the availability or obtainability of the desired data. And we asked followup questions about what types of approvals or interactions with other teams were required for the collection or usage of this type of data.
In cases where the participant was in a legal/policy role, the questions were similar to the above, but framed from the vantage point of someone reviewing a request for data access or usage. In these interviews, we also asked additional questions about what legal or other risks they take into account when making decisions. In cases where participants were identified based on press reporting or organizational publications on algorithmic fairness issues, we sometimes included more tailored questions about the content of the publication.
After hearing about more specific cases, we also asked for more general reflections from participants on the trade-offs involved in this type of bias assessment work — e.g., “How do you balance user privacy concerns with the need for demographic data?” In addition, we heard some insightful reflections in response to more forward-looking questions such as “In your ideal world, what would bias assessment look like for you?”
To analyze the interviews, we used open and closed coding in MaxQDA. Open-coding was used to parse and summarize an initial selection of 25% of the interviews. Open codes were then iteratively grouped and synthesized using thematic networks (Attride-Stirling, 2001), with themes organized around either the challenges with or explicit constraints on demographic data use. These themes and their subthemes then informed a set of closed codes that were applied and expanded upon over the corpus of transcripts. This paper reflects our preliminary findings from this stage of the analysis.
4. Results Overview
Before delving into emergent themes, it is important to first note some of the broad trends we heard concerning the availability of demographic data. Almost every participant described access to demographic data as a significant barrier to implementing various fairness techniques, a result that mirrors the responses from Holstein et al. (2019). In terms of accessible data, virtually all companies that collected or commissioned their own data had access to gender and age and frequently used that data to detect bias. These categories of demographics data were regarded as entailing much less sensitivity and privacy risk than other categories. Outside of employment, healthcare, and the occasional financial service contexts, few practitioners had direct access to data on race, so most did not try to assess their systems and products for racial bias. When attempts were made, it was generally through proxies or inferred values, but such methods were rarely deployed in practice (conforming to previous findings by Holstein et al. (2019) for demographic attributes more generally). Finally, practitioners in business-to-business (B2B) companies and external consultants generally had very limited access to sensitive data of any kind.
We organize the themes uncovered through our analysis into two types. The first type are the regulatory and organizational constraints on demographic data procurement and use, which are generally maintained by a network of actors outside the practitioner’s team (e.g., legal and compliance teams, policy teams, company leadership, external auditors, and regulators). The second type are concerns surrounding demographic data procurement and use that the practitioners themselves surfaced or encountered during the course of their work. Having made this distinction, however, it is important to note that it was not uncommon for the constraints in the first category to inform the concerns of practitioners or, vice versa, for the concerns of practitioners to feed into organizational policy.
5. Constraints on Demographic Data Procurement and Use That Practitioners Must Work Within
Looking first to the network of constraints surrounding the procurement and use of demographic data, we discuss three major factors at play. The first, and perhaps most obvious, are the various regimes of privacy laws and policies. Second, we consider the role of anti-discrimination laws and policies in determining what types of demographic data get collected. Finally, we consider a suite of potential restrictions that stem from organizational priorities and practices.
5.1. Privacy Laws and Policies
GDPR-related notions like data-subject consent and the Privacy By Design standard of only collecting required data seemed to primarily guide legal, policy, and privacy teams’ handling of data requests, despite most participants’ organizations being based in the U.S. As described by LP1, a legal practitioner in the Ad Tech space:
“The general framework is that the rules apply to data about people. And then, one axis on there is ‘what does it mean that the data is about a person.’ [T]he old school way of looking at that was [whether the data was] identifiable, which meant it had their name or contact information, email address. The newer GDPR-like way to look at it is that even pseudonymous data — [i.e.,] I’m not [First Name, Last Name], I’m ABC789 (or my computer or my browser is) — even pseudonymous data is regulated.”
5.1.1. Consent and Sensitive Data
For this work we left our discussions of “demographic data” open to any membership category of data relevant to our participants. Generally speaking, however, practitioners pointed to the categories defined by law as protected or sensitive. An important distinction that emerged for organizations with operations in Europe was between categories classified as “special” by GDPR and those that are not. Most notably, gender, age, and region, three categories not considered “special,” were available to most of the practitioners we spoke to, whereas race, a “special” category, was not. This is significant because race was seen to be a very salient axis of analysis for every domain, yet attempting to procure data on race was rare. The absence of this data was keenly felt by practitioners attempting to show the failure modes of various systems, as edified by TC9, a product manager in the social media domain: “Every time we do one of these assessments, people ask, ‘Where’s race?’”
While there are likely other motivations for some companies to not actively collect “special category data,” GDPR has raised the bar for private organizations that wish to use this data for the purposes of bias detection or fairness assessments. Organizations seeking to collect this type of data in Europe must now acquire freely given consent from all data subjects or make clear claims about how the data use aligns with another special category carve-out listed in Article 9 of GDPR (European Parliament and Council of European Union, 2016)
. Though these requirements are not necessarily prohibitive, participants discussed how the general practice is to just avoid the risks of wide-scale special category data collection altogether. In the rare cases where this data was procurable, it was generally through government datasets, open-sourced data with self-identification, or by following explicitly applicable regulation and standards. There were also some reported cases where racial data was collected for user studies and experimentation, but this was done with explicit consent and compensation, as well as strict legal review. The only domain in which we saw the attempted collection of racial data at a global scale was for hiring and HR. Though not directly addressed in our interviews, the carve-outs for this type of data are a bit more explicit and tend to defer to employment law.333See e.g. the German Federal Data Protection Act Section 26 (3) (German Bundestag, 2017)
5.1.2. Don’t Collect More Than What You Need
Beyond the actual legal risks incurred by potential GDPR violations, we also heard from practitioners that it is “the combination of GDPR and the ethos that GDPR lives in, which is an ethos of tech fear and nervousness and lack of trust” (PM3) that can lead to inaction. When public trust in data use is already low, the Privacy By Design standard of not collecting more data than you need can very easily take precedence over the more extended and drawn out process of ensuring product fairness. Describing an experience voiced to us by multiple participants, TC5 talked of how “for the collection of new data, there’s deep conversations with policy [and] legal, and then we need to loop in external stakeholders [to ask], ‘does the need to be able to measure this seem to sufficiently balance out the user concerns for privacy?’” In many of our interviews, we heard practitioners recall cases where data requests would be stifled by this question of necessity.
Interestingly, practitioners reported that inferring the demographic data they need is similarly frowned upon when the need revolves around detecting bias, fairness, or discrimination. One potential reason for this is the increasing prevalence of data access request rights through regulation such as GDPR and the California Consumer Privacy Act. LP4 discussed how with “data access requests, it’s still an open question under certain laws whether inferred data is user data and [thus] needs to be shared or not. And so [a] concern around that [is], ‘are people aware that this might be happening?’” In this way, the potential for forced revelation of inferred data might be encouraging organizations to be more conservative about what they infer. All of this considered, the line where the fear of violating GDPR and other data protection law ends and where the desire to not collect more data than you need begins is a hazy one at best, but both factors seem to play a role in deeming certain types of data too sensitive to collect.
5.1.3. Data Sharing
One final difficulty often presented by data privacy regulation in this space is the inaccessibility of demographic attribute data for outside auditors or consultants as well as for Business-to-Business (B2B) vendors of algorithmic tools. Though we did hear of cases where external auditors were brought into an organization and given pseudonymized access to data, there are many cases where auditors, consultants, and vendors had to build their models or provide recommendations without ever seeing the data. For practitioners in this kind of situation, some spoke of the importance of publicly available datasets drawn from domains similar to the organizations they were assisting. By combining this external data with API access or knowledge of the infrastructures in use, these practitioners noted that they could point to specific, high-risk failure modes without needing to risk any legal privacy violations. In one unique case, a practitioner opted to forego the use of public/private data altogether, instead making the case for “pathways” of discrimination in conceptual Structural Causal Models (SCMs) built in concert with subject matter experts and other employees from the organization they were assisting. The SCM was used to inform the client of potential sources of bias in their data.
5.2. Anti-Discrimination Laws and Policies
Most often, the closest legal and policy teams come to thinking about fairness is through anti-discrimination policy. For less regulated domains, there is sometimes a tension between organizations’ privacy and anti-discrimination policies, with privacy policies discouraging the collection of demographic data and anti-discrimination policies encouraging it. Given that corporate anti-discrimination policies around data or algorithmic products are often not well-established, privacy usually wins out. For this section, we first consider a few domains where we saw clear anti-discrimination standards interact with privacy requirements, and then move to looking at how this interaction plays out when liability for discrimination is less clear.
Within the financial industry in the United States, as described by Bogen et al. (2020), products vary widely in their standards and requirements for demographic data collection. One of the most notable distinctions to draw is between mortgage products and credit-based products. For the former, lenders are required by law to collect demographic attribute information, while for the latter they are nearly barred from doing so. In both of these cases, however, financial institutions are held to anti-discrimination standards by various watch-dog institutions, such as the Consumer Finance Protection Bureau (CFPB). Interestingly, we heard in our interviews that a number of financial institutions have taken their cue from the CFPB to infer racial categories through the Bayesian Improved Surname Geocoding (BISG) (Elliott et al., 2009)
method when the data is either sparse or not collected. That being said, the machine learning practitioners we talked to in this industry reported never having direct access to these attributes, inferred or otherwise. Compliance teams with special data access will test models after they have been produced, sending them back if they are found to be discriminatory. When discussing this arrangement, TC4 suggested that “they kind of want the whole testing function to be a black box,“ likely such that no intentional circumvention of compliance tests can occur. This can have the unfortunate effect, however, of leaving engineering teams in the dark about the impacts of their models. For some forms of discrimination, however, clear regulation does not necessarily lead to measurement and mitigation. On the issue of regulations prohibiting financial service discrimination based on sexuality, LR3 noted that, “in practice, the regulators understand that data for certain classes are not easy to obtain and so they don’t make any enforcement. […] So it’s one of those things where when the rubber hits the road, you have to figure out, is this actually something you can do?” Even for classes of data that are possibly easier to obtain, financial institutions can still be resistant to their collection depending on how anti-discrimination is enforced. As described in a congressional testimony given by the GAO(Williams, 2008), collecting demographic data actually makes financial institutions liable for fixing uncovered disparities, incentivizing inaction.
In the Hiring/HR domain, there are generally clearer standards to defer to. In the U.S., practitioners mentioned how the demographic categories on EEO-1 forms (gender and race/ethnicity) were often made available to them such that they could ensure their models’ adherence to the ”80% rule”. 444The EEOC’s adverse impact rule of thumb, where if the selection rate for one group is four-fifths that of another it may point to deeper issues (Commission and others, 1979). Difficulties can arise when the organization’s standard practices for bias and discrimination assessment go beyond the expectations, requirements, and norms for doing so in the country of business, however. When discussing doing business in a country outside their own, PM7 described how, “We talked about demographics and they were like, ‘This is not something that is important to us.’ […] It becomes a real challenge to get customers even interested in why we do [model debiasing].” As a result of these types of concerns, participants discussed bringing on employment anti-discrimination experts specializing in the various regions they operate in. In terms of how procured demographic data often gets used, teams doing model debiasing will rely on specially collected and curated datasets, generally from client organizations, that have demographic attributes of reliable accuracy. Typically, this type of high-quality data is only used for the purposes of reducing discriminatory effects before deployment, as there are too many practical and legal constraints on including demographic attribute data in deployment.
Healthcare was the third and final domain where we spoke to participants that pointed to explicit anti-discrimination policies. Though not necessarily codified in law, professionals from one organization noted that they try to strictly adhere to the Institute of Medicine’s Six Dimensions of Health Care Quality (Institute of Medicine (US) Committee on Quality of Health Care in America, 2001), one of which is health equity. This establishes a clear alignment between the goals of their organizations and the push for anti-discrimination in algorithmic systems. Interestingly, this was also the main domain in which practitioners saw the potential for algorithmic fairness as extending beyond interventions to the algorithm or product. While in other domains societal bias could sometimes be seen as beyond the scope of the practitioners’ organization, practitioners in healthcare reported using their trained models to inform interventions anywhere in the patient treatment pipeline. In other words, uncovered discrimination or unfairness did not necessarily have to stem from institutional practices or models to warrant addressing. Instead, as has been suggested elsewhere (Barabas et al., 2018; Andrus and Gilbert, 2019), algorithmic fairness assessments could be used as a type of diagnostic tool for the whole network of actors and interactions, identifying salient disparities in intake, treatment, and outcome and pointing to possible causes for these disparities. As such, demographic attributes were often procured without much resistance, and these attributes would be included in both model training and model deployment as a key means for identifying inequalities.
In cases outside of these domains, we saw an abundance of uncertainty that often lead to risk-aversion and inaction. One of the biggest constraints participants described was that even without clear anti-discrimination policies, there were still fears that procuring demographic data and using it to uncover discrimination without a clear plan to mitigate it — either because doing so might not align with the organization’s priorities or because there are no clear answers for how to resolve the issues — would open the door to legal liability. As PM4 described, “I think there’s always this sort of balancing test that the lawyers are applying, which is weighing whether the benefits of being proactive about testing for bias outweigh the risks associated with this leaking or this becoming scooped up in some sort of litigation or something like that.” Furthermore, without clear guidance on what policies to employ around product-level discrimination, fairness remains a “nice to have” while there are sometimes whole teams dedicated to privacy compliance. Participants also expressed the sentiment that unless an organization is distinguishing itself by its commitment to fairness (however defined) or directly responding to outside pressure, it is often easier to just not collect the data and not call attention to any possible algorithmic discrimination.
5.3. Organizational Priorities and Structures
As in any corporate setting, practitioners working on algorithmic fairness face constraints set by organizational structures, risk aversion, and the drive for profit. Usually there are provisions for responsible AI and fair ML practitioners to do their work unbeholden to the short-term goals of other product teams, but participants still described constraints imposed by organizational priorities and practices. It is important to note that many of these constraints seem to exist on a spectrum of severity, largely dictated by the amount of buy-in and commitment from both leadership and the organization’s clientele to ensuring algorithmic fairness. 555See (Rakova et al., 2020) for a deeper look at the array of difficulties across varying degrees of organizational support that Responsible AI and ML Fairness teams face.
Starting with issues arising from tenuous fairness commitments, multiple practitioners mentioned having to justify their bias analyses and calls for more or better data with expected improvements to key performance indicators (KPIs). In these cases, it was not uncommon for practitioners to find that their proposed interventions were orthogonal to certain KPIs, such that they had to make more qualitative arguments about how different groups would be better served by these interventions. If this approach failed, the work was typically abandoned or at least shelved. A similar constraint came from practitioners reporting that at times the cost of gathering more detailed or more representative data simply came with too high of a price tag and too long of a lead time to be deemed worthwhile.
Another significant concern we heard was that often fairness and responsibility teams do not have clear pathways for pushing recommendations into practice. Other teams’ existing processes around data use and procurement are generally already well-established, and so it can be very difficult to figure out both how to plug in operationally as well as how much those processes can be overwritten or overloaded in the pursuit of fairness.
A final constraint on this front is the need to avoid bad PR stemming from demographic data collection. Generally, undertaking fairness/ethics projects results in at least neutral press about the initiative, so long as it appears to be in good faith. Attempts to collect demographic data, however, are likely to be seen as just an extension of the frequently reported trends of data misappropriation and data misuse (Seneviratne, 2019). Specifically addressing the issue of racial data collection at her organization, PM2 discussed how, from the public’s perspective, such data collection would be viewed with suspicion: “[It’s] just adding one more thing. [The public might say,] ’now you’re asking me if I’m White or Black, why would you need that information?’ Right? ‘You’re already in trouble with data. How can I trust you with this?’ So I think it was an optics thing, like, we can’t even ask that. It’s kind of off the table.” The calculus can change dramatically, however, in cases where there is already a lot of public attention around a company’s biased services. As described by EC3, “public interest and attention and focus – it actually puts these companies at a tipping point where they can’t just do nothing, because even though they can say they’re complying with all the laws, clearly the public doesn’t feel like that’s adequate.” In cases such as these, participants reported that their companies would start up limited and measured data collection processes to address, at least in part, the source of public concern.
6. Concerns held by Algorithmic Fairness Practitioners
Beyond business, legal, and reputational pressures from other parts of the organization, we see a number of sources of caution surrounding the procurement and subsequent use of demographic data on the ML fairness practitioner side.
6.1. Concerns Around Self-Reporting
Regarding the actual process of collecting demographic data, practitioners reported a number of concerns. Perhaps the most salient of these were the many reasons why self-reported demographic data can be unreliable or incomplete. In most cases, there are not strong incentives for individuals to respond to requests for their demographic data. Outside of government surveys and forms, participants noted that organizations might need to incentivize individuals to provide reliable data. 666This is by no means a new problem, social scientists have long been studying ways to increase survey response rates as well as the associated biases with non-responses (Marquis et al., 1986) In the case of the American financial industry, one participant noted that although they are permitted by federal law to collect demographic data through optional surveys, the response rates are incredibly low. While it might have been possible to improve the response rate by making the intended use for the data more clear (Crotty, 2020), this participant noted that more reliably collecting demographic data could actually increase the liability of the institution, as discussed in 5.2. As such, the sparse survey data was only used for validating data that had been compiled and inferred via other sources.
There are many possible reasons for why a survey might get a low response rate, but one often-repeated reason was the matter of trust. LR6 discussed how when their team was working on improving search diversity for their social media platform, they surveyed users asking, “would you want to give us more demographic information?” LR6 described how “the consensus was, ‘no, [if you collect this data,] then what are you not showing me? […] Who are you to decide what I would like just because I am this person?’” In cases where individuals are wary of the reasons behind or potential outcomes of demographic data collection, it is reasonable to expect that the response rate and response accuracy are likely to be very poor.
Within the healthcare domain, practitioners were acutely aware of the types of difficulties associated with accurate data collection, as the systems they produce can have direct consequences on life chances and there is a high-bar for data quality. As LR8 described, “We think about data collection as data donation, it’s like donating blood. It takes effort, someone’s got to do it, right? We don’t have [an] automated way to collect all this information, so we have to think carefully about the most effective and efficient ways to collect […] most of the data that we collect.” While it is likely to be difficult to transfer this style of thinking to other domains, posing data collection for anti-discrimination and fairness assessments as something akin to “data donation” might be a useful framing for increasing buy-in and response rates.
On top of data reliability concerns, how organizations would even go about the process of collecting demographic data is unclear. EC4, an expert who assists platform companies trying to deal with potential bias or discrimination, discussed how an organization might “want to do a disparate impact [analysis] on gender” when “[they] don’t have gender.” EC4 asked, ”Are [they] going to throw up a [pop-up] and say, ‘Hey, tell us about your gender identity,’ to their whole user base? There’s some cost to doing that.” Due to the range of potential public responses and PR risks as described in 5.3, the reputational cost could even exceed the expense of designing and deploying such a pop-up. So, as EC4 put it, whether “any company is ever going to do that to a user base of millions of people that are already on their service” remains an open question.
Furthermore, some practitioners discussed how even reliably collected demographic data can lead to issues down the line. Most often, demographic data is self-reported by selecting only one answer from a set of predetermined options, which may or may not include an “Other” category. What individuals signify by selecting one of these options, however, is not always uniform. TC9 outlined a number of the potential concerns that can arise here surrounding gender:
“And I think one thing that I’m very conscious of is that anything can be misinterpreted. […] So for example, we have four gender options, ‘male,’ ‘female,’ ‘unknown,’ and ‘custom gender’, where ‘custom gender’ is an opt-in, non-binary gender that a user can add a specific string for. ‘Unknown’ is you haven’t provided it or you’ve explicitly said you don’t want to tell us your gender. It’s a whole bunch of things. And even though this is not inferred data, I’m still very, very careful whenever I talk about gender analyses, to be really clear about what exactly we’re seeing here, because ‘custom gender’ does not mean non-binary, ‘unknown’ gender does not mean the person opted into not giving it to us.”
Similarly, other practitioners described exercising caution around the demographic category data they do have access to, acknowledging the myriad identities that can fit within a single box on a form.
6.2. Concerns around Proxies and Inference
Where practitioners’ concerns around the explicit collection of demographic data mirrored some of the more well-established pitfalls and risks of survey design, we see a rather different set of concerns around the use of proxies and inferred demographic data. For demographic categories that are unlikely to get the collection go-ahead from various oversight teams (likely for the reasons given in Section 5), we sometimes see practitioners infer attributes directly using available data or look at other available categories of data that have an implicit correlation with attributes of interest.
When asked whether, and if so how, inferred demographics were used in any of their fairness analyses, we heard a wide range of responses from practitioners. Participants from U.S. financial institutions noted that they follow the standard set by the Consumer Financial Protection Bureau in using Bayesian Improved Surname Geocoding (BISG) to infer individual race (Bureau, 2014). Participants from domains where demographic data collection is mandated, such as the HR/Hiring space, on the other hand, balked at the question, as exemplified by the response of LR4: “We both A) don’t have a reason to do that in the line of what we’re doing, B) would be deeply opposed to doing that in this context. […] Fascinating, I would want to know who else would do that in what context and why you would ever do that?”
Responses from participants in other domains were situated between these two extremes. Oftentimes inference is the only available option for conducting discrimination or bias assessments, but there are few standardized practices for doing so. Of the participants that discussed using demographic inference, it was largely for content data, such as in skin-tone classification for images or author gender prediction for pieces of text. Inferring demographics in other cases, while perhaps feasible, was seen as introducing privacy risks as well as dignity concerns to any fairness evaluations conducted with them. These concerns were especially heightened around sensitive attributes. As described by TC5, “In contexts of race or sexual orientation, or other really sensitive groups, we don’t view inferring attributes without fully informed user consent as a path forward. We think that all of the concerns are around whether [our organization] has the data. And so, if we were to try to pursue [acquiring the data], we would need to do it explicitly and, you know, with fully informed consent.” Other practitioners suggested that when inference is done at the group level (i.e. ”X percent of this dataset is women”), that might resolve some of the consent and dignity concerns of explicitly classifying individuals.
Although inferred demographics are generally used in contexts where self-reported demographics are not available, an interesting dilemma arises in cases where inferred demographics are more suitable for the task. As pointed to by various scholars (Hanna et al., 2020; Greiner and Rubin, 2010), for some types of fairness assessment observed race makes sense to use (e.g., when the relevant question is whether someone was discriminated against based on others’ perception of their race), whereas for others it might be more suitable to use self-identified race (e.g., for measuring discrimination in governmental policies). While the two concepts may appear facially similar, the measurement regimes required to accurately collect each are very different. For instance, EC3 discussed how on one of the systems they worked on, there were potential issues based on how users perceived other users’ races: “So you’re doing perceived race. Then the question is, so if I’m a tech company and I want to study this racial experience gap, right, should I go around guessing everybody’s races?” While having labelers or an algorithm guess someone’s race is perhaps the best way to operationalize the concept of perceived race, it opens the door to subsequent backlash given the sensitivity of these labels.
A unique set of concerns arises for proxies that are not explicitly treated as such. In some cases, when a salient attribute is inaccessible, practitioners use other available attributes to obtain some signal about potential discrimination or unfairness surrounding that more important attribute (e.g., using a subscription tier-level to point towards potential problems across socio-economic status). In cases such as these, practitioners were very explicit that these proxies are not meant to be treated as the attribute itself: TC9 explained, “We’re very careful to not draw further conclusions, because we don’t want to make assumptions or directly infer [the attribute of interest], basically.” Instead these attributes are treated as a very rough signal, where if you notice large disparities, you should treat it as a call for deeper inspection. Although the algorithmic fairness literature has frequently called out the dangers of using proxies for salient attributes or target variables in fairness analyses (Barocas and Selbst, 2016; Corbett-Davies and Goel, 2018; Jacobs and Wallach, 2019), this is often the only option practitioners have to make a quantitative argument for escalation and deeper analysis. The real risk that we see arising with this practice is when/if the use of the proxy becomes sufficiently normalized such that practitioners no longer reflect on its inadequacies.
6.3. Relevance of Demographic Categories
Most types of fairness analysis require practitioners to identify and focus on discrete groups, but there is often discomfort and uncertainty around what groups warrant attention and how those groups should be defined. These concerns were seen to be especially salient within organizations and teams that had to build and maintain products and services for regions outside of their own.
Looking first to the determination of demographic categories, numerous practitioners discussed the inadequacy of the standard categories. This was especially a concern in industries where demographic data is collected according to governmental standards. PM7, a product manager in the hiring/HR domain, pointed to how they were actively “finding ways to gather more data outside of the employment law data,” as “the categories were defined back in the sixties and are challenging these days. […] You only get six boxes for your race. There’s two boxes for gender. You know, it just doesn’t work for a lot of people.” For other domains, the established categories were sometimes used out of a sense of tradition or simply because it would be hard to change them. Adhering to these inapt categorization schemas can be an impediment to meaningfully understanding the types of bias present, however, as the of demographic categories can lead to dramatically disparate analyses and conclusions (Howell and Emerson, 2017; Hanna et al., 2020).
A common response to this problem was to try to make the categories representative, but practitioners were generally unsure of where to draw the line. R3, who was tasked with expanding the demographic categories and variables of concern used by their organization, said of this issue: “You can go on and on and create a list that’s like 20 or even a hundred long, but we need to set a limit, […] like what’s the reasonable limit on how many categories you want to put in such a question and what we want to recommend internally to like product teams to at least sample across.” A common concern here is that you start to lose statistical significance as the groups get smaller. Concerning asking more granular demographic questions, a practitioner from the Healthcare domain (LR8) asked, “how should we capture […] relatively low frequency answers to those questions? […] All of a sudden you become an N of 1 and you don’t fit into any category and we can’t figure out how to help you. Whereas, if we could have lumped you into another category, might we have been able to do better?”
Whether or not the demographic categories are expanded before data collection, there still remains the question of what the most relevant categories or combinations of categories are to consider for evaluation and deeper inspection. PM4 said of this problem, “[this] step of [asking], ‘what demographic slices are most relevant,’ that’s very difficult to get because almost nobody feels like they have the authority to say, ‘these are the ones that matter.’” While it might not require much extra effort or resourcing to assess bias across all available attributes, practitioners’ teams were also generally responsible for informing algorithmic, design, and policy changes. On top of this, several practitioners recounted needing to find bias issues with groups at the intersection of multiple attributes as well (Crenshaw, 1989; Hoffmann, 2019). Considering all possible intersections of demographic attributes is generally not practical for a small team (e.g., if you have five attributes, each with four options, then there are total groups), some practitioners had used written feedback from clients and users to help identify where there might be gaps.
Arguably the most straightforward approach to this problem is to just defer to legal or policy requirements around what groups to consider, but they often do not map onto practitioners’ views of what is ethically or culturally salient. R1 notes that this can become a significant impediment to making systems more fair in practice: “I’d say that the main conflict is where we’ll do some testing on a product and reveal issues. And we’ll say ‘look, this, it doesn’t perform on this group.’ Or, ‘it makes these failures.’ And then the policy team will say [something] like, ‘well, we don’t have a specific policy about this, so why are you testing for it?’” When these conflicts arise between the types of categories practitioners think are most salient and the categories that their organizations expect them to assess bias for, a substantial minority of participants reported having to just shelve their work on those categories.
An additional concern reported by practitioners was that their companies’ awareness of what categories are legally salient is often based on the laws of the region in which the company is located. Concerning the general focus on “race, ethnicity, gender, and maybe sexual orientation” at their organization, PM4 remarked, “I often think that the way we approach this is very U.S. centric, that our understanding of what demographic characteristics would be relevant in other countries and cultures is pretty rudimentary.” Some organizations try to ameliorate this concern by introducing subject matter experts with understandings of regional and cultural standards, but this is often only if there are explicit anti-discrimination regimes they have to work within. In reference to their algorithmic HR support tools, LR1 said, “The main problem is local employment law. We don’t have knowledge of local employment law, and every country is different. And so we always work with what we call a HR business partner, who is able to bring in labor experts if needed.”
Finally, when the relevant data is just not available for testing, it may lead to a default focus on whatever category data is on hand. When discussing ensuring dataset diversity beyond gender, PM1 stated that “it’s a complex problem because we are not allowed to collect such [demographic] data. So that’s why gender is prioritized, because we can do something about that.” Similar to the reliance on loose proxies, this issue of focusing on categories due to availability and not more principled reasons was a common concern relayed to us by participants.
6.4. Ensuring Trustworthiness and Alignment In Data Procurement
When conducting fairness work, the intended purpose is usually to make systems work better for people. As such, there is possibly an inherent alignment between model designer and data subject. This relationship can be difficult to bear out in practice, however, due to both of low public trust in technology companies and a lack of organizational incentivization for practitioners.
On the issue of trust, some practitioners in less regulated industries expressed concerns about whether their organizations could take the steps to meaningfully garner public trust around demographic data collection. Given the PR-cycles around data misuse and mishandling (Lapowsky, 2019), practitioners expressed awareness that users and clients simply do not believe their data is going to be used in a way they have control over or for their benefit (Auxier et al., 2019). We even heard from a few practitioners that they were not confident themselves that data collected for bias, fairness, or discrimination assessments would be sufficiently sheltered from other uses. One potential path forward here came from external auditors and consultants. EC4 noted that such external entities could require organizations to make enforceable, public commitments to narrowly use specially collected data only for fairness and anti-discrimination purposes in order to surmount concerns around misuse.
Some participants did, however, express hopes that their organizations might be moving in a direction of making more clear what data they would like to collect and why. A key element of trust-building, as one practitioner pointed out, is that the relationship needs to go both ways — users and clients provide the data needed to assess system performance, and then organizations follow through on their commitments, providing at least a summary of the results back to users and clients. As described by LR9, “trust is a word that is almost meaningless. Trustworthy is something else, and much more weighty. […] Trustworth[iness] is showing capacity on both sides of the trust dynamic. [Saying] someone is trustworthy means that they can receive information and treat it respectfully. Trust alone is…toothless.” There are likely to be many organizational and legal barriers to this type of trust-building exercise, but some practitioners felt that they might need to start exploring other ways of addressing fairness if the sensitive data they currently need cannot be collected in a trustworthy, consensual way.
Building on the notions of trustworthiness and consent, a few practitioners reported trying to reach a higher level of alignment with their data subjects. Concerning the use of demographic data in their research, R2 broached the question, “if you recognize that this system is working poorly for some demographic, how do you rectify that in a way that meaningfully pays attention and gives voice to this group that it’s not working for?” By ensuring that data subjects not only have a say in what data gets collected, but are actually involved in the thinking about how the data should be used, it is both more likely that data will be employed in the interest of those it is about and that the fairness achieved will be more than a satisfied formalized constraint (Green and Viljoen, 2020; Dobbe et al., 2018; Mulligan et al., 2019; Martin Jr. et al., 2020; Katell et al., 2020).
6.5. Mitigation Uncertainty
When practitioners are uncertain about how effectively they will be able to use demographic data and make changes based upon it, they can be deterred from trying to overcome the barriers and concerns described in previous sections. From our conversations with practitioners, we saw this uncertainty focus around 1) the ability of the practitioner to provide a solution for detected bias, and 2) the likelihood that the practitioner’s analyses and solutions would be used by their organization.
Firstly, as discussed in Section 1, the Algorithmic Fairness literature provides a number of techniques for potentially mitigating detected bias, but practitioners noted that there are few applicable, touchstone examples of these techniques in practice. Moreover, the methods proposed often do not detail exactly what types of harms they are designed to mitigate, so it is not always clear which methods should be applied in which contexts. The only domains in which we saw practitioners proceed with confidence were those with relatively well-established definitions of what it means to make systems fair (e.g., Hiring/HR and Finance) and those where the solutions were not limited to algorithmic fixes (e.g. Healthcare). In the first case, precedents such as the 80% rule (see 5.2) are straight-forward enough to operationalize that practitioners can address bias with relative confidence. In the latter case, a lack of clear technical solutions did not hamper efforts to detect bias since practitioners could report evidence of bias to relevant stakeholders and the necessary interventions could be made. LR8 provided an example where a model showed that individuals who spoke English as a second language had worse predicted health outcomes associated with sepsis. Instead of pursuing a technical intervention, they found some of the issues stemmed from an absence of Spanish language materials on the infection in the community. In this case, had the spoken-language data field been either not collected or thrown out of the model, as a more naive anti-discrimination policy might dictate, more equitable health outcomes would have been directly inhibited.
Secondly, fairness practitioners often want to go beyond just the minimum requirements of corporate policy, but it is not always clear what will be permitted. One element of this is that that, as described in 6.3, their work can simply run ahead of what the policy team has considered. When conducting fairness analyses, practitioners mentioned surfacing uncomfortable questions, such as, what their responsibility is for historical inequalities that impact various performance metrics. When organizations are not prepared to answer these questions, it can be difficult for practitioners to know how to (and whether to) proceed with demographic data collection and use. A related issue is that when there are more established policies on fairness, they generally seek to ensure all parties are treated “the same.” Whether this is through equalizing performance metrics across groups, such as calibration or predictive parity, or by removing/ignoring demographic data entirely, practitioners worried that achieving equity might require more nuanced treatments of different groups. As argued by TC9, “I would like to make a little bit stronger of a stand on [being] anti-bias. Instead of just making sure our models aren’t biased, [we should] potentially carve out a space for treating groups differently and treating the historically marginalized group with more care.” Policies that attempt to treat every group the same can be seen as trying to espouse a “view from nowhere,” (Haraway, 1988) which in practice just means taking the perspective of the majority or those with power (e.g. treating the N-word as hate speech implies that the speaker is not Black).
Participants discussed many challenges related to demographic data procurement that they faced when seeking to detect algorithmic bias. One core challenge was balancing privacy and anti-discrimination laws and policies. Most organizations did not have very well-developed anti-discrimination policies compared to privacy policies, making the heightened consent requirements under data protection law more salient than the desire to check for bias in data or products. In addition, there were organizational constraints given that fairness goals did not always align with financial objectives and could potentially create blowback in the press given the lack of public trust in tech companies’ data collection practices. As a result of these barriers, the presence or absence of regulatory guidance had a large influence on whether bias detection for particular demographic categories was part of organizations’ standard practice. In domains without clear guidance, data collection was far less common than in domains with regulatory mandates or industry standard practices.
In addition to these legal, policy, financial, and public relations barriers, practitioners also expressed varied concerns they held themselves regarding the procurement and use of demographic data for fairness purposes. They discussed, on the one hand, the unreliability of self-reported demographics, but on the other hand, uncertainty around the appropriateness of using proxies or inferred demographics. Practitioners also reported difficulties with the contested nature and regional differences of many demographic attributes. Moreover, practitioners expressed concern around how to build trustworthy and transparent processes for data collection. Finally, the lack of clear examples and guidance on how to address bias once it has been detected worried many practitioners. Though these concerns existed for many practitioners, they did not see them as unresolvable. In fact, most interviewees expressed a desire to start exploring responsible strategies for collection and use, since it is difficult to address issues that are not being measured.
Having mapped out a dense web of challenges practitioners face when attempting to use demographic data in service of fairness goals, we do not believe that the next step should be to simply lower the barriers to collecting demographic data. To the contrary, many of the challenges participants raised highlight deep normative questions not only around how but also whether demographic data should be collected and used by algorithm developers. While the main goal of this paper was to explore the contours of these normative questions from the practitioner perspective, we take the rest of this section to consider three elements of possible paths forward: clearer legal requirements around data protection and anti-discrimination, privacy-respecting algorithmic fairness strategies, and more meaningful inclusion of data subjects in determining how their data gets used.
Looking first at legal provisioning, using demographic data to assess and address algorithmic discrimination will likely require clear carve outs in data protection regulation. This could entail establishing official third party data holders and/or auditors, or it could mean just opening the door for private organizations to do this work themselves. During the course of this interview project, we did see the UK Information Commissioner’s Office publish explicit guidance on how to legally use special category data to assess fairness in algorithmic systems (UK Information Commissioner’s Office, 2020), but it is not yet clear what impact this will have. The approach of relaxing demographic data protections by itself, however, will likely leave many of the other constraints and practitioner concerns unaddressed. Further still, anti-discrimination law does not necessarily provide incentives for using demographic attributes to inform algorithmic decision-making (Xiang, 2021; Harned and Wallach, 2019; Tischbirek, 2020; Bent, 2020). As such, data protection carve-outs would likely need to coincide with updated protections around algorithmic discrimination. Looking to domains where demographic data collection is already legally provisioned (e.g. hiring/HR in the U.S.), we see that practitioner concerns do not dissolve so much as change shape (e.g. issues with the government data categories), suggesting the importance of crafting this legislation in collaboration with practitioners.
Turning to algorithmic fairness strategies that maintain privacy, they are often proposed as a means of enabling data collection in low-trust environments in that they construct barriers around sensitive data, ensuring it can only be used for assessing and addressing bias and discrimination. Veale and Binns (2017) recommend a number of approaches that enlist third-parties with higher trust to collect the data and conduct secure fairness analyses, fair model training, and knowledge sharing. Other proposed methods build on this third-party arrangement, ensuring cryptographic or differential privacy from the point of collection Kilbertus et al. (2018); Jagielski et al. (2019) and reducing privacy and trust concerns even more. Decentralized approaches such as federated learning could have similar benefits as they would possibly eliminate the need for data sharing entirely, though methods for fairness analyses are less built out in this area (Kairouz and others, 2019). While these methods could go a long way in addressing some of the trust-related concerns discussed by practitioners, they could exacerbate other challenges. By off-loading the responsibilities of collecting, handling, and processing this data, the third-party would also inherit some of the practitioners’ more socially-minded concerns, such as how to determine demographic saliency and how to meaningfully resolve detected unfairness. Depending on the nature of the third-party organization and the level of investment from the primary company, this deferral of responsibility could increase the likelihood that these questions just remain unaddressed.
The final approach we consider is more firmly incorporating data subjects into the work around fairness. As a relevant and timely example of why this is important to do, throughout the course of this project we saw debates flare up around the use of racial data in COVID response plans (Butler, 2020; Reventlow, ; Singh, 2020; James, 2020). Opponents argue that without a strong commitment to actually intervening on and reversing systemic health inequalities, generating such data is more likely to perpetuate harm (e.g. through the construction of various negative narratives blaming disadvantaged communities for health disparities (Singh, 2020)) than to address it (Onuoha, 2020). In a similar way, we might expect hasty, uncritical pushes for demographic data collection to leave the roots of discrimination unaddressed—possibly even covering deeper issues with a veneer of having reduced algorithmic bias. As such, we see a strong potential for more transparent, inclusionary practices surrounding demographic data and algorithmic fairness (Katell et al., 2020; Wong, 2020; Martin Jr. et al., 2020). As discussed in Section 6.4, finding ways to keep data subjects informed and giving them a voice in how their data is used should ideally be an element of any project that builds on sensitive data. Though meaningfully taking steps to maintain this relationship would likely be difficult and costly, doing so could also address issues of privacy and a majority of practitioners’ concerns. Data subject consent would be a more accessible legal basis for demographic data collection, and concerns around data reliability and accuracy would likely be reduced as the collection process more comes to reflect what LR8 referred to as “data donation”—shared data could mutually benefit the individual, their community, and the organization collecting the data. Furthermore, more deeply engaging various groups in the collection and use of demographic data is likely to complement internal processes of determining both the salient demographics to consider as well as ways to meaningfully address uncovered bias.
We provide these three brief considerations of paths forward, but much more work is required to adequately resolve the myriad constraints and concerns characterized in previous sections. While there has been much discussion within the algorithmic fairness literature about the legal and technical issues posed by sensitive and protected attribute availability, we hope this work can both ground and expand the range of issues being considered by seeing them through the practitioner’s lens.
- The Guardian. External Links: Cited by: §1.
- Towards a Just Theory of Measurement: A Principled Social Measurement Assurance Program for Machine Learning. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, Honolulu HI USA, pp. 445–451 (en). External Links: Cited by: §5.2.
- Working to Address Algorithmic Bias? Don’t Overlook the Role of Demographic Data. (en-US). External Links: Cited by: §1.
- Thematic networks: an analytic tool for qualitative research. Qualitative Research 1 (3), pp. 385–405 (en). External Links: Cited by: §3.
- Americans and privacy: Concerned, confused and feeling lack of control over their personal information. Pew Research Center: Internet, Science & Tech (blog). November 15, pp. 2019. Cited by: §6.4.
- Interventions over predictions: reframing the ethical debate for actuarial risk assessment. In Conference on Fairness, Accountability and Transparency, pp. 62–76. Cited by: §5.2.
- Beyond Bias: Re-Imagining the Terms of ‘Ethical AI’ in Criminal Law. SSRN Electronic Journal (en). External Links: Cited by: §1.
- Algorithmic Equity Toolkit. (en). External Links: Cited by: §1.
- Big Data’s Disparate Impact. California Law Review 104, pp. 671. External Links: Cited by: §6.2.
- AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias. arXiv:1810.01943 [cs]. External Links: Cited by: §1.
- What is the point of fairness?: disability, AI and the complexity of justice. ACM SIGACCESS Accessibility and Computing (125), pp. 1–1 (en). External Links: Cited by: §2.
- Is Algorithmic Affirmative Action Legal?. Georgetown Law Journal 108, pp. 803. Cited by: §2, §7.
- Racial categories in machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency - FAT* ’19, Atlanta, GA, USA, pp. 289–298 (en). External Links: Cited by: §1, §2.
- Awareness in practice: tensions in access to sensitive attribute data for antidiscrimination. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pp. 492–500. Cited by: §2, §5.2.
- Using publicly available information to proxy for unidentified race and ethnicity. External Links: Cited by: §6.2.
- Al Jazeera. Cited by: §7.
- Feminist data manifest-no. Cited by: §2.
- Questions and answers to clarify and provide a common interpretation of the uniform guidelines on employee selection procedures. Federal Register. Cited by: footnote 4.
- The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning. arXiv:1808.00023 [cs]. External Links: Cited by: §6.2.
- Demarginalizing the intersection of race and sex: a black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics. u. Chi. Legal f., pp. 139. Cited by: §6.3.
- Revised form for self-identification of disability released. (en-US). External Links: Cited by: §6.1.
- Fairness is not static: Deeper understanding of long term fairness via simulation studies. In Proceedings of the 2020 conference on fairness, accountability, and transparency, FAccT ’20, New York, NY, USA, pp. 525–534. External Links: Cited by: §1.
- Data 4 Black Lives. External Links: Cited by: §2.
- A Broader View on Bias in Automated Decision-Making: Reflecting on Epistemology and Dynamics. arXiv:1807.00553 [cs, math, stat] (en). External Links: Cited by: §6.4.
Using the Census Bureau’s surname list to improve estimates of race/ethnicity and associated disparities. Health Services and Outcomes Research Methodology 9 (2), pp. 69. Cited by: §5.2.
- Regulation (EU) 2016/679 (General Data Protection Regulation). External Links: Cited by: §1, §5.1.1.
- Federal Data Protection Act of 30 June 2017 (BDSG). External Links: Cited by: footnote 3.
Eliminating latent discrimination: train then mask.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3672–3680. Cited by: §1.
- A step towards accountable algorithms? algorithmic discrimination and the european union general data protection. In 29th conference on neural information processing systems (NIPS 2016), barcelona. NIPS foundation, Cited by: §2.
- Algorithmic realism: expanding the boundaries of algorithmic thought. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, Barcelona Spain, pp. 19–31 (en). External Links: Cited by: §6.4.
- Causal Effects of Perceived Immutable Characteristics. The Review of Economics and Statistics 93 (3), pp. 775–785. External Links: Cited by: §6.2.
- Proxy Fairness. arXiv:1806.11212 [cs, stat] (en). External Links: Cited by: §1.
- Discrimination- and privacy-aware patterns. Data Mining and Knowledge Discovery 29 (6), pp. 1733–1782 (en). External Links: Cited by: §1.
- Gender Recognition or Gender Reductionism?: The Social Implications of Embedded Gender Recognition Systems. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems - CHI ’18, Montreal QC, Canada, pp. 1–13 (en). External Links: Cited by: §2.
- Towards a Critical Race Methodology in Algorithmic Fairness. pp. 12 (en). Cited by: §2, §6.2, §6.3.
- Situated Knowledges: The Science Question in Feminism and the Privilege of Partial Perspective. Feminist Studies 14 (3), pp. 575–599. External Links: Cited by: §6.5.
- Stretching human laws to apply to machines: The dangers of a’Colorblind’Computer. Florida State University Law Review, Forthcoming. Cited by: §7.
- Reforming Pretrial Justice in California. pp. 34 (en). Cited by: §1.
- Fairness Without Demographics in Repeated Loss Minimization. arXiv:1806.08010 [cs, stat] (en). External Links: Cited by: §1.
- Where fairness fails: data, algorithms, and the limits of antidiscrimination discourse. Information, Communication & Society 22 (7), pp. 900–915 (en). External Links: Cited by: §1, §6.3.
- Improving fairness in machine learning systems: What do industry practitioners need?. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems - CHI ’19, pp. 1–16 (en). External Links: Cited by: §1, §2, §4, footnote 2.
- So What “ Should ” We Use? Evaluating the Impact of Five Racial Measures on Markers of Social Inequality. Sociology of Race and Ethnicity 3 (1), pp. 14–30 (en). External Links: Cited by: §6.3.
- What’s Sex Got to Do With Fair Machine Learning?. pp. 11 (en). Cited by: §2.
- Crossing the Quality Chasm: A New Health System for the 21st Century. National Academies Press (US), Washington (DC) (eng). External Links: Cited by: §5.2.
- Measurement and Fairness. arXiv:1912.05511 [cs] (en). External Links: Cited by: §6.2.
- Differentially private fair learning. In International Conference on Machine Learning, pp. 3000–3008. Cited by: §1, §7.
- Race-based COVID-19 data may be used to discriminate against racialized communities. (en). External Links: Cited by: §7.
- Advances and Open Problems in Federated Learning. arXiv:1912.04977 [cs, stat] (en). External Links: Cited by: §7.
- Fairness-aware Learning through Regularization Approach. In 2011 IEEE 11th International Conference on Data Mining Workshops, pp. 643–650. External Links: Cited by: §1.
- Toward situated interventions for algorithmic equity: lessons from the field. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, Barcelona Spain, pp. 45–55 (en). External Links: Cited by: §1, §6.4, §7.
- Blind Justice: Fairness with Encrypted Sensitive Attributes. arXiv:1806.03281 [cs, stat] (en). External Links: Cited by: §1, §7.
- Fair Decision Making using Privacy-Protected Data. arXiv:1905.12744 [cs] (en). External Links: Cited by: §1.
- Fairness without Demographics through Adversarially Reweighted Learning. arXiv:2006.13114 [cs, stat] (en). External Links: Cited by: §1.
- How Cambridge Analytica Sparked the Great Privacy Awakening. Wired (en-us). External Links: Cited by: §6.4.
- LinkedIn Recruiter: The Industry-Standard Recruiting Tool. (en). External Links: Cited by: §3.
- Innovative Methodologies in Qualitative Research: Social Media Window for Accessing Organisational Elites for interviews. 12 (2), pp. 11 (en). Cited by: §3.
- Response bias and reliability in sensitive topic surveys. Journal of the American Statistical Association 81 (394), pp. 381–389. Cited by: footnote 6.
- Participatory Problem Formulation for Fairer Machine Learning Through Community Based System Dynamics. arXiv:2005.07572 [cs, stat] (en). External Links: Cited by: §6.4, §7.
- Fairlearn. Microsoft. External Links: Cited by: §1.
- Big Data from the South(s): Beyond Data Universalism. Television & New Media 20 (4), pp. 319–335 (en). External Links: Cited by: §2.
- This Thing Called Fairness: Disciplinary Confusion Realizing a Value in Technology. Proceedings of the ACM on Human-Computer Interaction 3 (CSCW), pp. 1–36 (en). External Links: Cited by: §6.4.
- Algorithms of oppression: How search engines reinforce racism. NYU Press. Cited by: §1.
- When Proof Is Not Enough. (en-US). External Links: Cited by: §7.
- Our data bodies: reclaiming our data. June 15, pp. 37. Cited by: §2.
- Mitigating Bias in Algorithmic Hiring: Evaluating Claims and Practices. arXiv:1906.09208 [cs] (en). External Links: Cited by: footnote 1.
- Indigenous data sovereignty. Cited by: §2.
- Where Responsible AI meets Reality: Practitioner Perspectives on Enablers for shifting Organizational Practices. arXiv:2006.12358 [cs] (en). External Links: Cited by: §2, footnote 5.
-  Data collection is not the solution for Europe’s racism problem. Al Jazeera. External Links: Cited by: §7.
- What’s in a Name? Reducing Bias in Bios without Access to Protected Attributes. arXiv:1904.05233 [cs, stat] (en). External Links: Cited by: §1.
- Aequitas: A Bias and Fairness Audit Toolkit. arXiv:1811.05577 [cs] (en). External Links: Cited by: §1.
Fairness GAN: Generating datasets with fairness properties using a generative adversarial network. IBM Journal of Research and Development 63 (4/5), pp. 3:1–3:9 (en). External Links: Cited by: §1.
- How We’ve Taught Algorithms to See Identity: Constructing Race and Gender in Image Databases for Facial Analysis. Proceedings of the ACM on Human-Computer Interaction 4 (CSCW1), pp. 1–35 (en). External Links: Cited by: §2.
- Fairness and Abstraction in Sociotechnical Systems. In Proceedings of the Conference on Fairness, Accountability, and Transparency - FAT* ’19, Atlanta, GA, USA, pp. 59–68 (en). External Links: Cited by: §1.
- The ugly truth: tech companies are tracking and misusing our data, and there’s little we can do. (en). External Links: Cited by: §5.3.
- Collecting race-based data during pandemic may fuel dangerous prejudices. (en). External Links: Cited by: §7.
- FAT Forensics: A Python Toolbox for Implementing and Deploying Fairness, Accountability and Transparency Algorithms in Predictive Systems. Journal of Open Source Software 5 (49), pp. 1904. External Links: Cited by: §1.
- What is data justice? The case for connecting digital rights and freedoms globally. Big Data & Society 4 (2), pp. 205395171773633 (en). External Links: Cited by: §2.
- Artificial intelligence and discrimination: Discriminating against discriminatory systems. In Regulating artificial intelligence, pp. 103–121. Cited by: §1, §7.
- What do we need to do to ensure lawfulness, fairness, and transparency in AI systems?. (en). External Links: Cited by: §7.
- LiFT: A Scalable Framework for Measuring Fairness in ML Applications. arXiv:2008.07433 [cs]. Note: Accepted for publication in CIKM 2020 External Links: Cited by: §1.
- Fairer machine learning in the real world: Mitigating discrimination without collecting sensitive data. Big Data & Society 4 (2), pp. 205395171774353 (en). External Links: Cited by: §1, §1, §7.
- Fairness and Accountability Design Needs for Algorithmic Support in High-Stakes Public Sector Decision-Making. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems - CHI ’18, Montreal QC, Canada, pp. 1–14 (en). External Links: Cited by: §2.
- The What-If Tool: Interactive Probing of Machine Learning Models. IEEE Transactions on Visualization and Computer Graphics 26 (1), pp. 56–65. External Links: Cited by: §1.
- How Algorithms Discriminate Based on Data They Lack: Challenges, Solutions, and Policy Implications. Journal of Information Policy 8, pp. 78 (en). External Links: Cited by: §1.
- Fair lending: race and gender data are limited for non-mortgage lending. Subcommittee on Oversight and Investigations, Committee on Financial Services, House of Representatives. External Links: Cited by: §5.2.
- Democratizing Algorithmic Fairness. Philosophy&Technology 33 (2), pp. 225–244 (en). External Links: Cited by: §7.
- Reconciling legal and technical approaches to algorithmic bias. Tennessee Law Review 88 (3). Cited by: §2, §7.
- Fairness Constraints: Mechanisms for Fair Classification. In Artificial Intelligence and Statistics, pp. 962–970 (en). External Links: Cited by: §1.
- Understanding discrimination in the scored society. Washington Law Review 89, pp. 1375 (en). Cited by: §1.
- Assessing Fair Lending Risks Using Race/Ethnicity Proxies. Management Science 64 (1), pp. 178–197. External Links: Cited by: §1.
- Using sensitive personal data may be necessary for avoiding discrimination in data-driven decision models. Artificial Intelligence and Law 24 (2), pp. 183–201 (en). External Links: Cited by: §1.