Software-Supported Audits of Decision-Making Systems: Testing Google and Facebook's Political Advertising Policies

02/26/2021 ∙ by J. Nathan Matias, et al. ∙ Princeton University The University of Chicago cornell university 0

How can society understand and hold accountable complex human and algorithmic decision-making systems whose systematic errors are opaque to the outside? These systems routinely make decisions on individual rights and well-being, and on protecting society and the democratic process. Practical and statistical constraints on external audits can lead researchers to miss important sources of error in these complex decision-making systems. In this paper, we design and implement a software-supported approach to audit studies that auto-generates audit materials and coordinates volunteer activity. We implemented this software in the case of political advertising policies enacted by Facebook and Google during the 2018 U.S. election. Guided by this software, a team of volunteers posted 477 auto-generated ads and analyzed the companies' actions, finding systematic errors in how companies enforced policies. We find that software can overcome some common constraints of audit studies, within limitations related to sample size and volunteer capacity.



There are no comments yet.


page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In 2018, both Facebook and Google introduced policies that restricted who could purchase advertisements about elections and topics of “national importance” (Leathern, 2018; Wakabayashi, 2018). During the 2016 U.S. presidential election, foreign actors had attempted to influence U.S. voters by purchasing ads, among other tactics (Isaac and Wakabayashi, 2017). Under pressure from regulators and civil society, the companies needed a way to review potentially millions of ads a week, identify which ones were related to an election, and determine which should be published–all without disrupting ad markets and other forms of societally-important speech. They needed large-scale decision-making systems.

Decision-making systems, which often combine human and software processes, routinely determine high-stakes outcomes for millions of people. These systems shape people’s lives in areas including judicial sentencing (Angwin et al., 2016), child protection (Chouldechova et al., 2018), public health (Card, 2018; Chen et al., 2019), immigration (Whittaker, 2019), and hiring (Barabas, 2015). Decision-making systems are used worldwide to protect the public from harmful speech (Gillespie, 2018), to protect democracies from attempts to undermine elections (Madrigal, 2018), and to censor political speech (King et al., 2014). Because all decision-making systems make errors, the public depends on independent audits to hold decision-making institutions accountable, correct errors, and set policy for their decisions (Popper, 1947; Diakopoulos, 2014).

Attempts to audit decision-making systems can also advance scientific understanding of society, algorithms, and how they interact in the field (Bandiera et al., 2011; Kitchin, 2017; Rahwan et al., 2019). Audit studies, unlike financial audits or qualitative audits (Maclay, 2010; Sandberg, 2019)

, are randomized trials that quantitatively estimate systematic statistical error in decision-making systems. By conducting lab and field experiments that observe systematic errors in decision-making, scientists have advanced understanding of human psychology

(Kahneman and Tversky, 1972), organizational behavior (Pager, 2007), and market behavior (Salganik and Watts, 2009). Computer scientists use similar methods to study decision-making by algorithms (Chen and Wilson, 2017; Sandvig et al., 2014).

Because audit studies are complex to design and coordinate, researchers tend to focus on a small number of decision-making error types. But if they are too simple, studies can fail to observe important sources of error (Adler et al., 2018). In this paper, we describe, prototype, and evaluate novel software for increasing the complexity and validity of audit studies. Like other computer science papers that introduce a new approach to software-supported research (Geiger and Ribes, 2011; Matias and Mou, 2018), we present the method and include an example of its use. We created software to (a) support variable selection, (b) generate audit materials, (b) allocate audit attempts to testers, and (d) generate pre-registered statistical results and illustrations.

Using this software, we conducted a comparative audit of Google and Facebook’s political advertising policies during the 2018 U.S. midterm election. In our study, testers posted 477 auto-generated ads across the two platforms, testing each platform’s policy decisions for different ad targeting criteria, for partisan bias across multiple kinds of content, and for different advertiser characteristics. While Google did not make any mistakes in our audit, Facebook mistakenly prohibited 18% of ads for government websites and 5% of ads to non-election civic events. We did not find evidence of political bias in these mistakes.

Overall, this paper demonstrates the potential for software to streamline the design and implementation of audit studies while expanding their realism and validity. We conclude with design implications for software-supported audits of decision-making systems.

2. Background and Related Work

We begin by providing background on audit studies and introduce limitations that software can help overcome. We then provide background on the political advertising policies of Facebook and Google during the 2018 U.S. midterm elections to illustrate these challenges in the field. Finally, to frame our discussion of design considerations for audit software, we review the state of methods for software-supported audits.

2.1. Audit Studies and Their Limitations

Audit studies are field experiments that describe and explain systematic error made by a decision-making system. These randomized trials, first developed in the 1970s (Wienk, 1979; Hakken, 1979), became more common in the 2000s after a series of field experiments by Devah Pager demonstrated systematic racial discrimination by U.S. employers. Pager’s studies overturned an inaccurate consensus that racial differences in hiring outcomes were due to differences in skills (Pager, 2007). These audits, which examined the effects of race and criminal record on hiring, demonstrated that when presented with two equally-capable applicants, firms preferred white candidates to black candidates on average, even though firms were legally prohibited from considering race (Pager, 2003). While social scientists and advocates have persuasively argued the shortcomings of a focus on decision-making systems when addressing social inequality (Alexander, 2010), audit studies remain a valuable method for studying errors by a wider range of decision-making systems.

2.1.1. How Audit Studies Work

In audit studies, researchers start with hypotheses about decision-making error based on the (a) characteristics of the tester and (b) the choice being offered to the decision-making system. To create a controlled test, researchers recruit “testers” who are as similar as possible on all characteristics except those of interest to the study (e.g. skin tone). Testers are then randomly assigned to prompt a decision-making system to make a choice, for example by applying for a job. Across multiple prompts, testers also vary the stimulus provided to the institution (e.g. altering the CV) and record the decision made by the system. Researchers then conduct statistical tests for differences in decision rates for different categories of testers and prompts.

2.1.2. Dimensional Complexity

Audit studies, like all randomized trials, increase in dimensional complexity with each added variable. Pager’s earliest study was two-dimensional. The study worked with all-male testers, varying tester race and whether the CV included a criminal record. Had Pager added gender, the study would require three dimensions, with four testers for every gender presentation tested. Had Pager considered additional variations in CV details such as education level, the study would have grown further in complexity.

Among testers, dimensional complexity creates challenges of recruitment, training, and coordination. The number of testers increases exponentially with each new variable and linearly with each new possible attribute for a given variable. As more variables are added, the complexity of a study could quickly become too large for humans to coordinate alone.

Among the prompts being tested, dimensional complexity creates challenges of design and coordination. For example, auditing political advertising policies is dimensionally-complex in a country with two major parties, 50 states, and a range of issues where companies could make errors in the enforcement of their policies. When auditing election advertising policies, testers might need to prompt companies with many different prompts which are somehow similar enough for comparison, different enough to avoid detection of the audit itself, and tailored to the specific variables being tested. The effort of designing these prompts and allocating them to testers would quickly become too complex for humans to conduct efficiently.

2.1.3. Trading off Dimensional Complexity and External Validity

Since findings from audit studies only generalize to the kind of testers and prompts being sampled, audits have a trade-off between complexity and external validity. In the social sciences, external validity refers to the repeatability of a study outside the lab and also to the degree to which findings describe phenomena that actually occur in the field (Shadish et al., 2002; Cialdini, 1980). Simple audits that test a small number variables can detect decision-making errors that are important to society and to science. Yet even when researchers match testers and prompts to common cases, the findings cannot describe the performance of a decision-making system outside the audit sample. For example, while Pager’s study established discrimination among male job applicants, it took other audits to establish systematic hiring discrimination toward women (Bertrand and Mullainathan, 2004).

To put it simply, designers of audits face a trade-off between external validity and dimensional complexity. A more complex study with more variables can include a wider range of people and prompts. A narrower study with fewer variables will cover a more constrained range of cases. This trade-off has high-stakes consequences when auditing complex decision-making systems. In such high-dimensional settings, auditors could miss important errors because the test did not cover a wide enough range of prompts. By creating software to automatically generate audit materials, estimate sample sizes, and allocate prompts across testers, we hoped to increase the validity of audit studies by expanding their dimensional complexity.

2.2. Platform Political Advertising Policies in 2018

We tested a software-supported approach to audit studies in an audit of Facebook and Google’s political advertising policies during the 2018 US midterm elections.

Since the 2016 United States presidential election, voters have become more aware of the potential for online advertising on social media platforms to influence elections. For example, Russian actors placed election-related ads on Facebook to influence American voters, which covered topics such as gun control, racial tension, and immigration (Politico, 2018; Facebook, 2018). Politicians criticized Facebook for allowing these ads to be published (C-Span, 2018). Many insisted that Facebook should have detected and prevented these attempted influence campaigns by observing available signals such as payment currency–the Russian ads were paid for in rubles (Hill, 2017).

In response to public concern and Congressional hearings, Facebook (Madrigal, 2018) and Google (Reardon, 2018) implemented policies in 2018 to limit who can publish election-related ads during national elections. During several national elections, including the U.S. 2018 midterm election, companies required advertisers to confirm their identity and nationality before permitting them to publish ads promoting candidates in the period before the election. In this, companies were forcing advertisers to comply with U.S. Federal Election Commission (FEC) regulations that required disclosure of who funded federal election advertising campaigns (Valentino-DeVries, 2018).

Facebook also developed policies that restricted who could publish so-called “issue ads,” a category that regulators have struggled to define. In the U.S. legal tradition, issue ads include content that a “reasonable person” might guess are about an election. Issue ads evade election regulations by avoiding direct encouragement to vote for a candidate (Hayward, 1999). These ads are common and controversial in U.S. elections. Issue ads are also the kind of ads published by foreign influence operations during the 2016 U.S. presidential election. To simplify decades of policy making and Supreme Court cases, because issue ads are to difficult to define clearly, the FEC did not restrict issue ads or require funding disclosure for these ads in the 2018 election (Hayward, 1999; Valentino-DeVries, 2018). Under pressure from politicians to do what the government considered too complex, Facebook developed policies in 2018 that forbade advertisers from publishing ads about “issues of national importance” without verifying themselves.

2.2.1. Prior Research on Political Advertising Markets

Researchers have studied these political advertising policies using approaches from information security and algorithm design.

Security researchers have investigated how the Russian Intelligence Research Agency (IRA) used Facebook’s ad targeting tools to create politically divisive ads before the 2016 election (Ribeiro et al., 2019). Other security researchers have identified ways that malicious advertisers could prevent their political ads from being disclosed in Facebook’s political ad library (Edelson et al., 2020).

Since machine learning models inform the decision to label content as political advertising, researchers have studied platform policies from the perspective of algorithm and market design. One team developed a browser plugin to gather Facebook ads shown on volunteers’ timelines. After using machine learning to classify ads as political or not, they found cases of political ads that Facebook failed to label 

(Silva et al., 2020). Another team showed that the content of an ad on Facebook (e.g., its associated image) can cause an ad to be disproportinately delivered to one group of users over another, despite the same targeting criteria and the same bidding strategy being used across ads (Ali et al., 2019a, b).

2.2.2. Decision-Making Systems for Political Ads

To determine which advertisers needed to be authorized, Facebook and Google created decision-making systems that used a combination of machine learning and human processes to review whether a given ad would be required to comply with a company’s political ad policies. In 2017, Facebook announced that they would hire 1,000 content moderators to help review ads. They also developed a machine learning system to detect ”inauthentic Pages and the ads they run” (Constine, 2017). While Google was less open about their methods, they also reviewed ads for political content and required advertisers for political candidates to verify their identity (Reardon, 2018).

Upon identifying and verifying an advertiser, both Facebook and Google promised to disclose the advertiser’s identity when publishing the ad, as well as publish reports and datasets of election-related advertisers (Bowles and Frenkel, 2018). Both companies published interactive websites that provided information about the entities who had published past and present political advertisements (Smith, 2018; Constine, 2017). In their press releases, both companies expressed hope that these datasets would provide greater transparency toward advertisers as well as disclosure of the decisions made by companies (Smith, 2018; Leathern, 2018).

2.2.3. Systematic Errors from Political Ad Policy Enforcement

Google and Facebook’s political advertising policy systems, like any decision-making system, can make systematic errors. In previous decades, the U.S. government had restrained political advertising regulations to protect freedom of expression, since mistakes could substantially impact both an election and American civic life (Hayward, 1999). Systematic errors by platform enforcement systems introduce three risks to society:

  • Platforms might be too permissive, creating systems that were ineffective at preventing, removing, or labeling ads that violate a platform’s policies and election law.

  • Platform decision-making systems might be too restrictive, forbidding important public discourse protected by the constitution that was unrelated to elections and permitted by a company’s own policies.

  • Platforms could also be politically-biased if systematic errors advantaged or disadvantaged a given political candidate, party, or ideology.

Throughout 2018, platforms were accused of all three kinds of decision-making failures. Google’s decision-making system was accused of being too permissive by failing to apply its political advertising restrictions to ads by major political candidates (Glazer and Haggin, 2019). Facebook’s decision-making system were accused of being politically biased by failing to label a political ad that asked San Francisco voters to vote ”yes” on a school bond proposition and by blocking non-partisan ads from newspapers (Merrill and Tobin, 2018). Lastly, Facebook was accused of being too restrictive by falsely labeling ads for veterans, LGBTQ+ people, and even baked beans whose brand was similar to the name of a former U.S. president (Gale, 2018; Rosenberg, 2018; Mak, 2018). All of these cases involved policy actions that were inconsistent with a platform’s own written policies.

2.2.4. How Audit Studies Advance Understanding of Political Advertising Policies

While stories about political ad policy enforcement errors were observed in platform transparency reports, none of the datasets published by Facebook or Google during the 2018 election can be used to ask if the policies were too permissive, too restrictive, or politically biased. To answer these questions, researchers would need to compare restricted accounts and ads to those that were permitted. Since transparency datasets only include ads that companies labeled as election-related, it is impossible to make this comparison. Even if platforms were to publish the full set of all ads submitted during the election period, observational studies could still produce an inaccurate picture of the fairness of a company’s processes. For example, if one candidate in an election was more or less competent at online advertising than their opponent, observational transparency data might lead researchers to mistakenly conclude that the company’s decision-making system wrongfully favored the more competent advertiser.

Unlike correlation studies of administrative data, audit studies evaluate decision-making systems under balanced and controlled conditions in the field. Consequently, audit studies provide reliable tests of a system’s average performance, including any cases of under-enforcement, over-enforcement, and bias. Yet audit studies have typically been too difficult to design and coordinate for complex processes like elections, which involve many candidates campaigning across many regions with local issues.

2.3. Software-Supported Audit Studies

The methods used by prior research in computer science about decision-making systems have tended to depend on the intended use of the research and the standpoint of the researchers. In the philosophy of science, standpoint refers to the ways that questions such as “by whom” and ”for whom” shape every detail of the research endeavor (Harding, 1991).

Across decision-making research, researchers working from the standpoint of designers, vendors, and administrators will use methods that benefit from privileged knowledge about a system (Raji et al., 2020; Obermeyer et al., 2019). Those working from a journalistic, consumer protection, or policy enforcement standpoint will tend to treat systems as a black box, often in situations where those who control a decision-making system could with-hold access to internal information about how a system works (Diakopoulos, 2014). For researchers standing on the outside, audit studies can identify problems, and contribute some limited understanding of the nature of a problem (Weiss, 1979; Buolamwini and Gebru, 2018).

Computer scientists have made progress defining the challenge of dimensional complexity for audit studies of fully-automated decision-making systems in the lab (Adler et al., 2018; Kearns et al., 2018). Yet the work of conducting audits in the field is constrained by the challenges of working with testers and designing prompts for those testers to use.

In some domains, software tools have coordinated large-scale volunteer monitoring of decision-making systems that shape network conditions, such as censorship and filtering. Two prominent examples are OONI (Filasto and Appelbaum, 2012) and Encore (Burnett and Feamster, 2015), each of which operates in a slightly different manner. In the case of OONI, volunteers run software on their own machines to test various aspects of network connectivity, such as the reachability of various Internet destinations. Encore can recruit and coordinate a much larger user base, since it does not require special software installation, and since tests can occur with minimal to no training.

Computer scientists have simulated human testers with software. By removing humans from the research process, scientists can generate testers with a wide range of characteristics (Tschantz et al., 2015). Despite the efficiency and variety of this promising approach, questions remain about the reliability of a method that could be interpreted by decision-making systems as inauthentic or fraudulent activity (Narayanan and Reisman, 2017).

Social scientists have created externally-valid audit prompts by manually adapting real-world content. In one audit of employment discrimination, researchers downloaded resumés from an online website and edited them to create their decision prompts (Bertrand and Mullainathan, 2004). In a study of online censorship in China, researchers looked for news articles about political and non-political collective action that they then tested posting to Chinese social media (King et al., 2014)

. While drawing from real cases increases the naturalism of an audit study, the authors of these studies use labored words like “scour” to describe the substantial effort required by this manual process. Neither study considered more than two binary variables, likely constrained by the substantial effort of finding examples and adapting them into scientifically-valid prompts.

3. Design Considerations for Software-Supported Audits of Decision-Making Systems

We argue that carefully designed software can help automate and streamline audits of decision-making systems. In our case, we were interested in writing software to help audit the enforcement of Facebook and Google’s political ad policies. To design our audit-support software, we considered the following design decisions that anyone conducting an audit of decision-making systems must face.

3.1. Choosing Variables to Test and Sample Sizes

The designers of any audit study need to make decisions about which variables to test. Researchers must decide which characteristics of the testers and the characteristics of the prompts that they wish to test and which ones they wish to hold constant. These decisions determine the dimensional complexity of the study and are constrained by the feasible sample size required to observe meaningful error rates. While larger, more complex audits offer greater precision and validity, they also require more testers, more time from testers, and more money (we paid for every ad we posted).

Audit studies also vary the prompt provided to the decision-making system being tested. In the case of political advertising policies, we were interested in democrat and republican candidates, progressive and conservative issues, and elections at a federal and local level.

In social scientific audit studies, testers are chosen for having “bundles” of characteristics that enable comparison between traits that might be impossible for an individual person to change, such as their skin tone (Sen and Wasow, 2016). Researchers must choose how many variations to include in the study design. In the audit of political advertising policies, we considered the citizenship of the tester, the geolocation of their internet activity, browser and platform language settings, whether their bank account was based in the U.S., their billing address, and what currency their bank account used.

To support the design of our audit study, we wrote software that took input on the characteristics and prompts we wished to test. The software bundled those characteristics into combinations of personas and prompts, and simulated the sample size needed to observe meaningful differences in error rates.

3.2. Generating Realistic Audit Prompts

In any audit study, the validity of the study depends on the realism of the prompts–the occasions and materials that force the decisions recorded by testers. When auto-generating prompts, researchers risk losing that validity in favor of automation. Yet automation can also improve the realism of audit studies when researchers draw from a range of qualitative and quantitative evidence, including news stories about platform mistakes, public data about an election, and other relevant public datasets.

Many audit studies are prompted by news stories about individual cases. This qualitative evidence about mistakes can provide a valuable source for creating realistic audit prompts. For our study, we read news reports, observed decisions by companies, interviewed people whose ads we believed were mistakenly removed, and examined the ads that they posted. Through this process, we learned more about the genre of prompts we needed to create (advertisements), as well as people’s theories about the nature of platform mistakes.

Once researchers identify the genre of audit prompt their study requires, they can draw from a range of data sources to generate prompts using software. In our case, we were studying advertisements, which include a title, subtitle, web link, and in some cases, an image. To generate realistic prompts with software, we needed data sources that our software could use to generate each of an online ad.

Since our audit study focused on political advertising policies, we also needed public data about the election. Public election datasets provide information about the structure of the election into states, voting districts, individual races, and individual candidates. The Federal Elections Commission API provides up-to-date details on every candidate running for federal office, which party they were from, and which voting districts are voting on which positions.111

The creation of realistic audit prompts will likely involve matching multiple datasets. in our study, since election datasets are linked with geographic regions, we were able to match election data with public domain and commercial data sources. These datasets increased the realism of our audit, since our ads were advertising real products, events, and places, with accurate descriptions and photographs.

Overall, the validity of any audit study depends on the realism of the prompts used by researchers. When audit materials are informed by on-the-ground observation and generated by software using real-world information, researchers can develop highly realistic audit materials.

3.3. Design Diagnosis and Statistical Power

Since audit studies set out to describe systematic errors through statistics, researchers need to specify a sample size for the number of testers and the number of prompts. Deciding on a sample size is often an iterative process of making decisions in light of trade-offs of complexity, precision, and cost. With larger samples, researchers can estimate base rates and error rates with greater precision. As dimensional complexity increases, researchers must either increase the sample size or reduce the precision of their findings. If the sample size is too small, researchers may fail to observe important forms of systematic error. Since each observation requires time, coordination effort, and sometimes direct costs, researchers also have reasons to seek the smallest sample size that matches their precision goals.

For low-dimensional studies (such as an audit study based on one binary variable), researchers can use simple power analysis tools to choose a sample size. In high-dimensional studies, researchers may wish to observe error at different levels of precision for different characteristics. For example, in an audit of political advertising policies, differences in error rates based on currency type might be on a different scale or have a different importance from errors that differ by political party. Researchers may also want to allocate sample sizes to observe party-based errors with greater precision. Researchers may also have a limited number of testers or seek to operate within certain financial constraints. Commonly-available power calculators cannot support such levels of dimensional complexity.

By simulating audit studies using the MIDA approach, researchers can explore the trade-offs of different aspects of study design. To develop and diagnose the final details of field experiments, including sample size, social scientists have recently developed a software-supported approach of Model, Inquiry, Data Strategy, and Answer Strategy (MIDA) (Blair et al., 2019). In this approach, researchers create a Model (M) by describing and simulating the processes they wish to be able to observe in the field—in our case, the decision-making system. Researchers then also simulate their data collection process (Inquiry, Data), which may also have sources of error. Finally, researchers diagnose the details of their study, from sample size to analysis plan (Answer), by examining the results of the full simulation.

3.4. Involving and Coordinating Volunteers

The selection, training, and coordination of testers are essential to any audit study. When designing this audit, we considered using Facebook and Google’s own advertising targeting systems to recruit volunteers with the required characteristics, as other algorithm audits have done (Matias et al., 2017). We also considered creating software to automatically place ads on behalf of testers, reducing the effort and training required of testers.

Since the study as we designed it required a small number of testers, we did not implement these ideas for recruiting and coordinating participants. We did however design the prompt-generation software to allocate them to testers in correspondence with the study design.

In audit studies of online platform policies, we also considered the risk of companies observing communications between the researchers and among testers. To maintain the integrity of the audit study, we needed the organizations managing the decision-making systems to be unaware of our study. Maintaining secrecy can be difficult when testing platforms that also monitor communications between individuals. To address this risk, we coordinated testers using end-to-end encrypted group chat software from an organization independent from Facebook or Google.222

3.5. Law and Ethics

Researchers who conduct audit studies must always consider the legal and ethical risks for four parties: the testers, people who might encounter the prompts, the parties that oversee the decision-making system, and the researchers (Pager, 2007). In our case, this meant the researchers, the testers posting ads, the people reviewing the ads, members of the public who might see the ads, and the companies whose systems we were auditing.

Researchers face legal risks if the audit could be interpreted as a violation of the law. For example, the lead researchers on this study are employed by a university that holds a 501(c)(3) non-profit status. According to U.S. federal law, 501(c)(3)s are forbidden from participating in electioneering, which includes placing advertisements that advocate for a candidate. In the absence of policies that protect audit studies, we chose to focus on mistaken enforcement and issue ads, omitting tests for too-permissive enforcement of political ad policies.

Testers could also face risks if their behavior was interpreted as illegal or if conducting the audit might introduce irreparable harms to how their user accounts are treated by a company. Avoiding campaign ads protected our testers as well as the researchers.

In audits of ad platforms, members of the public are sometimes shown ads that are part of the audit. These ads might introduce election risks if they influence political beliefs and behaviors. By choosing to test ads that do not engage in campaigning, and by balancing the number of ads that could be mistakenly construed as right or left leaning, we minimized the risk to voters and to elections (Desposato, 2014).

Finally, audit studies create risks for those responsible for decision-making systems and for the people responsible for individual decisions. If our research discovered serious errors in a platform’s attempts to protect elections, it could create reputational and legal risks for those companies–balanced against the benefits of public knowledge to democratic society. More seriously, systematic errors might be wrongly attributed to the individuals making decisions, who are often in precarious employment positions (Roberts, 2014). While we did not take steps to minimize the risk for Facebook and Google as companies, we always add a disclaimer to our findings that they describe error across a system on average rather than evidence on any individual person’s performance.

Careful legal and ethical consideration are especially important with audit studies, which sometimes fall outside of the scope of university ethics boards. Our own institution’s IRB determined decided that our audit it does not fall under the purview of the U.S. Common Rule for research ethics, since in their view it constitutes a contribution to institutional rather than generalizable knowledge. In their view, research about Facebook and Google’s decision-making systems at a moment in time was not generalizable enough for them to review. Despite this decision, we have followed standard practices in academic research ethics to minimize the risk from our research and to protect the privacy of those involved–seeking consent from testers, storing all data securely, anonymizing the data, and minimizing the number of people who were exposed to the ads.

4. Software-Supported Auditing in Practice in the 2018 U.S. Midterm Elections

In 2018, we developed software to support an audit of Google and Facebook’s political advertising policy enforcement systems. Here we present the design and results of the audit, followed by what researchers can learn from this experience about software-supported audit studies.

4.1. Audit Study Research Questions

What kinds of ads do Facebook and Google mistakenly prohibit? Second, what percentage of non-election ads are wrongly prohibited by these companies? By asking these questions, our audit study examines the problem of over-reaches in platform content policy enforcement, a problem that has long been identified as a major risk to civil liberties worldwide (Seltzer, 2010; MacKinnon, 2012).

By investigating the rate at which non-election ads are prohibited, we are studying the decisions of a platform’s enforcement systems on average. Policy enforcement mistakes could result from many factors, including the details of a company’s policies, the quality of training, the behavior of automated filter software, and perhaps differences in the judgment of individual workers. Because audit studies cannot typically distinguish between internal organizational factors, and because we evaluate the system as a whole rather than individuals, these findings should never be interpreted as a reflection upon any individual worker enacting policy for these companies.

4.2. Software-Generated Audit Prompts

To answer these questions, we created software to support an audit of decision-making systems for political advertising policies pertaining to the United States. Our software queried publicly-available APIs to help generate ads that testers published to targeted geographically-targeted audiences on each platform. Our software also simulated a power analysis to determine how many ads needed to be placed for each tester, and it allocated ads to their respective testers. Each observation combined three high-level variables: Ad Type, Leaning, and Location. We summarize the methods of our audit study below.

(a) Left-leaning parks/parades ad
(b) Right-leaning parks/parades ad
(c) Left-leaning product ad
(d) Right-leaning product ad
Figure 1. Previews of Facebook ads that were posted to one of our Facebook pages during this study
(a) Left-leaning product ad
(b) Right-leaning product ad
Figure 2. Previews of Google ads that were posted during this study.

4.2.1. Designing Relevant Ads

We designed three kinds of ads in consultation with multiple U.S. election lawyers: product mistakes, community event mistakes, and government website mistakes. To test erroneous removal of non-election product ads, we chose products with names that included the surnames of political candidates, based on reports that Facebook required authorization by the makers of “Bush’s Beans”, an American food company that shares its name with former United States presidents (Rosenberg, 2018).

To test erroneous removal of ads for community events, we included ads for Veterans Day celebrations after Facebook reportedly removed ads for non-election gatherings of LGBTQ people and websites for U.S. military veterans on the grounds that they were election-related (Mak, 2018; Gale, 2018). In these cases, Facebook mistakenly prevented non-election advertisements from being published, and the company has apologized and changed its decisions after news articles mentioned those decisions.

Finally, to test erroneous removal of ads linking to government websites, we designed ads for national parks after hearing informal reports, later corroborated, that Facebook was preventing some government services from publishing ads about non-partisan public service information (Smith, 2019). We associated a rightward political leaning to Veterans Day ads emphasizing the respect of the military and a leftward political leaning to environment-focused ads encouraging use of public parks.

We summarize our tested variables below:

  • Ad type: Whether a non-political ad could be mistaken for:

    • Candidate support ad

      • Product ads for music albums in which the artist shared the last name of a political candidate for the 2018 midterms.

    • Issue ad

      • National park ads that encouraged people to visit a particular park before it is ”destroyed by climate change”.

      • Veterans Day parade ads that encouraged people to visit a particular parade, ”respect the military”, and to ”remember the troops”.

  • Leaning: whether the ad could be mistaken for left or right leaning content, or for supporting Republican or Democrat candidates

  • Location: the targeted location of the ad, based on a specific election that the ad could be mistaken for:

    • A state (governor) or federal (house) election

    • Geographically-targeted advertising toward regions voting in that election, either a state (governor) or voting district (house)

4.2.2. Ad Generation

Once we designed types of ads to place and determined how many ads should be placed, we wrote software to help generate ads for Facebook and Google. Figure 1 and Figure 2 show examples of our ads.

To generate non-election product ads, our software scraped data about 2018 midterm elections from the Federal Elections Commission. The FEC API provided up-to-date details on every candidate running for federal office, which party they were from, and which voting districts are voting on which positions. Our software then queried the BestBuy product API for music albums that shared a surname with a candidate in that election.333 For the image of each ad, our software scraped the image of the first music album that was returned for a query with a candidate’s surname. For the body text of each ad, our software used a candidate’s surname and the physical format of the respective music album to generate text that read ”Check out this ¡Album format¿ from the one and only ¡Candidate surname¿!” For the header text of each ad, our software used a candidate’s surname and the physical format of the respective music album to generate text that read ”[State]: Calling all music lovers.”

To generate ads for government websites, our software scraped the list of U.S. National Parks, the park location in relation to voting districts, and the website for each park from an API provided by the National Park Service.444 For the image of each ad, our software scraped the image returned by the API for the national park. For the body text of each ad, our software used the name of a national park to generate text that read ”Visit the ¡National park¿ before it’s destroyed by climate change!” For the header text of each ad, we wrote ”Don’t forget about nature.”

Finally, to generate ads for community events, our software scraped a list of local observances of Veterans Day–an non-partisan national holiday held on November 11th, 3 days after election day–from At the time we scraped the data, the website listed the details of Veterans Day parades across the United States. For the image of each ad, we manually accessed images from a collection of Wikimedia Commons images of the U.S. flag. Our prompts matched the location of these celebrations with voting districts in the test. For the body text of each ad, our software used the name of a Veterans Day parade to generate text that read ”Don’t forget about our troops. Visit the ¡Parade name¿.” For the header text of each ad, our software used the respective state that a parade was planned for to generate text that read ”[State]: Respect our military in November.”

4.2.3. Ad Placement

To place our ads, we recruited 7 testers who are U.S. citizens and who have characteristics that we thought might influence the chance of an advertisement to receive enforcement. We allocated our advertisements to testers by sampling each combination of ad type, investigator type, and location. Where the platform required ads to be associated with a group, channel, or page, investigators created a separate page for each ad type.

The two types of testers were:

  • US: U.S. citizens with an EN-US browser locales and U.S. IP address locations using U.S. Dollars to place ads

  • Non-US: U.S. citizen with a non-US browser locale and non-US IP address, using a non-US bank and non-US currency (CAD, GBP) to place ads

Each tester created an advertiser account on Google and Facebook, as well as a Facebook page for every ad type that they tested. Pages offered geographically-specific content related to the topic of the ads that were associated with them. Testers posted Adwords ads to Google on auto-generated search terms relevant to a given ad. Testers attempted to publish each ad for a period of 48 hours at a budget of 1 unit of currency per day (US, CAD, GBP). Testers then recorded whether the ad was published or prevented from being published by the platform, citing their political advertising policies.

Since none of the prohibited ads were published by the platforms and since none of the published ads were publicly labeled by platforms as election-related, none of these ads appeared in platform transparency reports. Although we had intended testers to seek authorization and record how platforms reported the identity of advertisers, we realized during the audit that appearing in platform transparency reports would de-anonymize our testers and represent an undue burden.

4.2.4. Software-supported Research Design Diagnosis

To refine the final study design, we wrote power analysis software that implements the MIDA process for design diagnosis of audit studies. First, we developed inputs on the minimum observable difference and bias we wanted to be able to observe for different combinations of personas and ads. Using that information, the software simulated thousands of possible audit studies to estimate the number of prompts and volunteers that would be needed. For each possible sample size and data collection plan we considered, this configurable software simulated estimates and error bars for multi-variate comparisons on each of our tested variables–ad type, political leaning, targeting location of the ad, and the location of the testers (U.S. or non-U.S.) (Appendix Figures 1, 2, and 3).

Based on the results of this research diagnosis, we created 20 ads for each combination of a platform, investigator location, election location, political leaning, and ad type.

4.2.5. Analysis Plan

We pre-registered the analysis and results-generating code at the Open Science Framework before collecting any data.555 All of our code, data, and training materials are publicly available on GitHub.666 We have published these materials to be completely transparent about how we carried out this study.

4.3. Audit Study Results

From 2018-09-17 through 2018-10-10, our team of 7 posted a total of 477 ads to Facebook and Google.777One intended Facebook ad was not found in our final records and may not have been posted. We have removed this observation from the analysis We observed whether the ad was prevented from being published by the platform for allegedly violating policies about election advertising.888One product ad was blocked by Facebook because the platform judged that the cover image included too much skin. We manually chose a different image and made another attempt, which was published by the platform. We did not count this as an ad blocked for its relation to the election.

Google did not prohibit any of the 239 ads that we posted to their platform. Facebook however prevented 10 out of 238 ads from publication, citing their election policies, a total of 4.2% of the ads we placed.999To confirm that these ads were genuinely permissible, we submitted two of the ads to Facebook’s appeals process, and the company reversed their decision for both. Parks and parade ads were 9 of the ads that Facebook prohibited, and 1 of them was a product ad. Among ads that Facebook prevented from publishing, 3 could have been mistaken for being right leaning or for Republican candidates, and 7 might have been mistaken for being left leaning or for Democrat candidates.

Our main analysis estimates the rate at which a certain type of advertisement is permitted. To do so, we computed the groupwise means and confidence intervals for the chance of an ad of a certain kind posted by a certain kind of person to be permitted by a platform. Groupwise means and confidence intervals are generated using the Wilson estimation method for confidence intervals (the binom.confint function in the R library

binom). At small sample sizes, a small increase in the number of ads placed may lead to large differences in the calculated 95% confidence intervals. The Wilson estimation method minimizes variation in confidence intervals between small differences in the sample size at smaller samples (Brown et al., 2001). Groupwise means and confidence intervals for each tested characteristic are available in Figure 3 and Table 2.

We also conducted exploratory logistic regression models within the ads posted to Facebook. We found that Facebook permitted 89% park and holiday advertisements compared to 99% of product ads, a statistically-significant difference of 10 percentage points (p=0.005) (Figure

4, Table 1 “Ad Type”). We also tested hypotheses about differences in leaning, ad location, and the location of the ad-poster. In each case, we did not observe a statistically-significant result, though it is possible that with a larger sample size, we may have done so (Table 1).

Figure 3. Estimated chance of publication for a given ad combination (election, political leaning). 477 ad placements were attempted by 7 people from 2018-09-17 to 2018-10-10. Product ads are music albums that share the artist’s last name with a candidate. Veterans Day & National Park ads are about events and places that could be mistaken by platform policy enforcers as election-related ads of national importance. 95% confidence intervals use the Wilson method. Code & data: anonymized.
Figure 4. Estimated Facebook publication rate for non-election advertisements comparing Veterans Day & National Park ads (issue mistake) to Product ads (candidate mistake). 238 ad placements on Facebook were attempted by 6 people from 2018-09-17 to 2018-10-10. Product ads are music albums that share the artist’s last name with a gubernatorial candidate. Veterans Day & National Park ads are about events & places that could be mistaken by platform policy enforcers as election-related ads of national importance. Results from a logistic regression (p=0.005). Code & data: anonymized.
Ad Type Leaning Location Ad Poster
Ad Type (Park & Parade)
Leaning (Republican)
Location (State)
Poster Location (US)
AIC 72.62 85.25 83.99 83.06
BIC 79.56 92.20 90.93 90.00
Log Likelihood -34.31 -40.63 -39.99 -39.53
Deviance 68.62 81.25 79.99 79.06
Num. obs. 238 238 238 238
, ,
Table 1. Logistic regression models testing univariate differences in publication rates based on ad type (Park & Parade vs Product), Leaning (Republican vs Democrat) Location, (State vs Federal), and Ad Poster Location (US vs non-US)
platform ad poster location leaning ad type # published
Facebook US federal Democrat candidate.mistake 20 100.0%
Facebook US federal Republican candidate.mistake 19 94.7%
Facebook US state Democrat candidate.mistake 20 100.0%
Facebook US state Democrat issue.mistake 20 75.0%
Facebook US state Republican candidate.mistake 20 100.0%
Facebook US state Republican issue.mistake 21 90.5%
Facebook Non-US federal Democrat candidate.mistake 20 100.0%
Facebook Non-US federal Republican candidate.mistake 19 100.0%
Facebook Non-US state Democrat candidate.mistake 19 100.0%
Facebook Non-US state Democrat issue.mistake 20 90.0%
Facebook Non-US state Republican candidate.mistake 20 100.0%
Facebook Non-US state Republican issue.mistake 20 100.0%
Google US federal Democrat candidate.mistake 20 100.0%
Google US federal Republican candidate.mistake 19 100.0%
Google US state Democrat candidate.mistake 20 100.0%
Google US state Democrat issue.mistake 20 100.0%
Google US state Republican candidate.mistake 20 100.0%
Google US state Republican issue.mistake 21 100.0%
Google Non-US federal Democrat candidate.mistake 20 100.0%
Google Non-US federal Republican candidate.mistake 20 100.0%
Google Non-US state Democrat candidate.mistake 19 100.0%
Google Non-US state Democrat issue.mistake 20 100.0%
Google Non-US state Republican candidate.mistake 20 100.0%
Google Non-US state Republican issue.mistake 20 100.0%
Table 2. Number of ads placed and percentage of ads published, for each combination of platform, investigator location, election location, leaning, and type of advertisement.

Taken together, our audit demonstrated that Facebook was systematically prohibiting non-election material of importance to American civic life, including public holidays and government websites. Further reports and journalism corroborated our findings, showing that prohibitions on government advertising affected housing and urban development departments (Smith, 2019). When scaled across an entire society these errors could have significant impacts on civic life and people’s access to government services. While we cannot confirm the role of our research in corporate policy changes, Facebook altered its policies to accommodate government advertising in 2019 (Ortutay and Writer, 2019). As is common with social scientific audit studies (Quillian et al., 2017), any evaluation of Facebook’s changes to its decision-making system would require further audits.

4.4. Limitations

All enforcement systems make mistakes. Our audit study shows the rates at which Facebook and Google prohibited non-election ads under their political advertising policies during the 2018 U.S. midterm elections.

Overall, our study discovers evidence of systematic mistakes (false positives) by Facebook, who prevented the publication of ads that were acceptable within the company’s own policies. Facebook regularly blocked ads for national holidays, government national park websites, and non-partisan products that happen to share common characters with a candidate name. This false positive rate of 4.2% is not representative of all advertisements posted to Facebook, but it does represent an important segment of online advertising in the United States.

This audit study found no evidence of political ad policy enforcement by Google. This audit focuses on decisions that are too restrictive and, for legal reasons, does not study decisions that are too permissive. For that reason, we cannot provide guidance on whether our finding was due to a general lack of political policy enforcement by the company, whether the company has narrower internal policies, or whether Google was more accurate at enforcing its policies than Facebook in the cases we examined.

This study has several limitations. First, we have no information about either company’s enforcement of ads that genuinely violate their policies. Second, our findings only generalize to the kinds of ads we tested: it is possible that companies made more mistakes with other kinds of ads that have received attention in the press, such as news articles and LGBTQ gatherings. Third, failures to find differences in publication rates should not be interpreted as proof of no difference; a larger study may have detected differences more clearly. Fourth, in this study, we offered 1 unit of currency (such as one US or Canadian dollar) per day per ad. If platforms offered greater scrutiny to ads that involve more money, it is possible that the rate of mistakes might be higher or lower in those cases. Finally, since platforms frequently change policies, internal guidelines, and training procedures with little public notice, these findings are most strongly informative about the period we studied.

We note that our audit study was not completely automated. Our software enabled us scrape images and text for certain ad types, generate ads based on templates, conduct a power analysis, and allocate ads to testers. Human testers were essential to this audit study, since we wanted to rule out platform policies about inauthentic activity. In other types of audits, researchers might be more able to use of platform-provided APIs to automatically place ads, enabling them to achieve greater dimensional complexity in their audit studies.

5. Discussion

In this paper, we have described common design challenges that constrain the validity of audit studies, field experiments that contribute pragmatic and scientific knowledge about decision-making system errors. We have described areas where software could guide the design of these studies, broaden their dimensional complexity, and make them more efficient. To explore and validate these ideas, we prototyped software to support an audit of Google and Facebook’s political ad policies during the 2018 U.S. midterm election. Using a study design informed by our software and using prompts auto-generated by that software, we were able to observe systematic errors in Facebook’s human and machine decision-making about political ads. We also learned wider lessons for designing software to support audit studies.

The trade-off between validity and dimensional complexity is a central design challenge for any audit study. In this project, we identified several areas where software can broaden the dimensional complexity of an audit and guide decisions about those trade-offs. By auto-generating prompts with data from APIs, researchers can reduce the labor of creating them manually while ensuring consistency. While our system to allocate prompts to testers was simple, we expect that software for recruiting and coordinating testers could also substantially expand the potential complexity of audit studies. We also found that software simulations can guide researcher decisions about sample sizes and the validity-complexity trade-off.

Despite these gains in complexity, some constraints cannot be addressed with software. The first is budget. Since our audit involved publishing ads, the financial cost of our audit study was a linear function of our sample size, determined by the statistical power we needed. Second, we limited our prompts to kinds of content that could be generated from publicly-available data. The potential for automating audit materials for a given system will be related to the availability of such data sources. Finally, for audits that involve active tester involvement, training and ongoing coordination are further limits on the complexity of any audit study.

Some of these constraints might be overcome through future work in automated audit studies. For example, software that acts on behalf of testers to carry out audit procedures may be able to reduce the complexities of training and coordinating testers– without reducing the realism of the audit. Future audits could reduce sample sizes through sequential testing algorithms that incorporate information from each new observation into models that auto-allocate new testers to post new prompts. Even without sequential testing, audit study researchers could use targeted advertising and matching algorithms to recruit and allocate audit prompts across multi-dimensional samples of participants.

Audit studies contribute valuable knowledge to science and society. In democracies, the public knowledge from audit studies is valuable for holding power accountable and correcting errors. By supporting the audit process with software, researchers can broaden the validity and value of this valuable research method.

6. Acknowledgments

We are grateful to Molly Sauter, who provided logistical support to this study, and to Jon Penney, who provided helpful advice and feedback. This research project was supported financially by the Princeton University Center for Information Technology Policy.


  • (1)
  • Adler et al. (2018) Philip Adler, Casey Falk, Sorelle A. Friedler, Tionney Nix, Gabriel Rybeck, Carlos Scheidegger, Brandon Smith, and Suresh Venkatasubramanian. 2018. Auditing black-box models for indirect influence. Knowledge and Information Systems 54, 1 (2018), 95–122. Publisher: Springer.
  • Alexander (2010) Michelle Alexander. 2010. The new Jim Crow: Mass incarceration in the age of colorblindness. The New Press.
  • Ali et al. (2019a) Muhammad Ali, Piotr Sapiezynski, Miranda Bogen, Aleksandra Korolova, Alan Mislove, and Aaron Rieke. 2019a. Discrimination through Optimization: How Facebook’s Ad Delivery Can Lead to Biased Outcomes. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (2019), 1–30.
  • Ali et al. (2019b) Muhammad Ali, Piotr Sapiezynski, Aleksandra Korolova, Alan Mislove, and Aaron Rieke. 2019b. Ad Delivery Algorithms: The Hidden Arbiters of Political Messaging. arXiv preprint arXiv:1912.04255 (2019).
  • Angwin et al. (2016) Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. Machine bias. ProPublica, May 23 (2016), 2016.
  • Bandiera et al. (2011) Oriana Bandiera, Iwan Barankay, and Imran Rasul. 2011. Field experiments with firms. Journal of Economic Perspectives 25, 3 (2011), 63–82.
  • Barabas (2015) Chelsea Chelsea Marie Barabas. 2015. Engineering the American dream: a study of bias and perceptions of merit in the high-tech labor market. Master’s thesis. Massachusetts Institute of Technology.
  • Bertrand and Mullainathan (2004) Marianne Bertrand and Sendhil Mullainathan. 2004. Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination. American economic review 94, 4 (2004), 991–1013.
  • Blair et al. (2019) Graeme Blair, Jasper Cooper, Alexander Coppock, and Macartan Humphreys. 2019. Declaring and diagnosing research designs. American Political Science Review 113, 3 (2019), 838–859. Publisher: Cambridge University Press.
  • Bowles and Frenkel (2018) Nellie Bowles and Sheera Frenkel. 2018. Facebook and Twitter Plan New Ways to Regulate Political Ads - The New York Times. New York Times (May 2018).
  • Brown et al. (2001) Lawrence D. Brown, T. Tony Cai, and Anirban DasGupta. 2001. Interval Estimation for a Binomial Proportion. Statist. Sci. 16, 2 (May 2001), 101–133.
  • Buolamwini and Gebru (2018) Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency. 77–91.
  • Burnett and Feamster (2015) Sam Burnett and Nick Feamster. 2015. Encore: Lightweight measurement of web censorship with cross-origin requests. In ACM SIGCOMM Computer Communication Review, Vol. 45. ACM, 653–667.
  • C-Span (2018) C-Span. 2018. Senator Al Franken Questions Facebook V.P. About Political Ads Purchased with Foreign Currency. Accessed 12 March 2018.
  • Card (2018) Catherine Card. 2018. How Facebook AI Helps Suicide Prevention | Facebook Newsroom.
  • Chen et al. (2019) Irene Y. Chen, Peter Szolovits, and Marzyeh Ghassemi. 2019. Can AI Help Reduce Disparities in General Medical and Mental Health Care? AMA journal of ethics 21, 2 (2019), 167–179.
  • Chen and Wilson (2017) Le Chen and Christo Wilson. 2017. Observing algorithmic marketplaces in-the-wild. ACM SIGecom Exchanges 15, 2 (2017), 34–39.
  • Chouldechova et al. (2018) Alexandra Chouldechova, Diana Benavides-Prado, Oleksandr Fialko, and Rhema Vaithianathan. 2018. A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions. In Conference on Fairness, Accountability and Transparency. 134–148.
  • Cialdini (1980) Robert B. Cialdini. 1980. Full-cycle social psychology. Applied social psychology annual (1980).
  • Constine (2017) John Constine. 2017. Facebook will hire 1,000 and make ads visible to fight election interference. TechCrunch (2017).
  • Desposato (2014) Scott Desposato. 2014. Ethical Challenges and Some Solutions for Field Experiments. San Diego: University of California. Available at www. desposato. org/ethicsfieldexperiments. pdf (2014).
  • Diakopoulos (2014) Nicholas Diakopoulos. 2014. Algorithmic accountability reporting: On the investigation of black boxes. (2014).
  • Edelson et al. (2020) Laura Edelson, Tobias Lauinger, and Damon McCoy. 2020. A Security Analysis of the Facebook Ad Library. In 2020 IEEE Symposium on Security and Privacy (SP).
  • Facebook (2018) Facebook. 2018. Hard Questions: Russian Ads Delivered to Congress. Accessed 05 March 2018.
  • Filasto and Appelbaum (2012) Arturo Filasto and Jacob Appelbaum. 2012. OONI: Open Observatory of Network Interference.. In FOCI.
  • Gale (2018) David Gale. 2018. Facebook’s Problem With Veterans. Wall Street Journal (Oct. 2018).
  • Geiger and Ribes (2011) R. Stuart Geiger and David Ribes. 2011. Trace ethnography: Following coordination through documentary practices. In 2011 44th hawaii international conference on system sciences. IEEE, 1–10.
  • Gillespie (2018) Tarleton Gillespie. 2018. Custodians of the Internet: Platforms, content moderation, and the hidden decisions that shape social media. Yale University Press.
  • Glazer and Haggin (2019) Emily Glazer and Patience Haggin. 2019. Google’s Tool to Tame Election Influence Has Flaws. Wall Street Journal (2019).
  • Hakken (1979) Jon Hakken. 1979. Discrimination against Chicanos in the Dallas rental housing market: An experimental extension of the housing market practices survey. Division of Evaluation, US Dept. of Housing and Urban Development, Office of ….
  • Harding (1991) Sandra Harding. 1991. Whose science? Whose knowledge?: Thinking from women’s lives. Cornell University Press.
  • Hayward (1999) Allison R. Hayward. 1999. When Does an Advertisement about Issues become an Issues Ad. Cath. UL Rev. 49 (1999), 63.
  • Hill (2017) The Hill. 2017. Franken Blasts Facebook For Accepting Rubles For U.S. Election Ads. Accessed 30 April 2018.
  • Isaac and Wakabayashi (2017) Mike Isaac and Daisuke Wakabayashi. 2017. Russian Influence Reached 126 Million Through Facebook Alone. The New York Times (Oct. 2017).
  • Kahneman and Tversky (1972) Daniel Kahneman and Amos Tversky. 1972.

    Subjective probability: A judgment of representativeness.

    Cognitive psychology 3, 3 (1972), 430–454.
  • Kearns et al. (2018) Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. 2018. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In International Conference on Machine Learning. 2564–2572.
  • King et al. (2014) Gary King, Jennifer Pan, and Margaret E. Roberts. 2014. Reverse-engineering censorship in China: Randomized experimentation and participant observation. Science 345, 6199 (2014), 1251722.
  • Kitchin (2017) Rob Kitchin. 2017. Thinking critically about and researching algorithms. Information, Communication & Society 20, 1 (2017), 14–29.
  • Leathern (2018) Rob Leathern. 2018. Shining a Light on Ads With Political Content | Facebook Newsroom.
  • MacKinnon (2012) Rebecca MacKinnon. 2012. Consent of the networked: The worldwide struggle for Internet freedom. Vol. 50. Basic Books.
  • Maclay (2010) Colin Maclay. 2010. Protecting Privacy and Expression Online. Access Controlled: The Shaping of Power, Rights, and Rules in Cyberspace (2010), 87–108.
  • Madrigal (2018) Alexis Madrigal. 2018. Will Facebook’s New Ad Transparency Protect Democracy? The Atlantic (May 2018).
  • Mak (2018) Aaron Mak. 2018. Facebook Thought an Ad From Bush’s Baked Beans Was “Political” and Removed It. Slate Magazine (May 2018).
  • Matias and Mou (2018) J. Nathan Matias and Merry Mou. 2018. CivilServant: Community-led experiments in platform governance. In Proceedings of the 2018 CHI conference on human factors in computing systems. 1–13.
  • Matias et al. (2017) J. Nathan Matias, Megan Steiner, and Aimee Rickman. 2017. Who Gets to Use Facebook’s Rainbow ’Pride’ Reaction?
  • Merrill and Tobin (2018) Jeremy B. Merrill and Ariana Tobin. 2018. Facebook’s Screening for Political Ads Nabs News Sites Instead of Politicians. ProPublica (2018).
  • Narayanan and Reisman (2017) Arvind Narayanan and Dillon Reisman. 2017. The princeton web transparency and accountability project. In Transparent data mining for big and small data. Springer, 45–67.
  • Obermeyer et al. (2019) Ziad Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 6464 (2019), 447–453. Publisher: American Association for the Advancement of Science.
  • Ortutay and Writer (2019) Barbara Ortutay and Ap Technology Writer. 2019. Facebook tightens political ad rules, but leaves loopholes.
  • Pager (2003) Devah Pager. 2003. The mark of a criminal record. American journal of sociology 108, 5 (2003), 937–975.
  • Pager (2007) Devah Pager. 2007. The use of field experiments for studies of employment discrimination: Contributions, critiques, and directions for the future. The Annals of the American Academy of Political and Social Science 609, 1 (2007), 104–133.
  • Politico (2018) Politico. 2018. Full Mueller indictment on Russian election case. Accessed 05 March 2018.
  • Popper (1947) Karl Popper. 1947. The open society and its enemies. Routledge.
  • Quillian et al. (2017) Lincoln Quillian, Devah Pager, Ole Hexel, and Arnfinn H. Midtbøen. 2017. Meta-analysis of field experiments shows no change in racial discrimination in hiring over time. Proceedings of the National Academy of Sciences 114, 41 (2017), 10870–10875.
  • Rahwan et al. (2019) Iyad Rahwan, Manuel Cebrian, Nick Obradovich, Josh Bongard, Jean-François Bonnefon, Cynthia Breazeal, Jacob W. Crandall, Nicholas A. Christakis, Iain D. Couzin, Matthew O. Jackson, Nicholas R. Jennings, Ece Kamar, Isabel M. Kloumann, Hugo Larochelle, David Lazer, Richard McElreath, Alan Mislove, David C. Parkes, Alex ‘Sandy’ Pentland, Margaret E. Roberts, Azim Shariff, Joshua B. Tenenbaum, and Michael Wellman. 2019. Machine behaviour. Nature 568, 7753 (April 2019), 477–486.
  • Raji et al. (2020) Inioluwa Deborah Raji, Andrew Smart, Rebecca N White, Margaret Mitchell, Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron, and Parker Barnes. 2020. Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. 33–44.
  • Reardon (2018) Reardon. 2018. Politicians who want to run ads on Google will have new rules to follow. CNET (May 2018).
  • Ribeiro et al. (2019) Filipe N Ribeiro, Koustuv Saha, Mahmoudreza Babaei, Lucas Henrique, Johnnatan Messias, Fabricio Benevenuto, Oana Goga, Krishna P Gummadi, and Elissa M Redmiles. 2019. On microtargeting socially divisive ads: A case study of russia-linked ad campaigns on facebook. In Proceedings of the Conference on Fairness, Accountability, and Transparency. ACM, 140–149.
  • Roberts (2014) Sarah T. Roberts. 2014. Behind the screen: The hidden digital labor of commercial content moderation. PhD Thesis. University of Illinois at Urbana-Champaign.
  • Rosenberg (2018) Eli Rosenberg. 2018. Facebook blocked many gay-themed ads as part of its new advertising policy, angering LGBT groups. Washington Post (Oct. 2018).
  • Salganik and Watts (2009) Matthew J. Salganik and Duncan J. Watts. 2009. Web-based experiments for the study of collective social dynamics in cultural markets. Topics in cognitive science 1, 3 (2009), 439–468.
  • Sandberg (2019) Sheryl Sandberg. 2019. A Second Update on Our Civil Rights Audit | Facebook Newsroom.
  • Sandvig et al. (2014) Christian Sandvig, Kevin Hamilton, Karrie Karahalios, and Cedric Langbort. 2014. Auditing algorithms: Research methods for detecting discrimination on internet platforms. Data and discrimination: converting critical concerns into productive inquiry 22 (2014).
  • Seltzer (2010) Wendy Seltzer. 2010. Free speech unmoored in copyright’s safe harbor: Chilling effects of the DMCA on the first amendment. Harv. JL & Tech. 24 (2010), 171. Publisher: HeinOnline.
  • Sen and Wasow (2016) Maya Sen and Omar Wasow. 2016. Race as a bundle of sticks: Designs that estimate effects of seemingly immutable characteristics. Annual Review of Political Science 19 (2016).
  • Shadish et al. (2002) William R. Shadish, Thomas D. Cook, and Donald T. Campbell. 2002. Experimental and quasi-experimental designs for generalized causal inference. (2002).
  • Silva et al. (2020) Márcio Silva, Lucas Santos de Oliveira, Athanasios Andreou, Pedro Olmo Vaz de Melo, Oana Goga, and Fabrício Benevenuto. 2020. Facebook Ads Monitor: An Independent Auditing System for Political Ads on Facebook. In Proceedings of The Web Conference 2020. 224–234.
  • Smith (2018) Michee Smith. 2018. Introducing a new transparency report for political ads. Google (2018).
  • Smith (2019) Sarah Smith. 2019. Facebook denies Houston’s ads promoting fair housing over race, religion references. Houston Chronicle (May 2019).
  • Tschantz et al. (2015) Michael Carl Tschantz, Amit Datta, Anupam Datta, and Jeannette M. Wing. 2015. A methodology for information flow experiments. In 2015 IEEE 28th Computer Security Foundations Symposium. IEEE, 554–568.
  • Valentino-DeVries (2018) Jennifer Valentino-DeVries. 2018. I Approved This Facebook Message — But You Don’t Know That. ProPublica (Feb. 2018).
  • Wakabayashi (2018) Daisuke Wakabayashi. 2018. Google Will Ask Buyers of U.S. Election Ads to Prove Identities - The New York Times. The New York Times (May 2018).
  • Weiss (1979) Carol H. Weiss. 1979. The many meanings of research utilization. Public administration review 39, 5 (1979), 426–431. Publisher: JSTOR.
  • Whittaker (2019) Zach Whittaker. 2019. U.S. has denied entry to some because of others’ social media. TechCrunch (Aug. 2019).
  • Wienk (1979) Ronald E. Wienk. 1979. Measuring Racial Discrimination in American Housing Markets: The Housing Market Practices Survey. (1979).