ALPHA: Audit that Learns from Previously Hand-Audited Ballots

01/07/2022
by   Philip B. Stark, et al.
berkeley college
0

BRAVO, the most widely tried method for risk-limiting election audits, cannot accommodate sampling without replacement or stratified sampling, which can improve efficiency and may be required by law. It applies only to ballot-polling audits, which are less efficient than comparison audits. It applies to plurality, majority, super-majority, proportional representation, and ranked-choice voting contests, but not to many social choice functions for which there are RLA methods, such as approval voting, STAR-voting, Borda count, and general scoring rules. And while BRAVO has the smallest expected sample size among sequentially valid ballot-polling-with-replacement methods when reported vote shares are exactly right, it can require arbitrarily large samples when the reported reported winner(s) really won but reported vote shares are wrong. ALPHA is a simple generalization of BRAVO that (i) works for sampling with and without replacement and Bernoulli sampling; (ii) increases power for stratified audits by avoiding the need to use a P-value combining function or to maximize P-values over nuisance parameters within strata, and allowing adaptive sampling across strata; (iii) works not only for ballot-polling but also for ballot-level comparison, batch-polling, and batch-level comparison audits, sampling with or without replacement, uniformly or with weights proportional to size; (iv) works for all social choice functions covered by SHANGRLA; and (v) in situations where both ALPHA and BRAVO apply, requires smaller samples than BRAVO when the reported vote shares are wrong but the outcome is correct–five orders of magnitude in some examples. ALPHA includes the family of betting martingale tests in RiLACS, with a different betting strategy parametrized as an estimator of the population mean and explicit flexibility to accommodate sampling weights and population bounds that vary by draw.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/22/2019

Sets of Half-Average Nulls Generate Risk-Limiting Audits: SHANGRLA

Risk-limiting audits (RLAs) for many social choice functions can be redu...
09/12/2018

Risk-Limiting Audits by Stratified Union-Intersection Tests of Elections (SUITE)

Risk-limiting audits (RLAs) offer a statistical guarantee: if a full man...
07/14/2020

WOR and p's: Sketches for ℓ_p-Sampling Without Replacement

Weighted sampling is a fundamental tool in data analysis and machine lea...
07/25/2021

Assertion-based Approaches to Auditing Complex Elections, with application to party-list proportional elections

Risk-limiting audits (RLAs), an ingredient in evidence-based elections, ...
12/06/2020

More style, less work: card-style data decrease risk-limiting audit sample sizes

U.S. elections rely heavily on computers such as voter registration data...
07/03/2020

A method to find an efficient and robust sampling strategy under model uncertainty

We consider the problem of deciding on sampling strategy, in particular ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A risk-limiting audit (RLA) is a procedure that has a known minimum probability of correcting the reported outcome of an election contest if the reported outcome is wrong. The risk limit of an RLA is the maximum chance that the RLA will not correct the electoral outcome, if the outcome is wrong. The outcome means the political outcome—who or what won—not the numerical vote tallies, which are practically impossible to get exactly right. A RLA requires a trustworthy record of the validly cast votes:111Generally, the record is a complete set of validly cast hand-marked paper ballot cards that has been kept demonstrably secure. Machine-marked ballot cards cannot be considered a trustworthy record of voter intent. See Appel, DeMillo and Stark (2020); Appel and Stark (2020); Stark and Wagner (2012). a manual count of those records is the recourse to correct wrong outcomes. Establishing whether the record of votes is trustworthy prior to conducting a risk-limiting audit is generically called a compliance audit (Stark and Wagner, 2012; Appel and Stark, 2020). RLAs are recommended by the National Academies of Science, Engineering, and Medicine (National Academies of Sciences, Engineering, and Medicine, 2018), the American Statistical Association (American Statistical Association, 2010), and other groups concerned with election integrity. They are in law in about half a dozen U.S. states and have been piloted in roughly a dozen U.S. states and in Denmark.

BRAVO (Lindeman, Stark and Yates, 2012) is a particularly simple method to conduct an RLA of plurality and supermajority contests. It relies on sampling ballot cards222In general, a ballot is comprised of one or more ballot cards, each of which contains some of the contests a given voter is eligible to vote in. Many countries and some U.S. states have one-card ballots, but many U.S. states routinely have ballots that comprise two or more ballot cards. uniformly at random with replacement from all ballot cards validly cast in the contest. Stark and Teague (2014) showed how BRAVO can be used to audit proportional representation schemes such as D’Hondt. Blom, Stuckey and Teague (2018) showed how BRAVO can be used to audit ranked-choice voting as well. BRAVO is based on Wald’s (Wald, 1945) sequential probability ratio test (SPRT) of the simple hypothesis against a simple alternative from IID observations. (A random variable takes the value with probability and the value with probability ; its expected value is .) Because it requires IID observations, BRAVO is limited to ballot-polling audits and to using samples drawn with replacement, both of which limit efficiency and applicability. (A ballot-polling audit involves manually interpreting randomly selected ballot cards, but does not use the voting system’s interpretation of individual ballot cards or groups of ballot cards, other than the reported results. As discussed below, comparison audits, which compare the voting system’s interpretation of ballot cards to manual interpretations of the same cards, can be more efficient.)

To audit a plurality contest with BRAVO involves using the SPRT to test a number of hypotheses: for each reported winner and each reported loser , let be the the conditional probability that a ballot selected at random with replacement from all ballot cards validly cast in the contest shows a valid vote for , given that it shows a valid vote for or for , and let be the number of votes reported for , divided by the total votes reported for and combined. BRAVO tests the hypotheses against the alternatives for every pair.

BRAVO for a supermajority contest can be simpler or more involved than for a plurality contest. Suppose that the contest requires a candidate to receive at least a fraction of the valid votes to be a winner. (We allow the possibility that , in which case “supermajority” is a misnomer and there can be more than one winner.) Suppose candidate is reported to be a winner. Let denote the conditional probability that a ballot selected at random from all ballot cards validly cast in the contest shows a valid vote for , given that it shows a valid vote for any candidate in the contest, and let be the number of votes reported for , divided by the total valid votes reported in the contest. BRAVO uses the SPRT to test the hypothesis against the alternative for each reported winner. If , there can be only one reported winner. If , there can be more than one, in which case that hypothesis needs to be tested for all candidates (not just the reported winners), to confirm that (only) the reported winner(s) won. If is is reported that no candidate received at least a fraction of the valid votes, BRAVO tests the hypotheses that each candidate received of the valid votes against the alternative that each candidate received of the valid votes, to confirm that none received or more.

Consider independent, identically distributed (IID) draws from a population , where each (the population is binary). Let be the population fraction of 1s. Let be the value of selected on the th draw. Then and . By independence, the probability of a sequence is the product of the probabilities of the terms, which can be written

(1)

The ratio of the probability of the sequence if to the probability if is

(2)

Wald’s SPRT rejects the hypothesis that at significance level if for any . That is, : the SPRT is a sequentially valid test. Moreover, is an anytime -value for the hypothesis . That is, for any ,

The SPRT is quite general; this is perhaps the simplest example. Wald’s proof that the SPRT is sequentially valid is complicated, but Ville’s inequality (Ville, 1939) yields a simple proof. Given a sequence of random variables , let denote the finite sequence . A sequence of absolutely integrable random variables is a martingale with respect to a sequence of random variables if . It is a supermartingale if . The expected value of every term of a martingale is the same. A (sub)martingale is nonnegative if for all . Ville’s inequality is a version of Markov’s inequality for supermartingales: the chance that a nonnegative supermartingale ever exceeds any multiple of its expected value is at most the reciprocal of that multiple. That is, if , is a nonnegative supermartingale with respect to , , then .

The Bernoulli is a martingale with respect to , if :

(3)

Because , Ville’s inequality implies that . (More generally, sequences of likelihood ratios are nonnegative martingales with respect to the distribution in the denominator.)

Wald (1945) proved that among all sequentially valid tests of the hypothesis , the SPRT with alternative has the smallest expected sample size to reject when in fact . But when , the SPRT can fail to reject the null, continuing to sample forever, and when , it can be very inefficient. As a result, when reported vote shares are incorrect but the reported winner(s) really won, BRAVO can require enormous samples, even when the true margin is large.

This paper introduces ALPHA, a simple adaptive extension of BRAVO. It is motivated by the SPRT for the Bernoulli and its optimality when the simple alternative is true. While BRAVO tests against the alternative that the true vote shares are equal to the reported vote shares, ALPHA combines the reported results with the audit sample to estimate the reported winner’s share of the vote before the th ballot is examined, given that the first ballot cards have shown. ALPHA also generalizes BRAVO to situations where are not necessarily binary, but merely nonnegative and bounded: the Bernoulli is just the population mean of a binary population, but the martingale property continues to hold when is the mean of a nonnegative, bounded population. That generalization allows ALPHA to be used with SHANGRLA to audit supermajority contests and to conduct comparison audits of a wide variety of social choice functions—any for which there is a SHANGRLA audit. In contrast, BRAVO requires the list elements to be binary-valued. Finally, ALPHA works for sampling with or without replacement, while BRAVO is specifically for sampling with replacement (IID observations). (The SPRT for a population percentage using sampling without replacement is straightforward, but was not in the original BRAVO paper (Lindeman, Stark and Yates, 2012).)

2 ALPHA and SHANGRLA

2.1 Shangrla

Before introducing ALPHA, we provide additional motivation for constructing a more general test than BRAVO: the SHANGRLA framework for RLAs. SHANGRLA (Stark, 2020) checks outcomes by testing half-average assertions, each of which claims that the mean of a finite list of numbers between and is greater than . Each list of numbers results from applying an assorter to the ballot cards. The assorter uses the votes and possibly other information (e.g., how the voting system interpreted the ballot) to assign a number between and to each ballot. Some assorters assign only the numbers and or , , and , but for others, there are more possible values.

The correctness of the outcomes under audit is implied by the intersection of a collection of such assertions; the assertions depends on the social choice function, the number of candidates, and other details (Stark, 2020). SHANGRLA tests the negation of each assertion, i.e., it tests the

complementary null hypothesis

that each assorter mean is not greater than . If that hypothesis is rejected for every assertion, the audit concludes that the outcome is correct. Otherwise, the audit expands, potentially to a full hand count. If every null is tested at level , this results in a risk-limiting audit with risk limit : if the outcome is not correct, the chance the audit will stop shy of a full hand count is at most . No adjustment for multiple testing is needed (Stark, 2020).

The core, canonical statistical problem in SHANGRLA is to test the hypothesis that using a sample from a finite population , where each , with known.333An equivalent problem is to test the hypothesis that using a sample from , where each (let and set ). This formulation unifies polling audits and comparison audits; the difference is only in how the values are calculated; see section 2.4. The sample might be drawn with or without replacement; it might be drawn from the population as a whole (unstratified sampling), or the population might be divided into strata, each of which is sampled independently (stratified sampling). Or it might be drawn using Bernoulli sampling, where each item is included independently, with some common probability. Or batches (clusters) of ballot cards might be sampled instead of individual cards; see section 4.

For instance, consider one reported winner and one reported loser in a single-winner or multi-winner plurality contest (any number of pairs can be audited simultaneously using the same sample (Stark, 2020)). Let denote the number of ballot cards validly cast in the contest. The assorter assigns the th ballot the value if the ballot has a valid vote for the reported winner, the value if it has a valid vote for the reported loser, and the value otherwise. That reported winner really beat that reported loser if . In a multi-winner plurality contest with reported winners and reported losers, the reported winners really won if the mean of every one of the lists for the (reported winner, reported loser) pairs is greater than .

2.2 The ALPHA martingale test

We start by developing a one-sided test of the hypothesis , then show the -value is monotone in , so the test is valid for the hypothesis , as SHANGRLA requires. Let . Assume for some known . For ballot-polling audits of plurality contests, . For sampling uniformly from bounded, finite populations, is the upper bound on the elements of the population. See section 4 for an example (weighted sampling without replacement) where varies with .

Let computed under the null hypothesis . Let , , be a predictable sequence in the sense that may depend on , but not on for . We now define the ALPHA martingale . Let and

(4)

This can be rearranged to yield

(5)

Equivalently,

(6)

Under the null hypothesis, and is nonnegative since , , and are all in . Also,

(7)

Thus if , is a nonnegative martingale with respect to , starting at . If , then and , so

(8)

Thus is a nonnegative supermartingale starting at 1 if . It follows from Ville’s inequality (Ville, 1939) that if ,

(9)

That is, is an “anytime -value” for the null hypothesis .

Note that the derivation did not use any information about other than : it applies to populations that are nonnegative and bounded, not merely binary populations, to sampling with or without replacement, and to weighted sampling. That means it can be used generically to test any SHANGRLA assertion, allowing it to be used to audit a wide variety of social choice functions—including plurality, multi-winner plurality, super-majority, d’Hondt and other proportional representation schemes, Borda count, approval voting, STAR-Voting, arbitrary scoring rules, and IRV—using sampling with or without replacement, with or without stratification, and sampling uniformly or using probability proportional to “size” (see section 4). The ALPHA martingales comprise the same family of betting martingales studied by Waudby-Smith and Ramdas (2021); Waudby-Smith, Stark and Ramdas (2021), but are parametrized differently; see section 2.3 below.

2.2.1 Sampling without replacement.

To use ALPHA to make tests about the mean of a finite population from a sample drawn without replacement, we need computed on the assumption that . For sampling without replacement from a population with mean , after draw , the mean of the remaining numbers is . Thus the conditional expectation of given under the null is . If for any , the null hypothesis is certainly false.

2.2.2 BRAVO is a special case of ALPHA.

BRAVO is ALPHA with the following restrictions:

  • the sample is drawn with replacement from ballot cards that do have a valid vote for the reported winner or the reported loser (ballot cards with votes for other candidates or non-votes are ignored)

  • ballot cards are encoded as 0 or 1, depending on whether they have a valid vote for the reported winner or for the reported loser; and the only possible values of are 0 and 1

  • , and for all since the sample is drawn with replacement

  • , where is the number of votes reported for candidate and is the number of votes reported for candidate : is not updated as data are collected

It follows from Wald (1945) that BRAVO minimizes the expected sample size to reject the null hypothesis when when really received the share of the reported votes. The motivation for this paper is that almost never receives exactly their reported vote share, and BRAVO (and other RLA methods that rely on the reported vote share) may then have poor performance—even though they are still guaranteed to limit the risk that an incorrect result will become final to at most .

When the reported vote shares are incorrect, using a method that adapts to the observed audit data can help, as we shall see.

2.3 Relationship to RiLACS and Betting Martingales

Waudby-Smith and Ramdas (2021); Waudby-Smith, Stark and Ramdas (2021) develop tests and confidence sequences for the mean of a bounded population using “betting” martingales of the form

(10)

where, as above, computed on the assumption that the null hypothesis is true. The sequence can be viewed as the fortune of a gambler in a series of wagers. The gambler starts with a stake of unit and bets a fraction of their current wealth on the outcome of the th wager. The value is the gambler’s wealth after the th wager. The gambler is not permitted to borrow money, so to ensure that when (corresponding to losing the th bet) the gambler does not end up in debt (), cannot exceed .

The ALPHA martingale is of the same form:

(11)

identifying . Choosing is equivalent to choosing :

(12)

As ranges from to , ranges continuously from 0 to , the same range of values of permitted in Waudby-Smith and Ramdas (2021); Waudby-Smith, Stark and Ramdas (2021): selecting is equivalent to selecting a method for estimating . That is, the ALPHA martingales are identical to the betting martingales in Waudby-Smith and Ramdas (2021); Waudby-Smith, Stark and Ramdas (2021); the difference is only in how is chosen. (ALPHA also makes explicit that the upper bound on can depend on .)

Waudby-Smith and Ramdas (2021); Waudby-Smith, Stark and Ramdas (2021) consider two classes of strategies for picking intended to maximize the expected rate at which the gambler’s wealth grows. One of the classes is approximately optimal if is known (much like BRAVO is optimal when the reported results are correct); the other does not user prior information, instead adapting to the data. The ALPHA representation of the betting martingales provides a family of tradeoffs between those extremes, using different estimates of based on and . Parametrizing the selection of in terms of an estimate of may aid intuition in developing more powerful martingale tests (in the sense that they tend to reject for smaller samples) for particular applications—such as election audits.

2.4 Comparison audits

In the SHANGRLA framework, there is no formal difference between polling audits (which do not use the voting system’s interpretation of ballot cards) and comparison audits, which involve comparing how the voting system interpreted cards to how humans interpret the same cards. Either way, the correctness of the election outcome is implied by a collection of assertions, each of which is of the form, “the average of this list of numbers in is greater than .” The only difference is the particular function that assigns numbers in to ballot cards. For polling audits, the number assigned to a card depends on the votes on that card as interpreted by a human (and on the social choice function and other parameters of the contest), but not on how the voting system interpreted the card. For comparison audits, the number also depends on how the system interpreted that card and on the reported “assorter margin.” See Stark (2020) for details.

Because ALPHA can test the hypothesis that the mean of a bounded, nonnegative population is not greater than 1/2 (even for populations with more than two values), it works for comparison and polling audits with no modification. The interpretation of is different: instead of being related to vote shares, it is related to the amount of overstatement error in the system’s interpretation of the ballot card. For comparison audits, the initial value for the alternative, , could be chosen by making assumptions about how often the system made errors of various kinds. The risk is rigorously limited even if those assumptions are wrong, but the choice affects the performance.

To conduct a comparison RLA, auditors export subtotals or other vote records from the voting system and commit to them (e.g., by publishing them). Election auditors first check whether the exported records results agree with the overall reported contest outcomes. If not, the election fails the audit: even according to the voting system, some reported outcome is wrong. If things add up, the audit next checks whether differences between the voting system’s exported records and a human interpretation of the votes on ballot cards could have altered any reported outcome, by manually checking a random sample of the voting system’s exported records against a manual interpretation of the votes on the corresponding physical batches of ballot cards.

This is like checking an expense report. Committing to the subtotals is like submitting the expense report. An auditor can check the accuracy of the report by first checking the addition (checking whether the exported batch-level results produce the reported contest outcomes), then manually checking a sample of the reported expenses against the physical paper receipts (checking the accuracy of the machine interpretation of the cards).

2.5 Setting to be an estimate of

Since the SPRT minimizes the expected sample size to reject the null when the alternative is true, we might be able to construct an efficient test by using as the alternative an estimate of based on the audit data and the reported results. Any estimate of that does not depend on for preserves the (super)-martingale property under the null, and the auditor has the freedom “change horses” and use a different estimator at will as the sample evolves. For example,

might be constant, as it is in BRAVO. Or it could be constant for the first 100 draws, then switch to the unbiased estimate of

based on once . Or it could be a Bayes estimate of using data and a prior concentrated on , centered at the value of implied by the reported results. (The -value of the test is still a frequentist -value; the estimate affects the power.) Or it could give

weight that depends on the sample variance or the standard error of the sample mean, giving it more weight when the variability is small. Or it could be the estimate implied by choosing

using one of the methods for selecting . described by Waudby-Smith and Ramdas (2021).

2.5.1 Naively maximizing does not work.

Suppose that , i.e., that the alternative hypothesis is true. What value of maximizes ?

(13)

This is monotone increasing in , so it is maximized for , for a single draw. But if and , then for all , and the test will never reject the null hypothesis, no matter how many more data are collected. This is essentially the observation made by Kelly (Kelly, 1956) in his development of the Kelly criterion. Keeping hedges against that possibility.

Instead of picking to maximize the next term , one can pick it to maximize the rate at which grows. In the binary data case, the Kelly criterion (Kelly, 1956), discussed by Waudby-Smith and Ramdas (2021), leads to the optimal choice when is known. For sampling with replacement, this is , since . This corresponds to .

2.5.2 Illustration: a simple way to select .

Any choice of that depends only on preserves the martingale property and thus the validity of the ALPHA test. To show the potential of ALPHA, the simulations reported below are based on setting to be a simple “truncated shrinkage” estimate of . The estimator shrinks towards the reported result as if the reported result were the mean of draws from the population ( is not necessarily an integer). To ensure that the alternative hypothesis corresponds to the reported winner really winning, we need , and to keep the estimate consistent with the constraint that , we need . The following estimate of meets both requirements:

(14)

Choosing . The starting value could be the value of implied by the reported results. For a polling audit, that might be based on the reported margin in a plurality contest. For a comparison audit, that might be based on historical experience with tabulation error. But the procedure could be made “fully adaptive” by starting with, say, or .

Choosing . As , the sample size for ALPHA approaches that of BRAVO (in the binary data case). The larger is, the more strongly anchored the estimate is to the reported vote shares, and the smaller the penalty ALPHA pays when the reported results are exactly correct. Using a small value of is particularly helpful when the true population mean is far from the reported results. The smaller is, the faster the method adapts to the true population mean, but the higher the variance is. Whatever is, the relative weight of the reported vote shares decreases as the sample size increases.

Choosing . To allow the estimated winner’s share to approach as the sample grows (if the sample mean approaches or less), we shall take for a nonnegative constant , for instance . The estimate is thus the sample mean, shrunk towards and truncated to the interval , where as the sample size grows.

3 Pseudo-algorithm for ballot-level comparison and ballot-polling audits

The algorithm below is written for a single SHANGRLA assertion, but the audit can be conducted in parallel for any number of assertions using the same sampled ballot cards; no multiplicity adjustment for the number of assertions is needed. There are assorters for polling audits, which do not use information about how the voting system interpreted ballot cards, and for comparison audits, which require the voting system to commit to how it interpreted each ballot card before the audit starts. For comparison audits, the first step is to verify that the data exported from the voting system reproduces the reported election outcome, that is, to check whether applying the social choice function to the cast vote records gives the same winners. We shall assume that a compliance audit has shown that the paper trail is trustworthy. For comparison audits, we assume that the system has exported a CVRs for very ballot card. (For methods to deal with a mismatch between the number of ballot cards and the number of CVRs, see Stark (2020).)

  • Set audit parameters:

    • select the risk limit ; decide whether to sample with or without replacement

    • set as appropriate for the assertion under audit and the sampling method; for uniform sampling of ballots with or without replacement, is the assorter upper bound.

    • is the number of ballot cards in the population of cards from which the sample is drawn

    • Set . For polling audits, could be the reported mean value of the assorter. For comparison audits, could be based on assumed or historical error rates.

    • define the function to update based on the sample, e.g.,
      , where is the sample sum of the first draws and ; set any free parameters in the function (e.g., and in this example). The only requirement is that , where is computed under the null.

  • Initialize variables

    • : sample number

    • : sample sum

    • : population mean under the null

  • While and not all ballot cards have been audited:

    • draw a ballot card at random

    • determine by applying the assorter to the selected ballot card (and the CVRs, for comparison audits)

    • if , . Otherwise, ;

    • for sampling without replacement,

    • if desired, break and conduct a full hand count instead of continuing to audit

4 Batch-Polling and Batch-Level Comparison Audits

So far we have been discussing audits that sample and manually interpret individual ballot cards: ballot-polling audits, which use only the manual interpretation of the sampled ballots, and ballot-level comparison audits, which also use the system’s interpretation of the sampled ballot cards (CVRs). Ballot-level comparison audits are the most efficient strategy (measured by expected sample size) if the voting system can export CVRs in a way that the allows the corresponding physical ballot cards to be identified, retrieved, and interpreted manually. Legacy systems cannot. Even with modern equipment, reporting CVRs linked to physical ballot cards while maintaining vote anonymity is hard if votes are tabulated in precincts or vote centers, because the order in which ballot cards are scanned, tabulated, and stored can be nearly identical to the order in which they were cast.

Many jurisdictions tabulate and store ballot cards in physical batches for which the voting system can report batch-level results.444Vote centers and vote-by-mail can make batch-level comparison audits hard or impossible, since some voting systems can only report vote subtotals for batches based on political geography (e.g., precincts), which may not correspond to physically identifiable batches. To create physical batches that match the reporting batches would require sorting the ballot cards. Thus it can be desirable to sample and interpret batches of ballot cards instead of individual ballot cards, i.e., to use cluster sampling. Batch-polling audits use human interpretation of the votes in the batches but not the voting system’s tabulation (other than the system’s report of who won, and possibly the reported vote totals). Batch-level comparison audits compare human interpretation of the ballot cards in the sampled batches to the voting system’s interpretation of the same cards. Batch-level comparison audits are operationally similar to existing, non-risk-limiting audits in many states, including California and New York—but RLAs provide statistical guarantees, unlike the statutory audits in California and New York.

For many social choice functions (including all scoring rules), knowing the total number of votes reported for each candidate in each batch is enough to conduct a batch-level comparison audit. But for some voting systems and some social choice functions, batch-level results contain too little information. For instance, to audit instant-runoff voting (IRV), it is not enough to know how many voters gave each rank to each candidate: the joint distribution of ranks matters.

As discussed above, SHANGRLA audits of one or more contests involve a collection of assorters , functions from ballot cards (and possibly additional information, such as the reported outcome, reported margin, and the system’s interpretation of the votes on the ballot card) to . The domain of assorter is , which could comprise all ballot cards cast in the election or a smaller set, provided includes every card that contains the contest that the assorter is relevant for. Targeting audit sampling using information about which ballot cards purport to contain which contests (card style data) can vastly improve audit efficiency while rigorously maintaining the risk limit even if the voting system misidentifies which cards contain which contests (Glazer, Spertus and Stark, 2021). There are also techniques for dealing with missing ballot cards (Bañuelos and Stark, 2012; Stark, 2020).

We will consider a single assorter and suppress the dependence on to simplify the notation. Considerations for testing multiple assertions using the same sample are in section 4.2.1. Let denote the number of ballot cards in . Every audited contest outcome is correct if every assorter mean is greater than , i.e., if for all ,

(15)

Ballot cards cards are tabulated and stored in disjoint, physically identifiable batches of ballot cards. Let be the number of ballot cards in batch . We assume that each assorter domain is the union of some of the batches: . Let be the indices of the batches to which assorter applies and let denote the cardinality of . Define

(16)

the total of assorter over batch . Then . Let be an upper bound on , for instance . Tighter upper bounds than that may be calculable, in particular for batch comparison audits: depending on the reported votes in batch , the upper bound might not be attainable for every ballot card in the batch. Let be the sum of the batch upper bounds.

4.1 Batch Sampling with Equal Probabilities

Define

(17)

Then

(18)

That is, the mean of the values is equal to the mean of the values . Let and . Then are in , so if we sample batches with equal probability (with or without replacement), testing whether the population mean is less than or equal to is an instance of the problem solved by ALPHA, the tests in Waudby-Smith and Ramdas (2021); Waudby-Smith, Stark and Ramdas (2021), and the Kaplan martingale test; for sampling with replacement, it is also solved by the Kaplan-Wald and Kaplan-Markov tests.

However, because batch sizes may vary widely, using a single upper bound for all batches may have a great deal of slack for some batches, which can reduce the efficiency of the tests. By sampling batches with unequal probabilities, we can transform the problem to one where the upper bounds on the batches are sharper. This may lead to more efficient audits, depending on fixed costs related to retrieving batches; checking, recording, and opening seals; re-sealing batches and returning them to storage; etc.

4.2 Batch Sampling with Probability Proportional to a Bound on the Assorter

Let denote the batch selected in the th draw. For sampling without replacement, let ; for sampling with replacement, let . For sampling without replacement, let ; for sampling with replacement, let . Then are the indices of the batches from which the th sample batch will be drawn, and are the ballot cards those batches contain. Let . The th batch is selected at random from , with chance of selecting .

Define

(19)

where . (For sampling without replacement, this typically varies with .) Let be the value of selected on the th draw. Consider the expected value of given :

(20)

the mean value of the assorter over the ballots that remain in the population just before the th draw. Under the null hypothesis that ,

(21)

Let be an estimate of based on and define

(22)

This is an example of ALPHA allowing the upper bound on to vary with , with a corresponding draw-dependent constraint on .

4.2.1 Auditing many assertions using the same weighted sample of batches.

To audit more than one assertion using the same sample of batches, the sampling weights—and thus the batch-level a priori bounds—for the assertions need to be commensurable, in the following sense. Suppose there are assorters, ; let be the domain of assorter ; and let denote the set of batches that comprise ( denotes the set of batches in the domain of assorter from which the th sample will be drawn.) Suppose that batches and are relevant for assorters and . Let be the upper bound for assorter in batch and define , , and analogously. Then we need . The easiest way to accomplish that is to take for all and all . Tighter bounds may be possible in some cases, depending on the batch-level reports for all the contests under audit.

5 Stratified Sampling

Stratified sampling—partitioning ballot cards into disjoint strata and sampling independently from those strata—can be helpful in RLAs (Stark, 2008; Higgins, Rivest and Stark, 2011; Ottoboni et al., 2018; Stark, 2020). For instance, some states (including California) require jurisdictions to draw audit samples independently. Auditing a cross-jurisdictional contest then involves stratified samples; the cards cast in each jurisdiction comprise the strata. Stratified sampling can also offer logistical advantages by making it possible to use different audit strategies (polling, batch polling, ballot-level comparison, batch-level comparison) for different subsets of ballot cards, for instance, if some ballot cards are tabulated using equipment that can report how it interpreted each ballot and some are not.

Stratified batch-comparison RLAs were developed in the first paper on risk-limiting audits, Stark (2008). The approach was tightened in Higgins, Rivest and Stark (2011). Ottoboni et al. (2018) developed a more flexible approach, SUITE (Stratified Union-Intersection Tests of Elections), which does not require using the same sampling or audit strategy in different strata. In particular, SUITE allows using polling in some strata and ballot-level or batch-level comparisons in others. BRAVO does not work for auditing in the polling strata in that context, because it makes inferences about the votes for one candidate as a fraction of the votes that are either for that candidate or one other candidate, that is, it conditions on the event that the selected card has a vote for either the reported winner or the reported loser. That suffices to tell who won a plurality contest—by auditing every (reported winner, reported loser) pair—if all the ballot cards are in a single stratum, but not when the sample is stratified.

When the sample is stratified, what is needed is an inference about the number of votes in the stratum for each candidate. To solve that problem, Ottoboni et al. (2018) used a test in the polling stratum based on the multinomial distribution, maximizing the -value over a nuisance parameter, the number of ballot cards in the stratum with no valid vote for either candidate. SUITE represents the hypothesis that the outcome is wrong as a union of intersections of hypotheses. The union is over all ways of partitioning outcome-changing errors across strata. The intersection is across strata for each partition in the union. For each partition, for each stratum, SUITE computes a -value for the hypothesis that the error in that stratum exceeds its allocation, then combines those -values across strata (using a combining function such as Fisher’s combining function) to test the intersection hypothesis that the error in every stratum exceeds its allocation in the partition. If the maximum -value of that intersection hypothesis over all allocations of outcome-changing error is less than or equal to the risk limit, the audit stops. Stark (2020) extends the union-intersection approach to use SHANGRLA assorters, avoiding the need to maximize -values over nuisance parameters in individual strata and permitting sampling with or without replacement.

5.1 ALPHA obviates the need to use a combining function across strata

Because ALPHA works with polling and comparison strategies, it can be the basis of the test in every stratum, whereas SUITE used completely different “risk measuring functions” for strata where the audit involves ballot polling and strata where the audit involves comparisons. We shall see that this obviates the need to use a combining function to combine -values across strata: they can just be multiplied. This is because (predictably) multiplying terms in the product representation of different sequences—each of which, under the nulls in the intersection, is a nonnegative supermartingale starting at one—yields a nonnegative supermartingale starting at one. Thus the product of the stratum-wise test statistics in any order (including interleaving terms across strata) is also a test statistic with the property that the chance it is greater than or equal to is at most

under the intersection null. Because Fisher’s combining function adds two degrees of freedom to the chi-square distribution for each stratum, avoiding the need for a combining function can substantially increase power as the number of strata grows. Table 

1 illustrates this increase: it shows the combined -value for the intersection hypothesis when the -value in each stratum is . The number of strata ranges from —which might arise in an audit in a single jurisdiction when stratifying on mode of voting (in-person versus absentee)—to 150—which might arise in auditing a cross-jurisdictional contest in a state with many counties. For instance, Georgia has 159 counties, Kentucky has 120, Texas has 254, and Virginia has 133.

strata Fisher’s combination supermartingale
2 0.5966 0.25000000
5 0.7319 0.03125000
10 0.8374 0.00097656
25 0.9514 0.00000003
50 0.9917 0.00000000
100 0.9997 0.00000000
150 1.0000 0.00000000
Table 1: Overall -value for the intersection null hypothesis if the -value in each stratum is 0.5, for Fisher’s combining function (column 2) and for martingale-based tests (column 3). The “stratification penalty” arising from the large number of degrees of freedom (twice the number of strata) for Fisher’s combining function can be entirely avoided by using martingale-based tests, which permit simply multiplying the test statistic values across strata and taking the reciprocal of the result (or 1, if 1 is smaller) as the -value.

5.2 Supermartingale-based tests of intersection hypotheses

Here is a sketch of how ALPHA can be used for stratified audits. Suppose there are ballots in all, partitioned into strata. (This section will overload to mean two related things: the number of strata and a mapping from counting numbers to strata. Elsewhere in the paper, refers to a sample sum.) Stratum contains ballot cards; . We want to test the hypothesis that . Let be the upper bound on the numbers assigns. Let be the average of the assorter restricted to stratum , so . Suppose satisfies . We sample independently from the strata. Let denote the test supermartingale for stratum to test the null , . Let denote the th draw from the th stratum, and define , , and analogously. Define

(23)

Recall from equation 6 that

(24)

Let and be an interleaving of the samples, i.e., if , then is the index of the next element of not yet seen.

  • Initialize: set

  • For : If ,

The stratum-selector is arbitrary and can be adaptive: it can depend on the previously observed data. But it must be predictable in the sense that can rely only on observations already entered into the calculation, for . It could ignore past data and simply select strata by round-robin: (), skipping any strata whose samples have been exhausted. Or it could concatenate the samples: if we have drawn times from stratum , then , ; , , etc. In general, the power will depend on the mapping . For instance, if data from stratum suggest that , future values of might omit stratum , concentrating instead on strata where there is some evidence that the intersection null is false, to maximize the expected rate at which the test supermartingale grows. Indeed, choosing can be viewed as a (possibly finite-population) multi-armed bandit problem: which stratum should the next sample come from to maximize the expected rate of growth of the test statistic?

We now define the intersection test martingale:

(25)

Now and are predictable from . Suppose and . The samples from different strata are independent, so the conditional expectation of given is the conditional expectation of given , which is at most 1. Thus

(26)

That is, under the intersection null, is a nonnegative supermartingale starting at 1, and by Ville’s inequality,

(27)

To audit the assertion, we need to check whether there is any with for which . If there is, sampling needs to continue. We thus need to find

(28)

the solution to a finite-dimensional optimization problem.

6 Bernoulli Sampling

Ottoboni et al. (2020) developed a ballot-polling risk-limiting audit (BBP) based on Bernoulli sampling, where each ballot card is independently included in the sample with probability . This results in a random sample of random size. Conditional on the sample size, it is a simple random sample of ballot cards. Their approach to testing whether one candidate received more votes than another involves conditioning on the realized sample size and maximizing an SPRT -value over a nuisance parameter, the number of ballot cards that do not contain a valid vote for either of those candidates. They find that BBP requires sample sizes comparable to BRAVO for the same margin.

ALPHA, combined with the SHANGRLA transformation, eliminates the need to perform the maximization over a nuisance parameter. To use ALPHA, the sample needs to have a notional ordering. That ordering can come from randomly permuting the sample, or from setting a canonical ordering of the ballot cards before the sample is selected, e.g., a lexicographical ordering, then considering the sample order to be the lexicographical order of the cards in the sample.

If the initial Bernoulli sample does not suffice to confirm the outcome, the sample can be expanded using the same approach Ottoboni et al. (2020) took in section 4 of their paper. Since the performance of BBP is similar to that of BRAVO, one might expect that applying ALPHA to Bernoulli samples would require lower sample sizes (i.e., lower selection probabilities) than BBP. We do not perform any simulations here, but we plan to investigate the efficiency of ALPHA/SHANGRLA versus BBP in future work.

7 Simulations

7.1 Sampling with replacement

Table 2 reports mean sample sizes of ALPHA and BRAVO for the same true vote shares , with the same choices of , using the truncated shrinkage estimate of , for a variety of choices of , all for a risk limit of 5%. Results are based on 1,000 replications for each true . Sample sizes were limited to 10 million ballot cards: if a method required a sample bigger than that in any of the 1,000 replications, the result is listed as ‘—’. As expected, BRAVO is best (or tied for best) when , i.e., when the reported vote shares are exactly right. ALPHA with a small value of is best when is large; and ALPHA with a small value of is best when is small, with a few exceptions, where BRAVO beats ALPHA with when and is small. When is large, ALPHA often does as well as BRAVO even when . When vote shares are wrong—and the reported winner still won—ALPHA often reduces average sample sizes substantially, even when the true margin is large. Indeed, in many cases, the sample size for BRAVO exceeded 10 million ballot cards in some runs, while the average for ALPHA was up to five orders of magnitude lower.

The SPRT is known to perform poorly—sometimes never leading to a decision—when . In such cases, ALPHA did much better for all choices of when , and for and when .

The simulations show that the performance of BRAVO can also be poor when . In most of those cases, ALPHA performed better than BRAVO for all choices of . For instance, when (a margin of 20%) and , ALPHA mean sample sizes were 204–353, but BRAVO sample sizes exceeded for some runs.

ALPHA with
10 100 500 1000 BRAVO
0.505 0.505 102,500 91,024 82,757 79,414 58,266
0.51 102,738 91,878 84,088 80,564
0.52 103,418 93,842 88,512 87,685
0.53 103,746 96,611 97,731 103,630
0.54 104,490 99,535 110,216 126,618
0.55 105,071 104,047 125,659 158,247
0.6 110,346 135,445 269,961 440,573
0.65 118,727 190,702 519,166 920,839
0.7 129,332 265,380 861,560 1,597,004
0.51 0.505 24,476 21,487 19,798 19,258 19,965
0.51 24,598 21,598 19,577 18,841 14,930
0.52 24,717 22,036 19,839 19,035
0.53 24,760 22,451 20,846 20,928
0.54 24,930 23,029 22,888 24,602
0.55 25,017 23,856 25,848 30,351
0.6 25,954 30,041 57,261 93,849
0.65 27,721 42,078 116,207 209,576
0.7 30,117 61,550 201,768 376,304