1 Introduction
A risklimiting audit (RLA) is a procedure that has a known minimum probability of correcting the reported outcome of an election contest if the reported outcome is wrong. The risk limit of an RLA is the maximum chance that the RLA will not correct the electoral outcome, if the outcome is wrong. The outcome means the political outcome—who or what won—not the numerical vote tallies, which are practically impossible to get exactly right. A RLA requires a trustworthy record of the validly cast votes:^{1}^{1}1Generally, the record is a complete set of validly cast handmarked paper ballot cards that has been kept demonstrably secure. Machinemarked ballot cards cannot be considered a trustworthy record of voter intent. See Appel, DeMillo and Stark (2020); Appel and Stark (2020); Stark and Wagner (2012). a manual count of those records is the recourse to correct wrong outcomes. Establishing whether the record of votes is trustworthy prior to conducting a risklimiting audit is generically called a compliance audit (Stark and Wagner, 2012; Appel and Stark, 2020). RLAs are recommended by the National Academies of Science, Engineering, and Medicine (National Academies of Sciences, Engineering, and Medicine, 2018), the American Statistical Association (American Statistical Association, 2010), and other groups concerned with election integrity. They are in law in about half a dozen U.S. states and have been piloted in roughly a dozen U.S. states and in Denmark.
BRAVO (Lindeman, Stark and Yates, 2012) is a particularly simple method to conduct an RLA of plurality and supermajority contests. It relies on sampling ballot cards^{2}^{2}2In general, a ballot is comprised of one or more ballot cards, each of which contains some of the contests a given voter is eligible to vote in. Many countries and some U.S. states have onecard ballots, but many U.S. states routinely have ballots that comprise two or more ballot cards. uniformly at random with replacement from all ballot cards validly cast in the contest. Stark and Teague (2014) showed how BRAVO can be used to audit proportional representation schemes such as D’Hondt. Blom, Stuckey and Teague (2018) showed how BRAVO can be used to audit rankedchoice voting as well. BRAVO is based on Wald’s (Wald, 1945) sequential probability ratio test (SPRT) of the simple hypothesis against a simple alternative from IID observations. (A random variable takes the value with probability and the value with probability ; its expected value is .) Because it requires IID observations, BRAVO is limited to ballotpolling audits and to using samples drawn with replacement, both of which limit efficiency and applicability. (A ballotpolling audit involves manually interpreting randomly selected ballot cards, but does not use the voting system’s interpretation of individual ballot cards or groups of ballot cards, other than the reported results. As discussed below, comparison audits, which compare the voting system’s interpretation of ballot cards to manual interpretations of the same cards, can be more efficient.)
To audit a plurality contest with BRAVO involves using the SPRT to test a number of hypotheses: for each reported winner and each reported loser , let be the the conditional probability that a ballot selected at random with replacement from all ballot cards validly cast in the contest shows a valid vote for , given that it shows a valid vote for or for , and let be the number of votes reported for , divided by the total votes reported for and combined. BRAVO tests the hypotheses against the alternatives for every pair.
BRAVO for a supermajority contest can be simpler or more involved than for a plurality contest. Suppose that the contest requires a candidate to receive at least a fraction of the valid votes to be a winner. (We allow the possibility that , in which case “supermajority” is a misnomer and there can be more than one winner.) Suppose candidate is reported to be a winner. Let denote the conditional probability that a ballot selected at random from all ballot cards validly cast in the contest shows a valid vote for , given that it shows a valid vote for any candidate in the contest, and let be the number of votes reported for , divided by the total valid votes reported in the contest. BRAVO uses the SPRT to test the hypothesis against the alternative for each reported winner. If , there can be only one reported winner. If , there can be more than one, in which case that hypothesis needs to be tested for all candidates (not just the reported winners), to confirm that (only) the reported winner(s) won. If is is reported that no candidate received at least a fraction of the valid votes, BRAVO tests the hypotheses that each candidate received of the valid votes against the alternative that each candidate received of the valid votes, to confirm that none received or more.
Consider independent, identically distributed (IID) draws from a population , where each (the population is binary). Let be the population fraction of 1s. Let be the value of selected on the th draw. Then and . By independence, the probability of a sequence is the product of the probabilities of the terms, which can be written
(1) 
The ratio of the probability of the sequence if to the probability if is
(2) 
Wald’s SPRT rejects the hypothesis that at significance level if for any . That is, : the SPRT is a sequentially valid test. Moreover, is an anytime value for the hypothesis . That is, for any ,
The SPRT is quite general; this is perhaps the simplest example. Wald’s proof that the SPRT is sequentially valid is complicated, but Ville’s inequality (Ville, 1939) yields a simple proof. Given a sequence of random variables , let denote the finite sequence . A sequence of absolutely integrable random variables is a martingale with respect to a sequence of random variables if . It is a supermartingale if . The expected value of every term of a martingale is the same. A (sub)martingale is nonnegative if for all . Ville’s inequality is a version of Markov’s inequality for supermartingales: the chance that a nonnegative supermartingale ever exceeds any multiple of its expected value is at most the reciprocal of that multiple. That is, if , is a nonnegative supermartingale with respect to , , then .
The Bernoulli is a martingale with respect to , if :
(3)  
Because , Ville’s inequality implies that . (More generally, sequences of likelihood ratios are nonnegative martingales with respect to the distribution in the denominator.)
Wald (1945) proved that among all sequentially valid tests of the hypothesis , the SPRT with alternative has the smallest expected sample size to reject when in fact . But when , the SPRT can fail to reject the null, continuing to sample forever, and when , it can be very inefficient. As a result, when reported vote shares are incorrect but the reported winner(s) really won, BRAVO can require enormous samples, even when the true margin is large.
This paper introduces ALPHA, a simple adaptive extension of BRAVO. It is motivated by the SPRT for the Bernoulli and its optimality when the simple alternative is true. While BRAVO tests against the alternative that the true vote shares are equal to the reported vote shares, ALPHA combines the reported results with the audit sample to estimate the reported winner’s share of the vote before the th ballot is examined, given that the first ballot cards have shown. ALPHA also generalizes BRAVO to situations where are not necessarily binary, but merely nonnegative and bounded: the Bernoulli is just the population mean of a binary population, but the martingale property continues to hold when is the mean of a nonnegative, bounded population. That generalization allows ALPHA to be used with SHANGRLA to audit supermajority contests and to conduct comparison audits of a wide variety of social choice functions—any for which there is a SHANGRLA audit. In contrast, BRAVO requires the list elements to be binaryvalued. Finally, ALPHA works for sampling with or without replacement, while BRAVO is specifically for sampling with replacement (IID observations). (The SPRT for a population percentage using sampling without replacement is straightforward, but was not in the original BRAVO paper (Lindeman, Stark and Yates, 2012).)
2 ALPHA and SHANGRLA
2.1 Shangrla
Before introducing ALPHA, we provide additional motivation for constructing a more general test than BRAVO: the SHANGRLA framework for RLAs. SHANGRLA (Stark, 2020) checks outcomes by testing halfaverage assertions, each of which claims that the mean of a finite list of numbers between and is greater than . Each list of numbers results from applying an assorter to the ballot cards. The assorter uses the votes and possibly other information (e.g., how the voting system interpreted the ballot) to assign a number between and to each ballot. Some assorters assign only the numbers and or , , and , but for others, there are more possible values.
The correctness of the outcomes under audit is implied by the intersection of a collection of such assertions; the assertions depends on the social choice function, the number of candidates, and other details (Stark, 2020). SHANGRLA tests the negation of each assertion, i.e., it tests the
complementary null hypothesis
that each assorter mean is not greater than . If that hypothesis is rejected for every assertion, the audit concludes that the outcome is correct. Otherwise, the audit expands, potentially to a full hand count. If every null is tested at level , this results in a risklimiting audit with risk limit : if the outcome is not correct, the chance the audit will stop shy of a full hand count is at most . No adjustment for multiple testing is needed (Stark, 2020).The core, canonical statistical problem in SHANGRLA is to test the hypothesis that using a sample from a finite population , where each , with known.^{3}^{3}3An equivalent problem is to test the hypothesis that using a sample from , where each (let and set ). This formulation unifies polling audits and comparison audits; the difference is only in how the values are calculated; see section 2.4. The sample might be drawn with or without replacement; it might be drawn from the population as a whole (unstratified sampling), or the population might be divided into strata, each of which is sampled independently (stratified sampling). Or it might be drawn using Bernoulli sampling, where each item is included independently, with some common probability. Or batches (clusters) of ballot cards might be sampled instead of individual cards; see section 4.
For instance, consider one reported winner and one reported loser in a singlewinner or multiwinner plurality contest (any number of pairs can be audited simultaneously using the same sample (Stark, 2020)). Let denote the number of ballot cards validly cast in the contest. The assorter assigns the th ballot the value if the ballot has a valid vote for the reported winner, the value if it has a valid vote for the reported loser, and the value otherwise. That reported winner really beat that reported loser if . In a multiwinner plurality contest with reported winners and reported losers, the reported winners really won if the mean of every one of the lists for the (reported winner, reported loser) pairs is greater than .
2.2 The ALPHA martingale test
We start by developing a onesided test of the hypothesis , then show the value is monotone in , so the test is valid for the hypothesis , as SHANGRLA requires. Let . Assume for some known . For ballotpolling audits of plurality contests, . For sampling uniformly from bounded, finite populations, is the upper bound on the elements of the population. See section 4 for an example (weighted sampling without replacement) where varies with .
Let computed under the null hypothesis . Let , , be a predictable sequence in the sense that may depend on , but not on for . We now define the ALPHA martingale . Let and
(4) 
This can be rearranged to yield
(5) 
Equivalently,
(6) 
Under the null hypothesis, and is nonnegative since , , and are all in . Also,
(7)  
Thus if , is a nonnegative martingale with respect to , starting at . If , then and , so
(8) 
Thus is a nonnegative supermartingale starting at 1 if . It follows from Ville’s inequality (Ville, 1939) that if ,
(9) 
That is, is an “anytime value” for the null hypothesis .
Note that the derivation did not use any information about other than : it applies to populations that are nonnegative and bounded, not merely binary populations, to sampling with or without replacement, and to weighted sampling. That means it can be used generically to test any SHANGRLA assertion, allowing it to be used to audit a wide variety of social choice functions—including plurality, multiwinner plurality, supermajority, d’Hondt and other proportional representation schemes, Borda count, approval voting, STARVoting, arbitrary scoring rules, and IRV—using sampling with or without replacement, with or without stratification, and sampling uniformly or using probability proportional to “size” (see section 4). The ALPHA martingales comprise the same family of betting martingales studied by WaudbySmith and Ramdas (2021); WaudbySmith, Stark and Ramdas (2021), but are parametrized differently; see section 2.3 below.
2.2.1 Sampling without replacement.
To use ALPHA to make tests about the mean of a finite population from a sample drawn without replacement, we need computed on the assumption that . For sampling without replacement from a population with mean , after draw , the mean of the remaining numbers is . Thus the conditional expectation of given under the null is . If for any , the null hypothesis is certainly false.
2.2.2 BRAVO is a special case of ALPHA.
BRAVO is ALPHA with the following restrictions:

the sample is drawn with replacement from ballot cards that do have a valid vote for the reported winner or the reported loser (ballot cards with votes for other candidates or nonvotes are ignored)

ballot cards are encoded as 0 or 1, depending on whether they have a valid vote for the reported winner or for the reported loser; and the only possible values of are 0 and 1

, and for all since the sample is drawn with replacement

, where is the number of votes reported for candidate and is the number of votes reported for candidate : is not updated as data are collected
It follows from Wald (1945) that BRAVO minimizes the expected sample size to reject the null hypothesis when when really received the share of the reported votes. The motivation for this paper is that almost never receives exactly their reported vote share, and BRAVO (and other RLA methods that rely on the reported vote share) may then have poor performance—even though they are still guaranteed to limit the risk that an incorrect result will become final to at most .
When the reported vote shares are incorrect, using a method that adapts to the observed audit data can help, as we shall see.
2.3 Relationship to RiLACS and Betting Martingales
WaudbySmith and Ramdas (2021); WaudbySmith, Stark and Ramdas (2021) develop tests and confidence sequences for the mean of a bounded population using “betting” martingales of the form
(10) 
where, as above, computed on the assumption that the null hypothesis is true. The sequence can be viewed as the fortune of a gambler in a series of wagers. The gambler starts with a stake of unit and bets a fraction of their current wealth on the outcome of the th wager. The value is the gambler’s wealth after the th wager. The gambler is not permitted to borrow money, so to ensure that when (corresponding to losing the th bet) the gambler does not end up in debt (), cannot exceed .
The ALPHA martingale is of the same form:
(11)  
identifying . Choosing is equivalent to choosing :
(12) 
As ranges from to , ranges continuously from 0 to , the same range of values of permitted in WaudbySmith and Ramdas (2021); WaudbySmith, Stark and Ramdas (2021): selecting is equivalent to selecting a method for estimating . That is, the ALPHA martingales are identical to the betting martingales in WaudbySmith and Ramdas (2021); WaudbySmith, Stark and Ramdas (2021); the difference is only in how is chosen. (ALPHA also makes explicit that the upper bound on can depend on .)
WaudbySmith and Ramdas (2021); WaudbySmith, Stark and Ramdas (2021) consider two classes of strategies for picking intended to maximize the expected rate at which the gambler’s wealth grows. One of the classes is approximately optimal if is known (much like BRAVO is optimal when the reported results are correct); the other does not user prior information, instead adapting to the data. The ALPHA representation of the betting martingales provides a family of tradeoffs between those extremes, using different estimates of based on and . Parametrizing the selection of in terms of an estimate of may aid intuition in developing more powerful martingale tests (in the sense that they tend to reject for smaller samples) for particular applications—such as election audits.
2.4 Comparison audits
In the SHANGRLA framework, there is no formal difference between polling audits (which do not use the voting system’s interpretation of ballot cards) and comparison audits, which involve comparing how the voting system interpreted cards to how humans interpret the same cards. Either way, the correctness of the election outcome is implied by a collection of assertions, each of which is of the form, “the average of this list of numbers in is greater than .” The only difference is the particular function that assigns numbers in to ballot cards. For polling audits, the number assigned to a card depends on the votes on that card as interpreted by a human (and on the social choice function and other parameters of the contest), but not on how the voting system interpreted the card. For comparison audits, the number also depends on how the system interpreted that card and on the reported “assorter margin.” See Stark (2020) for details.
Because ALPHA can test the hypothesis that the mean of a bounded, nonnegative population is not greater than 1/2 (even for populations with more than two values), it works for comparison and polling audits with no modification. The interpretation of is different: instead of being related to vote shares, it is related to the amount of overstatement error in the system’s interpretation of the ballot card. For comparison audits, the initial value for the alternative, , could be chosen by making assumptions about how often the system made errors of various kinds. The risk is rigorously limited even if those assumptions are wrong, but the choice affects the performance.
To conduct a comparison RLA, auditors export subtotals or other vote records from the voting system and commit to them (e.g., by publishing them). Election auditors first check whether the exported records results agree with the overall reported contest outcomes. If not, the election fails the audit: even according to the voting system, some reported outcome is wrong. If things add up, the audit next checks whether differences between the voting system’s exported records and a human interpretation of the votes on ballot cards could have altered any reported outcome, by manually checking a random sample of the voting system’s exported records against a manual interpretation of the votes on the corresponding physical batches of ballot cards.
This is like checking an expense report. Committing to the subtotals is like submitting the expense report. An auditor can check the accuracy of the report by first checking the addition (checking whether the exported batchlevel results produce the reported contest outcomes), then manually checking a sample of the reported expenses against the physical paper receipts (checking the accuracy of the machine interpretation of the cards).
2.5 Setting to be an estimate of
Since the SPRT minimizes the expected sample size to reject the null when the alternative is true, we might be able to construct an efficient test by using as the alternative an estimate of based on the audit data and the reported results. Any estimate of that does not depend on for preserves the (super)martingale property under the null, and the auditor has the freedom “change horses” and use a different estimator at will as the sample evolves. For example,
might be constant, as it is in BRAVO. Or it could be constant for the first 100 draws, then switch to the unbiased estimate of
based on once . Or it could be a Bayes estimate of using data and a prior concentrated on , centered at the value of implied by the reported results. (The value of the test is still a frequentist value; the estimate affects the power.) Or it could giveweight that depends on the sample variance or the standard error of the sample mean, giving it more weight when the variability is small. Or it could be the estimate implied by choosing
using one of the methods for selecting . described by WaudbySmith and Ramdas (2021).2.5.1 Naively maximizing does not work.
Suppose that , i.e., that the alternative hypothesis is true. What value of maximizes ?
(13)  
This is monotone increasing in , so it is maximized for , for a single draw. But if and , then for all , and the test will never reject the null hypothesis, no matter how many more data are collected. This is essentially the observation made by Kelly (Kelly, 1956) in his development of the Kelly criterion. Keeping hedges against that possibility.
Instead of picking to maximize the next term , one can pick it to maximize the rate at which grows. In the binary data case, the Kelly criterion (Kelly, 1956), discussed by WaudbySmith and Ramdas (2021), leads to the optimal choice when is known. For sampling with replacement, this is , since . This corresponds to .
2.5.2 Illustration: a simple way to select .
Any choice of that depends only on preserves the martingale property and thus the validity of the ALPHA test. To show the potential of ALPHA, the simulations reported below are based on setting to be a simple “truncated shrinkage” estimate of . The estimator shrinks towards the reported result as if the reported result were the mean of draws from the population ( is not necessarily an integer). To ensure that the alternative hypothesis corresponds to the reported winner really winning, we need , and to keep the estimate consistent with the constraint that , we need . The following estimate of meets both requirements:
(14) 
Choosing . The starting value could be the value of implied by the reported results. For a polling audit, that might be based on the reported margin in a plurality contest. For a comparison audit, that might be based on historical experience with tabulation error. But the procedure could be made “fully adaptive” by starting with, say, or .
Choosing . As , the sample size for ALPHA approaches that of BRAVO (in the binary data case). The larger is, the more strongly anchored the estimate is to the reported vote shares, and the smaller the penalty ALPHA pays when the reported results are exactly correct. Using a small value of is particularly helpful when the true population mean is far from the reported results. The smaller is, the faster the method adapts to the true population mean, but the higher the variance is. Whatever is, the relative weight of the reported vote shares decreases as the sample size increases.
Choosing . To allow the estimated winner’s share to approach as the sample grows (if the sample mean approaches or less), we shall take for a nonnegative constant , for instance . The estimate is thus the sample mean, shrunk towards and truncated to the interval , where as the sample size grows.
3 Pseudoalgorithm for ballotlevel comparison and ballotpolling audits
The algorithm below is written for a single SHANGRLA assertion, but the audit can be conducted in parallel for any number of assertions using the same sampled ballot cards; no multiplicity adjustment for the number of assertions is needed. There are assorters for polling audits, which do not use information about how the voting system interpreted ballot cards, and for comparison audits, which require the voting system to commit to how it interpreted each ballot card before the audit starts. For comparison audits, the first step is to verify that the data exported from the voting system reproduces the reported election outcome, that is, to check whether applying the social choice function to the cast vote records gives the same winners. We shall assume that a compliance audit has shown that the paper trail is trustworthy. For comparison audits, we assume that the system has exported a CVRs for very ballot card. (For methods to deal with a mismatch between the number of ballot cards and the number of CVRs, see Stark (2020).)

Set audit parameters:

select the risk limit ; decide whether to sample with or without replacement

set as appropriate for the assertion under audit and the sampling method; for uniform sampling of ballots with or without replacement, is the assorter upper bound.

is the number of ballot cards in the population of cards from which the sample is drawn

Set . For polling audits, could be the reported mean value of the assorter. For comparison audits, could be based on assumed or historical error rates.

define the function to update based on the sample, e.g.,
, where is the sample sum of the first draws and ; set any free parameters in the function (e.g., and in this example). The only requirement is that , where is computed under the null.


Initialize variables

: sample number

: sample sum

: population mean under the null


While and not all ballot cards have been audited:

draw a ballot card at random


determine by applying the assorter to the selected ballot card (and the CVRs, for comparison audits)

if , . Otherwise, ;


for sampling without replacement,

if desired, break and conduct a full hand count instead of continuing to audit

4 BatchPolling and BatchLevel Comparison Audits
So far we have been discussing audits that sample and manually interpret individual ballot cards: ballotpolling audits, which use only the manual interpretation of the sampled ballots, and ballotlevel comparison audits, which also use the system’s interpretation of the sampled ballot cards (CVRs). Ballotlevel comparison audits are the most efficient strategy (measured by expected sample size) if the voting system can export CVRs in a way that the allows the corresponding physical ballot cards to be identified, retrieved, and interpreted manually. Legacy systems cannot. Even with modern equipment, reporting CVRs linked to physical ballot cards while maintaining vote anonymity is hard if votes are tabulated in precincts or vote centers, because the order in which ballot cards are scanned, tabulated, and stored can be nearly identical to the order in which they were cast.
Many jurisdictions tabulate and store ballot cards in physical batches for which the voting system can report batchlevel results.^{4}^{4}4Vote centers and votebymail can make batchlevel comparison audits hard or impossible, since some voting systems can only report vote subtotals for batches based on political geography (e.g., precincts), which may not correspond to physically identifiable batches. To create physical batches that match the reporting batches would require sorting the ballot cards. Thus it can be desirable to sample and interpret batches of ballot cards instead of individual ballot cards, i.e., to use cluster sampling. Batchpolling audits use human interpretation of the votes in the batches but not the voting system’s tabulation (other than the system’s report of who won, and possibly the reported vote totals). Batchlevel comparison audits compare human interpretation of the ballot cards in the sampled batches to the voting system’s interpretation of the same cards. Batchlevel comparison audits are operationally similar to existing, nonrisklimiting audits in many states, including California and New York—but RLAs provide statistical guarantees, unlike the statutory audits in California and New York.
For many social choice functions (including all scoring rules), knowing the total number of votes reported for each candidate in each batch is enough to conduct a batchlevel comparison audit. But for some voting systems and some social choice functions, batchlevel results contain too little information. For instance, to audit instantrunoff voting (IRV), it is not enough to know how many voters gave each rank to each candidate: the joint distribution of ranks matters.
As discussed above, SHANGRLA audits of one or more contests involve a collection of assorters , functions from ballot cards (and possibly additional information, such as the reported outcome, reported margin, and the system’s interpretation of the votes on the ballot card) to . The domain of assorter is , which could comprise all ballot cards cast in the election or a smaller set, provided includes every card that contains the contest that the assorter is relevant for. Targeting audit sampling using information about which ballot cards purport to contain which contests (card style data) can vastly improve audit efficiency while rigorously maintaining the risk limit even if the voting system misidentifies which cards contain which contests (Glazer, Spertus and Stark, 2021). There are also techniques for dealing with missing ballot cards (Bañuelos and Stark, 2012; Stark, 2020).
We will consider a single assorter and suppress the dependence on to simplify the notation. Considerations for testing multiple assertions using the same sample are in section 4.2.1. Let denote the number of ballot cards in . Every audited contest outcome is correct if every assorter mean is greater than , i.e., if for all ,
(15) 
Ballot cards cards are tabulated and stored in disjoint, physically identifiable batches of ballot cards. Let be the number of ballot cards in batch . We assume that each assorter domain is the union of some of the batches: . Let be the indices of the batches to which assorter applies and let denote the cardinality of . Define
(16) 
the total of assorter over batch . Then . Let be an upper bound on , for instance . Tighter upper bounds than that may be calculable, in particular for batch comparison audits: depending on the reported votes in batch , the upper bound might not be attainable for every ballot card in the batch. Let be the sum of the batch upper bounds.
4.1 Batch Sampling with Equal Probabilities
Define
(17) 
Then
(18) 
That is, the mean of the values is equal to the mean of the values . Let and . Then are in , so if we sample batches with equal probability (with or without replacement), testing whether the population mean is less than or equal to is an instance of the problem solved by ALPHA, the tests in WaudbySmith and Ramdas (2021); WaudbySmith, Stark and Ramdas (2021), and the Kaplan martingale test; for sampling with replacement, it is also solved by the KaplanWald and KaplanMarkov tests.
However, because batch sizes may vary widely, using a single upper bound for all batches may have a great deal of slack for some batches, which can reduce the efficiency of the tests. By sampling batches with unequal probabilities, we can transform the problem to one where the upper bounds on the batches are sharper. This may lead to more efficient audits, depending on fixed costs related to retrieving batches; checking, recording, and opening seals; resealing batches and returning them to storage; etc.
4.2 Batch Sampling with Probability Proportional to a Bound on the Assorter
Let denote the batch selected in the th draw. For sampling without replacement, let ; for sampling with replacement, let . For sampling without replacement, let ; for sampling with replacement, let . Then are the indices of the batches from which the th sample batch will be drawn, and are the ballot cards those batches contain. Let . The th batch is selected at random from , with chance of selecting .
Define
(19) 
where . (For sampling without replacement, this typically varies with .) Let be the value of selected on the th draw. Consider the expected value of given :
(20)  
the mean value of the assorter over the ballots that remain in the population just before the th draw. Under the null hypothesis that ,
(21) 
Let be an estimate of based on and define
(22) 
This is an example of ALPHA allowing the upper bound on to vary with , with a corresponding drawdependent constraint on .
4.2.1 Auditing many assertions using the same weighted sample of batches.
To audit more than one assertion using the same sample of batches, the sampling weights—and thus the batchlevel a priori bounds—for the assertions need to be commensurable, in the following sense. Suppose there are assorters, ; let be the domain of assorter ; and let denote the set of batches that comprise ( denotes the set of batches in the domain of assorter from which the th sample will be drawn.) Suppose that batches and are relevant for assorters and . Let be the upper bound for assorter in batch and define , , and analogously. Then we need . The easiest way to accomplish that is to take for all and all . Tighter bounds may be possible in some cases, depending on the batchlevel reports for all the contests under audit.
5 Stratified Sampling
Stratified sampling—partitioning ballot cards into disjoint strata and sampling independently from those strata—can be helpful in RLAs (Stark, 2008; Higgins, Rivest and Stark, 2011; Ottoboni et al., 2018; Stark, 2020). For instance, some states (including California) require jurisdictions to draw audit samples independently. Auditing a crossjurisdictional contest then involves stratified samples; the cards cast in each jurisdiction comprise the strata. Stratified sampling can also offer logistical advantages by making it possible to use different audit strategies (polling, batch polling, ballotlevel comparison, batchlevel comparison) for different subsets of ballot cards, for instance, if some ballot cards are tabulated using equipment that can report how it interpreted each ballot and some are not.
Stratified batchcomparison RLAs were developed in the first paper on risklimiting audits, Stark (2008). The approach was tightened in Higgins, Rivest and Stark (2011). Ottoboni et al. (2018) developed a more flexible approach, SUITE (Stratified UnionIntersection Tests of Elections), which does not require using the same sampling or audit strategy in different strata. In particular, SUITE allows using polling in some strata and ballotlevel or batchlevel comparisons in others. BRAVO does not work for auditing in the polling strata in that context, because it makes inferences about the votes for one candidate as a fraction of the votes that are either for that candidate or one other candidate, that is, it conditions on the event that the selected card has a vote for either the reported winner or the reported loser. That suffices to tell who won a plurality contest—by auditing every (reported winner, reported loser) pair—if all the ballot cards are in a single stratum, but not when the sample is stratified.
When the sample is stratified, what is needed is an inference about the number of votes in the stratum for each candidate. To solve that problem, Ottoboni et al. (2018) used a test in the polling stratum based on the multinomial distribution, maximizing the value over a nuisance parameter, the number of ballot cards in the stratum with no valid vote for either candidate. SUITE represents the hypothesis that the outcome is wrong as a union of intersections of hypotheses. The union is over all ways of partitioning outcomechanging errors across strata. The intersection is across strata for each partition in the union. For each partition, for each stratum, SUITE computes a value for the hypothesis that the error in that stratum exceeds its allocation, then combines those values across strata (using a combining function such as Fisher’s combining function) to test the intersection hypothesis that the error in every stratum exceeds its allocation in the partition. If the maximum value of that intersection hypothesis over all allocations of outcomechanging error is less than or equal to the risk limit, the audit stops. Stark (2020) extends the unionintersection approach to use SHANGRLA assorters, avoiding the need to maximize values over nuisance parameters in individual strata and permitting sampling with or without replacement.
5.1 ALPHA obviates the need to use a combining function across strata
Because ALPHA works with polling and comparison strategies, it can be the basis of the test in every stratum, whereas SUITE used completely different “risk measuring functions” for strata where the audit involves ballot polling and strata where the audit involves comparisons. We shall see that this obviates the need to use a combining function to combine values across strata: they can just be multiplied. This is because (predictably) multiplying terms in the product representation of different sequences—each of which, under the nulls in the intersection, is a nonnegative supermartingale starting at one—yields a nonnegative supermartingale starting at one. Thus the product of the stratumwise test statistics in any order (including interleaving terms across strata) is also a test statistic with the property that the chance it is greater than or equal to is at most
under the intersection null. Because Fisher’s combining function adds two degrees of freedom to the chisquare distribution for each stratum, avoiding the need for a combining function can substantially increase power as the number of strata grows. Table
1 illustrates this increase: it shows the combined value for the intersection hypothesis when the value in each stratum is . The number of strata ranges from —which might arise in an audit in a single jurisdiction when stratifying on mode of voting (inperson versus absentee)—to 150—which might arise in auditing a crossjurisdictional contest in a state with many counties. For instance, Georgia has 159 counties, Kentucky has 120, Texas has 254, and Virginia has 133.strata  Fisher’s combination  supermartingale 

2  0.5966  0.25000000 
5  0.7319  0.03125000 
10  0.8374  0.00097656 
25  0.9514  0.00000003 
50  0.9917  0.00000000 
100  0.9997  0.00000000 
150  1.0000  0.00000000 
5.2 Supermartingalebased tests of intersection hypotheses
Here is a sketch of how ALPHA can be used for stratified audits. Suppose there are ballots in all, partitioned into strata. (This section will overload to mean two related things: the number of strata and a mapping from counting numbers to strata. Elsewhere in the paper, refers to a sample sum.) Stratum contains ballot cards; . We want to test the hypothesis that . Let be the upper bound on the numbers assigns. Let be the average of the assorter restricted to stratum , so . Suppose satisfies . We sample independently from the strata. Let denote the test supermartingale for stratum to test the null , . Let denote the th draw from the th stratum, and define , , and analogously. Define
(23) 
Recall from equation 6 that
(24) 
Let and be an interleaving of the samples, i.e., if , then is the index of the next element of not yet seen.

Initialize: set

For : If ,
The stratumselector is arbitrary and can be adaptive: it can depend on the previously observed data. But it must be predictable in the sense that can rely only on observations already entered into the calculation, for . It could ignore past data and simply select strata by roundrobin: (), skipping any strata whose samples have been exhausted. Or it could concatenate the samples: if we have drawn times from stratum , then , ; , , etc. In general, the power will depend on the mapping . For instance, if data from stratum suggest that , future values of might omit stratum , concentrating instead on strata where there is some evidence that the intersection null is false, to maximize the expected rate at which the test supermartingale grows. Indeed, choosing can be viewed as a (possibly finitepopulation) multiarmed bandit problem: which stratum should the next sample come from to maximize the expected rate of growth of the test statistic?
We now define the intersection test martingale:
(25) 
Now and are predictable from . Suppose and . The samples from different strata are independent, so the conditional expectation of given is the conditional expectation of given , which is at most 1. Thus
(26) 
That is, under the intersection null, is a nonnegative supermartingale starting at 1, and by Ville’s inequality,
(27) 
To audit the assertion, we need to check whether there is any with for which . If there is, sampling needs to continue. We thus need to find
(28) 
the solution to a finitedimensional optimization problem.
6 Bernoulli Sampling
Ottoboni et al. (2020) developed a ballotpolling risklimiting audit (BBP) based on Bernoulli sampling, where each ballot card is independently included in the sample with probability . This results in a random sample of random size. Conditional on the sample size, it is a simple random sample of ballot cards. Their approach to testing whether one candidate received more votes than another involves conditioning on the realized sample size and maximizing an SPRT value over a nuisance parameter, the number of ballot cards that do not contain a valid vote for either of those candidates. They find that BBP requires sample sizes comparable to BRAVO for the same margin.
ALPHA, combined with the SHANGRLA transformation, eliminates the need to perform the maximization over a nuisance parameter. To use ALPHA, the sample needs to have a notional ordering. That ordering can come from randomly permuting the sample, or from setting a canonical ordering of the ballot cards before the sample is selected, e.g., a lexicographical ordering, then considering the sample order to be the lexicographical order of the cards in the sample.
If the initial Bernoulli sample does not suffice to confirm the outcome, the sample can be expanded using the same approach Ottoboni et al. (2020) took in section 4 of their paper. Since the performance of BBP is similar to that of BRAVO, one might expect that applying ALPHA to Bernoulli samples would require lower sample sizes (i.e., lower selection probabilities) than BBP. We do not perform any simulations here, but we plan to investigate the efficiency of ALPHA/SHANGRLA versus BBP in future work.
7 Simulations
7.1 Sampling with replacement
Table 2 reports mean sample sizes of ALPHA and BRAVO for the same true vote shares , with the same choices of , using the truncated shrinkage estimate of , for a variety of choices of , all for a risk limit of 5%. Results are based on 1,000 replications for each true . Sample sizes were limited to 10 million ballot cards: if a method required a sample bigger than that in any of the 1,000 replications, the result is listed as ‘—’. As expected, BRAVO is best (or tied for best) when , i.e., when the reported vote shares are exactly right. ALPHA with a small value of is best when is large; and ALPHA with a small value of is best when is small, with a few exceptions, where BRAVO beats ALPHA with when and is small. When is large, ALPHA often does as well as BRAVO even when . When vote shares are wrong—and the reported winner still won—ALPHA often reduces average sample sizes substantially, even when the true margin is large. Indeed, in many cases, the sample size for BRAVO exceeded 10 million ballot cards in some runs, while the average for ALPHA was up to five orders of magnitude lower.
The SPRT is known to perform poorly—sometimes never leading to a decision—when . In such cases, ALPHA did much better for all choices of when , and for and when .
The simulations show that the performance of BRAVO can also be poor when . In most of those cases, ALPHA performed better than BRAVO for all choices of . For instance, when (a margin of 20%) and , ALPHA mean sample sizes were 204–353, but BRAVO sample sizes exceeded for some runs.
ALPHA with  
10  100  500  1000  BRAVO  
0.505  0.505  102,500  91,024  82,757  79,414  58,266 
0.51  102,738  91,878  84,088  80,564  —  
0.52  103,418  93,842  88,512  87,685  —  
0.53  103,746  96,611  97,731  103,630  —  
0.54  104,490  99,535  110,216  126,618  —  
0.55  105,071  104,047  125,659  158,247  —  
0.6  110,346  135,445  269,961  440,573  —  
0.65  118,727  190,702  519,166  920,839  —  
0.7  129,332  265,380  861,560  1,597,004  —  
0.51  0.505  24,476  21,487  19,798  19,258  19,965 
0.51  24,598  21,598  19,577  18,841  14,930  
0.52  24,717  22,036  19,839  19,035  —  
0.53  24,760  22,451  20,846  20,928  —  
0.54  24,930  23,029  22,888  24,602  —  
0.55  25,017  23,856  25,848  30,351  —  
0.6  25,954  30,041  57,261  93,849  —  
0.65  27,721  42,078  116,207  209,576  —  
0.7  30,117  61,550  201,768  376,304  —  
Comments
There are no comments yet.