1 Introduction
A risklimiting audit (RLA) of an election is a procedure that has a known, prespecified minimum chance of correcting the electoral outcome if the outcome is incorrect—that is, if the reported outcome differs from the outcome that a full manual tabulation of the votes would find. RLAs require a durable, voterverifiable record of voter intent, such as paper ballots, and they assume that this audit trail is sufficiently complete and accurate that a full hand tally would show the true electoral outcome. That assumption is not automatically satisfied: a compliance audit (Stark and Wagner, 2012) is required.
Risklimiting audits are generally incremental: they examine more ballots, or batches of ballots, until either (i) there is strong statistical evidence that a full hand tabulation would confirm the outcome, or (ii) the audit has led to a full hand tabulation, the result of which should become the official result.
RLAs have been piloted in California, Colorado, and Ohio, and a test of RLA procedures has been conducted in Arizona. RLA bills are being drafted or are already under consideration in California, Virginia, Washington, and other states. A number of laws have either allowed or mandated risklimiting audits, including California AB 2023 (Saldaña), SB 360 (Padilla), and AB 44 (Mullin); Rhode Island SB 413A and HB 5704A; and Colorado Revised Statutes (CRS) 17515.
CRS 17515 required Colorado to implement risklimiting audits beginning in 2017. (There are provisions to allow the Secretary of State to exempt some counties.) The first set of coordinated risklimiting election audits across the state took place in Colorado in November, 2017.^{1}^{1}1See https://www.sos.state.co.us/pubs/elections/RLA/2017RLABackground.html
Colorado’s “uniform voting system” program^{2}^{2}2https://www.sos.state.co.us/pubs/elections/VotingSystems/UniformVotingSystem.html
led many Colorado counties to purchase (or to plan to purchase) voting systems that are auditable at the ballot level: those systems export cast vote records (CVRs) for individual ballots in a manner that allows the corresponding paper ballot to be identified, and conversely, make it possible to find the CVR corresponding to any particular paper ballot. We call counties that have such systems “CVR” counties. It is estimated that by June, 2018, 98.2% of active Colorado voters will be in CVR counties. CVR counties can perform “ballotlevel comparison audits,”
(Lindeman and Stark, 2012) which are currently the most efficient approach to risklimiting audits in that they require examining fewer ballots than other methods do, when the outcome of the contest under audit is in fact correct.Other counties (“legacy” or “noCVR” counties) have systems that do not allow auditors to check how the system interpreted voter intent for individual ballots. Their election results can still be audited, provided their voting systems create a voterverifiable paper trail (e.g., votermarked paper ballots) that is conserved to ensure that it remains accurate and intact, and organized well enough to permit ballots to be selected at random. Pilot audits in California suggest that the most efficient way to audit such systems is by “ballotpolling” (Lindeman et al., 2012; Lindeman and Stark, 2012) (in contrast to “batchlevel comparisons,” for example).
There is currently no literature on how to perform risklimiting audits of contests that include CVR counties and noCVR counties by combining ballot polling and ballotlevel comparisons. Existing methods would either require all counties to use the lowest common denominator, ballotpolling (which does not take advantage of the CVRs, and thus is expected to require more auditing than a method that does take advantage of the CVRs), or would require noCVR counties to perform batchlevel comparisons, which were found in California to be (generally) less efficient than ballotpolling audits.^{3}^{3}3See Rivest (2018) for a different (Bayesian) approach to auditing contests that include both CVR counties and noCVR counties.
The opensource audit software used for Colorado’s 2017 audits, RLATool (https://github.com/FreeAndFair/ColoradoRLA/), needs to be improved to audit contests that cross county lines and to audit small contests efficiently.
First, the current version (1.1.0) of RLATool needs to be modified to recognize and group together contests that cross jurisdictional boundaries; currently, it treats every contest as if it were entirely contained in a single county. Margins and risk limits apply to entire contests, not to the portion of a contest included in a county. RLATool also does not allow the user to select the sample size, nor does it directly allow an unstratified random sample to be drawn across counties. Second, to audit a contest that includes voters in “legacy” counties (counties with voting systems that cannot export cast vote records) and voters in counties with newer systems, new statistical methods are needed to keep the efficiency of ballotlevel comparison audits that the newer systems afford. Third, auditing contests that appear only on a subset of ballots can be made much more efficient if the sample can be drawn from just those ballots that contain the contest. While allowing samples to be restricted to ballots reported to contain a particular contest is not essential in the short run, it will be necessary eventually to make it feasible to audit smaller contests.
This document focuses on nearterm requirements for risklimiting audits in Colorado. Section 2 presents a number of crude approaches that could be implemented easily but might require examining substantially more ballots. Section 3 presents an approach based on comparison audits with different batch sizes. This approach is statistically elegant and relatively efficient, but might require changing how counties handle their ballots. Section 4 presents our recommended approach, which combines ballotlevel comparisons in counties that can perform them with ballotpolling in the noCVR counties. All the approaches require new software, including at least minor modifications to RLATool. We provide example software implementing the risk calculations for our recommended approach as a Python Jupyter notebook.^{4}^{4}4See https://github.com/pbstark/CORLA18. Section 5 describes how audit efficiency could be improved in CVR counties by combining CVR data with data from Colorado’s voter registration system, SCORE.^{5}^{5}5SCORE is Colorado’s voter registration system, which also tracks who voted. See https://www.sos.state.co.us/pubs/elections/SCORE/SCOREhome.html. Sections 6 and 7 explain the recommended modifications to ballotlevel comparison and ballotpolling audits, respectively. Section 8 summarizes our recommendations and considerations for implementation.
1.1 Priorities for Colorado audits
Auditing efficiency is controlled in part by how well the audit can limit the sample to ballots that contain the contests under audit. Some contests are on (essentially) every ballot, for instance the governor’s race. Others, such as mayoral contests, may appear on only a small fraction of ballots cast in a county. Partisan primaries—even for statewide office—are somewhere in between, because in general no single party’s primary appears on every ballot cast in the state. Thus, either we accept reduced efficiency for the sake of simplicity by continuing to sample ballots uniformly from within counties (or collections of counties), or we develop a way to focus the auditing on the ballots that contain the contest. The latter requires external information, e.g., from SCORE, as discussed below.
Moreover, party primaries for statewide offices (and perhaps other contests) will include CVR counties and noCVR counties, so we need a method to audit across both kinds of voting technology.
This report addresses both issues, providing options for effectively auditing heterogeneous voting technology, varying in efficiency, complexity, and on whom any additional audit burden falls.
2 Crude (and unpleasant) approaches
Here and generally throughout the paper, we discuss auditing a single contest at a time, although the same sample can be used to audit more than one contest and there are ways of combining audits of different contests into a single process (Stark, 2009b, 2010). We use terminology drawn from a number of papers; the key reference is Lindeman and Stark (2012). An overstatement error is an error that caused the margin between any reported winner and any reported loser to appear larger than it really was. An understatement error is an error that caused the margin between every reported winner and every reported loser to appear to be smaller than it really was.
2.1 Hand count the legacy counties
The simplest approach to combining legacy counties with CVR counties is to require every legacy county to do a full hand count of the primaries, and to conduct a ballotlevel comparison audit in CVR counties, based on contest margins adjusted for the results of the manual tallies in the CVR counties. For instance, imagine a contest with two candidates, reported winner and reported loser . Suppose the total number of reported votes for candidate is and the total for candidate is , so that , since is the reported winner. Suppose that a full manual tally of the votes in the legacy counties shows votes for and votes for . Suppose that a total of ballots were cast in the CVR counties. Then the diluted margin for the comparison audit in the CVR counties is defined to be . Requiring a full hand count in the legacy counties has obvious disadvantages, except perhaps in very close contests where ballot polling is not efficient. (But it does have the advantage of not forcing CVR counties to do additional auditing to compensate for the legacy counties.)
2.2 Subtract error bounds for the legacy counties from vote totals
If ballot accounting and SCORE data can provide good upper bounds on the number of ballots cast in each contest in legacy counties, there are simple upper bounds on the total possible overstatement error each legacy county could contribute to the overall contest results; those can be subtracted from the overall margin (as in the previous subsection) and the remainder of the contests can be audited in CVR counties against the adjusted margins. For instance, consider a primary that appears on ballots in a legacy counties. Suppose that in legacy counties, the overall, statewide contest winner, , is reported to have received votes, and some loser, , is reported to have received votes. (Note that could be greater than : is not necessarily the reported winner in the legacy counties.) Then the most overstatement error that the county could possibly have in determining whether in fact beat is if every reported undervote, invalid vote, or vote for a different candidate, , had in fact been a vote for (producing a 1vote overstatement), and every vote reported for was in fact a vote for (producing a 2vote overstatement). The reduction in the margin that would produce is votes.
Whereas the previous approach places the auditing burden created by obsolescent equipment entirely on the legacy counties, this approach places it entirely upon the CVR counties. Also, in a close contest, it could require a full hand count in every county that might not otherwise be necessary.
2.3 Treat legacy counties as if every ballot selected from them for audit has a twovote overstatement
A third simplebutpessimistic approach is to sample uniformly from all counties as if one were performing a ballotlevel comparison audit everywhere, but to treat any ballot selected from a legacy county as a twovote overstatement. This approach has the same disadvantages as the previous approach.
3 Variable batch sizes
Another approach is to perform a comparison audit across all counties, but to use batches consisting of more than one ballot (batchlevel comparisons) in legacy counties and batches consisting of a single ballot (ballotlevel comparisons) in CVR counties.^{6}^{6}6For majority and plurality elections, including those in which voters can select more than one candidate, audits can be based on overstatement and understatement errors at the level of batches. This requires that the noCVR counties report vote subtotals for physically identifiable batches. If a county’s voting system can only report subtotals by precinct but the county does not sort paper ballots by precinct, this approach might require revising how the county handles its paper; we understand that this is the case in many Colorado counties.
That said, many California counties that do not sort votebymail (VBM) ballots by precinct conduct the statutory 1% audits by manually retrieving the ballots for just those precincts selected for audit from whatever physical batches they happen to be in: the situation is identical to that in Colorado.
Another solution is the “Boulderstyle” batchlevel audit,^{7}^{7}7See http://bcn.boulder.co.us/~neal/elections/boulderaudit1011/. which requires generating vote subtotals after each physical batch is scanned, and exporting those subtotals in machinereadable form. That in turn may require using extra memory cards, repeatedly initializing and deleting tabulation databases, or other measures that add complexity and opportunity for human error.
While those two approaches are laborious, they would provide a viable shortterm solution, especially combined with information from SCORE to check that the reported batchlevel results contain the correct number of ballots for each contest under audit. Moreover, it does not unduly increase the workload in CVR counties to compensate for legacy equipment.
This kind of variablebatchsize comparison audit approach would require modifying or augmenting RLATool in several ways:

The CVR reporting tool would need to be modified to allow noCVR counties to report batchlevel results in a manner analogous to how CVR counties report ballotlevel results, or an external tool would need to be provided.

The sampling algorithm would have to allow sampling batches—and sampling them with unequal probability, because efficient batchlevel audits involve sampling batches with probability proportional to a bound on the possible overstatement error in the batch. It would also need to calculate the appropriate sampling probability for each batch (of whatever size). Again, this could be accommodated using an external tool to draw the sample from legacy counties.

The risk calculations would need to be modified. This, too, could be done with external software, with suitable provisions for capturing audit data from RLATool or directly from legacy counties.
None of these changes is enormous; the mathematics and statistics are already worked out in published papers, and there is exemplar code for calculating the batchlevel error bounds, drawing the samples with probability proportional to an error bound, and calculating the attained risk from the sample results. Indeed, this is the method that was used in several of California’s pilot audits, including the audit in Orange County. A derivation of a method for comparison audits with variable batch sizes is given below in section 6.
4 Stratified “hybrid” audits
Other approaches involve stratification: partitioning the cast ballots into nonoverlapping groups and sampling independently from those groups. One could stratify by county, but in general it is simpler and more efficient statistically (i.e., results in auditing fewer ballots) to minimize the number of strata. We consider methods that use two strata: CVR counties and noCVR counties. Collectively, the ballots cast in CVR counties comprise one stratum and the ballots cast in legacy counties comprise a second stratum; every ballot cast in the contest is in exactly one of the two strata. We assume that the samples are drawn from the two strata independently.
As explained below, these stratified “hybrid” audits require the specification of some additional parameters: for dividing the tolerable overstatement error up, and the strata risk limits .
4.1 Partitioning the total permissible overstatement into strata
The simplest approach to stratification involves partitioning the risk limit and the tolerable overstatement error of the tabulation into two pieces, one for the (pooled) CVR counties and one for the (pooled) noCVR counties. Let denote the contestwide margin (in votes) of reported winner over reported loser . Let denote the margin (in votes) of reported winner over reported loser in stratum . Note that might be negative in one stratum. Let denote the margin (in votes) of reported winner over reported loser that a full hand count of the entire contest would show, that is, the actual margin rather than the reported margin. Reported winner really beat reported loser if and only if . Define to be the actual margin (in votes) of over in stratum ; this too may be negative.
Let be the overstatement of the margin of over in stratum . Reported winner really beat reported loser if and only if .
Pick and define . These values partition the total tolerable overstatement between the two strata: If and , candidate really received more votes than candidate . Some pairs can be ruled out a priori, because (for instance) , where is the number of ballots cast in stratum . There are other simple, sharper bounds, sketched below.
The choice of (which determines the tolerable overstatement in each stratum), the strata risk limits , and details of the audit procedures affect the workload and the overall risk limit. (See section 4.1.1 and section 8.)
For ballotlevel comparison audits, auditing to ensure that is discussed in section 6. It is a minor modification of the method embodied in RLATool.
For ballotpolling audits, auditing to ensure that is discussed in section 7. Note that this requires a more substantial modification of the standard ballotpolling calculations, because the standard calculations consider only the fraction of ballots with a vote for either or that contain a vote for , while we need to make an inference about the difference between the number of votes for and the number of votes for . This introduces an additional unknown nuisance parameter, the number of ballots with votes for either or .
4.1.1 Combining stratumlevel risk limits
We audit to test the two hypotheses , independently for the two strata. If we reject both hypotheses, we conclude that the contest outcome is correct; otherwise, we manually retabulate the contest in one or both strata, depending on the audit rules. Those rules matter: the two audits might need to be conducted to smaller risk limits individually than the desired risk limit for the contest as a whole.
Recall that the samples are drawn independently from the two strata. Pick . (Below we discuss the choice further.) We audit each stratum to test the hypothesis (the overstatement exceeds the tolerable overstatement) at risk limit , as if it were its own election. The audits can be conducted at the same time or sequentially; there is no coordination between the audits unless one of them leads to a full hand count but the other does not: see below.
How do these two stratumlevel “risk limits” and determine the overall risk that the audit will not correct the outcome if the outcome is wrong? The overall risk depends on the rule for what we do if the audit in one stratum leads to a full manual tally of that stratum.
Here are the possibilities. Bear in mind that for the outcome to be wrong, at least one stratum must have a net overstatement greater its tolerable overstatement: That is, if , then or , or both. If the tolerable overstatement is exceeded in only one stratum, , then the chance that the stratum will be fully hand counted is at least .
If both and , then the chance both are completely tabulated by hand is at least , since the audit samples in the two strata are independent.
What should we do if the audit leads to a full tally in one stratum, , that reveals that indeed its tolerable overstatement has been exceeded, but the other audit has not led to a full tabulation, because it has not started, because it is still underway, or because it terminated without a full hand tally? We consider two options. The simpler is to automatically require a full hand count of the other stratum. If the audit uses this rule, then we can take , and the procedure will have risk limit . However, this rule creates the possibility of requiring a full hand count in circumstances where it may seem substantively superfluous. For instance, one can imagine an audit of a statewide contest in which the tolerable overstatement in noCVR counties is exceeded, yet the outcome still could be verified without a full hand count in the CVR counties.
The second approach is to adjust the tolerable overstatement in the other stratum in light of the known manual tally in the stratum that has been fully hand tallied: we will test against the threshold , rather than the original value . (Because the overstatement in stratum exceeded the tolerable overstatement, the updated tolerable overstatement in stratum
will be smaller than the original value.) Then to reject the new null hypothesis in stratum
is to conclude that the overall outcome is correct.If and when the hypothesis in stratum changes, the audit in that stratum might be able to stop on the basis of the data already observed; it might need to continue; or—if it had stopped based on the original threshold —it might need to examine more ballots, possibly continuing to a full hand tally.
We will now show in detail that this rule allows the contest to be audited at risk limit by selecting values of and that sum to a bit more than : specifically, such that . For instance, suppose we want the overall risk limit to be 5%. If we use a risk limit of 4% in the noCVR stratum and a risk limit of 1.04% in the CVR stratum, the overall risk limit is not larger than .
The statistical wrinkle is that adjusting for the manual tally in the handcounted stratum changes the hypothesis being tested in the other stratum in a way that is itself random: whether the original null is tested or the new null is tested depends on what the sample reveals in stratum . If the hypothesis does change, there is only one value possible for —which depends on the reported margin and the count in stratum —but is unknown until is known.
We assume that before any data are collected, the audit specifies two families of tests: for each stratum , a family of level tests of the null hypothesis that the overstatement in the stratum is greater than or equal to , for all feasible values of . That is,
(1) 
for , and all feasible . Moreover, we insist that the test depend on data only from ballots selected from its stratum. Because the samples in the two strata are independent, for all feasible pairs ,
(2) 
What is the chance that the audit leads to a full hand tabulation if the outcome is incorrect? One way the audit can lead to a full hand tally is if it leads to a full count in one stratum, the null hypothesis in the other stratum is changed, and the audit in the second stratum then proceeds to a full manual tally. (There are other ways the audit can lead to a full hand tally, for instance, if neither null hypothesis is rejected, but this is one way.)
If the outcome is wrong, there is at least one stratum in which the overstatement exceeds the threshold . Let be one such stratum. Then the chance the audit in stratum leads to a full manual tally in that stratum is at least . If the audit leads to a full manual tally in stratum and the overall outcome is wrong, then the (new) null hypothesis in the other stratum, , must be true. If we started to audit that new hypothesis ab initio, the chance that we would reject it would be at most , so the chance the audit would lead to a full hand count of stratum is at least . The question is whether “changing hypotheses” could make that chance smaller. The inequality 4.1.1 shows that it cannot: for any feasible pair of overstatements, , if and , the chance that neither the hypothesis nor the hypothesis will be rejected is at least .
And therefore, for this procedure, the chance that there will be a full hand count in both strata is at least if the outcome is incorrect, even if the probability were zero that both of the original audits would proceed to a full hand count. The overall risk limit is thus not larger than .
4.2 Constraining the total overstatement across strata
A more statistically efficient approach to ensuring that the overstatement error in the two strata does not exceed the margin is to try to constrain the sum of the overstatement errors in the two strata, rather than constrain the pieces separately: there are many ways that the total overstatement could be less than without having the overstatement in stratum less than , . To that end, imagine all values . If, for all such pairs, we can reject the hypothesis that the overstatement error in stratum 1 is greater than or equal to and the overstatement error in stratum 2 is greater than or equal to , then we can conclude that the outcome is correct.
To test the conjunction hypothesis (i.e., that both of those null hypotheses are false), we use Fisher’s combining function. Let be the value of the hypothesis . If the null hypothesis that and is true, then the combination
(3) 
has a probability distribution that is dominated by the chisquare distribution with 4 degrees of freedom.
^{8}^{8}8If the two tests had continuously distributed values, the distribution would be exactly chisquare with four degrees of freedom, but if either value has atoms when the null hypothesis is true, it is in general stochastically smaller. This follows from a coupling argument along the lines of Theorem 4.12.3 in Grimmett and Stirzaker (2001).Hence, if, for all and , the combined statistic is greater than the quantile of the chisquare distribution with 4 degrees of freedom, the audit can stop.
5 Sampling from subcollections
To audit contests that are contained on only a fraction of the ballots cast in one or more counties efficiently requires the ability to sample from just those ballots (or, at least, from a subset of all ballots that contains every such ballot). Because the CVRs cannot be entirely trusted (otherwise, the audit would be superfluous), we cannot rely on them to determine which ballots contain a given contest. However, if we have independent knowledge of the number of ballots that contain a given contest (e.g., from the SCORE system), then there are methods that allow the sample to be drawn from ballots whose CVRs contain the contest and still limit the risk rigorously. See Benaloh et al. (2011) and Bañuelos and Stark (2012) for details.
6 Batch comparison audits of a tolerable overstatement in votes
In this section we expand previous comparison auditing work (already embodied in RLATool) to handle two new requirements. The first allows the specification of the parameters discussed in section 4. The second handles batchlevel auditing.
The first requirement requires that we consider auditing in a single stratum to test whether the overstatement of any margin (in votes) exceeds some fraction of the overall margin between reported winner and reported loser . If the stratum contains all the ballots cast in the contest, then for , this would confirm the election outcome. For stratified audits, we might want to test other values of , as described above.
In Colorado, comparison audits have been ballotlevel (i.e., batches consisting of a single ballot). This section also addresses the second requirement by deriving a method for batches of arbitrary size, which might be useful for Colorado to audit contests that include CVR counties and legacy counties. We keep the a priori error bounds tighter than the “supersimple” method (Stark, 2010)
. To keep the notation simpler, we consider only a single contest, but the MACRO test statistic
(Stark, 2009b, 2010) automatically extends the result to auditing contests simultaneously. The derivation is for plurality contests, including “votefor” plurality contests. Majority and supermajority contests are a minor modification (Stark, 2008).^{9}^{9}9So are some forms of preferential and approval voting, such as Borda count, and proportional representation contests, such as D’Hondt (Stark and Teague, 2014). Changes for IRV/STV are more complicated.6.1 Notation

: the set of reported winners of the contest

: the set of reported losers of the contest

ballots were cast in all in the stratum. (The contest might not appear on all ballots.)

“batches” of ballots are in stratum . A batch contains one or more ballots. Every ballot in stratum is in exactly one batch.

: number of ballots in batch . .

: the reported votes for candidate in batch

: actual votes for candidate in batch . If the contest does not appear on any ballot in batch , then .

: Reported margin in stratum of reported winner over reported loser , in votes.

: Overall reported margin of reported winner over reported loser , in votes, for the entire contest (not just stratum )

: smallest reported overall margin between any reported winner and reported loser:

: actual margin in the stratum of reported winner over reported loser , in votes

: actual margin of reported winner over reported loser , in votes, for the entire contest (not just in stratum )
6.2 Reduction to maximum relative overstatement
If the contest is entirely contained in stratum , then the reported winners of the contest are the actual winners if
Here, we address the case that the contest may include a portion outside the stratum. To combine independent samples in different strata, it is convenient to be able to test whether the net overstatement error in a stratum exceeds a given threshold.
Instead of testing that condition directly, we will test a condition that is sufficient but not necessary for the inequality to hold, to get a computationally simple test that is still conservative (i.e., the risk is not larger than its nominal value).
For every winner, loser pair , we want to test whether the overstatement error exceeds some threshold, generally one tied to the reported margin between and . For instance, for a simple stratified audit, we might take the threshold to be .
We want to test whether
The maximum of sums is not larger than the sum of the maxima; that is,
Define
Then no reported margin is overstated by a fraction or more if
Thus if we can reject the hypothesis , we can conclude that no pairwise margin was overstated by as much as a fraction .
Testing whether would require a very large sample if we knew nothing at all about without auditing batch : a single large value of could make arbitrarily large. But there is an a priori upper bound for . Whatever the reported votes are in batch , we can find the potential values of the actual votes that would make the error largest, because must be between 0 and , the number of ballots in batch :
Hence,
(4) 
Knowing that might let us conclude reliably that by examining only a small number of batches—depending on the values and on the values of for the audited batches.
To make inferences about , it is helpful to work with the taint . Define . Suppose we draw batches at random with replacement, with probability of drawing batch in each draw, . (Since , these are all positive numbers, and they sum to 1, so they define a probability distribution on the batches.)
Let be the value of for the batch selected in the th draw. Then are IID, , and
Thus .
So, if we have strong evidence that , we have strong evidence that .
This approach can be simplified even further by noting that has a simple upper bound that does not depend on . At worst, the reported result for batch shows votes for the “leastwinning” apparent winner of the contest with the smallest margin, but a hand interpretation would show that all ballots in the batch had votes for the runnerup in that contest. Since and ,
Thus if we use in lieu of , we still get conservative results. (We also need to redefine to be the sum of those upper bounds.) An intermediate, still conservative approach would be to use this upper bound for batches that consist of a single ballot, but use the sharper bound (4) when . Regardless, for the new definition of and , are IID, , and
So, if we have evidence that , we have evidence that .
6.3 Testing
To test whether , there are a variety of methods available. One particularly “clean” sequential method is based on Wald’s Sequential Probability Ratio Test (SPRT) (Wald (1945)). Harold Kaplan pointed out this method on a website that no longer exists. A derivation of this “KaplanWald” method is given in Stark and Teague (2014, Appendix A); to apply the method here, take in their equation 18.
A different sequential method, the KaplanMarkov method (also due to Harold Kaplan), is given in Stark (2009a).
7 Ballotpolling audits of a tolerable overstatement in votes
7.1 Conditional trihypergeometric test
We consider a single stratum , containing ballots. We will sample individual ballots without replacement from stratum . Of the ballots, have a vote for but not for , have a vote for but not for , and have votes for both and or neither nor , including undervotes and invalid ballots. We might draw a simple random sample of ballots ( fixed ahead of time), or we might draw sequentially without replacement, so the sample size could be random. For instance, the rule for determining could depend on the data.^{10}^{10}10Sampling with replacement leads to simpler arithmetic, but is not as efficient.
Regardless, we assume that, conditional on the attained sample size , the ballots are a simple random sample of size from the ballots in the population. In the sample, ballots contain a vote for but not , with and defined analogously. Then, conditional on
, the joint distribution of
is trihypergeometric:(5) 
The test statistic will be the diluted sample margin, . This is the sample difference in the number of ballots for the winner and for the loser, divided by the total number of ballots in the sample. We want to test the compound hypothesis . The value of is inferred from the definition . Thus,
The alternative is the compound hypothesis .^{11}^{11}11To use Wald’s Sequential Probability Ratio Test, we might pick a simple alternative instead, e.g., and , the reported values, provided . Hence, we will reject for large values of . Conditional on , the event is the event .
Suppose we observe . The test will condition on the event . (In contrast, the BRAVO ballotpolling method (Lindeman et al., 2012) conditions only on .)
The value of the simple hypothesis that there are ballots with a vote for but not for , ballots with a vote for but not for , and ballots with votes for both and or neither nor (including undervotes and invalid ballots) is the sum of these probabilities for events when . Therefore,
(6) 
7.2 Conditional hypergeometric test
Another approach is to condition on both the events and . We describe the hypothesis test here, but do not advocate for using it. We found that this approach was inefficient in some simulation experiments.
Given , all samples of size from the ballots are equally likely, by hypothesis. Hence, in particular, all samples of size for which are equally likely. There are such samples. Among these samples, may take values . For a fixed , there are samples with and .
The factor counts the number of ways to sample of the remaining ballots. If we divide out this factor, we simply count the number of ways to sample ballots from the group of ballots for or for . There are equally likely samples of size from the ballots with either a vote for or for , but not both, and of these samples, contain ballots with a vote for but not . Therefore, conditional on and , the probability that is
The value of the simple hypothesis that there are ballots with a vote for but not for , ballots with a vote for but not for , and ballots with votes for both and or neither nor (including undervotes and invalid ballots) is the sum of these probabilities for events when . This event occurs for . Therefore,
(7) 
This conditional
value is thus the tail probability of the hypergeometric distribution with parameters
“good” items, “bad” items, and a sample of size . This calculation is numerically stable and fast; tail probabilities of the hypergeometric distribution are available and welltested in all standard statistics software.7.3 Maximizing the value over the null set
The composite null hypothesis does not specify or separately, only that for some fixed, known . The (conditional) value of this composite hypothesis for is the maximum value for all values that are possible under the null hypothesis,
(8) 
wherever the summand is defined. (Equivalently, define if , , or .)
7.3.1 Optimizing over the parameter
The following result enables us to only test hypotheses along the boundary of the null set.
Theorem 1.
Assume that . Suppose the composite null hypothesis is . The value is maximized on the boundary of the null region, i.e. when .
Proof.
Without loss of generality, let and assume that is fixed. Let be the fixed, unknown number of ballots for or for in stratum . The value for the simple hypothesis that is
(9) 
where is defined as the term in the summand and for pairs that don’t appear in the summation.
Assume that is given. The value for this simple hypothesis is
Terms in the fraction can be simplified: choose the corresponding pairs in the numerator and denominator. Fractions of the form can be expressed as . Fractions of the form can be expressed as . Thus, the value can be written as
The last inequality follows from the fact that and are nonnegative, and that – it is a possible outcome under the null hypothesis.
∎
7.3.2 Optimizing over the parameter
We have shown empirically (but do not prove) that this tail probability, as a function of , has a unique maximum at one of the endpoints when is either as small or as large as possible, given , , and the observed sample values and . If the empirical result is true, then finding the maximum is trivial; otherwise, it is a trivial onedimensional optimization problem to compute the unconditional value.
7.4 Conditional testing
If the conditional tests are always conducted at significance level or less, i.e., so that , then the overall procedure has significance level or less:
(10)  
In particular, this implies that our conditional hypergeometric test will have the correct risk limit unconditionally.
8 Recommendations
We have outlined several methods Colorado might use to audit crossjurisdictional contests that include CVR counties and noCVR counties. We expect that stratified “hybrid” audits will be the most palatable, given the constraints on time for software development and the logistics of the audit itself, because the workflow for counties would be the same as it was in November, 2017.
What would change is the risk calculation “behind the scene,” including the algorithms used to decide when the audit can stop. Those algorithms could be implemented in software external to RLATool. The minimal modification to RLATool that would be required to conduct a hybrid audit is to allow the sample size from each county to be controlled externally, e.g. by uploading a parameter file once per round, rather than using a formula that is based on the margin within that county alone. The parameter file would be generated by external software that does the audit calculations described here based on the detailed audit progress and discrepancy data available from RLATools’ rla_export command.
To conduct a hybrid audit, one must choose two numbers in addition to the risk limit :

one stratumwise risk limit, (the other, , is determined from and the overall risk limit, )

the tradeoff (allocation) of the tolerable overstatement between strata, (the value of is )
Those parameters can be chosen essentially arbitrarily (provided ) and the audit will still be risklimiting; however, they can be optimized to reduce the expected workload under various assumptions about tabulation errors in the two strata. (Software that can be used to run scenarios is available at https://www.github.com/pbstark/CORLA18; see below.)
In either stratum, increasing the risk limit or increasing the tolerable overstatement will decrease the required sample size from that stratum (assuming that the actual overstatement in that stratum is less than its allowable overstatement). The relative change in sample size as the risk limit changes scales similarly in the two strata, because the risk limit enters both ballotlevel comparisons and ballotpolling the same way: as the logarithm. However, the relative change in sample sizes as the tolerable overstatement changes scales quite differently in the two strata: linearly in the ballotlevel comparison stratum, but quadratically in the ballotpolling stratum. Hence, the workload is not as sensitive to how the risk limit is allocated across strata as it is to how the tolerable overstatement is allocated.
8.1 Software and examples
Examples of stratified hybrid audits are in Jupyter notebooks available at https://www.github.com/pbstark/CORLA18. The first two examples are contained in a single notebook, “hybridauditexample1”. The first example is a hypothetical mediumsized election with a total of votes and a diluted margin of . 9.1% of the ballots come from noCVR counties. The risk limit is 10%. If the audit in the CVR stratum found no errors and the allowable overstatement error was 30% of the margin, it would terminate after examining 1,213 ballots. In over 90% of 10,000 simulations, an audit of 250 ballots from the noCVR stratum would have sufficed to confirm that the overstatement error in that stratum did not exceed its allocation, 70% of the margin. A sample of 450 ballots was sufficient to stop the audit in 99% of simulations. As always, could be adjusted to rebalance the expected workload between strata, perhaps taking into account the expected workload for audits of countywide or intracounty contests, so as to minimize (or quite possibly eliminate) any additional burden imposed by the stratified audit.
If a CVR were available for all counties and we could have run a ballotlevel comparison audit for the entire contest, rather than stratifying, an audit with risk limit 10% that found no errors would have concluded after examining just 263 ballots. The efficiency gained comes from two sources. First, ballotlevel comparison audits are substantially more efficient than ballotpolling audits. Second, the hybrid audit requires dividing the margin and risk limit between two strata. This results in both strata using smaller risk limits. In order to keep the workload low, it is necessary to allocate a disproportionately high fraction of the margin to the noCVR stratum; the CVR stratum must increase its workload to compensate.
Another method discussed in Section 2.3 is to perform a ballotlevel comparison audit statewide, but to treat any ballot sampled from the noCVR county as showing a twovote overstatement. In this example, this worstcase method would lead to a full hand count. However, the situation may be more optimistic for Colorado: if only 1.2% of ballots came from the noCVR stratum and the overall margin were in fact 10,000 votes, then this method would require checking 430 ballots.
The second example is a hypothetical large statewide election with a total of 2 million ballots and a diluted margin of nearly . The risk limit is 5%. If the audit in the CVR stratum found no errors and the allowable overstatement error was 10% of the margin, it would terminate after examining 50 ballots. In over 90% of 10,000 simulations, an audit of 50 ballots from the noCVR stratum would have sufficed to confirm that the overstatement error in that stratum did not exceed its allocation, 90% of the margin. A sample of 100 ballots was sufficient to stop the audit in 99% of simulations. If a CVR were available for all counties and we could have run a ballotlevel comparison audit for the entire contest, rather than stratifying, an audit with risk limit 5% that found no errors would have concluded after examining just 24 ballots.
A second notebook, “hybridauditexample2,” illustrates the workflow for conducting a hybrid audit of this kind. The example election has a total of 2 million ballots. The reported margin is just over , but in reality the vote totals for the reported winner and reported loser are identical in both strata. The risk limit is 5%. The example illustrates two scenarios. In the first scenario, the audit in the CVR stratum escalates to a full hand count and the allowable overstatement in the noCVR stratum must be adjusted. Using the new allowable overstatement in the noCVR stratum makes it impossible to terminate the audit, even for samples as large as 5% of the ballots. In the second scenario, the audit in the noCVR stratum terminates with a sample of 500 ballots. However, the audit in the CVR stratum will still lead to a full hand count and the audit in the noCVR stratum must be redone using the adjusted allowable overstatement, putting us back in the first scenario. In both cases, the audit leads to a full recount of all the ballots.
These notebooks can be modified and run with different contest sizes, margins, risk limits, and allocations of allowable error, in order to estimate the workload of different scenarios.
References
 Bañuelos and Stark (2012) J.H. Bañuelos and P.B. Stark. Limiting risk by turning manifest phantoms into evil zombies. Technical report, arXiv.org, 2012. URL http://arxiv.org/abs/1207.3413. Retrieved 17 July 2012.
 Benaloh et al. (2011) J. Benaloh, D. Jones, E. Lazarus, M. Lindeman, and P.B. Stark. SOBA: Secrecypreserving observable ballotlevel audits. In Proceedings of the 2011 Electronic Voting Technology Workshop / Workshop on Trustworthy Elections (EVT/WOTE ’11). USENIX, 2011. URL http://statistics.berkeley.edu/~stark/Preprints/soba11.pdf.
 Grimmett and Stirzaker (2001) Geoffrey R. Grimmett and David R. Stirzaker. Probability and Random Processes. Oxford University Press, August 2001. ISBN 0198572220. URL http://www.amazon.ca/exec/obidos/redirect?tag=citeulike0920&path=ASIN/0198572220.
 Lindeman et al. (2012) M. Lindeman, P.B. Stark, and V. Yates. BRAVO: Ballotpolling risklimiting audits to verify outcomes. In Proceedings of the 2011 Electronic Voting Technology Workshop / Workshop on Trustworthy Elections (EVT/WOTE ’11). USENIX, to appear 2012.
 Lindeman and Stark (2012) Mark Lindeman and Philip B. Stark. A gentle introduction to risklimiting audits. IEEE Security and Privacy, 10:42–49, 2012.
 Rivest (2018) Ronald L. Rivest. Bayesian tabulation audits: Explained and extended, January 1, 2018. URL https://arxiv.org/abs/1801.00528.
 Stark (2008) P.B. Stark. Conservative statistical postelection audits. Ann. Appl. Stat., 2:550–581, 2008. URL http://arxiv.org/abs/0807.4005.
 Stark (2009a) P.B. Stark. Risklimiting postelection audits: values from common probability inequalities. IEEE Transactions on Information Forensics and Security, 4:1005–1014, 2009a.
 Stark (2009b) P.B. Stark. Auditing a collection of races simultaneously. Technical report, arXiv.org, 2009b. URL http://arxiv.org/abs/0905.1422v1.
 Stark (2010) P.B. Stark. Supersimple simultaneous singleballot risklimiting audits. In Proceedings of the 2010 Electronic Voting Technology Workshop / Workshop on Trustworthy Elections (EVT/WOTE ’10). USENIX, 2010. URL http://www.usenix.org/events/evtwote10/tech/full_papers/Stark.pdf.
 Stark and Teague (2014) Philip B. Stark and Vanessa Teague. Verifiable european elections: Risklimiting audits for d’hondt and its relatives. JETS: USENIX Journal of Election Technology and Systems, 3.1, 2014. URL https://www.usenix.org/jets/issues/0301/stark.
 Stark and Wagner (2012) Philip B. Stark and David A. Wagner. Evidencebased elections. IEEE Security and Privacy, 10:33–41, 2012.
 Wald (1945) A. Wald. Sequential tests of statistical hypotheses. Ann. Math. Stat., 16:117–186, 1945.
Comments
There are no comments yet.