Next Steps for the Colorado Risk-Limiting Audit (CORLA) Program

by   Mark Lindeman, et al.

Colorado conducted risk-limiting tabulation audits (RLAs) across the state in 2017, including both ballot-level comparison audits and ballot-polling audits. Those audits only covered contests restricted to a single county; methods to efficiently audit contests that cross county boundaries and combine ballot polling and ballot-level comparisons have not been available. Colorado's current audit software (RLATool) needs to be improved to audit these contests that cross county lines and to audit small contests efficiently. This paper addresses these needs. It presents extremely simple but inefficient methods, more efficient methods that combine ballot polling and ballot-level comparisons using stratified samples, and methods that combine ballot-level comparison and variable-size batch comparison audits in a way that does not require stratified sampling. We conclude with some recommendations, and illustrate our recommended method using examples that compare them to existing approaches. Exemplar open-source code and interactive Jupyter notebooks are provided that implement the methods and allow further exploration.



There are no comments yet.


page 1

page 2

page 3

page 4


Risk-Limiting Audits by Stratified Union-Intersection Tests of Elections (SUITE)

Risk-limiting audits (RLAs) offer a statistical guarantee: if a full man...

A Study of Maintainability in Evolving Open-Source Software

Our study is focused on an evaluation of the maintainability characteris...

Sets of Half-Average Nulls Generate Risk-Limiting Audits: SHANGRLA

Risk-limiting audits (RLAs) for many social choice functions can be redu...

Bernoulli Ballot Polling: A Manifest Improvement for Risk-Limiting Audits

We present a method and software for ballot-polling risk-limiting audits...

You can do RLAs for IRV

The City and County of San Francisco, CA, has used Instant Runoff Voting...

ALPHA: Audit that Learns from Previously Hand-Audited Ballots

BRAVO is currently the most widely used method for risk-limiting electio...

Invariant Risk Minimisation for Cross-Organism Inference: Substituting Mouse Data for Human Data in Human Risk Factor Discovery

Human medical data can be challenging to obtain due to data privacy conc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A risk-limiting audit (RLA) of an election is a procedure that has a known, pre-specified minimum chance of correcting the electoral outcome if the outcome is incorrect—that is, if the reported outcome differs from the outcome that a full manual tabulation of the votes would find. RLAs require a durable, voter-verifiable record of voter intent, such as paper ballots, and they assume that this audit trail is sufficiently complete and accurate that a full hand tally would show the true electoral outcome. That assumption is not automatically satisfied: a compliance audit (Stark and Wagner, 2012) is required.

Risk-limiting audits are generally incremental: they examine more ballots, or batches of ballots, until either (i) there is strong statistical evidence that a full hand tabulation would confirm the outcome, or (ii) the audit has led to a full hand tabulation, the result of which should become the official result.

RLAs have been piloted in California, Colorado, and Ohio, and a test of RLA procedures has been conducted in Arizona. RLA bills are being drafted or are already under consideration in California, Virginia, Washington, and other states. A number of laws have either allowed or mandated risk-limiting audits, including California AB 2023 (Saldaña), SB 360 (Padilla), and AB 44 (Mullin); Rhode Island SB 413A and HB 5704A; and Colorado Revised Statutes (CRS) 1-7-515.

CRS 1-7-515 required Colorado to implement risk-limiting audits beginning in 2017. (There are provisions to allow the Secretary of State to exempt some counties.) The first set of coordinated risk-limiting election audits across the state took place in Colorado in November, 2017.111See

Colorado’s “uniform voting system” program222

led many Colorado counties to purchase (or to plan to purchase) voting systems that are auditable at the ballot level: those systems export cast vote records (CVRs) for individual ballots in a manner that allows the corresponding paper ballot to be identified, and conversely, make it possible to find the CVR corresponding to any particular paper ballot. We call counties that have such systems “CVR” counties. It is estimated that by June, 2018, 98.2% of active Colorado voters will be in CVR counties. CVR counties can perform “ballot-level comparison audits,”

(Lindeman and Stark, 2012) which are currently the most efficient approach to risk-limiting audits in that they require examining fewer ballots than other methods do, when the outcome of the contest under audit is in fact correct.

Other counties (“legacy” or “no-CVR” counties) have systems that do not allow auditors to check how the system interpreted voter intent for individual ballots. Their election results can still be audited, provided their voting systems create a voter-verifiable paper trail (e.g., voter-marked paper ballots) that is conserved to ensure that it remains accurate and intact, and organized well enough to permit ballots to be selected at random. Pilot audits in California suggest that the most efficient way to audit such systems is by “ballot-polling” (Lindeman et al., 2012; Lindeman and Stark, 2012) (in contrast to “batch-level comparisons,” for example).

There is currently no literature on how to perform risk-limiting audits of contests that include CVR counties and no-CVR counties by combining ballot polling and ballot-level comparisons. Existing methods would either require all counties to use the lowest common denominator, ballot-polling (which does not take advantage of the CVRs, and thus is expected to require more auditing than a method that does take advantage of the CVRs), or would require no-CVR counties to perform batch-level comparisons, which were found in California to be (generally) less efficient than ballot-polling audits.333See Rivest (2018) for a different (Bayesian) approach to auditing contests that include both CVR counties and no-CVR counties.

The open-source audit software used for Colorado’s 2017 audits, RLATool (, needs to be improved to audit contests that cross county lines and to audit small contests efficiently.

First, the current version (1.1.0) of RLATool needs to be modified to recognize and group together contests that cross jurisdictional boundaries; currently, it treats every contest as if it were entirely contained in a single county. Margins and risk limits apply to entire contests, not to the portion of a contest included in a county. RLATool also does not allow the user to select the sample size, nor does it directly allow an unstratified random sample to be drawn across counties. Second, to audit a contest that includes voters in “legacy” counties (counties with voting systems that cannot export cast vote records) and voters in counties with newer systems, new statistical methods are needed to keep the efficiency of ballot-level comparison audits that the newer systems afford. Third, auditing contests that appear only on a subset of ballots can be made much more efficient if the sample can be drawn from just those ballots that contain the contest. While allowing samples to be restricted to ballots reported to contain a particular contest is not essential in the short run, it will be necessary eventually to make it feasible to audit smaller contests.

This document focuses on near-term requirements for risk-limiting audits in Colorado. Section 2 presents a number of crude approaches that could be implemented easily but might require examining substantially more ballots. Section 3 presents an approach based on comparison audits with different batch sizes. This approach is statistically elegant and relatively efficient, but might require changing how counties handle their ballots. Section 4 presents our recommended approach, which combines ballot-level comparisons in counties that can perform them with ballot-polling in the no-CVR counties. All the approaches require new software, including at least minor modifications to RLATool. We provide example software implementing the risk calculations for our recommended approach as a Python Jupyter notebook.444See Section 5 describes how audit efficiency could be improved in CVR counties by combining CVR data with data from Colorado’s voter registration system, SCORE.555SCORE is Colorado’s voter registration system, which also tracks who voted. See Sections 6 and 7 explain the recommended modifications to ballot-level comparison and ballot-polling audits, respectively. Section 8 summarizes our recommendations and considerations for implementation.

1.1 Priorities for Colorado audits

Auditing efficiency is controlled in part by how well the audit can limit the sample to ballots that contain the contests under audit. Some contests are on (essentially) every ballot, for instance the governor’s race. Others, such as mayoral contests, may appear on only a small fraction of ballots cast in a county. Partisan primaries—even for statewide office—are somewhere in between, because in general no single party’s primary appears on every ballot cast in the state. Thus, either we accept reduced efficiency for the sake of simplicity by continuing to sample ballots uniformly from within counties (or collections of counties), or we develop a way to focus the auditing on the ballots that contain the contest. The latter requires external information, e.g., from SCORE, as discussed below.

Moreover, party primaries for statewide offices (and perhaps other contests) will include CVR counties and no-CVR counties, so we need a method to audit across both kinds of voting technology.

This report addresses both issues, providing options for effectively auditing heterogeneous voting technology, varying in efficiency, complexity, and on whom any additional audit burden falls.

2 Crude (and unpleasant) approaches

Here and generally throughout the paper, we discuss auditing a single contest at a time, although the same sample can be used to audit more than one contest and there are ways of combining audits of different contests into a single process (Stark, 2009b, 2010). We use terminology drawn from a number of papers; the key reference is Lindeman and Stark (2012). An overstatement error is an error that caused the margin between any reported winner and any reported loser to appear larger than it really was. An understatement error is an error that caused the margin between every reported winner and every reported loser to appear to be smaller than it really was.

2.1 Hand count the legacy counties

The simplest approach to combining legacy counties with CVR counties is to require every legacy county to do a full hand count of the primaries, and to conduct a ballot-level comparison audit in CVR counties, based on contest margins adjusted for the results of the manual tallies in the CVR counties. For instance, imagine a contest with two candidates, reported winner and reported loser . Suppose the total number of reported votes for candidate  is and the total for candidate  is , so that , since is the reported winner. Suppose that a full manual tally of the votes in the legacy counties shows votes for and votes for . Suppose that a total of ballots were cast in the CVR counties. Then the diluted margin for the comparison audit in the CVR counties is defined to be . Requiring a full hand count in the legacy counties has obvious disadvantages, except perhaps in very close contests where ballot polling is not efficient. (But it does have the advantage of not forcing CVR counties to do additional auditing to compensate for the legacy counties.)

2.2 Subtract error bounds for the legacy counties from vote totals

If ballot accounting and SCORE data can provide good upper bounds on the number of ballots cast in each contest in legacy counties, there are simple upper bounds on the total possible overstatement error each legacy county could contribute to the overall contest results; those can be subtracted from the overall margin (as in the previous subsection) and the remainder of the contests can be audited in CVR counties against the adjusted margins. For instance, consider a primary that appears on ballots in a legacy counties. Suppose that in legacy counties, the overall, statewide contest winner, , is reported to have received votes, and some loser, , is reported to have received votes. (Note that could be greater than : is not necessarily the reported winner in the legacy counties.) Then the most overstatement error that the county could possibly have in determining whether in fact beat is if every reported undervote, invalid vote, or vote for a different candidate, , had in fact been a vote for (producing a 1-vote overstatement), and every vote reported for was in fact a vote for (producing a 2-vote overstatement). The reduction in the margin that would produce is votes.

Whereas the previous approach places the auditing burden created by obsolescent equipment entirely on the legacy counties, this approach places it entirely upon the CVR counties. Also, in a close contest, it could require a full hand count in every county that might not otherwise be necessary.

2.3 Treat legacy counties as if every ballot selected from them for audit has a two-vote overstatement

A third simple-but-pessimistic approach is to sample uniformly from all counties as if one were performing a ballot-level comparison audit everywhere, but to treat any ballot selected from a legacy county as a two-vote overstatement. This approach has the same disadvantages as the previous approach.

3 Variable batch sizes

Another approach is to perform a comparison audit across all counties, but to use batches consisting of more than one ballot (batch-level comparisons) in legacy counties and batches consisting of a single ballot (ballot-level comparisons) in CVR counties.666For majority and plurality elections, including those in which voters can select more than one candidate, audits can be based on overstatement and understatement errors at the level of batches. This requires that the no-CVR counties report vote subtotals for physically identifiable batches. If a county’s voting system can only report subtotals by precinct but the county does not sort paper ballots by precinct, this approach might require revising how the county handles its paper; we understand that this is the case in many Colorado counties.

That said, many California counties that do not sort vote-by-mail (VBM) ballots by precinct conduct the statutory 1% audits by manually retrieving the ballots for just those precincts selected for audit from whatever physical batches they happen to be in: the situation is identical to that in Colorado.

Another solution is the “Boulder-style” batch-level audit,777See which requires generating vote subtotals after each physical batch is scanned, and exporting those subtotals in machine-readable form. That in turn may require using extra memory cards, repeatedly initializing and deleting tabulation databases, or other measures that add complexity and opportunity for human error.

While those two approaches are laborious, they would provide a viable short-term solution, especially combined with information from SCORE to check that the reported batch-level results contain the correct number of ballots for each contest under audit. Moreover, it does not unduly increase the workload in CVR counties to compensate for legacy equipment.

This kind of variable-batch-size comparison audit approach would require modifying or augmenting RLATool in several ways:

  1. The CVR reporting tool would need to be modified to allow no-CVR counties to report batch-level results in a manner analogous to how CVR counties report ballot-level results, or an external tool would need to be provided.

  2. The sampling algorithm would have to allow sampling batches—and sampling them with unequal probability, because efficient batch-level audits involve sampling batches with probability proportional to a bound on the possible overstatement error in the batch. It would also need to calculate the appropriate sampling probability for each batch (of whatever size). Again, this could be accommodated using an external tool to draw the sample from legacy counties.

  3. The risk calculations would need to be modified. This, too, could be done with external software, with suitable provisions for capturing audit data from RLATool or directly from legacy counties.

None of these changes is enormous; the mathematics and statistics are already worked out in published papers, and there is exemplar code for calculating the batch-level error bounds, drawing the samples with probability proportional to an error bound, and calculating the attained risk from the sample results. Indeed, this is the method that was used in several of California’s pilot audits, including the audit in Orange County. A derivation of a method for comparison audits with variable batch sizes is given below in section 6.

4 Stratified “hybrid” audits

Other approaches involve stratification: partitioning the cast ballots into non-overlapping groups and sampling independently from those groups. One could stratify by county, but in general it is simpler and more efficient statistically (i.e., results in auditing fewer ballots) to minimize the number of strata. We consider methods that use two strata: CVR counties and no-CVR counties. Collectively, the ballots cast in CVR counties comprise one stratum and the ballots cast in legacy counties comprise a second stratum; every ballot cast in the contest is in exactly one of the two strata. We assume that the samples are drawn from the two strata independently.

As explained below, these stratified “hybrid” audits require the specification of some additional parameters: for dividing the tolerable overstatement error up, and the strata risk limits .

4.1 Partitioning the total permissible overstatement into strata

The simplest approach to stratification involves partitioning the risk limit and the tolerable overstatement error of the tabulation into two pieces, one for the (pooled) CVR counties and one for the (pooled) no-CVR counties. Let denote the contest-wide margin (in votes) of reported winner over reported loser . Let denote the margin (in votes) of reported winner over reported loser in stratum . Note that might be negative in one stratum. Let denote the margin (in votes) of reported winner over reported loser that a full hand count of the entire contest would show, that is, the actual margin rather than the reported margin. Reported winner really beat reported loser if and only if . Define to be the actual margin (in votes) of over in stratum ; this too may be negative.

Let be the overstatement of the margin of over in stratum . Reported winner really beat reported loser if and only if .

Pick and define . These values partition the total tolerable overstatement between the two strata: If and , candidate really received more votes than candidate . Some pairs can be ruled out a priori, because (for instance) , where is the number of ballots cast in stratum . There are other simple, sharper bounds, sketched below.

The choice of (which determines the tolerable overstatement in each stratum), the strata risk limits , and details of the audit procedures affect the workload and the overall risk limit. (See section 4.1.1 and section 8.)

For ballot-level comparison audits, auditing to ensure that is discussed in section 6. It is a minor modification of the method embodied in RLATool.

For ballot-polling audits, auditing to ensure that is discussed in section 7. Note that this requires a more substantial modification of the standard ballot-polling calculations, because the standard calculations consider only the fraction of ballots with a vote for either or that contain a vote for , while we need to make an inference about the difference between the number of votes for and the number of votes for . This introduces an additional unknown nuisance parameter, the number of ballots with votes for either or .

4.1.1 Combining stratum-level risk limits

We audit to test the two hypotheses , independently for the two strata. If we reject both hypotheses, we conclude that the contest outcome is correct; otherwise, we manually re-tabulate the contest in one or both strata, depending on the audit rules. Those rules matter: the two audits might need to be conducted to smaller risk limits individually than the desired risk limit for the contest as a whole.

Recall that the samples are drawn independently from the two strata. Pick . (Below we discuss the choice further.) We audit each stratum to test the hypothesis (the overstatement exceeds the tolerable overstatement) at risk limit , as if it were its own election. The audits can be conducted at the same time or sequentially; there is no coordination between the audits unless one of them leads to a full hand count but the other does not: see below.

How do these two stratum-level “risk limits” and determine the overall risk that the audit will not correct the outcome if the outcome is wrong? The overall risk depends on the rule for what we do if the audit in one stratum leads to a full manual tally of that stratum.

Here are the possibilities. Bear in mind that for the outcome to be wrong, at least one stratum must have a net overstatement greater its tolerable overstatement: That is, if , then or , or both. If the tolerable overstatement is exceeded in only one stratum, , then the chance that the stratum will be fully hand counted is at least .

If both and , then the chance both are completely tabulated by hand is at least , since the audit samples in the two strata are independent.

What should we do if the audit leads to a full tally in one stratum, , that reveals that indeed its tolerable overstatement has been exceeded, but the other audit has not led to a full tabulation, because it has not started, because it is still underway, or because it terminated without a full hand tally? We consider two options. The simpler is to automatically require a full hand count of the other stratum. If the audit uses this rule, then we can take , and the procedure will have risk limit . However, this rule creates the possibility of requiring a full hand count in circumstances where it may seem substantively superfluous. For instance, one can imagine an audit of a statewide contest in which the tolerable overstatement in no-CVR counties is exceeded, yet the outcome still could be verified without a full hand count in the CVR counties.

The second approach is to adjust the tolerable overstatement in the other stratum in light of the known manual tally in the stratum that has been fully hand tallied: we will test against the threshold , rather than the original value . (Because the overstatement in stratum exceeded the tolerable overstatement, the updated tolerable overstatement in stratum

will be smaller than the original value.) Then to reject the new null hypothesis in stratum

is to conclude that the overall outcome is correct.

If and when the hypothesis in stratum changes, the audit in that stratum might be able to stop on the basis of the data already observed; it might need to continue; or—if it had stopped based on the original threshold —it might need to examine more ballots, possibly continuing to a full hand tally.

We will now show in detail that this rule allows the contest to be audited at risk limit  by selecting values of  and  that sum to a bit more than : specifically, such that . For instance, suppose we want the overall risk limit to be 5%. If we use a risk limit of 4% in the no-CVR stratum and a risk limit of 1.04% in the CVR stratum, the overall risk limit is not larger than .

The statistical wrinkle is that adjusting for the manual tally in the hand-counted stratum changes the hypothesis being tested in the other stratum in a way that is itself random: whether the original null is tested or the new null is tested depends on what the sample reveals in stratum . If the hypothesis does change, there is only one value possible for —which depends on the reported margin and the count in stratum —but is unknown until is known.

We assume that before any data are collected, the audit specifies two families of tests: for each stratum , a family of level- tests of the null hypothesis that the overstatement in the stratum is greater than or equal to , for all feasible values of . That is,


for , and all feasible . Moreover, we insist that the test depend on data only from ballots selected from its stratum. Because the samples in the two strata are independent, for all feasible pairs ,


What is the chance that the audit leads to a full hand tabulation if the outcome is incorrect? One way the audit can lead to a full hand tally is if it leads to a full count in one stratum, the null hypothesis in the other stratum is changed, and the audit in the second stratum then proceeds to a full manual tally. (There are other ways the audit can lead to a full hand tally, for instance, if neither null hypothesis is rejected, but this is one way.)

If the outcome is wrong, there is at least one stratum in which the overstatement exceeds the threshold . Let be one such stratum. Then the chance the audit in stratum leads to a full manual tally in that stratum is at least . If the audit leads to a full manual tally in stratum  and the overall outcome is wrong, then the (new) null hypothesis in the other stratum, , must be true. If we started to audit that new hypothesis ab initio, the chance that we would reject it would be at most , so the chance the audit would lead to a full hand count of stratum is at least . The question is whether “changing hypotheses” could make that chance smaller. The inequality 4.1.1 shows that it cannot: for any feasible pair of overstatements, , if and , the chance that neither the hypothesis nor the hypothesis will be rejected is at least .

And therefore, for this procedure, the chance that there will be a full hand count in both strata is at least if the outcome is incorrect, even if the probability were zero that both of the original audits would proceed to a full hand count. The overall risk limit is thus not larger than .

4.2 Constraining the total overstatement across strata

A more statistically efficient approach to ensuring that the overstatement error in the two strata does not exceed the margin is to try to constrain the sum of the overstatement errors in the two strata, rather than constrain the pieces separately: there are many ways that the total overstatement could be less than without having the overstatement in stratum less than , . To that end, imagine all values . If, for all such pairs, we can reject the hypothesis that the overstatement error in stratum 1 is greater than or equal to and the overstatement error in stratum 2 is greater than or equal to , then we can conclude that the outcome is correct.

To test the conjunction hypothesis (i.e., that both of those null hypotheses are false), we use Fisher’s combining function. Let be the -value of the hypothesis . If the null hypothesis that and is true, then the combination


has a probability distribution that is dominated by the chi-square distribution with 4 degrees of freedom.

888If the two tests had continuously distributed -values, the distribution would be exactly chi-square with four degrees of freedom, but if either -value has atoms when the null hypothesis is true, it is in general stochastically smaller. This follows from a coupling argument along the lines of Theorem 4.12.3 in Grimmett and Stirzaker (2001).

Hence, if, for all and , the combined statistic is greater than the quantile of the chi-square distribution with 4 degrees of freedom, the audit can stop.

The calculation of uses the procedures discussed in sections 6 and 7.

5 Sampling from subcollections

To audit contests that are contained on only a fraction of the ballots cast in one or more counties efficiently requires the ability to sample from just those ballots (or, at least, from a subset of all ballots that contains every such ballot). Because the CVRs cannot be entirely trusted (otherwise, the audit would be superfluous), we cannot rely on them to determine which ballots contain a given contest. However, if we have independent knowledge of the number of ballots that contain a given contest (e.g., from the SCORE system), then there are methods that allow the sample to be drawn from ballots whose CVRs contain the contest and still limit the risk rigorously. See Benaloh et al. (2011) and Bañuelos and Stark (2012) for details.

6 Batch comparison audits of a tolerable overstatement in votes

In this section we expand previous comparison auditing work (already embodied in RLATool) to handle two new requirements. The first allows the specification of the parameters discussed in section 4. The second handles batch-level auditing.

The first requirement requires that we consider auditing in a single stratum to test whether the overstatement of any margin (in votes) exceeds some fraction of the overall margin between reported winner and reported loser . If the stratum contains all the ballots cast in the contest, then for , this would confirm the election outcome. For stratified audits, we might want to test other values of , as described above.

In Colorado, comparison audits have been ballot-level (i.e., batches consisting of a single ballot). This section also addresses the second requirement by deriving a method for batches of arbitrary size, which might be useful for Colorado to audit contests that include CVR counties and legacy counties. We keep the a priori error bounds tighter than the “super-simple” method (Stark, 2010)

. To keep the notation simpler, we consider only a single contest, but the MACRO test statistic

(Stark, 2009b, 2010) automatically extends the result to auditing contests simultaneously. The derivation is for plurality contests, including “vote-for-” plurality contests. Majority and super-majority contests are a minor modification (Stark, 2008).999So are some forms of preferential and approval voting, such as Borda count, and proportional representation contests, such as D’Hondt (Stark and Teague, 2014). Changes for IRV/STV are more complicated.

6.1 Notation

  • : the set of reported winners of the contest

  • : the set of reported losers of the contest

  • ballots were cast in all in the stratum. (The contest might not appear on all ballots.)

  • “batches” of ballots are in stratum . A batch contains one or more ballots. Every ballot in stratum is in exactly one batch.

  • : number of ballots in batch . .

  • : the reported votes for candidate in batch

  • : actual votes for candidate in batch . If the contest does not appear on any ballot in batch , then .

  • : Reported margin in stratum of reported winner over reported loser , in votes.

  • : Overall reported margin of reported winner over reported loser , in votes, for the entire contest (not just stratum )

  • : smallest reported overall margin between any reported winner and reported loser:

  • : actual margin in the stratum of reported winner over reported loser , in votes

  • : actual margin of reported winner over reported loser , in votes, for the entire contest (not just in stratum )

6.2 Reduction to maximum relative overstatement

If the contest is entirely contained in stratum , then the reported winners of the contest are the actual winners if

Here, we address the case that the contest may include a portion outside the stratum. To combine independent samples in different strata, it is convenient to be able to test whether the net overstatement error in a stratum exceeds a given threshold.

Instead of testing that condition directly, we will test a condition that is sufficient but not necessary for the inequality to hold, to get a computationally simple test that is still conservative (i.e., the risk is not larger than its nominal value).

For every winner, loser pair , we want to test whether the overstatement error exceeds some threshold, generally one tied to the reported margin between and . For instance, for a simple stratified audit, we might take the threshold to be .

We want to test whether

The maximum of sums is not larger than the sum of the maxima; that is,


Then no reported margin is overstated by a fraction or more if

Thus if we can reject the hypothesis , we can conclude that no pairwise margin was overstated by as much as a fraction .

Testing whether would require a very large sample if we knew nothing at all about without auditing batch : a single large value of could make arbitrarily large. But there is an a priori upper bound for . Whatever the reported votes are in batch , we can find the potential values of the actual votes that would make the error largest, because must be between 0 and , the number of ballots in batch :



Knowing that might let us conclude reliably that by examining only a small number of batches—depending on the values and on the values of for the audited batches.

To make inferences about , it is helpful to work with the taint . Define . Suppose we draw batches at random with replacement, with probability of drawing batch in each draw, . (Since , these are all positive numbers, and they sum to 1, so they define a probability distribution on the batches.)

Let be the value of for the batch selected in the th draw. Then are IID, , and

Thus .

So, if we have strong evidence that , we have strong evidence that .

This approach can be simplified even further by noting that has a simple upper bound that does not depend on . At worst, the reported result for batch shows votes for the “least-winning” apparent winner of the contest with the smallest margin, but a hand interpretation would show that all ballots in the batch had votes for the runner-up in that contest. Since and ,

Thus if we use in lieu of , we still get conservative results. (We also need to re-define to be the sum of those upper bounds.) An intermediate, still conservative approach would be to use this upper bound for batches that consist of a single ballot, but use the sharper bound (4) when . Regardless, for the new definition of and , are IID, , and

So, if we have evidence that , we have evidence that .

6.3 Testing

To test whether , there are a variety of methods available. One particularly “clean” sequential method is based on Wald’s Sequential Probability Ratio Test (SPRT) (Wald (1945)). Harold Kaplan pointed out this method on a website that no longer exists. A derivation of this “Kaplan-Wald” method is given in Stark and Teague (2014, Appendix A); to apply the method here, take in their equation 18.

A different sequential method, the Kaplan-Markov method (also due to Harold Kaplan), is given in Stark (2009a).

7 Ballot-polling audits of a tolerable overstatement in votes

7.1 Conditional tri-hypergeometric test

We consider a single stratum , containing ballots. We will sample individual ballots without replacement from stratum . Of the ballots, have a vote for but not for , have a vote for but not for , and have votes for both and or neither nor , including undervotes and invalid ballots. We might draw a simple random sample of ballots ( fixed ahead of time), or we might draw sequentially without replacement, so the sample size could be random. For instance, the rule for determining could depend on the data.101010Sampling with replacement leads to simpler arithmetic, but is not as efficient.

Regardless, we assume that, conditional on the attained sample size , the ballots are a simple random sample of size from the ballots in the population. In the sample, ballots contain a vote for but not , with and defined analogously. Then, conditional on

, the joint distribution of

is tri-hypergeometric:


The test statistic will be the diluted sample margin, . This is the sample difference in the number of ballots for the winner and for the loser, divided by the total number of ballots in the sample. We want to test the compound hypothesis . The value of is inferred from the definition . Thus,

The alternative is the compound hypothesis .111111To use Wald’s Sequential Probability Ratio Test, we might pick a simple alternative instead, e.g., and , the reported values, provided . Hence, we will reject for large values of . Conditional on , the event is the event .

Suppose we observe . The test will condition on the event . (In contrast, the BRAVO ballot-polling method (Lindeman et al., 2012) conditions only on .)

The -value of the simple hypothesis that there are ballots with a vote for but not for , ballots with a vote for but not for , and ballots with votes for both and or neither nor (including undervotes and invalid ballots) is the sum of these probabilities for events when . Therefore,


7.2 Conditional hypergeometric test

Another approach is to condition on both the events and . We describe the hypothesis test here, but do not advocate for using it. We found that this approach was inefficient in some simulation experiments.

Given , all samples of size from the ballots are equally likely, by hypothesis. Hence, in particular, all samples of size for which are equally likely. There are such samples. Among these samples, may take values . For a fixed , there are samples with and .

The factor counts the number of ways to sample of the remaining ballots. If we divide out this factor, we simply count the number of ways to sample ballots from the group of ballots for or for . There are equally likely samples of size from the ballots with either a vote for or for , but not both, and of these samples, contain ballots with a vote for but not . Therefore, conditional on and , the probability that is

The -value of the simple hypothesis that there are ballots with a vote for but not for , ballots with a vote for but not for , and ballots with votes for both and or neither nor (including undervotes and invalid ballots) is the sum of these probabilities for events when . This event occurs for . Therefore,


This conditional

-value is thus the tail probability of the hypergeometric distribution with parameters

“good” items, “bad” items, and a sample of size . This calculation is numerically stable and fast; tail probabilities of the hypergeometric distribution are available and well-tested in all standard statistics software.

7.3 Maximizing the -value over the null set

The composite null hypothesis does not specify or separately, only that for some fixed, known . The (conditional) -value of this composite hypothesis for is the maximum -value for all values that are possible under the null hypothesis,


wherever the summand is defined. (Equivalently, define if , , or .)

7.3.1 Optimizing over the parameter

The following result enables us to only test hypotheses along the boundary of the null set.

Theorem 1.

Assume that . Suppose the composite null hypothesis is . The -value is maximized on the boundary of the null region, i.e. when .


Without loss of generality, let and assume that is fixed. Let be the fixed, unknown number of ballots for or for in stratum . The -value for the simple hypothesis that is


where is defined as the term in the summand and for pairs that don’t appear in the summation.

Assume that is given. The -value for this simple hypothesis is

Terms in the fraction can be simplified: choose the corresponding pairs in the numerator and denominator. Fractions of the form can be expressed as . Fractions of the form can be expressed as . Thus, the -value can be written as

The last inequality follows from the fact that and are nonnegative, and that – it is a possible outcome under the null hypothesis.

7.3.2 Optimizing over the parameter

We have shown empirically (but do not prove) that this tail probability, as a function of , has a unique maximum at one of the endpoints when is either as small or as large as possible, given , , and the observed sample values and . If the empirical result is true, then finding the maximum is trivial; otherwise, it is a trivial one-dimensional optimization problem to compute the unconditional -value.

7.4 Conditional testing

If the conditional tests are always conducted at significance level or less, i.e., so that , then the overall procedure has significance level or less:


In particular, this implies that our conditional hypergeometric test will have the correct risk limit unconditionally.

8 Recommendations

We have outlined several methods Colorado might use to audit cross-jurisdictional contests that include CVR counties and no-CVR counties. We expect that stratified “hybrid” audits will be the most palatable, given the constraints on time for software development and the logistics of the audit itself, because the workflow for counties would be the same as it was in November, 2017.

What would change is the risk calculation “behind the scene,” including the algorithms used to decide when the audit can stop. Those algorithms could be implemented in software external to RLATool. The minimal modification to RLATool that would be required to conduct a hybrid audit is to allow the sample size from each county to be controlled externally, e.g. by uploading a parameter file once per round, rather than using a formula that is based on the margin within that county alone. The parameter file would be generated by external software that does the audit calculations described here based on the detailed audit progress and discrepancy data available from RLATools’ rla_export command.

To conduct a hybrid audit, one must choose two numbers in addition to the risk limit :

  • one stratum-wise risk limit, (the other, , is determined from and the overall risk limit, )

  • the tradeoff (allocation) of the tolerable overstatement between strata, (the value of is )

Those parameters can be chosen essentially arbitrarily (provided ) and the audit will still be risk-limiting; however, they can be optimized to reduce the expected workload under various assumptions about tabulation errors in the two strata. (Software that can be used to run scenarios is available at; see below.)

In either stratum, increasing the risk limit or increasing the tolerable overstatement will decrease the required sample size from that stratum (assuming that the actual overstatement in that stratum is less than its allowable overstatement). The relative change in sample size as the risk limit changes scales similarly in the two strata, because the risk limit enters both ballot-level comparisons and ballot-polling the same way: as the logarithm. However, the relative change in sample sizes as the tolerable overstatement changes scales quite differently in the two strata: linearly in the ballot-level comparison stratum, but quadratically in the ballot-polling stratum. Hence, the workload is not as sensitive to how the risk limit is allocated across strata as it is to how the tolerable overstatement is allocated.

8.1 Software and examples

Examples of stratified hybrid audits are in Jupyter notebooks available at The first two examples are contained in a single notebook, “hybrid-audit-example-1”. The first example is a hypothetical medium-sized election with a total of votes and a diluted margin of . 9.1% of the ballots come from no-CVR counties. The risk limit is 10%. If the audit in the CVR stratum found no errors and the allowable overstatement error was 30% of the margin, it would terminate after examining 1,213 ballots. In over 90% of 10,000 simulations, an audit of 250 ballots from the no-CVR stratum would have sufficed to confirm that the overstatement error in that stratum did not exceed its allocation, 70% of the margin. A sample of 450 ballots was sufficient to stop the audit in 99% of simulations. As always, could be adjusted to rebalance the expected workload between strata, perhaps taking into account the expected workload for audits of countywide or intra-county contests, so as to minimize (or quite possibly eliminate) any additional burden imposed by the stratified audit.

If a CVR were available for all counties and we could have run a ballot-level comparison audit for the entire contest, rather than stratifying, an audit with risk limit 10% that found no errors would have concluded after examining just 263 ballots. The efficiency gained comes from two sources. First, ballot-level comparison audits are substantially more efficient than ballot-polling audits. Second, the hybrid audit requires dividing the margin and risk limit between two strata. This results in both strata using smaller risk limits. In order to keep the workload low, it is necessary to allocate a disproportionately high fraction of the margin to the no-CVR stratum; the CVR stratum must increase its workload to compensate.

Another method discussed in Section 2.3 is to perform a ballot-level comparison audit statewide, but to treat any ballot sampled from the no-CVR county as showing a two-vote overstatement. In this example, this worst-case method would lead to a full hand count. However, the situation may be more optimistic for Colorado: if only 1.2% of ballots came from the no-CVR stratum and the overall margin were in fact 10,000 votes, then this method would require checking 430 ballots.

The second example is a hypothetical large statewide election with a total of 2 million ballots and a diluted margin of nearly . The risk limit is 5%. If the audit in the CVR stratum found no errors and the allowable overstatement error was 10% of the margin, it would terminate after examining 50 ballots. In over 90% of 10,000 simulations, an audit of 50 ballots from the no-CVR stratum would have sufficed to confirm that the overstatement error in that stratum did not exceed its allocation, 90% of the margin. A sample of 100 ballots was sufficient to stop the audit in 99% of simulations. If a CVR were available for all counties and we could have run a ballot-level comparison audit for the entire contest, rather than stratifying, an audit with risk limit 5% that found no errors would have concluded after examining just 24 ballots.

A second notebook, “hybrid-audit-example-2,” illustrates the workflow for conducting a hybrid audit of this kind. The example election has a total of 2 million ballots. The reported margin is just over , but in reality the vote totals for the reported winner and reported loser are identical in both strata. The risk limit is 5%. The example illustrates two scenarios. In the first scenario, the audit in the CVR stratum escalates to a full hand count and the allowable overstatement in the no-CVR stratum must be adjusted. Using the new allowable overstatement in the no-CVR stratum makes it impossible to terminate the audit, even for samples as large as 5% of the ballots. In the second scenario, the audit in the no-CVR stratum terminates with a sample of 500 ballots. However, the audit in the CVR stratum will still lead to a full hand count and the audit in the no-CVR stratum must be redone using the adjusted allowable overstatement, putting us back in the first scenario. In both cases, the audit leads to a full recount of all the ballots.

These notebooks can be modified and run with different contest sizes, margins, risk limits, and allocations of allowable error, in order to estimate the workload of different scenarios.