Risk-Limiting Audits by Stratified Union-Intersection Tests of Elections (SUITE)

by   Kellie Ottoboni, et al.

Risk-limiting audits (RLAs) offer a statistical guarantee: if a full manual tally of the paper ballots would show that the reported election outcome is wrong, an RLA has a known minimum chance of leading to a full manual tally. RLAs generally rely on random samples. Stratified sampling--partitioning the population of ballots into disjoint strata and sampling independently from the strata--may simplify logistics or increase efficiency compared to simpler sampling designs, but makes risk calculations harder. We present SUITE, a new method for conducting RLAs using stratified samples. SUITE considers all possible partitions of outcome-changing error across strata. For each partition, it combines P-values from stratum-level tests into a combined P-value; there is no restriction on the tests used in different strata. SUITE maximizes the combined P-value over all partitions of outcome-changing error. The audit can stop if that maximum is less than the risk limit. Voting systems in some Colorado counties (comprising 98.2 how the system interpreted each ballot, which allows ballot-level comparison RLAs. Other counties use ballot polling, which is less efficient. Extant approaches to conducting an RLA of a statewide contest would require major changes to Colorado's procedures and software, or would sacrifice the efficiency of ballot-level comparison. SUITE does not. It divides ballots into two strata: those cast in counties that can conduct ballot-level comparisons, and the rest. Stratum-level P-values are found by methods derived here. The resulting audit is substantially more efficient than statewide ballot polling. SUITE is useful in any state with a mix of voting systems or that uses stratified sampling for other reasons. We provide an open-source reference implementation and exemplar calculations in Jupyter notebooks.



There are no comments yet.


page 1

page 2

page 3

page 4


Sets of Half-Average Nulls Generate Risk-Limiting Audits: SHANGRLA

Risk-limiting audits (RLAs) for many social choice functions can be redu...

Bernoulli Ballot Polling: A Manifest Improvement for Risk-Limiting Audits

We present a method and software for ballot-polling risk-limiting audits...

Next Steps for the Colorado Risk-Limiting Audit (CORLA) Program

Colorado conducted risk-limiting tabulation audits (RLAs) across the sta...

ALPHA: Audit that Learns from Previously Hand-Audited Ballots

BRAVO is currently the most widely used method for risk-limiting electio...

You can do RLAs for IRV

The City and County of San Francisco, CA, has used Instant Runoff Voting...

RiLACS: Risk-Limiting Audits via Confidence Sequences

Accurately determining the outcome of an election is a complex task with...

More style, less work: card-style data decrease risk-limiting audit sample sizes

U.S. elections rely heavily on computers such as voter registration data...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A risk-limiting audit (RLA) of an election contest is a procedure that has a known minimum chance of leading to a full manual tally of the ballots if the electoral outcome according to that tally would differ from the reported outcome. Outcome means the winner(s) (or, for instance, whether there is a runoff)—not the numerical vote totals. RLAs require a durable, voter-verifiable record of voter intent, such as paper ballots, and they assume that this audit trail is sufficiently complete and accurate that a full hand tally would show the true electoral outcome. That assumption is not automatically satisfied: a compliance audit [16] is required to check whether the paper trail is trustworthy.

Current methods for risk-limiting audits are generally sequential hypothesis testing procedures: they examine more ballots, or batches of ballots, until either (i) there is strong statistical evidence that a full hand tabulation would confirm the outcome, or (ii) the audit has led to a full hand tabulation, the result of which should become the official result.

RLAs have been conducted in California, Colorado, Indiana, Ohio, Virginia, and Denmark, and are required by law in Colorado (CRS 1-7-515) and Rhode Island (SB 413A and HB 5704A).

The most efficient and transparent sampling design for risk-limiting audits selects individual ballots uniformly at random, with or without replacement [13]. Risk calculations for such samples can be made simple without sacrificing rigor [14, 6]. However, to audit contests that cross jurisdictional boundaries then requires coordinating sampling in different counties, and may require different counties to use the lowest common denominator method for assessing risk from the sample, which would not take full advantage of the capabilities of some voting systems. For instance, any system that uses paper ballots as the official record can conduct ballot-polling audits, while ballot-level comparison audits require systems to generate cast vote records that can be checked manually against a human reading of the paper [5, 6]. (These terms are described in Section 3.)

Stratified RLAs have been considered previously, primarily to conform with legacy audit laws under which counties draw audit samples independently of each other, but also to allow auditors to start the audit before all vote-by-mail or provisional ballots have been tallied, by sampling independently from ballots cast in person, by mail, and provisionally, as soon as subtotals for each group are available [9, 4]

. However, extant methods address only a single approach to auditing, batch-level comparisons, and only a particular test statistic.

Here, we introduce SUITE, a more general approach to conducting RLAs using stratified samples. SUITE is a twist on intersection-union tests [7]

, which represent the null hypothesis as the intersection of a number of simpler hypotheses, and the alternative hypothesis as a union of their alternatives. In contrast, here, the null is the union of simpler hypotheses, and the alternative is the intersection of their alternatives. The approach involves finding the maximum

-value over a vector of nuisance parameters that describe the simple hypotheses, all allocations of tabulation error across strata for which a full count would find a different electoral outcome than was reported. (A

nuisance parameter

is a property of the population that is not of direct interest, but that affects the probability distribution of the data.

Overstatement is error that made the margin of one or more winners over one or more losers appear larger than it really was. The total overstatement across strata determines whether the reported outcome is correct; the overstatements in individual strata are nuisance parameters that affect the distribution of the audit sample.)

The basic building block for the method is testing whether the overstatement error in a single stratum is greater than or equal to a quota. Fisher’s combining function is used to merge -values for tests in different strata into a single -value for the hypothesis that the overstatement in every stratum is greater than or equal to its quota. If that hypothesis can be rejected for all stratum-level quotas that could change the outcome—that is, if the maximum combined -value is sufficiently small—the audit can stop.

It is not actually necessary to consider all possible quotas: the -value involves a sum of monotonic functions, which allows us to find upper and lower bounds everywhere using only values on a discrete grid. We present a numerical procedure, implemented in Python, to find bounds on the maximum -value when there are two strata. The procedure can be generalized to more than two strata.

Section 2 presents the new approach to stratified auditing. Section 3 illustrates the method by solving a problem pertinent to Colorado: combining ballot polling in one stratum with ballot-level comparisons in another. This requires straightforward modifications to the mathematics behind ballot polling and ballot-level comparison to allow the overstatement to be compared to specified thresholds other than the overall contest margin; those modifications are described in Sections 3.1 and 3.2. Section 4 gives numerical examples of simulated audits, using parameters intended to reflect how the procedure would work in Colorado. We provide example software implementing the risk calculations for our recommended approach in Python Jupyter notebooks.111See https://github.com/pbstark/CORLA18. Section 5 gives recommendations and considerations for implementation.

2 Stratified audits

Stratified sampling involves partitioning a population into non-overlapping groups and drawing independent random samples from those groups. [9, 4] developed RLAs based on comparing stratified samples of batches of ballots to hand counts of the votes in those batches: batch-level comparison RLAs, using a particular test statistic. The method we develop here is more general and more flexible: it can be used with any test statistic, and test statistics in different strata need not be the same—which is key to combining audits of ballots cast using diverse voting technologies.

Here and below, we consider auditing a single plurality contest at a time, although the same sample can be used to audit more than one contest (and super-majority contests), and there are ways of combining audits of different contests into a single process [10, 14]. We use terminology drawn from a number of papers, notably [6].

An overstatement error is an error that caused the margin between any reported winner and any reported loser to appear larger than it really was. An understatement error is an error that caused the margin between every reported winner and every reported loser to appear to be smaller than it really was. Overstatements cast doubt on outcomes; understatements do not, even though they are tabulation errors.

We use to denote a reported winner and to denote a reported loser. The total number of reported votes for candidate  is and the total for candidate  is . Thus , since is reported to have gotten more votes than .

Let denote the contest-wide margin (in votes) of over . We have strata. Let denote the margin (in votes) of reported winner over reported loser in stratum . Note that might be negative in one stratum, but . Let denote the margin (in votes) of reported winner over reported loser that a full hand count would show: the actual margin, in contrast to the reported margin . Reported winner really beat reported loser if and only if . Define to be the actual margin (in votes) of over in stratum .

Let be the overstatement of the margin of over in stratum . Reported winner really beat reported loser if and only if .

An RLA is a test of the hypothesis that the outcome is wrong, that is, that did not really beat : . The null is true if and only if there exists some -tuple of real numbers with such that for all .222“If” is straightforward. For “only if,” suppose . Set . Then , and for all . Thus if we can reject the conjunction hypothesis at significance level for all such that , we can stop the audit, and the risk limit will be .

2.1 Fisher’s combination method

Fix , with . To test the conjunction hypothesis that stratum null hypotheses are true, that is, that for all , we use Fisher’s combining function. Let be the -value of the hypothesis . If the null hypothesis is true, then


has a probability distribution that is dominated by the chi-square distribution with

 degrees of freedom.

333If the stratum-level tests had continuously distributed -values, the distribution would be exactly chi-square with  degrees of freedom, but if any of the -values has atoms when the null hypothesis is true, it is in general stochastically smaller. This follows from a coupling argument along the lines of Theorem 4.12.3 in [3]. Fisher’s combined statistic will tend to be small when all stratum-level null hypotheses are true. If any is false, then as the sample size increases, Fisher’s combined statistic will tend to grow.

If, for all with , we can reject the conjunction hypothesis at level (i.e., if the minimum value of Fisher’s combined statistic over all is larger than the quantile of the chi-square distribution with  degrees of freedom), the audit can stop.

If the audit is allowed to “escalate” in steps, increasing the sample size sequentially, then either the tests used in the separate strata have to be sequential tests, or multiplicity needs to be taken into account, for instance by adjusting the risk limit at each step. Otherwise, the overall procedure can have a risk limit that is much larger than . For examples of controlling for multiplicity when using non-sequential testing procedures in an RLA, see [9, 11].

The stratum-level -value could be a -value for the hypothesis from any test procedure. We assume, however, that is based on a one-sided test, and that the tests for different values of “nest” in the sense that if , then . This monotonicity is a reasonable requirement because the evidence that the overstatement is greater than should be weaker than the evidence that the overstatement is greater than , if . In particular, this monotonicity holds for the tests proposed in Sections 3.1 and 3.2.

One could use a function other than Fisher’s to combine the stratum-level -values into a -value for the conjunction hypothesis, provided it satisfies these properties (see [7]):

  • the function is non-increasing in each argument and symmetric with respect to rearrangements of the arguments

  • the combining function attains its supremum when one of the arguments approaches zero

  • for every level , the critical value of the combining function is finite and strictly smaller than the function’s supremum.

For instance, one could use Liptak’s function, , or Tippett’s function, .

Fisher’s function is convenient for this application because the tests in different strata are independent, so the chi-squared distribution dominates the distribution of when the null hypothesis is true. If tests in different strata were correlated, the null distribution of the combination function would need to be calibrated by simulation; some other combining function might have better properties than Fisher’s [7].

2.2 Maximizing Fisher’s combined -value for

We now specialize to strata. The set of such that is then a one-dimensional family: if , then . For a given set of data, finding the maximum -value over all is thus a one-dimensional optimization problem. We provide two software solutions to the problem.

The first approach approximates the maximum via a grid search, refining the grid once the maximum has been bracketed. This is not guaranteed to find the global maximum exactly, although it can approximate the maximum as closely as one desires by refining the mesh, since the objective function is continuous.

The second, more rigorous approach uses bounds on Fisher’s combining function for all . (A lower bound on implies an upper bound on the -value: if, for all , the lower bound is larger than the quantile of the chi-squared distribution with 4 degrees of freedom, the maximum -value is no larger than .)

Some values of can be ruled out a priori, because (for instance) , where is the number of ballots cast in stratum , and thus


Let and be lower and upper bounds on .

Recall that are monotonically increasing functions, so, as a function of , increases monotonically and decreases monotonically. Suppose . Then for all , and . Thus


This gives a lower bound for on the interval ; the corresponding upper bound is . Partitioning into a collection of intervals and finding and for each yields piecewise-constant lower and upper bounds for .

If, for all , the lower bound on is larger than the quantile of the chi-square distribution with 4 degrees of freedom, the audit can stop. On the other hand, if for some , the upper bound is less than the quantile of the chi-square distribution with 4 degrees of freedom, or if is less than this quantile at any grid point , the sample size in one or both strata needs to increase. If the lower bound is less than the quantile on some interval, but is above this quantile at every grid point , then one should improve the lower bound by refining the grid and/or by increasing the sample size in one or both strata.

3 Auditing cross-jurisdictional contests

As mentioned above, stratified sampling can simplify audit logistics by allowing jurisdictions to sample ballots independently of each other, or by allowing a single jurisdiction to sample independently from different collections of ballots (e.g., vote-by-mail versus cast in person). SUITE allows stratified samples to be combined into an RLA of contests that include ballots from more than one stratum.

We present an example where SUITE is helpful for a different reason: it enables an RLA to take advantage of differences among voting systems to reduce audit sample sizes, which solves a current problem in Colorado.

CRS 1-7-515 requires Colorado to conduct risk-limiting audits beginning in 2017. The first risk-limiting election audits under this statute were conducted in November, 2017; the second were conducted in July, 2018.444See https://www.sos.state.co.us/pubs/elections/RLA/2017RLABackground.html Counties cannot audit contests that cross jurisdictional boundaries (cross-jurisdictional contests, such as gubernatorial contests and most federal contests) on their own: margins and risk limits apply to entire contests, not to the portion of a contest included in a county. Colorado has not yet conducted an RLA of a cross-jurisdictional contest, although it has performed RLA-like procedures on individual jurisdictions’ portions of some cross-jurisdictional contests. To audit statewide elections and contests that cross county lines, Colorado will need to implement new approaches and make some changes to its auditing software, RLATool.

Colorado’s voting systems are heterogeneous. Some counties (containing about 98% of active voters, as of this writing) have voting systems that export cast vote records (CVRs) in a way that the paper ballot corresponding to each CVR can be identified uniquely and retrieved. We call counties with such voting systems CVR counties. In CVR counties, auditors can manually check the accuracy of the voting system’s interpretation of individual ballots. In other counties (legacy or no-CVR counties) there is no way to check the accuracy of the system’s interpretation of voter intent for individual ballots.

Contests entirely contained in CVR counties can be audited using ballot-level comparison audits [6], which compare CVRs to the auditors’ interpretation of voter intent directly from paper ballots. Ballot-level comparison audits are currently the most efficient approach to risk-limiting audits in that they require examining fewer ballots than other methods do, when the outcome of the contest under audit is in fact correct. Contests involving no-CVR counties can be audited using ballot-polling audits [5, 6], which generally require examining more ballots than ballot-level comparison audits to attain the same risk limit.

Colorado’s challenge is to audit contests that include ballots cast in both CVR counties and no-CVR counties. There is no literature on how to combine ballot polling with ballot-level comparisons to audit cross-jurisdictional contests that include voters in CVR counties and voters in no-CVR counties.555See [8] for a different (Bayesian) approach to auditing contests that include both CVR counties and no-CVR counties. In general, Bayesian audits are not risk-limiting.

Colorado could simply revert to ballot-polling audits for cross-jurisdictional contests that include votes in no-CVR counties, but that would entail a loss of efficiency. Alternatively, Colorado could use batch-level comparison audits, with single-ballot batches in CVR counties and larger batches in no-CVR counties.666Since so few ballots are cast in no-CVR counties, cruder approaches might work, for instance, pretending that no-CVR counties had CVRs, but treating any ballot sampled from a no-CVR county as if it had a 2-vote overstatement error. See [1]. The statistical theory for such audits has been worked out (see, e.g., [9, 10, 12, 14] and Section 0.A, below); indeed, this is the method that was used in several of California’s pilot audits, including the audit in Orange County, California. However, batch-level comparison audits were found to be less efficient than ballot-polling audits in these pilots [2].

Moreover, to use batch-level comparison audits in Colorado would require major changes to RLATool, for reporting batch-level contest results prior to the audit, for drawing the sample, for reporting audit findings, and for determining when the audit can stop. The changes would include modifying data structures, data uploads, random sampling procedures, and the county user interface. No-CVR counties would also have to revise their audit procedures. Among other things, they would need to report vote subtotals for physically identifiable groups of ballots before the audit starts. No-CVR counties with voting systems that can only report subtotals by precinct might have to make major changes to how they handle ballots, for instance, sorting all ballots by precinct. These are large changes.

We show here that SUITE makes possible a “hybrid” RLA that keeps the advantages of ballot-level comparison audits in CVR counties but does not require major changes to how no-CVR counties audit, nor major changes to RLATool. The key is to use stratified sampling with two strata: ballots cast in CVR counties and those cast in no-CVR counties.

In order to use Equation 1, we must develop stratum-level tests for the overstatement error that are appropriate for the corresponding voting system. Sections 3.1 and 3.2 describe these tests for overstatement in the CVR and no-CVR strata, respectively.

3.1 Comparison audits of overstatement quotas

To use comparison auditing in the approach to stratification described above requires extending previous work to test whether the overstatement error is greater than or equal to , rather than simply . Appendix 0.A derives this generalization for arbitrary batch sizes, including batches consisting of one ballot. The derivation considers only a single contest, but the MACRO test statistic [10, 14] automatically extends the result to auditing any number of contests simultaneously. The derivation is for plurality contests, including “vote-for-” plurality contests. Majority and super-majority contests are a minor modification [9].777So are some forms of preferential and approval voting, such as Borda count, and proportional representation contests, such as D’Hondt [15]. For a derivation of ballot-level comparison risk-limiting audits for super-majority contests, see https://github.com/pbstark/S157F17/blob/master/audit.ipynb. (Last visited 14 May 2018.) Changes for IRV/STV are more complicated.

3.2 Ballot-polling audits of overstatement quotas

To use the new stratification method with ballot polling requires a different approach than [5] took: their approach tests whether got a larger share of the votes than , but we need to test whether the margin in votes in the stratum is greater than or equal to a threshold (namely, ). This introduces a nuisance parameter, the number of ballots with votes for either or . We address this by maximizing the probability ratio in Wald’s Sequential Probability Ratio Test [17] over all possible values of the nuisance parameter. Appendix 0.B develops the test.

4 Numerical examples

Jupyter notebooks containing calculations for hybrid stratified audits intended to be relevant for Colorado are available at https://www.github.com/pbstark/CORLA18.

hybrid-audit-example-1 contains two hypothetical examples. The first has cast ballots, of which 9.1% were in no-CVR counties. The diluted margin (the margin in votes, divided by the total number of ballots cast) is . In 94% of 10,000 simulations in which the reported results were correct, drawing 700 ballots from the CVR stratum and 500 ballots from the no-CVR stratum (1,200 ballots in all) allowed SUITE to confirm the outcome at 10% risk. For the remaining 6%, further expansion of the audits would have been necessary.

If it were possible to conduct a ballot-level comparison audit for the entire contest, an RLA with risk limit 10% could terminate after examining 263 ballots if it found no errors. A ballot-polling audit of the entire contest would have been expected to examine about 14,000 ballots, more than 10% of ballots cast. The hybrid audit is less efficient than a ballot-level comparison audit, but far more efficient than a ballot-polling audit.

The second contest has 2 million cast ballots, of which 5% were cast in no-CVR counties. The diluted margin is about . The workload for SUITE at 5% risk is quite low: In 100% of 10,000 simulations in which the reported results were correct, auditing 43 ballots from the CVR stratum and 15 ballots from the no-CVR stratum would have confirmed the outcome. If it were possible to conduct a ballot-level comparison audit for the entire contest, an RLA at risk limit 5% could terminate after examining 31 ballots if it found no errors. The additional work for the hybrid stratified audit is disproportionately in the no-CVR counties.

A second notebook, hybrid-audit-example-2, illustrates the workflow for SUITE for an election with 2 million ballots cast. The reported margin is just over , but the reported winner and reported loser are actually tied in both strata. The risk limit is 5%. For a sample of 500 ballots from the CVR stratum and 1000 ballots from the no-CVR stratum, the maximum combined -value is over 25%, so the audit cannot stop there.

A third notebook, fisher_combined_pvalue, illustrates the numerical methods used to check whether the maximum combined -value is below the risk limit. It includes code for the tests in the two strata, for the lower and upper bounds and for , for evaluating Fisher’s combining function on a grid, and for computing bounds on the -value via Equation 3.

5 Discussion

We present SUITE, a new class of procedures for RLAs based on stratified random sampling. SUITE is agnostic about the capability of voting equipment in different strata, unlike previous methods, which require batch-level comparisons in every stratum. SUITE allows arbitrary tests to be used in different strata; if those tests are sequentially valid, then the overall RLA is sequential. (Otherwise, multiplicity adjustments might be needed if one wants an audit that escalates in stages. See [9, 11] for two approaches.)

Like other RLA methods, SUITE poses auditing as a hypothesis test. The null hypothesis is a union over all partitions of outcome-changing error across strata. The hypothesis is rejected if the maximum -value over all such partitions is sufficiently small. Each possible partition yields an intersection hypothesis, tested by combining -values from different strata using Fisher’s combining function (or a suitable replacement).

Among other things, the new approach solves a current problem in Colorado: how to conduct RLAs of contests that cross jurisdictional lines, such as statewide contests and many federal contests.

We give numerical examples in Jupyter notebooks that can be modified to estimate the workload for different contest sizes, margins, and risk limits. In our numerical experiments, the new method requires auditing far fewer ballots than previous approaches would.

Appendix 0.A Comparison tests for an overstatement quota

0.a.1 Notation

  • : the set of reported winners of the contest

  • : the set of reported losers of the contest

  • ballots were cast in stratum . (The contest might not appear on all ballots.)

  • “batches” of ballots are in stratum . A batch contains one or more ballots. Every ballot in stratum is in exactly one batch.

  • : number of ballots in batch . .

  • : reported votes for candidate in batch

  • : actual votes for candidate in batch . If the contest does not appear on any ballot in batch , then .

  • : Reported margin in stratum of reported winner over reported loser , in votes.

  • : overall reported margin in votes of reported winner over reported loser for the entire contest (not just stratum )

  • : smallest reported overall margin in votes between any reported winner and reported loser

  • : actual margin in votes in the stratum of reported winner over reported loser

  • : actual margin in votes of reported winner over reported loser for the entire contest (not just in stratum )

0.a.2 Reduction to maximum relative overstatement

If the contest is entirely contained in stratum , then the reported winners of the contest are the actual winners if

Here, we address the case that the contest may include a portion outside the stratum. To combine independent samples in different strata, it is convenient to be able to test whether the net overstatement error in a stratum is greater than or equal to a given threshold.

Instead of testing that condition directly, we will test a condition that is sufficient but not necessary for the inequality to hold, to get a computationally simple test that is still conservative (i.e., the level is not larger than its nominal value).

For every winner, loser pair , we want to test whether the overstatement error is greater than or equal to some threshold, generally one tied to the reported margin between and . For instance, for a hybrid stratified audit, we set the threshold to be .

We want to test whether

The maximum of sums is not larger than the sum of the maxima; that is,


Then no reported margin is overstated by a fraction or more if

Thus if we can reject the hypothesis , we can conclude that no pairwise margin was overstated by as much as a fraction .

Testing whether would require a very large sample if we knew nothing at all about without auditing batch : a single large value of could make arbitrarily large. But there is an a priori upper bound for . Whatever the reported votes are in batch , we can find the potential values of the actual votes that would make the error largest, because must be between 0 and , the number of ballots in batch :



Knowing that might let us conclude reliably that by examining only a small number of batches—depending on the values and on the values of for the audited batches.

To make inferences about , it is helpful to work with the taint . Define . Suppose we draw batches at random with replacement, with probability of drawing batch in each draw, . (Since , these are all positive numbers, and they sum to 1, so they define a probability distribution on the batches.)

Let be the value of for the batch selected in the th draw. Then are IID, , and

Thus . So, if we have strong evidence that , we have strong evidence that .

This approach can be simplified even further by noting that has a simple upper bound that does not depend on . At worst, the reported result for batch shows votes for the “least-winning” apparent winner of the contest with the smallest margin, but a hand interpretation would show that all ballots in the batch had votes for the runner-up in that contest. Since and ,

Thus if we use in lieu of , we still get conservative results. (We also need to re-define to be the sum of those upper bounds.) An intermediate, still conservative approach would be to use this upper bound for batches that consist of a single ballot, but use the sharper bound (4) when . Regardless, for the new definition of and , are IID, , and

So, if we have evidence that , we have evidence that .

0.a.3 Testing

A variety of methods are available to test whether . One particularly elegant sequential method is based on Wald’s Sequential Probability Ratio Test (SPRT) [17]. Harold Kaplan pointed out this method on a website that no longer exists. A derivation of this Kaplan-Wald method is in Appendix A of [15]; to apply the method here, take in their equation 18. A different sequential method, the Kaplan-Markov method (also due to Harold Kaplan), is given in [12].

Appendix 0.B Ballot-polling tests for an overstatement quota

In this section, we derive a ballot-polling test of the hypothesis that the margin (in votes) in a single stratum is greater than or equal to a threshold .

0.b.1 Wald’s SPRT with a nuisance parameter

Consider a single stratum containing ballots, of which have a vote for but not for , have a vote for but not for , and have votes for both and or neither nor , including undervotes and invalid ballots. Ballots are drawn sequentially without replacement, with equal probability of selecting each as-yet-unselected ballot in each draw.

We want to test the compound hypothesis that against the alternative that , , and , with .

The values , , and are the reported results for stratum (or values related to those reported results; see [5]). In this problem, (equivalently, ) is a nuisance parameter: we care about .

Let be , , or according to whether the ballot selected on the th draw shows a vote for but not , but not , or something else. Let ; and define and analogously.

The probability of a given data sequence under the alternative hypothesis is

If , the data obviously do not provide evidence against the null, so we suppose that , in which case, the element of the null that will maximize the probability of the observed data has . Under the null hypothesis, the probability of is

for some value and the corresponding . How large can that probability be under the null? The probability under the null is maximized by any integer that maximizes

The logarithm is monotonic, so any maximizer also maximizes

The first two terms on the right increase monotonically with and the last term decreases monotonically with . This yields bounds without having to evaluate everywhere. Suppose . Then for all integer between and ,

The optimization problem can be solved using a branch and bound approach. For instance, start by evaluating


at , , and their midpoint, to get the values of at those three points, along with upper bounds on on the ranges between them. At stage , we have evaluated , , and at points , and we have upper bounds on on the ranges between those points. Let be the upper bound on for . Suppose that for some , . Then is a global maximizer of . If there is some , then subdivide the range with the largest , calculate , , and at the new point, and repeat. This algorithm must terminate by identifying a global maximizer after a finite number of steps.

A conservative -value for the null hypothesis after items have been drawn is thus

Because the test is built on Wald’s SPRT, the sample can expand sequentially and (if the null hypothesis is true) the chance that is never larger than . That is, if the null is true.

A Jupyter notebook implementing this approach is given in https://github.com/pbstark/CORLA18.