RiLACS: Risk-Limiting Audits via Confidence Sequences

Accurately determining the outcome of an election is a complex task with many potential sources of error, ranging from software glitches in voting machines to procedural lapses to outright fraud. Risk-limiting audits (RLA) are statistically principled "incremental" hand counts that provide statistical assurance that reported outcomes accurately reflect the validly cast votes. We present a suite of tools for conducting RLAs using confidence sequences – sequences of confidence sets which uniformly capture an electoral parameter of interest from the start of an audit to the point of an exhaustive recount with high probability. Adopting the SHANGRLA framework, we design nonnegative martingales which yield computationally and statistically efficient confidence sequences and RLAs for a wide variety of election types.



There are no comments yet.


page 1

page 2

page 3

page 4


Auditing Indian Elections

Indian Electronic Voting Machines (EVMs) will be fitted with printers th...

Confidence sequences for sampling without replacement

Many practical tasks involve sampling sequentially without replacement f...

Risk-Limiting Audits by Stratified Union-Intersection Tests of Elections (SUITE)

Risk-limiting audits (RLAs) offer a statistical guarantee: if a full man...

Lazy Risk-Limiting Ballot Comparison Audits

Risk-limiting audits or RLAs are rigorous statistical procedures meant t...

Bayesian Tabulation Audits: Explained and Extended

Tabulation audits for an election provide statistical evidence that a re...

Voter Perceptions of Trust in Risk-Limiting Audits

Risk-limiting audits (RLAs) are expected to strengthen the public confid...

Risk-Limiting Tallies

Many voter-verifiable, coercion-resistant schemes have been proposed, bu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The reported outcome of an election may not match the validly cast votes for a variety of reasons, including software configuration errors, bugs, human error, and deliberate malfeasance. Trustworthy elections start with a trustworthy paper record of the validly cast votes. Given access to a trustworthy paper trail of votes, a risk-limiting audit (RLA) can provide a rigorous probabilistic guarantee:

  1. If an initially announced assertion about an election is false, this will be corrected by the audit with high probability;

  2. If the aforementioned assertion is true, then will be confirmed (with probability one).

Here, an electoral assertion is simply a claim about the aggregated votes cast (e.g. “Alice received more votes than Bob”). An auditor may wish to audit several claims: for example, whether the reported winner is correct or whether the margin of victory is as large as announced.

From a statistical point of view, efficient risk-limiting audits can be implemented as sequential hypothesis tests. Namely, one tests the null hypothesis

: “the assertion is false,” versus the alternative : “the assertion is true”. Imagine then observing a random sequence of voter-cast ballots , where is the total number of ballots. A sequential hypothesis test is represented by a sequence of binary-valued functions:

where represents rejecting (typically in favor of ), and means that has not yet been rejected. The sequential test (and thus the RLA) stops as soon as or once all ballots are observed, whichever comes first. The “risk-limiting” property of RLAs states that if the assertion is false (in other words, if holds), then

which is equivalent to type-I error control of the sequential test. Another way of interpreting the above statement is as follows: if the assertion is incorrect, then with probability at least

, for every and hence all ballots will eventually be inspected, at which point the “true” outcome (which is the result of the full hand count) will be known with certainty.

1.1 SHANGRLA Reduces Election Auditing to Sequential Testing

Designing the sequential hypothesis test depends on the type of vote, the aggregation method, or the social choice function for the election, and thus past works have constructed a variety of tests. Some works have designed in the context of a particular type of election [lindeman2012bravo, ottoboni2019bernoulli, rivest2017clipaudit]. On the other hand, the “SHANGRLA” (Sets of Half-Average Nulls Generate RLAs) framework unifies many common election types including plurality elections, approval voting, ranked-choice voting, and more by reducing each of these to a simple hypothesis test of whether a finite collection of finite lists of bounded numbers has mean at most 1/2 [stark2019sets, blom2021]. Let us give an illustrative example to show how SHANGRLA can be used in practice.

Suppose we have an election with two candidates, Alice and Bob. A ballot may contain a vote for Alice or for Bob, or it may contain no valid vote, e.g., because there was no selection or an overvote. It is reported that Alice and Bob received and votes respectively with and that there were a total of invalid ballots for a total of voters. We encode votes for Alice as “1”, votes for Bob as “0” and invalid votes as “1/2”, to obtain a set of numbers . Crucially, Alice indeed received more votes than Bob if and only if . In other words, the report that Alice beat Bob can be translated into the assertion that .

SHANGRLA proposes to audit an assertion by testing its complement: rejecting that “complementary null” is affirmative evidence that the assertion is indeed true. In other words, if one can ensure that is a random permutation of by sampling ballots without replacement (each ballot is chosen uniformly amongst remaining ballots), then we can concern ourselves with designing a hypothesis test to test the null against the alternative .

One of the major benefits of SHANGRLA is the ability to reduce a wide range of election types to a testing problem of the above form. This permits the use of powerful statistical techniques which were designed specifically for such testing problems (but may not have been designed with RLAs in mind). Throughout this paper, we adopt the SHANGRLA framework, and while we return to the example of plurality elections for illustrative purposes, all of our methods can be applied to any election audit which has a SHANGRLA-like testing reduction [stark2019sets].

1.2 Confidence Sequences

In the fixed-time (i.e. non-sequential) hypothesis testing regime, there is a well-known duality between hypothesis tests and confidence intervals for a parameter

of interest. We describe this briefly for for simplicity. For each , suppose that is a level- nonsequential, fixed-sample test for the hypothesis versus . Then, a nonsequential, fixed-sample confidence interval for is given by the set of all for which does not reject, that is

As we discuss further in Section 2, an analogous duality holds for sequential hypothesis tests and time-uniform confidence sequences (here and throughout the paper, “time” is used to refer to the number of samples so far, and need not correspond to any particular units such as hours or seconds). We first give a brief preview of the results to come. Consider a family of sequential hypothesis tests , meaning that for each , is a sequential test for . Then, the set of all for which ,

forms a confidence sequence for , meaning that

where is used to denote the set . In other words, will cover at every single time , except with some small probability . Since is typically an interval , we call the lower endpoint as a lower confidence sequence (and similarly for upper).

Figure 1: 95% Lower confidence sequences for the margin of a plurality election between Alice and Bob for three different auditing methods. Votes for Alice are encoded by “1” and those for Bob are encoded by “0”. The parameter of interest is then the average of these votes, which in this particular example is 54% (given by the horizontal grey line). The outcome is verified once the lower confidence sequence exceeds 1/2. The time at which this happens is given by the vertical blue, green, and pink lines.

In particular, given the sequential hypothesis testing problem that arises in SHANGRLA, we can cast the RLA as a sequential estimation problem that can be solved by developing confidence sequences (see Figure 

1).111Code to reproduce all plots can be found at As we will see in Section 2, our confidence sequences provide added flexibility and an intuitive visualizable interpretation for SHANGRLA-compatible election audits, without sacrificing any statistical efficiency.

1.3 Contributions and Outline

The contributions of this work are twofold. First, we introduce confidence sequences to the election auditing literature as intuitive and flexible ways of interpreting and visualizing risk-limiting audits. Second, we present algorithms for performing RLAs based on confidence sequences by deriving statistically and computationally efficient nonnegative martingales. At the risk of oversimplifying the issue, modern RLAs face a computational-statistical efficiency tradeoff. Methods such as BRAVO are easy to compute, but potentially less statistically efficient than the current state-of-the-art, KMart [stark2019sets], but KMart can be prohibitively expensive to compute for large elections. The methods presented in this paper resolve this tradeoff: they typically match or outperform both BRAVO and KMart, while remaining practical to compute in large elections.

In Section 2, we show how confidence sequences generate risk-limiting audits, how they relate to more familiar RLAs based on sequentially valid -values, and how they can be used to audit multiple contests. Section 3 derives novel confidence sequence-based RLAs and compares them to past RLA methods via simulation. In Section 4, we illustrate how the previously derived techniques can be applied to an audit of Canada’s 43rd federal election. Finally, Section 5 discusses how all of the aforementioned results apply to risk-limiting tallies for coercion-resistant voting schemes.

2 Confidence Sequences are Risk-Limiting

Consider an election consisting of ballots. Following SHANGRLA [stark2019sets], suppose that these can be transformed to a set of -bounded real numbers with mean for some known . Suppose that electoral assertions can be made purely in terms of . A classical confidence interval for is an interval computed from data with the guarantee that

In contrast, a confidence sequence for is a sequence of confidence sets, which all simultaneously capture with probability at least . That is,

The two probabilistic statements above are equivalent, but provide a different way of interpreting and the corresponding guarantee.

If we have access to a confidence sequence for , we can audit any assertion about the election outcome made in terms of with risk limit . Here, we use to denote an assertion. For example, SHANGRLA typically uses assertions of the form “ is greater than ”, in which case . Risk limiting audits via confidence sequences (RiLACS)

Assertion , risk limit .
for  do
     Randomly sample and remove from the remaining ballots.
     Compute at level .
     if  then
         Certify the assertion and stop if desired.
     end if
end for

If the goal is to finish the audit as soon as possible above all else, then one can ignore the “if desired” condition. However, continued sampling can provide added assurance in , and maintains the risk limit at . The following theorem summarizes the risk-limiting guarantee of the above algorithm.

Theorem 1.

Let be a confidence sequence for . Let be an assertion about the electoral outcome (in terms of ). The audit mechanism that certifies as soon as has risk limit .


We need to prove that if , then . First, notice that if , then we must have that since . Then,

where the second inequality follows from the definition of a confidence sequence. This completes the proof. ∎

Let us see how this theorem can be used in an example. Consider an election with two candidates, Alice and Bob, and a total of cast ballots. Let be the list of numbers that result from encoding votes for Alice as 1, votes for Bob as 0, and ballots that do not contain a valid vote as . Let be a confidence sequence for . If we wish to audit the assertion that “Alice beat Bob”, then and . We can sequentially sample without replacement, certifying the assertion once . By Theorem 1, this limits the risk to level .

2.1 Relationship to Sequential Hypothesis Testing

The earliest work on RLAs did not use anytime -values [stark2008conservative, stark2009cast], but since about 2009, most RLA methods have used anytime -values to conduct sequential hypothesis tests [stark2009risk, ottoboni2018risk, ottoboni2019bernoulli, stark2019sets, huang2020unified]. An anytime -value is a sequence of -values with the property that under some null hypothesis ,


The anytime -values are typically defined implicitly for each null hypothesis and yield a sequential hypothesis test . As alluded to in Section 1.2, this immediately recovers a confidence sequence:

Notice in Figure 2 that the times at which nulls are rejected (or “stopping times”) are the same for both confidence sequences and the associated -values. Thus, nothing is lost by basing the RLA on confidence sequences rather than anytime -values. Confidence sequences benefit from being visually intuitive and are arguably easier to interpret than anytime -values.

For example, consider conducting an RLA for a simple two-candidate election between Alice and Bob with no invalid votes. Suppose that it is reported that Alice won, i.e., where if the th ballot is for Alice, 0 if for Bob, and if the ballot does not contain a valid vote for either candidate. A sequential RLA in the SHANGRLA framework would posit a null hypothesis (the complement of the announced result: Bob actually won or the outcome is a tie), sample random ballots sequentially, and stop the audit (confirming the announced result) if and when is rejected at significance level . If is not rejected before all ballots have been inspected, the true outcome is known.222At any point during the sampling, an election official can choose to abort the sampling and perform a full hand count for any reason. This cannot increase the risk limit: the chance of failing to correct an incorrect reported outcome does not increase.

Figure 2: The duality between anytime -values and confidence sequences for three nulls: for . The -value for (pink dash-dotted line) drops below after 578 samples, exactly when the lower confidence sequence exceeds 0.45. However, the -value for never reaches and the 95% confidence sequence never excludes 0.5, the true value of .

On the other hand, a ballot-polling RLA [lindeman2012bravo] based on confidence sequences proceeds by computing a lower confidence bound for the fraction of votes for Alice. The audit stops, confirming the outcome, if and when this lower bound is larger than 1/2. If that does not occur before the last ballot has been examined, the true outcome is known. In this formulation, there is no need to define a null hypothesis as the complement of the announced result and interpret the resulting -value, and so on. The approach also works for comparison audits using the “overstatement assorter” approach developed in [stark2019sets], which transforms the problem into the same canonical form: testing whether the mean of any list in a collection of nonnegative, bounded lists is less than 1/2.

2.2 Auditing Multiple Contests

It is known that RLAs of multi-candidate, multi-winner elections can be reduced to several pairwise contests without adjusting for multiplicity [lindeman2012bravo]. This is accomplished by testing whether every single reported winner beat every single reported loser, and stopping once each of these tests rejects their respective nulls at level . For example, suppose it is reported that a set of candidates beat a set of candidates in a -winner plurality contest with candidates in all (that is, and ). For each reported winner and each reported loser , encode votes for candidate as “1”, votes for as “0” and ballots with no valid vote in the contest or with a vote for any other candidate as “1/2” to obtain the population . Then as before, candidate beat candidate if and only if . In a two-candidate plurality election we would have proceeded by testing the null against the alternative . To use the decomposition of a single winner or multi-winner plurality contest into a set of pairwise contests, we test each null for and . The audit stops if and when all null hypotheses are rejected. Crucially, if candidate did not win (i.e. for some ), then

The same technique applies when auditing with confidence sequences. Let be confidence sequences for , , . We verify the electoral outcome of every contest once for all , . Again, if for some , and , then

This technique can be generalized to handle audits of any number of contests from the same audit sample, as explained in [stark2019sets]. For the sake of brevity, we omit the derivation, but it is a straightforward extension of the above.

3 Designing Powerful Confidence Sequences for RLAs

So far we have discussed how to conduct RLAs from confidence sequences for the parameter . In this section, we will discuss how to derive powerful confidence sequences for the purposes of conducting RLAs as efficiently as possible. For mathematical and notational convenience in the following derivations, we consider the case where . Note that nothing is lost in this setup since any population of -bounded numbers can be scaled to the unit interval by dividing each element by (thereby scaling the population’s mean as well).

As discussed in Section 2.1, we can construct confidence sequences by “inverting” sequential hypothesis tests. In particular, given a sequential hypothesis test , the sequence of sets,

forms a confidence sequence for . Consequently, in order to develop powerful RLAs via confidence sequences, we can simply focus on carefully designing sequential tests .333Notice that it is not always feasible to compute the set of all such that since is uncountably infinite. However, all confidence sequences we will derive in this section are intervals (i.e. convex), and thus we can find the endpoints using a simple grid search or standard root-finding algorithms.

To design sequential hypothesis tests, we start by finding martingales that translate to powerful tests. To this end, define and consider the following process for :


where is a tuning parameter depending only on , and

is the conditional mean of if the mean of were .

Following [waudby2020estimating, Section 6], the process is a nonnegative martingale starting at one. Formally, this means that , , and

for each . Importantly for our purposes, nonnegative martingales are unlikely to ever become very large. This fact is known as Ville’s inequality [ville1939etude, howard_exponential_2018], which serves as a generalization of Markov’s inequality to nonnegative (super)martingales, and can be stated formally as


where , and the equality follows from the fact that . As alluded to in Section 2, can be interpreted as the reciprocal of an anytime -value:

which matches the probabilistic guarantee in (1). As a direct consequence of Ville’s inequality, if we define the test , then

and thus is a level- sequential hypothesis test. We can then invert and apply Theorem 1 to obtain confidence sequence-based RLAs with risk limit .

3.1 Designing Martingales and Tests from Reported Vote Totals

So far, we have found a process that is a nonnegative martingale when , but what happens when ? This is where the tuning parameters come into the picture. Recall that an electoral assertion is certified once . Therefore, to audit assertions quickly, we want to be as tight as possible. Since is defined as the set of such that , we can make tight by making as large as possible. To do so, we must carefully choose . This choice will depend on the type of election as well as the amount of information provided prior to the audit. First consider the case where reported vote totals are given (in addition to the announced winner).

Figure 3: Ballot-polling audit workload distributions under four possible outcomes of a two-candidate plurality election. Workload is defined as the number of distinct ballots examined before completing the audit. The first example considers an outcome where Alice and Bob received 2750 and 2250 votes respectively, and no ballots were invalid, for a margin of . The second, third, and fourth examples have the same margin, but with increasing numbers of invalid or “nuisance” ballots represented by . Notice that in the case with no nuisance ballots, a priori Kelly and BRAVO have an edge, while in the setting with many nuisance ballots, a priori Kelly vastly outperforms BRAVO. On the other hand, neither SqKelly nor dKelly require tuning based on the reported outcomes, but SqKelly outperforms dKelly in all four scenarios.

For example, recall the election between Alice and Bob of Section 2, and suppose that is the list of numbers encoding votes for Alice as 1, votes for Bob as 0, and ballots with no valid vote for either candidate as 1/2. Recall that Alice beat Bob if and only if , so we are interested in testing the null hypothesis . Suppose it is reported that Alice beat Bob with votes for Alice, for Bob, and nuisance votes (i.e. either invalid or for another party). If the reported outcome is correct, then for any fixed , we know the exact value of


which is an inexact but reasonable proxy for , the final value of the process . We can then choose the value of that maximizes (4). Some algebra reveals that the maximizer of (4) is given by


We then truncate to obtain


ensuring that it lies in the allowable range . We call this choice of a priori Kelly due to its connections to Kelly’s criterion [kelly1956new, waudby2020estimating] for maximizing products of the form (4). This choice of also has the desirable property of yielding convex confidence sequences, which we summarize below.

Proposition 1.

Let be a sequential random sample from with . Consider from (6) and define the process for any . Then the confidence set

is an interval with probability one.


Notice that since and , we have that

is a nonincreasing function of for each . Consequently, is a nonincreasing and quasiconvex function of , so its sublevel sets are convex. ∎

Note that any sequence such that would have yielded a valid nonnegative martingale, but we chose that which maximizes (4) so that the resulting hypothesis test is powerful. In situations more complex than two-candidate plurality contests, the maximizer of (4) can still be found efficiently via standard root-finding algorithms. All of these methods are implemented in our Python

While audits based on a priori Kelly display excellent empirical performance (see Figure 3), their efficiency may be hurt when vote totals are erroneously reported. Small errors in reported vote totals seem to have minor adverse effects on stopping times (and in some cases can be slightly beneficial), but larger errors can significantly affect stopping time distributions (see Figure 4). If we wish to audit the reported winner of an election but prefer not to rely on (or do not have access to) exact reported vote totals, we need an alternative to a priori Kelly. In the following section, we describe a family of such alternatives.

Figure 4: Stopping times for a priori Kelly under various degrees of error in reported outcomes. In the above legends, refers to the true number of votes for Alice, while refers to the incorrectly reported number of votes. Notice that empirical performance is relatively strong for but is adversely affected when , especially in the right-hand side plot with a narrower margin.

3.2 Designing Martingales and Tests without Vote Totals

If the exact vote totals are not known, but we still wish to audit an assertion (e.g. that Alice beat Bob), we need to design a slightly different martingale that does not depend on maximizing (4) directly. Instead of finding an optimal , we will take points evenly-spaced on the allowable range and “hedge our bets” among all of these. Making this more precise, note that a convex combination of martingales (with respect to the same filtration) is itself a martingale [waudby2020estimating], and thus for any such that and , we have that


forms a nonnegative martingale starting at one. Notice that we no longer have to depend on the reported vote totals to begin an audit. Furthermore, confidence sequences generated using sublevel sets of are intervals with probability one [waudby2020estimating, Proposition 4]. Nevertheless, choosing is a nontrivial task. A natural — but as we will see, suboptimal — choice is to set for each . Previous works [waudby2020estimating] call this dKelly (for “diversified Kelly”), a name we adopt here. In fact, this choice of gives an arbitrarily close and computationally efficient approximation to the Kaplan martingale (KMart) [stark2019sets] which can otherwise be prohibitively expensive to compute for large .

Figure 5: Various values of the convex weights , which can be used in the construction of the diversified martingale (7). Notice that the linear and square weights are largest for near 0, and decrease as approaches , finally remaining at 0 for all large . Smaller values of are upweighted since they correspond to those values of in that are optimal for smaller (i.e. interesting) electoral margins. This is in contrast to the constant weight function, which sets for each . We find that square weights perform well in practice (see Figure 3) but these can be tuned and tailored based on prior knowledge and the particular problem at hand.

Better choices of exist for the types of elections one might encounter in practice. Recall that near-optimal values of are given by (5). However, setting for each implicitly treats all as equally reasonable values of . Elections with large values of (e.g. closer to 1) are “easier” to audit, and the interesting or “difficult” regime is when is close to (but strictly larger than) 1/2. Therefore, we recommend designing so that upweights optimal values of for margins close to 0, and downweights those for margins close to 1. Consider the following concrete examples. First, we have the truncated-square weights,

and we normalize by to ensure that . Another sensible choice is given by the truncated-linear weights, where we simply replace by . These values of and are large for and small for , and hence the summands in the martingale given by (7) are upweighted for implicit values of which are optimal for “interesting” margins close to 0, and downweighted for simple margins much larger than 0 (see Figure 5).

When is combined with , we refer to the resulting martingales and confidence sequences as SqKelly. We compare their empirical workload against that of a priori Kelly, dKelly, and BRAVO in Figure 3. A hybrid approach is also possible: suppose we want to use reported outcomes or prior knowledge alongside these convex-weighted martingales. We can simply choose so that upweights values in a neighborhood of (or some other value chosen based on prior knowledge555The use of the word “prior” here should not be interpreted in a Bayesian sense. No matter what values of are chosen, the resulting tests and confidence sequences have frequentist risk-limiting guarantees.).

4 Illustration: Auditing Canada’s 43rd Federal Election

We now apply the techniques derived in Section 3 to risk-limiting audits of the 2019 Canadian federal election, which is made up of many plurality contests between 6 major political parties.666While Canada has many registered political parties, only a handful have come close to winning seats in the house of commons, and hence should be considered in an audit. As a somewhat arbitrary rule, we considered those parties which satisfied the Leaders’ Debates Commission’s 2019 participation criteria. These consisted of The Liberal Party of Canada, The Progressive Conservative Party of Canada (PC), The New Democratic Party (NDP), The Green Party, The Bloc Québécois (Bloc), and the People’s Party of Canada (PPC). Independent candidates were also included where appropriate.

Figure 6: A map of Canada’s 338 ridings, each representing one seat in the house of commons. Ridings are colored according to which party received the greatest number of votes in the 2019 federal election. The PPC is omitted from the legend here as they did not win any seats.

The country is made up of 338 so-called “ridings” (see Figure 6). These are geographic regions, each corresponding to one seat in the house of commons. For each riding, a multi-party, single-winner plurality contest takes place where the winner is awarded the respective seat. Generally speaking, the party with the greatest number of seats forms government (there are exceptions to this but these will not be important for the purposes of auditing). In US elections, states and electoral college votes play similar roles to ridings and seats, respectively. Since each riding’s underlying contest takes the form of a multi-party, single-winner plurality election, we can simply apply the techniques for auditing multiple contests outlined in Section 2.2 alongside the martingales and confidence sequences developed in Section 3.

The data-driven web application

We designed and developed an interactive Python- and Bokeh-based [bokeh] web application where users can display audits of any Canadian riding in a single click. This combined two data sources: one for electoral outcomes as recorded by hand-counted paper ballots in the 2019 federal election [elections_act_2019, canadaVotingResultsRaw], and one to draw the map of electoral districts [canadaElectoralDistricts]. After cleaning and merging, the data consisted of 347 records. Each record consists of a geographic information systems (GIS) polygon to draw the riding, vote totals for each party, and other information. The additional 9 records correspond to islands which are not separate ridings but require their own GIS polygon to be drawn on a map.

Figure 7: Example risk-limiting audit for the riding of Waterloo, Ontario using SqKelly. This screenshot was captured after zooming the map of Figure 6 in on southern Ontario. In this example, it was (correctly) reported that the Liberal party received 31,085 out of 63,708 total votes. Clicking on Waterloo’s polygon will begin the audit shown in the right-hand side, which displays six lower confidence sequences for the pairwise contests between the Liberal party and each reportedly losing party. The Liberal party’s win is certified once each of these confidence sequences exceeds 1/2, which in this case happened after sampling roughly 160 ballots.

Following the notation of Section 2.2, recall that the electoral parameter of interest is defined as


  • if the ballot shows a vote for ,

  • if the ballot shows a vote for , and

  • if the ballot shows a vote for any other party.

Also recall that the reported assertion — “ received more votes than for each ” — is certified once the lower confidence sequences for exceed 1/2 for each . Furthermore, this yields an RLA with risk limit , without needing to perform any multiplicity adjustments for constructing several confidence sequences (see Section 2.2 for more details). For example, the right-hand side plot of Figure 7 displays an RLA with risk-limit for the assertion “the Liberal party received the largest number of votes” by computing six lower confidence sequences for , where , and .

It is important to keep in mind that electoral outcomes in the underlying data sets correspond to hand-counted paper ballot vote totals [elections_act_2019, canadaVotingResultsRaw]. Therefore, the right-hand side plot in the web application (e.g. Figure 7) demonstrates the length of time that an audit would last, given correctly-reported outcomes, and assuming that the recorded data match the true votes cast. In practice, our confidence sequences would only rely on an assertion to audit (e.g. “The Liberal party received the most votes”) and a simple random sample without replacement from the physical stack of ballots cast. Moreover, the web application is easily adapted to this practical scenario, an extension we plan to pursue in future work.

A key feature of this app is its interactivity. Users can hover their cursors over ridings to see reported vote totals, click and drag the map around, zoom in on regions of interest, and so on. When the user has found a riding they wish to audit, they can simply click on that riding’s polygon to immediately compute lower confidence sequences and begin the RLA (see Figure 7). Server-side computation and client-side updates are fully asynchronous, meaning users can interact with the app while the audit is being conducted, and the audit will not “lock up”. A demo of these features can be found and the code is available on

5 Risk-Limiting Tallies via Confidence Sequences

Rather than audit an already-announced electoral outcome, it may be of interest to determine (for the purposes of making a first announcement) the election winner with high probability, without counting all ballots. Such procedures are known as risk-limiting tallies (RLTs), which were developed for coercion-resistant, end-to-end verifiable voting schemes [jamroga2019risk]. For example, suppose a voter is being coerced to vote for Bob. If the final vote tally reveals that Bob received few or no votes, then the coercer will suspect that the voter did not comply with instructions. RLTs provide a way to mitigate this issue by providing high-probability guarantees that the reported winner truly won, leaving a large proportion of votes shrouded. In such cases, the voter is guaranteed plausible deniability, as they can claim to the coercer that their ballot is simply among the unrevealed ones.

While the motivations for RLTs are quite different from those for RLAs, the underlying techniques are similar. The same is true for confidence sequence-based RLTs. All methods introduced in this paper can be applied to RLTs (with the exception of “a priori Kelly” since it depends on the reported outcome) but with two-sided power. Consider the martingales we discussed in Section 3.2,


where are convex weights. Recall that our confidence sequences at a given time were defined as those for which . In other words, a given value is only excluded from the confidence set if is large. However, notice that will become large if the conditional mean is larger than the null conditional mean , but the same cannot be said if . As a consequence, the resulting confidence sequences are all one-sided lower confidence sequences. To ensure that our bounds have non-trivial two-sided power, we can simply combine (8) with a martingale that also grows when .

Figure 8: Confidence sequence-based risk-limiting tally for a two-candidate election. Unlike RLAs, RLTs require two-sided confidence sequences so that the true winner can be determined (with high probability) without access to an announced result. Notice that testing the same null is less efficient in an RLT than in an RLA. This is a necessary sacrifice for having nontrivial power against other alternatives.
Proposition 2.

For nonnegative vectors

and that each sum to one, define the processes

Next, for , define their mixture

Then, is a nonnegative martingale starting at one. Consequently,

forms a confidence sequence for .


This follows immediately from the fact that both and are martingales with respect to the same filtration, and that convex combinations of such martingales are also martingales. ∎

With this setup and notation in mind, as defined in Section 3.2 is a special case of with . As noted by [jamroga2019risk], RLTs involving multiple assertions do require correction for multiple testing, unlike RLAs. The same is true for confidence sequence-based RLTs (and hence the tricks of Section 2.2 do not apply). It suffices to perform a simple Bonferroni correction by constructing confidence sequences to establish simultaneous assertions.

6 Summary

This paper presented a general framework for conducting risk-limiting audits based on confidence sequences, and derived computationally and statistically efficient martingales for computing them. We showed how a priori Kelly takes advantage of the reported vote totals (if available) to stop ballot-polling audits significantly earlier than extant ballot-polling methods, and how alternative martingales such as SqKelly also provide strong empirical performance in the absence of reported outcomes. Finally, we demonstrated how a simple tweak to the aforementioned algorithms provides two-sided confidence sequences, which can be used to perform risk-limiting tallies. Confidence sequences and these martingales can be applied to ballot-level comparison audits and batch-level comparison audits as well, using “overstatement assorters” [stark2019sets], which reduce comparison audits to the same canonical statistical problem: testing whether the mean of any list in a collection of non-negative bounded lists is at most 1/2. We hope that this new perspective on RLAs and its associated software will aid in making election audits simpler, faster, and more transparent.