1 Introduction
The reported outcome of an election may not match the validly cast votes for a variety of reasons, including software configuration errors, bugs, human error, and deliberate malfeasance. Trustworthy elections start with a trustworthy paper record of the validly cast votes. Given access to a trustworthy paper trail of votes, a risklimiting audit (RLA) can provide a rigorous probabilistic guarantee:

If an initially announced assertion about an election is false, this will be corrected by the audit with high probability;

If the aforementioned assertion is true, then will be confirmed (with probability one).
Here, an electoral assertion is simply a claim about the aggregated votes cast (e.g. “Alice received more votes than Bob”). An auditor may wish to audit several claims: for example, whether the reported winner is correct or whether the margin of victory is as large as announced.
From a statistical point of view, efficient risklimiting audits can be implemented as sequential hypothesis tests. Namely, one tests the null hypothesis
: “the assertion is false,” versus the alternative : “the assertion is true”. Imagine then observing a random sequence of votercast ballots , where is the total number of ballots. A sequential hypothesis test is represented by a sequence of binaryvalued functions:where represents rejecting (typically in favor of ), and means that has not yet been rejected. The sequential test (and thus the RLA) stops as soon as or once all ballots are observed, whichever comes first. The “risklimiting” property of RLAs states that if the assertion is false (in other words, if holds), then
which is equivalent to typeI error control of the sequential test. Another way of interpreting the above statement is as follows: if the assertion is incorrect, then with probability at least
, for every and hence all ballots will eventually be inspected, at which point the “true” outcome (which is the result of the full hand count) will be known with certainty.1.1 SHANGRLA Reduces Election Auditing to Sequential Testing
Designing the sequential hypothesis test depends on the type of vote, the aggregation method, or the social choice function for the election, and thus past works have constructed a variety of tests. Some works have designed in the context of a particular type of election [lindeman2012bravo, ottoboni2019bernoulli, rivest2017clipaudit]. On the other hand, the “SHANGRLA” (Sets of HalfAverage Nulls Generate RLAs) framework unifies many common election types including plurality elections, approval voting, rankedchoice voting, and more by reducing each of these to a simple hypothesis test of whether a finite collection of finite lists of bounded numbers has mean at most 1/2 [stark2019sets, blom2021]. Let us give an illustrative example to show how SHANGRLA can be used in practice.
Suppose we have an election with two candidates, Alice and Bob. A ballot may contain a vote for Alice or for Bob, or it may contain no valid vote, e.g., because there was no selection or an overvote. It is reported that Alice and Bob received and votes respectively with and that there were a total of invalid ballots for a total of voters. We encode votes for Alice as “1”, votes for Bob as “0” and invalid votes as “1/2”, to obtain a set of numbers . Crucially, Alice indeed received more votes than Bob if and only if . In other words, the report that Alice beat Bob can be translated into the assertion that .
SHANGRLA proposes to audit an assertion by testing its complement: rejecting that “complementary null” is affirmative evidence that the assertion is indeed true. In other words, if one can ensure that is a random permutation of by sampling ballots without replacement (each ballot is chosen uniformly amongst remaining ballots), then we can concern ourselves with designing a hypothesis test to test the null against the alternative .
One of the major benefits of SHANGRLA is the ability to reduce a wide range of election types to a testing problem of the above form. This permits the use of powerful statistical techniques which were designed specifically for such testing problems (but may not have been designed with RLAs in mind). Throughout this paper, we adopt the SHANGRLA framework, and while we return to the example of plurality elections for illustrative purposes, all of our methods can be applied to any election audit which has a SHANGRLAlike testing reduction [stark2019sets].
1.2 Confidence Sequences
In the fixedtime (i.e. nonsequential) hypothesis testing regime, there is a wellknown duality between hypothesis tests and confidence intervals for a parameter
of interest. We describe this briefly for for simplicity. For each , suppose that is a level nonsequential, fixedsample test for the hypothesis versus . Then, a nonsequential, fixedsample confidence interval for is given by the set of all for which does not reject, that isAs we discuss further in Section 2, an analogous duality holds for sequential hypothesis tests and timeuniform confidence sequences (here and throughout the paper, “time” is used to refer to the number of samples so far, and need not correspond to any particular units such as hours or seconds). We first give a brief preview of the results to come. Consider a family of sequential hypothesis tests , meaning that for each , is a sequential test for . Then, the set of all for which ,
forms a confidence sequence for , meaning that
where is used to denote the set . In other words, will cover at every single time , except with some small probability . Since is typically an interval , we call the lower endpoint as a lower confidence sequence (and similarly for upper).
In particular, given the sequential hypothesis testing problem that arises in SHANGRLA, we can cast the RLA as a sequential estimation problem that can be solved by developing confidence sequences (see Figure
1).^{1}^{1}1Code to reproduce all plots can be found at github.com/wannabesmith/RiLACS. As we will see in Section 2, our confidence sequences provide added flexibility and an intuitive visualizable interpretation for SHANGRLAcompatible election audits, without sacrificing any statistical efficiency.1.3 Contributions and Outline
The contributions of this work are twofold. First, we introduce confidence sequences to the election auditing literature as intuitive and flexible ways of interpreting and visualizing risklimiting audits. Second, we present algorithms for performing RLAs based on confidence sequences by deriving statistically and computationally efficient nonnegative martingales. At the risk of oversimplifying the issue, modern RLAs face a computationalstatistical efficiency tradeoff. Methods such as BRAVO are easy to compute, but potentially less statistically efficient than the current stateoftheart, KMart [stark2019sets], but KMart can be prohibitively expensive to compute for large elections. The methods presented in this paper resolve this tradeoff: they typically match or outperform both BRAVO and KMart, while remaining practical to compute in large elections.
In Section 2, we show how confidence sequences generate risklimiting audits, how they relate to more familiar RLAs based on sequentially valid values, and how they can be used to audit multiple contests. Section 3 derives novel confidence sequencebased RLAs and compares them to past RLA methods via simulation. In Section 4, we illustrate how the previously derived techniques can be applied to an audit of Canada’s 43rd federal election. Finally, Section 5 discusses how all of the aforementioned results apply to risklimiting tallies for coercionresistant voting schemes.
2 Confidence Sequences are RiskLimiting
Consider an election consisting of ballots. Following SHANGRLA [stark2019sets], suppose that these can be transformed to a set of bounded real numbers with mean for some known . Suppose that electoral assertions can be made purely in terms of . A classical confidence interval for is an interval computed from data with the guarantee that
In contrast, a confidence sequence for is a sequence of confidence sets, which all simultaneously capture with probability at least . That is,
The two probabilistic statements above are equivalent, but provide a different way of interpreting and the corresponding guarantee.
If we have access to a confidence sequence for , we can audit any assertion about the election outcome made in terms of with risk limit . Here, we use to denote an assertion. For example, SHANGRLA typically uses assertions of the form “ is greater than ”, in which case . Risk limiting audits via confidence sequences (RiLACS)
If the goal is to finish the audit as soon as possible above all else, then one can ignore the “if desired” condition. However, continued sampling can provide added assurance in , and maintains the risk limit at . The following theorem summarizes the risklimiting guarantee of the above algorithm.
Theorem 1.
Let be a confidence sequence for . Let be an assertion about the electoral outcome (in terms of ). The audit mechanism that certifies as soon as has risk limit .
Proof.
We need to prove that if , then . First, notice that if , then we must have that since . Then,
where the second inequality follows from the definition of a confidence sequence. This completes the proof. ∎
Let us see how this theorem can be used in an example. Consider an election with two candidates, Alice and Bob, and a total of cast ballots. Let be the list of numbers that result from encoding votes for Alice as 1, votes for Bob as 0, and ballots that do not contain a valid vote as . Let be a confidence sequence for . If we wish to audit the assertion that “Alice beat Bob”, then and . We can sequentially sample without replacement, certifying the assertion once . By Theorem 1, this limits the risk to level .
2.1 Relationship to Sequential Hypothesis Testing
The earliest work on RLAs did not use anytime values [stark2008conservative, stark2009cast], but since about 2009, most RLA methods have used anytime values to conduct sequential hypothesis tests [stark2009risk, ottoboni2018risk, ottoboni2019bernoulli, stark2019sets, huang2020unified]. An anytime value is a sequence of values with the property that under some null hypothesis ,
(1) 
The anytime values are typically defined implicitly for each null hypothesis and yield a sequential hypothesis test . As alluded to in Section 1.2, this immediately recovers a confidence sequence:
Notice in Figure 2 that the times at which nulls are rejected (or “stopping times”) are the same for both confidence sequences and the associated values. Thus, nothing is lost by basing the RLA on confidence sequences rather than anytime values. Confidence sequences benefit from being visually intuitive and are arguably easier to interpret than anytime values.
For example, consider conducting an RLA for a simple twocandidate election between Alice and Bob with no invalid votes. Suppose that it is reported that Alice won, i.e., where if the th ballot is for Alice, 0 if for Bob, and if the ballot does not contain a valid vote for either candidate. A sequential RLA in the SHANGRLA framework would posit a null hypothesis (the complement of the announced result: Bob actually won or the outcome is a tie), sample random ballots sequentially, and stop the audit (confirming the announced result) if and when is rejected at significance level . If is not rejected before all ballots have been inspected, the true outcome is known.^{2}^{2}2At any point during the sampling, an election official can choose to abort the sampling and perform a full hand count for any reason. This cannot increase the risk limit: the chance of failing to correct an incorrect reported outcome does not increase.
On the other hand, a ballotpolling RLA [lindeman2012bravo] based on confidence sequences proceeds by computing a lower confidence bound for the fraction of votes for Alice. The audit stops, confirming the outcome, if and when this lower bound is larger than 1/2. If that does not occur before the last ballot has been examined, the true outcome is known. In this formulation, there is no need to define a null hypothesis as the complement of the announced result and interpret the resulting value, and so on. The approach also works for comparison audits using the “overstatement assorter” approach developed in [stark2019sets], which transforms the problem into the same canonical form: testing whether the mean of any list in a collection of nonnegative, bounded lists is less than 1/2.
2.2 Auditing Multiple Contests
It is known that RLAs of multicandidate, multiwinner elections can be reduced to several pairwise contests without adjusting for multiplicity [lindeman2012bravo]. This is accomplished by testing whether every single reported winner beat every single reported loser, and stopping once each of these tests rejects their respective nulls at level . For example, suppose it is reported that a set of candidates beat a set of candidates in a winner plurality contest with candidates in all (that is, and ). For each reported winner and each reported loser , encode votes for candidate as “1”, votes for as “0” and ballots with no valid vote in the contest or with a vote for any other candidate as “1/2” to obtain the population . Then as before, candidate beat candidate if and only if . In a twocandidate plurality election we would have proceeded by testing the null against the alternative . To use the decomposition of a single winner or multiwinner plurality contest into a set of pairwise contests, we test each null for and . The audit stops if and when all null hypotheses are rejected. Crucially, if candidate did not win (i.e. for some ), then
The same technique applies when auditing with confidence sequences. Let be confidence sequences for , , . We verify the electoral outcome of every contest once for all , . Again, if for some , and , then
This technique can be generalized to handle audits of any number of contests from the same audit sample, as explained in [stark2019sets]. For the sake of brevity, we omit the derivation, but it is a straightforward extension of the above.
3 Designing Powerful Confidence Sequences for RLAs
So far we have discussed how to conduct RLAs from confidence sequences for the parameter . In this section, we will discuss how to derive powerful confidence sequences for the purposes of conducting RLAs as efficiently as possible. For mathematical and notational convenience in the following derivations, we consider the case where . Note that nothing is lost in this setup since any population of bounded numbers can be scaled to the unit interval by dividing each element by (thereby scaling the population’s mean as well).
As discussed in Section 2.1, we can construct confidence sequences by “inverting” sequential hypothesis tests. In particular, given a sequential hypothesis test , the sequence of sets,
forms a confidence sequence for . Consequently, in order to develop powerful RLAs via confidence sequences, we can simply focus on carefully designing sequential tests .^{3}^{3}3Notice that it is not always feasible to compute the set of all such that since is uncountably infinite. However, all confidence sequences we will derive in this section are intervals (i.e. convex), and thus we can find the endpoints using a simple grid search or standard rootfinding algorithms.
To design sequential hypothesis tests, we start by finding martingales that translate to powerful tests. To this end, define and consider the following process for :
(2) 
where is a tuning parameter depending only on , and
is the conditional mean of if the mean of were .
Following [waudby2020estimating, Section 6], the process is a nonnegative martingale starting at one. Formally, this means that , , and
for each . Importantly for our purposes, nonnegative martingales are unlikely to ever become very large. This fact is known as Ville’s inequality [ville1939etude, howard_exponential_2018], which serves as a generalization of Markov’s inequality to nonnegative (super)martingales, and can be stated formally as
(3) 
where , and the equality follows from the fact that . As alluded to in Section 2, can be interpreted as the reciprocal of an anytime value:
which matches the probabilistic guarantee in (1). As a direct consequence of Ville’s inequality, if we define the test , then
and thus is a level sequential hypothesis test. We can then invert and apply Theorem 1 to obtain confidence sequencebased RLAs with risk limit .
3.1 Designing Martingales and Tests from Reported Vote Totals
So far, we have found a process that is a nonnegative martingale when , but what happens when ? This is where the tuning parameters come into the picture. Recall that an electoral assertion is certified once . Therefore, to audit assertions quickly, we want to be as tight as possible. Since is defined as the set of such that , we can make tight by making as large as possible. To do so, we must carefully choose . This choice will depend on the type of election as well as the amount of information provided prior to the audit. First consider the case where reported vote totals are given (in addition to the announced winner).
For example, recall the election between Alice and Bob of Section 2, and suppose that is the list of numbers encoding votes for Alice as 1, votes for Bob as 0, and ballots with no valid vote for either candidate as 1/2. Recall that Alice beat Bob if and only if , so we are interested in testing the null hypothesis . Suppose it is reported that Alice beat Bob with votes for Alice, for Bob, and nuisance votes (i.e. either invalid or for another party). If the reported outcome is correct, then for any fixed , we know the exact value of
(4) 
which is an inexact but reasonable proxy for , the final value of the process . We can then choose the value of that maximizes (4). Some algebra reveals that the maximizer of (4) is given by
(5) 
We then truncate to obtain
(6) 
ensuring that it lies in the allowable range . We call this choice of a priori Kelly due to its connections to Kelly’s criterion [kelly1956new, waudby2020estimating] for maximizing products of the form (4). This choice of also has the desirable property of yielding convex confidence sequences, which we summarize below.
Proposition 1.
Let be a sequential random sample from with . Consider from (6) and define the process for any . Then the confidence set
is an interval with probability one.
Proof.
Notice that since and , we have that
is a nonincreasing function of for each . Consequently, is a nonincreasing and quasiconvex function of , so its sublevel sets are convex. ∎
Note that any sequence such that would have yielded a valid nonnegative martingale, but we chose that which maximizes (4) so that the resulting hypothesis test is powerful. In situations more complex than twocandidate plurality contests, the maximizer of (4) can still be found efficiently via standard rootfinding algorithms. All of these methods are implemented in our Python package.^{4}^{4}4github.com/wannabesmith/RiLACS
While audits based on a priori Kelly display excellent empirical performance (see Figure 3), their efficiency may be hurt when vote totals are erroneously reported. Small errors in reported vote totals seem to have minor adverse effects on stopping times (and in some cases can be slightly beneficial), but larger errors can significantly affect stopping time distributions (see Figure 4). If we wish to audit the reported winner of an election but prefer not to rely on (or do not have access to) exact reported vote totals, we need an alternative to a priori Kelly. In the following section, we describe a family of such alternatives.
3.2 Designing Martingales and Tests without Vote Totals
If the exact vote totals are not known, but we still wish to audit an assertion (e.g. that Alice beat Bob), we need to design a slightly different martingale that does not depend on maximizing (4) directly. Instead of finding an optimal , we will take points evenlyspaced on the allowable range and “hedge our bets” among all of these. Making this more precise, note that a convex combination of martingales (with respect to the same filtration) is itself a martingale [waudby2020estimating], and thus for any such that and , we have that
(7) 
forms a nonnegative martingale starting at one. Notice that we no longer have to depend on the reported vote totals to begin an audit. Furthermore, confidence sequences generated using sublevel sets of are intervals with probability one [waudby2020estimating, Proposition 4]. Nevertheless, choosing is a nontrivial task. A natural — but as we will see, suboptimal — choice is to set for each . Previous works [waudby2020estimating] call this dKelly (for “diversified Kelly”), a name we adopt here. In fact, this choice of gives an arbitrarily close and computationally efficient approximation to the Kaplan martingale (KMart) [stark2019sets] which can otherwise be prohibitively expensive to compute for large .
Better choices of exist for the types of elections one might encounter in practice. Recall that nearoptimal values of are given by (5). However, setting for each implicitly treats all as equally reasonable values of . Elections with large values of (e.g. closer to 1) are “easier” to audit, and the interesting or “difficult” regime is when is close to (but strictly larger than) 1/2. Therefore, we recommend designing so that upweights optimal values of for margins close to 0, and downweights those for margins close to 1. Consider the following concrete examples. First, we have the truncatedsquare weights,
and we normalize by to ensure that . Another sensible choice is given by the truncatedlinear weights, where we simply replace by . These values of and are large for and small for , and hence the summands in the martingale given by (7) are upweighted for implicit values of which are optimal for “interesting” margins close to 0, and downweighted for simple margins much larger than 0 (see Figure 5).
When is combined with , we refer to the resulting martingales and confidence sequences as SqKelly. We compare their empirical workload against that of a priori Kelly, dKelly, and BRAVO in Figure 3. A hybrid approach is also possible: suppose we want to use reported outcomes or prior knowledge alongside these convexweighted martingales. We can simply choose so that upweights values in a neighborhood of (or some other value chosen based on prior knowledge^{5}^{5}5The use of the word “prior” here should not be interpreted in a Bayesian sense. No matter what values of are chosen, the resulting tests and confidence sequences have frequentist risklimiting guarantees.).
4 Illustration: Auditing Canada’s 43rd Federal Election
We now apply the techniques derived in Section 3 to risklimiting audits of the 2019 Canadian federal election, which is made up of many plurality contests between 6 major political parties.^{6}^{6}6While Canada has many registered political parties, only a handful have come close to winning seats in the house of commons, and hence should be considered in an audit. As a somewhat arbitrary rule, we considered those parties which satisfied the Leaders’ Debates Commission’s 2019 participation criteria. These consisted of The Liberal Party of Canada, The Progressive Conservative Party of Canada (PC), The New Democratic Party (NDP), The Green Party, The Bloc Québécois (Bloc), and the People’s Party of Canada (PPC). Independent candidates were also included where appropriate.
The country is made up of 338 socalled “ridings” (see Figure 6). These are geographic regions, each corresponding to one seat in the house of commons. For each riding, a multiparty, singlewinner plurality contest takes place where the winner is awarded the respective seat. Generally speaking, the party with the greatest number of seats forms government (there are exceptions to this rule^{7}^{7}7www.elections.ca/content.aspx?section=res&dir=ces&document=part1&lang=e but these will not be important for the purposes of auditing). In US elections, states and electoral college votes play similar roles to ridings and seats, respectively. Since each riding’s underlying contest takes the form of a multiparty, singlewinner plurality election, we can simply apply the techniques for auditing multiple contests outlined in Section 2.2 alongside the martingales and confidence sequences developed in Section 3.
The datadriven web application
We designed and developed an interactive Python and Bokehbased [bokeh] web application where users can display audits of any Canadian riding in a single click. This combined two data sources: one for electoral outcomes as recorded by handcounted paper ballots in the 2019 federal election [elections_act_2019, canadaVotingResultsRaw], and one to draw the map of electoral districts [canadaElectoralDistricts]. After cleaning and merging, the data consisted of 347 records. Each record consists of a geographic information systems (GIS) polygon to draw the riding, vote totals for each party, and other information. The additional 9 records correspond to islands which are not separate ridings but require their own GIS polygon to be drawn on a map.
Following the notation of Section 2.2, recall that the electoral parameter of interest is defined as
where

if the ballot shows a vote for ,

if the ballot shows a vote for , and

if the ballot shows a vote for any other party.
Also recall that the reported assertion — “ received more votes than for each ” — is certified once the lower confidence sequences for exceed 1/2 for each . Furthermore, this yields an RLA with risk limit , without needing to perform any multiplicity adjustments for constructing several confidence sequences (see Section 2.2 for more details). For example, the righthand side plot of Figure 7 displays an RLA with risklimit for the assertion “the Liberal party received the largest number of votes” by computing six lower confidence sequences for , where , and .
It is important to keep in mind that electoral outcomes in the underlying data sets correspond to handcounted paper ballot vote totals [elections_act_2019, canadaVotingResultsRaw]. Therefore, the righthand side plot in the web application (e.g. Figure 7) demonstrates the length of time that an audit would last, given correctlyreported outcomes, and assuming that the recorded data match the true votes cast. In practice, our confidence sequences would only rely on an assertion to audit (e.g. “The Liberal party received the most votes”) and a simple random sample without replacement from the physical stack of ballots cast. Moreover, the web application is easily adapted to this practical scenario, an extension we plan to pursue in future work.
A key feature of this app is its interactivity. Users can hover their cursors over ridings to see reported vote totals, click and drag the map around, zoom in on regions of interest, and so on. When the user has found a riding they wish to audit, they can simply click on that riding’s polygon to immediately compute lower confidence sequences and begin the RLA (see Figure 7). Serverside computation and clientside updates are fully asynchronous, meaning users can interact with the app while the audit is being conducted, and the audit will not “lock up”. A demo of these features can be found online^{8}^{8}8ian.waudbysmith.com/audit_demo.mov and the code is available on GitHub.^{9}^{9}9github.com/WannabeSmith/RiLACS/tree/main/canada_audit
5 RiskLimiting Tallies via Confidence Sequences
Rather than audit an alreadyannounced electoral outcome, it may be of interest to determine (for the purposes of making a first announcement) the election winner with high probability, without counting all ballots. Such procedures are known as risklimiting tallies (RLTs), which were developed for coercionresistant, endtoend verifiable voting schemes [jamroga2019risk]. For example, suppose a voter is being coerced to vote for Bob. If the final vote tally reveals that Bob received few or no votes, then the coercer will suspect that the voter did not comply with instructions. RLTs provide a way to mitigate this issue by providing highprobability guarantees that the reported winner truly won, leaving a large proportion of votes shrouded. In such cases, the voter is guaranteed plausible deniability, as they can claim to the coercer that their ballot is simply among the unrevealed ones.
While the motivations for RLTs are quite different from those for RLAs, the underlying techniques are similar. The same is true for confidence sequencebased RLTs. All methods introduced in this paper can be applied to RLTs (with the exception of “a priori Kelly” since it depends on the reported outcome) but with twosided power. Consider the martingales we discussed in Section 3.2,
(8) 
where are convex weights. Recall that our confidence sequences at a given time were defined as those for which . In other words, a given value is only excluded from the confidence set if is large. However, notice that will become large if the conditional mean is larger than the null conditional mean , but the same cannot be said if . As a consequence, the resulting confidence sequences are all onesided lower confidence sequences. To ensure that our bounds have nontrivial twosided power, we can simply combine (8) with a martingale that also grows when .
Proposition 2.
For nonnegative vectors
and that each sum to one, define the processesNext, for , define their mixture
Then, is a nonnegative martingale starting at one. Consequently,
forms a confidence sequence for .
Proof.
This follows immediately from the fact that both and are martingales with respect to the same filtration, and that convex combinations of such martingales are also martingales. ∎
With this setup and notation in mind, as defined in Section 3.2 is a special case of with . As noted by [jamroga2019risk], RLTs involving multiple assertions do require correction for multiple testing, unlike RLAs. The same is true for confidence sequencebased RLTs (and hence the tricks of Section 2.2 do not apply). It suffices to perform a simple Bonferroni correction by constructing confidence sequences to establish simultaneous assertions.
6 Summary
This paper presented a general framework for conducting risklimiting audits based on confidence sequences, and derived computationally and statistically efficient martingales for computing them. We showed how a priori Kelly takes advantage of the reported vote totals (if available) to stop ballotpolling audits significantly earlier than extant ballotpolling methods, and how alternative martingales such as SqKelly also provide strong empirical performance in the absence of reported outcomes. Finally, we demonstrated how a simple tweak to the aforementioned algorithms provides twosided confidence sequences, which can be used to perform risklimiting tallies. Confidence sequences and these martingales can be applied to ballotlevel comparison audits and batchlevel comparison audits as well, using “overstatement assorters” [stark2019sets], which reduce comparison audits to the same canonical statistical problem: testing whether the mean of any list in a collection of nonnegative bounded lists is at most 1/2. We hope that this new perspective on RLAs and its associated software will aid in making election audits simpler, faster, and more transparent.
Comments
There are no comments yet.