1 Introduction
The wellknown stochastic multiarmed bandit (MAB) is a sequential decisionmaking task in which a learner repeatedly chooses an action (or arm) and receives a noisy reward. The learner’s objective is to maximize cumulative reward by exploring the actions to discover optimal ones (having the highest expected reward), balanced with exploiting them. The problem, originally stemming from experiments in medicine (Robbins, 1952), has applications in fields such as ranking (Kveton et al., 2015), recommendation systems (collaborative filtering) (Caron and Bhagat, 2013), investment portfolio design (Hoffman et al., 2011) and online advertising (Schwartz et al., 2017), to name a few. Such applications, relying on sensitive data, raise privacy concerns.
Differential privacy (Dwork et al., 2006) has become in recent years the goldstandard for privacy preserving dataanalysis alleviating such concerns, as it requires that the output of the dataanalysis algorithm has a limited dependency on any single datum. Differentially private variants of online learning algorithms have been successfully devised in various settings (Smith and Thakurta, 2013), including a private UCBalgorithm for the MAB problem (details below) (Mishra and Thakurta, 2015; Tossou and Dimitrakakis, 2016) as well as UCB variations in the linear (Kannan et al., 2018) and contextual (Shariff and Sheffet, 2018) settings.
More formally, in the MAB problem at every timestep the learner selects an arm out of available arms, pulls it and receives a random reward drawn i.i.d from a distribution — of support and unknown mean . The Upper Confidence Bound (UCB) algorithm for the MAB problem was developed in a series of works (Berry and Fristedt, 1985; Agrawal, 1995) culminating in (Auer et al., 2002a), and is provably optimal for the MAB problem (Lai and Robbins, 1985)
. The UCB algorithm maintains a timedependent highprobability upperbound
for each arm’s mean, and at each timestep optimistically pulls the arm with the highest bound. The abovementioned differentially private (DP) analogues of the UCBalgorithm follow the same procedure except for maintaining noisy estimations
using the “treebased mechanism” (Chan et al., 2010; Dwork et al., 2010). This mechanism continuously releases aggregated statistics over a stream of observations, introducing only noise in each timestep. The details of this polylog factor are the focus of this work.It was recently shown (Shariff and Sheffet, 2018) that any DP stochastic MAB algorithm^{1}^{1}1In this work, we focus on pure DP, rather than DP. must incur an added pseudo regret of . However, it is commonly known that any algorithm that relies on the treebased mechanism must incur an added pseudo regret of . Indeed, the treebased mechanism maintains a binary tree over the streaming observations, a tree of depth , where each node in this tree holds an i.i.d sample from a distribution. At each timestep , the mechanism outputs the sum of the first observations added to the sum of the nodes on the rootto
thleaf path in the binary tree. As a result, the variance of the added noise at
each timestep is , making the noise per timestep . (In fact, most analyses^{2}^{2}2(Tossou and Dimitrakakis, 2016) claim a bound, but (i) rely on DP rather than pureDP and more importantly (ii) “sweep under the rug” several factors that are themselves on the order of .^{3}^{3}3(Mishra and Thakurta, 2015) shows a bound of of the treebased mechanism rely on the union bound over all timesteps, obtaining a bound of ; consequentially the addedregret bound of the DPUCB algorithm is .) Thus, in a setting where each of the treemechanisms (one per arm) is run over observations (say, if all arms have suboptimality gap of ), the private UCBalgorithm must unavoidably obtain an added regret of (on top of the regret of the UCBalgorithm). It is therefore clear that the challenge in devising an optimal DP algorithm for the MAB problem, namely an algorithm with added regret of , is algorithmic in nature — we must replace the suboptimal treebased mechanism with a different, simpler, mechanism.Our Contribution and Organization.
In this work, we present an optimal algorithm for the stochastic MABproblem, which meets both the nonprivate lowerbound of (Lai and Robbins, 1985) and the private lowerbound of (Shariff and Sheffet, 2018). Our algorithm is a DP variant of the Successive Elimination (SE) algorithm (EvenDar et al., 2002)
, a different optimal algorithm for stochastic MAB. SE works by pulling all arms sequentially, maintaining the same confidence interval around the empirical average of each arm’s reward (as all remaining arms are pulled the exact same number of times); and when an arm is found to be noticeably suboptimal in comparison to a different arm, it is then eliminated from the set of viable arms (all arms are viable initially). To design a DPanalogue of SE we first consider the case of
arms and ask ourselves — what is the optimal way to privately discern whether the gap between the mean rewards of two arms is positive or negative? This motivates the study of private stopping rules which take as input a stream of i.i.d observations from a distribution of support and unknown mean , and halt once they obtain a approximation of with confidence of at least . Note that due to the multiplicative nature of the required approximation, it is impossible to straightforwardly use the Hoeffding or Bernstein bounds; rather a stopping rule must alter its halting condition with time. (Domingo et al., 2002) proposed a stopping rule known as the Nonmonotonic Adaptive Sampling (NAS) algorithm that relies on the Hoeffding’s inequality to maintain a confidence interval at each timestep. They showed a sample complexity bound of , later improved slightly by (Mnih et al., 2008) to . The work of (Dagum et al., 2000)shows an essentially matching sample complexity lowerbound. Stopping Rules have also been applied to Reinforcement Learning and Racing algorithms (See
Sajed et al. (2018); Mnih et al. (2008)).In this work we introduce a DP analogue of the NAS algorithm that is based on the
sparse vector technique
(SVT), with added sample complexity of (roughly) . Moreover, we show that this added sample complexity is optimal in the sense that any DP stopping rule has a matching sample complexity lowerbound. After we introduce preliminaries in Section 2, we present the private NAS in Section 3. We then turn our attention to the design of the private SE algorithm. Note that straightforwardly applying private stopping rules yields a suboptimal algorithm whose regret bound is proportional to . Instead, we partitionthe algorithm’s armpulls into epochs, where epoch
is set to eliminate all arms with suboptimalitygaps greater than . By design each epoch must be at least twice as long as the previous epoch, and so we can reset (compute empirical means from fresh reward samples) the algorithm inbetween epochs while incurring only a constantfactor increase to the regret bound. Note that as a side benefit our algorithm also solves the private Best Arm Identification problem, with provably optimal cost. Details appear in Section 4. We also assess the empirical performance of our algorithm in comparison to the DPUCB baseline and show that the improvement in analysis (despite the use of large constants) is also empirically evident; details provided in Section 5. Lastly, future directions for this work are discussed in Section 6.Discussion.
Some may find the results of this work underwhelming — after all the improvement we put forth is solely over factors, and admittedly they are already subsumed by the nonprivate regret bound of the algorithm under many “natural” settings of parameters. Our reply to these is twofold. First, our experiments (see Section 5) show a significantly improved performance empirically, which is only due to the different algorithmic approach. Second, as the designers of privacypreserving learning algorithms it is our “moral duty” to quantify the added cost of privacy on top of the already existing cost, and push this added cost to its absolute lowest.
We would also like to emphasize a more philosophical point arising from this work. Both the UCBalgorithm and the SEalgorithm are provably optimal for the MAB problem in the nonprivate setting, and are therefore equivalent. But the UCBalgorithm makes in each timestep an inputdependent choice (which arm to pull); whereas the SEalgorithm inputdependent choices are reflected only in special timesteps in which it declares “eliminate arm ” (in any other timestep it chooses the next viable arm). In that sense, the SEalgorithm is simpler than the UCBalgorithm, making it the less costly to privatize between the two. In other words, differential privacy gives quantitative reasoning for preferring one algorithm to another because “simpler is better.” While not a fullfledged theory (yet), we believe this narrative is of importance to anyone who designs differentially private dataanalysis algorithms.
2 Preliminaries
Stopping Rules.
In the stopping rule problem, the input consists of a stream of i.i.d samples drawn from a distribution over an apriori known support and with unknown mean . Given , the goal of the stopping rule is to halt after seeing as few samples as possible while releasing a approximation of at halting time. Namely, a stopping rule halts at some time and releases such that . (It should be clear that the halting time increases as decreases.) During any timestep , we denote and .
Stochastic MAB and its optimal bounds.
The formal description of the stochastic MAB problem was provided in the introduction. Formally, the bound maintained by the UCBalgorithm for each arm at a given timestep is with denoting the empirical mean reward from pulling arm and denoting the number of times was pulled thus far. We use to denote the leading arm, namely, an arm of highest mean reward: . Given any arm we denote the meangap as , with by definition. Additionally we denote the horizon with  the number of rounds that a MAB algorithm will be run for. An algorithm that chooses arm at timestep incurs an expected regret or pseudoregret of . It is wellknown (Lai and Robbins, 1985) that any consistent^{4}^{4}4A regret minimization algorithm is called consistent if its regret is subpolynomial, namely in for any . regretminimization algorithm must incur a pseudoregret of ; and indeed the UCBalgorithm meets this bound and has pseudoregret of . However, the minimax regret bound of the UCBalgorithm is , obtained by an adversary that knows and sets all suboptimal arms’ gaps to , whereas the minimax lowerbound of any algorithm is slightly smaller: (Auer et al., 2002a).
Differential Privacy.
In this work, we preserve eventlevel privacy under continuous observation (Dwork et al., 2010). Formally, we say two streams are neighbours if they differ on a single entry in a single timestep , and are identical on any other timestep. An algorithm is differentially private if for any two neighboring streams and and for any set of decisions made from timestep through , it holds that . Note that much like its input, the output is also released in a streamlike fashion, and the requirement should hold for all decisions made by in all timesteps.
In this work, we use two mechanisms that are known to be DP. The first is the Laplace mechanism (Dwork et al., 2006). Given a function that takes as input a stream and releases an output in , we denote its global sensitivity as ; and the Laplace mechanism adds a random (independent) sample from to each coordinate of . The other mechanism we use is the sparsevector technique (SVT), that takes in addition to a sequence of queries (each query has a global sensitivity ), and halts with the very first query whose value exceeds a given threshold. The SVT works by adding a random noise sampled i.i.d from to both to the threshold and to each of the queryvalues. See (Dwork et al., 2014) for more details.
Concentration bounds.
A Laplace r.v. is sampled from a distribution with . It is known that , and that for any it holds that .
Throughout this work we also rely on the Hoeffding inequality (Hoeffding, 1963). Given a collection
of i.i.d random variables that take value in a finite interval of length
with mean , it holds that .Additional Notation and Facts.
Throughout this work denotes the logarithm base of . Given two distributions and , we denote their totalvariation distance as . We emphasize we made no effort to minimize constants throughout this work. We also rely on the following folklore fact. For completeness, its proof is shown in Appendix Section A.
Fact 2.1.
Fix any and any . Then for any it holds that , and for any it holds that .
3 Differentially Private Stopping Rule
In this section, we derive a differentially private stopping rule algorithm, DPNAS, which is based on the nonprivate NAS (Nonmonotonic Adaptive Sampling). The nonprivate NAS is rather simple. Given , denote as confidence interval derived by the Hoeffding bound with confidence for iid random samples bounded in magnitude by ; thus, w.p. it holds that . The NAS algorithm halts at the first for which . Indeed, such a stopping rule assures that , the last inequality follows from .
In order to make NAS differentially private we use the sparse vector technique, since the algorithm is basically asking a series of threshold queries: . Recall that the sparsevector technique adds random noise both to the threshold and to the answer of each query, and so we must adjust the naïve threshold of to some in order to make sure that is sufficiently close to . Lastly, since our goal is to provide a private approximation of the distribution mean, we also apply the Laplace mechanism to to assert the output is differentially private. Details appear in Algorithm 1.
Theorem 3.1.
Algorithm 1 is a DP stopping rule.
Proof.
First, we argue that Algorithm 1 is differentially private. This follows immediately from the fact that the algorithm is a combination of the sparsevector technique with the Laplace mechanism. The first part of the algorithm halts when . Indeed, this is the sparsevector mechanism for a sumquery of sensitivity of no more than . It follows that sampling both the thresholdnoise and the query noise from suffices to maintain DP. Similarly, adding a sample from suffices to release the mean with DP at the very last step of the algorithm.
Since , under the assumption that all are i.i.d samples from a distribution of mean , the Hoeffdingbound and unionbound give that . Standard concentration bound on the Laplace distribution give that , , and . It follows that w.p. none of these events happen, and so .
It follows that at the time we halt we have that
where is due to . Therefore, we have that . ∎
Rather than analyzing the utility of Algorithm 1, namely, the highprobability bounds on its stopping time, we now turn our attention to a slight modification of the algorithm and analyze the revised algorithm’s utility. The modification we introduce, albeit technical and noninstrumental in the utility bounds, plays a conceptual role in the description of later algorithms. We introduce Algorithm 2 where we exponentially reduce the number of SVT queries using standard doubling technique. Instead of querying the magnitude of the average at each timestep, we query it at exponentially growing intervals, thus paying no more than a constant factor in the utility guarantees while still reducing the number of SVT queries dramatically.
Corollary 3.2.
Algorithm 2 is a DP stopping rule.
Proof.
The only difference between Algorithms 1 and 2 lies in checking the halting condition at exponentially increasing timeintervals, namely during times for . The privacy analysis remains the same as in the proof of Theorem 3.1, and the algorithm correctness analysis is modified by considering only the timesteps during which we checking for the halting condition. Formally, we denote as the event where (i) , (ii) , (iii) , and (iv) . Analogous to the proof of Theorem 3.1 we bound and the result follows. ∎
Theorem 3.3.
Fix and . Let be an ensemble of i.i.d samples from any distribution over the range and with mean . Denote , , . Then with probability at least , Algorithm 2 halts by timestep .
Proof.
Recall the event from the proof of Corollary 3.2 and its four conditions. We assume holds and so the algorithm releases a approximation of . To prove the claim, we show that under , at time it must hold that .
Under we have that and ; and so it suffices to show that . In fact, since we show something slightly stronger: that at time we have . This however is an immediate corollary of the following three facts.

For any we have , implying .

For any we have , implying .

For any we have .
where the first two rely on Fact 2.1. It follows therefore that at time all three conditions hold and so, due to the exponentially growth of the intervals, by time we reach some which is a power of , on which we pose a query for the SVT mechanism and halt. ∎
3.1 Private Stopping Rule Lower bounds
We turn our attention to proving the (near) optimality of Algorithm 2. A nonprivate lower bound was proven in (Dagum et al., 2000), who showed no stopping rule algorithm can achieve a sample complexity better than (with denoting the variance of the underlying distribution). In this section, we prove a lower bound on the additional sample complexity that any DP stopping rule algorithm must incur. We summarize our result below:
Theorem 3.4.
Any differentially private stopping rule whose input consists of a steam of i.i.d samples from a distribution over support and with mean , must have a sample complexity of .
Proof.
Fix such that and , and fix and . We define two distributions over a support consisting of two discrete points: . Setting we have that . Set as any number infinitesimally below the threshold of , so that we have ; we set the parameters of s.t. so . By definition, the total variation distance .
Let be any differentially private stopping rule. Denote . Let be the event “after seeing at most samples, halts and outputs a number in the interval .” We now apply the following, very elegant, lemma from (Karwa and Vadhan, 2018), stating that the group privacy loss of a differentially privacy mechanism taking as input i.i.d samples either from a distributions or from a distribution scales effectively as .
Lemma 3.5 (Lemma 6.1 from (Karwa and Vadhan, 2018)).
Let be any differentially private mechanism, fix a natural and fix two distributions and , and let and denote an ensemble of i.i.d samples taken from and resp. Then for any possible set of outputs it holds that .
And so, applying over i.i.d samples taken from , we must have that , since . Applying Lemma 3.5 to our setting, we get
since . Since, by definition, we have that the probability of the event “after seeing at most samples, halts and outputs a number outside the interval ” over i.i.d samples from is at most , then it must be that halts after seeing strictly more than samples w.p. . ∎
Combining the nonprivate lower bound of (Dagum et al., 2000) and the bound of Theorem 3.4, we immediately infer the overall sample complexity bound, which follows from the fact that the variance of the distribution used in the proof of Theorem 3.4 has variance of .
Corollary 3.6.
There exists a distribution for which any differentially private stopping rule algorithm has a sample complexity of .
Discussion.
How optimal is Algorithm 2? The sample complexity bound in Theorem 3.3 can be interpreted as the sum of the nonprivate and private parts. The nonprivate part is and the private part is . If we add in the assumption that we get that the upperbound of Theorem 3.3 matches the lowerbound in Corollary 3.6.
How benign is this assumption? Much like in (Mnih et al., 2008), we too believe it is a very mild assumption. Specifically, in the next section, where we deal with finite sequences of length , we set as proportional to . Since over finitelength sequence we can only retrieve an approximation of if , requiring is trivial. However, we cannot completely disregard the possibility of using a private stopping rule in a setting where, for example, both are constants whereas is a subconstant. In such a setting, may dominate , and there it might be possible to improve on the performance of Algorithm 2 (or tighten the bound).
4 An Optimal Private MAB Algorithm
In this section, our goal is to devise an optimal differentially private algorithm for the stochastic arms bandit problem, in a setting where all rewards are between . We denote the mean reward of each arm as , the best arm as , and for any we refer to the gap . We seek in the optimal algorithm in the sense that it should meet both the nonprivate instancedependent bound of (Lai and Robbins, 1985) and the lower bound of (Shariff and Sheffet, 2018); namely an algorithm with an instancedependent pseudoregret bound of . The algorithm we devise is a differentially private version of the Successive Elimination (SE) algorithm (EvenDar et al., 2002). SE initializes by setting all arms as viable options, and iteratively pulls all viable arms maintaining the same confidence interval around the empirical average of each viable arm’s reward. Once some viable arm’s upper confidence bound is strictly smaller than the lower confidence bound of some other viable arm, the arm with the lower empirical reward is eliminated and is no longer considered viable. It is worth while to note that the classical UCB algorithm and the SE algorithm have the same asymptotic pseudoregret. To design the differentially private analouge of SE, we use our results from the previous section regarding stopping rules. After all, in the special case where we have arms, we can straightforwardly use the private stoppingrule to assess the mean of the difference between the arms up to a constant (say ). The question lies in applying this algorithm in the case.
Here are a few failed firstattempts. The most straightforward ideas is to apply stopping rules / SVTs for all pairs of arms; but since a reward of a single pull of any single arm plays a role in SVT instantiations, it follows we would have to scale down the privacyloss of each SVT to resulting in an added regret scaled up by a factor of . In an attempt to reduce the number of SVTinstantiations, we might consider asking for each arm whether there exists an arm with a significantly greater reward, yet it still holds that the reward from a single pull of the leading arm plays a role in SVTinstantiations. Next, consider merging all queries into a single SVT, posing in each round queries (one per arm) and halting once we find that a certain arm is suboptimal; but this results in a single SVT that may halt times, causing us yet again to scale by a factor of .
In order to avoid scaling down by a factor of , our solution leverages on the combination of parallel decomposition and geometrically increasing intervals. Namely we partition the arm pulls of the algorithm into epochs of geometrically increasing lengths, where in epoch we eliminate all arms of optimalitygap . In fact, it turns out we needn’t apply the SVT at the end of each epoch^{5}^{5}5We thank the anonymous referee for this elegant observation. but rather just test for a noticeably underperforming arm using a private histogram. The key point is that at the beginning of each new epoch we nullify all counters and start the meanreward estimation completely anew (over the remaining set of viable arms) — and so a single reward plays a role in only one epoch, allowing for DP meanestimation in each epoch (rather than ). Yet due to the fact that the epochs are of exponentially growing lengths the total number of pulls for any suboptimal arm is proportional to the length of the epoch in which it eliminated, resulting in only a constant factor increase to the regret. The fullfledged details appear in Algorithm 3.
Theorem 4.1.
Algorithm 3 is differentially private.
Proof.
Consider two streams of armrewards that differ on the reward of a single arm in a single timestep. This timestep plays a role in a single epoch . Moreover, let be the arm whose reward differs between the two neighboring streams. Since the reward of each arm is bounded by it follows that the difference of the mean of arm between the two neighboring streams is . Thus, adding noise of to guarantees DP. ∎
To argue about the optimality of Algorithm 3, we require the following lemma, a key step in the following theorem that bounds the pseudoregret of the algorithm.
Lemma 4.2.
Fix any instance of the MAB problem, and denote as its optimal arm (of highest mean), and the gaps between the mean of arm and any suboptimal arm as . Fix any horizon . Then w.p. it holds that Algorithm 3 pulls each suboptimal arm for a number of timesteps upper bounded by
Proof of Lemma 4.2.
The bound of is trivial so we focus on proving the latter bound. Given an epoch we denote by the event where for all arms it holds that both (i) and (ii) ; and also denote . The Hoeffding bound, concentration of the Laplace distribution and the union bound over all arms in give that , thus . The remainder of the proof continues under the assumption the holds, and so, for any epoch and any viable arm in this epoch we have . As a result for any epoch and any two arms we have that .
Next, we argue that under the optimal arm is never eliminated. Indeed, for any epoch , we denote the arm and it is simple enough to see that , so the algorithm doesn’t eliminate .
Next, we argue that, under , in any epoch we eliminate all viable arms with suboptimality gap . Fix an epoch and a viable arm with suboptimality gap . Note that we have set parameter so that
Therefore, since arm remains viable, we have that , guaranteeing that arm is removed from .
Lastly, fix a suboptimal arm and let be the first epoch such that , implying . Using the immediate observation that for any epoch we have , we have that the total number of pulls of arm is
The bounds , , and allow us to conclude and infer that under the total number of pulls of arm is at most . ∎
Theorem 4.3.
Proof.
In order to bound the expected regret based on the highprobability bound given in Lemma 4.2, we must set . (Alternatively, we use the standard guessanddouble technique when the horizon is unknown. I.e. we start with a guess of and on time we multiply the guess .) Thus, with probability at most we may pull a suboptimal on all timesteps incurring expect regret of at most ; and with probability , since each time we pull a suboptimal arm we incur an expected regret of , our overall expected regret when is sufficient large is proportional to at most