DeepAI

# An Optimal Private Stochastic-MAB Algorithm Based on an Optimal Private Stopping Rule

We present a provably optimal differentially private algorithm for the stochastic multi-arm bandit problem, as opposed to the private analogue of the UCB-algorithm [Mishra and Thakurta, 2015; Tossou and Dimitrakakis, 2016] which doesn't meet the recently discovered lower-bound of Ω(K(T)/ϵ) [Shariff and Sheffet, 2018]. Our construction is based on a different algorithm, Successive Elimination [Even-Dar et al. 2002], that repeatedly pulls all remaining arms until an arm is found to be suboptimal and is then eliminated. In order to devise a private analogue of Successive Elimination we visit the problem of private stopping rule, that takes as input a stream of i.i.d samples from an unknown distribution and returns a multiplicative (1 ±α)-approximation of the distribution's mean, and prove the optimality of our private stopping rule. We then present the private Successive Elimination algorithm which meets both the non-private lower bound [Lai and Robbins, 1985] and the above-mentioned private lower bound. We also compare empirically the performance of our algorithm with the private UCB algorithm.

• 2 publications
• 11 publications
06/11/2020

### Best-Arm Identification for Quantile Bandits with Privacy

We study the best-arm identification problem in multi-armed bandits with...
02/16/2021

### Optimal Algorithms for Private Online Learning in a Stochastic Environment

We consider two variants of private stochastic online learning. The firs...
03/23/2021

### Bandits with many optimal arms

We consider a stochastic bandit problem with a possibly infinite number ...
02/09/2022

### Optimal Clustering with Bandit Feedback

This paper considers the problem of online clustering with bandit feedba...
06/04/2021

### Optimal Rates of (Locally) Differentially Private Heavy-tailed Multi-Armed Bandits

In this paper we study the problem of stochastic multi-armed bandits (MA...
11/09/2022

### Almost Tight Error Bounds on Differentially Private Continual Counting

The first large-scale deployment of private federated learning uses diff...
04/19/2020

### An Algorithm for the Exact Treedepth Problem

We present a novel algorithm for the minimum-depth elimination tree prob...

## 1 Introduction

The well-known stochastic multi-armed bandit (MAB) is a sequential decision-making task in which a learner repeatedly chooses an action (or arm) and receives a noisy reward. The learner’s objective is to maximize cumulative reward by exploring the actions to discover optimal ones (having the highest expected reward), balanced with exploiting them. The problem, originally stemming from experiments in medicine (Robbins, 1952), has applications in fields such as ranking (Kveton et al., 2015), recommendation systems (collaborative filtering) (Caron and Bhagat, 2013), investment portfolio design (Hoffman et al., 2011) and online advertising (Schwartz et al., 2017), to name a few. Such applications, relying on sensitive data, raise privacy concerns.

Differential privacy (Dwork et al., 2006) has become in recent years the gold-standard for privacy preserving data-analysis alleviating such concerns, as it requires that the output of the data-analysis algorithm has a limited dependency on any single datum. Differentially private variants of online learning algorithms have been successfully devised in various settings (Smith and Thakurta, 2013), including a private UCB-algorithm for the MAB problem (details below) (Mishra and Thakurta, 2015; Tossou and Dimitrakakis, 2016) as well as UCB variations in the linear (Kannan et al., 2018) and contextual (Shariff and Sheffet, 2018) settings.

More formally, in the MAB problem at every timestep the learner selects an arm out of available arms, pulls it and receives a random reward drawn i.i.d from a distribution  — of support and unknown mean . The Upper Confidence Bound (UCB) algorithm for the MAB problem was developed in a series of works (Berry and Fristedt, 1985; Agrawal, 1995) culminating in (Auer et al., 2002a), and is provably optimal for the MAB problem (Lai and Robbins, 1985)

. The UCB algorithm maintains a time-dependent high-probability upper-bound

for each arm’s mean, and at each timestep optimistically pulls the arm with the highest bound. The above-mentioned -differentially private (

-DP) analogues of the UCB-algorithm follow the same procedure except for maintaining noisy estimations

using the “tree-based mechanism” (Chan et al., 2010; Dwork et al., 2010). This mechanism continuously releases aggregated statistics over a stream of observations, introducing only noise in each timestep. The details of this poly-log factor are the focus of this work.

It was recently shown (Shariff and Sheffet, 2018) that any -DP stochastic MAB algorithm111In this work, we focus on pure -DP, rather than -DP. must incur an added pseudo regret of . However, it is commonly known that any algorithm that relies on the tree-based mechanism must incur an added pseudo regret of . Indeed, the tree-based mechanism maintains a binary tree over the streaming observations, a tree of depth , where each node in this tree holds an i.i.d sample from a distribution. At each timestep , the mechanism outputs the sum of the first observations added to the sum of the nodes on the root-to-

th-leaf path in the binary tree. As a result, the variance of the added noise at

each timestep is , making the noise per timestep . (In fact, most analyses222(Tossou and Dimitrakakis, 2016) claim a bound, but (i) rely on -DP rather than pure-DP and more importantly (ii) “sweep under the rug” several factors that are themselves on the order of .333(Mishra and Thakurta, 2015) shows a bound of of the tree-based mechanism rely on the union bound over all timesteps, obtaining a bound of ; consequentially the added-regret bound of the DP-UCB algorithm is .) Thus, in a setting where each of the tree-mechanisms (one per arm) is run over observations (say, if all arms have suboptimality gap of ), the private UCB-algorithm must unavoidably obtain an added regret of (on top of the regret of the UCB-algorithm). It is therefore clear that the challenge in devising an optimal DP algorithm for the MAB problem, namely an algorithm with added regret of , is algorithmic in nature — we must replace the suboptimal tree-based mechanism with a different, simpler, mechanism.

#### Our Contribution and Organization.

In this work, we present an optimal algorithm for the stochastic MAB-problem, which meets both the non-private lower-bound of (Lai and Robbins, 1985) and the private lower-bound of (Shariff and Sheffet, 2018). Our algorithm is a DP variant of the Successive Elimination (SE) algorithm (Even-Dar et al., 2002)

, a different optimal algorithm for stochastic MAB. SE works by pulling all arms sequentially, maintaining the same confidence interval around the empirical average of each arm’s reward (as all remaining arms are pulled the exact same number of times); and when an arm is found to be noticeably suboptimal in comparison to a different arm, it is then eliminated from the set of viable arms (all arms are viable initially). To design a DP-analogue of SE we first consider the case of

arms and ask ourselves — what is the optimal way to privately discern whether the gap between the mean rewards of two arms is positive or negative? This motivates the study of private stopping rules which take as input a stream of i.i.d observations from a distribution of support and unknown mean , and halt once they obtain a -approximation of with confidence of at least . Note that due to the multiplicative nature of the required approximation, it is impossible to straight-forwardly use the Hoeffding or Bernstein bounds; rather a stopping rule must alter its halting condition with time. (Domingo et al., 2002) proposed a stopping rule known as the Nonmonotonic Adaptive Sampling (NAS) algorithm that relies on the Hoeffding’s inequality to maintain a confidence interval at each timestep. They showed a sample complexity bound of , later improved slightly by (Mnih et al., 2008) to . The work of (Dagum et al., 2000)

shows an essentially matching sample complexity lower-bound. Stopping Rules have also been applied to Reinforcement Learning and Racing algorithms (See

Sajed et al. (2018); Mnih et al. (2008)).

In this work we introduce a -DP analogue of the NAS algorithm that is based on the

sparse vector technique

(SVT), with added sample complexity of (roughly) . Moreover, we show that this added sample complexity is optimal in the sense that any -DP stopping rule has a matching sample complexity lower-bound. After we introduce preliminaries in Section 2, we present the private NAS in Section 3. We then turn our attention to the design of the private SE algorithm. Note that straight-forwardly applying private stopping rules yields a suboptimal algorithm whose regret bound is proportional to . Instead, we partition

the algorithm’s arm-pulls into epochs, where epoch

is set to eliminate all arms with suboptimality-gaps greater than . By design each epoch must be at least twice as long as the previous epoch, and so we can reset (compute empirical means from fresh reward samples) the algorithm in-between epochs while incurring only a constant-factor increase to the regret bound. Note that as a side benefit our algorithm also solves the private Best Arm Identification problem, with provably optimal cost. Details appear in Section 4. We also assess the empirical performance of our algorithm in comparison to the DP-UCB baseline and show that the improvement in analysis (despite the use of large constants) is also empirically evident; details provided in Section 5. Lastly, future directions for this work are discussed in Section 6.

#### Discussion.

Some may find the results of this work underwhelming — after all the improvement we put forth is solely over -factors, and admittedly they are already subsumed by the non-private regret bound of the algorithm under many “natural” settings of parameters. Our reply to these is two-fold. First, our experiments (see Section 5) show a significantly improved performance empirically, which is only due to the different algorithmic approach. Second, as the designers of privacy-preserving learning algorithms it is our “moral duty” to quantify the added cost of privacy on top of the already existing cost, and push this added cost to its absolute lowest.

We would also like to emphasize a more philosophical point arising from this work. Both the UCB-algorithm and the SE-algorithm are provably optimal for the MAB problem in the non-private setting, and are therefore equivalent. But the UCB-algorithm makes in each timestep an input-dependent choice (which arm to pull); whereas the SE-algorithm input-dependent choices are reflected only in special timesteps in which it declares “eliminate arm ” (in any other timestep it chooses the next viable arm). In that sense, the SE-algorithm is simpler than the UCB-algorithm, making it the less costly to privatize between the two. In other words, differential privacy gives quantitative reasoning for preferring one algorithm to another because “simpler is better.” While not a full-fledged theory (yet), we believe this narrative is of importance to anyone who designs differentially private data-analysis algorithms.

## 2 Preliminaries

#### Stopping Rules.

In the stopping rule problem, the input consists of a stream of i.i.d samples drawn from a distribution over an a-priori known support and with unknown mean . Given , the goal of the stopping rule is to halt after seeing as few samples as possible while releasing a -approximation of at halting time. Namely, a -stopping rule halts at some time and releases such that . (It should be clear that the halting time increases as decreases.) During any timestep , we denote and .

#### Stochastic MAB and its optimal bounds.

The formal description of the stochastic MAB problem was provided in the introduction. Formally, the bound maintained by the UCB-algorithm for each arm at a given timestep is with denoting the empirical mean reward from pulling arm and denoting the number of times was pulled thus far. We use to denote the leading arm, namely, an arm of highest mean reward: . Given any arm we denote the mean-gap as , with by definition. Additionally we denote the horizon with - the number of rounds that a MAB algorithm will be run for. An algorithm that chooses arm at timestep incurs an expected regret or pseudo-regret of . It is well-known (Lai and Robbins, 1985) that any consistent444A regret minimization algorithm is called consistent if its regret is sub-polynomial, namely in for any . regret-minimization algorithm must incur a pseudo-regret of ; and indeed the UCB-algorithm meets this bound and has pseudo-regret of . However, the minimax regret bound of the UCB-algorithm is , obtained by an adversary that knows and sets all suboptimal arms’ gaps to , whereas the minimax lower-bound of any algorithm is slightly smaller:  (Auer et al., 2002a).

#### Differential Privacy.

In this work, we preserve event-level privacy under continuous observation (Dwork et al., 2010). Formally, we say two streams are neighbours if they differ on a single entry in a single timestep , and are identical on any other timestep. An algorithm is -differentially private if for any two neighboring streams and and for any set of decisions made from timestep through , it holds that . Note that much like its input, the output is also released in a stream-like fashion, and the requirement should hold for all decisions made by in all timesteps.

In this work, we use two mechanisms that are known to be -DP. The first is the Laplace mechanism (Dwork et al., 2006). Given a function that takes as input a stream and releases an output in , we denote its global sensitivity as ; and the Laplace mechanism adds a random (independent) sample from to each coordinate of . The other mechanism we use is the sparse-vector technique (SVT), that takes in addition to a sequence of queries (each query has a global sensitivity ), and halts with the very first query whose value exceeds a given threshold. The SVT works by adding a random noise sampled i.i.d from to both to the threshold and to each of the query-values. See (Dwork et al., 2014) for more details.

#### Concentration bounds.

A Laplace r.v.  is sampled from a distribution with . It is known that , and that for any it holds that .

Throughout this work we also rely on the Hoeffding inequality (Hoeffding, 1963). Given a collection

of i.i.d random variables that take value in a finite interval of length

with mean , it holds that .

#### Additional Notation and Facts.

Throughout this work denotes the logarithm base of . Given two distributions and , we denote their total-variation distance as . We emphasize we made no effort to minimize constants throughout this work. We also rely on the following folklore fact. For completeness, its proof is shown in Appendix Section A.

###### Fact 2.1.

Fix any and any . Then for any it holds that , and for any it holds that .

## 3 Differentially Private Stopping Rule

In this section, we derive a differentially private stopping rule algorithm, DP-NAS, which is based on the non-private NAS (Nonmonotonic Adaptive Sampling). The non-private NAS is rather simple. Given , denote as confidence interval derived by the Hoeffding bound with confidence for iid random samples bounded in magnitude by ; thus, w.p. it holds that . The NAS algorithm halts at the first for which . Indeed, such a stopping rule assures that , the last inequality follows from .

In order to make NAS differentially private we use the sparse vector technique, since the algorithm is basically asking a series of threshold queries: . Recall that the sparse-vector technique adds random noise both to the threshold and to the answer of each query, and so we must adjust the naïve threshold of to some in order to make sure that is sufficiently close to . Lastly, since our goal is to provide a private approximation of the distribution mean, we also apply the Laplace mechanism to to assert the output is differentially private. Details appear in Algorithm 1.

###### Theorem 3.1.

Algorithm 1 is a -DP -stopping rule.

###### Proof.

First, we argue that Algorithm 1 is -differentially private. This follows immediately from the fact that the algorithm is a combination of the sparse-vector technique with the Laplace mechanism. The first part of the algorithm halts when . Indeed, this is the sparse-vector mechanism for a sum-query of sensitivity of no more than . It follows that sampling both the threshold-noise and the query noise from suffices to maintain -DP. Similarly, adding a sample from suffices to release the mean with -DP at the very last step of the algorithm.

Since , under the assumption that all are i.i.d samples from a distribution of mean , the Hoeffding-bound and union-bound give that . Standard concentration bound on the Laplace distribution give that , , and . It follows that w.p. none of these events happen, and so .

It follows that at the time we halt we have that

 |¯¯¯¯¯¯Xt−μ| Hoeffding≤ht ≤α(|¯¯¯¯¯¯Xt|−ht)−αt(ct+At+B) (∗)≤α|μ|−αt(ct+At+B)≤α|μ|−|L|t

where is due to . Therefore, we have that . ∎

Rather than analyzing the utility of Algorithm 1, namely, the high-probability bounds on its stopping time, we now turn our attention to a slight modification of the algorithm and analyze the revised algorithm’s utility. The modification we introduce, albeit technical and non-instrumental in the utility bounds, plays a conceptual role in the description of later algorithms. We introduce Algorithm 2 where we exponentially reduce the number of SVT queries using standard doubling technique. Instead of querying the magnitude of the average at each timestep, we query it at exponentially growing intervals, thus paying no more than a constant factor in the utility guarantees while still reducing the number of SVT queries dramatically.

###### Corollary 3.2.

Algorithm 2 is a -DP -stopping rule.

###### Proof.

The only difference between Algorithms 1 and 2 lies in checking the halting condition at exponentially increasing time-intervals, namely during times for . The privacy analysis remains the same as in the proof of Theorem 3.1, and the algorithm correctness analysis is modified by considering only the timesteps during which we checking for the halting condition. Formally, we denote as the event where (i) , (ii) , (iii) , and (iv) . Analogous to the proof of Theorem 3.1 we bound and the result follows. ∎

###### Theorem 3.3.

Fix and . Let be an ensemble of i.i.d samples from any distribution over the range and with mean . Denote , , . Then with probability at least , Algorithm 2 halts by timestep .

###### Proof.

Recall the event from the proof of Corollary 3.2 and its four conditions. We assume holds and so the algorithm releases a -approximation of . To prove the claim, we show that under , at time it must hold that .

Under we have that and ; and so it suffices to show that . In fact, since we show something slightly stronger: that at time we have . This however is an immediate corollary of the following three facts.

1. For any we have , implying .

2. For any we have , implying .

3. For any we have .

where the first two rely on Fact 2.1. It follows therefore that at time all three conditions hold and so, due to the exponentially growth of the intervals, by time we reach some which is a power of , on which we pose a query for the SVT mechanism and halt. ∎

### 3.1 Private Stopping Rule Lower bounds

We turn our attention to proving the (near) optimality of Algorithm 2. A non-private lower bound was proven in (Dagum et al., 2000), who showed no stopping rule algorithm can achieve a sample complexity better than (with denoting the variance of the underlying distribution). In this section, we prove a lower bound on the additional sample complexity that any -DP stopping rule algorithm must incur. We summarize our result below:

###### Theorem 3.4.

Any -differentially private -stopping rule whose input consists of a steam of i.i.d samples from a distribution over support and with mean , must have a sample complexity of .

###### Proof.

Fix such that and , and fix and . We define two distributions over a support consisting of two discrete points: . Setting we have that . Set as any number infinitesimally below the threshold of , so that we have ; we set the parameters of s.t. so . By definition, the total variation distance .

Let be any -differentially private -stopping rule. Denote . Let be the event “after seeing at most samples, halts and outputs a number in the interval .” We now apply the following, very elegant, lemma from (Karwa and Vadhan, 2018), stating that the group privacy loss of a differentially privacy mechanism taking as input i.i.d samples either from a distributions or from a distribution scales effectively as .

###### Lemma 3.5 (Lemma 6.1 from (Karwa and Vadhan, 2018)).

Let be any -differentially private mechanism, fix a natural and fix two distributions and , and let and denote an ensemble of i.i.d samples taken from and resp. Then for any possible set of outputs it holds that .

And so, applying over i.i.d samples taken from , we must have that , since . Applying Lemma 3.5 to our setting, we get

 PrM,S∼Pn[E] ≤e6εn⋅dTV(P,Q)PrM,S∼Qn[E] ≤β⋅exp(6εn⋅αμR) =β⋅exp(6εαμR⋅Rlog(\nicefrac1β)12εαμ)=β√β<12

since . Since, by definition, we have that the probability of the event “after seeing at most samples, halts and outputs a number outside the interval ” over i.i.d samples from is at most , then it must be that halts after seeing strictly more than samples w.p. . ∎

Combining the non-private lower bound of (Dagum et al., 2000) and the bound of Theorem 3.4, we immediately infer the overall sample complexity bound, which follows from the fact that the variance of the distribution used in the proof of Theorem 3.4 has variance of .

###### Corollary 3.6.

There exists a distribution for which any -differentially private -stopping rule algorithm has a sample complexity of .

#### Discussion.

How optimal is Algorithm 2? The sample complexity bound in Theorem 3.3 can be interpreted as the sum of the non-private and private parts. The non-private part is and the private part is . If we add in the assumption that we get that the upper-bound of Theorem 3.3 matches the lower-bound in Corollary 3.6.

How benign is this assumption? Much like in (Mnih et al., 2008), we too believe it is a very mild assumption. Specifically, in the next section, where we deal with finite sequences of length , we set as proportional to . Since over finite-length sequence we can only retrieve an approximation of if , requiring is trivial. However, we cannot completely disregard the possibility of using a private stopping rule in a setting where, for example, both are constants whereas is a sub-constant. In such a setting, may dominate , and there it might be possible to improve on the performance of Algorithm 2 (or tighten the bound).

## 4 An Optimal Private MAB Algorithm

In this section, our goal is to devise an optimal -differentially private algorithm for the stochastic -arms bandit problem, in a setting where all rewards are between . We denote the mean reward of each arm as , the best arm as , and for any we refer to the gap . We seek in the optimal algorithm in the sense that it should meet both the non-private instance-dependent bound of (Lai and Robbins, 1985) and the lower bound of (Shariff and Sheffet, 2018); namely an algorithm with an instance-dependent pseudo-regret bound of . The algorithm we devise is a differentially private version of the Successive Elimination (SE) algorithm (Even-Dar et al., 2002). SE initializes by setting all arms as viable options, and iteratively pulls all viable arms maintaining the same confidence interval around the empirical average of each viable arm’s reward. Once some viable arm’s upper confidence bound is strictly smaller than the lower confidence bound of some other viable arm, the arm with the lower empirical reward is eliminated and is no longer considered viable. It is worth while to note that the classical UCB algorithm and the SE algorithm have the same asymptotic pseudo-regret. To design the differentially private analouge of SE, we use our results from the previous section regarding stopping rules. After all, in the special case where we have arms, we can straight-forwardly use the private stopping-rule to assess the mean of the difference between the arms up to a constant (say ). The question lies in applying this algorithm in the case.

Here are a few failed first-attempts. The most straight-forward ideas is to apply stopping rules / SVTs for all pairs of arms; but since a reward of a single pull of any single arm plays a role in SVT instantiations, it follows we would have to scale down the privacy-loss of each SVT to resulting in an added regret scaled up by a factor of . In an attempt to reduce the number of SVT-instantiations, we might consider asking for each arm whether there exists an arm with a significantly greater reward, yet it still holds that the reward from a single pull of the leading arm plays a role in SVT-instantiations. Next, consider merging all queries into a single SVT, posing in each round queries (one per arm) and halting once we find that a certain arm is suboptimal; but this results in a single SVT that may halt times, causing us yet again to scale by a factor of .

In order to avoid scaling down by a factor of , our solution leverages on the combination of parallel decomposition and geometrically increasing intervals. Namely we partition the arm pulls of the algorithm into epochs of geometrically increasing lengths, where in epoch we eliminate all arms of optimality-gap . In fact, it turns out we needn’t apply the SVT at the end of each epoch555We thank the anonymous referee for this elegant observation. but rather just test for a noticeably underperforming arm using a private histogram. The key point is that at the beginning of each new epoch we nullify all counters and start the mean-reward estimation completely anew (over the remaining set of viable arms) — and so a single reward plays a role in only one epoch, allowing for -DP mean-estimation in each epoch (rather than ). Yet due to the fact that the epochs are of exponentially growing lengths the total number of pulls for any suboptimal arm is proportional to the length of the epoch in which it eliminated, resulting in only a constant factor increase to the regret. The full-fledged details appear in Algorithm 3.

###### Theorem 4.1.

Algorithm 3 is -differentially private.

###### Proof.

Consider two streams of arm-rewards that differ on the reward of a single arm in a single timestep. This timestep plays a role in a single epoch . Moreover, let be the arm whose reward differs between the two neighboring streams. Since the reward of each arm is bounded by it follows that the difference of the mean of arm between the two neighboring streams is . Thus, adding noise of to guarantees -DP. ∎

To argue about the optimality of Algorithm 3, we require the following lemma, a key step in the following theorem that bounds the pseudo-regret of the algorithm.

###### Lemma 4.2.

Fix any instance of the -MAB problem, and denote as its optimal arm (of highest mean), and the gaps between the mean of arm and any suboptimal arm as . Fix any horizon . Then w.p. it holds that Algorithm 3 pulls each suboptimal arm for a number of timesteps upper bounded by

 min{T,  O((log(\nicefracKβ)+loglog(\nicefrac1Δa))(1Δ2a+1εΔa))}
###### Proof of Lemma 4.2.

The bound of is trivial so we focus on proving the latter bound. Given an epoch we denote by the event where for all arms it holds that both (i) and (ii) ; and also denote . The Hoeffding bound, concentration of the Laplace distribution and the union bound over all arms in give that , thus . The remainder of the proof continues under the assumption the holds, and so, for any epoch and any viable arm in this epoch we have . As a result for any epoch and any two arms we have that .

Next, we argue that under the optimal arm is never eliminated. Indeed, for any epoch , we denote the arm and it is simple enough to see that , so the algorithm doesn’t eliminate .

Next, we argue that, under , in any epoch we eliminate all viable arms with suboptimality gap . Fix an epoch and a viable arm with suboptimality gap . Note that we have set parameter so that

 he =√log(\nicefrac8|S|⋅e2β)2Re<  ⎷log(\nicefrac8|S|⋅e2β)2⋅32log(\nicefrac8|S|e2β)Δ2e=Δe8 ce =log(\nicefrac4|S|⋅e2β)Reε

Therefore, since arm remains viable, we have that , guaranteeing that arm is removed from .

Lastly, fix a suboptimal arm and let be the first epoch such that , implying . Using the immediate observation that for any epoch we have , we have that the total number of pulls of arm is

 ∑e≤e(a)Re≤∑e≤e(a)2e−e(a)Re(a)≤Re(a)∑i≥02−i≤2(32log(\nicefrac8|S|⋅e(a)2β)Δ2e+8log(\nicefrac4|S|⋅e(a)2β)εΔe)

The bounds , , and allow us to conclude and infer that under the total number of pulls of arm is at most . ∎

###### Theorem 4.3.

Under the same notation as in Lemma 4.2 and for sufficiently large , the expected regret of Algorithm 3 is at most .

###### Proof.

In order to bound the expected regret based on the high-probability bound given in Lemma 4.2, we must set . (Alternatively, we use the standard guess-and-double technique when the horizon is unknown. I.e. we start with a guess of and on time we multiply the guess .) Thus, with probability at most we may pull a suboptimal on all timesteps incurring expect regret of at most ; and with probability , since each time we pull a suboptimal arm we incur an expected regret of , our overall expected regret when is sufficient large is proportional to at most

 ∑a≠a∗(log(\nicefracK(1/T))+loglog(\nicefrac1Δa))(ΔaΔ2a+ΔaεΔa) =∑a≠a∗(log(TK⋅log(\nicefrac1Δa))(1Δa+1ε)