# Towards Instance Optimal Bounds for Best Arm Identification

In the classical best arm identification (Best-1-Arm) problem, we are given n stochastic bandit arms, each associated with a reward distribution with an unknown mean. We would like to identify the arm with the largest mean with probability at least 1-δ, using as few samples as possible. Understanding the sample complexity of Best-1-Arm has attracted significant attention since the last decade. However, the exact sample complexity of the problem is still unknown. Recently, Chen and Li made the gap-entropy conjecture concerning the instance sample complexity of Best-1-Arm. Given an instance I, let μ_[i] be the ith largest mean and Δ_[i]=μ_[1]-μ_[i] be the corresponding gap. H(I)=∑_i=2^nΔ_[i]^-2 is the complexity of the instance. The gap-entropy conjecture states that Ω(H(I)·(δ^-1+Ent(I))) is an instance lower bound, where Ent(I) is an entropy-like term determined by the gaps, and there is a δ-correct algorithm for Best-1-Arm with sample complexity O(H(I)·(δ^-1+Ent(I))+Δ_[2]^-2Δ_[2]^-1). If the conjecture is true, we would have a complete understanding of the instance-wise sample complexity of Best-1-Arm. We make significant progress towards the resolution of the gap-entropy conjecture. For the upper bound, we provide a highly nontrivial algorithm which requires O(H(I)·(δ^-1 +Ent(I))+Δ_[2]^-2Δ_[2]^-1polylog(n,δ^-1)) samples in expectation. For the lower bound, we show that for any Gaussian Best-1-Arm instance with gaps of the form 2^-k, any δ-correct monotone algorithm requires Ω(H(I)·(δ^-1 + Ent(I))) samples in expectation.

Comments

There are no comments yet.

## Authors

• 13 publications
• 87 publications
• 8 publications
• ### Nearly Instance Optimal Sample Complexity Bounds for Top-k Arm Selection

In the Best-k-Arm problem, we are given n stochastic bandit arms, each a...
02/13/2017 ∙ by Lijie Chen, et al. ∙ 0

read it

• ### Practical Algorithms for Best-K Identification in Multi-Armed Bandits

In the Best-K identification problem (Best-K-Arm), we are given N stocha...
05/19/2017 ∙ by Haotian Jiang, et al. ∙ 0

read it

• ### Optimal best arm selection for general distributions

Given a finite set of unknown distributions or arms that can be sampled ...
08/24/2019 ∙ by Shubhada Agrawal, et al. ∙ 0

read it

• ### Thresholding Bandit for Dose-ranging: The Impact of Monotonicity

We analyze the sample complexity of the thresholding bandit problem, wit...
11/13/2017 ∙ by Aurélien Garivier, et al. ∙ 0

read it

• ### Fractional Moments on Bandit Problems

Reinforcement learning addresses the dilemma between exploration to find...
02/14/2012 ∙ by Ananda Narayanan B, et al. ∙ 0

read it

• ### Sparse Dueling Bandits

The dueling bandit problem is a variation of the classical multi-armed b...
01/31/2015 ∙ by Kevin Jamieson, et al. ∙ 0

read it

• ### The Sample Complexities of Global Lipschitz Optimization

We study the problem of black-box optimization of a Lipschitz function f...
02/03/2021 ∙ by François Bachoc, et al. ∙ 0

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The stochastic multi-armed bandit is one of the most popular and well-studied models for capturing the exploration-exploitation tradeoffs in many application domains. There is a huge body of literature on numerous bandit models from several fields including stochastic control, statistics, operation research, machine learning and theoretical computer science. The basic stochastic multi-armed bandit model consists of

stochastic arms with unknown distributions. One can adaptively take samples from the arms and make decision depending on the objective. Popular objectives include maximizing the cumulative sum of rewards, or minimizing the cumulative regret (see e.g., [Cesa-Bianchi and Lugosi(2006), Bubeck et al.(2012)Bubeck, Cesa-Bianchi, et al.]).

In this paper, we study another classical multi-armed bandit model, called pure exploration model, where the decision-maker first performs a pure-exploration phase by sampling from the arms, and then identifies an optimal (or nearly optimal) arm, which serves as the exploitation phase. The model is motivated by many application domains such as medical trials [Robbins(1985), Audibert and Bubeck(2010)], communication network [Audibert and Bubeck(2010)], online advertisement [Chen et al.(2014)Chen, Lin, King, Lyu, and Chen], crowdsourcing [Zhou et al.(2014)Zhou, Chen, and Li, Cao et al.(2015)Cao, Li, Tao, and Li]. The best arm identification problem (Best--Arm) is the most basic pure exploration problem in stochastic multi-armed bandits. The problem has a long history (first formulated in [Bechhofer(1954)]) and has attracted significant attention since the last decade [Audibert and Bubeck(2010), Even-Dar et al.(2006)Even-Dar, Mannor, and Mansour, Mannor and Tsitsiklis(2004), Jamieson et al.(2014)Jamieson, Malloy, Nowak, and Bubeck, Karnin et al.(2013)Karnin, Koren, and Somekh, Chen and Li(2015), Carpentier and Locatelli(2016), Garivier and Kaufmann(2016)]. Now, we formally define the problem and set up some notations.

###### Definition 1.1

Best--Arm: We are given a set of arms . Arm has a reward distribution with an unknown mean

. We assume that all reward distributions are Gaussian distributions with unit variance. Upon each play of

, we get a reward sampled i.i.d. from . Our goal is to identify the arm with the largest mean using as few samples as possible. We assume here that the largest mean is strictly larger than the second largest (i.e., ) to ensure the uniqueness of the solution, where denotes the th largest mean.

###### Remark 1.2

Some previous algorithms for Best--Arm take a sequence (instead of a set) of arms as input. In this case, we may simply assume that the algorithm randomly permutes the sequence at the beginning. Thus the algorithm will have the same behaviour on two different orderings of the same set of arms.

###### Remark 1.3

For the upper bound, everything proved in this paper also holds if the distributions are 1-sub-Gaussian, which is a standard assumption in the bandit literature. On the lower bound side, we need to assume that the distributions are from some family parametrized by the means and satisfy certain properties. See Remark D.4. Otherwise, it is possible to distinguish two distributions using 1 sample even if their means are very close. We cannot hope for a nontrivial lower bound in such generality.

The Best--Arm problem for Gaussian arms was first formulated in [Bechhofer(1954)]. Most early works on Best--Arm did not analyze the sample complexity of the algorithms (they proved their algorithms are -correct though). The early advances are summarized in the monograph [Bechhofer et al.(1968)Bechhofer, Kiefer, and Sobel].

For the past two decades, significant research efforts have been devoted to understanding the optimal sample complexity of the Best--Arm problem. On the lower bound side, [Mannor and Tsitsiklis(2004)] proved that any -correct algorithm for Best--Arm takes samples in expectation. In fact, their result is an instance-wise lower bound (see Definition 1.6). [Kaufmann et al.(2015)Kaufmann, Cappé, and Garivier] also provided an lower bound for Best--Arm, which improved the constant factor in [Mannor and Tsitsiklis(2004)]. [Garivier and Kaufmann(2016)] focused on the asymptotic sample complexity of Best--Arm as the confidence level approaches zero (treating the gaps as fixed), and obtained a complete resolution of this case (even for the leading constant).111In contrast, our work focus on the situation that both and all gaps are variables that tend to zero. In fact, if we let the gaps (i.e., ’s) tend to while maintaining fixed, their lower bound is not tight. [Chen and Li(2015)] showed that for each there exists a Best--Arm instance with arms that require samples, which further refines the lower bound.

The algorithms for Best--Arm have also been significantly improved in the last two decades [Even-Dar et al.(2002)Even-Dar, Mannor, and Mansour, Gabillon et al.(2012)Gabillon, Ghavamzadeh, and Lazaric, Kalyanakrishnan et al.(2012)Kalyanakrishnan, Tewari, Auer, and Stone, Karnin et al.(2013)Karnin, Koren, and Somekh, Jamieson et al.(2014)Jamieson, Malloy, Nowak, and Bubeck, Chen and Li(2015), Garivier and Kaufmann(2016)]. [Karnin et al.(2013)Karnin, Koren, and Somekh] obtained an upper bound of

 O(∑ni=2Δ−2[i](lnlnΔ−1[i]+lnδ−1)).

The same upper bound was obtained by [Jamieson et al.(2014)Jamieson, Malloy, Nowak, and Bubeck] using a UCB-type algorithm called lil’UCB. Recently, the upper bound was improved to

 O(Δ−2[2]lnlnΔ−1[2]+∑ni=2Δ−2[i](lnlnmin(Δ−1[i],n)+lnδ−1))

by [Chen and Li(2015)]. There is still a gap between the best known upper and lower bound.

To understand the sample complexity of Best--Arm, it is important to study a special case, which we term as SIGN-. The problem can be viewed as a special case of Best--Arm where there are only two arms, and we know the mean of one arm. SIGN- will play a very important role in our lower bound proof.

###### Definition 1.4

SIGN-: is a fixed constant. We are given a single arm with unknown mean . The goal is to decide whether or . Here, the gap of the problem is defined to be . Again, we assume that the distribution of the arm is a Gaussian distribution with unit variance.

In this paper, we are interested in algorithms (either for Best--Arm or for SIGN-) that can identify the correct answer with probability at least . This is often called the fixed confidence setting in the bandit literature.

###### Definition 1.5

For any , we say that an algorithm for Best--Arm   (or SIGN-) is -correct, if on any Best--Arm (or SIGN-) instance, returns the correct answer with probability at least .

### 1.1 Almost Instance-wise Optimality Conjecture

It is easy to see that no function (only depending on and ) can serve as an upper bound of the sample complexity of Best--Arm (with arms and confidence level ). Instead, the sample complexity depends on the gaps. Intuitively, the smaller the gaps are, the harder the instance is (i.e., more samples are required). Since the gaps completely determine an instance (for Gaussian arms with unit variance, up to shifting), we use ’s as the parameters to measure the sample complexity.

Now, we formally define the notion of instance-wise lower bounds and instance optimality.For algorithm and instance , we use to denote the expected number of samples taken by on instance .

###### Definition 1.6 (Instance-wise Lower Bound)

For a Best--Arm instance and a confidence level , we define the instance-wise lower bound of as

 L(I,δ):=infA:A is δ-% correct for Best-1-ArmTA(I).

We say a Best--Arm algorithm is instance optimal, if it is -correct, and for every instance , .

Now, we consider the Best--Arm problem from the perspective of instance optimality. Unfortunately, even for the two-arm case, no instance optimal algorithm may exist. In fact, [Farrell(1964)] showed that for any -correct algorithm for SIGN-, we must have

 liminfΔ→0TA(I)Δ−2lnlnΔ−1=Ω(1).

This implies that any -correct algorithm requires samples in the worst case. Hence, the upper bound of for SIGN- is generally not improvable. However, for a particular SIGN- instance with gap , there is an -correct algorithm that only needs samples for this instance, implying . See [Chen and Li(2015)] for details.

Despite the above fact, [Chen and Li(2016)] conjectured that the two-arm case is the only obstruction toward an instance optimal algorithm. Moreover, based on some evidence from the previous work [Chen and Li(2015)], they provided an explicit formula and conjecture that can be expressed by the formula. Interestingly, the formula involves an entropy term (similar entropy terms also appear in [Afshani et al.(2009)Afshani, Barbay, and Chan] for completely different problems). In order to state Chen and Li’s conjecture formally, we define the entropy term first.

###### Definition 1.7

Given a Best--Arm instance and , let

 Gk={i∈[2,n]∣2−(k+1)<Δ[i]≤2−k},Hk=∑i∈GkΔ−2[i], and pk=Hk/∑jHj.

We can view

as a discrete probability distribution. We define the following quantity as the

gap entropy of instance :

 Ent(I)=∑k∈N:Gk≠∅pklnp−1k.\lx@notefootnoteNotethatitisexactlytheShannonentropyforthedistributiondefinedby${pk}$.
###### Remark 1.8

We choose to partition the arms based on the powers of . There is nothing special about the constant , and replacing it by any other constant only changes by a constant factor.

###### Conjecture 1.9 (Gap-Entropy Conjecture (Chen and Li(2016)))

There is an algorithm for Best--Arm with sample complexity

 O(L(I,δ)+Δ−2[2]lnlnΔ−1[2]),

for any instance and . And we say such an algorithm is almost instance-wise optimal for Best--Arm. Moreover,

 L(I,δ)=Θ(∑ni=2Δ−2[i]⋅(lnδ−1+Ent(I))).
###### Remark 1.10

As we mentioned before, the term is sufficient and necessary for distinguishing the best and the second best arm, even though it is not an instance-optimal bound. The gap entropy conjecture states that modulo this additive term, we can obtain an instance optimal algorithm. Hence, the resolution of the conjecture would provide a complete understanding of the sample complexity of Best--Arm (up to constant factors). All the previous bounds for Best--Arm agree with Conjecture 1.9, i.e., existing upper (lower) bounds are no smaller (larger) the conjectured bound. See [Chen and Li(2016)] for details.

### 1.2 Our Results

In this paper, we make significant progress toward the resolution of the gap-entropy conjecture. On the upper bound side, we provide an algorithm that almost matches the conjecture.

###### Theorem 1.11

There is a -correct algorithm for Best--Arm with expected sample complexity

 O(∑ni=2Δ−2[i]⋅(lnδ−1+Ent(I))+Δ−2[2]lnlnΔ−1[2]⋅polylog(n,δ−1)).

Our algorithm matches the main term in Conjecture 1.9. For the additive term (which is typically small), we lose a factor. In particular, for those instances where the additive term is times smaller than the main term, our algorithm is optimal.

On the lower bound side, despite that we are not able to completely solve the lower bound, we do obtain a rather strong bound. We need to introduce some notations first. We say an instance is discrete, if the gaps of all the sub-optimal arms are of the form for some positive integer . We say an instance is a sub-instance of an instance , if can be obtained by deleting some sub-optimal arms from . Formally, we have the following theorem.

###### Theorem 1.12

For any discrete instance , confidence level , and any -correct algorithm for Best--Arm, there exists a sub-instance of such that

 TA(I′)≥c⋅(∑ni=2Δ−2[i]⋅(lnδ−1+Ent(I))),

where is a universal constant.

We say an algorithm is monotone, if for every and such that is a sub-instance of . Then we immediately have the following corollary.

###### Corollary 1.13

For any discrete instance , and confidence level , for any monotone -correct algorithm for Best--Arm, we have that

 TA(I)≥c⋅(∑ni=2Δ−2[i]⋅(lnδ−1+Ent(I))),

where is a universal constant.

We remark that all previous algorithms for Best--Arm have monotone sample complexity bounds. The above corollary also implies that if an algorithm has a monotone sample complexity bound, then the bound must be on all discrete instances.

## 2 Related Work

#### Sign-ξ and A/B testing.

In the A/B testing problem, we are asked to decide which arm between the two given arms has the larger mean. A/B testing is in fact equivalent to the SIGN- problem. It is easy to reduce SIGN- to A/B testing by constructing a fictitious arm with mean . For the other direction, given an instance of A/B testing, we may define an arm as the difference between the two given arms and the problem reduces to SIGN- where . In particular, our refined lower bound for SIGN- stated in Lemma 4.1 also holds for A/B testing. [Kaufmann et al.(2015)Kaufmann, Cappé, and Garivier, Garivier and Kaufmann(2016)] studied the limiting behavior of the sample complexity of A/B testing as the confidence level approaches to zero. In contrast, we focus on the case that both and the gap tend to zero, so that the complexity term due to not knowing the gap in advance will not be dominated by the term.

#### Best-k-Arm.

The Best--Arm problem, in which we are required to identify the arms with the largest means, is a natural extension of Best--Arm. Best--Arm has been extensively studied in the past few years [Kalyanakrishnan and Stone(2010), Gabillon et al.(2011)Gabillon, Ghavamzadeh, Lazaric, and Bubeck, Gabillon et al.(2012)Gabillon, Ghavamzadeh, and Lazaric, Kalyanakrishnan et al.(2012)Kalyanakrishnan, Tewari, Auer, and Stone, Bubeck et al.(2013)Bubeck, Wang, and Viswanathan, Kaufmann and Kalyanakrishnan(2013), Zhou et al.(2014)Zhou, Chen, and Li, Kaufmann et al.(2015)Kaufmann, Cappé, and Garivier, Chen et al.(2017)Chen, Li, and Qiao], and most results for Best--Arm are generalizations of those for Best--Arm. As in the case of Best--Arm, the sample complexity bounds of Best--Arm depend on the gap parameters of the arms, yet the gap of an arm is typically defined as the distance from its mean to either or (depending on whether the arm is among the best arms or not) in the context of Best--Arm problem. The Combinatorial Pure Exploration problem, which further generalizes the cardinality constraint in Best--Arm (i.e., to choose exactly arms) to general combinatorial constraints, was also studied [Chen et al.(2014)Chen, Lin, King, Lyu, and Chen, Chen et al.(2016)Chen, Gupta, and Li, Gabillon et al.(2016)Gabillon, Lazaric, Ghavamzadeh, Ortner, and Bartlett].

#### PAC learning.

The sample complexity of Best--Arm and Best--Arm in the probably approximately correct (PAC) setting has also been well studied in the past two decades. For Best--Arm, the tight worst-case sample complexity bound was obtained by [Even-Dar et al.(2002)Even-Dar, Mannor, and Mansour, Mannor and Tsitsiklis(2004), Even-Dar et al.(2006)Even-Dar, Mannor, and Mansour]. [Kalyanakrishnan and Stone(2010), Kalyanakrishnan et al.(2012)Kalyanakrishnan, Tewari, Auer, and Stone, Zhou et al.(2014)Zhou, Chen, and Li, Cao et al.(2015)Cao, Li, Tao, and Li] also studied the worst case sample complexity of Best--Arm in the PAC setting.

## 3 Preliminaries

Throughout the paper, denotes an instance of Best--Arm (i.e., is a set of arms). The arm with the largest mean in is called the optimal arm, while all other arms are sub-optimal. We assume that every instance has a unique optimal arm. denotes the arm in with the -th largest mean, unless stated otherwise. The mean of an arm is denoted by , and we use as a shorthand notation for (i.e., the -th largest mean in an instance). Define as the gap of arm , and let denote the gap of arm . We assume that to ensure the optimal arm is unique.

We partition the sub-optimal arms into different groups based on their gaps. For each , group is defined as . For brevity, let and denoted and respectively. The complexity of arm is defined as , while the complexity of instance is denoted by (or simply , if the instance is clear from the context). Moreover, denotes the total complexity of the arms in group . naturally defines a probability distribution on , where the probability of is given by . The gap-entropy of the instance is then denoted by

 Ent(I)=∑kpklnp−1k.

Here and in the following, we adopt the convention that .

## 4 A Sketch of the Lower Bound

### 4.1 A Comparison with Previous Lower Bound Techniques

We briefly discuss the novelty of our new lower bound technique, and argue why the previous techniques are not sufficient to obtain our result. To obtain a lower bound on the sample complexity of Best--Arm, all the previous work [Mannor and Tsitsiklis(2004), Chen et al.(2014)Chen, Lin, King, Lyu, and Chen, Kaufmann et al.(2015)Kaufmann, Cappé, and Garivier, Garivier and Kaufmann(2016)] are based on creating two similar instances with different answers, and then applying the change of distribution method (originally developed in [Kaufmann et al.(2015)Kaufmann, Cappé, and Garivier]) to argue that a certain number of samples are necessary to distinguish such two instances. The idea was further refined by [Garivier and Kaufmann(2016)]. They formulated a max-min game between the algorithm and some instances (with different answers than the given instance) created by an adversary. The value of the game at equilibrium would be a lower bound of the samples one requires to distinguish the current instance and several worst adversary instances. However, we notice that even in the two-arm case, one cannot prove the lower bound by considering only one max-min game to distinguish the current instance from other instance. Roughly speaking, the factor is due to not knowing the actual gap , and any lower bound that can bring out the factor should reflect the union bound paid for the uncertainty of the instance. In fact, for the Best--Arm problem with arms, the gap entropy term exists for a similar reason (not knowing the gaps). Hence, any lower bound proof for Best--Arm that can bring out the term necessarily has to consider the uncertainty of current instance as well (in fact, the random permutation of all arms is the kind of uncertainty we need for the new lower bound). In our actual lower bound proof, we first obtain a very tight understanding of the SIGN- problem (Lemma 4.1).333Farrell’s lower bound [Farrell(1964)] is not sufficient for our purpose. Then, we provide an elegant reduction from SIGN- to Best--Arm, by embedding the SIGN- problem to a collection of Best--Arm instances.

### 4.2 Proof of Theorem 1.12

Following the approach in [Chen and Li(2015)], we establish the lower bound by a reduction from SIGN- to discrete Best--Arm instances, together with a more refined lower bound for SIGN- stated in the following lemma.

###### Lemma 4.1

Suppose , and is a -correct algorithm for SIGN-. is a probability distribution on defined by . denotes the Shannon entropy of distribution . Let denote the expected number of samples taken by when it runs on an arm with distribution and . Define . Then,

 m∑k=1pkαk=Ω(Ent(P)+lnδ−1).

It is well known that to distinguish the normal distribution

from , samples are required. Thus, denotes the ratio between the expected number of samples taken by and the corresponding lower bound, which measures the “loss” due to not knowing the gap in advance. Then Lemma 4.1 can be interpreted as follows: when the gap is drawn from a distribution , the expected loss is lower bounded by the sum of the entropy of and . We defer the proof of Lemma 4.1 to Appendix D.

Now we prove Theorem 1.12 by applying Lemma 4.1 and an elegant reduction from SIGN- to Best--Arm. [Proof of Theorem 1.12] Let be the hidden constant in the big- in Lemma 4.1, i.e.,

 m∑k=1pkαk≥c0⋅(Ent(P)+lnδ−1).

We claim that Theorem 1.12 holds for constant .

Suppose towards a contradiction that is a -correct (for some ) algorithm for Best--Arm and is a discrete instance, while for all sub-instance of ,

 TA(I′)

Recall that and denote the complexity and entropy of instance , respectively.

#### Construct a distribution of SIGN-ξ instances.

Let be the number of arms in with gap , and be the greatest integer such that . Since is discrete, the complexity of instance is given by

 H(I)=m∑k=14knk.

Let . Then defines a distribution on . Moreover, the Shannon entropy of distribution is exactly the entropy of instance , i.e., . Our goal is to construct an algorithm for SIGN- that violates Lemma 4.1 on distribution .

#### A family of sub-instances of I.

Let be the set of “types” of arms that are present in . We consider the following family of instances obtained from . For , define as the instance obtained from by removing exactly one arm of gap for each . Note that is a sub-instance of .

Let denote , the complement of set relative to . For and , let denote the expected number of samples taken on all the arms with gap when runs on . Define . We note that is the expected number of samples taken on every arm with gap in instance .444 Recall that a Best--Arm algorithm is defined on a set of arms, so the arms with identical means in the instance cannot be distinguished by . See Remark 1.2 for details.

We have the following inequality:

 ∑S⊆U∑k∈¯¯¯S4knkαSk=∑S⊆U∑k∈¯¯¯SτSk≤∑S⊆UTA(IS)

The second step holds because the lefthand side only counts part of the samples taken by . The last step follows from our assumption and the fact that is a sub-instance of .

#### Construct algorithm Anew from A.

Now we define an algorithm for SIGN- with . Given an arm , we first choose a set uniformly at random from all subsets of . Recall that denotes the mean of the optimal arm in . runs the following four algorithms through in parallel:

1. Algorithm simulates on .

2. Algorithm simulates on .

3. Algorithm simulates on .

4. Algorithm simulates on .

More precisely, when one of the four algorithms requires a new sample from (or ), we draw a sample from arm , feed to and , and then feed to and . Note that the samples taken by the four algorithms are the same up to negation and shifting.

terminates as soon as one of the four algorithms terminates. If one of and identifies as the optimal arm, or one of and identifies an arm other than as the optimal arm, outputs “”; otherwise it outputs “”.

Clearly, is correct if all of through are correct, which happens with probability at least . Note that since , the condition of Lemma 4.1 is satisfied.

#### Upper bound the sample complexity of Anew.

The crucial observation is that when and , effectively simulates the execution of on . In fact, since all arms are Gaussian distributions with unit variance, the arm is the same as an arm with gap in the original Best--Arm instance. Recall that the number of samples taken on each of the arms with gap in instance is . Therefore, the expected number of samples taken on is upper bounded by .555 Recall that if terminates after taking samples from , the number of samples taken by on is also (rather than ). Likewise, when and , is equivalent to the execution of on , and thus the expected number of samples on is less than or equal to . Analogous claims hold for the case and algorithms and as well.

It remains to compute the expected loss of on distribution and derive a contradiction to Lemma 4.1. It follows from a simple calculation that

 m∑k=1pkαk ≤∑k∈Upk⋅12|U|⎛⎜⎝∑S⊆U:k∈SαS∖{k}k+∑S⊆U:k∈¯¯¯Sα¯¯¯S∖{k}k⎞⎟⎠ =12|U|−1∑k∈U∑S⊆U:k∈SpkαS∖{k}k =12|U|−1∑S⊆U∑k∈¯¯¯S4knkH(I)⋅αSk ≤2|U|2|U|−1⋅c⋅(Ent(I)+lnδ−1)

The first step follows from our discussion on algorithm . The third step renames the variables and rearranges the summation. The last line applies (1). This leads to a contradiction to Lemma 4.1 and thus finishes the proof.

## 5 Warmup: Best-1-Arm with Known Complexity

To illustrate the idea of our algorithm for Best--Arm, we consider the following simplified yet still non-trivial version of Best--Arm: the complexity of the instance, , is given, yet the means of the arms are still unknown.

### 5.1 Building Blocks

We introduce some subroutines that are used throughout our algorithm.

#### Uniform sampling.

The first building block is a uniform sampling procedure, , which takes samples from each arm in set . Let be the empirical mean of arm (i.e., the average of all sampled values from ). It obtains an -approximation of the mean of each arm with probability . The following fact directly follows by the Chernoff bound.

###### Fact 5.1

takes samples. For each arm , we have

 Pr[|^μA−μA|≤ε]≥1−δ.

We say that a call to procedure returns correctly, if holds for every arm . Fact 5.1 implies that when , the probability of returning correctly is at least .

#### Median elimination.

[Even-Dar et al.(2002)Even-Dar, Mannor, and Mansour] introduced the Median Elimination algorithm for the PAC version of Best--Arm. returns an arm in with mean at most away from the largest mean. Let denote the largest mean among all arms in . The performance guarantees of Med-Elim is formally stated in the next fact.

###### Fact 5.2

takes samples. Let be the arm returned by Med-Elim. Then

 Pr[μA≥μ[1](S)−ε]≥1−δ.

We say that returns correctly, if it holds that .

#### Fraction test.

Procedure decides whether a sufficiently large fraction (compared to thresholds and ) of arms in have small means (compared to thresholds and ). The procedure randomly samples a certain number of arms from

and estimates their means using

Unif-Sampl. Then it compares the fraction of arms with small means to the thresholds and returns an answer accordingly. The detailed implementation of Frac-Test is relegated to Appendix A, where we also prove the following fact.

###### Fact 5.3

takes samples, where and . With probability , the following two claims hold simultaneously:

• If Frac-Test returns True, .

• If Frac-Test returns False, .

We say that a call to procedure Frac-Test returns correctly, if both the two claims above hold; otherwise the call fails.

#### Elimination.

Finally, procedure eliminates the arms with means smaller than threshold from . More precisely, the procedure guarantees that at most a fraction of arms in the result have means smaller than . On the other hand, for each arm with mean greater than , with high probability it is not eliminated. We postpone the pseudocode of procedure Elimination and the proof of the following fact to Appendix A.

###### Fact 5.4

takes samples in expectation, where . Let denote the set returned by . Then with probability at least ,

 |{A∈S′:μA

Moreover, for each arm with , we have

 Pr[A∈S′]≥1−δ/2.

We say that a call to Elimination returns correctly if both and hold; otherwise the call fails. Here denotes the arm with the largest mean in set . Fact 5.4 directly implies that procedure Elimination returns correctly with probability at least .

### 5.2 Algorithm

Now we present our algorithm for the special case that the complexity of the instance is known in advance. The Known-Complexity algorithm takes as its input a Best--Arm instance , the complexity of the instance, as well as a confidence level . The algorithm proceeds in rounds, and maintains a sequence of arm sets, each of which denotes the set of arms that are still considered as candidate answers at the beginning of round .

Roughly speaking, the algorithm eliminates the arms with gaps at the -th round, if they constitute a large fraction of the remaining arms. Here is the accuracy parameter that we use in round . To this end, Known-Complexity first calls procedures Med-Elim and Unif-Sampl to obtain , which is an estimation of the largest mean among all arms in up to an error. After that, Frac-Test is called to determine whether a large proportion of arms in have gaps. If so, Frac-Test returns True, and then Known-Complexity calls the Elimination procedure with carefully chosen parameters to remove suboptimal arms from .

Instance with complexity and risk . The best arm. ; ;

to return the only arm in ;

; ;

;

;

;

; ;

The following two lemmas imply that there is a -correct algorithm for Best--Arm that matches the instance-wise lower bound up to an additive term.666 Lemma 5.6 only bounds the number of samples conditioning on an event that happens with probability , so the algorithm may take arbitrarily many samples when the event does not occur. However, Known-Complexity can be transformed to a -correct algorithm with the same (unconditional) sample complexity bound, using the “parallel simulation” technique in the proof of Theorem 1.11 in Appendix C.

###### Lemma 5.5

For any Best--Arm instance and , returns the optimal arm in with probability at least .

###### Lemma 5.6

For any Best--Arm instance and , conditioning on an event that happens with probability , takes

 O(H(I)⋅(lnδ−1+Ent(I))+Δ−2[2]lnlnΔ−1[2])

samples in expectation.

### 5.3 Observations

We state a few key observations on Known-Complexity, which will be used throughout the analysis. The proofs are exactly identical to those of Observations A.3 through A.5 in Appendix A. The following observation bounds the value of at round , assuming the correctness of Unif-Sampl and Med-Elim.

###### Observation 5.7

If Unif-Sampl returns correctly at round , . Here denotes the largest mean of arms in . If both Unif-Sampl and Med-Elim return correctly, .

The following two observations bound the thresholds used in Frac-Test and Elimination by applying Observation 5.7.

###### Observation 5.8

At round , let and denote the two thresholds used in Frac-Test. If Unif-Sampl returns correctly, . If both Med-Elim and Unif-Sampl return correctly, .

###### Observation 5.9

Let and denote the two thresholds used in Elimination. If Unif-Sampl returns correctly, . If both Med-Elim and Unif-Sampl return correctly, .

### 5.4 Correctness

We define as the event that all calls to procedures Unif-Sampl, Frac-Test, and Elimination return correctly. We will prove in the following that Known-Complexity returns the correct answer with probability conditioning on , and . Note that Lemma 5.5 directly follows from these two claims.

#### Event E implies correctness.

It suffices to show that conditioning on , Known-Complexity never removes the best arm, and the algorithm eventually terminates. Suppose that . Observation 5.9 guarantees that at round , the upper threshold used by Elimination is smaller than or equal to . By Fact 5.4, the correctness of Elimination guarantees that .

It remains to prove that Known-Complexity terminates conditioning on . Define . Suppose is the smallest integer greater than such that Med-Elim returns correctly at round .777 Med-Elim returns correctly with probability at least in each round, so is well-defined with probability . By Observation 5.9, the lower threshold in Elimination is greater than or equal to . The correctness of Elimination implies that

 |Sr∗+1|−1=|Sr∗+1∩G≤rmax|≤|Sr∗+1∩G

It follows that . Therefore, the algorithm terminates either before or at round .

#### E happens with high probability.

We first note that at round , the probability that either Unif-Sampl or Frac-Test fails (i.e., returns incorrectly) is at most . By a union bound, the probability that at least one call to Unif-Sampl or Frac-Test returns incorrectly is upper bounded by

 ∞∑r=12δr=∞∑r=1δ5r2<δ/2.

It remains to bound the probability that Elimination fails at some round, yet procedures Unif-Sampl and Frac-Test are always correct. Define as the probability that, given the value of at the beginning of round , at least one call to Elimination returns incorrectly in round or later, yet Unif-Sampl and Frac-Test always return correctly. We prove by induction that for any that contains the optimal arm ,

 P(r,Sr)≤δ^H(128C(r,Sr)+16M(r,Sr)ε−2r), (2)

where and

 C(r,Sr)\coloneqq∞∑i=r−1|Sr∩Gi|i+1∑j=rε−2j+rmax+1∑i=rε−2i.

The details of the induction are postponed to Appendix E.

Observe that and

 C(1,I)=∞∑i=0|Sr∩Gi|i+1∑j=14j+rmax+1∑i=14i≤163(∞∑i=0|Sr∩Gi|4i+4rmax)≤163⎛⎝∞∑i=0∑A∈Sr∩GiΔ−2A+Δ−2[2]⎞⎠≤323H(I).

Therefore we conclude that

 Pr[E]≥1−P(1,S1)−δ2≥1−δ^H(128C(1,I)+16M(1,I)ε−21)−δ2≥1−128⋅δ4096H⋅32H3−δ2≥1−δ,

which completes the proof of correctness. Here the first step applies a union bound. The second step follows from inequality (2), and the third step plugs in and .

### 5.5 Sample Complexity

As in the proof of Lemma 5.5, we define as the event that all calls to procedures Unif-Sampl, Frac-Test, and Elimination return correctly. We prove that Known-Complexity takes

 O(H(I)(lnδ−1+Ent(I))+Δ−2[2]lnlnΔ−1[2])

samples in expectation conditioning on .

#### Samples taken by Unif-Sampl and Frac-Test.

By Facts 5.1 and 5.3, procedures Unif-Sampl and Frac-Test take samples in total at round .

In the proof of correctness, we showed that conditioning on , the algorithm does not terminate before or at round (for ) implies that Med-Elim fails between round and round , which happens with probability at most . Thus for , the expected number of samples taken by Unif-Sampl and Frac-Test at round is upper bounded by

 O(0.01k−rmax−1⋅ε−2k(lnδ−1+lnk)).

Summing over all yields the following upper bound: