# The Simulator: Understanding Adaptive Sampling in the Moderate-Confidence Regime

We propose a novel technique for analyzing adaptive sampling called the Simulator. Our approach differs from the existing methods by considering not how much information could be gathered by any fixed sampling strategy, but how difficult it is to distinguish a good sampling strategy from a bad one given the limited amount of data collected up to any given time. This change of perspective allows us to match the strength of both Fano and change-of-measure techniques, without succumbing to the limitations of either method. For concreteness, we apply our techniques to a structured multi-arm bandit problem in the fixed-confidence pure exploration setting, where we show that the constraints on the means imply a substantial gap between the moderate-confidence sample complexity, and the asymptotic sample complexity as δ→ 0 found in the literature. We also prove the first instance-based lower bounds for the top-k problem which incorporate the appropriate log-factors. Moreover, our lower bounds zero-in on the number of times each individual arm needs to be pulled, uncovering new phenomena which are drowned out in the aggregate sample complexity. Our new analysis inspires a simple and near-optimal algorithm for the best-arm and top-k identification, the first practical algorithm of its kind for the latter problem which removes extraneous log factors, and outperforms the state-of-the-art in experiments.

## Authors

• 30 publications
• 31 publications
• 59 publications
• ### Pure Exploration in Infinitely-Armed Bandit Models with Fixed-Confidence

We consider the problem of near-optimal arm identification in the fixed ...
03/13/2018 ∙ by Maryam Aziz, et al. ∙ 0

• ### Nearly Instance Optimal Sample Complexity Bounds for Top-k Arm Selection

In the Best-k-Arm problem, we are given n stochastic bandit arms, each a...
02/13/2017 ∙ by Lijie Chen, et al. ∙ 0

• ### Optimal Best Markovian Arm Identification with Fixed Confidence

We give a complete characterization of the sampling complexity of best M...
12/02/2019 ∙ by Vrettos Moulos, et al. ∙ 0

• ### A Non-asymptotic Approach to Best-Arm Identification for Gaussian Bandits

We propose a new strategy for best-arm identification with fixed confide...
05/27/2021 ∙ by Antoine Barrier, et al. ∙ 0

• ### Structured Best Arm Identification with Fixed Confidence

We study the problem of identifying the best action among a set of possi...
06/16/2017 ∙ by Ruitong Huang, et al. ∙ 0

• ### Bayesian Best-Arm Identification for Selecting Influenza Mitigation Strategies

Pandemic influenza has the epidemic potential to kill millions of people...
11/16/2017 ∙ by Pieter Libin, et al. ∙ 0

• ### Fully adaptive algorithm for pure exploration in linear bandits

We propose the first fully-adaptive algorithm for pure exploration in li...
10/16/2017 ∙ by Liyuan Xu, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The goal of adaptive sampling is to estimate some unknown property

about the world, using as few measurements from a set of possible measurement actions 111We only work with finitely many measurements, but this may be generalized as in [1]. At each time step , a learner chooses a measurement action based on past observations, and recieves an observation . We assume that the observations are drawn i.i.d from a distribution over

, which is unknown to the learner. In particular, the vector of distributions

, called the instance, encodes the distribution of all possible measurement actions. The instance can be thought of as describing the state of the world, and that our property of interest is a function of the instance. We focus on what is called the fixed-confidence pure-exploration setting, where the algorithm decides to stop at some (possibly random) time , and returns an output which is allowed to differ from

with probability at most

on any instance . Since is exactly equal to the number of measurements taken, the goal of adaptive pure-exploration problems is to design algorithms for which is as small as possible, either in expectation or with high probability.

Crucially, we often expect the instance to lie in a known constraining set . This allows us to encode a broad range of problems of interest as pure-exploraton multi-arm bandit () problems [2, 3] with structural constraints. As an example, the adaptive linear prediction problem of [4, 5] (known in the literature as linear bandits), is equivalent to , subject to the constraint that the mean vector (where ) lies in the subspace spanned by the rows of , where are the vector-valued features associated with arms through

. The noisy combinatorial optimization problems of

[6, 7, 8] can be also be cast in this fashion. Moreover, by considering properties other than the top mean, one can use the the above framework to model signal recovery and compressed sensing [1, 9], subset-selection [10], and additional variants of combinatorial optimization [11, 12, 13].

The purpose of this paper is to present new machinery to better understand the consequences of structural constraints , and types of objectives on the sample complexity of adaptive learning problems. This paper presents bounds for some structured adaptive sampling problems which charactecterize the sample complexity in the regime where the probability of error is a moderately small constant (e.g. , or even inverse-polynomial in the number of measurements). In constrast, prior work has adressed the sample complexity of adaptive samplings problems in the asymptotic regime that , where such problems often admit algorithms whose asymptotic dependence on matches lower bounds for each ground-truth instance, even matching the exact instance-dependent leading constant [14, 15, 16]. Analogous asymptotically-sharp and instance-specific results (even for structured problems) also hold in the regret setting where the time horizon [17, 8, 18, 19, 20].

The upper and lower bounds in this paper demonstrate that the asymptotics can paint a highly misleading picture of the true sample complexity when is not-too-small. This occurs for two reasons:

1. Asymptotic characterizations of the sample complexity of adaptive estimation problems occur on a time horizon where the learner can learn an optimal measurement allocation tailored to the ground truth instance . In the short run, however, learning favorable measurement allocations is extremeley costly, and the allocation requires considerably more samples to learn than it itself would prescribe.

2. Asymptotic characterizations are governed by the complexity of discriminating the ground truth

from any single, alternative hypothesis. This neglects the sorts multiple-hypothesis and suprema-of-empirical-process effects that are ubiquitous in high-dimensional statistics and learning theory (e.g. those reflected in Fano-style bounds).

To understand these effects, we introduce a new framework for analyzing adaptive sampling called the “Simulator”. Our approach differs from the existing methods by considering not how much information could be gathered by any fixed sampling strategy, but how difficult it is to distinguish a good sampling strategy from a bad one, given any limited amount of data collected up to any given time. Our framework allows us to characterize granular, instance dependent properties that any successful adaptive learning algorithm must have. In particular, these insights inspire a new, theoretically near-optimal, and practically state-of-the-art algorithm for the top-k subset selection problem. We emphasize that the Simulator framework is concerned with how an algorithm samples, rather than its final objective. Thus, we believe that the techniques in this paper can be applied more broadly to a wide class of problems in the active learning community.

## 2 Preliminaries

As alluded to in the introduction, the adaptive estimation problems in this paper can be formalized as multi-arm bandits problems, where the instances lie in an appropriate constraint set , called an instance class (e.g., the mean vectors , where lie in some specified polytope). We use the term arms to refer both to the indices and distributions they index. The stochastic multi-arm bandit formulation has been studied extensively in the pure-exploration setting considered in this work [2, 3, 10, 21, 22, 23, 14, 15]. At each time , a learner plays an action , and observes an observation drawn i.i.d from . At some time , the learner decides to end the game and return some output. Formally, let denote the sigma-algebra generated by , and some additional randomness independent of all the samples (this represents randomization internal to the algorithm). A sequential sampling algorithm consists of

1. A sampling rule , where is measurable

2. A stopping time , which is measurable

3. An output rule , which is -measurable.

We let denote the samples collected from arm by time . In particular, is the number of times arm is pulled by the algorithm before terminating, and . A algorithm corresponds to the case where the decision rule is a singleton , and, more generally, a algorithm specifies a . We will use as a variable which describes a particular algorithm, and use the notation and to denote probabilities and expectations which are taken with respect to the samples drawn from , and the (possibly randomized) sampling, stopping, and output decisions made by . Finally, we adopt the following notion of correctness, which corresponds to the “fixed-confidence” setting in the active learning literature:

###### Definition 1.

Fix a set of instances . We say that a algorithm is -correct for a best-arm mapping over the class of instances , if for all , . We say that a algorithm is -correct for a top- mapping , .

Typically, the best arm mapping is defined as the arm with the highest mean , and top mapping as the arms with the -largest means , which captures the notion of the arm/set of arms that yield the highest reward. When the best-arm mapping returns the highest-mean arm, and the observations are sub-Gaussian222Formally, is -sub-Gaussian if , the problem complexity for is typically parameterized in terms of the “gaps” between the means [24]. More generally, sample complexity is parametrized in terms of the , the divergences between the measures and . For ease of exposition, we will present our high-level contributions in terms of gaps, but the body of the work will also present more general results in terms of ’s. Finally, our theorem statements will use and to denote inequalities up to constant factors. In the text, we shall occasionally use more informally, hiding doubly-logarithmic factors in problem parameters.

## 3 Statements of Lower Bound Results

Typically, lower bounds in the bandit and adaptive sampling literature are obtained by the change of measure technique [24, 9, 14]. To contextualize our findings, we begin by stating the state-of-the-art change-measure-lower bounds, as it appears in [25]. For a class of instances , let denote the set of instances such that, . Then:

###### Proposition 1 (Theorem 1 [14]).

If is correct for all , then the expected number of samples collects under , , is bounded below by the solution to the following optimization problem

 minτ∈Rn≥0n∑a=1τa subject to max~ν∈Alt(ν)n∑a=1τaKL(νa,~νa)≥kl(δ,1−δ) (1)

where , which scales like as .

The above proposition says that the expected sample complexity is lower bounded by the following, non-adaptive experiment design problem: minimize the total number of samples

subject to the constraint that these samples can distinguish between a null hypothesis

, and any alternative hypothesis for

, with Type-I and Type-II errors at most

. We will call the optimization problem in Equation 1 the Oracle Lower Bound, because it captures the best sampling complexity that could be attained by a powerful “oracle” who knows how to optimally sample under .

Unlike the oracle, a real learner would never have access to the true instance . Indeed, for instances with sufficient structure, Equation 1 gives a misleading view of the instrinsic difficulty of the problem. For example, let denote the class of instances where , and lies in the simplex, i.e. and . If the ground truth instance has for some , then any oracle which uses the knowledge of the ground truth to construct a sampling allocation can simply put all of its samples on arm . Indeed, the simplex constraint implies that is indeed the best arm of , and that any instance which has a best arm other than must have . Thus, for all , . In other words, the sampling vector

 τa={(.08)−1kl(δ,1−δ)a=a∗0a≠a∗ (2)

is feasible for Equation 1 which means that the optimal number of samples predicted by Equation 1 is no more than . But this predicted sample complexity doesn’t depend on the number of arms!

So how how hard is the simplex really? To adress this question, we prove the first lower bound in the literature which, to the author’s knowledge 333while this manuscript was on one of it’s authors websites, the same result was obtained independently by [26], accurately characterizes the complexity a strictly easier problem: when the means are known up to a permutation. Because the theorem holds when the measures are known up to a permutation, it also holds in the more general setting when the measures satisfy any permutation-invariant constraints, including when a) the means lie on the simplex b) the means lie in an ball or c) the vector of sorted means satisfy arbitrary constraints (e.g. weighted constraints on the sorted means [27]).

In what follows, let denote the group of permutations on elements and denote the index which is mapped to under . For an instance , we let , and define the instance class . Moreover, we use the notation to denote that is drawn uniformly at random. With this notation, is the number of times we pull the arm indexed by , i.e. the samples from . And is the expected number of samples from since , not the distribution . The following theorem essentially says that if the instance is randomly permuted before the start of the game, no -correct algorithm can avoid taking a substantial number of samples from for any .

###### Theorem 1 (Lower bounds on Permutations).

Let be an instance with unique best arm , and for , define . If is -correct over then

 Eπ∼SnPπ(ν),Alg[Nπ(b)(T)>τblog(1/4η)]≥η−δ (3)

for any , and by Markov’s inequality

 Eπ∼SnEπ(ν),Alg[T]=Eπ∼Sn⎡⎣∑b≠a∗Eπ(ν),Alg[Nπ(b)(T)]⎤⎦≥supη∈[δ,1/4](η−δ)log(1/4η)∑b≠a∗τb. (4)

In particular, if is -correct, then .

When the reward distributions are , (recall . In this setting, applying the oracle bound of Proposition 1 to permutations implies a lower bound of . Combining this bound with Theorem 1 yields that

 Eπ∼SnEπ(ν),Alg[T]≳max{maxb≠a∗Δ−2blog(1/δ),∑b≠a∗Δ−2b} (5)

For comparison, the bound Proposition 1 only implies a lower bound of , since an oracle who knows how to sample could place all their samples on . Thus, for constant , ourlower bound differs from the bound in Proposition 1 by up to a factor of , the number of arms. In particular, when the gaps are all on the same order, the asymptotics only paint an accurate picture of the sample complexity once is exponentially-small in .

In fact, our lower bound is essentially unimproveable: Section A.1 provides an upper bound for the setting where the top-two means are known, whose expected sample complexity on any permutation matches the on-average complexity in Equation 5 up to constant and doubly-logarithmic factors. Together, these upper and lower bounds depict two very different regimes:

1. When is fixed, the complexity of the constrained problem the lower bound essentially matches known upper bounds for the unconstrained best-arm problem [23, 22]. Thus, in this regime, imposing or removing with permutation-invariant constraints does not affect the sample complexity

2. As the , an algorithm which knows the means up to a permutation can learn to optimistically and agressively focus its samples on the top arm when the means are, yielding an asymptotic sample complexity predicted by Proposition 1, one which is potentially far smaller than that of the unconstrained problem.444In fact, using a track-and-stop strategy similar to [14] one could design an algorithm which matches the constant factor in Propostion 1

These two regimes show that the Simulator and oracle lower bounds are complementary, and go after two different aspects of problem difficulty: In the second regime, the oracle lower bound characterizes samples sufficient to verify that arm is the best, whereas in the first regime, the Simulator characterizes the samples needed to learn a favorable sampling allocation555The simulator also provides a lower bound on the tail of the number of pulls from a suboptimal arm since, with probability , arm is pulled times. This shows that even though you can learn an oracle allocation on average, there is always a small risk of oversampling. Such affects do not appear from Proposition 1, which only control the number of samples taken in expectation. We remark that [25] also explores the problem of learning-to-sample by establishing the implications of Proposition 1 for finite-time regret; however, there approach does not capture any effects which aren’t reflected in Proposition 1. Finally, we note that proving a lower bound for learning a favorable strategy in our setting must consider some sort of average or worst-case over the instances. Indeed, one could imagine an algorithm that starts off by pulling the first arm until it has collected enough samples to test whether (i.e. ), and then pulling arm to test whether , and so on. If arm is the best, this algorithm can successfully identify it without pulling any of the others, thereby matching the oracle lower bound.

### 3.1 Sharper Multiple-Hypothesis Lower Bounds

In contrast to the oracle lower bounds, the active PAC learning literature (e.g., binary classification) leverages classical tools like Fano’s inequality with packing arguments [28, 29] and other measures of class complexity such as the disagreement coefficient [30]. Because these arguments consider multiple hypotheses simultaneously, they can capture effects which the worst-case binary-hypothesis oracle lower bounds like Equation 1 can miss, and the considerable gap between two-way and multiple tests is well-known in the passive setting [31]. Unfortunately, existing techniques which capture this multiple-hypothesis complexity lead to coarse, worst- or average-case lower bounds for adaptive problems because they rely on constructions which are either artificially symmetric, or are highly pessimistic [28, 29, 10]. Moreover, the constructions rarely shed insights on why active learning algorithms seem to avoid paying the costs for multiple hypotheses that would occur in the passive setting, e.g. the folk theorem: “active learning removes log factors” [9].

As a first step towards understanding these effects, we prove the first instance-based lower bound which sheds light on why active learning is able to effective reduce the number of hypotheses it needs to distinguish. To start, we prove a qualitative result for a simplified problem, using a novel reduction to Fano’s inequality via the simulator:

###### Theorem 2.

Let be -correct, consider a game with best arm and arms of measure . Let . Then

 Pπ∼SnPπ(ν),Alg[{π(1)∈Sm}∧{|Sm|≥m}]≥34 (6)

For Gaussian rewards with unit variance,

, where is the gap between the means , the above proposition states that, for any , any correct algorithm must sample some arms, including the top arm, times. Thus, the number of samples allocated by the oracle of Proposition 1 are necessarily insufficient to identify the best arm for moderate . This is because, until sufficiently many samples has been taken, one cannot distinguish between the best arm, and other arm exhibiting large statistical deviations. Looking at exponential-gap style upper bounds [23, 21], which halve the number of arms in consideration at each round, we see that our lower bound is qualitatively sharp for some algorithms666We believe that UCB-style algorithms exhibit this same qualitative behavior. Further, we emphasize that this set of arms which must be pulled times may be random777In fact, for an algorithm with which only samples arms , this subset of arms must be random. This is because for a fixed subset of arms, one could apply Theorem 2 to the remaining arms., depend on the random fluctations in the samples collected, and thus cannot be determined using knowledge of the instance alone. Stated otherwise, if one sampled according to the proporitions as ascribed by Proposition 1, then the total number of samples one would need to collect would be suboptimal (by a factor of ). Thus, effective adaptive sampling should adapt its allocation to the statistical deviations in the collected data, not just the ground truth instance. We stress that the Simulator is indepensible for establishing this result, because it lets us characterize the stage-wise sampling allocation of adaptive algorithms.

Guided by this intuition, a more sophisticated proof strategy establishes the following guarantee for with Gaussian rewards (a more general result for single-parameter exponential families is stated in Theorem 5):

###### Proposition 2 (Lower Bound for Gaussian MAB).

Supppose has measures ,with . Then, if is correct over ,

 Eπ∼SnEπ(ν(1)),Alg[Nπ(1)(T)] ≳ max2≤m≤nΔ−2mlog(m/δ)where% Δm=μ1−μm (7)

In particular, when all the gaps are on the same order , then the top arm must be pulled times. When the gaps are different, trades off between larger factor as the inverse-gap-squared shrinks. As we explain in Section D.1, this tradeoff best understood in the sense that the algorithm is conducting an instance-dependent union bound bound, where the union bound places more confidence on means closer to the top. The proof itself is quite involved, and constitutes the main technical contribution of this paper. We devote Section D.1 to explaining the intuition and proof roadmap. Our argument makes use of “tilted distributions”, which arise in Herbst Argument in Log-Sobolev Inequalities in the concentration-of-measure literature [32]. Tiltings translate the tendency of some empirical means to deviate far above their averages (i.e. to anti-concentrate) into a precise information-theoretic statement that they “look like” draws from the top arm. To the best of our knowledge, this consitutes the first use of tiltings to establish information-theoretic lower bounds, and we believe this technique may have broader use.

### 3.2 Instance-Specific Lower bound for TopK

Propostion 2 readily implies the first instance-specific lower bound for the . The idea is that, if I can identify an arm as one of the top arms, then, in particular, I can identify arm as the best arm among . Similarly, if I can reject arm as not part of the top , then I can identify it as the “worst” arm among . Section E formally proves the following lower bound using this reduction:

###### Proposition 3 (Lower Bound for Gaussian TopK).

Supppose has measures , with . Then, if is correct over ,

 Eπ∼SnEπ(ν(1)),Alg[Nπ(j)(T)] ≳ {maxm>k(μj−μm)−2log((m−k+1)/δ)j≤kmaxm≤k(μj−μm)−2log((k+2−m)/δ)j>k (8)

By taking and in the first and second lines of 8, our result recovers the gap-dependent bounds of [10] and [16] . Moreover, when the gaps are on the same order , we recover the worst-case lower bound from [10] of .

#### 3.2.1 Comparison with [26]

After a manuscript of the present work was posted on one of its author’s websites,  [26] presented an alternative proof of Proposition 3, also by a reduction to . Instead of tiltings, their argument handles different gaps by a series of careful reductions to a symmetric problem, to which they apply Proposition 1. As in this paper, their proof hinges on a “simulation” argument which compares the behavior of an algorithm on an instance to a run of an algorithm where the reward distributions change mid-game. This seems to suggest that our simulator framework is in some sense a natural tool for these sorts of lower bounds.

While our works prove many of the same results, our papers differ considerably in emphasis.The goal for in this work is to explain why algorithms must incur the sample complexities that they do, rather than just sharpen logarithmic factors. In this vein, we establish Theorem 2, which has no analogue in [26]. Moreover, we believe that the proof of Proposition 2 based on tiltings is a step towards novel lower bounds for more sophisticated problems by translating intuitions about large-deviations into precise, information-theoretic statements. Further still, our Theorem 1 (and Proposition 7 in the appendix) imply lower bounds on the tail-deviations of the number of times suboptimal arms need to be sampled in constrained problems (see footnote 5).

## 4 Lucb++

The previous section showed that for in the worst case, the bottom arms must be pulled in proportion to times while the top arms must be pulled in proportion to times. Inspired by these new insights, the original LUCB algorithm of [10], and the analysis of [22] for the setting, in this section we propose a novel algorithm for : LUCB++. The LUCB++ algorithm proceeds exactly like that of [10], the only difference being the definition of the confidence bounds used in the algorithm.

At each round , let denote the empirical mean of all the samples from arm collected so far. Let be an anytime confidence bound based on the law of the iterated logarithm (see Kaufmann et al. [33, Theorem 8] for explicit constants). Finally, we let denote the set of the arms with the largest empirical means. The algorithm is outlined in Figure 1, and satisfies the following guarantee:

###### Theorem 3.

Suppose that is subgaussian. Then, for any , the LUCB++ algorithm is -correct, and the stopping time satisfies

 T≤k∑i=1cΔ−2ilog((n−k)log(Δ−2i)δ)+n∑j=k+1cΔ−2jlog(klog(Δ−2j)δ)

with probability at least , where is a universal constant.

By Propositions 3 we recognize that when the gaps are all the same the sample complexity of the LUCB++ algorithm is unimprovable up to factors. This is the first practical algorithm that removes extraneous log factors on the sub-optimal arms [10, 12]. However, it is known that not all instances must incur a multiplicative on the top arms [12, 26]. Indeed, when this problem is just the best-arm identification problem and the sample complexity of the above theorem, ignoring doubly logarithimc factors, scales like . But there exist algorithms for this particular best-arm setting whose sample complexity is just exposing a case where Theorem 3 is loose [21, 22, 23, 12]. In general, this additional factor is unnecessary on the top arms when , but for large , this is a case unlikely to be encountered in practice.

While this manuscript was in preparation, [26] proposed a algorithm which satisfies stronger theoretical guarantees, essentially matching the lower bound in Theorem 3. However, their algorithm of (and the matroid-bandit algorithm of [12])relies on exponential-gap elimination, making it unsuitable for practical use888While exponential-gap elimination algorithms might have the correct dependence on problem parameters, their constant-factors in the sample complexity are incredibly high, because they rely on the median-elimination as a subroutine (see [22] for discussion). Furthermore, our improved LUCB++ confidence intervals can be reformulated for different KL-divergences, leading to tighter bounds for non-Gaussian rewards such as Bernoullis. Moreover, we can “plug-in” our LUCB++ confidence intervals into other LUCB-style algorithms, sharpening their factors. For example, one could ammend the confidence intervals in the CLUCB algorithm of [11] for combinatorial bandits, which would yield slight improvements for arbitrary decision classes, and near-optimal bounds for matroid classes considered in [12].

To demonstrate the effectiveness of our new algorithm we compare to a number of natural baselines: LUCB of [10], a version of the oracle strategy of [14], and uniform sampling; all three use the stopping condition of [10] which is when the empirical top confidence bounds999To avoid any effects due to the particular form of the any-time confidence bound used, we use the same finite-time law-of-the-iterated logarithm confidence bound used in [33, Theorem 8] for all of the algorithms. do not overlap with the bottom , employing a union bound over all arms. Consider a instance for constructed with unit-variance Gaussian arms with for and otherwise. Table 1 presents the average number of samples taken by the algorithms before reaching the stopping criterion, relative to the the number of samples taken by LUCB++. For these means, the oracle strategy pulls each arm a number of times proportional to where for and for ( for all when ). Note that the uniform strategy is idenitical to the oracle strategy, but with for all .

## 5 Lower Bounds via The Simulator

As alluded to in the introduction, our lower bounds treat adaptive sampling decisions made by the algorithm as hypothesis tests between different instances . Using a type of gadget we call a Simulator, we reduce lower bounds on adaptive sampling strategies to a family of lower bounds on different, possibly data-dependent and time-specific non-adaptive hypothesis testing problems.

The Simulator acts as an adversarial channel intermediating between the algorithm , and i.i.d samples from the true instance . Given an instance , let denote a random transcript of an infinite sequence of samples drawn i.i.d from , where . We can think of sequential sampling algorithms as operating by iteracting with the transcript, where the sample is obtained by reading the sample off from (recall that is the number of times arm has been pulled at the end of round ). With this notation, we define a simulator as follows:

###### Definition 2 (Simulator).

A simulator is a map which sends to a modified transcript , which will interact with instead of (Figure 1). We allow this mapping to depend on the ground truth and some internal randomness .

Equivalently, is a measure on an random process , which, unlike , does not require the samples to be i.i.d (or even independent). Hence, we use the shorthand to refer the measure corresponding to , and let denote the probability taken with respect to ’s modified transcript , and the internal randomness in and . With this notation, the quantities and are well defined as the and divergences of the random process under the measures and .

Note that, in general, if for some , since (resp ) govern infinite i.i.d sequence (resp ). However, in this paper we will always design our simulator so that , and is in fact quite small. The hope is that if the modified transcript conveys too little information to distinguish between and , then will have to behave similarly on both simulated instances. Hence, we will show that if behaves differently on two instances and , yet limits information between them, then ’s behavior must differ quite a bit under versus , for either or . Formally, we will show that will have to “break” the simulator, in the following sense:

###### Definition 3 (Breaking).

Given measure , algorithm , and simulator , we say that is a truthful event under if, for all events ,

 PSim(ν),Alg[E∧W]=Pν,Alg[E∧W] (10)

Moreover, will say that is breaks on under . Recall that is the -algebra generated by , and the actions/samples collected by up to time .

The key insight is that, whenever doesn’t break (i.e. on a truthful event ), a run of on can be perfectly simulated by running on . But if, fudges in a way that drastically limits information about , this means that can be simulated using little information about , which will contradict information theoretic lower bounds. This suggests the following recipe for proving lower bounds:

1) State a claim you wish to falsify over a class of instances (e.g., the best arm is not pulled more than times, with some probability ). 2) Phrase your claims as candidate truthful events on each instance (e.g. where is the best arm of ) 3) Construct a simulator such that is truthful on , but (or ) is small for alternative pairs . For example, if the truthful event is , then simulator should only modify samples . 4) Apply an information-theoretic lower bound (e.g., Proposition 4 to come) to show that the simulator breaks (e.g. is large for at least one , or for a drawn uniformly from )

## 6 Applying the Simulator to Permutations

In what follows, we show how to use the simulator to prove Theorem 1. At a high level, our lower bound follows from considering pairs of instances where the best arm is swapped-out for a sub-optimal arm, and ultimately averaging over those pairs. On each such pair, we apply a version of Le Cam’s method to the simulator setup (proof in Section B.1):

###### Proposition 4 (Simulator Le Cam).

Let and be two measures, be a simulator, and let be two truthful events under for . Then, for any algorithm

 2∑i=1Pν(i),Alg(Wci)≥supE∈FT|Pν(1),Alg(E)−Pν(2),Alg(E)|−Q(KLAlg(Sim(ν(1)),Sim(ν(2)))) (11)

where . The bound also holds with replaced by .

Note that Equation 11 decouples the behavior of the algorithm under from the information limited by the simulator. This proposition makes formal the intuition from Section 5 that the algorithm which behaves differently on two distinct instances must “break” a simulator which severly limits the information between them.

### 6.1 Lower Bounds on 1-Arm Swaps

The key step in proving Theorem 1 is to establish a simple lower bound that holds for pairs of instances obtained by “swapping” the best arm.

###### Proposition 5.

Let be an instance with unique best arm . For , let be the instance obtained by swapping and , namely , , and for . Then, if is -correct, one has that for any

 12{Pν,Alg[Nb(T)>τ(η)]+Pν(b,a∗),Alg[Na∗(T)>τ(η)]}≥η−δ (12)

where

This bound implies that, if an instance is drawn uniformly from , then any -correct algorithm has to pull the suboptimal arm, namely the distribution , at least times on average, with probability . Proving this proposition requires choosing an appropriate simulator. To this end, fix a , and let map to such that,

 Sim:ˆX[s,a]↤⎧⎪ ⎪⎨⎪ ⎪⎩X[s,a]a≠a∗,bX[s,a]a∈{a∗,b},s≤τiid∼νa∗a∈{a∗,b},s>τ (13)

where for and , the means that the samples are taken independently of everything else (in particular, independent of and ), using internal randomness . We emphasize depends crucially on , , and .

Note that the only entries of whose distribution differs under and are just the first entries from arms and , namely . Hence, by a data-processing inequality

 KLAlg(Sim(ν),Sim(ν(b,a∗))≤τ{KL(νa∗,νb)+KL(νb,νa∗)} (14)

Using the notation of Proposition 4, let , , let and (i.e, under and , you sample the suboptimal arm no greater than times). Now, Proposition 5 now follows immediately from Proposition 4, elementary manipulations, and the following claim:

###### Claim 1.

For and defined above, is truthful on under .

###### Proof of Claim 1.

The samples and have the sample distribution under and for and , by construction. Moreover, the samples and for are also i.i.d draws from , so they have the same distribution as the samples and under and respectively. Thus, the only samples whose distributions are changed by the simulator are the samples under and under , respectively, which never acesses under under and , respectively. ∎

### 6.2 Proving Theorem 1 from Proposition 5

Theorem 1 can be proven directly using the machinery established thus far. However, we will introduce a reduction to “symmetric algorithms” which will both expedite the proof of the Theorem 1, and come in handy for additional bounds as well. For a transcript , let denote the transcript , and denote probability taken w.r.t. the randomness of acting on the fixed (deterministic) transcript .

###### Definition 4 (Symmetric Algorithm).

We say that an algorithm is symmetric if the distribution of its sampling sequence and output commutes with permutations. That is, for any permutation , transcript , sequence of actions , and output ,

 PAlg,Tr[(a1,a2,…,aT,ˆS)=(A1,A2,…,