Function optimization over structured domains is a basic sequential decision making problem. A well-known formulation of this problem is Probably Approximately Correct (PAC) best arm identification in multi-armed bandits , in which a learner is given a set of arms with unknown (and unrelated) means. The learner must sequentially test arms and output, as soon as possible with high confidence, a near-optimal arm (where optimality is defined in terms of the largest mean).
Often, the arms (decisions) and their associated rewards, possess structural relationships, allowing for more efficient learning of the rewards and transfer of learnt information, e.g., two ‘close enough’ arms may have similar mean rewards. One of the best-known examples of structured decision spaces is the linear bandit, whose arms are vectors (points) in . The reward or function value of an arm is an unknown linear function of its vector representation, and the goal is to find an arm with maximum reward in the shortest possible time by measuring arms’ rewards sequentially with noise. This framework models an array of structured online linear optimization problems including adaptive routing , smooth function optimization over graphs , subset selection  and, in the nonparametric setting, black box optimization in smooth function spaces , among others.
Although no-regret online learning for linear bandits is a well-understood problem (see  and references therein), the PAC-sample complexity of best arm identification in this model has not received significant attention until recently . The state of the art here is the work of Fiez et al. , who give an algorithm with optimal (instance-dependent) PAC sample complexity. However, a closer look indicates that the algorithm assumes repeated oracle access to a minimax optimization problem111This is, in fact, a plug-in version of a minimax optimization problem representing an information-theoretic sample complexity lower bound for the problem.; it is not clear, from a performance standpoint, in what manner (and to what accuracy) this optimization problem should be practically solved222 For its experiments, the paper implements a (approximate) minimax oracle using the Frank-Wolfe algorithm and a heuristic stopping rule, but this is not rigorously justifiable for nonsmooth optimization, see Sec.
For its experiments, the paper implements a (approximate) minimax oracle using the Frank-Wolfe algorithm and a heuristic stopping rule, but this is not rigorously justifiable for nonsmooth optimization, see Sec.4. to enjoy the claimed sample complexity. Hence, the question of how to design an explicit algorithm with optimal PAC sample complexity for best arm identification in linear bandits has remained open.
In this paper, we resolve this question affirmatively by giving an explicit linear bandit best-arm identification algorithm with instance-optimal PAC sample complexity and, more importantly, a clearly quantified computational effort. We achieve this goal using new techniques: the main ingredient in the proposed algorithm is a game-theoretic interpretation of the minimax optimization problem that is at the heart of the instance-based sample complexity lower bound. This in turn yields an adaptive, sample-based approach using carefully constructed confidence sets for the unknown parameter . The adaptive sampling strategy is driven by the interaction of 2 no-regret online learning subroutines that attempt to solve the minimax problem approximately, obviating the worry of i) solving the optimal minimax allocation to a suitable precision and ii) making an integer sampling allocation from it by rounding, which occur in the approach of Fiez et al . We note that the seeds of this game-theoretic approach were laid by the recent work of Degenne et al.  for the simple (i.e., unstructured) multiarmed bandit problem. However, our work demonstrates a novel extension of their methodology to solve best-arm learning in structured multi-armed bandits for the first time to the best of our knowledge.
1.1 Related Work
The PAC best arm identification problem for linearly parameterised bandits is first studied in , in which an adaptive algorithm is given with a sample complexity guarantee involving a hardness term () which in general renders the sample complexity suboptimal. Tao et al 
take the path of constructing new estimators instead of ordinary least squares, using which they give an algorithm achieving the familiar sum-of-inverse-gaps sample complexity known for standard bandits; this is, however, not optimal for general linear bandits. The LinGapE algorithm is an attempt at solving best arm identification with a fully adaptive strategy, but its sample complexity in general is not instance-optimal and can additionally scale with the total number of arms, in addition to the extra dimension-dependence known to be incurred by self-normalized inequality-based confidence set constructions . Zaki et al  design a fully adaptive algorithm based on the Lower-Upper Confidence Bound (LUCB) principle  with limited guarantees for 2 or 3 armed settings. Fiez et al  give a phased elimination algorithm achieving the ideal information-theoretic sample complexity but with minimax oracle access and an additional rounding operation; we detail an explicit arm-playing strategy that eliminates both these steps, in the same high-level template. In a separate vein, game-theoretic techniques to solve minimax problems have been in existence for over a couple of decades ; only recently have they been combined with optimism to give a powerful framework to solve adaptive hypothesis testing problems .
Table 1 compares the sample complexities of various best arm identification algorithms in the literature.
2 Problem Statement and Notation
We study the problem of best arm identification in linear bandits with the arm set , where each arm333In general, the number of arms can be much larger than the ambient dimension, i.e., is a vector in . We will interchangeably use and the set , whenever the context is clear. In every round the agent chooses an arm , and receives a reward , where is assumed to be a fixed but unknown vector, and is zero-mean noise assumed to be conditionally subgaussian, i.e., . We denote by the distribution of the reward obtained by pulling arm i.e., whenever
Given two probability distributionsover , denotes the KL Divergence of and (assuming ). Given , let , where we assume that is such that the argmax is unique.
A learning algorithm for the best arm identification problem comprises the following rules: (1) a sampling rule, which determines based on the past play of arms and observations, which arm to pull next, (2) a stopping rule, which controls the end of sampling phase and is a function of the past observations and reward, and (3) a recommendation rule, which, when the algorithm stops, offers a guess for the best arm. The goal of a learning algorithm is: Given an error probability , identify (guess) with probability by pulling as few (in an expected sense) arms as possible. Any algorithm that (1) stops with probability 1 and (2) returns upon stopping with probability at least is said to be -Probably Approximately Correct (-PAC). For clarity of exposition, we distinguish the above linear bandit setting from what we term the unstructured bandit setting, wherein and the canonical basis vectors (the former setting generalizes the latter). The (expected) number of samples consumed by an algorithm in determining the optimal arm in any bandit setting (not necessarily the linear setting) is called its sample complexity.
In the rest of the paper, we will assume that . Given a positive definite matrix , we denote by , the matrix norm induced by . For any , we define to be the gap between the largest expected reward and the expected reward of (suboptimal) arm . Let . We denote as the closed ball with center and radius . For any measurable space we define to be the set of all probability measures on . is big-Oh notation that suppresses logarithmic dependence on problem parameters. For the benefit of the reader, we provide a glossary of commonly used symbols in Sec. A in the Appendix.
|LinGapE444Here is a complicated term defined in terms of a solution to an offline optimization problem in . ||Fully adaptive, sub-optimal in general.|
|ALBA ||Fully adaptive, sub-optimal in general (see )|
|PELEG (this paper)||
Note: is a term depending only on the geometry of the arm set.
3 Overview of Algorithmic Techniques
The phased elimination approach: Fiez et al . We first note that a lower bound on the sample complexity of any - PAC algorithm for the canonical (i.e., unstructured) bandit setting  was generalized by Fiez et al  to the linear bandit setting, assuming
to be standard normal random variables. This result states that any-PAC algorithm in the linear setting must satisfy , where and , where . The bound suggests a natural -PAC strategy, namely, to sample arms according to the distribution
In fact, as [8, Sec. 2.2] explains, using the Ordinary Least Squares (OLS) estimator for and sampling arm exactly times with ensures with probability Unfortunately, this sampling distribution cannot directly be implemented since is unknown.
Fiez et al circumvent this difficulty by designing a nontrivial strategy (RAGE) that attempts to mimic the optimal allocation in phases. Specifically, in phase , it tries to eliminate arms that are about -suboptimal (in their gaps), by solving (1) with a plugin estimate of . The resulting fractional allocation, passed through a separate discrete rounding procedure, gives an integer pull count distribution which ensures that all surviving arms’ mean differences are estimated with high precision and confidence.
Though direct and appealing, this phased elimination strategy is based crucially on solving minimax problems of the form (1). Though the inner (max) function is convex as a function of on the probability simplex (see e.g., Lemma 1 in ), it is non-smooth, and it is not made explicit how, and to what extent, it must be solved in . Fortunately, we are able to circumvent this obstacle by using ideas from games between no-regret online learners with optimism, as introduced by the work of Degenne et al  for unstructured bandits.
From Pure-exploration Games to -PAC Algorithms: Degenne et al . We briefly explain some of the insights in  that we leverage to design an explicit linear bandit--PAC algorithm with low computational complexity. For a fixed weight parameter consider the two-player, zero-sum Pure-exploration Game in which the player (or column player) plays an arm while the (or row) player chooses an alternative bandit model such that then receives a payoff of from For a given define and the mixed strategy that attains With moving first and playing a mixed strategy the value of the game becomes . In the unstructured bandit setting, to match the sample complexity lower bound, any algorithm must essentially sample arm at rate where is the number of times Arm has been sampled up to time . This helps explain why any -PAC algorithm implicitly needs to solve the Pure Exploration Game
We crucially employ no-regret online learners to solve the Pure Exploration Game for linear bandits. More precisely, no-regret learning with the well-known Exponential Weights rule/Negative-entropy mirror descent algorithm  on one hand, and a best-response convex programming subroutine on the other, provides a direct sampling strategy that obviates the need for separate allocation optimization and rounding for sampling as in . One crucial advantage of our approach (inspired by ) is that we only use a best response oracle to solve for , which gives us a computational edge over  who employ the computationally more costly max-min oracle to solve or, its linear bandit equivalent,
4 Algorithm and Sample Complexity Bound
Our algorithm, that we call “Phased Elimination Linear Exploration Game” (PELEG), is presented in detail as Algorithm 1. PELEG proceeds in phases with each phase consisting of multiple rounds, maintaining a set of active arms for testing during Phase . An OLS estimate of is used to estimate the mean reward of active arms and, at the end of phase , every active arm with a plausible reward more than below that of some arm in is eliminated. Suppose . If we can ensure that in every Phase then PELEG will terminate within phases, where This statement is proved in Corollary 2 in the Supplementary Material.
If we knew then we could sample arms according to the optimal distribution in (1). However, since all we now have at our disposal is the knowledge that we can instead construct a sampling distribution by solving the surrogate , and sampling each arm in sufficiently often to produce a small enough confidence set. This is precisely what RAGE  does. However solving this optimization is, as mentioned in Sec. 3, computationally expensive and RAGE repeatedly accesses a minmax oracle to do this. Note that in simulating this algorithm, the authors implement an approximate oracle using the Frank-Wolfe method to solve the outer optimization over [8, Sec. F]. The operation, however, renders the optimization objective non-smooth, and it is well-known that the Frank-Wolfe iteration can fail with even simple non-differentiable objectives (see e.g., ). We, therefore, deviate from RAGE at this point by employing three novel techniques, the first two motivated by ideas in .
We formulate the above minimax problem as a two player, zero-sum game. We solve the game sequentially, converging to its Nash equilibrium by invoking the use of the EXP-WTS algorithm . Specifically, in each round
in a phase, PELEG supplies EXP-WTS with an appropriate loss functionand receives the requisite sampling distribution (lines 15 & 18 of the algorithm). This is then fed to the second no-regret learner – a best response subroutine – that finds the ‘most confusing’ plausible model to focus next on (line 16). This is a minimization of a quadratic function over a union of finitely many convex sets (halfspaces intersecting a ball) which can be transparently implemented in polynomial time.
Once the sampling distribution is found, there still remains the problem of actually sampling according to it. Given a distribution approximating it by sampling or times can lead to too few (resp. many) samples. Other naive sampling strategies are, for the same reason, unusable. While  invokes a specialized rounding algorithm for this purpose, we opt for a more efficient tracking procedure (line 19): In each Round of Phase , we sample Arm , where is the number of times Arm has been sampled up to time In Lem. 3, we show that this procedure is efficient, i.e.,
Finally, in each phase
, we need to sample arms often enough to (i) construct confidence intervals of size at mostaround (ii) ensure that and (iii) that In Sec. E, we prove a Key Lemma (whose argument is discussed in Sec. 5) to show that our novel Phase Stopping Criterion ensures this with high probability.
It is worth remarking that naively trying to adapt the strategy of Degenne et al  to the linear bandit structure yields a suboptimal (multiplicative ) dependence in the sample complexity, thus we adopt the phased elimination template of Fiez et al . We also find, interestingly, that this phased structure eliminates the need to use more complex, self-tuning online learners like AdaHedge  in favour of the simpler Exponential Weights (Hedge).
The main theoretical result of this paper is the following performance guarantee.
Theorem 1 (Sample Complexity of Algorithm 1).
With probability at least , PELEG returns the optimal arm after rounds, with
As explained in Sec. 3, the optimal (oracle) allocation requires samples. Comparing this with (2), we see that our algorithm is instance optimal up to logarithmic factors, barring the term, so the optimality holds whenever . Recall that is the smallest eigenvalue of
is the smallest eigenvalue of. is reasonable to expect given that in most applications, feature vectors (i.e., ) are chosen to represent the feature space well which translates to a high value of .
The main computational effort in Algorithm 1 is in checking the phase stopping criterion (line 13) and implementing the best-response model learner (line 16), both of which are explicit quadratic programs. Note also that bounding the losses submitted to EXP-WTS to within is required only for the regret analysis of EXP-WTS to go through. In practice, as the simulation results show, PELEG works without this and, in fact, permits efficient solution of Step 16 in the algorithm, further reducing computational complexity.
5 Sketch of Sample Complexity Analysis
At a high level the proof of Theorem 1 involves two main parts: (1) a correctness argument for the central while loop that eliminates arms, and (2) a bound for its length, which, when added across all phases, gives the overall sample complexity bound.
1. Ensuring progress (arm elimination) in each phase. At the heart of the analysis is the following result which guarantees that upon termination of the central while loop, the uncertainty in estimating all differences of means among the surviving (i.e., non-eliminated) arms remains bounded.
Lemma 1 (Key Lemma).
After each phase ,
Proof sketch. Phase ends at time when the ellipsoid , with center and shape according to the arms played in the phase so far, becomes small enough to avoid intersecting the half spaces , for all surviving arms , within the ball (Step 13 of the algorithm) which is required to keep loss functions bounded for no-regret properties.
Suppose, for the sake of simplicity, that only two arms and are present when phase starts. Figure 0(a) depicts a possible situation when the phase ends. and with are halfspaces, denoted in gray, that intersect the ball in the areas colored red. In this situation, the ellipsoid , shaded in blue, has just broken away from the red regions in the interior of the ball. Because its extent in the direction
lies within the strip between the two hyperplanes bounding, it can be shown (see proof of lemma in appendix) that is small enough to not exceed roughly .
The more challenging situation is when the ellipsoid breaks away from the red regions by breaching the boundary of the ball , as in the green ellipsoid in Figure 0(b). The while loop terminating at this time would not satisfy the objective of controlling to within , since the extent of the ellipsoid in the direction is larger than the gap between the halfspaces and . A key idea we introduce here is to shrink the hyperplane gap (i.e., ) by a factor (precisely ) which is represented by the min operation in Step 7. In doing this we bring the halfspaces closer, and then insist that the ellipsoid break away from these new halfspaces within the ball. This more stringent requirement guarantees that when the loop terminates, the extent of the final ellipsoid (shaded in blue) stays within the original, unshrunk, gap ensuring .
2. Bounding the number of arm pulls in a phase. The main bound on the length of the central while loop is the following result.
Lemma 2 (Phase length bound).
Let . There exists such that , the length of any phase is bounded as :
To prove this we use the no-regret property of both the best-response and the EXP-WTS learner (the full proof appears in the appendix). A key novelty here is the introduction of the ball as a technical device to control the -norm radius of the final stopped ellipsoid (inequality in the proof) when used with the basic tracking rule over arms introduced by Degenne et al .
We numerically evaluate PELEG, against the algorithms -static (), LUCB (), ALBA (), LinGapE () and RAGE (), for 3 common benchmark settings. The oracle lower bound is also calculated. Note: In our implementation, we ignore the term in the phase stopping criterion; this has the advantage of making the criterion check-able in closed form. We simulate independent, observation noise in each round. All results reported are averaged over 50 trials. We also empirically observe a success rate in identifying the best arm, although a confidence value of is passed in all cases.
Setting 1: Standard bandit. The arm set is the standard basis in 5 dimensions. The unknown parameter is set to , where , with swept across . As noted in , for close to , -static’s essentially uniform allocation is optimal, since we have to estimate all directions equally accurately. However, PELEG performs better (Fig. 2(a)) due to being able to eliminate suboptimal arms earlier instead of uniformly across all arms. Fig. 2(b) compares PELEG and RAGE in the smaller window , where PELEG is found to be competitive (and often better than) RAGE.
Setting 2: Unit sphere. The arms set comprises of vectors sampled uniformly from the surface of the unit sphere . We pick the two closest arms, say and , and then set for , making the best arm. We simulate all algorithms over dimensions . This setting was first introduced in , and PELEG is uniformly competitive with the other algorithms (Fig. 2(c)).
Setting 3: Standard bandit with a confounding arm . We instantiate canonical basis arms and an additional arm , , with so that the first arm is the best arm. By setting , the th arm becomes the closest competitor. Here, the performance critically depends on how much an agent focuses on comparing arm and arm . LinGapE performs very well in this setting, and PELEG and RAGE are competitive with it (Fig. 2(d)).
7 Concluding Remarks
We have proposed a new, explicitly described algorithm for best arm identification in linear bandits, using tools from game theory and no-regret learning to solve minimax games. Several interesting directions remain unexplored. Removing the less-than-ideal dependence on the featureof the arm geometry and the extra logarithmic dependence on are perhaps the most interesting technical questions. It is also of great interest to see if a more direct game-theoretic strategy, along the lines of 
, exists for structured bandit problems, as also whether one can extend this machinery to solve for best policies in more general Markov Decision Processes.
This work is largely theoretical in its objective. However, the problem that it attempts to lay sound theoretical foundations for is a widely encountered search problem based on features in machine learning. As a result, we anticipate that its implications may carry over to domains that involve continuous, feature-based learning, such as attribute-based recommendation systems, adaptive sensing and robotics applications. Proper care must be taken in such cases to ensure that recommendations or decisions from the algorithms set out in this work do not transgress considerations of safety and bias. While we do not address such concerns explicitly in this work, they are important in the design and operation of automated systems that continually interact with human users.
- Abbasi-Yadkori et al.  Y. Abbasi-Yadkori, D. Pal, and C. Szepesvari. Improved Algorithms for Linear Stochastic Bandits. In Proc. NIPS, pages 2312–2320, 2011.
- Awerbuch and Kleinberg  B. Awerbuch and R. Kleinberg. Online linear optimization and adaptive routing. Journal of Computer and System Sciences, 74(1):97–114, 2008.
- Cesa-Bianchi and Lugosi  N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.
- Cheung and Li  E. Cheung and Y. Li. Solving separable nonsmooth problems using frank-wolfe with uniform affine approximations. In IJCAI, pages 2035–2041, 2018.
- de Rooij et al.  S. de Rooij, T. van Erven, P. Grunwald, and W. Koolen. Follow the leader if you can, hedge if you must. Journal of Machine Learning Research, 15:1281–1316, 2014.
- Degenne et al.  R. Degenne, W. M. Koolen, and P. Ménard. Non-asymptotic pure exploration by solving games. In NeurIPS, 2019.
Even-Dar et al. 
E. Even-Dar, S. Mannor, and Y. Mansour.
Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems.J. Mach. Learn. Res., 7:1079–1105, Dec. 2006.
- Fiez et al.  T. Fiez, L. Jain, K. G. Jamieson, and L. Ratliff. Sequential experimental design for transductive linear bandits. In Advances in Neural Information Processing Systems, pages 10666–10676, 2019.
- Freund and Schapire  Y. Freund and R. E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79–103, 1999.
- Garivier and Kaufmann [Jun. 2016] A. Garivier and E. Kaufmann. Optimal best arm identification with fixed confidence. In Conference On Learning Theory, pages 998–1027, Jun. 2016.
- Kalyanakrishnan et al.  S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. Pac subset selection in stochastic multi-armed bandits. In ICML, 2012.
- Kaufmann et al.  E. Kaufmann, O. Cappé, and A. Garivier. On the complexity of best-arm identification in multi-armed bandit models. J. Mach. Learn. Res., 17(1):1–42, Jan. 2016.
- Kuroki et al.  Y. Kuroki, L. Xu, A. Miyauchi, J. Honda, and M. Sugiyama. Polynomial-time algorithms for combinatorial pure exploration with full-bandit feedback. arXiv preprint arXiv:1902.10582, 2019.
- Lattimore and Szepesvári  T. Lattimore and C. Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.
Liyuan Xu 
M. S. Liyuan Xu, Junya Honda.
Fully adaptive algorithm for pure exploration in linear bandits.
International Conference on Artificial Intelligence and Statistics, 2017.
- Shalev-Shwartz et al.  S. Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and trends in Machine Learning, 4(2):107–194, 2011.
- Soare et al.  M. Soare, A. Lazaric, and R. Munos. Best-arm identification in linear bandits. Advances in Neural Information Processing Systems 27, pages 828–836, 2014.
- Srinivas et al.  N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In In International Conference on Machine Learning, 2010.
- Tao et al.  C. Tao, S. Blanco, and Y. Zhou. Best arm identification in linear bandits with linear dimension dependency. volume 80 of Proceedings of Machine Learning Research, pages 4877–4886. PMLR, 2018.
- Valko et al.  M. Valko, R. Munos, B. Kveton, and T. Kocák. Spectral bandits for smooth graph functions. In International Conference on Machine Learning, pages 46–54, 2014.
- Zaki et al.  M. Zaki, A. Mohan, and A. Gopalan. Towards optimal and efficient best arm identification in linear bandits. arXiv preprint arXiv:1911.01695, 2019.
Appendix A Glossary of symbols
the EXP-WTS algorithm, used to compute the mixed strategy of the player in each round of PELEG.
the index of the best arm, i.e., .
the closed ball of radius in , centered at .
is the union of all hyperplanes
dimension of space in which the feature vectors reside.
maximum allowable probability of erroneous arm selection (a.k.a confidence parameter).
, is the confidence ellipsoid with center , shaped by and
number of feature vectors.
the length of Phase
rewards from Arm are all drawn IID from
, the set of all probability measures on some given set
fixed but unknown vector in that parameterizes the means of i.e., the mean of is
number of times Arm has been sampled up to Round of PELEG.
OLS estimate of at the end of Phase of PELEG.
the design matrix in Round of Phase
the design matrix formed by sampling arms
, the feature set.
the set of features that survive Phase of PELEG.
Appendix B Technical lemmas
b.1 Tracking lemma
The subroutine recommends a distribution in every round over the set of arms. In order to play an arm from this distribution we use a “tracking” rule, which helps the number of arm pulls to stay close to the cumulative sum for each arm .
Lemma 3 ( tracks ).
In any phase , if for , the following strategy for pulling the arms is used.
then, for all and for all , .
We first show the upper-bound. We need to show that the inequality holds for all arms. First, let . We will use induction on .
Base case: At , .
Let, the induction hold for all . We will show that the inequality holds for . If , then .
Next, let . We note that by definition of , we have
Here, the inequality (*) follows because of the following fact:
for any sequence of positive numbers and ,
Consequently, . Rearranging completes the proof of the right hand side.
For the lower bound inequality, we observe that for any ,
Here, the inequality follows from the the upper-bound on . ∎
b.1.1 Details of (Exp-Wts)
We employ the EXP-WTS algorithm to recommend to the MAX player, the arm to be played in round . At the start of every phase , an EXP-WTS subroutine is instantiated afresh, with initial weight vectors to be for each of the experts. The experts are taken to be standard unit vectors with at the position, . The EXP-WTS subroutine recommends an exponentially-weighted probability distribution over the number of arms, depending upon the weights on each expert. The loss function supplied to update the weights of each expert, is indicated in Step 18 of Algorithm 1.
In any phase , at any round , has a regret bounded as
The proof involves a simple modification of the proof of the regret analysis of the EXP-WTS algorithm (see for example, ), with loss scaled by followed by the well-known doubling trick. ∎
Appendix C Proof of Key Lemma
Let , for ease of notation. The phase stopping criterion is
Note that the set depends on the value that takes in phase . Depending on the value of , we divide the analysis into the following two cases.
Case 1. .
In this case . For any phase , and , let us define the ellipsoid . The phase stopping rule at round is equivalent to :