This paper is concerned with the problem of optimising an unknown linear function over a finite domain when given the ability to sequentially test and observe noisy function values at domain points of our choice. In the language of online learning this is the problem of best arm identification in linearly parameterised bandits; in classical statistics it is essentially the problem of adaptive, sequential composite hypothesis testing where each hypothesis corresponds to one domain point being optimal. From the point of view of causal inference, it can be interpreted as the problem of learning the best (i.e., most rewarding) intervention, from among a set of parameterised interventions available at hand with respect to an observed variable (Lattimore et al., 2016, Sen et al., 2016)
, with the key difference in this work being that the causal response of the variable to an intervention is modelled as being linear in the intervention value. The linear structure endows the model with complex but exploitable structure, in that it makes possible inference about the utility (function value) of an intervention (bandit arm) by using observations from other, correlated interventions, akin to what happens in standard (batch) prediction with linear regression.
In the linear bandit setting, each arm or action is associated with a fixed known feature vectorand the expected reward obtained by choosing to pull arm with feature vector is where
is a fixed but unknown vector. We specifically consider the probably-approximately-correct (PAC) objective of the learner (agent) declaring a guess for the identity of the optimal arm, after it has made an internally determined (and potentially random) number of sequential plays of arms, which is required to be correct with at least a given probability– the fixed confidence best arm identification goal (Even-Dar et al., 2006). In this regard, our focus is on both the statistical and computational efficiency of adaptive arm-sampling strategies, i.e., designing strategies (a) whose number of plays is as close to the quantifiable information-theoretic limit on sample complexity across all strategies, and (b) which can determine in a computationally lightweight manner the next arm to play at each adaptive round.
Broadly, there are two different approaches towards solving such pure exploration problems: (i) uniform sampling-elimination based and (ii) adaptive sampling-UCB based. The algorithm of Tao et al. (2018) is based on the former approach while those of Soare et al. (2014) and Xu et al. (2017) are based on the latter, adaptive-sampling idea, but both have sample complexity guarantees that do not depend finely on the problem instance and linear structure, and thus are worst-case optimal at best.
The -static algorithm of Soare et al. (2014) is a static algorithm which fixes the schedule of arm plays before collecting any observations. Hence, it is not able to adapt towards pulling arms which may be more “informative" for identifying the best arm for the given instance, and can consequently only be worst-case optimal. The LinGapE algorithm Xu et al. (2017) is a fully adaptive algorithm which performs well experimentally. However it requires to solve optimization problems at start, one for every pair of arms ( denotes the number of arms), which is computationally inefficient for large values of . Finally, the -ElimTil- algorithm of Tao et al. (2018) is an elimination based algorithm, and though its sample complexity scales only linearly with the dimension , it requires to sample arms in each round which is already far from optimal even for the case of the canonical (unstructured) multi-armed bandit (MAB) having the standard basis vectors as the arms. This can again only be worst-case optimal. A summary of the sample complexity bounds of the above algorithms appears in Table 1.
A very recent departure from this worst-case sample complexity dependence is the work of Fiez et al. (2019), that shows a provably instance-optimal best arm identification algorithm for linear (and more generally transductive) bandits. However, implementation of this algorithm requires computing an arguably costly rounding procedure to determine (in phases) a schedule of arms to play, as well as solving a minimax optimization problem which may be computationally very expensive111In fact, these aspects of their algorithm have prevented us from successfully implementing and testing it..
Our contributions and organization. In contrast with existing work, we aim to take a qualitatively different route towards the design of linear best arm identification, by drawing inspiration from the upper confidence bound principle, which is known to give sample-optimal performance for canonical MAB (Kalyanakrishnan et al., 2012, Jamieson et al., 2014). In this conceptually simple and elegant approach, in each decision round the learner constructs a statistically plausible confidence set for the underlying bandit instance (the weight vector in our linear setting) based on past observations, and then plays the arm that best appears to reduce the uncertainty about the optimal linear arm.
We generalise the Lower Upper Confidence Bound (LUCB) algorithm of Kalyanakrishnan et al. (2012) to the linear bandit setting. To achieve this, we introduce a new geometric “maximum overlap” principle as a basis for the learner to identify which arm is most informative to play at any given round. This results in a fully data-dependent arm selection strategy which we call Generalized-LUCB (GLUCB) (Section 3). We then proceed to rigorously analyse the sample complexity of GLUCB for certain specialized (yet instructive) cases in Section 4, and finally compare its empirical performance with other state-of-the-art methods in Section 5.
As a comment on the execution times as compared to the other algorithms proposed for this problem, our proposed algorithm GLUCB, improves significantly on the time complexity over LinGapE (Xu et al., 2017) and -ElimTil- (Tao et al., 2018) as it does not require solving any offline optimization problems.
2 Problem Statement and Notation
We study the problem of best arm identification in linear multi-arm bandits (LMABs) with the arm set , where is finite but possibly large. We will interchangeably use and the set , whenever the context is clear. Each arm is a vector in . The quantity will, henceforth, be called the ambient dimension. At every round the agent chooses an arm , and receives a reward , where is assumed to be a fixed but unknown vector and is zero-mean noise assumed to be conditionally -subGaussian, i.e., , Let The goal of the agent is, given an error probability , to identify with probability , by pulling as few arms as possible (in literature, this is known as the “fixed-” regime Kaufmann et al. (2016)). Henceforth, we will call this the LMAB (linear multi armed bandit) problem. When restricted to the case where and is the standard ordered basis the problem reduces to the Standard LMAB (SMAB) problem studied, for instance, in Kalyanakrishnan et al. (2012).
In the rest of the paper, we will assume that and that the agent has information of some upper bound on say, . Let be a positive definite matrix, then we denote by , the matrix norm induced by . Let for any , be the gap between the largest expected reward and the expected reward for arm . Denote by , the smallest reward gap.
|-static (Soare et al., 2014)|
|LinGapE222Here is a complicated term defined in terms of a solution to an offline optimization problem in Xu et al. (2017). (Xu et al., 2017)|
|-ElimTil- (Tao et al., 2018)|
|RAGE (Fiez et al., 2019)||Instance-dependent lower bound (upto log factors)|
3 The GLUCB Algorithm
This section is organized as follows. We begin with a description of the ingredients required to construct GLUCB, including “MaxOverlap.” Thereafter, we formally describe the GLUCB algorithm. Finally, we show how GLUCB is a generalization of LUCB.
To begin with, note that any algorithm for the best arm identification problem requires the following ingredients:
a stopping rule: which decides when the agent must stop sampling arms, and is a function of past observations, arms chosen and rewards only,
a sampling rule: which determines, based on the arms played and rewards observed hitherto, which arm to pull next (clearly, this rule is invoked only if the stopping rule decides not to stop); and
a recommendation rule: which, when the stopping rule decides to stop, chooses the index of the arm that is to be reported as the best.
Each of these steps will now be developed in detail and combined to give the full GLUCB algorithm. Towards this we first introduce some technical desiderata.
Let be a sequence of arms played until time by any adaptive strategy (i.e., a strategy which chooses to play an arm depending on the past arm pulls and their corresponding observations) and let be the received rewards. The (regularized) least squares estimate
least squares estimateof at time is given by
By standard results on least squares confidence sets for adaptive sampling (Abbasi-Yadkori et al., 2011b), it can be shown that with high probability, lies in the confidence ellipsoid333recall that noise is assumed to be -sub Gaussian
Notice that the ellipsoid is time-indexed, since, as more arms chosen, the estimate changes and so does the ellipsoid. In the sequel, we sometimes denote by in the interest of space.
We also define a set , where and are vectors in . Next, for any , define , as the cone of parameters such that, if , then is the optimal arm. Clearly partition the entire parameter space , modulo the degenerate regions where more than one arm is optimal. Furthermore, let be the index of the arm that currently appears to be the best.
3.1 Ingredients of the GLUCB algorithm
Following the intuition in (Soare et al., 2014, Sec. 3), we observe that a good choice for a stopping rule could be to stop the algorithm when the confidence ellipsoid is completely contained within one of the cones . Therefore, we wish to design an algorithm which minimizes the overlap of the current confidence ellipsoid with every cone which currently seems to be suboptimal, i.e., all the cones other than the current home cone of . That way, the algorithm can quickly insert completely into one of the cones, and because contains with high probability, so does since, now, This also means that upon stopping, will be the arm recommended.
Definition 1 (MaxOverlap).
The MaxOverlap of set on set is defined to be the maximum distance of set from the boundary of another set .
Here denotes the closure of the set and its topological boundary (Rudin (1964)). Hence, at time , our algorithm GLUCB is defined as sampling the arm
The following result shows that the MaxOverlap-based arm sampling rule reduces to a concrete prescription for the linear bandit setting. (Due to space constraints, proof details are omitted and can be found in the Appendix.)
At time step , define the arm
We are now ready to describe the ingredients of GLUCB. At every time step define the “Advantage” of arm as
Stopping rule: The algorithm stops when the “Advantage” defined above becomes non-positive for every arm other than the current best arm
Sampling rule: Play the arm which minimizes the current max Advantage: In case of a tie, the agent selects an arm uniformly randomly.
Recommendation rule: Once the algorithm stops, the current best arm is recommended as the guess for the best arm.
Note that the GLUCB algorithm (Algorithm 1) reduces to the well-known LUCB algorithm of Kalyanakrishnan et al. (2012) for the SMAB problem. If we consider the case when , and the arms being the standard basis , we see that the arm in algorithm 1 corresponds to which is what LUCB would suggest. Indeed, when , . However, with we obtain, . Also, it is easy to check the stopping criterion also reduces to that of LUCB. Hence, Algorithm 1, when applied to the unstructured case, plays the current best and the closest arm simultaneously every time till the algorithm stops.
We now provide some preliminary theoretical results regarding the sample complexity performance of GLUCB.
4 Analysis of GLUCB
The following result proves the correctness of G-LUCB.
Let be any arbitrary sampling strategy. Algorithm 1 returns the optimal arm upon stopping with probability at least
We will now analyze the sample complexity of GLUCB when and . For this we first present the following useful result on the convexity of a certain norm-based function on the probability simplex.
For any , and , where and , the function is convex in
4.1 Analysis of GLUCB for Linear MAB with arms
Let for which the arm set is . Let and with . For this simple case, it is clear that the set We aim to analyze the sample complexity of GLUCB by tracking the possible sample paths of (which turns out to be tractable in this setting). Let be the number of times arm has been pulled till time . Then, playing GLUCB guarantees that,
In any round ,
The proof of the result relies on the following observations.
there is a tie.
The lemma tells us that for any , if then there is a tie. Next, we show that whenever an arm (say arm 1 w.l.o.g) is played for times while arm 2 for times we will be forced to play arm 2.
Infact, for the two arm case, the sample complexity of GLUCB is optimal. The following ‘potential function’-based result formally establishes this fact.
Let us consider two algorithms and , where is GLUCB and is any other algorithm. Let for We can now state
On the other hand, a detailed analysis of the information-theoretic lower bound on sample complexity, e.g., Kaufmann et al. (2016) or Fiez et al. (2019), yields the following result where is the optimal vector of arm frequencies in the min-max optimisation problem of the lower bound (termed in Fiez et al. (2019)).
[Lower bound for ]. For , with any two arms and such ,
The (expected) sample complexity of GLUCB for the 2-arm setting is at most , where , where is the probability simplex over 2 arms and represents the optimal arm.
The quantity is the usual information theoretic lower bound on best arm identification sample complexity (Fiez et al., 2019), with the sample complexity of GLUCB being only away from it; the extra factor arises because of weaker concentration bounds for adaptive strategies.
4.2 Linear MAB with arms
This section deals with a representative example of the linear bandit. Let , and the arm set Let . This setup is particularly interesting when is close to 0. An algorithm which is optimal for standard MAB will quickly discard arm 2, and would continue to sample arms 1 and 3 until stopping. However, this is not the optimal strategy, since pulling arm 2 gives valuable information about . We will see that this is what GLUCB does.
As compared to the algorithms designed for best arm identification standard MAB, GLUCB identifies the structure (if any) present in the arms and tries to exploit it. For the particular case just described, Arm 3 is always dominated by the other two arms, i.e., we will see in the key Lemma 5 that in order to minimize the uncertainity in any direction , the reduction obtained by pulling Arm 3 is always dominated by that for some other arm.
If GLUCB run on the above problem instance and is its stopping time, then,
The proof of this result relies on the following key lemma.
If , then Arm 3 is never played.
Next, the following lemma shows an upper-bound on the time taken by GLUCB to discard arm 2 as or . We show this by bounding the number of samples required by GLUCB such that .
With probability , for all , where , Arm 3
Finally we bound the number of samples needed by GLUCB to stop once the set has frozen.
The number of samples needed for G-LUCB to stop once in steady state is upper bounded by
The term inside the max in the theorem statement, is small and can be absorbed into the leading terms. Hence, the sample complexity of GLUCB can be written as .
A crucial observation here is that the geometry of the problem enters the sample complexity (in terms of ), which. since is small, reduces the sample complexity compared to that of a standard MAB algorithm running on the instance.
We will now see that this is indeed the optimal strategy.
4.2.1 Lower-bound for the three-arm case
By (Fiez et al., 2019, Theorem 1), the expected sample complexity of any PAC best arm identification algorithm for LMAB is lower bounded as:
where . By solving the above optimization problem, we have
For , and , the expected sample complexity is lower bounded as
In this section, we compare the performance of GLUCB with XY-static Soare et al. (2014), LUCB Kalyanakrishnan et al. (2012), LinGapE Xu et al. (2017) and X-ElimTilp with Tao et al. (2018) through experiments in three synthetic settings and simulations based on real data. For LinGapE we implement the version of the algorithm which has been analyzed in their paper. For implementation of X-ElimTil0 with the setting as mentioned in Tao et al. (2018).
5.1 Experiments based on synthetic data
Throughout we assume noise independent. The results reported are averaged over 100 trials under each setting. We report the average number of samples to stop in each case, for each algorithm. The empirical probability of error in each case was found to be 0.
Dataset 1: This is the setting introduced by Soare et al. (2014) for linear bandits. We set up the linear bandit problem with arms, where features are the canonical bases and an additional arm with so that the first arm is the best arm, with the th arm being the most ambigous arm. We test by varying . With , this setup resembles the case we analyzed in 4.
Dataset 2: In this dataset, feature vectors are sampled uniformly at random the surface of the unit sphere centered at the origin. We pick the two closest arms, say and , and then set for . This makes as the best arm. We test the algorithm for .
Dataset 3: This setup is important as this shows the efficiency of GLUCB in the case when there may be many arms which competing for the second best arm. For a given value of , the armset contains feature vectors from where where . was fixed to be . We conduct the experiment by varying .
5.2 Experiments based on real data
We conduct an experiment on Yahoo! Webscope dataset R6A444https://webscope.sandbox.yahoo.com/ which consists of features of 36-dimensions accompanied with binary outcomes. We change the situation as is done in Xu et al. (2017) so that it can be adopted for best arm identification setting. We construct the 36-dimensional feature set by the random sampling from the dataset, and the reward is generated as with probability and with probability , where is the regularized least squared estimator fitted for the original dataset. We choose the vectors such that . For the detailed procedure, we refer the reader to the paper of Xu et al. (2017).
6 Conclusion and future work
We have generalised the LUCB best arm identification algorithm to bandits with linear structure via a new MaxOverlap rule to reason under uncertainty. The resulting GLUCB algorithm is computationally very attractive as compared to many state-of-the-art algorithms for linear bandits. In particular, it does not require solving optimisation problems which are inefficient when is large. Viewed from another perspective, the algorithm leverages the fact that a strategy which tries to greedily maximize the gap between the current best and second best arms is optimal for the BAI problem. We show that for the special case of two arms GLUCB is better than any causal algorithm for BAI problem. We also show orderwise optimality in the case of three arms.
In light of the analysis presented and the performance of GLUCB in our experiments, we conjecture that our algorithm is optimal for any set of arms. Proving this forms part of our future work. We conduct several experiments based on synthetically designed environments and real world dataset and show the superior performance of GLUCB over other algorithms. Furthermore, we believe that the factor of in the sample complexity results is due to the general concentration bound for adaptive sequences and can be improved, which also remains as a future work. More generally, it is interesting to ask if the general max-overlap principle works for other, non-linear bandit reward structures as well.
- Abbasi-Yadkori et al. (2011a) Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Improved Algorithms for Linear Stochastic Bandits. In Proc. NIPS, pages 2312–2320, 2011a.
- Abbasi-Yadkori et al. (2011b) Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2312–2320. Curran Associates, Inc., 2011b.
Even-Dar et al. (2006)
Eyal Even-Dar, Shie Mannor, and Yishay Mansour.
Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems.J. Mach. Learn. Res., 7:1079–1105, December 2006. ISSN 1532-4435.
- Fiez et al. (2019) Tanner Fiez, Lalit Jain, Kevin Jamieson, and Lillian Ratliff. Sequential experimental design for transductive linear bandits. arXiv preprint arXiv:1906.08399, 2019.
- Garivier and Kaufmann (Jun. 2016) Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Conference On Learning Theory, pages 998–1027, Jun. 2016.
- Horn and Johnson (2012) Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, New York, NY, USA, 2nd edition, 2012. ISBN 0521548233, 9780521548236.
- Jamieson et al. (2014) Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil´ucb: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pages 423–439, 2014.
- Kalyanakrishnan et al. (2012) Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. Pac subset selection in stochastic multi-armed bandits. In ICML, 2012.
- Kaufmann et al. (2016) Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm identification in multi-armed bandit models. J. Mach. Learn. Res., 17(1):1–42, January 2016. ISSN 1532-4435.
- Lattimore et al. (2016) Finnian Lattimore, Tor Lattimore, and Mark D Reid. Causal bandits: Learning good interventions via causal inference. In Advances in Neural Information Processing Systems, pages 1181–1189, 2016.
- Rudin (1964) Walter Rudin. Principles of mathematical analysis, volume 3. McGraw-hill New York, 1964.
- Sen et al. (2016) Rajat Sen, Karthikeyan Shanmugam, Murat Kocaoglu, Alexandros G Dimakis, and Sanjay Shakkottai. Contextual bandits with latent confounders: An nmf approach. arXiv preprint arXiv:1606.00119, 2016.
- Soare et al. (2014) Marta Soare, Alessandro Lazaric, and Rémi Munos. Best-arm identification in linear bandits. CoRR, abs/1409.6110, 2014.
Tao et al. (2018)
Chao Tao, Saúl Blanco, and Yuan Zhou.
Best arm identification in linear bandits with linear dimension
In Jennifer Dy and Andreas Krause, editors,
Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4877–4886, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/tao18a.html.
Xu et al. (2017)
Liyuan Xu, Junya Honda, and Masashi Sugiyama.
Fully adaptive algorithm for pure exploration in linear bandits.
International Conference on Artificial Intelligence and Statistics, 2017.
7.1 Proof of Prop. 3.1
As mentioned in Sec. 3, at time , our algorithm samples arm
Where we have used the fact that To get a closed form solution for the above, we solve the optimization problem : explicitly to obtain:
The agent must stop choosing arms if this value is zero, i.e., when no longer intersects any of the suboptimal cones. Due to the over arms the above strategy is inefficient to implement. A slight modification of the above which is easier to implement, is as follows. recalling the definition of HalfSpace, define the cone () which has the maximum overlap with the current cone.
which is straightforward to implement on a computer. Then play an arm according to the following rule. If play:
The last step uses the Matrix Inversion Lemma Horn and Johnson (2012).
7.2 Lower Bound on Sample Complexity
We begin by restating the result of [Garivier and Kaufmann (Jun. 2016)] in the special case of linear bandits. Let be a given set of arms in . Let be any vector in . For any arbitrary vector , we define . Define the set .
Lemma 8 (General change-of-measure based lower bound of Garivier and Kaufmann (Jun. 2016)).
Let For any strategy and any linear bandit with the unknown parameter vector . Let the noise be normal with a variance parameter of
. Let the noise be normal with a variance parameter of. Then the expected sample complexity of any strategy
where, for and
Theorem (Lower Bound).
Let For any strategy and any linear bandit problem with the unknown parameter vector ,
where the expectation is under and , where are the arms in
Recall the definition from Lemma 8,
where is the KL divergence between any two distributions. Hence, we have,
for some We will first consider the inner part of the expression above (a convex program), which can be re-written as
where is defined as in the theorem. Writing the Lagrangian, we get
Setting , we get
Substituting this value of into the Lagrangian, we obtain: