The linear bandit is the simplest generalisation of the finite-armed bandit. Let be a finite set that spans with and for all . A learner interacts with the bandit over rounds. In each round the learner chooses an action (arm) and observes a payoff where is Gaussian noise and is an unknown parameter. The optimal action is , which is not known since it depends on . The assumption that spans is non-restrictive, since if has rank , then one can simply use a different basis for which all but coordinates are always zero and then drop them from the analysis. The Gaussian assumption can be relaxed to -subgaussian for our upper bound, but is needed for the lower bound. Our performance measure is the expected pseudo-regret (from now on just the regret), which is given by
where the expectation is taken with respect to the actions of the strategy and the noise. There are a number of algorithms designed for minimising the regret, all of which use one of two algorithmic designs. The first is the principle of optimism in the face of uncertainty, which was originally applied to finite-armed bandits by Agrawal (1995); Katehakis and Robbins (1995); Auer et al. (2002) and many others, and more recently to linear bandits (Auer, 2002; Dani et al., 2008; Abbasi-Yadkori et al., 2011, 2012). The second algorithm design is Thompson sampling, which is an old algorithm (Thompson, 1933) that has experienced a resurgence in popularity because of its impressive practical performance and theoretical guarantees for finite-armed bandits (Kaufmann et al., 2012; Korda et al., 2013). Thompson sampling has also recently been applied to linear bandits with good empirical performance (Chapelle and Li, 2011) and near-minimax theoretical guarantees (Agrawal and Goyal, 2013).
While both approaches lead to practical algorithms (especially Thompson sampling), we will show they are fundamentally flawed in that algorithms based on these ideas cannot be close to asymptotically optimal. Along the way we characterise the optimal achievable asymptotic regret and design a strategy achieving it. This is an important message because optimism and Thompson sampling are widely used beyond the finite-armed case. Examples include generalised linear bandits (Filippi et al., 2010), spectral bandits (Valko et al., 2014)
, and even learning in Markov decision processes(Auer et al., 2010; Gopalan and Mannor, 2015).
The disadvantages of these approaches is obscured in the worst-case regime, where both are quite close to optimal. One might question whether or not the asymptotic analysis is relevant in practice. The gold standard would be instance-dependent finite-time guarantees like what is available for finite-armed bandits, but historically the asymptotic analysis has served as a useful guide towards understanding the trade-offs in finite-time. Besides hiding the structure of specific problems, pushing for optimality in the worst-case regime can also lead to sub-optimal instance-dependent guarantees. For example, the MOSS algorithm for finite-armed bandits is minimax optimal, but far from finite-time optimal (Audibert and Bubeck, 2009). For these reasons we believe that understanding the asymptotics of a problem is a useful first step towards optimal finite-time instance-dependent guarantees that are most desirable.
It is worth mentioning that partial monitoring (a more complicated online learning setting) is a well known example of the failure of optimism (Bartók et al., 2014). Although related, the partial monitoring framework is more general than the bandit setting because the learner may not observe the reward even for the action they take, which means that additional exploration is usually necessary in order to gain information. Basic results in partial monitoring are concerned with characterizing whether an instance is easier or harder than bandit instances. More recently, the question of asymptotic instance optimality was studied in finite stochastic partial monitoring (Komiyama et al., 2015), and the special setting of learning with side information (Wu et al., 2015). While the algorithms derived in these works served as inspiration, the analysis and the algorithms do not generalise in a simple direct fashion to the linear setting, which requires a careful study of how information is transferred between actions in a linear setting.
For positive semidefinite (written as
) and vectorwe write . The Euclidean norm of a vector is and the spectral norm of a matrix is . The pseudo-inverse of a matrix is denoted by . The mean of arm is and the optimal mean is . Let be any optimal action such that . The sub-optimality gap of arm is and and . The number of times arm has been chosen after round is denoted by and . A policy is consistent if for all and it holds that . Note that this is equivalent to and also to . size=,color=red!20!white,size=,color=red!20!white,todo: size=,color=red!20!white,Tor: how is Landau notation more precise? size=,color=blue!20!white,size=,color=blue!20!white,todo: size=,color=blue!20!white,Csaba: Indeed, this is misleading. The other is also called Landau notation. Maybe more ”faithful”? When more appropriate, we will use the more precise Landau notation (also with , and ). Vectors in will often be indexed by the action set, which we assume has an arbitrary fixed order. For example, we might write and refer to for some .
3 Lower Bound
We note first that the finite-armed UCB algorithm of Agrawal (1995); Katehakis and Robbins (1995) can be used on this problem by disregarding the structure on the arms to achieve an asymptotic regret of
This quantity depends linearly on the number of suboptimal arms, which may be very large (much larger than the dimension) and is very undesirable. Nevertheless we immediately observe that the asymptotic regret should be logarithmic. The following theorem and its corollary characterises the optimal asymptotic regret.
Fix such that there is a unique optimal arm. Let be a consistent policy and let
which we assume is invertible for sufficiently large . Then for all suboptimal it holds that
The astute reader may recognizeusing a linear least squares estimator. The result says that this width has to shrink at least logarithmically with a specific constant. Before the proof of Theorem 1 we present a trivial corollary and some consequences. The assumption that is eventually invertible can be relaxed. In fact, if is not eventually invertible, then the algorithm must suffer linear regret on some problem. This is quite natural because a singular implies the algorithm has not explored at all in some direction. The proof of this fact may be found in Appendix C.
Let be a consistent policy, such that there is a unique optimal arm in . Then
where is defined as the solution to the following optimisation problem:
As with the previous result, in (1) the reader may recognize the leading term of the confidence width for estimating the mean reward of . Unsurprisingly, the width of this confidence interval has to shrink at least as fast as the width of the confidence interval for estimating the gap . The intuition underlying the optimisation problem (2) is that no consistent strategy can escape allocating samples so that the gaps of all suboptimal actions are identified with high confidence, while a good strategy will also minimise the regret subject to the identifiability condition. The proof of Corollary 2 is given in Appendix B.
Example 3 (Finite armed bandits).
Suppose and be the standard basis vectors. Then
which recovers the lower bound by Lai and Robbins (1985).
Let and and with and and and . Then for all sufficiently small . The example serves to illustrate the interesting fact that , which means that the problem becomes significantly harder if is removed from the action-set. The reason is that and are pointing in nearly the same direction, so learning the difference is very challenging. But determining which of and is optimal is easy by playing . So we see that in linear bandits there is a complicated trade-off between information and regret that makes the structure of the optimal strategy more interesting than in the finite setting.
The closest prior work to our lower bound is by Komiyama et al. (2015) and Agrawal et al. (1989). The latter consider stochastic partial monitoring when the reward is part of the observation. In this setting in each round, the learner selects one of finitely many actions and receives an observation from a distribution that depends on the action chosen and an unknown parameter, but is otherwise known. While this model could cover our setting, the results in the paper are developed only for the case when the unknown parameter belongs to a finite set, an assumption that all the results of the paper heavily depend on. Komiyama et al. (2015) on the other hand restricts partial monitoring to the case when the observations belong to a finite set, while the parameter belongs to the unit simplex. While this problem also has a linear structure, their results do not generalize beyond the discrete observation setting.
4 Proof of Theorem 1
We make use of two standard results from information theory. The first is a high probability version of Pinsker’s inequality.
Let and be measures on the same measurable space . Then for any event ,
where is the complementer event of () and is the relative entropy between and , which is defined as , if is not absolutely continuous with respect to , and is otherwise.
This result follows easily from Lemma 2.6 of Tsybakov (2008).
The second lemma is sometimes called the information processing lemma and shows that the relative entropy between measures on sequences of outcomes for the same algorithm interacting with different bandits can be decomposed in terms of the expected number of times each arm is chosen and the relative entropies of the distributions of the arms. There are many versions of this result (e.g., Auer et al. (1995) and Gerchinovitz and Lattimore (2016)). To state the result, assume without the loss of generality that the measure space underlying the action-reward sequence is and and are the respective coordinate projections: and , .
Let and be the probability measures on the sequence for a fixed bandit policy interacting with a linear bandit with standard Gaussian noise and parameters and respectively. Under these conditions the KL divergence of and can be computed exactly and is given by
where is the expectation operator induced by .
Proof of Theorem 1.
Recall that is the optimal arm, which we assumed to be unique. Let be a suboptimal arm (so ) and be an event to be chosen later. Rearranging (3) gives and recalling that , together with Lemma 6 we get that
Now we choose “close” to , but in a such a way that , meaning in the bandit determined by the optimal action is not . Selecting ensures that is small, because is consistent. Intuitively, this holds because if is large then is not used much in , hence must be large. If is large, then is used often in , hence must be large. But from the consistency of we know that both and are sub-polynomial. Let and () to be chosen later and define by
where we also restrict so that . Then,
Hence the mean reward of is higher than that of in .
On the other hand, introducing and to denote the expectation operator induced by and using that by (7), is suboptimal in , we also have
Adding up the two inequalities and lower bounding by , which holds when (which we assume from now on), we get
which completes the proof that is indeed small. Now we calculate the term on the left-hand side of (5). Using the definition of , we get
where in the last line we introduced
Since is consistent, . Hence, for all such that ,
Now take a subsequence such that
Let . A simple calculation gives and hence if is any cluster point of , say, the subsequence of the subsequence converges to , and then
Since was arbitrary small, the result will follow once we establish that . To show this, assume on the contrary that . This implies that and through it also implies that . Let , where is the identity matrix. Then, , so and thus
The uniqueness assumption of the theorem can be lifted at the price of more work and by slightly changing the theorem statement. In particular, the theorem statement must be restricted to those suboptimal actions that can be made optimal by changing to , while none of the optimal actions are optimal. That is, the statement only concerns such that but there exists such that and . The choice of would still be as before, except that is selected as the optimal action under that maximizes . Then, in the proof, has to be redefined to be (the total number of times an optimal action is chosen), and at the end one also needs to show that the chosen satisfies . size=,color=blue!20!white,size=,color=blue!20!white,todo: size=,color=blue!20!white,Csaba: This last step I have not verified.. If time permits.. Otherwise we can shorten this remark to basically say that we think an extension is possible.
Before introducing the new algorithm we analyse the concentration properties of the least squares estimator. Our results refine the existing guarantees by Abbasi-Yadkori et al. (2011), and are necessary in order to obtain asymptotic optimality. Let be the Gram matrix after round defined by and be the empirical (least squares) estimate, where is selected based on and , . size=,color=blue!20!white,size=,color=blue!20!white,todo: size=,color=blue!20!white,Csaba: This is not super precise. We will only use for rounds when is invertible. The empirical estimate of the sub-optimal gaps is , where . We will also use the notation and for vectors of empirical means and sub-optimality gaps (indexed by the arms).
For any , sufficiently large and such that is almost surely non-singular,
where for some universal constant
The result improves on the elegant concentration guarantee of Abbasi-Yadkori et al. (2011) because asymptotically we have , while there it was . Note that the restriction on may be relaxed with a small additional argument. The proof of Theorem 8 relies on a peeling argument and is given in Appendix A. For the remainder we abbreviate and , which are chosen so that
6 Optimal Strategy
A barycentric spanner of the action space is a set such that for any there exists an with . The existence of a barycentric spanner is guaranteed because is finite and spans (Awerbuch and Kleinberg, 2004). We propose a simple strategy that operates in three phases called the warm-up phase, the success phase and the recovery phase. In the warm-up the algorithm deterministically chooses its actions from a barycentric spanner to obtain a rough estimate of the sub-optimality gaps. The algorithm then uses the estimated gaps as a substitute for the true gaps to determine the optimal pull counts for each action, and starts implementing this strategy. Finally, if an anomaly is detected that indicates the inaccuracy of the estimated gaps then the algorithm switches to the recovery phase where it simply plays UCB.
For any define to be a solution to the optimisation problem
Assuming that is unique, the strategy given in Algorithm 1 satisfies
7 Proof of Theorem 10
We analyse the regret in each of the three phases. The warm-up phase has length , so its contribution to the asymptotic regret is negligible. There are two challenges. The first is to show that the recovery phase happens with probability at most . Then, since the regret in the recovery phase is logarithmic by known results for UCB, this ensures that the expected regret incurred in the recovery phase is also negligible. The second challenge is to show that the expected regret incurred during the success phase is asymptotically matching the lower bound in Theorem 1.
The set of rounds when the algorithm is in the warm-up/success/recovery phases are denoted by , and respectively. We introduce two failure events that occur when the errors in the empirical estimates of the arms are excessively large. Let be the event that there exists an arm and round such that
Similarly, let be the event that there exists an arm and round such that
Theorem 8 with and (12) imply that and . The failure events determine the quality of the estimates throughout time. The following two lemmas show that if does not occur then the regret is asymptotically optimal, while if occurs then the regret is logarithmic with some constant factor that depends only on the problem (determined by the action set and the parameter ). Since occurs with probability at most , the contribution of the latter component is negligible asymptotically.
If does not occur then Algorithm 1 never enters the recovery phase. size=,color=blue!20!white,size=,color=blue!20!white,todo: size=,color=blue!20!white,Csaba: should use mathrm, but to save space I don’t switch to it now. Furthermore,
Let for any . Then
Proof of Lemma 11.
First, if is the round at the end of the warm-up period then by the definition of the algorithm there is a barycentric spanner and for . Let be arbitrary. Then, by the definition of the barycentric spanner, we can write where for all . Therefore,
Recalling the definition of in the algorithm we have
Consider the case when does not hold. Then, for all arms and rounds after the warm-up period we have
Therefore for all after the warm-up period we have , which means the success phase never ends and so the first part of the lemma is proven. It remains to bound the regret. Since we are only concerned with the asymptotics we may take to be large enough so that , which implies that . For , the solution to the optimisation problem in Definition 9 with the true gaps, it holds that
Letting and , we have
Therefore, , where . Also,
where in the last inequality we used the fact that . Then the regret in the success phase is
Our second lemma shows that provided fails, the regret in the success phase is at most logarithmic:
It holds that:
The proof follows by showing the existence of a constant that depends on and , but not such that the regret suffered in the success phase whenever does not hold is almost surely at most . The result follows from this because . See Appendix E for details.
Proof of Theorem 10.
We decompose the regret into the regret suffered in each of the phases:
The warm-up phase has length , which contributes asymptotically negligibly to the regret:
By Lemma 11, the recovery phase only occurs if occurs and . Therefore by well-known guarantees for UCB (Bubeck and Cesa-Bianchi, 2012) there exists a universal constant such that size=,color=blue!20!white,size=,color=blue!20!white,todo: size=,color=blue!20!white,Csaba: Mildly fishy: need to argue that conditioning does not ruin things..