The contextual linear bandit is a practical setting for many sequential decision-making problems, especially in on-line applications (Agarwal et al., 2009; Li et al., 2010). Our main contribution is a new algorithm that is asymptotically optimal, computationally efficient and empirically well-behaved in finite-time regimes. As a consequence of asymptotic optimality, the algorithm adapts to certain easy cases where it achieves sub-logarithmic regret.
Popular approaches for regret minimising in contextual bandits include -greedy (Langford and Zhang, 2007), explicit optimism-based algorithms (Dani et al., 2008; Rusmevichientong and Tsitsiklis, 2010; Chu et al., 2011; Abbasi-Yadkori et al., 2011)
, and implicit ones, such as Thompson sampling(Agrawal and Goyal, 2013). Although these algorithms enjoy near-optimal worst-case guarantees and can be quite practical, they are known to be arbitrarily suboptimal in the asymptotic regime, even in the non-contextual linear bandit (Lattimore and Szepesvari, 2017).
We propose an optimization-based algorithm that estimates and tracks the optimal allocation for each context/action pair. This technique is most well known for its effectiveness in pure exploration(Chan and Lai, 2006; Garivier and Kaufmann, 2016; Degenne et al., 2019, and others). The approach has been used in regret minimisation in linear bandits with fixed action sets (Lattimore and Szepesvari, 2017) and structured bandits (Combes et al., 2017). The last two articles provide algorithms for the non-contextual case and hence cannot be applied directly to our setting. More importantly, however, the algorithms are not practical. The first algorithm uses a complicated three-phase construction that barely updates its estimates. The second algorithm is not designed to handle large action spaces and has a ‘lower-order’ term in the regret that depends linearly on the number of actions and dominates the regret in all practical regimes. This lower-order term is not merely a product of the analysis, but also reflected in the experiments (see Section 5.4 for details).
The most related work is by Ok et al. (2018)
who study a reinforcement learning setting. A stochastic contextual bandit can be viewed as a Markov decision process where the state represents the context and the transition is independent of the action. The structured nature of the mentioned paper means our setting is covered by their algorithm. Again, however, the algorithm is too general to exploit the specific structure of the contextual bandit problem. Their algorithm is asymptotically optimal, but suffers from lower-order terms that are linear in the number of actions and dominate the regret in all practically interesting regimes. In contrast, our algorithm is asymptotically optimal, but also practical in finite-horizon regimes, as will be demonstrated by our experiments.
The contextual linear bandit also serves as an interesting example where the asymptotics of the problem are not indicative of what should be expected in finite-time (see the second scenario in Section 5.2). This is in contrast to many other bandit models where the asymptotic regret is also roughly optimal in finite time (Lattimore and Szepesvári, 2019). There is an important lesson here. Designing algorithms that optimize for the asymptotic regret may make huge sacrifices in finite-time.
Another interesting phenomenon is related to the idea of ‘natural exploration’ that occurs in contextual bandits (Bastani et al., 2017; Kannan et al., 2018). A number of authors have started to investigate the striking performance of greedy algorithms in contextual bandits. In most bandit settings the greedy policy does not explore sufficiently and suffers linear regret. In some contextual bandit problems, however, the changing features ensure the algorithm cannot help but explore. Our algorithm and analysis highlights this effect (see Section 3.1 for details). If the context distribution is sufficiently rich, then the algorithm is eventually almost completely greedy and enjoys sub-logarithmic regret. An advantage of our approach is that we do not need strong assumptions on the context distribution: the algorithm adapts to the problem in a data-dependent fashion in the sense that even when the context distribution is not sufficiently rich, we preserve the usual optimality guarantee. As another contribution we also prove that algorithms based on optimism enjoy sub-logarithmic regret in this setting (Section 3.1).
The rest of the paper is organized as follows. Section 2 introduces the basic problem setup. Section 3 studies the asymptotic regret lower bound for linear contextual bandit. Section 4 introduces our optimal allocation matching algorithm and presents the asymptotic regret upper bound. Section 5 conducts several experiments. Section 6 discusses interesting directions for future works.
Let . For a matrix
and vector, we denote . The cardinality of a set is denoted by .
2 Problem Setting
We consider the stochastic -armed contextual linear bandit with a horizon of rounds and possible contexts. The assumption that the contexts are discrete is often reasonable in practice. For instance, in a recommender system users are often clustered into finitely many user groups. For each context there is a known feature/action set with . The interaction protocol is as follows. First the environment samples a sequence of independent contexts from an unknown distribution over
and each context is assumed to appear with positive probability. At the start of roundthe context is revealed to the learner, who may use their observations to choose an action . The reward is
is a sequence of independent standard Gaussian random variables andis an unknown parameter. The Gaussian assumption can be relaxed to conditional sub-Gaussian assumption for the regret upper bound, but is necessary for the regret lower bound. Throughout, we consider a frequentist setting in the sense that is fixed. For simplicity, we assume each spans and for all .
The performance metric is the cumulative expected regret, which measures the difference between the expected cumulative reward collected by the omniscient policy that knows and the learner’s expected cumulative reward. The optimal arm associated with context is . Then the expected cumulative regret of a policy when facing the bandit determined by is
Note that this cumulative regret also depends on the context distribution and action sets. They are omitted from the notation to reduce clutter and because there will never be ambiguity.
3 Asymptotic Lower Bound
We investigate the fundamental limit of linear contextual bandit by deriving its instance-dependent asymptotic lower bound. First, we define the class of policies that are taken into consideration.
[Consistent Policy] A policy is called consistent if the regret is subpolynomial for any bandit in that class and all context distributions:
The next lemma is the key ingredient in proving the asymptotic lower bound. Given a context and let be the suboptimality gap. Furthermore, let .
Assume that for all and that is uniquely defined for each context and let be consistent. Then for sufficiently large the expected covariance matrix
is invertible. Furthermore, for any context and any arm ,
The proof is deferred to Appendix A.1 in the supplementary material. Intuitively, the lemma shows that any consistent policy must collect sufficient statistical evidence at confidence level
that suboptimal arms really are suboptimal. This corresponds to ensuring that the width of an appropriate confidence intervalis approximately smaller than the sub-optimality gap .
[Asymptotic Lower Bound] Under the same conditions as Section 3,
where is defined as the optimal value of the following optimization problem:
subject to the constraint that for any context and suboptimal arm ,
Given the result in Lemma 3, the proof of Theorem 3 follows exactly the same idea of the proof of Corollary 2 in Lattimore and Szepesvari (2017) and thus is omitted here. Later on we will prove a matching upper bound in Theorem 4.2 and argue that our asymtotical lower bound is sharp.
In the above we adopt the convention that so that whenever . The inverse of a matrix with infinite entries is defined by passing to the limit in the obvious way, and is not technically an inverse.
Let us denote as an optimal solution to the above optimization problem. It serves as the optimal allocation rule for each arm such that the cumulative regret is minimized subject to the width of the confidence interval of each sub-optimal arm is small. Specifically, can be interpreted as the approximate optimal number of times arm should be played having observed context .
Our lower bound may also be derived from the general bound in Ok et al. (2018), since a stochastic contextual bandit can be viewed as a kind of Markov decision process. We use an alternative proof technique and the two lower bound statements have different forms. The proof is included for completeness.
When and is the standard basis vectors, the problem reduces to classical multi-armed bandit and , which matches the well-known asymptotic lower bound by Lai and Robbins (1985).
The constant depends on both the unknown parameter and the action sets , but not the context distribution . In this sense there is a certain discontinuity in the hardness measure as a function of the context distribution. More precisely, problems where is arbitrarily close to zero may have different regret asymptotically than the problem obtained by removing context entirely. Clearly as tends to zero the th context is observed with vanishingly small probability in finite time and hence the asymptotically optimal regret may not be representative of the finite-time hardness.
3.1 Sub-logarithmic regret
Our matching upper and lower bounds reveal the interesting phenomenon that if the action sets satisfy certain conditions, then sub-logarithmic regret is possible. Consider the scenario that the set of optimal arms spans . Let be a large constant to be defined subsequently and for each context and arm let
Since the set of optimal arms spans it holds that for any context and arm ,
Hence, the constraint in Eq. 6 is satisfied for sufficiently large . Since with this choice of we have , it follows that . Therefore our upper bound will show that when the set of optimal actions spans our new algorithm satisfies
The choice of above shows that when span , then an asymptotically optimal algorithm only needs to play suboptimal arms sub-logarithmically often, which means the algorithm is eventually very close to the greedy algorithm. Bastani et al. (2017); Kannan et al. (2018) also investigate the striking performance of greedy algorithms in contextual bandits. However, Bastani et al. (2017) assume the covariate diversity on the context distribution while Kannan et al. (2018) assume the context is artificially perturbed with a noise. Both conditions are hard to check in practice. In addition, Bastani et al. (2017) only provide a rate-optimal algorithm while our algorithm is optimal in constants (see Theorem 4.2 for details).
As claimed in the introduction, we also prove that algorithms based on optimism can enjoy bounded regret when the set of optimal actions spans the space of all actions. The proof of the following theorem is given in LABEL:sec:thm:ucb.
Consider the policy that plays optimistically by
Suppose that is such that spans . Then, for suitable with , the expected regret .
Note, the choice of for which the above theorem holds also guarantees the standard minimax bound for this algorithm, showing that LinUCB can adapt online to this nice case.
4 Optimal Allocation Matching
The instance-dependent asymptotic lower bound provides an optimal allocation rule. However, the optimal allocation depends on the unknown sub-optimality gap. In this section, we present a novel matching algorithm that simultaneously estimates the unknown parameter using least squares and updates the allocation rule.
Let be the number of pulls of arm after round and . The least squares estimator is . For each context the estimated sub-optimality gap of arm is and the estimated optimal arm is . The minimum nonzero estimated gap is
Next we define a similar optimization problem as in (5) but with a different normalisation.
Let be the constant given by
where is an absolute constant. We write . For any define as a solution of the following optimization problem:
and that is invertible.
If is an estimate of , we call the solution an approximated allocation rule in contrast to the optimal allocation rule defined in Remark 3. Our algorithm alternates between exploration and exploitation, depending on whether or not all the arms have satisfied the approximated allocation rule. We are now ready to describe the algorithm, which starts with a brief initialisation phase.
In the first rounds the algorithm chooses any action in the action set such that is not in the span of . This is always possible by the assumption that spans for all contexts . At the end of the initialisation phase is guaranteed to be invertible.
In each round after the initialisation phase the algorithm checks if the following criterion holds for any :
The algorithm exploits if Eq. 11 holds and explores otherwise, as explained below.
The algorithm exploits by taking the greedy action:
The algorithm explores when Eq. 11 does not hold. This means that some actions have not been explored sufficiently. There are two cases to consider. First, when there exists an such that
the algorithm then computes two actions
Let be the number of exploration rounds defined in Algorithm 1. If the algorithm plays arm that executes the forced exploration. Otherwise it plays arm . Finally, rounds where there does not exist an are called wasted. In these rounds the algorithm acts optimistically as LinUCB (Abbasi-Yadkori et al., 2011):
The naive forced exploration can be improved by calculating a barycentric spanner (Awerbuch and Kleinberg, 2008) for each action set and then playing the least played action in the spanner. In normal practical setups this makes very little difference, where the forced exploration plays a limited role. For finite-time worst-case analysis, however, it may be crucial, since otherwise the regret may depend linearly on the number of actions, while using the spanner guarantees the forced exploration is sample efficient.
4.2 Asymptotic Upper Bound
Our main theorem is that Algorithm 1 is asymptotically optimal under mild assumptions. Suppose that is uniquely defined and is continuous at for all contexts and actions . Then the policy proposed in Algorithm 1 with satisfies
Together with the asymptotic lower bound in Theorem 3, we can argue that optimal allocation matching algorithm is asymptotical optimal and the lower bound (4) is sharp. The assumption that is continuous at is used to ensure the stability of our algorithm. We prove that the uniqueness assumption actually implies the continuity (LABEL:lem:cont in the supplementary material). There are, however, certain corner cases where uniqueness does not hold. For example when .
4.3 Proof Sketch
The complete proof is deferred to Appendix A.2 in the supplementary material. At a high level the analysis of the optimization-based approach consists of three parts. (1) Showing that the algorithm’s estimate of the true parameter is close to the truth in finite time. (2) Showing that the algorithm subsequently samples arms approximately according to the unknown optimal allocation and (3) Showing that the greedy action when arms have been sampled sufficiently according to the optimal allocation is optimal with high probability. Existing optimization-based algorithms suffer from dominant ‘lower-order’ terms because they use simple empirical means for Part (1), while here we use the data-efficient least-squares estimator.
We denote as the set of exploration rounds, decomposed into disjoint sets of forced exploration , unwasted exploration and wasted exploration (LinUCB), and let Exploit be the set of exploitation rounds.
Regret while exploiting
The criterion in Eq. 11 guarantees that the greedy action is optimal with high probability in exploitation rounds. To see this, note that if is an exploitation round, then the sub-optimality gap of greedy action satisfies the following with high probability:
Since the instantaneous regret either vanishes or is larger than , we have
Regret while exploring
Based on the design of our algorithm, the regret while exploring is decomposed into three kinds of explorations,
Shortly we argue that the regret incurred in is at most logarithmic and hence the regret in rounds associated with forced exploration is sub-logarithmic:
The regret in W-Explore is also sub-logarithmic. To see this, we first argue that since each context has positive probability. Combining with the fact that is logarithmic in and the regret of LinUCB is square root in time horizon,
The regret in UW-Explore is logarithmic in with the asymptotically optimal constant using the definition of the optimal allocation:
Of course many details have been hidden here, which are covered in detail in the supplementary material.
In this section, we first empirically compare our proposed algorithm and LinUCB (Abbasi-Yadkori et al., 2011) on some specific problem instances to showcase their strengths and weaknesses. We examine OSSB (Combes et al., 2017) on instances with large action spaces to illustrate its weakness due to ignoring the linear structure. Since Combes et al. (2017) demonstrated that OSSB dominates the algorithm of Lattimore and Szepesvari (2017), we omit this algorithm from our experiments. In the end, we include the comparison with LinTS (Agrawal and Goyal, 2013).
To save computation, we follow the lazy-update approach, similar to that proposed in Section 5.1 of (Abbasi-Yadkori et al., 2011): The idea is to recompute the optimization problem (10) whenever increases by a constant factor and in all scenarios we choose (the arbitrary value) . All codes were written in Python. To solve the convex optimization problem (10), we use the CVXPY library (Diamond and Boyd, 2016).
5.1 Fixed Action Set
Finite-armed linear bandits with fixed action set are a special case of linear contextual bandits. Let and let the true parameter be . The action set is fixed and , , . We consider . By construction, is the optimal arm. From Figure 1, we observe that LinUCB suffers significantly more regret than our algorithm. The reason is that if is very small, then and point in almost the same direction and so choosing only these arms does not provide sufficient information to quickly learn which of or is optimal. On the other hand, and point in very different directions and so choosing allows a learning agent to quickly identify that is in fact optimal. LinUCB stops pulling once it is optimistic and thus does not balance this trade-off between information and regret. Our algorithm, however, takes this into consideration by tracking the optimal allocation ratios.
Fixed action set. The results are averaged over 100 realizations. Here and also later, the shaded areas show the standard errors.
5.2 Changing Action Set
We consider a simple but representative case when there are only two action sets and available.
Scenario One. In each round, is drawn with probability 0.3 while is drawn with probability 0.7. Set contains , , and , while set contains , , and . The true parameter is . From the left panel of Figure 2, we observe that LinUCB, while starts better, eventually again suffers more regret than our algorithm.
Scenario Two. In each round, is drawn with probability , while is drawn with probability . Set contains three actions: , , , while set contains three actions: , , . Apparently, and are the optimal arms for each action set and they span . Based on the allocation rule in Section 3.1, the algorithm is advised to pull actions and very often based on asymptotics. However, since the probability that is drawn is extremely small, we are very likely to fall back to wasted exploration and use LinUCB to explore. Thus, in the short term, our algorithm will suffer from the drawback that optimistic algorithms also suffer from and what is described in Section 5.1. Although, the asymptotics will eventually “kick in”, it may take extremely long time to see the benefits of this and the algorithm’s finite-time performance will be poor. Indeed, this is seen on the right panel of Figure 2, which shows that the performance of our algorithm and that of LinUCB nearly coincide in this case.
5.3 Bounded Regret
In Section 3, we showed that when the optimal arms of all action sets span , our algorithm achieves sub-logarithmic regret. Through experiments, we justify that the algorthm could even suffer bounded regret. We consider . At each round, is drawn with probability 0.8 while is drawn with probability 0.2 and the true parameter is .Set contains three actions: , , , while set contains three actions: , , . As discussed before, and are the optimal arms for each action set and they span . The results are shown in the left subpanel of Figure 3. Our algorithm achieved bounded regret in a relatively short time period. Interestingly, we found that LinUCB can also achieve bounded regret when the optimal arms of changing action sets span .
5.4 Large Action Space
. When the action space is large, OSSB suffers significantly large regret and performs unstable due to the ignoring of linear structure. The regret of (the theoretically justified version of) LinTS is also very large due to the unnecessary variance factor required by its theory.
We presented a new optimization-based algorithm for linear contextual bandits that is asymptotically optimal and adapts to both the action sets and unknown parameter. The new algorithm enjoys sub-logarithmic regret when the collection of optimal actions spans , a property that we also prove for optimism-based approaches. There are many open questions. A natural starting point is to prove near-minimax optimality of the new algorithm, possibly with minor modifications. Our work also highlights the dangers of focusing too intensely on asymptotics, which for contextual bandits hide completely the dependence on the context distribution. This motivates the intriguing challenge to understand the finite-time instance-dependent regret. Another open direction is to consider the asymptotics when the context space is continuous, which has not seen any attention.
- Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pál, D. and Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems.
- Agarwal et al. (2009) Agarwal, D., Chen, B.-C., Elango, P., Motgi, N., Park, S.-T., Ramakrishnan, R., Roy, S. and Zachariah, J. (2009). Online models for content optimization. In Advances in Neural Information Processing Systems.
Agrawal and Goyal (2013)
Agrawal, S. and Goyal, N. (2013).
Thompson sampling for contextual bandits with linear payoffs.
International Conference on Machine Learning.
- Awerbuch and Kleinberg (2008) Awerbuch, B. and Kleinberg, R. (2008). Online linear optimization and adaptive routing. Journal of Computer and System Sciences 74 97–114.
- Bastani et al. (2017) Bastani, H., Bayati, M. and Khosravi, K. (2017). Mostly exploration-free algorithms for contextual bandits. arXiv preprint arXiv:1704.09011 .
- Chan and Lai (2006) Chan, H. P. and Lai, T. L. (2006). Sequential generalized likelihood ratios and adaptive treatment allocation for optimal sequential selection. Sequential Analysis 25 179–201.
Chu et al. (2011)
Chu, W., Li, L., Reyzin, L. and Schapire,
Contextual bandits with linear payoff functions.
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics.
- Combes et al. (2017) Combes, R., Magureanu, S. and Proutiere, A. (2017). Minimal exploration in structured stochastic bandits. In Advances in Neural Information Processing Systems.
- Dani et al. (2008) Dani, V., Hayes, T. P. and Kakade, S. M. (2008). Stochastic linear optimization under bandit feedback .
- Degenne et al. (2019) Degenne, R., Koolen, W. M. and Ménard, P. (2019). Non-asymptotic pure exploration by solving games. arXiv preprint arXiv:1906.10431 .
- Diamond and Boyd (2016) Diamond, S. and Boyd, S. (2016). CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research 17 1–5.
- Garivier and Kaufmann (2016) Garivier, A. and Kaufmann, E. (2016). Optimal best arm identification with fixed confidence. In 29th Annual Conference on Learning Theory (V. Feldman, A. Rakhlin and O. Shamir, eds.), vol. 49 of Proceedings of Machine Learning Research. PMLR, Columbia University, New York, New York, USA.
- Kannan et al. (2018) Kannan, S., Morgenstern, J. H., Roth, A., Waggoner, B. and Wu, Z. S. (2018). A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. In Advances in Neural Information Processing Systems.
- Lai and Robbins (1985) Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 4–22.
Langford and Zhang (2007)
Langford, J. and Zhang, T. (2007).
The epoch-greedy algorithm for contextual multi-armed bandits.In Proceedings of the 20th International Conference on Neural Information Processing Systems. Citeseer.
Lattimore and Szepesvari (2017)
Lattimore, T. and Szepesvari, C. (2017).
The end of optimism? an asymptotic analysis of finite-armed linear bandits.In Artificial Intelligence and Statistics.
- Lattimore and Szepesvári (2019) Lattimore, T. and Szepesvári, C. (2019). Bandit algorithms. preprint .
Li et al. (2010)
Li, L., Chu, W., Langford, J. and Schapire,
R. E. (2010).
A contextual-bandit approach to personalized news article
In Proceedings of the 19th International Conference on World
Wide Web. WWW ’10, ACM, New York, NY, USA.
- Ok et al. (2018) Ok, J., Proutiere, A. and Tranos, D. (2018). Exploration in structured reinforcement learning. In Advances in Neural Information Processing Systems.
- Rusmevichientong and Tsitsiklis (2010) Rusmevichientong, P. and Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematics of Operations Research 35 395–411.
- Tsybakov (2008) Tsybakov, A. B. (2008). Introduction to Nonparametric Estimation. 1st ed. Springer Publishing Company, Incorporated.
- Vershynin (2010) Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 .
Appendix A Proofs of Asymptotic Lower and Upper Bounds
First of all, we define the sub-optimal action set as and denote and .
a.1 Proof of Lemma 3
The proof idea follows if is not sufficiently large in every direction, then some alternative parameters are not sufficiently identifiable.
We fix a consistent policy and fix a context as well as a sub-optimal arm . Consider another parameter such that it is close to but is not the optimal arm in bandit for action set . Specifically, we construct
where is some positive semi-definite matrix and is some absolute constant that will be specified later. Since the sub-optimality gap satisfies
it ensures that is -suboptimal in bandit .
We define and let and be the measures on the sequence of outcomes induced by the interaction between the policy and the bandit and respectively. By the definition of in (2), we have
Applying the Bretagnolle-Huber inequality inequality in Lemma LABEL:lem:kl and divergence decomposition lemma in Lemma LABEL:lem:inf-processing, it holds that for any event ,
In the following, we start to derive a lower bound of ,
where the first inequality comes from the fact that for all . Define the event as follows,
When event holds, we will only pull at most half of total rounds for the optimal action of action set . Then it holds that
Define another event as follows,
where will be chosen later and is the probability that the environment picks context . From the definition of , we have By the standard Hoeffding’s inequality (Vershynin, 2010), it holds that
Letting , we have
On the other hand, we let is taken with respect to probability measures . Then can be lower bounded as follows,
where we throw out all the sub-optimality gap terms except . Using the fact that is -suboptimal, it holds that
Combining the lower bounds of and together, it holds that