DeepAI
Log In Sign Up

Adaptive Exploration in Linear Contextual Bandit

10/15/2019
by   Botao Hao, et al.
0

Contextual bandits serve as a fundamental model for many sequential decision making tasks. The most popular theoretically justified approaches are based on the optimism principle. While these algorithms can be practical, they are known to be suboptimal asymptotically (Lattimore and Szepesvari, 2017). On the other hand, existing asymptotically optimal algorithms for this problem do not exploit the linear structure in an optimal way and suffer from lower-order terms that dominate the regret in all practically interesting regimes. We start to bridge the gap by designing an algorithm that is asymptotically optimal and has good finite-time empirical performance. At the same time, we make connections to the recent literature on when exploration-free methods are effective. Indeed, if the distribution of contexts is well behaved, then our algorithm acts mostly greedily and enjoys sub-logarithmic regret. Furthermore, our approach is adaptive in the sense that it automatically detects the nice case. Numerical results demonstrate significant regret reductions by our method relative to several baselines.

READ FULL TEXT VIEW PDF
10/23/2020

An Asymptotically Optimal Primal-Dual Incremental Algorithm for Contextual Linear Bandits

In the contextual linear bandit setting, algorithms built on the optimis...
05/21/2021

Parallelizing Contextual Linear Bandits

Standard approaches to decision-making under uncertainty focus on sequen...
05/19/2022

Breaking the √(T) Barrier: Instance-Independent Logarithmic Regret in Stochastic Contextual Linear Bandits

We prove an instance independent (poly) logarithmic regret for stochasti...
04/28/2017

Exploiting the Natural Exploration In Contextual Bandits

The contextual bandit literature has traditionally focused on algorithms...
02/25/2020

Information Directed Sampling for Linear Partial Monitoring

Partial monitoring is a rich framework for sequential decision making un...
04/08/2021

Leveraging Good Representations in Linear Contextual Bandits

The linear contextual bandit literature is mostly focused on the design ...
04/13/2020

Power-Constrained Bandits

Contextual bandits often provide simple and effective personalization in...

1 Introduction

The contextual linear bandit is a practical setting for many sequential decision-making problems, especially in on-line applications (Agarwal et al., 2009; Li et al., 2010). Our main contribution is a new algorithm that is asymptotically optimal, computationally efficient and empirically well-behaved in finite-time regimes. As a consequence of asymptotic optimality, the algorithm adapts to certain easy cases where it achieves sub-logarithmic regret.

Popular approaches for regret minimising in contextual bandits include -greedy (Langford and Zhang, 2007), explicit optimism-based algorithms (Dani et al., 2008; Rusmevichientong and Tsitsiklis, 2010; Chu et al., 2011; Abbasi-Yadkori et al., 2011)

, and implicit ones, such as Thompson sampling

(Agrawal and Goyal, 2013). Although these algorithms enjoy near-optimal worst-case guarantees and can be quite practical, they are known to be arbitrarily suboptimal in the asymptotic regime, even in the non-contextual linear bandit (Lattimore and Szepesvari, 2017).

We propose an optimization-based algorithm that estimates and tracks the optimal allocation for each context/action pair. This technique is most well known for its effectiveness in pure exploration

(Chan and Lai, 2006; Garivier and Kaufmann, 2016; Degenne et al., 2019, and others). The approach has been used in regret minimisation in linear bandits with fixed action sets (Lattimore and Szepesvari, 2017) and structured bandits (Combes et al., 2017). The last two articles provide algorithms for the non-contextual case and hence cannot be applied directly to our setting. More importantly, however, the algorithms are not practical. The first algorithm uses a complicated three-phase construction that barely updates its estimates. The second algorithm is not designed to handle large action spaces and has a ‘lower-order’ term in the regret that depends linearly on the number of actions and dominates the regret in all practical regimes. This lower-order term is not merely a product of the analysis, but also reflected in the experiments (see Section 5.4 for details).

The most related work is by Ok et al. (2018)

who study a reinforcement learning setting. A stochastic contextual bandit can be viewed as a Markov decision process where the state represents the context and the transition is independent of the action. The structured nature of the mentioned paper means our setting is covered by their algorithm. Again, however, the algorithm is too general to exploit the specific structure of the contextual bandit problem. Their algorithm is asymptotically optimal, but suffers from lower-order terms that are linear in the number of actions and dominate the regret in all practically interesting regimes. In contrast, our algorithm is asymptotically optimal, but also practical in finite-horizon regimes, as will be demonstrated by our experiments.

The contextual linear bandit also serves as an interesting example where the asymptotics of the problem are not indicative of what should be expected in finite-time (see the second scenario in Section 5.2). This is in contrast to many other bandit models where the asymptotic regret is also roughly optimal in finite time (Lattimore and Szepesvári, 2019). There is an important lesson here. Designing algorithms that optimize for the asymptotic regret may make huge sacrifices in finite-time.

Another interesting phenomenon is related to the idea of ‘natural exploration’ that occurs in contextual bandits (Bastani et al., 2017; Kannan et al., 2018). A number of authors have started to investigate the striking performance of greedy algorithms in contextual bandits. In most bandit settings the greedy policy does not explore sufficiently and suffers linear regret. In some contextual bandit problems, however, the changing features ensure the algorithm cannot help but explore. Our algorithm and analysis highlights this effect (see Section 3.1 for details). If the context distribution is sufficiently rich, then the algorithm is eventually almost completely greedy and enjoys sub-logarithmic regret. An advantage of our approach is that we do not need strong assumptions on the context distribution: the algorithm adapts to the problem in a data-dependent fashion in the sense that even when the context distribution is not sufficiently rich, we preserve the usual optimality guarantee. As another contribution we also prove that algorithms based on optimism enjoy sub-logarithmic regret in this setting (Section 3.1).

The rest of the paper is organized as follows. Section 2 introduces the basic problem setup. Section 3 studies the asymptotic regret lower bound for linear contextual bandit. Section 4 introduces our optimal allocation matching algorithm and presents the asymptotic regret upper bound. Section 5 conducts several experiments. Section 6 discusses interesting directions for future works.

Notation

Let . For a matrix

and vector

, we denote . The cardinality of a set is denoted by .

2 Problem Setting

We consider the stochastic -armed contextual linear bandit with a horizon of rounds and possible contexts. The assumption that the contexts are discrete is often reasonable in practice. For instance, in a recommender system users are often clustered into finitely many user groups. For each context there is a known feature/action set with . The interaction protocol is as follows. First the environment samples a sequence of independent contexts from an unknown distribution over

and each context is assumed to appear with positive probability. At the start of round

the context is revealed to the learner, who may use their observations to choose an action . The reward is

where

is a sequence of independent standard Gaussian random variables and

is an unknown parameter. The Gaussian assumption can be relaxed to conditional sub-Gaussian assumption for the regret upper bound, but is necessary for the regret lower bound. Throughout, we consider a frequentist setting in the sense that is fixed. For simplicity, we assume each spans and for all .

The performance metric is the cumulative expected regret, which measures the difference between the expected cumulative reward collected by the omniscient policy that knows and the learner’s expected cumulative reward. The optimal arm associated with context is . Then the expected cumulative regret of a policy when facing the bandit determined by is

Note that this cumulative regret also depends on the context distribution and action sets. They are omitted from the notation to reduce clutter and because there will never be ambiguity.

3 Asymptotic Lower Bound

We investigate the fundamental limit of linear contextual bandit by deriving its instance-dependent asymptotic lower bound. First, we define the class of policies that are taken into consideration.

[Consistent Policy] A policy is called consistent if the regret is subpolynomial for any bandit in that class and all context distributions:

(1)

The next lemma is the key ingredient in proving the asymptotic lower bound. Given a context and let be the suboptimality gap. Furthermore, let .

Assume that for all and that is uniquely defined for each context and let be consistent. Then for sufficiently large the expected covariance matrix

(2)

is invertible. Furthermore, for any context and any arm ,

(3)

The proof is deferred to Appendix A.1 in the supplementary material. Intuitively, the lemma shows that any consistent policy must collect sufficient statistical evidence at confidence level

that suboptimal arms really are suboptimal. This corresponds to ensuring that the width of an appropriate confidence interval

is approximately smaller than the sub-optimality gap .

[Asymptotic Lower Bound] Under the same conditions as Section 3,

(4)

where is defined as the optimal value of the following optimization problem:

(5)

subject to the constraint that for any context and suboptimal arm ,

(6)

Given the result in Lemma 3, the proof of Theorem 3 follows exactly the same idea of the proof of Corollary 2 in Lattimore and Szepesvari (2017) and thus is omitted here. Later on we will prove a matching upper bound in Theorem 4.2 and argue that our asymtotical lower bound is sharp.

In the above we adopt the convention that so that whenever . The inverse of a matrix with infinite entries is defined by passing to the limit in the obvious way, and is not technically an inverse.

Let us denote as an optimal solution to the above optimization problem. It serves as the optimal allocation rule for each arm such that the cumulative regret is minimized subject to the width of the confidence interval of each sub-optimal arm is small. Specifically, can be interpreted as the approximate optimal number of times arm should be played having observed context .

Our lower bound may also be derived from the general bound in Ok et al. (2018), since a stochastic contextual bandit can be viewed as a kind of Markov decision process. We use an alternative proof technique and the two lower bound statements have different forms. The proof is included for completeness.

When and is the standard basis vectors, the problem reduces to classical multi-armed bandit and , which matches the well-known asymptotic lower bound by Lai and Robbins (1985).

The constant depends on both the unknown parameter and the action sets , but not the context distribution . In this sense there is a certain discontinuity in the hardness measure as a function of the context distribution. More precisely, problems where is arbitrarily close to zero may have different regret asymptotically than the problem obtained by removing context entirely. Clearly as tends to zero the th context is observed with vanishingly small probability in finite time and hence the asymptotically optimal regret may not be representative of the finite-time hardness.

3.1 Sub-logarithmic regret

Our matching upper and lower bounds reveal the interesting phenomenon that if the action sets satisfy certain conditions, then sub-logarithmic regret is possible. Consider the scenario that the set of optimal arms spans . Let be a large constant to be defined subsequently and for each context and arm let

Then,

(7)

Since the set of optimal arms spans it holds that for any context and arm ,

(8)

Combining Eq. 7 and Eq. 8,

Hence, the constraint in Eq. 6 is satisfied for sufficiently large . Since with this choice of we have , it follows that . Therefore our upper bound will show that when the set of optimal actions spans our new algorithm satisfies

The choice of above shows that when span , then an asymptotically optimal algorithm only needs to play suboptimal arms sub-logarithmically often, which means the algorithm is eventually very close to the greedy algorithm. Bastani et al. (2017); Kannan et al. (2018) also investigate the striking performance of greedy algorithms in contextual bandits. However, Bastani et al. (2017) assume the covariate diversity on the context distribution while Kannan et al. (2018) assume the context is artificially perturbed with a noise. Both conditions are hard to check in practice. In addition, Bastani et al. (2017) only provide a rate-optimal algorithm while our algorithm is optimal in constants (see Theorem 4.2 for details).

As claimed in the introduction, we also prove that algorithms based on optimism can enjoy bounded regret when the set of optimal actions spans the space of all actions. The proof of the following theorem is given in LABEL:sec:thm:ucb.

Consider the policy that plays optimistically by

Suppose that is such that spans . Then, for suitable with , the expected regret .

Note, the choice of for which the above theorem holds also guarantees the standard minimax bound for this algorithm, showing that LinUCB can adapt online to this nice case.

4 Optimal Allocation Matching

The instance-dependent asymptotic lower bound provides an optimal allocation rule. However, the optimal allocation depends on the unknown sub-optimality gap. In this section, we present a novel matching algorithm that simultaneously estimates the unknown parameter using least squares and updates the allocation rule.

4.1 Algorithm

Let be the number of pulls of arm after round and . The least squares estimator is . For each context the estimated sub-optimality gap of arm is and the estimated optimal arm is . The minimum nonzero estimated gap is

Next we define a similar optimization problem as in (5) but with a different normalisation.

Let be the constant given by

(9)

where is an absolute constant. We write . For any define as a solution of the following optimization problem:

(10)

subject to

and that is invertible.

If is an estimate of , we call the solution an approximated allocation rule in contrast to the optimal allocation rule defined in Remark 3. Our algorithm alternates between exploration and exploitation, depending on whether or not all the arms have satisfied the approximated allocation rule. We are now ready to describe the algorithm, which starts with a brief initialisation phase.

Initialisation

In the first rounds the algorithm chooses any action in the action set such that is not in the span of . This is always possible by the assumption that spans for all contexts . At the end of the initialisation phase is guaranteed to be invertible.

Main phase

In each round after the initialisation phase the algorithm checks if the following criterion holds for any :

(11)

The algorithm exploits if Eq. 11 holds and explores otherwise, as explained below.

Exploitation.

The algorithm exploits by taking the greedy action:

(12)

Exploration.

The algorithm explores when Eq. 11 does not hold. This means that some actions have not been explored sufficiently. There are two cases to consider. First, when there exists an such that

the algorithm then computes two actions

(13)

Let be the number of exploration rounds defined in Algorithm 1. If the algorithm plays arm that executes the forced exploration. Otherwise it plays arm . Finally, rounds where there does not exist an are called wasted. In these rounds the algorithm acts optimistically as LinUCB (Abbasi-Yadkori et al., 2011):

(14)

where is defined in Eq. 9. The complete algorithm is presented in Algorithm 1.

The naive forced exploration can be improved by calculating a barycentric spanner (Awerbuch and Kleinberg, 2008) for each action set and then playing the least played action in the spanner. In normal practical setups this makes very little difference, where the forced exploration plays a limited role. For finite-time worst-case analysis, however, it may be crucial, since otherwise the regret may depend linearly on the number of actions, while using the spanner guarantees the forced exploration is sample efficient.

4.2 Asymptotic Upper Bound

Our main theorem is that Algorithm 1 is asymptotically optimal under mild assumptions. Suppose that is uniquely defined and is continuous at for all contexts and actions . Then the policy proposed in Algorithm 1 with satisfies

(15)

Together with the asymptotic lower bound in Theorem 3, we can argue that optimal allocation matching algorithm is asymptotical optimal and the lower bound (4) is sharp. The assumption that is continuous at is used to ensure the stability of our algorithm. We prove that the uniqueness assumption actually implies the continuity (LABEL:lem:cont in the supplementary material). There are, however, certain corner cases where uniqueness does not hold. For example when .

Input: exploration parameter , exploration counter .
# initialisation for  to  do
       Observe an action set , pull arm such that is not in the span of .
end for
for  to  do
       Observe an action set and compute the optimization problem (10) based on the estimated gap . if , then
             # exploitation Pull arm .
      else
             # exploration if , then
                   Pull arm according to LinUCB in (14).
            else
                   Calculate in (13). if  then
                         Pull arm .
                  else
                         Pull arm .
                   end if
                  
             end if
            
       end if
      Update .
end for
Algorithm 1 Optimal Allocation Matching

4.3 Proof Sketch

The complete proof is deferred to Appendix A.2 in the supplementary material. At a high level the analysis of the optimization-based approach consists of three parts. (1) Showing that the algorithm’s estimate of the true parameter is close to the truth in finite time. (2) Showing that the algorithm subsequently samples arms approximately according to the unknown optimal allocation and (3) Showing that the greedy action when arms have been sampled sufficiently according to the optimal allocation is optimal with high probability. Existing optimization-based algorithms suffer from dominant ‘lower-order’ terms because they use simple empirical means for Part (1), while here we use the data-efficient least-squares estimator.

We denote as the set of exploration rounds, decomposed into disjoint sets of forced exploration , unwasted exploration and wasted exploration (LinUCB), and let Exploit be the set of exploitation rounds.

Regret while exploiting

The criterion in Eq. 11 guarantees that the greedy action is optimal with high probability in exploitation rounds. To see this, note that if is an exploitation round, then the sub-optimality gap of greedy action satisfies the following with high probability:

Since the instantaneous regret either vanishes or is larger than , we have

Regret while exploring

Based on the design of our algorithm, the regret while exploring is decomposed into three kinds of explorations,

Shortly we argue that the regret incurred in is at most logarithmic and hence the regret in rounds associated with forced exploration is sub-logarithmic:

The regret in W-Explore is also sub-logarithmic. To see this, we first argue that since each context has positive probability. Combining with the fact that is logarithmic in and the regret of LinUCB is square root in time horizon,

The regret in UW-Explore is logarithmic in with the asymptotically optimal constant using the definition of the optimal allocation:

Of course many details have been hidden here, which are covered in detail in the supplementary material.

5 Experiments

In this section, we first empirically compare our proposed algorithm and LinUCB (Abbasi-Yadkori et al., 2011) on some specific problem instances to showcase their strengths and weaknesses. We examine OSSB (Combes et al., 2017) on instances with large action spaces to illustrate its weakness due to ignoring the linear structure. Since Combes et al. (2017) demonstrated that OSSB dominates the algorithm of Lattimore and Szepesvari (2017), we omit this algorithm from our experiments. In the end, we include the comparison with LinTS (Agrawal and Goyal, 2013).

To save computation, we follow the lazy-update approach, similar to that proposed in Section 5.1 of (Abbasi-Yadkori et al., 2011): The idea is to recompute the optimization problem (10) whenever increases by a constant factor and in all scenarios we choose (the arbitrary value) . All codes were written in Python. To solve the convex optimization problem (10), we use the CVXPY library (Diamond and Boyd, 2016).

5.1 Fixed Action Set

Finite-armed linear bandits with fixed action set are a special case of linear contextual bandits. Let and let the true parameter be . The action set is fixed and , , . We consider . By construction, is the optimal arm. From Figure 1, we observe that LinUCB suffers significantly more regret than our algorithm. The reason is that if is very small, then and point in almost the same direction and so choosing only these arms does not provide sufficient information to quickly learn which of or is optimal. On the other hand, and point in very different directions and so choosing allows a learning agent to quickly identify that is in fact optimal. LinUCB stops pulling once it is optimistic and thus does not balance this trade-off between information and regret. Our algorithm, however, takes this into consideration by tracking the optimal allocation ratios.

Figure 1:

Fixed action set. The results are averaged over 100 realizations. Here and also later, the shaded areas show the standard errors.

5.2 Changing Action Set

We consider a simple but representative case when there are only two action sets and available.

Scenario One. In each round, is drawn with probability 0.3 while is drawn with probability 0.7. Set contains , , and , while set contains , , and . The true parameter is . From the left panel of Figure 2, we observe that LinUCB, while starts better, eventually again suffers more regret than our algorithm.

Figure 2: Changing action sets. The left panel is for scenario one and the right panel is for scenario two. The results are averaged over 100 realizations.

Scenario Two. In each round, is drawn with probability , while is drawn with probability . Set contains three actions: , , , while set contains three actions: , , . Apparently, and are the optimal arms for each action set and they span . Based on the allocation rule in Section 3.1, the algorithm is advised to pull actions and very often based on asymptotics. However, since the probability that is drawn is extremely small, we are very likely to fall back to wasted exploration and use LinUCB to explore. Thus, in the short term, our algorithm will suffer from the drawback that optimistic algorithms also suffer from and what is described in Section 5.1. Although, the asymptotics will eventually “kick in”, it may take extremely long time to see the benefits of this and the algorithm’s finite-time performance will be poor. Indeed, this is seen on the right panel of Figure 2, which shows that the performance of our algorithm and that of LinUCB nearly coincide in this case.

5.3 Bounded Regret

In Section 3, we showed that when the optimal arms of all action sets span , our algorithm achieves sub-logarithmic regret. Through experiments, we justify that the algorthm could even suffer bounded regret. We consider . At each round, is drawn with probability 0.8 while is drawn with probability 0.2 and the true parameter is .Set contains three actions: , , , while set contains three actions: , , . As discussed before, and are the optimal arms for each action set and they span . The results are shown in the left subpanel of Figure 3. Our algorithm achieved bounded regret in a relatively short time period. Interestingly, we found that LinUCB can also achieve bounded regret when the optimal arms of changing action sets span .

Figure 3: The left panel is for bounded regret and right panel is for large action space. The results are averaged over 100 realizations.

5.4 Large Action Space

We consider finite-armed linear bandit with fixed action set. Let and . We generate uniformly distributed on the -dimensional unit sphere. The results are shown in the right subfigure of Figure 3

. When the action space is large, OSSB suffers significantly large regret and performs unstable due to the ignoring of linear structure. The regret of (the theoretically justified version of) LinTS is also very large due to the unnecessary variance factor required by its theory.

6 Discussion

We presented a new optimization-based algorithm for linear contextual bandits that is asymptotically optimal and adapts to both the action sets and unknown parameter. The new algorithm enjoys sub-logarithmic regret when the collection of optimal actions spans , a property that we also prove for optimism-based approaches. There are many open questions. A natural starting point is to prove near-minimax optimality of the new algorithm, possibly with minor modifications. Our work also highlights the dangers of focusing too intensely on asymptotics, which for contextual bandits hide completely the dependence on the context distribution. This motivates the intriguing challenge to understand the finite-time instance-dependent regret. Another open direction is to consider the asymptotics when the context space is continuous, which has not seen any attention.

References

  • Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pál, D. and Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems.
  • Agarwal et al. (2009) Agarwal, D., Chen, B.-C., Elango, P., Motgi, N., Park, S.-T., Ramakrishnan, R., Roy, S. and Zachariah, J. (2009). Online models for content optimization. In Advances in Neural Information Processing Systems.
  • Agrawal and Goyal (2013) Agrawal, S. and Goyal, N. (2013). Thompson sampling for contextual bandits with linear payoffs. In

    International Conference on Machine Learning

    .
  • Awerbuch and Kleinberg (2008) Awerbuch, B. and Kleinberg, R. (2008). Online linear optimization and adaptive routing. Journal of Computer and System Sciences 74 97–114.
  • Bastani et al. (2017) Bastani, H., Bayati, M. and Khosravi, K. (2017). Mostly exploration-free algorithms for contextual bandits. arXiv preprint arXiv:1704.09011 .
  • Chan and Lai (2006) Chan, H. P. and Lai, T. L. (2006). Sequential generalized likelihood ratios and adaptive treatment allocation for optimal sequential selection. Sequential Analysis 25 179–201.
  • Chu et al. (2011) Chu, W., Li, L., Reyzin, L. and Schapire, R. (2011). Contextual bandits with linear payoff functions. In

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

    .
  • Combes et al. (2017) Combes, R., Magureanu, S. and Proutiere, A. (2017). Minimal exploration in structured stochastic bandits. In Advances in Neural Information Processing Systems.
  • Dani et al. (2008) Dani, V., Hayes, T. P. and Kakade, S. M. (2008). Stochastic linear optimization under bandit feedback .
  • Degenne et al. (2019) Degenne, R., Koolen, W. M. and Ménard, P. (2019). Non-asymptotic pure exploration by solving games. arXiv preprint arXiv:1906.10431 .
  • Diamond and Boyd (2016) Diamond, S. and Boyd, S. (2016). CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research 17 1–5.
  • Garivier and Kaufmann (2016) Garivier, A. and Kaufmann, E. (2016). Optimal best arm identification with fixed confidence. In 29th Annual Conference on Learning Theory (V. Feldman, A. Rakhlin and O. Shamir, eds.), vol. 49 of Proceedings of Machine Learning Research. PMLR, Columbia University, New York, New York, USA.
  • Kannan et al. (2018) Kannan, S., Morgenstern, J. H., Roth, A., Waggoner, B. and Wu, Z. S. (2018). A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. In Advances in Neural Information Processing Systems.
  • Lai and Robbins (1985) Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 4–22.
  • Langford and Zhang (2007) Langford, J. and Zhang, T. (2007).

    The epoch-greedy algorithm for contextual multi-armed bandits.

    In Proceedings of the 20th International Conference on Neural Information Processing Systems. Citeseer.
  • Lattimore and Szepesvari (2017) Lattimore, T. and Szepesvari, C. (2017).

    The end of optimism? an asymptotic analysis of finite-armed linear bandits.

    In Artificial Intelligence and Statistics.
  • Lattimore and Szepesvári (2019) Lattimore, T. and Szepesvári, C. (2019). Bandit algorithms. preprint .
  • Li et al. (2010) Li, L., Chu, W., Langford, J. and Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web. WWW ’10, ACM, New York, NY, USA.
    URL http://doi.acm.org/10.1145/1772690.1772758
  • Ok et al. (2018) Ok, J., Proutiere, A. and Tranos, D. (2018). Exploration in structured reinforcement learning. In Advances in Neural Information Processing Systems.
  • Rusmevichientong and Tsitsiklis (2010) Rusmevichientong, P. and Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematics of Operations Research 35 395–411.
  • Tsybakov (2008) Tsybakov, A. B. (2008). Introduction to Nonparametric Estimation. 1st ed. Springer Publishing Company, Incorporated.
  • Vershynin (2010) Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 .

Appendix A Proofs of Asymptotic Lower and Upper Bounds

First of all, we define the sub-optimal action set as and denote and .

a.1 Proof of Lemma 3

The proof idea follows if is not sufficiently large in every direction, then some alternative parameters are not sufficiently identifiable.

Step One.

We fix a consistent policy and fix a context as well as a sub-optimal arm . Consider another parameter such that it is close to but is not the optimal arm in bandit for action set . Specifically, we construct

where is some positive semi-definite matrix and is some absolute constant that will be specified later. Since the sub-optimality gap satisfies

(16)

it ensures that is -suboptimal in bandit .

We define and let and be the measures on the sequence of outcomes induced by the interaction between the policy and the bandit and respectively. By the definition of in (2), we have

Applying the Bretagnolle-Huber inequality inequality in Lemma LABEL:lem:kl and divergence decomposition lemma in Lemma LABEL:lem:inf-processing, it holds that for any event ,

(17)

Step Two.

In the following, we start to derive a lower bound of ,

where the first inequality comes from the fact that for all . Define the event as follows,

(18)

When event holds, we will only pull at most half of total rounds for the optimal action of action set . Then it holds that

Define another event as follows,

(19)

where will be chosen later and is the probability that the environment picks context . From the definition of , we have By the standard Hoeffding’s inequality (Vershynin, 2010), it holds that

which implies

By the definition of events in (18),(19), we have

Letting , we have

(20)

On the other hand, we let is taken with respect to probability measures . Then can be lower bounded as follows,

where we throw out all the sub-optimality gap terms except . Using the fact that is -suboptimal, it holds that

(21)

Now we have derived the lower bounds (20)(21) for respectively.

Step Three.

Combining the lower bounds of and together, it holds that