# Nonparametric Stochastic Contextual Bandits

We analyze the K-armed bandit problem where the reward for each arm is a noisy realization based on an observed context under mild nonparametric assumptions. We attain tight results for top-arm identification and a sublinear regret of O(T^1+D/2+D), where D is the context dimension, for a modified UCB algorithm that is simple to implement (kNN-UCB). We then give global intrinsic dimension dependent and ambient dimension independent regret bounds. We also discuss recovering topological structures within the context space based on expected bandit performance and provide an extension to infinite-armed contextual bandits. Finally, we experimentally show the improvement of our algorithm over existing multi-armed bandit approaches for both simulated tasks and MNIST image classification.

## Authors

• 7 publications
• 27 publications
• ### Generalized Risk-Aversion in Stochastic Multi-Armed Bandits

We consider the problem of minimizing the regret in stochastic multi-arm...
05/05/2014 ∙ by Alexander Zimin, et al. ∙ 0

• ### Recovering Bandits

We study the recovering bandits problem, a variant of the stochastic mul...
10/31/2019 ∙ by Ciara Pike-Burke, et al. ∙ 0

• ### Nonparametric Contextual Bandits in an Unknown Metric Space

Consider a nonparametric contextual multi-arm bandit problem where each ...
08/03/2019 ∙ by Nirandika Wanigasekara, et al. ∙ 6

• ### Approximation Methods for Kernelized Bandits

The RKHS bandit problem (also called kernelized multi-armed bandit probl...
10/23/2020 ∙ by Sho Takemori, et al. ∙ 0

• ### Greedy Bandits with Sampled Context

Bayesian strategies for contextual bandits have proved promising in sing...
07/27/2020 ∙ by Dom Huh, et al. ∙ 0

• ### Regularized Contextual Bandits

We consider the stochastic contextual bandit problem with additional reg...
10/11/2018 ∙ by Xavier Fontaine, et al. ∙ 0

• ### Fully Gap-Dependent Bounds for Multinomial Logit Bandit

We study the multinomial logit (MNL) bandit problem, where at each time ...
11/19/2020 ∙ by Jiaqi Yang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Introduction

Multi-armed bandits (MABs) are an important sequential optimization problem introduced by robbins1985some robbins1985some. These models have extensively been used in a wide variety of fields related to statistics and machine learning.

The classical MAB consists of arms where at each point in time the learner can sample (or pull) one of them and observe a reward. Then various objectives can be established, such as finding the best arm (Top-Arm Identification) or minimizing some regret over time.

For contextual bandits (also referred to as bandits with side information or covariates), the learner has access to a context on which the payoffs depend. Then, based on the observations, we aim to determine the best policy (or context-to-arm mapping) and to optimize some notion of regret.

Most approaches to stochastic contextual bandits make strong assumptions on the payoffs. A popular approach models the mean reward for each arm as being linear in the context space [Chu et al.2011, Li et al.2010]. However, this is rarely the case in real data. In this paper, we take a more general approach and allow the reward functions to be non-linear and of arbitrary shape.

Using recent developments in nonparametric statistics [Jiang2017b], we show that with simple and easily implementable techniques, we can construct bandit algorithms which can learn over the entire context space with strong guarantees, despite the difficulty that arises with allowing a wide variety of reward functions. While this is not the first work which blends nonparametric statistics with bandits, we are the first to show simple and practical methods while still maintaining strong theoretical guarantees.

We reanalyze the uniform and upper confidence bound sampling strategies and demonstrate what nonparametric approaches can offer to contextual bandit learning. No other technique can adapt to the inherently difficult and complex real world reward functions while allowing such a strong theoretical understanding of the underlying algorithms.

While nonparametric models are powerful in their ability to learn arbitrary functions free of distributional assumptions, a major weakness is the curse of dimensionality. In order to have any theoretical guarantees, they require an exponential-in-dimension number of samples. However, when the data lies on an unknown low-dimensional structure such as a manifold, we show that our algorithms can converge as if the data was on a lower dimension and not in the potentially much large ambient dimension. Another striking fact is that no preprocessing of the data is required. This is of practical importance because modern data has increasingly more features but the underlying degrees of freedom often remain small.

We then discuss recovering geometric structures in the context space based on bandit performance. Specifically, we recover the connected components of the context space in which a particular bandit is the top-arm. Although learning a context-to-arm mapping gives us the estimated top-arm at each point in the context space, this alone does not tell the space’s topological structure, such as the number and shapes of connected components. We recover these structures with uniform consistency guarantees with mild assumptions, where the shapes and relative positions of the components can be arbitrary and the number of such components is recovered automatically.

We then provide an extension to infinite-armed bandits and conclude with empirical results from simulations and image classification on the MNIST dataset.

## Setup

Suppose there are bandit arms indexed in . At each time-step , the learner observes a context where is drawn i.i.d. from a context density with compact support bounded below away from zero (e.g. for some ). Then the learner chooses an arm and observes reward

 rt=fIt(xt)+ξt

where

is drawn according to white noise random variable

and is the -th arm’s mean reward. We make the following assumptions.

###### Assumption 1.

(Lipschitz Mean Reward) There exists such that for all and .

###### Assumption 2.

(Sub-Gaussian White noise) satisfies and is sub-Gaussian with parameter (i.e. for all ).

We require the finite-sample strong uniform consistency result (Theorem 1) for -NN regression defined as fellows:

###### Definition 1 (k-Nn).

Let the -NN radius of be where and the -NN set of be . Then for ,

 ˆfk-NN(x):=1|Nk(x)|n∑i=1yi⋅1[xi∈Nk(x)].
###### Theorem 1.

(Rate for -NN [Jiang2017b]) Let . There exists and universal constant such that if and

, then with probability at least

,

 supx∈X|f(x)−ˆfk-NN(x)|≤C√lognlog(1/δ)⋅n−1/(2+D).

It will be implicitly understood from here on that denotes the -NN regression estimate of under the settings of Theorem 1.

## Top-Arm Identification

###### Definition 2.

(-optimal arm) Arm is be -optimal at context if .

Following we show a uniform (over context) result about -optimal arm recovery:

###### Theorem 2.

(-optimal arm recovery) Let . For Algorithm 1, with probability at least , if

 T ≥Kmax{N0, log(C√log(1/δ)ϵ)⋅(2+D)(2C)2+Dlog(1/δ)1+D/2ϵ2+D},

then is -optimal at context uniformly for all .

###### Remark 1.

This result shows that with samples, we can determine an -approximate best arm. Known lower bounds in nonparametric regression stipulate that we need to identify differences between functions of size so our result matches lower bounds up to logarithmic factors.

###### Proof.

By Theorem 1, it follows that based on the choice of , each arm has at least enough time such that . Thus, we have , defining ,

 fπ(x)(x)−f^π(x)(x)≤ˆfπ(x)(x)−ˆf^π(x)+ϵ≤ϵ,

as desired. ∎

## Regret Analysis For UCB Strategy

Define to be the number of times arm was pulled by time .

We use the following notion of regret.

 RT=T∑t=1[maxifi(xt)−fIt(xt)]].
###### Remark 2.

Note that this notion of regret is different from those studied in classical MABs as well as other works in nonparametric contextual bandits. Usually the expected form is bounded. Here, our regret analysis is not under this expectation and hence is a stronger notion of regret.

###### Theorem 3.

Let . Suppose that and in Algorithm 2. Then we have that with probability at least ,

 RT≤ M121+D2+DK√logT(log(TK/δ)⋅T1+D2+D +KM0maxi||fi||∞.
###### Remark 3.

This shows a sub-linear regret of .

###### Proof.

Denote to be the -NN regression estimate of at time . Letting , we have by Theorem 1

 RT ≤T∑i=1σ(T^π(xt)(t−1))+C0≤KT∑i=1σ(i)+C0 =M1K√logT(log(TK/δ)T∑t=1t−1/(2+D)+C0 ≤M1K√logT(log(TK/δ)∫Tt=0(1+t)−1/(2+D)dt +C0 ≤M121+D2+DK√logT(log(TK/δ)⋅T1+D2+D+C0.

The first inequality holds because the confidence bound of a sub-optimal arm must be higher than that of the optimal at in order for that arm to be chosen and the regret at that time-step is bounded by the confidence bound. The second inequality holds because of the following simple combinatorial argument. Each time a suboptimal arm is chosen, its count increments, or otherwise there is no regret incurred. ∎

## Contextual Bandits on Manifolds

###### Assumption 3.

(Manifold Assumption) and the family of are supported on , where:

• is a -dimensional smooth compact Riemannian manifold without boundary embedded in compact subset .

• The volume of is bounded above by a constant.

• has condition number , which controls the curvature and prevents self-intersection.

Let be the density of with respect to the uniform measure on .

###### Theorem 4.

(Manifold Rate for -NN [Jiang2017b]) Let . There exists and universal constant such that if and , then with probability at least ,

 supx∈X|f(x)−fk(x)|≤C√lognlog(1/δ)⋅n−1/(2+d).

Then, simply by using Theorem 4 instead of Theorem 1, we automatically enjoy faster rates for Theorems 2 and 3.

###### Theorem 5.

(-optimal arm recovery on manifolds) Let . For Algorithm 1, with probability at least , if

 T ≥Kmax{N0, log(C√log(1/δ)ϵ)⋅(2+D)(2C)2+dlog(1/δ)1+D/2ϵ2+d},

then is -optimal at context uniformly for all .

###### Remark 4.

Now the sample complexity is instead of .

###### Theorem 6.

(UCB Regret Analysis on Manifolds) Let . Suppose that and in Algorithm 2. Then we have that with probability at least ,

 RT≤ M121+d2+dK√logT(log(TK/δ)⋅T1+d2+d +KM0maxi||fi||∞.

## Topological Analysis

In this section, we discuss how topological features about the bandit arms can be recovered. This is similar to recovering the Hartigan notion of clusters as level-sets of the density functions from a finite sample [Chaudhuri and Dasgupta2010, Jiang2017a], but here, we find similar structures in the reward functions based on noisy observations of them. We give procedures which can estimate with consistency guarantees the following structure: maximal connected regions in where a particular arm is the top-arm.

From the uniform sampling strategy earlier, we obtained estimated policy which is -optimal uniformly in with high probability. Although this is already powerful in giving us the mapping between context space and the corresponding top-arm, it does not immediately tell us the topological features of this mapping. In this subsection, we discuss how to recover the connected components of , the region where arm is the top-arm.

We give the following simple procedure.

We now give a consistency result for Algorithm 3.

First, we require the following regularity assumption, which ensures that there are no full-dimensional regions where the top-arm is not unique. This ensures that it is possible to unambiguously recover the regions where a particular arm is top.

###### Assumption 4.

The region in where the top-arm is not unique has measure , and for each arm , the region where it is unique can be partitioned into full-dimensional connected components.

Our rates will be in terms of the Hausdorff distance.

###### Definition 3.
 dH(A,B)=inf{ϵ≥0:A⊆B⊕ϵ,B⊆A⊕ϵ},

where .

###### Theorem 7.

Suppose that . Let be the maximal connected components of . Define the following minimum distance between two connected components.

 R0:=minp≠qinfx∈Cp,y∈Cqd(x,y).

Also define the following minimum separation in the reward functions

 D0:=infx∉Xi⊕R0/4maxj∈[K]fj(x)−fi(x).

Then the following holds simultaneously for all . Let Algorithm 3 with setting return . Then for sufficiently large, and there exists permutation of such that

 dH(Cj,Cγ(j))≤ξ(n)

for some that satisfies as .

###### Proof.

We first show that no two connected components can appear in the same returned component in Algorithm 3. We choose sufficiently large such that in light of Theorem 1, we have

 supx∈Xmaxj∈[K]ˆfj(x)≤D03.

. Then, uniformly for any , we have

 ˆfi(x) ≤fi(x)+D03≤maxj∈[K]fj(x)−2D03 ≤maxj∈[K]ˆfj(x)−D03

Thus, is disjoint from the returned points. Since , it follows that no two connected components points will appear in the same returned connected component from Algorithm 3.

Next, we show that for each connected component , there exists for some such that . It suffices to show that for each , we have that for sufficiently large, . There are thus two directions to show, that and . To show the first, define

 D1:=infx∈(Cq⊕r)∖(Cq⊕(r/2))maxj∈[K]fj(x)−fi(x).

Then choose sufficiently large such that in light of Theorem 1, we have

 supx∈Xmaxj∈[K]|ˆfj(x)−fj(x)|≤D13.

. Then we have for all , if , then

 ˆfi(x)≤fi(x)+D13≤maxj∈[K]fj(x)−2D13

thus, . The other direction follows from a similar argument.

All that remains is to show that such points appear in in the same connected component in the graph computed by Algorithm 3. This follows from uniform concentration bounds on balls (e.g. chaudhuri2010rates chaudhuri2010rates). ∎

## Infinite-Armed Bandits

In this section, we consider the setting where the action space is no longer a finite set of bandits, but a compact subset of for some .

We given analogous results for the uniform sampling top-arm identification and regret bounds for UCB-type strategy.

###### Definition 4.

(Mean Reward function)

 f:X×A→R,

where is the expected reward of action at context .

###### Assumption 5.

(Lipschitz Reward) There exists such that for all and , , where represents the -dimensional concatenation of and .

Then at each time , the learner chooses arm and observes context and a stochastic reward

 RT=f(xt,at)+ξt,

where are i.i.d. white noise with mean

and variance

.

###### Definition 5.

(-optimal arm) Define arm to be -optimal at context if .

Following is a uniform (over context and action space) result about -optimal arm recovery:

###### Theorem 8.

(-optimal arm recovery) There exists constant such that the following holds. Let . For Algorithm 4, with probability at least , we have that for

 T≥~C1log(√log(1/δ)ϵ)log(1/δ)1+(D+D′)/2ϵD+D′+2+~C2,

arm is -optimal at context uniformly for all .

###### Proof.

By Theorem 1, it follows that based on the choice of , there is enough time spent on pulling each arm such that . Thus, we have , defining ,

 f(x,π(x))−f(x,^π(x)) ≤ϵ2+^f(x,π(x))+ϵ2−^f(x,^π(x))≤ϵ,

as desired. ∎

Finally, using the notion of regret

 RT=T∑t=1[supa∈Af(xt,a)−f(xt,at)],

we give the following result. The proof idea is similar to that of Theorem 3 and is omitted here.

###### Theorem 9.

There exists and such that the following holds. Let . Suppose that and are chosen sufficiently large in Algorithm 5 depending on and . Then we have that with probability at least ,

 RT≤~C1√logT(log(T/δ)⋅T1+D+D′2+D+D′+~C2
###### Remark 5.

This shows a sub-linear regret of .

## Related Works

Canonical works for the standard bandit problem are lai1985asymptotically lai1985asymptotically; berry1985bandit berry1985bandit; gittins2011multi gittins2011multi; auer2002nonstochastic auer2002nonstochastic; cesa2006prediction cesa2006prediction; bubeck2012regret bubeck2012regret.

Work in contextual bandits can be roughly classified into adversarial and stochastic approaches. Much of the former, initiated by auer2002nonstochastic auer2002nonstochastic, assumes that there is an adversarial game between nature and the learner where, based on a context seen by both players, nature generates rewards for each arm at the same time the learner chooses an arm. Solutions typically involve game theoretical methods. In the stochastic approach, one assumes that the rewards for the arms are generated by a context-dependent distribution.

Approaches to modeling the arm rewards as a function of context are most commonly parametric. One of the most popular is that of linear payoffs, studied under a minimax framework [Goldenshluger and Zeevi2009, Goldenshluger and Zeevi2013], with UCB-type algorithms [Chu et al.2011, Li et al.2010, Auer et al.2002]

, or with Thompson sampling

[Agrawal and Goyal2013].

However, it is often the case that the dependency between the payoffs and the contexts are complex and therefore difficult to capture with models such as linear payoffs, many of which requiring strong assumptions on the data. To alleviate this, we can go beyond parametric modeling and blend nonparametric statistics with contextual bandits. Despite the advantage of learning much more general context-payoff dependencies, this line of work has received far less attention.

To the best of our knowledge, the first such work appeared in yang2002randomized yang2002randomized, who used histogram, -NN, and kernel methods and showed asymptotic convergence rates. rigollet2010nonparametric rigollet2010nonparametric; perchet2013multi perchet2013multi then combined histogram-type binning techniques in nonparametric statistics to obtain strong regret guarantees for contextual bandits with optimality guarantees.

lu2009showing lu2009showing study an interesting setting where the reward depends on a Lipschitz measure which is jointly in the context and the action space. They provide upper and lower regret bounds based on a covering argument and give results in terms of the packing dimension. This is highly related to the infinite-armed bandit setting in the present work; we provide similar regret guarantees but with a simple and practical procedure.

More recently, qian2016randomized qian2016randomized; qian2016kernel qian2016kernel use the strong uniform consistency properties of kernel smoothing regression to establish regret guarantees.

langford2008epoch langford2008epoch; dudik2011efficient dudik2011efficient alternatively impose neither linear nor smoothness assumptions on the mean reward function. The former propose a modification of an -greedy policy and showed that expected regret converges to while the latter considers a finite class of policies.

In this paper, using recent finite-sample results about -NN regression established in jiang2017rates jiang2017rates, we show that using the simple -NN regression is an effective alternative approach. Moreover, unlike many other nonparametric techniques, -NN adapts to a lower intrinsic dimension [Kpotufe2011] and thus we show that our regret bounds can adapt to a lower intrinsic dimension automatically and perform as if we were operating in that lower dimensional space.

## Experiments

### Simulations

We consider three two-arm bandit scenarios in the two-dimensional unit square, where is uniform. We set arm to be top in region respectively. Figure 1 illustrates the regions for the different scenarios.

• Scenario 1 (Quintic Function): We define two regions above and below a quintic function:

• Scenario 2 (Smiley): We use two circles and a semicircle to demarcate the regions in a ”smiley face” pattern.

• Scenario 3 (Bullseye): We define the regions using the alternating regions of four concentric circles centered in the support.

The true reward functions of the two arms are as follows.

 fi(x)={1,x∈Ri0.5,x∈Rj≠i

The learner observes the rewards with white noise random variable .

We compare the performance of

-NN regression (nonparametric) and Ridge regression at top-arm identification and regret minimization in the three scenarios. Mirroring our theoretical discussion, we use uniform sampling for top-arm identification and UCB strategy for regret analysis. Note that Ridge regression with UCB is the LinUCB algorithm.

#### Qualitative Analysis

We first qualitatively show that -NN regression can successfully model the bandits whereas the linear method cannot. The difficulty of the task is illustrated by Figure 2, which plots 10k uniformly sampled samples from each scenario with a colormap. We can see that a human would have a hard time recovering the regions where each arm is top due to the randomness in the observed rewards. This randomness is considerable as we set to be the same as .

We fix the number of training samples to 10k and the number of nearest neighbors to . We evaluate on 10k random test samples. Figure 3 shows that -NN regression does an excellent job of reproducing the region boundaries. Ridge regression does a poor job in the Quintic Function case, making a linear approximation to the quintic curve, and completely fails in the Smiley and Bullseye Cases, simply choosing the arm whose top-arm region is larger.

#### Quantitative Analysis

We report numerical results and optimal hyperparameters in Table 1. We tuned other hyperparameters using grid search on a validation set of size 1k using grid search and we evaluate performance of our models on a test set of size 1k. We use the UCB strategy in auer2002nonstochastic auer2002nonstochastic (a simplified version of UCB by agrawal2013thompson agrawal2013thompson). We found that a confidence level of worked well for all settings. We see that -NN significantly outperforms Ridge regression for both top-arm identification and regret minimization in all three scenarios (Table 1).

### Image Classification Experiments

We extend our experiments to image classification of the canonical MNIST dataset, which consists of 60k training images and 10k test images of isolated, normalized, hand-written digits. The task is to classify each 2828 image into one of ten classes. We reframe this as a contextual MAB problem by treating the classes as arms and the images as the contexts. Note that for every context, the payoff of all arms are known: 1 if the class is the true label and 0 otherwise. We compare -NN and Ridge regressions at regret minimization using the UCB strategy. As before we use the UCB strategy in auer2002nonstochastic auer2002nonstochastic and fix the confidence level to 0.1. We do not employ any data augmentation.

We obtain test regret of 17.5% from LinUCB with , where is the coefficient of L2 regularization, and significantly lower test regret of 5.8% from 4-NNUCB. Figure 4 shows that -NN regression maintains lower regret than Ridge regression over a range of values of and . We note that Ridge regression working well for relatively large values of itself suggests that it is a poor model for the task.

## Conclusion

For the multi-armed bandit setting, we use nonparametric regression to attain tight results for top-arm identification and a sublinear regret of , where is the dimension of the context. We also show that if the underlying context space has a lower intrinsic dimension , then our algorithm automatically adapts to the lower dimension and attains a faster rate of . We also provide a procedure for recovering the maximal connected regions in a support where a particular arm is the top-arm and provide a consistency analysis. We then give a natural extension to infinite-armed contextual bandits. Our simulations confirm that our method is able to learn in the contextual setting with arbitrary decision boundaries, even in the presence of significant noise, and our experiments on classification of MNIST images demonstrate superior performance of our method over LinUCB on a real world task.

## References

• [Agrawal and Goyal2013] Agrawal, S., and Goyal, N. 2013. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, 127–135.
• [Auer et al.2002] Auer, P.; Cesa-Bianchi, N.; Freund, Y.; and Schapire, R. E. 2002. The nonstochastic multiarmed bandit problem. SIAM journal on computing 32(1):48–77.
• [Berry and Fristedt1985] Berry, D. A., and Fristedt, B. 1985. Bandit problems: sequential allocation of experiments (Monographs on statistics and applied probability), volume 12. Springer.
• [Bubeck and Cesa-Bianchi2012] Bubeck, S., and Cesa-Bianchi, N. 2012. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning 5(1):1–122.
• [Cesa-Bianchi and Lugosi2006] Cesa-Bianchi, N., and Lugosi, G. 2006. Prediction, learning, and games. Cambridge university press.
• [Chaudhuri and Dasgupta2010] Chaudhuri, K., and Dasgupta, S. 2010. Rates of convergence for the cluster tree. In Advances in Neural Information Processing Systems, 343–351.
• [Chu et al.2011] Chu, W.; Li, L.; Reyzin, L.; and Schapire, R. E. 2011. Contextual bandits with linear payoff functions. In

International Conference on Artificial Intelligence and Statistics

, 208–214.
• [Dudik et al.2011] Dudik, M.; Hsu, D.; Kale, S.; Karampatziakis, N.; Langford, J.; Reyzin, L.; and Zhang, T. 2011. Efficient optimal learning for contextual bandits. arXiv preprint arXiv:1106.2369.
• [Gittins, Glazebrook, and Weber2011] Gittins, J.; Glazebrook, K.; and Weber, R. 2011. Multi-armed bandit allocation indices. John Wiley & Sons.
• [Goldenshluger and Zeevi2009] Goldenshluger, A., and Zeevi, A. 2009. Woodroofe’s one-armed bandit problem revisited. The Annals of Applied Probability 19(4):1603–1633.
• [Goldenshluger and Zeevi2013] Goldenshluger, A., and Zeevi, A. 2013. A linear response bandit problem. Stochastic Systems 3(1):230–261.
• [Jiang2017a] Jiang, H. 2017a. Density level set estimation on manifolds with dbscan. arXiv preprint arXiv:1703.03503.
• [Jiang2017b] Jiang, H. 2017b. Rates of uniform consistency for k-nn regression. arXiv preprint arXiv:1707.06261.
• [Kpotufe2011] Kpotufe, S. 2011. k-nn regression adapts to local intrinsic dimension. In Advances in Neural Information Processing Systems, 729–737.
• [Lai and Robbins1985] Lai, T. L., and Robbins, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6(1):4–22.
• [Langford and Zhang2008] Langford, J., and Zhang, T. 2008.

The epoch-greedy algorithm for multi-armed bandits with side information.

In Advances in neural information processing systems, 817–824.
• [Li et al.2010] Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, 661–670. ACM.
• [Lu, Pál, and Pál2010] Lu, T.; Pál, D.; and Pál, M. 2010. Showing relevant ads via lipschitz context multi-armed bandits. In Thirteenth International Conference on Artificial Intelligence and Statistics.
• [Perchet and Rigollet2013] Perchet, V., and Rigollet, P. 2013. The multi-armed bandit problem with covariates. The Annals of Statistics 41(2):693–721.
• [Qian and Yang2016a] Qian, W., and Yang, Y. 2016a. Kernel estimation and model combination in a bandit problem with covariates. Journal of Machine Learning Research.
• [Qian and Yang2016b] Qian, W., and Yang, Y. 2016b. Randomized allocation with arm elimination in a bandit problem with covariates. Electronic Journal of Statistics 10(1):242–270.
• [Rigollet and Zeevi2010] Rigollet, P., and Zeevi, A. 2010. Nonparametric bandits with covariates. arXiv preprint arXiv:1003.1630.
• [Robbins1985] Robbins, H. 1985. Some aspects of the sequential design of experiments. In Herbert Robbins Selected Papers. Springer. 169–177.
• [Yang and Zhu2002] Yang, Y., and Zhu, D. 2002. Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates. The Annals of Statistics 30(1):100–121.