Introduction
Multiarmed bandits (MABs) are an important sequential optimization problem introduced by robbins1985some robbins1985some. These models have extensively been used in a wide variety of fields related to statistics and machine learning.
The classical MAB consists of arms where at each point in time the learner can sample (or pull) one of them and observe a reward. Then various objectives can be established, such as finding the best arm (TopArm Identification) or minimizing some regret over time.
For contextual bandits (also referred to as bandits with side information or covariates), the learner has access to a context on which the payoffs depend. Then, based on the observations, we aim to determine the best policy (or contexttoarm mapping) and to optimize some notion of regret.
Most approaches to stochastic contextual bandits make strong assumptions on the payoffs. A popular approach models the mean reward for each arm as being linear in the context space [Chu et al.2011, Li et al.2010]. However, this is rarely the case in real data. In this paper, we take a more general approach and allow the reward functions to be nonlinear and of arbitrary shape.
Using recent developments in nonparametric statistics [Jiang2017b], we show that with simple and easily implementable techniques, we can construct bandit algorithms which can learn over the entire context space with strong guarantees, despite the difficulty that arises with allowing a wide variety of reward functions. While this is not the first work which blends nonparametric statistics with bandits, we are the first to show simple and practical methods while still maintaining strong theoretical guarantees.
We reanalyze the uniform and upper confidence bound sampling strategies and demonstrate what nonparametric approaches can offer to contextual bandit learning. No other technique can adapt to the inherently difficult and complex real world reward functions while allowing such a strong theoretical understanding of the underlying algorithms.
While nonparametric models are powerful in their ability to learn arbitrary functions free of distributional assumptions, a major weakness is the curse of dimensionality. In order to have any theoretical guarantees, they require an exponentialindimension number of samples. However, when the data lies on an unknown lowdimensional structure such as a manifold, we show that our algorithms can converge as if the data was on a lower dimension and not in the potentially much large ambient dimension. Another striking fact is that no preprocessing of the data is required. This is of practical importance because modern data has increasingly more features but the underlying degrees of freedom often remain small.
We then discuss recovering geometric structures in the context space based on bandit performance. Specifically, we recover the connected components of the context space in which a particular bandit is the toparm. Although learning a contexttoarm mapping gives us the estimated toparm at each point in the context space, this alone does not tell the space’s topological structure, such as the number and shapes of connected components. We recover these structures with uniform consistency guarantees with mild assumptions, where the shapes and relative positions of the components can be arbitrary and the number of such components is recovered automatically.
We then provide an extension to infinitearmed bandits and conclude with empirical results from simulations and image classification on the MNIST dataset.
Setup
Suppose there are bandit arms indexed in . At each timestep , the learner observes a context where is drawn i.i.d. from a context density with compact support bounded below away from zero (e.g. for some ). Then the learner chooses an arm and observes reward
where
is drawn according to white noise random variable
and is the th arm’s mean reward. We make the following assumptions.Assumption 1.
(Lipschitz Mean Reward) There exists such that for all and .
Assumption 2.
(SubGaussian White noise) satisfies and is subGaussian with parameter (i.e. for all ).
We require the finitesample strong uniform consistency result (Theorem 1) for NN regression defined as fellows:
Definition 1 (Nn).
Let the NN radius of be where and the NN set of be . Then for ,
Theorem 1.
(Rate for NN [Jiang2017b]) Let . There exists and universal constant such that if and
, then with probability at least
,It will be implicitly understood from here on that denotes the NN regression estimate of under the settings of Theorem 1.
TopArm Identification
Definition 2.
(optimal arm) Arm is be optimal at context if .
Following we show a uniform (over context) result about optimal arm recovery:
Theorem 2.
(optimal arm recovery) Let . For Algorithm 1, with probability at least , if
then is optimal at context uniformly for all .
Remark 1.
This result shows that with samples, we can determine an approximate best arm. Known lower bounds in nonparametric regression stipulate that we need to identify differences between functions of size so our result matches lower bounds up to logarithmic factors.
Proof.
By Theorem 1, it follows that based on the choice of , each arm has at least enough time such that . Thus, we have , defining ,
as desired. ∎
Regret Analysis For UCB Strategy
Define to be the number of times arm was pulled by time .
We use the following notion of regret.
Remark 2.
Note that this notion of regret is different from those studied in classical MABs as well as other works in nonparametric contextual bandits. Usually the expected form is bounded. Here, our regret analysis is not under this expectation and hence is a stronger notion of regret.
Theorem 3.
Let . Suppose that and in Algorithm 2. Then we have that with probability at least ,
Remark 3.
This shows a sublinear regret of .
Proof.
Denote to be the NN regression estimate of at time . Letting , we have by Theorem 1
The first inequality holds because the confidence bound of a suboptimal arm must be higher than that of the optimal at in order for that arm to be chosen and the regret at that timestep is bounded by the confidence bound. The second inequality holds because of the following simple combinatorial argument. Each time a suboptimal arm is chosen, its count increments, or otherwise there is no regret incurred. ∎
Contextual Bandits on Manifolds
Assumption 3.
(Manifold Assumption) and the family of are supported on , where:

is a dimensional smooth compact Riemannian manifold without boundary embedded in compact subset .

The volume of is bounded above by a constant.

has condition number , which controls the curvature and prevents selfintersection.
Let be the density of with respect to the uniform measure on .
Theorem 4.
(Manifold Rate for NN [Jiang2017b]) Let . There exists and universal constant such that if and , then with probability at least ,
Then, simply by using Theorem 4 instead of Theorem 1, we automatically enjoy faster rates for Theorems 2 and 3.
Theorem 5.
(optimal arm recovery on manifolds) Let . For Algorithm 1, with probability at least , if
then is optimal at context uniformly for all .
Remark 4.
Now the sample complexity is instead of .
Theorem 6.
(UCB Regret Analysis on Manifolds) Let . Suppose that and in Algorithm 2. Then we have that with probability at least ,
Topological Analysis
In this section, we discuss how topological features about the bandit arms can be recovered. This is similar to recovering the Hartigan notion of clusters as levelsets of the density functions from a finite sample [Chaudhuri and Dasgupta2010, Jiang2017a], but here, we find similar structures in the reward functions based on noisy observations of them. We give procedures which can estimate with consistency guarantees the following structure: maximal connected regions in where a particular arm is the toparm.
From the uniform sampling strategy earlier, we obtained estimated policy which is optimal uniformly in with high probability. Although this is already powerful in giving us the mapping between context space and the corresponding toparm, it does not immediately tell us the topological features of this mapping. In this subsection, we discuss how to recover the connected components of , the region where arm is the toparm.
We give the following simple procedure.
We now give a consistency result for Algorithm 3.
First, we require the following regularity assumption, which ensures that there are no fulldimensional regions where the toparm is not unique. This ensures that it is possible to unambiguously recover the regions where a particular arm is top.
Assumption 4.
The region in where the toparm is not unique has measure , and for each arm , the region where it is unique can be partitioned into fulldimensional connected components.
Our rates will be in terms of the Hausdorff distance.
Definition 3.
where .
Theorem 7.
Suppose that . Let be the maximal connected components of . Define the following minimum distance between two connected components.
Also define the following minimum separation in the reward functions
Then the following holds simultaneously for all . Let Algorithm 3 with setting return . Then for sufficiently large, and there exists permutation of such that
for some that satisfies as .
Proof.
We first show that no two connected components can appear in the same returned component in Algorithm 3. We choose sufficiently large such that in light of Theorem 1, we have
. Then, uniformly for any , we have
Thus, is disjoint from the returned points. Since , it follows that no two connected components points will appear in the same returned connected component from Algorithm 3.
Next, we show that for each connected component , there exists for some such that . It suffices to show that for each , we have that for sufficiently large, . There are thus two directions to show, that and . To show the first, define
Then choose sufficiently large such that in light of Theorem 1, we have
. Then we have for all , if , then
thus, . The other direction follows from a similar argument.
All that remains is to show that such points appear in in the same connected component in the graph computed by Algorithm 3. This follows from uniform concentration bounds on balls (e.g. chaudhuri2010rates chaudhuri2010rates). ∎
InfiniteArmed Bandits
In this section, we consider the setting where the action space is no longer a finite set of bandits, but a compact subset of for some .
We given analogous results for the uniform sampling toparm identification and regret bounds for UCBtype strategy.
Definition 4.
(Mean Reward function)
where is the expected reward of action at context .
Assumption 5.
(Lipschitz Reward) There exists such that for all and , , where represents the dimensional concatenation of and .
Then at each time , the learner chooses arm and observes context and a stochastic reward
where are i.i.d. white noise with mean
and variance
.Definition 5.
(optimal arm) Define arm to be optimal at context if .
Following is a uniform (over context and action space) result about optimal arm recovery:
Theorem 8.
(optimal arm recovery) There exists constant such that the following holds. Let . For Algorithm 4, with probability at least , we have that for
arm is optimal at context uniformly for all .
Proof.
By Theorem 1, it follows that based on the choice of , there is enough time spent on pulling each arm such that . Thus, we have , defining ,
as desired. ∎
Finally, using the notion of regret
we give the following result. The proof idea is similar to that of Theorem 3 and is omitted here.
Theorem 9.
There exists and such that the following holds. Let . Suppose that and are chosen sufficiently large in Algorithm 5 depending on and . Then we have that with probability at least ,
Remark 5.
This shows a sublinear regret of .
Related Works
Canonical works for the standard bandit problem are lai1985asymptotically lai1985asymptotically; berry1985bandit berry1985bandit; gittins2011multi gittins2011multi; auer2002nonstochastic auer2002nonstochastic; cesa2006prediction cesa2006prediction; bubeck2012regret bubeck2012regret.
Work in contextual bandits can be roughly classified into adversarial and stochastic approaches. Much of the former, initiated by auer2002nonstochastic auer2002nonstochastic, assumes that there is an adversarial game between nature and the learner where, based on a context seen by both players, nature generates rewards for each arm at the same time the learner chooses an arm. Solutions typically involve game theoretical methods. In the stochastic approach, one assumes that the rewards for the arms are generated by a contextdependent distribution.
Approaches to modeling the arm rewards as a function of context are most commonly parametric. One of the most popular is that of linear payoffs, studied under a minimax framework [Goldenshluger and Zeevi2009, Goldenshluger and Zeevi2013], with UCBtype algorithms [Chu et al.2011, Li et al.2010, Auer et al.2002]
, or with Thompson sampling
[Agrawal and Goyal2013].However, it is often the case that the dependency between the payoffs and the contexts are complex and therefore difficult to capture with models such as linear payoffs, many of which requiring strong assumptions on the data. To alleviate this, we can go beyond parametric modeling and blend nonparametric statistics with contextual bandits. Despite the advantage of learning much more general contextpayoff dependencies, this line of work has received far less attention.
To the best of our knowledge, the first such work appeared in yang2002randomized yang2002randomized, who used histogram, NN, and kernel methods and showed asymptotic convergence rates. rigollet2010nonparametric rigollet2010nonparametric; perchet2013multi perchet2013multi then combined histogramtype binning techniques in nonparametric statistics to obtain strong regret guarantees for contextual bandits with optimality guarantees.
lu2009showing lu2009showing study an interesting setting where the reward depends on a Lipschitz measure which is jointly in the context and the action space. They provide upper and lower regret bounds based on a covering argument and give results in terms of the packing dimension. This is highly related to the infinitearmed bandit setting in the present work; we provide similar regret guarantees but with a simple and practical procedure.
More recently, qian2016randomized qian2016randomized; qian2016kernel qian2016kernel use the strong uniform consistency properties of kernel smoothing regression to establish regret guarantees.
langford2008epoch langford2008epoch; dudik2011efficient dudik2011efficient alternatively impose neither linear nor smoothness assumptions on the mean reward function. The former propose a modification of an greedy policy and showed that expected regret converges to while the latter considers a finite class of policies.
In this paper, using recent finitesample results about NN regression established in jiang2017rates jiang2017rates, we show that using the simple NN regression is an effective alternative approach. Moreover, unlike many other nonparametric techniques, NN adapts to a lower intrinsic dimension [Kpotufe2011] and thus we show that our regret bounds can adapt to a lower intrinsic dimension automatically and perform as if we were operating in that lower dimensional space.
Experiments
Simulations
We consider three twoarm bandit scenarios in the twodimensional unit square, where is uniform. We set arm to be top in region respectively. Figure 1 illustrates the regions for the different scenarios.

Scenario 1 (Quintic Function): We define two regions above and below a quintic function:

Scenario 2 (Smiley): We use two circles and a semicircle to demarcate the regions in a ”smiley face” pattern.

Scenario 3 (Bullseye): We define the regions using the alternating regions of four concentric circles centered in the support.
The true reward functions of the two arms are as follows.
The learner observes the rewards with white noise random variable .
We compare the performance of
NN regression (nonparametric) and Ridge regression at toparm identification and regret minimization in the three scenarios. Mirroring our theoretical discussion, we use uniform sampling for toparm identification and UCB strategy for regret analysis. Note that Ridge regression with UCB is the LinUCB algorithm.
Quintic Function  Smiley  Bullseye  

Ridge  kNN  Ridge  kNN  Ridge  kNN  
TopArm Test Error from Uniform Sampling  0.065  0.002  0.080  0.000  0.335  0.005 
Number of samples  500k  500k  2k  5000k  100k  500k 
Number of neighbors    100    50    20 
Test Regret from UCB sampling  0.0315  0.001  0.0375  0.0135  0.161  0.004 
Number of samples  1k  500k  5k  1000k  50k  1000k 
Number of neighbors    100    20    100 
NN regressors. Each model was tuned individually and optimal hyperparameters are shown.
NN performs better on both metrics for all three scenarios.Qualitative Analysis
We first qualitatively show that NN regression can successfully model the bandits whereas the linear method cannot. The difficulty of the task is illustrated by Figure 2, which plots 10k uniformly sampled samples from each scenario with a colormap. We can see that a human would have a hard time recovering the regions where each arm is top due to the randomness in the observed rewards. This randomness is considerable as we set to be the same as .
We fix the number of training samples to 10k and the number of nearest neighbors to . We evaluate on 10k random test samples. Figure 3 shows that NN regression does an excellent job of reproducing the region boundaries. Ridge regression does a poor job in the Quintic Function case, making a linear approximation to the quintic curve, and completely fails in the Smiley and Bullseye Cases, simply choosing the arm whose toparm region is larger.
Quantitative Analysis
We report numerical results and optimal hyperparameters in Table 1. We tuned other hyperparameters using grid search on a validation set of size 1k using grid search and we evaluate performance of our models on a test set of size 1k. We use the UCB strategy in auer2002nonstochastic auer2002nonstochastic (a simplified version of UCB by agrawal2013thompson agrawal2013thompson). We found that a confidence level of worked well for all settings. We see that NN significantly outperforms Ridge regression for both toparm identification and regret minimization in all three scenarios (Table 1).
Image Classification Experiments
We extend our experiments to image classification of the canonical MNIST dataset, which consists of 60k training images and 10k test images of isolated, normalized, handwritten digits. The task is to classify each 2828 image into one of ten classes. We reframe this as a contextual MAB problem by treating the classes as arms and the images as the contexts. Note that for every context, the payoff of all arms are known: 1 if the class is the true label and 0 otherwise. We compare NN and Ridge regressions at regret minimization using the UCB strategy. As before we use the UCB strategy in auer2002nonstochastic auer2002nonstochastic and fix the confidence level to 0.1. We do not employ any data augmentation.
We obtain test regret of 17.5% from LinUCB with , where is the coefficient of L2 regularization, and significantly lower test regret of 5.8% from 4NNUCB. Figure 4 shows that NN regression maintains lower regret than Ridge regression over a range of values of and . We note that Ridge regression working well for relatively large values of itself suggests that it is a poor model for the task.
Conclusion
For the multiarmed bandit setting, we use nonparametric regression to attain tight results for toparm identification and a sublinear regret of , where is the dimension of the context. We also show that if the underlying context space has a lower intrinsic dimension , then our algorithm automatically adapts to the lower dimension and attains a faster rate of . We also provide a procedure for recovering the maximal connected regions in a support where a particular arm is the toparm and provide a consistency analysis. We then give a natural extension to infinitearmed contextual bandits. Our simulations confirm that our method is able to learn in the contextual setting with arbitrary decision boundaries, even in the presence of significant noise, and our experiments on classification of MNIST images demonstrate superior performance of our method over LinUCB on a real world task.
References
 [Agrawal and Goyal2013] Agrawal, S., and Goyal, N. 2013. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, 127–135.
 [Auer et al.2002] Auer, P.; CesaBianchi, N.; Freund, Y.; and Schapire, R. E. 2002. The nonstochastic multiarmed bandit problem. SIAM journal on computing 32(1):48–77.
 [Berry and Fristedt1985] Berry, D. A., and Fristedt, B. 1985. Bandit problems: sequential allocation of experiments (Monographs on statistics and applied probability), volume 12. Springer.
 [Bubeck and CesaBianchi2012] Bubeck, S., and CesaBianchi, N. 2012. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends in Machine Learning 5(1):1–122.
 [CesaBianchi and Lugosi2006] CesaBianchi, N., and Lugosi, G. 2006. Prediction, learning, and games. Cambridge university press.
 [Chaudhuri and Dasgupta2010] Chaudhuri, K., and Dasgupta, S. 2010. Rates of convergence for the cluster tree. In Advances in Neural Information Processing Systems, 343–351.

[Chu et al.2011]
Chu, W.; Li, L.; Reyzin, L.; and Schapire, R. E.
2011.
Contextual bandits with linear payoff functions.
In
International Conference on Artificial Intelligence and Statistics
, 208–214.  [Dudik et al.2011] Dudik, M.; Hsu, D.; Kale, S.; Karampatziakis, N.; Langford, J.; Reyzin, L.; and Zhang, T. 2011. Efficient optimal learning for contextual bandits. arXiv preprint arXiv:1106.2369.
 [Gittins, Glazebrook, and Weber2011] Gittins, J.; Glazebrook, K.; and Weber, R. 2011. Multiarmed bandit allocation indices. John Wiley & Sons.
 [Goldenshluger and Zeevi2009] Goldenshluger, A., and Zeevi, A. 2009. Woodroofe’s onearmed bandit problem revisited. The Annals of Applied Probability 19(4):1603–1633.
 [Goldenshluger and Zeevi2013] Goldenshluger, A., and Zeevi, A. 2013. A linear response bandit problem. Stochastic Systems 3(1):230–261.
 [Jiang2017a] Jiang, H. 2017a. Density level set estimation on manifolds with dbscan. arXiv preprint arXiv:1703.03503.
 [Jiang2017b] Jiang, H. 2017b. Rates of uniform consistency for knn regression. arXiv preprint arXiv:1707.06261.
 [Kpotufe2011] Kpotufe, S. 2011. knn regression adapts to local intrinsic dimension. In Advances in Neural Information Processing Systems, 729–737.
 [Lai and Robbins1985] Lai, T. L., and Robbins, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6(1):4–22.

[Langford and Zhang2008]
Langford, J., and Zhang, T.
2008.
The epochgreedy algorithm for multiarmed bandits with side information.
In Advances in neural information processing systems, 817–824.  [Li et al.2010] Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, 661–670. ACM.
 [Lu, Pál, and Pál2010] Lu, T.; Pál, D.; and Pál, M. 2010. Showing relevant ads via lipschitz context multiarmed bandits. In Thirteenth International Conference on Artificial Intelligence and Statistics.
 [Perchet and Rigollet2013] Perchet, V., and Rigollet, P. 2013. The multiarmed bandit problem with covariates. The Annals of Statistics 41(2):693–721.
 [Qian and Yang2016a] Qian, W., and Yang, Y. 2016a. Kernel estimation and model combination in a bandit problem with covariates. Journal of Machine Learning Research.
 [Qian and Yang2016b] Qian, W., and Yang, Y. 2016b. Randomized allocation with arm elimination in a bandit problem with covariates. Electronic Journal of Statistics 10(1):242–270.
 [Rigollet and Zeevi2010] Rigollet, P., and Zeevi, A. 2010. Nonparametric bandits with covariates. arXiv preprint arXiv:1003.1630.
 [Robbins1985] Robbins, H. 1985. Some aspects of the sequential design of experiments. In Herbert Robbins Selected Papers. Springer. 169–177.
 [Yang and Zhu2002] Yang, Y., and Zhu, D. 2002. Randomized allocation with nonparametric estimation for a multiarmed bandit problem with covariates. The Annals of Statistics 30(1):100–121.
Comments
There are no comments yet.