The multi-armed bandit is a simple model which exemplifies the exploitation-exploration trade-off in reinforcement learning. Solutions to this problem have numerous practical applications from sequential clinical trials to web-page ad placement (Bubeck et al. (2012)). We focus upon the stochastic setting in which an agent is given access to a collection of unknown reward distributions (arms); the agent sequentially selects a reward distribution to sample from, so as to maximise their cumulative reward. One of the most widely used strategies for stochastic multi-armed bandits is the Upper Confidence Bound (UCB) algorithm, which is based on the principle of optimism in the face of uncertainty (Lai and Robbins (1985); Agrawal (1995); Auer et al. (2002)). Garivier and Cappé’s KL-UCB algorithm utilises tighter upper confidence bounds to provide an algorithm with sharper regret bounds and a superior empirical performance (Garivier and Cappé (2011)). Multi-armed bandits with covariates extend this simple model by allowing the reward distributions to depend upon observable side information (Bubeck et al., 2012, Section 4.3). For example, in sequential clinical trials the agent might have access to a patient’s MRI scan or genome sequence; in web-page ad placement side-information might include a particular user’s preferences and purchasing history. Owing to their widespread applicability, multi-armed bandits with covariates have been extensively studied (Beygelzimer et al. (2011); Kakade et al. (2008); Langford and Zhang (2008); Perchet et al. (2013); Qian and Yang (2016); Rigollet and Zeevi (2010); Seldin et al. (2011); Slivkins (2011); Wang et al. (2005a, b); Yang et al. (2002)). In this paper we shall consider the non-parametric setting in which the relationship between reward distribution and side-information is assumed to satisfy smoothness conditions, without specifying a particular parametric form. Yang and Zhu proved strong-consistency for an epsilon-greedy approach to this problem, using either nearest neighbour or histogram based methods to model the functional dependency of the reward distribution upon the covariate (Yang et al. (2002)). Rigollet and Zeevi introduced the UCBogram which partitions the covariate space into cubes and runs the UCB locally on each member of the partition (Rigollet and Zeevi (2010)). Rigollet and Zeevi prove a regret bound with exponents depending upon distributional assumptions including a natural extension of the Tysbakov margin condition (Tsybakov (2004)). Unfortunately, the regret bound is sub-optimal when the margin parameter is greater than one. Later Perchet and Rigollet developed the Adaptively Binned Successive Elimination algorithm (ABSE) which runs the Successive Elimination algorithm locally on increasingly refined partitions of the covariate space (Perchet et al. (2013)). Perchet and Rigollet demonstrated that the ABSE algorithm achieves minimax optimal regret guarantees for all values of the margin parameter (Perchet et al. (2013)). Hence, the adaptive refinement of the partition of the covariate space enables the ABSE algorithm to take advantage of low-noise conditions, expressed as a margin condition. Despite the strong theoretical merits of the ABSE algorithm, there are several limitations owing to its dependency upon a partition of the feature space into dyadic hyper-cubes. Firstly, there are many applications in which it is natural to construct a metric between data points which cannot be embedded in a Euclidean space without significant distortion. Examples include the Wasserstein distance between images and the edit distance on graphs, Frogner et al. (2015); Luxburg and Bousquet (2004). However, neither the UBogram nor the ABSE algorithm can be applied to non-Euclidean metric spaces. Secondly, the regret bounds for the ABSE algorithm require that the marginal distribution be Lebesgue absolutely continuous with a density bounded from below on the unit hyper-cube . Whilst this condition is not entirely necessary for the analysis, the proof does depend crucially upon the existence of constants such that the following holds. For every dyadic hyper-cube of the form with , we have either or . However, this condition does not hold for many well-behaved measures on Euclidean space (see Appendix G for a simple example). Thirdly, the construction of the partitions in both the UCBogram and the ABSE algorithm requires prior knowledge of the intrinsic dimensionality of the covariate space as an input parameter. In the case of the UCBogram the dimension is used to choose the optimal partition size (Rigollet and Zeevi, 2010, Theorem 3.1). In the case of the ABSE algorithm, a partition element is refined after rounds, where is a number which depends upon the dimension (Perchet et al., 2013, Equation (5.2)). Moreover, if covariates are supported on a low-dimensional sub-manifold, then the intrinsic dimensionality of the sub-manifold is unlikely to be known in advance. The aim of the current paper is to address these three limitations. The
-nearest neighbour method is amongst the simplest approaches to supervised learning. In addition, it has strong theoretical guarantees. Kpotufe has shown that the-nearest neighbour regression algorithm attains distribution dependent minimax optimal rates, without prior knowledge of the intrinsic dimensionality of the data (Kpotufe (2011)). Chaudhuri and Dasgupta have shown the -nearest neighbour method attains distribution dependent minimax optimal rates in the supervised classification setting (Chaudhuri and Dasgupta (2014)). In particular, the
-nearest neighbour classifier automatically takes advantage of low noise in the data, expressed as a margin condition. In light of these theoretical strengths, it is natural to apply the-nearest neighbour method to problem of multi-armed bandits with covariates. We propose the -nearest neighbour UCB algorithm (-NN UCB), a conceptually simple procedure for multi-armed bandits with covariates which combines the UCB algorithm with -nearest neighbour regression. The algorithm does not require prior knowledge of the intrinsic dimensionality of the data. It is also naturally anytime, without resorting to the doubling trick. We prove a regret bound for the -NN UCB algorithm which is minimax optimal up to logarithmic factors. In particular, the algorithm automatically takes advantage of both low intrinsic dimensionality of the marginal distribution over the covariates and low noise conditions, expressed as a margin condition. In addition, focusing on the case of bounded rewards, we give corresponding regret bounds for the -nearest neighbour KL-UCB algorithm (-NN KL-UCB), which is an analogue of the KL-UCB algorithm (Garivier and Cappé (2011)) adapted to the setting of multi-armed bandits with covariates. Finally, we present empirical results which demonstrate the ability of both -NN UCB and -NN KL-UCB to take advantage of situations where the data is supported on an unknown sub-manifold of a high-dimensional feature space.
2 Bandits on a metric space
In this section we shall introduce some notation and background.
We consider the problem of bandits with covariates on metric spaces. Suppose we have a metric space . Given and we let denote the open metric ball of radius , centred at . Given we let . Given a collection of arms, we let
denote a distribution over random variableswith and , where denotes the value of arm . We let denote the marginal of over and let denote its support. For each we define a function by . For each a random sample is drawn i.i.d from
. We are allowed to view the feature vectorand we must choose an arm and receive the stochastic reward . We are able to observe the value of our chosen arm, but not the value of the remaining arms. Our sequential choice of arms is given by a policy consisting of functions , where is determined purely by the known reward history . The goal is to choose so as to maximise the cumulative reward . In order to quantify the quality of a policy we compare its expected cumulative reward to the cumulative reward to that of an oracle policy defined by . We define the regret by .
We shall make the following assumptions:
Assumption 1 (Dimension assumption)
There exists such that for all , we have
Assumption 1 holds for well-behaved measures which are absolutely continuous with respect to the Riemannian volume form on a -dimensional sub-manifold of Euclidean space (see Proposition 2, Appendix H). See Appendix G for an example where Assumption 1 whilst the measure of dyadic sub-cubes is not well behaved.
Assumption 2 (Lipschitz assumption)
There exists a constant such that for all , we have
Assumption 2 quantifies the requirement that similar covariates should imply similar conditional reward expectations. Let . For each let , and define
Assumption 3 (Margin assumption)
There exists such that for all we have
Assumption 3 quantifies the difficulty of the problem. It is a natural analogue of Tysbakov’s margin condition (Tsybakov (2004)) introduced by Rigollet and Zeevi (2010). Perchet and Rigollet showed that if is a manifold and then we must have on the interior of (Perchet et al., 2013, Proposition 3.1). All of our theoretical results require assumptions 1, 2 and 3. We shall also use one of the following two assumptions.
Assumption 4 (Subgaussian noise assumption)
For each and the arms have sub-gaussian noise ie. for all and ,
Assumption 5 (Bounded rewards assumption)
For all & , .
3 Nearest neighbour algorithms
In this section we introduce a pair of nearest neighbour based UCB strategies. We begin by introducing a generalized -nearest neighbours index strategy, of which the other strategies are special cases.
3.1 The generalized k-nearest neighbours index strategy
Suppose we are at a time step and we have access to the reward history . For each we let be an enumeration of such that for each ,
Given and we define and let
We adopt the convention that . For each we define
In addition, given a constant and a non-decreasing function we define a corresponding uncertainty value by
We shall combine , , and to construct an index corresponding to an upper-confidence bound on the reward function . Our algorithm then proceeds as follows. At each time step , a feature vector is received. For each arm , the algorithm selects a number of neighbours by minimising the uncertainty . The algorithm then selects the arm which maximises the index . The psuedo-code for this generalised k-NN index strategy is presented in Algorithm 1.
By selecting so as to minimise the we avoid giving an explicit formula for . This is fortuitous, since in order to obtain optimal regret bounds, any such formula would necessarily depend upon both the time horizon and the intrinsic dimensionality of the data , and in general, neither nor will be known a priori by the learner. Selecting in this way is inspired by Kpotufe’s procedure for selecting in the regression setting, so as to minimise an upper bound on the squared error (Kpotufe (2011)).
3.2 k-Nearest Neighbour UCB
The -Nearest Neighbour UCB algorithm (-NN UCB) is a special case of Algorithm 1 with the following index function,
The -NN UCB algorithm satisfies the following regret bound whenever the noise is subgaussian (Assumption 4). First we let and define . For all let .  Suppose that Assumption 1 holds with constants , Assumption 2 holds with Lipschitz constant , Assumption 3 holds with constants and Assumption 4 holds. Let be the -NN UCB algorithm (Algorithm 1 with as in equation (1)). Then for all there exists a constant , depending solely upon and such that for all we have
Theorem 3.2 follows from the more general Theorem 4 in Section 4. The full proof is given in Appendix A. Note that by taking we obtain a regret bound which is minimax optimal up to logarithmic factors for any smooth compact embedded sub-manifold (See Theorem H.3, Appendix H for details).
3.3 k-Nearest Neighbour KL-UCB
The -Nearest Neighbour KL-UCB algorithm is another special case of Algorithm 1, customized for the setting of bounded rewards. The -Nearest Neighbour KL-UCB algorithm is an adaptation of the KL-UCB algorithm of Garivier and Cappé (2011), which has shown strong empirical performance combined with tight regret bounds. Given
we define the Kullback-Leibler divergenceby
 Suppose that Assumption 1 holds with constants , Assumption 2 holds with Lipschitz constant , Assumption 3 holds with constants and Assumption 5 holds. Let be the -NN KL-UCB algorithm (Algorithm 1 with as in equation (2)). Then for all there exists a constant , depending solely upon and such that for all we have
Theorem 3.3 follows from the more general Theorem 4 in Section 4. The full proof is given in Appendix B. As with Theorem 3.2 we may select to obtain a regret bound which is minimax optimal up to logarithmic factors. Experiments on synthetic data indicate that the -NN KL-UCB algorithm typically outperforms the -NN UCB algorithm, just as the KL-UCB (Garivier and Cappé (2011)) algorithm typically outperforms the standard UCB algorithm (see Section 5). However, the regret bounds in Theorems 3.2 and 3.3 are of the same order.
4 Regret analysis
In order to prove Theorems 3.2 and 3.3 we first prove the more general Theorem 4. Suppose we have a k-NN index strategy (Algorithm 1) with index . We shall define for the index strategy a set of good events as follows. For each , and we define the event
Let .  Suppose that Assumption 1 holds with constants , Assumption 2 holds with Lipschitz constant and Assumption 3 holds with constants . Suppose is a -NN index strategy (Algorithm 1) with index . Then there exists a constant , depending solely upon such that for all we have
hold with high probability. The proof of Theorem4 consists of two primary components. Firstly, we prove an upper bound on the number of times an arm is pulled with covariates in a given region of the metric space with a sufficiently high local margin (see Lemma 3). A key difference with the regret bounds of (Rigollet and Zeevi (2010), Perchet et al. (2013)) is that these local bounds hold for arbitrary subsets, rather than just the members of the partition constructed by the algorithm. Secondly, we construct a partition of the covariate space based on local values of the margin, with regions of low margin partitioned into smaller pieces (see the proof of Proposition 1). The local upper bound is then applied to members of the partition to derive the regret bound. Given a subset and we define and let
There exists a constant , depending solely upon , such that for all we have
For any subset and any we have .
See Appendix D. The following key lemma bounds the number of times an arm is pulled in a given region of the covariate space.
Given a subset and an arm with , the following holds almost surely
Clearly we can assume that . We define
Note that as holds we must have . Since and we must have . Moreover, given any with we must have for some . Thus, . Note that implies . Choose so that . Since and we have . On the other hand, since holds we have, and
Thus, given above and the definitions of and we have
By the Lipschitz assumption (Assumption 2) together with the fact that we must have
Combining with the above proves the lemma. Lemma 4 applies Assumption 1 to obtain an analogue of nested hyper-cubes within . The proof adapts ideas from geometric measure theory (Käenmäki et al. (2012)).
Suppose that Assumption 1 holds. Given , and there exists a finite collection of subsets which satisfies:
For each , is a partition of .
Given with , and , either or .
For all , we have and
See Appendix E. We are now ready to complete the proof of Proposition 1, which entails Theorem 4. [Proof of Proposition 1] Throughout the proof will denote constants depending solely upon . We shall apply Lemma 4 to construct a cover of based upon the local value of . First let . Take some (to be specified later), let and and let be a collection of subsets satisfying properties (1),(2),(3) from Lemma 4. In particular, for all and we have and . First let
For each we define
We claim that for all we have
For the claim follows straightforwardly from the fact that is a partition of . Now suppose the claim holds for some . By properties (1) and (2) in Lemma 4 for any ,
Moreover, if then . Thus, we have
Hence, given that the claim holds for it must also hold for . From the special case where we deduce that,
Thus, given that we have
Moreover, since for , we have . Hence,
Now take and consider . We have ,
and . Hence, by Lemma 3 we have
Combining with Lemma 2 and we have
Moreover, it follows from the definition of that for all we have . Hence, by Assumption 3 we have
Thus, we have
Thus, if we take we have