# The K-Nearest Neighbour UCB algorithm for multi-armed bandits with covariates

In this paper we propose and explore the k-Nearest Neighbour UCB algorithm for multi-armed bandits with covariates. We focus on a setting where the covariates are supported on a metric space of low intrinsic dimension, such as a manifold embedded within a high dimensional ambient feature space. The algorithm is conceptually simple and straightforward to implement. The k-Nearest Neighbour UCB algorithm does not require prior knowledge of the either the intrinsic dimension of the marginal distribution or the time horizon. We prove a regret bound for the k-Nearest Neighbour UCB algorithm which is minimax optimal up to logarithmic factors. In particular, the algorithm automatically takes advantage of both low intrinsic dimensionality of the marginal distribution over the covariates and low noise in the data, expressed as a margin condition. In addition, focusing on the case of bounded rewards, we give corresponding regret bounds for the k-Nearest Neighbour KL-UCB algorithm, which is an analogue of the KL-UCB algorithm adapted to the setting of multi-armed bandits with covariates. Finally, we present empirical results which demonstrate the ability of both the k-Nearest Neighbour UCB and k-Nearest Neighbour KL-UCB to take advantage of situations where the data is supported on an unknown sub-manifold of a high-dimensional feature space.

## Authors

• 4 publications
• 1 publication
• 12 publications
• ### Batched Multi-armed Bandits Problem

In this paper, we study the multi-armed bandit problem in the batched se...
04/03/2019 ∙ by Zijun Gao, et al. ∙ 0

• ### Minimax rates for cost-sensitive learning on manifolds with approximate nearest neighbours

We study the approximate nearest neighbour method for cost-sensitive cla...
03/01/2018 ∙ by Henry WJ Reeve, et al. ∙ 0

• ### What Doubling Tricks Can and Can't Do for Multi-Armed Bandits

An online reinforcement learning algorithm is anytime if it does not nee...
03/19/2018 ∙ by Lilian Besson, et al. ∙ 0

• ### Bounded Regret for Finitely Parameterized Multi-Armed Bandits

We consider the problem of finitely parameterized multi-armed bandits wh...
03/03/2020 ∙ by Kishan Panaganti, et al. ∙ 0

• ### Regret Analysis of the Finite-Horizon Gittins Index Strategy for Multi-Armed Bandits

I analyse the frequentist regret of the famous Gittins index strategy fo...
11/18/2015 ∙ by Tor Lattimore, et al. ∙ 0

• ### k-NN Regression Adapts to Local Intrinsic Dimension

Many nonparametric regressors were recently shown to converge at rates t...
10/19/2011 ∙ by Samory Kpotufe, et al. ∙ 0

• ### Linear Bandits in High Dimension and Recommendation Systems

A large number of online services provide automated recommendations to h...
01/08/2013 ∙ by Yash Deshpande, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The multi-armed bandit is a simple model which exemplifies the exploitation-exploration trade-off in reinforcement learning. Solutions to this problem have numerous practical applications from sequential clinical trials to web-page ad placement (

-nearest neighbour method is amongst the simplest approaches to supervised learning. In addition, it has strong theoretical guarantees. Kpotufe has shown that the

-nearest neighbour regression algorithm attains distribution dependent minimax optimal rates, without prior knowledge of the intrinsic dimensionality of the data (Kpotufe (2011)). Chaudhuri and Dasgupta have shown the -nearest neighbour method attains distribution dependent minimax optimal rates in the supervised classification setting (Chaudhuri and Dasgupta (2014)). In particular, the

-nearest neighbour classifier automatically takes advantage of low noise in the data, expressed as a margin condition. In light of these theoretical strengths, it is natural to apply the

-nearest neighbour method to problem of multi-armed bandits with covariates. We propose the -nearest neighbour UCB algorithm (-NN UCB), a conceptually simple procedure for multi-armed bandits with covariates which combines the UCB algorithm with -nearest neighbour regression. The algorithm does not require prior knowledge of the intrinsic dimensionality of the data. It is also naturally anytime, without resorting to the doubling trick. We prove a regret bound for the -NN UCB algorithm which is minimax optimal up to logarithmic factors. In particular, the algorithm automatically takes advantage of both low intrinsic dimensionality of the marginal distribution over the covariates and low noise conditions, expressed as a margin condition. In addition, focusing on the case of bounded rewards, we give corresponding regret bounds for the -nearest neighbour KL-UCB algorithm (-NN KL-UCB), which is an analogue of the KL-UCB algorithm (Garivier and Cappé (2011)) adapted to the setting of multi-armed bandits with covariates. Finally, we present empirical results which demonstrate the ability of both -NN UCB and -NN KL-UCB to take advantage of situations where the data is supported on an unknown sub-manifold of a high-dimensional feature space.

## 2 Bandits on a metric space

In this section we shall introduce some notation and background.

### 2.1 Notation

We consider the problem of bandits with covariates on metric spaces. Suppose we have a metric space . Given and we let denote the open metric ball of radius , centred at . Given we let . Given a collection of arms, we let

denote a distribution over random variables

with and , where denotes the value of arm . We let denote the marginal of over and let denote its support. For each we define a function by . For each a random sample is drawn i.i.d from

. We are allowed to view the feature vector

and we must choose an arm and receive the stochastic reward . We are able to observe the value of our chosen arm, but not the value of the remaining arms. Our sequential choice of arms is given by a policy consisting of functions , where is determined purely by the known reward history . The goal is to choose so as to maximise the cumulative reward . In order to quantify the quality of a policy we compare its expected cumulative reward to the cumulative reward to that of an oracle policy defined by . We define the regret by .

### 2.2 Assumptions

We shall make the following assumptions:

###### Assumption 1 (Dimension assumption)

There exists such that for all , we have
.

Assumption 1 holds for well-behaved measures which are absolutely continuous with respect to the Riemannian volume form on a -dimensional sub-manifold of Euclidean space (see Proposition 2, Appendix H). See Appendix G for an example where Assumption 1 whilst the measure of dyadic sub-cubes is not well behaved.

###### Assumption 2 (Lipschitz assumption)

There exists a constant such that for all , we have
.

Assumption 2 quantifies the requirement that similar covariates should imply similar conditional reward expectations. Let . For each let , and define

 Δ(x)={mina∈[A]{Δa(x):Δa(x)>0} if ∃a∈[A]Δa(x)>00 otherwise.
###### Assumption 3 (Margin assumption)

There exists such that for all we have
.

Assumption 3 quantifies the difficulty of the problem. It is a natural analogue of Tysbakov’s margin condition (Tsybakov (2004)) introduced by Rigollet and Zeevi (2010). Perchet and Rigollet showed that if is a manifold and then we must have on the interior of (Perchet et al., 2013, Proposition 3.1). All of our theoretical results require assumptions 1, 2 and 3. We shall also use one of the following two assumptions.

###### Assumption 4 (Subgaussian noise assumption)

For each and the arms have sub-gaussian noise ie. for all and ,

 E[exp(θ⋅(Yat−fa(x)))|Xt=x]≤exp(θ2/2).

For all & , .

## 3 Nearest neighbour algorithms

In this section we introduce a pair of nearest neighbour based UCB strategies. We begin by introducing a generalized -nearest neighbours index strategy, of which the other strategies are special cases.

### 3.1 The generalized k-nearest neighbours index strategy

Suppose we are at a time step and we have access to the reward history . For each we let be an enumeration of such that for each ,

 ρ(x,Xτt,q(x))≤ρ(x,Xτt,q+1(x)).

Given and we define and let

 rt,k(x)=max{ρ(x,Xs):s∈Γt,k(x)}=ρ(x,Xτt,k(x)).

We adopt the convention that . For each we define

 Nat,k(x) :=∑s∈Γt,k(x)1{πs=a}, Sat,k(x) :=∑s∈Γt,k(x)1{πs=a}⋅Yas, ^fat,k(x) :=Sat,k(x)/Nat,k(x).

In addition, given a constant and a non-decreasing function we define a corresponding uncertainty value by

 Uat,k(x):=√(θlogt)/Nat,k(x)+φ(t)⋅rt,k(x).

We shall combine , , and to construct an index corresponding to an upper-confidence bound on the reward function . Our algorithm then proceeds as follows. At each time step , a feature vector is received. For each arm , the algorithm selects a number of neighbours by minimising the uncertainty . The algorithm then selects the arm which maximises the index . The psuedo-code for this generalised k-NN index strategy is presented in Algorithm 1.

By selecting so as to minimise the we avoid giving an explicit formula for . This is fortuitous, since in order to obtain optimal regret bounds, any such formula would necessarily depend upon both the time horizon and the intrinsic dimensionality of the data , and in general, neither nor will be known a priori by the learner. Selecting in this way is inspired by Kpotufe’s procedure for selecting in the regression setting, so as to minimise an upper bound on the squared error (Kpotufe (2011)).

### 3.2 k-Nearest Neighbour UCB

The -Nearest Neighbour UCB algorithm (-NN UCB) is a special case of Algorithm 1 with the following index function,

 Iat,k(x)=^fat,k(x)+Uat,k(x). (1)

The -NN UCB algorithm satisfies the following regret bound whenever the noise is subgaussian (Assumption 4). First we let and define . For all let . [] Suppose that Assumption 1 holds with constants , Assumption 2 holds with Lipschitz constant , Assumption 3 holds with constants and Assumption 4 holds. Let be the -NN UCB algorithm (Algorithm 1 with as in equation (1)). Then for all there exists a constant , depending solely upon and such that for all we have

 E[Rn(π)] ≤M⋅φ−1(λ)+C⋅A⋅⎛⎜ ⎜⎝M⋅φ(n)d+n⋅(φ(n)d⋅¯¯¯¯¯¯¯log(n)n)min{α+1d+2,1}⎞⎟ ⎟⎠.

Theorem 3.2 follows from the more general Theorem 4 in Section 4. The full proof is given in Appendix A. Note that by taking we obtain a regret bound which is minimax optimal up to logarithmic factors for any smooth compact embedded sub-manifold (See Theorem H.3, Appendix H for details).

### 3.3 k-Nearest Neighbour KL-UCB

The -Nearest Neighbour KL-UCB algorithm is another special case of Algorithm 1, customized for the setting of bounded rewards. The -Nearest Neighbour KL-UCB algorithm is an adaptation of the KL-UCB algorithm of Garivier and Cappé (2011), which has shown strong empirical performance combined with tight regret bounds. Given

we define the Kullback-Leibler divergence

by

 d(p,q):=plog(p/q)+(1−p)⋅log((1−p)/(1−q)).
 Iat,k(x)=sup{ω∈[0,1]:Nat,k(x)⋅d(^fat,k(x),ω)≤θ⋅logt}+φ(t)⋅rt,k(x). (2)

[] Suppose that Assumption 1 holds with constants , Assumption 2 holds with Lipschitz constant , Assumption 3 holds with constants and Assumption 5 holds. Let be the -NN KL-UCB algorithm (Algorithm 1 with as in equation (2)). Then for all there exists a constant , depending solely upon and such that for all we have

 E[Rn(π)] ≤φ−1(λ)+C⋅A⋅⎛⎜ ⎜⎝φ(n)d+n⋅(φ(n)d⋅¯¯¯¯¯¯¯log(n)n)min{α+1d+2,1}⎞⎟ ⎟⎠.

Theorem 3.3 follows from the more general Theorem 4 in Section 4. The full proof is given in Appendix B. As with Theorem 3.2 we may select to obtain a regret bound which is minimax optimal up to logarithmic factors. Experiments on synthetic data indicate that the -NN KL-UCB algorithm typically outperforms the -NN UCB algorithm, just as the KL-UCB (Garivier and Cappé (2011)) algorithm typically outperforms the standard UCB algorithm (see Section 5). However, the regret bounds in Theorems 3.2 and 3.3 are of the same order.

## 4 Regret analysis

In order to prove Theorems 3.2 and 3.3 we first prove the more general Theorem 4. Suppose we have a k-NN index strategy (Algorithm 1) with index . We shall define for the index strategy a set of good events as follows. For each , and we define the event

 Gat,k:={φ(t)≥λ}∩{Iat,k(Xt)−2⋅Uat,k(Xt)≤fa(Xt)≤Iat,k(Xt)}.

Let . [] Suppose that Assumption 1 holds with constants , Assumption 2 holds with Lipschitz constant and Assumption 3 holds with constants . Suppose is a -NN index strategy (Algorithm 1) with index . Then there exists a constant , depending solely upon such that for all we have

 E[Rn(π)] ≤C⋅A⋅⎛⎜ ⎜⎝M⋅φ(n)d+n⋅(θ⋅φ(n)d⋅¯¯¯¯¯¯¯log(n)n)min{α+1d+2,1}⎞⎟ ⎟⎠ +M⋅∑t∈[n](1−P[Gt]).

Theorems 3.2 and 3.3 are deduced from Theorem 4 in Appendices A and B, respectively. In both cases, the deduction amounts to using concentration inequalities to show that the good events

hold with high probability. The proof of Theorem

4 consists of two primary components. Firstly, we prove an upper bound on the number of times an arm is pulled with covariates in a given region of the metric space with a sufficiently high local margin (see Lemma 3). A key difference with the regret bounds of (Rigollet and Zeevi (2010), Perchet et al. (2013)) is that these local bounds hold for arbitrary subsets, rather than just the members of the partition constructed by the algorithm. Secondly, we construct a partition of the covariate space based on local values of the margin, with regions of low margin partitioned into smaller pieces (see the proof of Proposition 1). The local upper bound is then applied to members of the partition to derive the regret bound. Given a subset and we define and let

 Tan(π,B) :=∑t∈[n]1{Gt}⋅1{Xt∈B}⋅1{πt=a} ~Ran(π,B) :=∑t∈[n]1{Gt}⋅1{Xt∈B}⋅1{πt=a}⋅(Yπ∗tt−Yπtt).
###### Lemma 1

.

See Appendix D. In light of Lemma 1, in order to prove Theorem 4 it suffices to prove the following proposition (Proposition 1).

###### Proposition 1

There exists a constant , depending solely upon , such that for all we have

 E[~Ran(π,X)]≤C⋅⎛⎜ ⎜⎝M⋅φ(n)d+n⋅(θ⋅φ(n)d⋅¯¯¯¯¯¯¯log(n)n)min{α+1d+2,1}⎞⎟ ⎟⎠.

Before proving Proposition 1 we require three lemmas (2, 3 and 4 below).

###### Lemma 2

For any subset and any we have .

See Appendix D. The following key lemma bounds the number of times an arm is pulled in a given region of the covariate space.

###### Lemma 3

Given a subset and an arm with , the following holds almost surely

 Tan(π,B)≤4θ⋅¯¯¯¯¯¯¯¯¯¯¯logn(Δa(B)−4⋅φ(n)⋅diam(B))2+1.

Clearly we can assume that . We define

 t k(B) :=max{q∈[t−1]:Xτt,q(Xt)∈B}.

Note that as holds we must have . Since and we must have . Moreover, given any with we must have for some . Thus, . Note that implies . Choose so that . Since and we have . On the other hand, since holds we have, and

 fa(Xt)≥Iat,kt(a)(Xt)−2⋅Uat,kt(a)(Xt)

Thus, given above and the definitions of and we have

 (fz∗(Xt)−fa(Xt))/2 ≤Uat,kt(a)(Xt)≤Uat,k(B)(Xt) =√(θlogt)/Nat,k(B)(Xt)+φ(t)⋅rt,k(B)(Xt) ≤√(θlogt)/(Tan(π,B)−1)+φ(n)⋅diam(B).

By the Lipschitz assumption (Assumption 2) together with the fact that we must have

 fz∗(Xt)−fa(Xt)≥Δa(B)−2λ⋅diam(B)≥Δa(B)−2φ(n)⋅diam(B).

Combining with the above proves the lemma. Lemma 4 applies Assumption 1 to obtain an analogue of nested hyper-cubes within . The proof adapts ideas from geometric measure theory (Käenmäki et al. (2012)).

###### Lemma 4

Suppose that Assumption 1 holds. Given , and there exists a finite collection of subsets which satisfies:

1. For each , is a partition of .

2. Given with , and , either or .

3. For all , we have and

 μ(Zl,i)≥Cd⋅((δ/4)⋅(1−3r)⋅rl)d.

See Appendix E. We are now ready to complete the proof of Proposition 1, which entails Theorem 4. [Proof of Proposition 1] Throughout the proof will denote constants depending solely upon . We shall apply Lemma 4 to construct a cover of based upon the local value of . First let . Take some (to be specified later), let and and let be a collection of subsets satisfying properties (1),(2),(3) from Lemma 4. In particular, for all and we have and . First let

For each we define

 Zal:={Zl,i:i∈[ml],5⋅φ(n)⋅δ(n)⋅4−l≤Δa(Zl,i)<5⋅φ(n)⋅δ(n)⋅4−l+1}.

Finally, define

 Zasmall :={x∈X:0<Δa(x)<5⋅φ(n)⋅δ(n)⋅4−q} Za0 :={x∈X:Δa(x)=0}.

We claim that for all we have

 X⊆⋃⎛⎝Zabig∪⎛⎝⋃l∈[r]Zal⎞⎠∪{Zr,i:i∈[mr],Δa(Zr,i)<5⋅φ(n)⋅δ(n)⋅4−r}⎞⎠.

For the claim follows straightforwardly from the fact that is a partition of . Now suppose the claim holds for some . By properties (1) and (2) in Lemma 4 for any ,

 Zr,i=⋃{Zr+1,j:j∈[mr+1],Zr+1,j⊆Zr,i}.

Moreover, if then . Thus, we have

 ⋃ {Zr,i:i∈[mr],Δa(Zr,i)<5⋅φ(n)⋅δ(n)⋅4−r} ⊆⋃{Zr+1,i:i∈[mr+1],Δa(Zr+1,i)<5⋅φ(n)⋅δ(n)⋅4−r} =⋃(Zar+1∪{Zr+1,i:i∈[mr+1],Δa(Zr+1,i)<5⋅φ(n)⋅δ(n)⋅4−r−1}).

Hence, given that the claim holds for it must also hold for . From the special case where we deduce that,

 X⊆⋃⎛⎝Zabig∪⎛⎝⋃l∈[q]Zal⎞⎠∪{Zasmall,Za0}⎞⎠.

Thus, given that we have

 E[~Ran(π,X)] ≤∑Z∈ZabigE[~Ran(π,Z)]+q∑l=1∑Z∈ZalE[~Ran(π,Z)]+E[~Ran(π,Zasmall)].

We begin by considering . Given we have , and . By Lemmas 2 and 3 we have

 E[~Ran(π,Z)] ≤Δa(Z)⋅(4θ⋅¯¯¯¯¯¯¯log(n)(Δa(Z)−4⋅φ(n)⋅diam(Z))2+1) ≤5θ⋅¯¯¯¯¯¯¯log(n)4⋅φ(n)⋅δ(n)+M.

Moreover, since for , we have . Hence,

 ∑Z∈ZabigE[~Ran(π,Z)]≤c1⋅φ(n)d⋅(θ⋅¯¯¯¯¯¯¯log(n)+M). (3)

Now take and consider . We have ,

 5⋅φ(n)⋅δ(n)⋅4−l≤Δa(Z)<5⋅φ(n)⋅δ(n)⋅4−l+1

and . Hence, by Lemma 3 we have

 Tan(π,Z)≤θ⋅¯¯¯¯¯¯¯log(n)(φ(n)⋅δ(n))2⋅42l+1+1.

Combining with Lemma 2 and we have

 E[~Ran(π,Z)]≤c2⋅θ⋅¯¯¯¯¯¯¯log(n)⋅4l.

Moreover, it follows from the definition of that for all we have . Hence, by Assumption 3 we have

 #Zal⋅Cd⋅(δ(n)/16)d⋅4−ld≤∑Z∈Zalμ(Z)≤Cα⋅(5⋅φ(n)⋅δ(n)⋅4−l+1)α.

Thus, we have

 ∑Z∈ZalE[~Ran(π,Z)]≤c3⋅φ(n)d⋅θ⋅¯¯¯¯¯¯¯log(n)⋅4l(d+1−α). (4)

Finally, . Hence, by Assumption 3 we have . Hence, by Lemma 2 we have

 E[~Ran(π,Zasmall)] ≤(5⋅φ(n)⋅δ(n)⋅4−q)⋅E[Tan(π,Zasmall)] ≤(5⋅φ(n)⋅δ(n)⋅4−q)⋅n⋅μ(Zasmall)≤c4⋅n⋅4−q(α+1). (5)

Combining equations (3), (4) and (4) we have

 E[~Ran(π,X)] ≤c5⋅(φ(n)d(M+θ⋅¯¯¯¯¯¯¯log(n)⋅q∑l=04l(d+1−α))+n⋅4−q(α+1)) ≤c6⋅(φ(n)d(M+θ⋅¯¯¯¯¯¯¯log(n)⋅(1+4q(d+1−α)))+n⋅4−q(α+1)).

Thus, if we take we have

 E[~Ran(π,X)] ≤c7⋅⎛⎜ ⎜⎝(M+θ⋅¯¯¯¯¯¯¯log(n))⋅φ(n)d+n⋅(θ⋅φ(n)d⋅¯¯¯¯¯¯¯log(n)n)α+1d+2⎞⎟ ⎟⎠