# Gaussian Process Bandits for Tree Search: Theory and Application to Planning in Discounted MDPs

We motivate and analyse a new Tree Search algorithm, GPTS, based on recent theoretical advances in the use of Gaussian Processes for Bandit problems. We consider tree paths as arms and we assume the target/reward function is drawn from a GP distribution. The posterior mean and variance, after observing data, are used to define confidence intervals for the function values, and we sequentially play arms with highest upper confidence bounds. We give an efficient implementation of GPTS and we adapt previous regret bounds by determining the decay rate of the eigenvalues of the kernel matrix on the whole set of tree paths. We consider two kernels in the feature space of binary vectors indexed by the nodes of the tree: linear and Gaussian. The regret grows in square root of the number of iterations T, up to a logarithmic factor, with a constant that improves with bigger Gaussian kernel widths. We focus on practical values of T, smaller than the number of arms. Finally, we apply GPTS to Open Loop Planning in discounted Markov Decision Processes by modelling the reward as a discounted sum of independent Gaussian Processes. We report similar regret bounds to those of the OLOP algorithm.

There are no comments yet.

## Authors

• 1 publication
• 37 publications
• ### On Information Gain and Regret Bounds in Gaussian Process Bandits

Consider the sequential optimization of an expensive to evaluate and pos...
09/15/2020 ∙ by Sattar Vakili, et al. ∙ 0

• ### Online Learning in Kernelized Markov Decision Processes

We consider online learning for minimizing regret in unknown, episodic M...
05/21/2018 ∙ by Sayak Ray Chowdhury, et al. ∙ 0

• ### No-Regret Learning in Unknown Games with Correlated Payoffs

We consider the problem of learning to play a repeated multi-agent game ...
09/18/2019 ∙ by Pier Giuseppe Sessa, et al. ∙ 33

• ### Infinite Arms Bandit: Optimality via Confidence Bounds

The infinite arms bandit problem was initiated by Berry et al. (1997). T...
05/30/2018 ∙ by Hock Peng Chan, et al. ∙ 0

• ### Gaussian Process Optimization with Adaptive Sketching: Scalable and No Regret

Gaussian processes (GP) are a popular Bayesian approach for the optimiza...
03/13/2019 ∙ by Daniele Calandriello, et al. ∙ 4

• ### Stochastic Process Bandits: Upper Confidence Bounds Algorithms via Generic Chaining

The paper considers the problem of global optimization in the setup of s...
02/16/2016 ∙ by Emile Contal, et al. ∙ 0

• ### Practical Open-Loop Optimistic Planning

We consider the problem of online planning in a Markov Decision Process ...
04/09/2019 ∙ by Edouard Leurent, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In order to motivate the work presented here, we first review the problem of tree search and its bandit-based approaches. We motivate the use of models of arm dependencies in bandit problems, for the purpose of searching trees. We then introduce our approach based on Gaussian Processes, that we analyse in the rest of this paper.

### 1.1 Context

Tree search consists in looking for an optimal sequence of nodes to select, starting from the root, in order to maximise a reward given when a leaf is reached. We introduce this problem in more detail, we motivate the use of bandit algorithms for tree search and we review existing techniques.

#### 1.1.1 Tree search

##### Applications

Tree search is important in Artificial Intelligence for Games, where the machine represents possible sequences of moves as a tree and looks ahead for the first move which is most likely to yield a win. Rewards are given by Monte Carlo simulations where we randomly finish the game from the current position and return 1 for a win, 0 otherwise. Tree search can also be used to search for an optimum in a space of sequences of given length, as in sequence labelling. More generally, it can be used to search any topological space for which a tree of coverings is defined, as shown by

Bubeck et al. (2009), where each node corresponds to a region of the space. For instance, if the space to search is a -dimensional hyper-rectangle, the root node of the tree of coverings is the whole hyper-rectangle, and children nodes are defined recursively by splitting the region of the current node in two: each region is a hyper-rectangle and we split in the middle along the longest side.

##### Planning in Markov Decision Processes

In MDPs, an agent takes a sequence of actions that take it into a sequence of states, gets rewards from the environment for each action it takes, and aims at maximising its total reward. Alternatively, a simpler objective is to maximise the discounted sum of rewards the agent gets: a discount factor is given beforehand and a weight of is applied to the reward obtained at time , for all . If a generative model of the MDP is available (i.e. given a state we can determine the actions available from this state and the rewards obtained for each of these actions, without calling the environment), then we can represent the possible sequences of actions as a tree and determine the reward for each path through this tree (as a discounted sum of intermediate rewards). The idea of using bandit algorithms in the search for an optimal action in large state-space MDPs (i.e. planning) was introduced by Kocsis and Szepesvári (2006) and also considered by Chang et al. (2007) 111Bandits are also used by Ortner (2010) for closed-loop planning (where the chosen actions depend on the current states) in MDPs with deterministic transitions., as an alternative to costly dynamic programming approaches that aim to approximate the optimal value function.

##### Challenges

Searching trees with large branching factors can be computationally challenging, as applications to the game of Go have shown. It requires to efficiently select branches to explore based on their estimated potential (i.e. how good the reward can be at leaves of paths going through this node) and the uncertainty in the estimations. Similarly, high depths can be unattainable due to lack of computational time and bad selection of the branches to explore. A tree search algorithm should not waste too much time in exploring sub-optimal branches, while still exploring enough in order not to miss the optimal. Bandit algorithms can be used to guide the selection on nodes in the exploration of the tree, based on knowledge acquired from previous reward samples. However, one must be cautious that the process of selecting the best nodes to explore first doesn’t become itself too computationally expensive. In the work of

Gelly and Wang (2006) on the search of Go game trees, bandit algorithms allow a more efficient exploration of the tree compared to traditional Branch & Bound approaches (Alpha-Beta).

#### 1.1.2 Bandit problems

The bandit problem is a simple model of the trade-off between exploration and exploitation. The multi-armed bandit is an analogy with a traditional slot machine, known as a one-armed bandit, but with multiple arms. In the stochastic bandit scenario, the player, after pulling (or ‘playing’) an arm selected from the finite set of arms, receives a reward. It is assumed that the reward obtained when playing arm is a sample from a distribution

, unknown to the player, and that samples are iid. A stochastic bandit problem is characterised by a set of probability distributions

222Non-stochastic bandit problems are also of interest, as well as problems in which the distributions are allowed to change through time (see Bubeck, 2010, for an overview of the different types of bandit problems)..

##### Measure of performance

The objective of the player is to maximise the collected reward sum (or ‘cumulative reward’) through iterative plays of the bandit. The optimal arm selection policy , i.e. the policy that yields maximum expected cumulative reward, consists in selecting arm to play at each iteration. The expected cumulative reward of at time (after iterations) is . The performance of a policy is assessed by the analysis of its cumulative regret at time , defined as the difference between the expected cumulative reward of and at time .

##### Exploration vs. exploitation

A good policy requires to optimally balance the learning of the distributions and the exploitation of arms which have been learnt as having high expected rewards. When the number of arms is finite and smaller than the number of experiments allowed, it is possible to explore all the possible options (arms) a certain number of times, thus building empirical estimates , and exploit the best performing ones. As the number of times we play the same arm grows, we expect our reward estimate to improve.

##### Optimism in the face of uncertainty

A popular strategy for balancing exploration and exploitation consists in applying the so-called “optimism in the face of uncertainty” principle: reward estimates and uncertainty estimates are maintained for each arm, such that the probability that the actual mean-reward values are outside of the confidence intervals drops quickly. The arm to be played at each time step is the one for which the upper bound of the confidence interval is the highest. This strategy, as implemented in the UCB algorithm, has been shown by Auer et al. (2002) to achieve optimal regret growth-rate for problems with independent arms: problem-specific upper bound in , and problem-independent upper bound in 333A regret bound is said to be problem-specific when it involves quantities that are specific to the current bandit problem, such as the sub-optimalities of arms, based on the means of the distributions for this problem. The second bound, however, does not involve such quantities..

#### 1.1.3 Bandit-based Tree Search algorithms

Typically, algorithms proceed in iterations. After the iteration, a leaf node is selected and a reward is received. It is usually assumed that there exists a mean-reward function such that is a noisy observation of . Other common assumptions are that , the highest value of , is known (or an upper bound on is known) and is always bigger than . The algorithm stops when a convergence criterion is met, when a computational/time budget is exhausted (in game tree search for instance), or when a maximum number of iterations has been specified (this is referred to as “fixed horizon” exploration, by opposition to “anytime”). In the end, a path through the tree is given. This can simply be the path that leads to the leaf node that received the highest reward.

##### Path selection as a sequence of bandit problems

The algorithm developed by Kocsis and Szepesvári (2006), UCT, considers bandit problems at each node in the tree. The children of a given node represent the arms of the associated bandit problem, and the rewards obtained for selecting an arm are the values obtained at a leaf. At each iteration, we start from the root and select children nodes by invoking the bandit algorithms of the parents, until a leaf is reached and a reward is received, which is then back propagated to the ancestors up to the root. The bandit algorithm used in UCT is UCB (Auer et al., 2002) which stands for Upper Confidence Bounds and implements the principle of optimism in the face of uncertainty 444In the tree setting however, rewards are not iid and the values used by the UCB algorithms at each node are not true upper confidence bounds..

##### “Smooth” trees

Although Gelly and Wang (2006) reported that UCT performed very well on Go game trees, it was shown by Coquelin and Munos (2007) that it can behave poorly in certain situations because of “overly optimistic assumptions in the design of its upper confidence bounds” (Bubeck and Munos, 2010), leading to a high lower bound on its cumulative regret. An other algorithm was proposed, BAST (Bandit Algorithm for Smooth Trees), which can be parameterised to adapt to different levels of smoothness of the reward function on leaves, and to deal with the situations that UCT handles badly. BAST is only different from UCT in the definition of its ‘upper confidence bounds’ (UCT is actually a special case of BAST, corresponding to a particular value of one of the algorithm’s parameters). A time-independent regret upper bound was derived, however it was expressed in terms of the sub-optimality values of nodes (dependent on the reward on nodes, hence unknown to the algorithm) and was thus problem specific. Also, quite paradoxically, the bound could become very high for smooth functions (because of terms).

##### Optimistic planning in discounted MDPs

The discount factor implies a particular smoothness of the function on tree paths (the smaller , the smoother the function), which is the starting point of the work of Bubeck and Munos (2010) on the Open Loop Optimistic Planning algorithm, close in spirit to BAST. OLOP has been proved to be minimax optimal, up to a logarithmic factor, which means that the upper bound growth rate of its simple regret 555The simple regret is defined as the difference between and the best value of for the arms that have been played. matches the lower bound. However, OLOP requires the knowledge of the time horizon and the regret bounds do not apply when the algorithm is run in an anytime fashion.

##### Measure of performance

A Tree Search algorithm’s performance can be measured, as for a bandit algorithm, by its cumulative regret . However, although this is a good objective to achieve a good exploration/exploitation balance, we might be ultimately interested in a bound on how far the reward value for the best node we would see after iterations is from the optimal . Or it might be more useful to bound the regret after a given execution time (instead of a number of iterations) in order to compare algorithms that have different computational complexity.

### 1.2 Many-armed bandit algorithms

It is of interest to consider bandit problems in which there are more arms than the possible number of plays, or in which there is an infinity of arms. We refer to this as the “many-armed” bandit problem. In this case, we need a model of dependencies between arms in order to get, from one play, information about several arms – and not only the one that was played. We show how such models can be applied to online global optimisation. In particular, we review the use of Gaussian Processes for modelling arm dependencies.

#### 1.2.1 Bandits for online global optimisation

Bandit algorithms have been used to focus exploration in global optimisation. Each point in the space of search is an arm, and rewards are given as we select points where we want to observe the function. Even though the actual objective may not be to minimise the cumulative regret but to minimise the simple regret, we have seen above how a bound on the former can give a bound on the latter. The cumulative regret is also interesting as it forces algorithms not to waste samples. Samples can be costly to acquire in certain applications, as they might involve a physical and expensive action for instance, such as deploying a sensor or taking a measurement at a particular location (see the experiments on sensor networks performed by Srinivas et al., 2010), or they can simply be computationally costly: the less samples, the quicker we can find a maximum.

##### Modelling dependencies

The observations may or may not be noisy. In the latter case, the bandit problem is trivial when the search space has less elements than the maximum number of iterations we can perform. But in global optimisation, the search space is usually continuous. In that case, as pointed out by Wang et al. (2008), if no assumption is made on the smoothness of , the search might be arbitrarily hard. The key idea is to model dependencies between arms, through smoothness assumptions on , so that information can be gained about several arms (if not the whole set of arms) when playing only one arm. Modelling dependencies is also beneficial in problems with finite numbers of arms, as it speeds up the learning. Pandey et al. (2007) have developed an algorithm which exploits cluster structures among arms, applied to a content-matching problem (matching webpages to ads). Auer and Shawe-Taylor (2010) use a kernelised version of LinRel, a UCB-type algorithm introduced by Auer (2003) for linear optimisation and further analysed by Dani et al. (2008)

, for an image retrieval task with eye-movement feedback. LinRel has a regret in

, i.e. that grows in up to a logarithmic term 666The notation is the one used by Bubeck et al. (2009) and equivalent to used by Srinivas et al. (2010): iff there exists such that .

##### Continuous arm spaces

Bandit problems in continuous arm spaces have been studied notably by Kleinberg et al. (2008), Wang et al. (2008) and Bubeck et al. (2009). To each bandit problem corresponds a mean-reward function in the space of arms. Kleinberg et al. (2008) consider metric spaces, Lipschitz functions, and derive a regret growth-rate in , which strongly depends on the dimension of the input space. Bubeck et al. (2009), however, consider arbitrary topological spaces, weak-Lipschitz functions (i.e. local smoothness assumptions only) and derive a regret in . The rate of growth is this time independent of the dimension of the input space. Quite interestingly, the algorithm of Bubeck et al. (2009), HOO, uses BAST on a recursive splitting of the space where each node corresponds to a region of the space and regions are divided in halves, i.e. all non-leaf nodes have two children. BAST is used to select smaller and smaller regions to randomly sample in. The algorithm developed by Wang et al. (2008), UCB-AIR, assumes that the probability that an arm chosen uniformly at random is -optimal scales in . Thus, when there are many near-optimal arms and when choosing a certain number of arms uniformly at random, there exists at least one which is very good with high probability. Their regret bound is in when and , and in otherwise.

#### 1.2.2 Gaussian Process optimisation

##### GP assumption

In the global optimisation setting, a very popular assumption in the Bayesian community is that is drawn from a Gaussian Process, due to the flexibility and power of GPs (see Brochu et al., 2009, for a review of Bayesian optimisation using GPs) and their applicability in practise in engineering problems. GP optimisation is sometimes referred to as “Kriging” and response surfaces (see Grünewalder et al., 2010, and references therein). GPs are probability distributions over functions, that characterise a belief on the smoothness of functions. The idea, roughly, is that similar inputs are likely to yield similar outputs. The similarity is defined by a kernel/covariance function 777We use the terms ‘kernel function’ and ‘covariance function’ equivalently in the rest of the paper. between inputs. Parameterising the covariance function translates into a parametrisation of the smoothness assumption. Note that this is a global smoothness assumption which is thus stronger than that of Bubeck et al. (2009). It is, like the UCB-AIR assumption, a probabilistic assumption too, although a stronger one. Srinivas et al. (2010) claim that the GP assumption is neither too weak nor too strong in practise. One added benefit of this Bayesian framework is the possibility of tuning the parameters of our smoothness assumption (encoded in the kernel) by maximising the likelihood of the observed data, which can be written in closed-form for the commonly used Auto Relevance Determination kernel (see Rasmussen and Williams, 2006, chap. ̵̃5). In comparison, parameter tuning is critical for HOO to perform well and parameters need to be tuned by hand.

##### Acquisition of samples

Similarly to bandit problems, function samples are acquired iteratively and it is important to find ways to efficiently focus the exploration of the input space. The acquisition of function samples was often based on heuristics, such as the Expected Improvement and the Most Probable Improvement

(Mockus, 1989) that proved successful in practise (Lizotte et al., 2007). A more principled approach is that of Osborne et al. (2009) which considers a fixed number of iterations (“finite horizon” in the bandit terminology) and fully exploits the Bayesian framework to compute at each time step the expected loss 888In their approach, the loss is defined by the simple regret but one could imagine using the cumulative regret instead. over all possible remaining allocations as a function of the arm allocated at time . For this, the probability of loss is broken down into the probability of loss given the arms at times to , times the probability of picking these arms, which can also be broken down recursively. This is similar in spirit to the pioneering work of Gittins and Jones (1979) on bandit problems and on “dynamic allocation indices” for arms (also known as Gittins index). Here, computing the optimal allocation of samples has an extremely high computational cost 999In their experiments, the number of iterations was only twice the dimension of the problem. which is warranted in problems where function samples are very expensive themselves. The simple regret of this procedure was analysed by Grünewalder et al. (2010) in the case where observations are not noisy.

##### UCB heuristic for acquiring samples

GP approaches have been extended in the bandit setting 101010In practise, rewards are taken in in bandit problems, but it is more convenient when dealing with Gaussian Processes to have output spaces centred around (easier expressions for the posterior mean when the prior mean is the function). With GPs, we do not assume that the values are within a known interval. We previously mentioned that an upper bound on could be known, but there is no easy way to encode this knowledge in the prior, which is probably what motivated Graepel et al. (2010) to consider a generalised linear model with a probit link function, in order to learn the Click Through Rates of ads (in ) displayed by web search engines, while maximising the number of clicks (also an exploration vs. exploitation problem). , with the Gaussian Process Upper Confidence Bound algorithm (GP-UCB or GPB) presented by Dorard et al. (2009), for which a theoretical regret bound was given by Srinivas et al. (2010), based on the rate of decay of the eigenvalues of the kernel matrix on the whole set of arms, if finite, or of the kernel operator: for the linear and Gaussian kernels. This seems to match, up to a logarithmic factor, times the lower bound on the simple regret given by Grünewalder et al. (2010), which is a lower bound on the cumulative regret. As the name GP-UCB indicates, the sample acquisition heuristic is based on the optimism in the face of uncertainty principle, where the GP posterior mean and variance are used to define confidence intervals. Better results than with other Bayesian acquisition criteria were obtained on the sensor network applications presented by Srinivas et al. (2010). There still remains the problem of finding the maximum of the upper confidence function in order to implement this algorithm, but Brochu et al. (2009) showed that global search heuristics are very effective.

### 1.3 A Gaussian Process approach to Tree Search

In light of this, we consider a GP-based algorithm for searching the potentially very large space of tree paths, with a UCB-type heuristic for choosing arms to play at each iteration. We consider only one bandit problem for the whole tree, where arms are tree paths 111111This is similar to Bubeck and Munos (2010, sec. ̵̃4) where bandit algorithms for continuous arms spaces are compared to OLOP.. The kernel used with the GP algorithm is therefore a kernel between paths, and it can be defined by looking at nodes in common between two paths. The GP assumption makes sense for tree search as similar paths will have nodes in common, and we expect that the more nodes in common, the more likely to have similar rewards (this is clearly true for discounted MDPs). Owing to GPs, we can share information gained for ‘playing’ a path with other paths that have nodes in common (which Bubeck and Munos 2010 also aim at doing as stated in the last part of their Introduction section). Also, we will be able to use the results of Srinivas et al. (2010) to derive problem-independent regret bounds for our algorithm 121212The bound will be expressed in terms of the maximum branching factor and depth of the tree, and of the parameters of the kernel in our model, but they won’t depend on actual values., once we have studied the decay rate of the eigenvalues of the kernel matrix on the set of all arms (tree paths here), which determines the rate of growth of the cumulative regret in their work.

##### Assumptions

Similarly to BAST, we wish to model different levels of smoothness of the response/reward function on the leaves/paths. For this, we can extend the notion of characteristic length-scale to such functions by considering a Gaussian covariance function in a feature space for paths. Smoothness of the covariance/kernel translates to quick eigenvalue decay rate which can be used to improve the regret bound. As we already said, the parameter(s) of our smoothness assumption can be learnt from training data. Note that the GP smoothness assumption is global, whereas BAST only assumes smoothness for -optimal nodes. But in examples such as Go tree search we can expect to be globally smooth, and for planning in discounted MDPs this is even clearer as is defined as a sum of intermediate rewards and is thus Lipschitz with respect to a certain metric (see Bubeck and Munos, 2010, sec. ̵̃4). As such, is also made smoother by decreasing the value of the discount factor . Finally, GPs allow to model uncertainty, which results in tight confidence bounds, and can also be taken into account when outputting a sequence of actions at the end of the tree search: instead of taking the best observed action, we might take the one with highest lower confidence bound for a given threshold.

##### Main results

We derive regret bounds for our proposed GP-based Tree Search algorithm, run in an anytime fashion (i.e. without knowing the total number of iterations in advance), with tight constants in terms of the parameters of the Tree Search problem. The regret can be bounded with high probability in:

• for small values of

• for where is the maximum branching factor and the maximum depth of the tree 131313 is considered to be fixed but we will see in Section 5 that we can extend our analysis to cases where depends on .

• otherwise.

Although the rates are worse for smaller values of , the bounds are tighter because the constants are smaller. For , we have a constant in for the linear kernel, and in for the Gaussian kernel with width : the regret improves when the width increases. Having small constants in terms of the size of the problem is important, since is very large in practise and computational budgets do not allow to go beyond this value.

### 1.4 Outline of this paper

First, we describe the GP-UCB (or GPB) algorithm in greater detail and its application to tree search in Section 2. In particular, we show how the search for the max of the upper confidence function can be made efficient in the tree case. The theoretical analysis of the algorithm begins in Section 3 with the analysis of the eigenvalues of the kernel matrix on the whole set of tree paths. It is followed in Section 4 by the derivation of an upper bound on the cumulative regret of GPB for tree search, that exploits the eigenvalues’ decay rate. Finally, in Section 5, we compare GPB to other algorithms, namely BAST for tree search and OLOP for MDP planning, on a theoretical perspective. We also show how a cumulative regret bound can be used to derive other regret bounds. We propose ideas for other Tree Search algorithms based on Gaussian Processes Bandits, before bringing forward our conclusions.

## 2 The algorithm

In this section, we show how Gaussian Processes can be applied to the many-armed bandit problem, we review the theoretical analysis of the GPB algorithm and we describe its application to tree search.

### 2.1 Description of the Gaussian Process Bandits algorithm

We formalise the Gaussian Process assumption on the reward function, before giving the criterion for arm selection in the GPB framework.

#### 2.1.1 The Gaussian Process assumption

##### Definition

A GP is a probability distribution over functions, and is used here to formalise our assumption on how smooth we believe to be. It is an extension of multi-variate Gaussians to an infinite number of variables (an -variate Gaussian is actually a distribution over functions defined on spaces of exactly elements). A GP is characterised by a mean function and a covariance function. The mean is a function on and the covariance is a function of two variables in this space – think of the extension of a vector and a matrix to an infinite number of components. When choosing inputs and , the probability density for outputs and is a 2-variate Gaussian with covariance matrix

This holds when extending to any inputs. We see here that the role of the similarity measure between arms is taken by the covariance function, and, by specifying how much outputs co-vary, we characterise how likely we think that a set of outputs for a finite set of inputs is, based on the similarities between these inputs, thus expressing a belief on the smoothness of .

##### Inference and noise modelling

The reward may be observed with noise which, in the GP framework, is modelled as additive Gaussian white noise. The variance of this noise characterises the variability of the reward when always playing the same arm. In the absence of any extra knowledge on the problem at hand,

is flat a priori, so our GP prior mean is the function . The GP model allows us, each time we receive a new sample (i.e. an arm-reward pair), to use probabilistic reasoning to update our belief of what may be – it has to come relatively close to the sample values (we are only off because of the noise) but at the same time it has to agree with the level of smoothness dictated by the covariance function – thus creating a posterior belief. In addition to creating a ‘statistical picture’ of , encoded in the GP posterior mean, the GP model gives us error bars (the GP posterior variance). In other terms, it gives us confidence intervals for each value .

#### 2.1.2 Basic notations

We consider a space of arms and a kernel between elements of . In our model, the reward after playing arm is given by , where and is a function drawn once and for all from a Gaussian Process with zero mean and with covariance function . Arms played up to time are with rewards . The vector of concatenated reward observations is denoted . The GP posterior at time after seeing data has mean with variance .

Matrix and vector are defined as follows:

 (Ct)i,j = κ(xi,xj)+σ2noiseδi,j (kt(x))i = κ(x,xi)

and are then given by the following equations (see Rasmussen and Williams, 2006, chap. 2):

 μt(x) = kt(x)TC−1tyt (1) σ2t(x) = κ(x,x)−kt(x)TC−1tkt(x) (2)

#### 2.1.3 UCB arm selection

The algorithm plays a sequence of arms and aims at optimally balancing exploration and exploitation. For this, we select arms iteratively by maximising an upper confidence function :

In Section 2.3.2 we show how we can find the argmax of efficiently, in the tree search problem.

##### Interpretation

The arm selection problem can be seen as an active learning problem: we want to learn accurately in regions where the function values seem to be high, and do not care much if we make inaccurate predictions elsewhere. The

term balances exploration and exploitation: the bigger it gets, the more it favours points with high (exploration), while if , the algorithm is greedy. In the original UCB formula, .

##### Balance between exploration and exploitation

A choice of corresponds to a choice of an upper confidence bound. Srinivas et al. (2010) give a regret bound with high probability, that relies on the fact that the values lie between their lower and upper confidence bounds. If is finite, this happens with probability if:

 √βt=√2log(|X|t2π26δ)

However, the constants in their bounds were not optimised, and scaling by a constant specific to the problem at hand might be beneficial in practise. In their sensor network application, they tune the scaling parameter by cross validation.

### 2.2 Theoretical background

The GPB algorithm was studied by Srinivas et al. (2010) in the cases of finite and infinite number of arms, under the assumption that the mean-reward function is drawn from a Gaussian Process with zero mean and given covariance function, and in a more agnostic setting where has low complexity as measured under the RKHS norm induced by a given kernel. Their work is core to the regret bounds we give in Section 4.

#### 2.2.1 Overview

##### Finite case analysis

When all values are within their confidence intervals (which, by design of the upper confidence bounds, happens with high probability), a relationship can be given between the regret of the algorithm and its information gain after acquiring samples (i.e. playing arms). When everything is Gaussian, the information gain can easily be written in terms of the eigenvalues of the kernel matrix on the training set of arms that have been played so far. The simplest case is for a linear kernel in dimensions. However, in general there is no simple expression for these eigenvalues since we do not know which arms have been played141414The process of selecting arms is non-deterministic because of the noise introduced in the responses. However, we could maybe determine the probabilities of arms being selected, but such an analysis would be problem-specific (it would depend on values): we could, as is done in the UCB proof, look at the probability to select an arm given the number of times each arm has been selected so far, and do a recursion… which gives a problem-specific bound in terms of the ’s.. Thanks to the result of Nemhauser et al. (1978), we can use the fact that the information gain is a sub-modular function in order to bound our information gain by the “greedy information gain”, which can itself be expressed in terms of the eigenvalues of the kernel matrix on the whole set of arms (which is known and fixed), instead of the kernel matrix on the training set. We present this analysis in slightly more detail in Section 2.2

##### Infinite case analysis

The analysis requires to discretise the input space (assuming it is a subspace of ), and we need additional regularity assumptions on the covariance function in order to have all values within their confidence intervals with high probability. The discretisation is finer at each time step , and the information gain is bounded by an expression of the eigenvalues of the kernel matrix on

. The expected sum of these can be linked to the sum of eigenvalues of the kernel operator spectrum with respect to the uniform distribution over

, for which an expression is known for common kernels such as the Gaussian and Matern kernels.

#### 2.2.2 Finite number of arms

We present two main results of Srinivas et al. (2010) that will be needed in Section 4. First, we show that the regret of UCB-type algorithms can be bounded with high probability based on a measure of how quick the function can be learnt in an information theoretic sense: the maximum possible information gain after iterations (“max infogain”). Intuitively, a small growth rate means that there is not much information left to be gained after some time, hence that we can learn quickly, which should result in small regrets. The max infogain is a problem dependent quantity and its growth is determined by properties of the kernel and of the input space.

Second, we give an expression of the information gain of the “greedy” algorithm, that aims to maximise the immediate information gain at each iteration, in terms of the eigenvalues of the kernel matrix on . The max infogain is bounded by a constant times the greedy infogain. We can thus bound the regret in terms of the eigenvalues of the kernel matrix.

##### Notations

In the following, will denote the total number of iterations performed by the algorithm, the cumulative regret after iterations, the immediate regret at time step , the number of arms, the kernel matrix on the set of arms, the feature representation of the arm, and an element of the feature space (which might not correspond to one of the ’s). and will denote respectively the information gain of the greedy algorithm and of the (GP-)UCB algorithm.

##### Theorem

Theorem 1 of Srinivas et al. (2010) uses the fact that GPB always picks the arm with highest UCB value in order to relate the regret to :

 RT≤√16log(1+σ−2noise)log(NT2π26δ)TIu(T) with probability 1−δ
##### Greedy infogain

We define the “greedy algorithm” as the algorithm which is allowed to pick linear combinations of arms in , with a vector of weights of norm equal to , in order to maximise the immediate information gain at each time step. An arm in this extended space of linear combinations of the ’s is not characterised by an index anymore but by a weight vector. Infogain maximisers are arms that maximise the variance, which is given at arm by where is the posterior covariance matrix at time for the greedy algorithm and is a weight vector of norm 1. Let

denote the eigenvectors of eigenvalues

(in decreasing order) of . It can be shown that and share same eigenbasis and that the greedy algorithm selects arms such that their weight vectors are among the first eigenvectors of . The eigenvalue for the eigenvector of is given by:

 ^^λi,t=^λi1+σ−2noisemi,t^λi

where denotes the number of times has been selected up to time (we say that a weight vector is selected when the corresponding arm is selected).

Consequently, is selected for the first time at time if all eigenvectors of of indices smaller than have been selected at least once and:
(this will be useful in Section 4.4).

An expression of the greedy infogain can be given in terms of the eigenvalues of :

 Ig(T)=12min(T,N)∑t=1log(1+σ−2noisemt^λt)

where denotes the number of times has been selected during the iterations. We see that the rate of decay of the eigenvalues has a direct impact on the rate of growth of .

##### Maximum possible infogain

An information-theoretic argument for submodular functions (Nemhauser et al., 1978) gives a relationship between the infogain of the GP-UCB algorithm at time and the infogain of the greedy algorithm at time , based on the constant :

 I∗(T)≤11−e−1Ig(T)

As a consequence:

 Iu(T)≤I∗(T)≤12(1−e−1)min(T,N)∑t=1log(1+σ−2noisemt^λt) (3)

It might seem that would always be bounded by a constant because , which would imply a regret growth in . However, as we will see in Section 5, we may be interested in running the algorithm with a finite horizon and letting depend on . Also, we aim to provide tight bounds for the case where , with improved constants. The growth rate of might become higher, but the tight constants will result in tighter bounds. Finally, we aim to study how these constants are improved for smoother kernels.

### 2.3 Application to tree search

Let us consider trees of maximum branching factor and depth . As announced in the introduction, our Gaussian Processes Tree Search algorithm (GPTS) considers tree paths as arms of a bandit problem. The number of arms is (number of leaves or number of tree paths). Therefore, drawing from a GP is equivalent to drawing an N-dimensional vector of values from a multi-variate Gaussian.

#### 2.3.1 Feature space

A path is given by a sequence of nodes : where is always the root node and has depth . We consider the feature space indexed by all the nodes of the tree and defined by

 ϕn(x)={1;if ∃1≤i≤d,x=xi0;otherwise.

The dimension of this space is equal to the number of nodes in the tree .

##### Linear and Gaussian kernels

The linear kernel in this space simply counts the number of nodes in common between two paths: intuitively, the more nodes in common, the closer the rewards of these nodes should be. We could model different levels of smoothness of by considering a Gaussian kernel in this feature space and adapting the width parameter.

##### More kernels

More generally, we could consider kernel functions characterised by a set of decreasing values in where is the value of the kernel product between two paths that have nodes not in common. Once the are chosen, we can give an explicit feature representation for this kernel, based on the original feature space: we only change the components by taking instead of if a node at depth is in the path, and we take at depth (root). Thus, consider 2 paths that differ on nodes: the first nodes only will be in common, hence the inner product of their feature vectors will be , which is equal to the kernel product between the 2 paths, by definition of . Note that the kernel is normalised by imposing , which will be required in Section 3.1.

#### 2.3.2 Maximisation of ft in the tree

The difficulty in implementing the GPB algorithm is to find the maximum of the upper confidence function when the computational cost of an exhaustive search is prohibitive due to a large number of arms – as for most tree search applications. At time we look for the path which maximises . Here, we can benefit from the tree structure in order to perform this search in only. We first define some terminology and then prove this result.

##### Terminology

A node is said to be explored if there exists in the training data such that contains , and it is said to be unexplored otherwise. A sub-tree is defined here to be a set of nodes that have same parent (called the root of the sub-tree), together with their descendants. A sub-tree is unexplored if no path in the training data goes through this sub-tree. A maximum unexplored sub-tree is a sub-tree such that its root belongs to an in the training data.

##### Proof and procedure

can be expressed as a function of instead of a function of (see Equations 1 and 2) and we argue that all paths that go through a given unexplored sub-tree will have same value, hence same value. Let be such a path, where is defined such that node has been explored but not for . All ’s that go through have the same first nodes , and the other nodes do not matter in kernel computations since they haven’t been visited.

Consequently we just need to evaluate on one randomly chosen path that goes through the unexplored sub-tree , all other such paths having the same value for . We represent maximum unexplored sub-trees by “dummy nodes” and, similarly to leaf nodes, we compute and store values for dummy nodes. The number of dummy nodes in memory is per visited node with unexplored siblings: it is the sub-tree containing the unexplored siblings and their descendants. There are at most such nodes per path in the training data, and there are paths in the training data, hence the number of dummy nodes is less than or equal to .

This would mean that the number of nodes (leaf or dummy) to examine in order to find the maximiser of would be in . The search can be made more efficient than examining all these nodes one by one: we assign upper confidence values recursively to all other nodes (non-leaf and non-dummy) by taking the maximum of the upper confidence values of their children. The maximiser of can thus be found by starting from the root, selecting the node with highest upper confidence value, and so on until a leaf or a dummy node is reached. This method of selecting a path is the same as that of UCT and has a cost of only. After playing an arm, we would need to update the upper confidence values of all leaf nodes and dummy nodes (in ), and with this method we would also need to update the upper confidence values of these nodes’ ancestors, adding an extra cost in .

##### Pseudo-code

A pseuco-code that implements the search for the argmax of in is given in Algorithm 1. We sometimes talk about kernel products between leaves and reward on leaves because paths can be identified by their leaf nodes. Note that with this algorithm, we might choose the same leaf node more than once unless .

## 3 Kernel matrix eigenvalues

For our analysis, we ‘expand’ the tree by creating extra nodes so that all branches have the same branching factor . This construction is purely theoretical as the algorithm doesn’t need a representation of the whole tree, nor the expanded tree, in order to run.

### 3.1 Recursive block representation of the kernel matrix

We write the kernel matrix on all paths through an expanded tree with branching factor and depth , and the matrix of ones of dimension . and completely characterise the tree (here, nodes don’t have labels) so is expressed only in terms of and . It can be expressed in block matrix form with and blocks:

 KB,1=(χ0−χ1)I+χ1JB (4)

and

where is the value of the kernel product between any two paths that have nodes not in common.

To see this, one must think of the -tree as a root pointing to -trees. On the 1st diagonal block of is the kernel matrix for the paths that go through the first -tree. Because the kernel function is normalised, this stays the same when we prepend the same nodes (here the new root) to all paths, so it is . Similarly, on the other diagonal blocks we have . In order to complete the block matrix representation of we just need to know that any two paths that go through different -trees only have the root in common, and we use the definition of .

Let us denote by and the matrices of blocks by blocks:

We can then write:

 KB,D=χDJ(B)(JBD−1)−χDI(B)(JBD−1)+I(B)(KB,D−1) (5)

### 3.2 Eigenvalues

For simplicity in the derivations, we consider here the distinct eigenvalues in increasing order, with multiplicities . We will later need to “convert” these to the notations used by Srinivas et al. (2010) in order to use their results.

We show by recursion that, for all , has distinct eigenvalues with multiplicities :

 ∀i∈[1,D],¯λ(D)i = i−1∑j=0Bj(χj−χj+1) and ν(D)i=(B−1)BD−i (6) ¯λ(D)D+1 = D−1∑j=0Bj(χj−χj+1)+BDχD % and ν(D)D+1=1 (7)

We also show that and share same eigenbasis, and the eigenvector with highest eigenvalue is the vector of ones , which is also the eigenvector of with highest eigenvalue.

#### 3.2.1 Proof

##### Preliminary result: eigenanalysis of Jb and J(B)

has two eigenvalues: with multiplicity and with multiplicity . We denote by the eigenvectors of , in decreasing order of corresponding eigenvalue. is the vector of ones. The coordinates of are notated . For all from to we define as a concatenation of vectors:

For all , by definition of . For all -dimensional vector and matrix :

 J(B)(M)U(B)i(v) = = = 0

Hence is an eigenvector of with eigenvalue equal to .

##### Recursion

We propose eigenvectors of , use Equation 5 and determine the value of each term of the sum multiplied by the proposed eigenvectors, in order to get an expression for the eigenvalues.

• For . From Equation 4, are also eigenvectors of with eigenvalue , hence has multiplicity as expected. is also an eigenvector of with eigenvalue , and .

• Let us assume the result is true for a given depth .

• The largest eigenvalue of is

 ¯λ(D−1)D=BD−1χD−1+D−2∑j=0Bj(χj−χj+1)

with multiplicity . Let us apply to the corresponding eigenvector , and multiply it to the expression of given in Equation 5.

• and is a matrix of ones in dimensions, hence:

 J(B)(JBD−1)U(B)B(1BD−1)=BDU(B)B(1BD−1)
• is also the highest eigenvector of , with eigenvalue , hence:

 I(B)(JBD−1)U(B)B(1BD−1)=BD−1U(B)B(1BD−1)
• By definition of and :

 I(B)(KB,D−1)U(B)B(1BD−1)=¯λ(D−1)DU(B)B(1BD−1)

As a consequence, is the eigenvector of with highest eigenvalue (this will be confirmed later), equal to .

• Let us apply to for all from to .

• Owing to the preliminary result, we have:

 J(B)(JBD−1)U(B)k(1BD−1)=0
• Since is the eigenvector of with eigenvalue :

 I(B)(JBD−1)U(B)k(1BD−1)=BD−1U(B)k(1BD−1)
• Since is the eigenvector of with highest eigenvalue:

 I(B)(KB,D−1)U(B)k(1BD−1)=¯λ(D−1)DU(B)k(1BD−1)

for the same reasons as previously.

As a consequence, and we have found eigenvectors of with eigenvalue equal to . These vectors are also eigenvectors of with eigenvalue , which comes from the preliminary result and the fact that .

• For from to , let us apply , for all from to , to all eigenvectors of with eigenvalue equal to . By definition of :

 I(B)(KB,D−1)U(B)k(v)=¯λ(D−1)iU(B)k(v)

being also an eigenvector of with eigenvalue :

 J(B)(JBD−1)U(B)k(v) = 0 I(B)(JBD−1)U(B)k(v) = 0

As a consequence, eigenvalues stay unchanged but their multiplicities are all multiplied by (because goes from to and we have identified times as many eigenvectors) which gives . Again, the preliminary result allows us to show that the are also eigenvectors of with eigenvalue .

• The total number of multiplicities for all found eigenvalues is equal to so we have identified all the eigenvectors.

#### 3.2.2 Re-ordering of the kernel matrix eigenvalues

In order to match the notations of Srinivas et al. (2010), we re-write the eigenvalues as a sequence . We first need to reverse the order of the eigenvalues and thus consider the sequence of ’s. We obtain the lambda hats by repeating the lambda bars as many times as their multiplicities. with such that . For , with hence . from which we have:

 ∀t∈[1,N],∃i∈[−1,D−1],^λt=¯λD−i with logB(t)−1≤i

### 3.3 Linear kernel

The linear kernel is an inner product in the feature space, which amounts to counting how many nodes in common two paths have. It takes values from to . The normalised linear kernel divides these values by . If two paths of depth differ on nodes, they have nodes in common:

 χd=D+1−dD+1

For all , , hence for We use Inequality 8 to get a lower and an upper bound on for :

 ¯λD−i =NB−i−1(B−1)(D+1) NB−logB(t)−1(B−1)(D+1)≤ ^λt ≤NB1−logB(t)−1(B−1)(D+1)
 ∀t>1,N−t(B−1)(D+1)t≤^λt≤NB−t(B−1)(D+1)t

The bounds for are obtained by adding to the bounds above. Indeed:

 ^λ1 = ¯λD+1 = D−1∑j=0Bj(χj−χj+1)+BDχD = D∑j=0Bj(χj−χj+1)+BDχD+1

We thus see that the expression for only differs from the expressions for other ’s by an added term.

### 3.4 Gaussian kernel

We give an expression for for this kernel, before giving bounds on and studying the influence of the kernel width on these bounds.

#### 3.4.1 Value of χd and ¯λi

The squared Euclidian distance in the paths feature space is twice the number of nodes where they differ: path 1 contains nodes indexed by that path 2 doesn’t contain, and path 2 contains nodes indexed by that path 1 doesn’t contain, so the and components of the feature vectors differ. The components of the difference of the feature vectors will be except at the -indices and at the -indices where they will be or . Summing the squares gives .

Consequently, the Gaussian kernel is an exponential on minus the number of nodes where paths differ (from to ):

 χd=exp(−ds2)

For all , , hence for all ,

 ¯λi = (1−qsB)i−1∑j=0qjs = Cs(qis−1)

where

 qs = Bexp(−1s2)) Cs = 1−qsBqs−1

By definition, . Let us focus on the case where so that is always positive, which is equivalent to:

 s>1√log(B)

#### 3.4.2 Bounds on ^λt

Once again, Inequality 8 gives us a lower and an upper bound on :

 Cs(qDq−logB(t)−1)≤^λt≤Cs(qDq−logB(t)q−1)

As we will see in the next section, in Inequality 3, we are only interested in indices that are smaller than . As for the linear kernel, we can bound by expressions in . Indeed:

 q−logB(t) =t−logB(q) q−logB(t) =t−1+1s2log(B) 1t≤ q−logB(t) ≤1texp(Ds2) since t≤BD 1t≤ q−logB(t) ≤Ntq−D

Which thus gives:

 ∀t>1,Cs(Nexp(−Ds2)−t)t≤^λt≤Cs(Nqs−t)t

for .

#### 3.4.3 Influence of the kernel width

From the above we have:

 ^λt≤NCsqst

Note that

 Csqs = (B−qs)qsB(qs−1) = (1+1qs−1)(1−qsB)

and increases when increases, hence decreases and decreases. As a result, decreases. Also, since tends to when tends to infinity, the limit of is when tends to infinity. The upper-bound improves over that of the linear kernel when is big enough so that .

Now, let us look at the rate at which tends to zero: when is bigger than , we have:

 Csqs