Gaussian Process Bandits for Tree Search: Theory and Application to Planning in Discounted MDPs

09/03/2010
by   Louis Dorard, et al.
0

We motivate and analyse a new Tree Search algorithm, GPTS, based on recent theoretical advances in the use of Gaussian Processes for Bandit problems. We consider tree paths as arms and we assume the target/reward function is drawn from a GP distribution. The posterior mean and variance, after observing data, are used to define confidence intervals for the function values, and we sequentially play arms with highest upper confidence bounds. We give an efficient implementation of GPTS and we adapt previous regret bounds by determining the decay rate of the eigenvalues of the kernel matrix on the whole set of tree paths. We consider two kernels in the feature space of binary vectors indexed by the nodes of the tree: linear and Gaussian. The regret grows in square root of the number of iterations T, up to a logarithmic factor, with a constant that improves with bigger Gaussian kernel widths. We focus on practical values of T, smaller than the number of arms. Finally, we apply GPTS to Open Loop Planning in discounted Markov Decision Processes by modelling the reward as a discounted sum of independent Gaussian Processes. We report similar regret bounds to those of the OLOP algorithm.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/15/2020

On Information Gain and Regret Bounds in Gaussian Process Bandits

Consider the sequential optimization of an expensive to evaluate and pos...
research
05/21/2018

Online Learning in Kernelized Markov Decision Processes

We consider online learning for minimizing regret in unknown, episodic M...
research
09/18/2019

No-Regret Learning in Unknown Games with Correlated Payoffs

We consider the problem of learning to play a repeated multi-agent game ...
research
11/04/2022

Online Learning and Bandits with Queried Hints

We consider the classic online learning and stochastic multi-armed bandi...
research
10/05/2021

Contextual Combinatorial Volatile Bandits via Gaussian Processes

We consider a contextual bandit problem with a combinatorial action set ...
research
02/16/2016

Stochastic Process Bandits: Upper Confidence Bounds Algorithms via Generic Chaining

The paper considers the problem of global optimization in the setup of s...
research
04/02/2022

A UCB-based Tree Search Approach to Joint Verification-Correction Strategy for Large Scale Systems

Verification planning is a sequential decision-making problem that speci...

Please sign up or login with your details

Forgot password? Click here to reset