Online Learning in Kernelized Markov Decision Processes

05/21/2018
by   Sayak Ray Chowdhury, et al.
0

We consider online learning for minimizing regret in unknown, episodic Markov decision processes (MDPs) with continuous states and actions. We develop variants of the UCRL and posterior sampling algorithms that employ nonparametric Gaussian process priors to generalize across the state and action spaces. When the transition and reward functions of the true MDP are either sampled from Gaussian process priors (fully Bayesian setting) or are members of the associated Reproducing Kernel Hilbert Spaces of functions induced by symmetric psd kernels (frequentist setting), we show that the algorithms enjoy sublinear regret bounds. The bounds are in terms of explicit structural parameters of the kernels, namely a novel generalization of the information gain metric from kernelized bandit, and highlight the influence of transition and reward function structure on the learning performance. Our results are applicable to multi-dimensional state and action spaces with composite kernel structures, and generalize results from the literature on kernelized bandits, and the adaptive control of parametric linear dynamical systems with quadratic costs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/04/2019

On Online Learning in Kernelized Markov Decision Processes

We develop algorithms with low regret for learning episodic Markov decis...
research
05/23/2022

Logarithmic regret bounds for continuous-time average-reward Markov decision processes

We consider reinforcement learning for continuous-time Markov decision p...
research
09/03/2010

Gaussian Process Bandits for Tree Search: Theory and Application to Planning in Discounted MDPs

We motivate and analyse a new Tree Search algorithm, GPTS, based on rece...
research
06/30/2020

Continuous-Time Multi-Armed Bandits with Controlled Restarts

Time-constrained decision processes have been ubiquitous in many fundame...
research
02/23/2023

Reward Learning as Doubly Nonparametric Bandits: Optimal Design and Scaling Laws

Specifying reward functions for complex tasks like object manipulation o...
research
06/10/2021

Thompson Sampling with a Mixture Prior

We study Thompson sampling (TS) in online decision-making problems where...
research
07/07/2023

Online Network Source Optimization with Graph-Kernel MAB

We propose Grab-UCB, a graph-kernel multi-arms bandit algorithm to learn...

Please sign up or login with your details

Forgot password? Click here to reset