Bilinear Exponential Family of MDPs: Frequentist Regret Bound with Tractable Exploration and Planning

10/05/2022
∙
by   Reda Ouhamma, et al.
∙
6
∙

We study the problem of episodic reinforcement learning in continuous state-action spaces with unknown rewards and transitions. Specifically, we consider the setting where the rewards and transitions are modeled using parametric bilinear exponential families. We propose an algorithm, BEF-RLSVI, that a) uses penalized maximum likelihood estimators to learn the unknown parameters, b) injects a calibrated Gaussian noise in the parameter of rewards to ensure exploration, and c) leverages linearity of the exponential family with respect to an underlying RKHS to perform tractable planning. We further provide a frequentist regret analysis of BEF-RLSVI that yields an upper bound of 𝒊Ėƒ(√(d^3H^3K)), where d is the dimension of the parameters, H is the episode length, and K is the number of episodes. Our analysis improves the existing bounds for the bilinear exponential family of MDPs by √(H) and removes the handcrafted clipping deployed in existing -type algorithms. Our regret bound is order-optimal with respect to H and K.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
∙ 06/26/2019

A Tractable Algorithm For Finite-Horizon Continuous Reinforcement Learning

We consider the finite horizon continuous reinforcement learning problem...
research
∙ 04/12/2020

Regret Bounds for Kernel-Based Reinforcement Learning

We consider the exploration-exploitation dilemma in finite-horizon reinf...
research
∙ 06/27/2021

Regret Analysis in Deterministic Reinforcement Learning

We consider Markov Decision Processes (MDPs) with deterministic transiti...
research
∙ 01/01/2019

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds

Strong worst-case performance bounds for episodic reinforcement learning...
research
∙ 01/30/2020

Finite-time Analysis of Kullback-Leibler Upper Confidence Bounds for Optimal Adaptive Allocation with Multiple Plays and Markovian Rewards

We study an extension of the classic stochastic multi-armed bandit probl...
research
∙ 06/01/2022

An Îą-No-Regret Algorithm For Graphical Bilinear Bandits

We propose the first regret-based approach to the Graphical Bilinear Ban...
research
∙ 02/12/2018

Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning

We introduce SCAL, an algorithm designed to perform efficient exploratio...

Please sign up or login with your details

Forgot password? Click here to reset