Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

12/15/2020
by   Dongruo Zhou, et al.
11

We study reinforcement learning (RL) with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou et al., 2020) and the learning agent has access to either an integration or a sampling oracle of the individual basis kernels. We propose a new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise. Based on the new inequality, we propose a new, computationally efficient algorithm with linear function approximation named UCRL-VTR^+ for the aforementioned linear mixture MDPs in the episodic undiscounted setting. We show that UCRL-VTR^+ attains an Õ(dH√(T)) regret where d is the dimension of feature mapping, H is the length of the episode and T is the number of interactions with the MDP. We also prove a matching lower bound Ω(dH√(T)) for this setting, which shows that UCRL-VTR^+ is minimax optimal up to logarithmic factors. In addition, we propose the UCLK^+ algorithm for the same family of MDPs under discounting and show that it attains an Õ(d√(T)/(1-γ)^1.5) regret, where γ∈ [0,1) is the discount factor. Our upper bound matches the lower bound Ω(d√(T)/(1-γ)^1.5) proved by Zhou et al. (2020) up to logarithmic factors, suggesting that UCLK^+ is nearly minimax optimal. To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.

READ FULL TEXT

page 1

page 2

page 3

page 4

02/17/2021

Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function Approximation

We study the reinforcement learning for finite-horizon episodic Markov d...
06/23/2022

Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation

We study reinforcement learning with linear function approximation where...
02/15/2021

Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

We study reinforcement learning in an infinite-horizon average-reward se...
09/12/2021

Improved Algorithms for Misspecified Linear Markov Decision Processes

For the misspecified linear Markov decision process (MLMDP) model of Jin...
08/29/2019

Solving Discounted Stochastic Two-Player Games with Near-Optimal Time and Sample Complexity

In this paper, we settle the sampling complexity of solving discounted t...
02/25/2021

Provably Breaking the Quadratic Error Compounding Barrier in Imitation Learning, Optimally

We study the statistical limits of Imitation Learning (IL) in episodic M...
12/21/2021

Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

Policy optimization methods are one of the most widely used classes of R...