Variance Reduction Methods for Sublinear Reinforcement Learning

02/26/2018
by   Sham Kakade, et al.
0

This work considers the problem of provably optimal reinforcement learning for (episodic) finite horizon MDPs, i.e. how an agent learns to maximize his/her (long term) reward in an uncertain environment. The main contribution is in providing a novel algorithm --- Variance-reduced Upper Confidence Q-learning (vUCQ) --- which enjoys a regret bound of O(√(HSAT) + H^5SA), where the T is the number of time steps the agent acts in the MDP, S is the number of states, A is the number of actions, and H is the (episodic) horizon time. This is the first regret bound that is both sub-linear in the model size and asymptotically optimal. The algorithm is sub-linear in that the time to achieve ϵ-average regret (for any constant ϵ) is O(SA), which is a number of samples that is far less than that required to learn any (non-trivial) estimate of the transition model (the transition model is specified by O(S^2A) parameters). The importance of sub-linear algorithms is largely the motivation for algorithms such as Q-learning and other "model free" approaches. vUCQ algorithm also enjoys minimax optimal regret in the long run, matching the Ω(√(HSAT)) lower bound. Variance-reduced Upper Confidence Q-learning (vUCQ) is a successive refinement method in which the algorithm reduces the variance in Q-value estimates and couples this estimation scheme with an upper confidence based algorithm. Technically, the coupling of both of these techniques is what leads to the algorithm enjoying both the sub-linear regret property and the (asymptotically) optimal regret.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/16/2017

Minimax Regret Bounds for Reinforcement Learning

We consider the problem of provably optimal exploration in reinforcement...
research
10/20/2022

Horizon-Free Reinforcement Learning for Latent Markov Decision Processes

We study regret minimization for reinforcement learning (RL) in Latent M...
research
05/05/2019

Learning to Control in Metric Space with Optimal Regret

We study online reinforcement learning for finite-horizon deterministic ...
research
02/09/2021

Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap

This paper presents a new model-free algorithm for episodic finite-horiz...
research
03/11/2020

Model-Free Algorithm and Regret Analysis for MDPs with Peak Constraints

In the optimization of dynamic systems, the variables typically have con...
research
02/22/2022

Sequential Information Design: Markov Persuasion Process and Its Efficient Reinforcement Learning

In today's economy, it becomes important for Internet platforms to consi...
research
02/16/2023

Quantum Computing Provides Exponential Regret Improvement in Episodic Reinforcement Learning

In this paper, we investigate the problem of episodic reinforcement lear...

Please sign up or login with your details

Forgot password? Click here to reset