Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

12/21/2021
by   Tianhao Wu, et al.
0

Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. However, theoretical understanding of these methods remains insufficient. Even in the episodic (time-inhomogeneous) tabular setting, the state-of-the-art theoretical result of policy-based method in <cit.> is only Õ(√(S^2AH^4K)) where S is the number of states, A is the number of actions, H is the horizon, and K is the number of episodes, and there is a √(SH) gap compared with the information theoretic lower bound Ω̃(√(SAH^3K)). To bridge such a gap, we propose a novel algorithm Reference-based Policy Optimization with Stable at Any Time guarantee (), which features the property "Stable at Any Time". We prove that our algorithm achieves Õ(√(SAH^3K) + √(AH^4K)) regret. When S > H, our algorithm is minimax optimal when ignoring logarithmic factors. To our best knowledge, RPO-SAT is the first computationally efficient, nearly minimax optimal policy-based algorithm for tabular RL.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/22/2021

Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret

We study the problem of learning in the stochastic shortest path (SSP) s...
research
02/07/2023

Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR

In this paper, we study risk-sensitive Reinforcement Learning (RL), focu...
research
07/10/2022

Learning to Order for Inventory Systems with Lost Sales and Uncertain Supplies

We consider a stochastic lost-sales inventory control system with a lead...
research
05/23/2022

Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation

We study human-in-the-loop reinforcement learning (RL) with trajectory p...
research
08/11/2021

Gap-Dependent Unsupervised Exploration for Reinforcement Learning

For the problem of task-agnostic reinforcement learning (RL), an agent f...
research
07/26/2022

A Learning and Control Perspective for Microfinance

Microfinance in developing areas such as Africa has been proven to impro...
research
03/05/2020

Generalized Policy Elimination: an efficient algorithm for Nonparametric Contextual Bandits

We propose the Generalized Policy Elimination (GPE) algorithm, an oracle...

Please sign up or login with your details

Forgot password? Click here to reset