Conservative Dual Policy Optimization for Efficient Model-Based Reinforcement Learning

09/16/2022
by   Shenao Zhang, et al.
0

Provably efficient Model-Based Reinforcement Learning (MBRL) based on optimism or posterior sampling (PSRL) is ensured to attain the global optimality asymptotically by introducing the complexity measure of the model. However, the complexity might grow exponentially for the simplest nonlinear models, where global convergence is impossible within finite iterations. When the model suffers a large generalization error, which is quantitatively measured by the model complexity, the uncertainty can be large. The sampled model that current policy is greedily optimized upon will thus be unsettled, resulting in aggressive policy updates and over-exploration. In this work, we propose Conservative Dual Policy Optimization (CDPO) that involves a Referential Update and a Conservative Update. The policy is first optimized under a reference model, which imitates the mechanism of PSRL while offering more stability. A conservative range of randomness is guaranteed by maximizing the expectation of model value. Without harmful sampling procedures, CDPO can still achieve the same regret as PSRL. More importantly, CDPO enjoys monotonic policy improvement and global optimality simultaneously. Empirical results also validate the exploration efficiency of CDPO.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/12/2022

Variance-Reduced Conservative Policy Iteration

We study the sample complexity of reducing reinforcement learning to a s...
research
02/14/2023

Conservative State Value Estimation for Offline Reinforcement Learning

Offline reinforcement learning faces a significant challenge of value ov...
research
06/09/2023

Near-optimal Conservative Exploration in Reinforcement Learning under Episode-wise Constraints

This paper investigates conservative exploration in reinforcement learni...
research
02/27/2021

Revisiting Peng's Q(λ) for Modern Reinforcement Learning

Off-policy multi-step reinforcement learning algorithms consist of conse...
research
08/22/2023

Careful at Estimation and Bold at Exploration

Exploration strategies in continuous action space are often heuristic du...
research
02/15/2022

Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms

In Reinforcement Learning, the optimal action at a given state is depend...
research
05/27/2020

Assumed Density Filtering Q-learning

While off-policy temporal difference (TD) methods have widely been used ...

Please sign up or login with your details

Forgot password? Click here to reset