Supervised Policy Update

05/29/2018
by   Quan Ho Vuong, et al.
0

We propose a new sample-efficient methodology, called Supervised Policy Update (SPU), for deep reinforcement learning. Starting with data generated by the current policy, SPU optimizes over the proximal policy space to find a non-parameterized policy. It then solves a supervised regression problem to convert the non-parameterized policy to a parameterized policy, from which it draws new samples. There is significant flexibility in setting the labels in the supervised regression problem, with different settings corresponding to different underlying optimization problems. We develop a methodology for finding an optimal policy in the non-parameterized policy space, and show how Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) can be addressed by this methodology. In terms of sample efficiency, our experiments show SPU can outperform PPO for simulated robotic locomotion tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/23/2018

Hierarchical Approaches for Reinforcement Learning in Parameterized Action Space

We explore Deep Reinforcement Learning in a parameterized action space. ...
research
06/14/2020

Optimistic Distributionally Robust Policy Optimization

Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization...
research
04/17/2018

An Adaptive Clipping Approach for Proximal Policy Optimization

Very recently proximal policy optimization (PPO) algorithms have been pr...
research
05/28/2021

Improving Generalization in Mountain Car Through the Partitioned Parameterized Policy Approach via Quasi-Stochastic Gradient Descent

The reinforcement learning problem of finding a control policy that mini...
research
06/25/2019

Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy

Proximal policy optimization and trust region policy optimization (PPO a...
research
10/18/2022

Proximal Learning With Opponent-Learning Awareness

Learning With Opponent-Learning Awareness (LOLA) (Foerster et al. [2018a...
research
01/29/2019

Trust Region-Guided Proximal Policy Optimization

Model-free reinforcement learning relies heavily on a safe yet explorato...

Please sign up or login with your details

Forgot password? Click here to reset