Reparameterized Policy Learning for Multimodal Trajectory Optimization

07/20/2023
by   Zhiao Huang, et al.
0

We investigate the challenge of parametrizing policies for reinforcement learning (RL) in high-dimensional continuous action spaces. Our objective is to develop a multimodal policy that overcomes limitations inherent in the commonly-used Gaussian parameterization. To achieve this, we propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories. By conditioning the policy on a latent variable, we derive a novel variational bound as the optimization objective, which promotes exploration of the environment. We then present a practical model-based RL method, called Reparameterized Policy Gradient (RPG), which leverages the multimodal policy parameterization and learned world model to achieve strong exploration capabilities and high data efficiency. Empirical results demonstrate that our method can help agents evade local optima in tasks with dense rewards and solve challenging sparse-reward environments by incorporating an object-centric intrinsic reward. Our method consistently outperforms previous approaches across a range of tasks. Code and supplementary materials are available on the project page https://haosulab.github.io/RPG/

READ FULL TEXT

page 16

page 17

page 18

page 19

research
10/31/2019

VASE: Variational Assorted Surprise Exploration for Reinforcement Learning

Exploration in environments with continuous control and sparse rewards r...
research
10/08/2020

Learning Intrinsic Symbolic Rewards in Reinforcement Learning

Learning effective policies for sparse objectives is a key challenge in ...
research
06/18/2021

MADE: Exploration via Maximizing Deviation from Explored Regions

In online reinforcement learning (RL), efficient exploration remains par...
research
04/04/2022

Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization

We present Reward-Switching Policy Optimization (RSPO), a paradigm to di...
research
02/08/2022

Bingham Policy Parameterization for 3D Rotations in Reinforcement Learning

We propose a new policy parameterization for representing 3D rotations d...
research
02/18/2023

HOPE: Human-Centric Off-Policy Evaluation for E-Learning and Healthcare

Reinforcement learning (RL) has been extensively researched for enhancin...
research
11/28/2016

Improving Policy Gradient by Exploring Under-appreciated Rewards

This paper presents a novel form of policy gradient for model-free reinf...

Please sign up or login with your details

Forgot password? Click here to reset