Exploration by Maximizing Rényi Entropy for Zero-Shot Meta RL

06/11/2020
by   Chuheng Zhang, et al.
0

Exploring the transition dynamics is essential to the success of reinforcement learning (RL) algorithms. To face the challenges of exploration, we consider a zero-shot meta RL framework that completely separates exploration from exploitation and is suitable for the meta RL setting where there are many reward functions of interest. In the exploration phase, the agent learns an exploratory policy by interacting with a reward-free environment and collects a dataset of transitions by executing the policy. In the planning phase, the agent computes a good policy for any reward function based on the dataset without further interacting with the environment. This framework brings new challenges for exploration algorithms. In the exploration phase, we propose to maximize the Rényi entropy over the state-action space and justify this objective theoretically. We further deduce a policy gradient formulation for this objective and design a practical exploration algorithm that can deal with complex environments based on PPO. In the planning phase, we use a batch RL algorithm, batch constrained deep Q-learning (BCQ), to solve for good policies given arbitrary reward functions. Empirically, we show that our exploration algorithm is effective and sample efficient, and results in superior policies for arbitrary reward functions in the planning phase.

READ FULL TEXT

page 8

page 9

research
02/07/2020

Reward-Free Exploration for Reinforcement Learning

Exploration is widely regarded as one of the most challenging aspects of...
research
09/29/2022

Does Zero-Shot Reinforcement Learning Exist?

A zero-shot RL agent is an agent that can solve any RL task in a given e...
research
07/09/2020

A Policy Gradient Method for Task-Agnostic Exploration

In a reward-free environment, what is a suitable intrinsic objective for...
research
01/10/2022

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

Reward hacking – where RL agents exploit gaps in misspecified reward fun...
research
09/28/2021

Exploratory State Representation Learning

Not having access to compact and meaningful representations is known to ...
research
07/29/2018

Sidekick Policy Learning for Active Visual Exploration

We consider an active visual exploration scenario, where an agent must i...
research
10/12/2020

Nearly Minimax Optimal Reward-free Reinforcement Learning

We study the reward-free reinforcement learning framework, which is part...

Please sign up or login with your details

Forgot password? Click here to reset