Learning One Representation to Optimize All Rewards

03/14/2021
by   Ahmed Touati, et al.
8

We introduce the forward-backward (FB) representation of the dynamics of a reward-free Markov decision process. It provides explicit near-optimal policies for any reward specified a posteriori. During an unsupervised phase, we use reward-free interactions with the environment to learn two representations via off-the-shelf deep learning methods and temporal difference (TD) learning. In the test phase, a reward representation is estimated either from observations or an explicit reward description (e.g., a target state). The optimal policy for that reward is directly obtained from these representations, with no planning. The unsupervised FB loss is well-principled: if training is perfect, the policies obtained are provably optimal for any reward function. With imperfect training, the sub-optimality is proportional to the unsupervised approximation error. The FB representation learns long-range relationships between states and actions, via a predictive occupancy map, without having to synthesize states as in model-based approaches. This is a step towards learning controllable agents in arbitrary black-box stochastic environments. This approach compares well to goal-oriented RL algorithms on discrete and continuous mazes, pixel-based MsPacman, and the FetchReach virtual robot arm. We also illustrate how the agent can immediately adapt to new tasks beyond goal-oriented RL.

READ FULL TEXT

page 12

page 13

page 14

page 41

page 42

research
06/19/2020

On Reward-Free Reinforcement Learning with Linear Function Approximation

Reward-free reinforcement learning (RL) is a framework which is suitable...
research
02/07/2020

Reward-Free Exploration for Reinforcement Learning

Exploration is widely regarded as one of the most challenging aspects of...
research
10/12/2021

Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation

We study the model-based reward-free reinforcement learning with linear ...
research
03/03/2021

Successor Feature Sets: Generalizing Successor Representations Across Policies

Successor-style representations have many advantages for reinforcement l...
research
01/18/2021

Learning Successor States and Goal-Dependent Values: A Mathematical Viewpoint

In reinforcement learning, temporal difference-based algorithms can be s...
research
11/01/1997

Dynamic Non-Bayesian Decision Making

The model of a non-Bayesian agent who faces a repeated game with incompl...
research
06/12/2019

Fast Task Inference with Variational Intrinsic Successor Features

It has been established that diverse behaviors spanning the controllable...

Please sign up or login with your details

Forgot password? Click here to reset