Synthetic Returns for Long-Term Credit Assignment

02/24/2021
by   David Raposo, et al.
6

Since the earliest days of reinforcement learning, the workhorse method for assigning credit to actions over time has been temporal-difference (TD) learning, which propagates credit backward timestep-by-timestep. This approach suffers when delays between actions and rewards are long and when intervening unrelated events contribute variance to long-term returns. We propose state-associative (SA) learning, where the agent learns associations between states and arbitrarily distant future rewards, then propagates credit directly between the two. In this work, we use SA-learning to model the contribution of past states to the current reward. With this model we can predict each state's contribution to the far future, a quantity we call "synthetic returns". TD-learning can then be applied to select actions that maximize these synthetic returns (SRs). We demonstrate the effectiveness of augmenting agents with SRs across a range of tasks on which TD-learning alone fails. We show that the learned SRs are interpretable: they spike for states that occur after critical actions are taken. Finally, we show that our IMPALA-based SR agent solves Atari Skiing – a game with a lengthy reward delay that posed a major hurdle to deep-RL agents – 25 times faster than the published state-of-the-art.

READ FULL TEXT

page 3

page 4

page 7

page 9

page 11

page 12

page 13

page 14

research
06/29/2023

Would I have gotten that reward? Long-term credit assignment by counterfactual contribution analysis

To make reinforcement learning more sample efficient, we need better cre...
research
10/06/2019

Probabilistic Successor Representations with Kalman Temporal Differences

The effectiveness of Reinforcement Learning (RL) depends on an animal's ...
research
09/11/2018

Sparse Attentive Backtracking: Temporal CreditAssignment Through Reminding

Learning long-term dependencies in extended temporal sequences requires ...
research
10/12/2022

Contrastive introspection (ConSpec) to rapidly identify invariant steps for success

Reinforcement learning (RL) algorithms have achieved notable success in ...
research
10/15/2018

Optimizing Agent Behavior over Long Time Scales by Transporting Value

Humans spend a remarkable fraction of waking life engaged in acts of "me...
research
10/23/2020

Learning Guidance Rewards with Trajectory-space Smoothing

Long-term temporal credit assignment is an important challenge in deep r...
research
02/04/2014

Short-term plasticity as cause-effect hypothesis testing in distal reward learning

Asynchrony, overlaps and delays in sensory-motor signals introduce ambig...

Please sign up or login with your details

Forgot password? Click here to reset