Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution

by   Vihang P. Patil, et al.

Reinforcement Learning algorithms require a large number of samples to solve complex tasks with sparse and delayed rewards. Complex tasks can often be hierarchically decomposed into sub-tasks. A step in the Q-function can be associated with solving a sub-task, where the expectation of the return increases. RUDDER has been introduced to identify these steps and then redistribute reward to them, thus immediately giving reward if sub-tasks are solved. Since the problem of delayed rewards is mitigated, learning is considerably sped up. However, for complex tasks, current exploration strategies as deployed in RUDDER struggle with discovering episodes with high rewards. Therefore, we assume that episodes with high rewards are given as demonstrations and do not have to be discovered by exploration. Typically the number of demonstrations is small and RUDDER's LSTM model as a deep learning method does not learn well. Hence, we introduce Align-RUDDER, which is RUDDER with two major modifications. First, Align-RUDDER assumes that episodes with high rewards are given as demonstrations, replacing RUDDER's safe exploration and lessons replay buffer. Second, we replace RUDDER's LSTM model by a profile model that is obtained from multiple sequence alignment of demonstrations. Profile models can be constructed from as few as two demonstrations as known from bioinformatics. Align-RUDDER inherits the concept of reward redistribution, which considerably reduces the delay of rewards, thus speeding up learning. Align-RUDDER outperforms competitors on complex artificial tasks with delayed reward and few demonstrations. On the MineCraft ObtainDiamond task, Align-RUDDER is able to mine a diamond, though not frequently. Github:, YouTube:



There are no comments yet.


page 9

page 18

page 19


Automata Guided Reinforcement Learning With Demonstrations

Tasks with complex temporal structures and long horizons pose a challeng...

Curiosity-Driven Multi-Criteria Hindsight Experience Replay

Dealing with sparse rewards is a longstanding challenge in reinforcement...

Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

We propose a general and model-free approach for Reinforcement Learning ...

Policy Gradient from Demonstration and Curiosity

With reinforcement learning, an agent could learn complex behaviors from...

Playing hard exploration games by watching YouTube

Deep reinforcement learning methods traditionally struggle with tasks wh...

RUDDER: Return Decomposition for Delayed Rewards

We propose a novel reinforcement learning approach for finite Markov dec...

Making Efficient Use of Demonstrations to Solve Hard Exploration Problems

This paper introduces R2D3, an agent that makes efficient use of demonst...

Code Repositories


Code to reproduce results on toy tasks and companion blog for the paper.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.