Reward Relabelling for combined Reinforcement and Imitation Learning on sparse-reward tasks

by   Jesus Bujalance Martin, et al.
MINES ParisTech

During recent years, deep reinforcement learning (DRL) has made successful incursions into complex decision-making applications such as robotics, autonomous driving or video games. In the search for more sample-efficient algorithms, a promising direction is to leverage as much external off-policy data as possible. One staple of this data-driven approach is to learn from expert demonstrations. In the past, multiple ideas have been proposed to make good use of the demonstrations added to the replay buffer, such as pretraining on demonstrations only or minimizing additional cost functions. We present a new method, able to leverage demonstrations and episodes collected online in any sparse-reward environment with any off-policy algorithm. Our method is based on a reward bonus given to demonstrations and successful episodes, encouraging expert imitation and self-imitation. First, we give a reward bonus to the transitions coming from demonstrations to encourage the agent to match the demonstrated behaviour. Then, upon collecting a successful episode, we relabel its transitions with the same bonus before adding them to the replay buffer, encouraging the agent to also match its previous successes. Our experiments focus on manipulation robotics, specifically on three tasks for a 6 degrees-of-freedom robotic arm in simulation. We show that our method based on reward relabeling improves the performance of the base algorithm (SAC and DDPG) on these tasks, even in the absence of demonstrations. Furthermore, integrating into our method two improvements from previous works allows our approach to outperform all baselines.



page 6

page 8


Learning from demonstrations with SACR2: Soft Actor-Critic with Reward Relabeling

During recent years, deep reinforcement learning (DRL) has made successf...

GRI: General Reinforced Imitation and its Application to Vision-Based Autonomous Driving

Deep reinforcement learning (DRL) has been demonstrated to be effective ...

Self-Imitation Learning from Demonstrations

Despite the numerous breakthroughs achieved with Reinforcement Learning ...

Learning Sparse Rewarded Tasks from Sub-Optimal Demonstrations

Model-free deep reinforcement learning (RL) has demonstrated its superio...

Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

We propose a general and model-free approach for Reinforcement Learning ...

Achieving Sample-Efficient and Online-Training-Safe Deep Reinforcement Learning with Base Controllers

Application of Deep Reinforcement Learning (DRL) algorithms in real-worl...

Learning Agile Skills via Adversarial Imitation of Rough Partial Demonstrations

Learning agile skills is one of the main challenges in robotics. To this...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite having known great success, reinforcement-learning algorithms have yet to prove that they can consistently produce good results across different domains. Two of the main remaining challenges are reward shaping and sample efficiency. Reward shaping makes it very difficult to translate results from one task to another, and often relies on the intuition of the designer rather than a robust methodology. Sample-efficient algorithms are required to obtain faster and more reliable results, particularly in robotics where it is much harder to deploy an algorithm outside of simulation. Recent progress has enabled some deployment to real robots, but there are still issues to consider such as safety or human intervention (DBLP:journals/corr/abs-2104-11203). The recent trend towards more data-driven algorithms could be a solution to both of these problems. Indeed, additional data can alleviate the need for online data collection, and demonstration data can guide the agent to good behaviours along a simple task-agnostic reward function.

In this work we focus on off-policy reinforcement learning, and present a way to leverage offline data in the form of optimal demonstrations. Our method is based on the observation that, in hindsight, a successful episode of collected experience is in fact a demonstration, so it should receive the same treatment. In particular, we propose to add a reward bonus to all transitions coming from both demonstrations and successful episodes. Our approach provides a simple way of tying positive rewards and desired behaviour, without any task-specific reward shaping.

We focus on three manipulation tasks with sparse rewards: reaching a target, pushing a button, and flipping a switch. We instantiate our approach with Soft Actor-Critic (SAC) (haarnoja2018soft) and Deep Deterministic Policy Gradient (DDPG) (DBLP:journals/corr/LillicrapHPHETS15), and compare it to two other algorithms, DBLP:journals/corr/VecerikHSWPPHRL17 and nair2018overcoming, presented in detail in section 2. Just like ours, these methods are task-agnostic and generic, in the sense that they can be applied to any off-policy algorithm with some minor modifications.

The contributions of this paper are the following:

  • We introduce a new method that consists in giving a reward bonus to demonstrations and relabeling successful episodes as demonstrations. We test it across different tasks and algorithms, and show that it greatly improves upon the base algorithm. It also consistently solves the task without demonstrations.

  • We propose a new algorithm that integrates into our method two improvements from previous works. We test it on the reaching task with SAC and it outperforms all baselines.

2 Related work

Reinforcement Learning (RL) and Imitation Learning (IL) are the two most popular paradigms to solve decision-making tasks within machine learning. In classical RL, the data is collected during training via interactions with the environment, which provides a reward signal that we try to maximize. In IL, the data is generally collected before training, and it consists of expert trajectories of a behaviour that we wish to copy or imitate.

Learning for robotics. In this paper, we focus on general-purpose RL algorithms that can be applied to any problem. However, roboticists have for a long time pursued sample efficiency in order to deploy their algorithms in the real world. The recent RAPS from dalal2021accelerating alters the action space by manually specifying a library of parameterized robot action primitives, and the recent C2F-ARM from james2021coarse relies on point cloud inputs and a clever pre-processing pipeline to achieve impressive results from just a handful of demonstrations. Outside of RL, Transporters Networks (zeng2020transporter) exploit spatial symmetries and RGB-D sensors to solve a variety of tasks from just a few samples, and NTP (DBLP:journals/corr/abs-1710-01813) and the more recent NTG (huang2019neural) use clever representations to execute a hierarchy of movement primitives that solve complex sequential tasks from demonstrations. Many successful algorithms for robotics have come from model-based RL, since they can learn from task-agnostic data. Earlier works proposed dynamics models over latent representation spaces, which could be learnt from a reconstruction objective (finn2016deep), or directly from their ability to produce models that accurately explain the observed data (zhang2019solar). More recently, works such as finn2017deep or schmeckpeper2020learning successfully learnt a model directly in image space. More generally, the recent trend of offline RL has given algorithms the ability to leverage huge amounts of data, expert or not, such as in MT-Opt (kalashnikov2021mt).

Relabeling past experience. Since off-policy RL algorithms can theoretically use data coming from any policy, a natural idea was to share data between tasks in a multi-task setting. An even better idea came in HER (NIPS2017_453fadbd), where the authors pointed out that if we accidentally solve one task when trying to perform another task, that experience is still optimal if we relabel the goal that was initially intended. Similar and more general works followed, such as GCSL (ghosh2019learning), Generalized Hindsight (li2020generalized), and HIPI (eysenbach2020rewriting), which reframes the relabeling problem as inverse RL. RCP (DBLP:journals/corr/abs-1912-13465) extended the idea to the single-task setting, by learning a policy conditioned on the trajectory return.

Learning from demonstrations.

The most straight-forward variant of IL is behaviour cloning (BC), where we directly look for a policy that acts like the expert by solving a supervised learning problem. Despite its simplicity, BC has been successful in a variety of complex applications such as autonomous driving

(bojarski2016end). More robust algorithms have been proposed, like Dagger from ross2011reduction, which does however require that the expert is available during training, or the recent Implicit BC (florence2021implicit)

, achieving impressive results with energy-based models. Another variant of IL is inverse RL (IRL), where we try to infer the reward function that the expert was most likely trying to maximize, while optionally jointly learning a policy. IRL algorithms benefit from online data collection, but assume that the reward signal from the environment is unknown. The most recent IRL algorithms are based on adversarial optimization, such as GCL from

finn2016guided, which presents a similar idea to GAIL from ho2016generative.

Self-Imitation Learning. Having an expert to learn from can be very helpful, but such a luxury is not always available. Instead, we can apply IL methods to the experiences collected by the agent in the environment, to further supervise its behaviour. Oh2018SIL

introduced Self-Imitation Learning (SIL), where an additional loss function pushed the agent to imitate its own decisions in the past only when they resulted in larger returns than expected. Further works such as Self-Imitation Advantage Learning

(DBLP:journals/corr/abs-2012-11989) followed.

Learning from both demonstrations and reinforcement learning. Demonstrations can be used to design the reward, guide exploration, augment the training data, initialize policies, etc. In NAC (gao2018reinforcement), the demonstrations, which can be sub-optimal, are used as the only training data during the first iterations. In zhu2018reinforcement the demonstrations are used to augment the manually designed task reward with an imitation-based reward. In DAPG (DBLP:journals/corr/abs-1709-10087) the demonstrations are used twice: to pretrain with behavior cloning, and to augment the policy gradient equation. In DAC (liu2020demonstration), they introduce a novel objective based on an augmented reward, the larger the closer the policy to the expert policy. We will focus on three algorithms that can be applied to any continous-action off-policy algorithm with minor modifications.

DDPGfD and SACfD (DBLP:journals/corr/VecerikHSWPPHRL17) (originally DDPGfD based on DDPG) introduces three main ideas: transitions from demonstrations are added to the replay buffer, prioritized replay is used for sampling transitions (demonstration data is given a bonus to be sampled more often), and a mix of 1-step and n-step return losses is used.

DDPGBC and SACBC (nair2018overcoming) (has no name and originally based on DDPG) also introduces three main ideas: transitions from demonstrations are added to a separate additional replay buffer, an auxiliary behaviour cloning loss is applied to samples from this buffer, and some episodes are reset to a state sampled uniformly from a demonstration. The authors additionally present other ideas regarding the multi-goal setting which we won’t cover in this work.

SQIL (reddy2020sqil) is actually a pure IL algorithm, since the reward signal is supposed unknown, but we present it here since it also incorporates demonstration data into the replay buffer. The replay buffer is initially filled with demonstrations where the rewards are always , and new experiences collected by the agent are added with reward .

3 Background

Reinforcement learning. An agent interacts with an environment by performing an action and observing a feedback signal (reward ) and the new state of the environment. The goal is to find the policy (function mapping states to actions ) that maximizes the discounted cumulative reward (the parameter makes future rewards smaller):


The Q-function is defined as the reward-to-go from the state if we pick the action and then follow . This function obeys the Bellman equation:


Deep Deterministic Policy Gradient. The goal is to learn a deterministic policy . To train the critic, we can approximate the right-hand expectation of the Bellman equation with samples, set it equal to , and minimize the MSBE loss on a parameterized (in practice, two additional approximators and are used to compute the targets ):


To train the actor, simple gradient descent w.r.t. on the loss .

Soft Actor-Critic. In classic RL, the optimal policy is always deterministic under full observability, but stochastic policies have interesting properties: better exploration and robustness (due to wider coverage of states), and multi-modality. This new objective promotes stochasticity by maximizing the entropy of the policy ( is a hyper-parameter, and we omit for simplicity):


A new Q-function (slightly different) is derived and follows the soft Bellman equation:


Similar to DDPG, to train the critic we can approximate the right-hand expectation with samples, set it equal to , and minimize the MSBE loss on a parameterized (in practice, two approximators and are trained, with their respective target versions and ).

To train the actor , the actor loss is derived from the reparameterization trick to compute samples , where is some random noise:


4 Method

We propose SACR2 and DDPGR2, acronyms for SAC with Reward Relabeling and DDPG with Reward Relabeling respectively, a straight-forward method that could be implemented to any other off-policy reinforcement-learning algorithm with sparse rewards. Let be the sparse reward from the environment, the reward bonus of our method, and the average length of the task demonstrations. First, we add demonstration data to the buffer: the last transition of each expert trajectory is given the sparse reward , and the other transitions are given a reward equal to , typically smaller than . Then, as the agent explores the environment during training, every new successful episode is relabeled: the last transitions leading to the sparse reward are assigned a reward equal to .

Figure 1: Reward Relabeling for combined Reinforcement and Imitation Learning.

Reward bonus to demonstrations. The first part of our algorithm is most similar to SQIL (reddy2020sqil). Intuitively, it gives the agent an incentive to imitate the expert. Their paper shows theoretical connections between SQIL and regularized behaviour cloning. One important difference is that SQIL is a pure imitation-learning algorithm, while our method learns from both the reward bonuses and the reward from the environment. This also means that our reward bonus should be carefully tuned so that it has an impact without completely swallowing the environment reward . Intuitively, SQIL also gives an incentive to avoid states that weren’t in the demonstration data, which could potentially be harmful if those states led to successful behaviour.

Reward bonus to successful episodes. The relabeling part of our algorithm tries to mitigate this issue and is most similar to Self-Imitation Learning (SIL) (Oh2018SIL). In our method, the self-imitation is achieved by effectively treating successful episodes as if they were demonstrations. As a precautionary measure, we only relabel the last transitions leading to the sparse reward, where

is the average length of the demonstrations. Intuitively, we do not wish to reward the early transitions of overly long episodes since they probably didn’t help to the completion of the task.

In order to give a reward bonus to successful episodes, we need to wait for the episodes to end first. We follow the idea presented in HER (NIPS2017_453fadbd) to relabel past experiences and modify the transitions’ rewards. To the best of our knowledge, HIPI (eysenbach2020rewriting) is the only existing method to also modify the rewards this way. While their method relies on inverse RL to tackle the more general multi-task setting, ours is more straight-forward for the simpler single-task setting with sparse rewards.

Require: reward bonus, average length of demonstrations;
Initialize buffer with demonstrations, set reward for all non-final transitions;
Initialize empty episode;
while not converged do
       do off-policy RL update;
       if len(episode) == 0 then
             collect one episode;
             if episode is successful then
                   set for the last non-final transitions;
             end if
       end if
      pop a transition from the episode and add it to the buffer;
end while
Algorithm 1 Reward Relabeling for combined Reinforcement and Imitation Learning

5 Experimental setup

Figure 2: Snapshots from each task: reach target (left), flip switch (middle), push button (right).

We evaluate our method on three simulated tasks for a 6 degrees-of-freedom robot manipulator: reaching a ball, pushing a button, and flipping a switch. The target object (ball, button, switch) appears randomly at the beginning of each episode within the reach of the robot: the ball appears anywhere in the 3D space, the button is bound to the floor, and the switch is on a wall which appears randomly in the scene (see figure 2). The initial position of the target object changes after each episode, but it doesn’t move during an episode. The initial state of the robot is always the same, close to an upright position. An episode ends once the robot has completed the task, or after expiration (100 time-steps for the reaching task, and 150 for the other two). The reward is fully sparse, and is equal to if the robot solves the task and 0 otherwise. The state has 22 dimensions, 19 from the robot proprioceptive state (joint angles, joint speeds, gripper pose) and 3 from the task-related information (3D coordinates of the target object).

Simulator and demonstration data collection. The three tasks belong to the benchmark and learning environment RLBench from james2020rlbench, which is built on top of CoppeliaSim (rohmer2013v). The backend physics engine is the Bullet physics library (coumans2015bullet), and the expert demonstrations are provided by RLBench and rely on OMPL (sucan2012open) for motion planning.

Experiments. The basic SAC algorithm without demonstrations is able to solve these tasks (with close to 100

accuracy) on some runs. Our goal is to solve the tasks consistently and reduce the amount of training steps required to do so. We want to answer three questions: How does SACR2 perform? How to tune its hyperparameters? How does our method perform under rougher conditions (weaker base algorithm, few demonstrations available, no demonstrations available)?

In order to answer these questions, we compare our method SACR2 to a simple baseline SAC+Demo, which we define as SAC with demonstrations in the buffer, as well as SACfD (DBLP:journals/corr/VecerikHSWPPHRL17) and SACBC (nair2018overcoming). We apply the following modifications to all 4 algorithms:

  • For SAC+Demo, SACR2, and SACfD: Single buffer initially filled with 200 demonstrations, and kept thereafter at a ratio of demonstration data. After 130000 training iterations, around 800 more demonstrations have been added to the buffer. For SACBC: Two separate buffers, one exclusively filled with demonstrations (with a comparable amount).

  • We additionally put 1000 random interactions alongside the demonstrations in the buffer, and pre-train during 3000 iterations before collecting any data.

  • For SAC+Demo, SACR2, and SACBC: The data is sampled from the buffer according to prioritized experience replay (PER) (DBLP:journals/corr/SchaulQAS15). For SACfD: A modified version of PER which boosts the sampling of demonstration data is used.

  • The replay ratio on the collected data is set to 32. Since the batch size is set to 64, the agent takes two environment steps per training step.

  • L2 regularization losses on the weights of the critic and the actor.

On top of these changes, SACBC uses an auxiliary behaviour cloning loss and resets some episodes to demonstration states, SACfD uses an auxiliary n-step loss, and SACR2 uses the reward bonus presented in section 4.

6 Results

Figure 3:

Learning curves for different experiments on the three tasks. The results are smoothed with a rolling window of 100 episodes, and the standard error is computed on three random seeds.

Figure 4: For a given value of the reward bonus b, we plot the amount of training steps required for the average episode success to reach a certain threshold (requires a window of time-steps to be above the threshold). The results are smoothed with a rolling window of 100 episodes, and the standard error is computed on three random seeds.

The results (figures 2(a), 2(b), 2(c)) show that both SACR2 and SACfD greatly outperform our baseline (tied in push button, SACfD has an edge in reach target, and SACR2 has a small edge in flip switch). SACBC also improves the performance by a considerable margin in reach target, but surprisingly performs worse than the baseline on the other two tasks. Overall, the improvements are much less noticeable in the flip switch task. One possible explanation is that the quality of the demonstrations in this task is much worse, as the planner often struggled to find the shortest trajectory.

Impact of hyperparameter b. Figure 4 shows the impact of different values of . In the reach target task, going past a certain value (7 for SAC and 3 for DDPG) no longer improves performance and it even decreases it as the training becomes less stable (particularly towards the end). In the push button task, the plot suggests that we could have further increased the value of .

Impact of online relabeling. What happens if we skip the online relabeling in SACR2? We still give a reward bonus to the transitions coming from demonstrations, but we no longer relabel successful episodes. Figure 2(f) shows that, without relabeling, the learning process becomes unstable, probably due to the lack of consistency on the rewards once the agent has collected enough successful episodes. However, even without relabeling we can notice an initial boost during training, coming from the agent imitating the demonstrations more aggressively in the beginning.

Best algorithm. Since both SACBC and SACfD introduce multiple ideas, we carried on an ablation study in our previous work (martin2021learning) to evaluate them on isolation. We tested SACR2, four loss functions, two buffer configurations, pre-training on demonstrations, resetting some episodes to demonstration states, and modifying the sampling probabilities of the replay buffer. The results showed that the n-step loss from SACfD, our approach SACR2, and the behaviour cloning loss from SACBC, are the three methods with the greatest impact on performance.

In order to see if these methods stack together, we introduce SACR2*, which is defined as SACR2 with the addition of the behaviour cloning loss from SACBC and the n-step loss from SACfD. As shown in Figure 2(d), SACR2* outperforms all the other methods, and is particularly faster to reach an accuracy of , although other methods catch up later on and are equivalent towards the end.

From SAC to DDPG. We test our reward relabeling method with another base algorithm: DDPG. Figure 2(e) shows that DDPGR2 and DDPGfD have similar performances, while DDPG+Demo fails to make any progress on the task.

Low-data regime. The previous results were obtained with a large amount of demonstrations (), which is unrealistic for real-world applications. What happens if we limit the amount of demonstrations to just 100? Figure 2(g) shows that our baseline SAC+Demo struggles to solve the task on some runs. Compared to the high-data regime (figure 2(b)), SACBC suffers the largest drop in performance, while both SACR2 and SACfD remain very competitive in this regime.

Learning without demonstrations. One final interesting experiment is to see whether SACR2 can also improve the performance when no demonstrations are available, by just relabeling successful episodes. Figure 2(h) shows that SACR2 actually solves the task consistently, while SAC only solves it on 2 out of the 4 runs. However, over the runs where SAC solved the task, its sample efficiency was comparable to SACR2. Intuitively, SACR2 helps propagate the signal of the sparse reward by explicitly turning zero rewards into positive rewards, rather than entirely relying on bootstrapping rare future rewards.

7 Discussion and Conclusion

We propose Reward Relabelling for combined Reinforcement and Imitation Learning, a generic method that can be applied to any off-policy reinforcement-learning algorithm. It encourages two behaviours: imitate the expert demonstrations (if available), and imitate the past successful trajectories. From our experiments, our method SACR2 had comparable results to SACfD (DBLP:journals/corr/VecerikHSWPPHRL17) and outperformed SACBC (nair2018overcoming) and our baseline SAC+Demo. Although SACR2 and SACfD are completely different, the idea behind them is similar: the goal is to propagate the rare sparse reward quicker, either by looking ahead at a further horizon (SACfD) or by explicitly propagating the rewards (SACR2). Interestingly, we show that these methods stack together, as the best results were obtained with SACR2*, which combines SACR2 with the n-step loss from SACfD and the behaviour cloning loss from SACBC.

Regarding SACR2, further analysis needs to be done on the impact of the hyper-parameters and across different tasks, since our results didn’t shed much light on how to choose them. The main limitation of our method is that it assumes that the expert demonstrations are optimal: Other works like SACBC, NAC (gao2018reinforcement) and the more recent GRI (chekroun2021gri) are able to handle sub-optimal demonstrations, which is very useful for most tasks outside of robotics, and even for more complex robotics tasks where an optimal planner is not available. Also, our results come from rather simple tasks, and we don’t know how they would translate to more complex tasks.

Our results show that Reward Relabelling can greatly improve performance, even when no demonstrations are available and the only supervision comes from a sparse reward. Many improvements could be brought to the method, such as using a more principled value rather than a constant reward bonus, or implementing a decay or other strategy to switch the focus to the environment reward once it becomes more common. Finally, it would be interesting to test our method with a Q-learning-type base algorithm on a task with discrete actions.