1 Introduction
While reinforcement learning (RL) has shown remarkable success in many challenging domains Mnih et al. (2015); Silver et al. (2016); Schrittwieser et al. (2020), tasks with sparse rewards and long horizons still remain extremely difficult to solve. In such tasks, a positive reward is only encountered after the RL agent reaches the goal after a long sequence of actions, meaning that it cannot learn any useful signals until this occurs (typically the agent must reach the goal several times to learn reliably as well). Furthermore, the learning signal decreases exponentially with the horizon, which combined with slow gradient-based updates, can lead to catastrophic forgetting even after the agent learns to reach the goal.
Expert demonstrations can help RL agents solve difficult tasks Rajeswaran et al. (2018); Vecerik et al. (2017); Nair et al. (2018). These demonstrations can be used in the supervised setting where the agent imitates the expert’s behavior, termed imitation learning (IL). However, naive behavior cloning (BC) of the expert’s trajectories suffers from covariate shift: the agent’s policy drifts away from the expert’s, which leads to compounding errors due to RL’s sequential nature. Furthermore, the distribution of states given in the demonstrations often has low-dimensional support with respect to the entire state space. As such, the agent cannot extrapolate correctly when outside of the demonstration data. Many approaches to solve this issue have been proposed Ross et al. (2011); Sun et al. (2017); Laskey et al. (2017), but such approaches require an interactive expert to query the correct actions.
Another way to use demonstrations is to combine imitation learning with reinforcement learning in a form of learning from demonstrations (LfD) Schaal and others (1997); Kim et al. (2013); Hester et al. (2018). In this case, demonstrations do not simply act as supervised labels and can guide the agent’s exploration, and also act as augmentations to good data samples. These LfD approaches either use demonstrations to pretrain the policy Schaal and others (1997); Hester et al. (2018), use an auxiliary imitation loss in conjunction with the policy update Rajeswaran et al. (2018); Nair et al. (2018), or modify the reward function such that the agent is rewarded when it imitates the demonstrations Zhu et al. (2018); Reddy et al. (2020). However, these methods require interactions with the environment whereas we do not assume such access in our setting.
![]() |
![]() |
![]() |
![]() |
One primary concern of relying on demonstrations is that they are costly to obtain, especially in real-world applications. Requiring an operator providing corrections in-the-loop to handle covariate shift is often prohibitive as well. Given only a few offline trajectories demonstrating successful task completion, an agent ought to be able to replicate the behavior from similar starting conditions, even if there are small perturbations along the way, and correct itself when necessary. We are primarily interested in this setting. In this work, we aim to minimize the number of demonstrations necessary for sparse-reward tasks, while preserving successful task completion. To tackle covariate shift, we seek an approach that will be robust, in the sense of Figures 0(a) and 0(b): the agent is only trained with a few demonstrations from a single start state, but at evaluation time, it must generalize its behavior to a variety of unseen start states. If the agent can successfully complete the task from states adjacent to the demonstrated trajectory, then it can recover from small deviations from the demonstration.
Inspired by the notion of “funnels” in robotics (Mason, 1985) and feedback control (Burridge et al., 1999; Majumdar and Tedrake, 2017), we introduce a reverse-time generative model that can generate possible paths leading the agent back onto the demonstrations. These reverse rollouts provide useful information because every rollout ends within the support of the demonstration data. Assuming that the demonstrations lead to the goal, imitating these generated reverse rollouts and the original demonstration data allow the agent to reach the goal from more starting conditions, including unseen ones. As illustrated in Figures 0(c) and 0(d), typical behavior-cloning (BC) methods focus learning on the small number of demonstrated states, whereas our proposed approach, Backwards Model-based Imitation Learning (BMIL), uses the reverse rollouts to learn a wider region of attraction around the demonstration. We validate our approach on a number of long-horizon, sparse-reward continuous-control tasks. Even from – demonstrations, BMIL provides a significant increase in the region of attraction and robustness on many domains compared to BC, or to using a forward dynamics model.
We summarize our contributions as follows:
-
[leftmargin=15pt,itemsep=2pt,parsep=2pt,topsep=4pt]
-
We propose an imitation learning method that pairs a backwards dynamics model with a policy and train on both demonstrations and imagined model rollouts.
-
In the restrictive setting of an offline expert and no access to environment interactions, we show that a backwards model can improve robustness over behavior cloning.
-
Our experiments on a variety of long-horizon, sparse-reward domains demonstrate that BMIL can noticeably extend the region of attraction around the demonstration data, even when trained on very small subsets of the state space.
2 Related Work
Imitation learning has a long history (Pomerleau, 1989; Schaal and others, 1997) and is well-studied, as documented in various surveys (Argall et al., 2009; Osa et al., 2018; Hussein et al., 2017). The challenges of covariate shift and compounding errors are also well-known (Pomerleau, 1989; Ross and Bagnell, 2010). Most solutions involve on-policy imitation learning, where environment interaction and interactive querying of the expert allow for the agent’s distribution to match that of the expert (Ross et al., 2011; Sun et al., 2017). More closely related to our approach are methods that modify or augment the demonstrations to increase robustness. Laskey et al. (2017) inject noise into the supervisor’s policy during training to force the demonstrator to provide corrections. Luo et al. (2020) learn a dynamics model from demonstrations to conservatively extrapolate a value function that encourages the agent to return to the expert data distribution. Generative approaches have also been used in imitation learning (Ho and Ermon, 2016; Wang et al., 2017), but their focus is not on robustly following a few goal-reaching demonstrations.
Time-reversibility has been explored in RL, often as a form of regularization (Thodoroff et al., 2018; Nair et al., 2020; Zhang et al., 2020; Rahaman et al., 2020; Satija et al., 2020). Reverse-time dynamics models, also called predecessor models, have also been used in RL (Edwards et al., 2018; Goyal et al., 2019; Schroecker et al., 2019; Lai et al., 2020; Lee et al., 2020; Yu et al., 2021; Grinsztajn et al., 2021). However, in all cases, the reverse-time dynamics model is used as either an alternative to the forward-time dynamics model or as an auxiliary model in addition to the standard forward-time dynamics model, in order to mitigate model-compounding errors. The result is that the reverse-time dynamics model can accelerate RL and enable greater sample-efficiency. In this work, we take a different perspective where the reverse-time dynamics model is used to generate possible, unseen paths that can lead back to the expert demonstration and thus to the goal, thereby improving robustness in following the demonstration. Our work is also similar to Wang et al. (2021), where a reverse-time dynamics model is used to generate possible trajectories; however, their focus is on offline RL, where the generated trajectories are used to connect distinct sets of states in the offline dataset.
3 Method
3.1 Preliminaries
We model the setting as a Markov Decision Process (MDP) with a continuous state space
and a continuous action space . is the transition function that defines the distributions of the next state , given the current state and action taken at timestep .The objective of imitation learning is to learn a policy , parameterized by , that matches the expert’s policy . We assume that the expert generates demonstrations by rolling out its policy in the environment. Note that we consider the more restrictive case of where demonstrations only consist of transitions and not rewards, where is the next state.
Behavior cloning (BC) is a form of imitation learning where the policy
learns to imitate expert actions via supervised learning. The policy learned by behavior cloning is found by minimizing the negative log-likelihood over the demonstration data
Note that BC does not require environment interactions and can be considered to be offline. Furthermore, expert behavior is inferred only from demonstrations without access to the expert policy.
Compounding errors in behavior cloning
As pointed out in Ross and Bagnell (2010); Venkatraman et al. (2015), behavior cloning suffers from covariate shift, where errors in the policy can compound and lead the agent to states where it cannot recover. Intuitively, this occurs as states in the training data is a small subset of the entire state space and it is difficult for the policy to learn the optimal action for states outside of the training data. If during a rollout, once the policy makes an error and leaves the , it may encounter completely new states, leading to the compounding of errors. Furthermore, as the agent moves farther away from , there is very little hope for it to make the correct action and move back onto the distribution of the training data.
3.2 Problem Setting
We consider the same setting as BC, where we do not assume access to the environment or the expert policy during training, and only expert demonstrations without rewards are given. Furthermore, we assume that the expert demonstrations are given in the MDP where the sets of initial states and goals are both very small subsets of the entire state space. An example of this scenario is a maze environment where the agent starts from the same initial state and tries to reach a fixed goal. Formally, we define the demonstration trajectories
as coming from a probability density
where , and . In our experiments, and consist of a single start or goal state and/or the -ball of its neighborhood. As such, only a few demonstrations are required to learn a stable optimal policy. Note that this is different from domains in previous work Rajeswaran et al. (2018); Fu et al. (2020) which consider random goal states, requiring many more demonstrations to learn optimal policies. We also assume that the expert policy is optimal in the sense that all demonstrations successfully reach a goal. While this is not strictly necessary in our method, our setting does not include rewards in and thus we cannot discern whether demonstrations are optimal. This allows us to ignore the issue of modifying rewards as done in several offline RL algorithms (Fujimoto et al., 2019; Kumar et al., 2020)
. In order to use task completion success rates as an evaluation metric in our experiments, we consider only optimal demonstrations.
Our objective is to learn a policy that is robust to policy errors when imitating expert behavior and can learn to reach the goal from a variety of initial states. This is different than the multi-goal or multi-task setting, where the agent learns to solve multiple goals or tasks, usually from a small number of initial states. More formally, the robustness of the policy is defined as
where the expectation is taken over , where is a strict superset of . Note that our measure of robustness is somewhat coarse, in that we do not consider the shortest path to reach the goal from every start state (which would probably require more information such as diverse trajectories or environment interactions). Instead, we seek to extend the region of attraction around such that the learned policy can still reach the goal.
As we consider continuous states in our work, we measure the robustness using samples from , where we randomize either some or all of the state dimensions. For example, in robotic manipulation domains, we vary the position of the gripper as we are primarily concerned with being able to learn robustness from a variety of different starting positions. In other domains, we are interested in the policy’s ability to recover from arbitrary initial states and so we vary not only the agent’s starting position, but also the initial joint positions and velocities by adding uniformly random noise.
Throughout, we assume that there exists an action that allows the policy to go towards when in a state . This scenario is true for many navigation and physics-based domains if we ignore rare circumstances such as irrecoverable unsafe states or the breakdown of the agent. We exclude such cases and assume that state transitions are reversible. We discuss some possible ways to incorporate irrecoverable states in Section 6.
3.3 Backwards Model-based Imitation Learning
In our work, we use a backwards dynamics model to provide more synthetic training data to the policy and therefore increase the policy’s robustness. We call our method backwards model-based imitation learning or BMIL.
Backwards model
The backwards model is a probabilistic generative model defined as
. This model estimates the conditional distribution of the reverse time dynamics. It takes in the next state and outputs the previous state and previous action. As we consider only continuous state and action spaces, we implement
as a conditional Gaussian, parameterized by .The backwards model is decomposed into two functions , an action generator and previous state generator. The action generator predicts which action was taken in order to land in the next state. There may be several such actions from different states that can lead to the next state. Thus the action generator implicitly encodes a backwards policy. It is important for this backwards policy to closely match the learned forward policy but be different enough to generate diverse new rollouts for the policy to train on. The previous state generator predicts the previous state given the next state and previous action taken. The goal of this generator is to accurately predict the backwards dynamics.
As we consider the setting with no access to the expert or the environment, is trained only on . The action generator and previous state generator are jointly trained by maximum likelihood
(1) |
where is the state, action, and next state, respectively, at timestep .
Model rollouts
Given expert demonstrations , we use the backwards model to generate possible several short reverse rollouts or traces , starting from every state in . As all of these traces end on states within the demonstration data and as all demonstrations reach the goal, following these traces will eventually lead to the goal. For all in , we generate traces in a time-reversed manner, where we start from the last state action pair and then predict for timesteps . These traces are collected into a buffer . As we assumed that there are no irrecoverable states in our setting, the rollouts reflect possible paths that the agent could have taken to reach . If the reverse time model is accurate and the previous action generator gives sufficiently diverse actions, the traces are then samples from the region of attraction or “funnels” around every state along the demonstration. As we assume all demonstrations reach the goal, these samples from the funnels can be used to learn a robust policy , as it can follow the traces onto the optimal path.
Action selection strategy for
As the backwards model is trained only on a limited number of expert demonstrations, it is likely that can only learn accurate reverse-time dynamics for states contained within or close to the demonstration data Xu et al. (2020). Thus repeatedly rolling out would only generate traces whose state-action pairs are contained within and would not help with learning robust policies. However, we would like to generate diverse traces with new unseen state-action pairs in order to robustify the policy. To balance the model misprediction accuracy with generating plausible state-action pairs, we perturb only the first action generated from and not the subsequent actions and also use short horizon lengths for the traces. Note that we are essentially choosing a good action selection strategy for . Let be the action that the expert would take. A good action selection strategy would place more probability mass closer to the support of the , providing a “cover” of but with a wider tail to provide diverse rollouts. As our backwards model is probabilistic (implemented as a conditional Gaussian with diagonal covariance), we can easily perturb
by increasing the distribution’s variance.
Let be the previous action output by . We consider two ways to generate action : 1) simple scaling of and 2) resampling a new action by adding uniform noise, , where
is a fixed hyperparameter. For the scaling strategy, we further scale
by the entropy of the probability density as we wish to make the distribution “wider” for peaker distributions.Algorithm
Our method BMIL is outlined in Algorithm 1. Given expert demonstrations with tuples of the form , we train the backwards model using Eqn. 1 to estimate the reverse-time dynamics . We train our policy on both the demonstration data and the model traces by sampling from both at a fixed ratio and using maximum likelihood,
(2) |
where is the probability of sampling from . As our aim is to learn a robust policy while still succeeding at the original start states and goals, we sample from the demonstrations at a higher ratio than the model traces. Note that BMIL does not depend on the type of imitation learning policy. Any algorithm can be used as long as the policy can be trained with samples of the form .
4 Experiment Design
4.1 Environments
We validate our approach on several continuous control domains: 1) the Fetch robotics environment Plappert et al. (2018), 2) maze navigation with two different agents, and 3) Adroit hand manipulation Rajeswaran et al. (2018). Figure 2 shows sample images of some environments. For the Fetch robotics environments, we consider the “Push” and “PickAndPlace” tasks where the objective is to control a Fetch end effector to either push an object to the goal or pick an object and place it at the target location. For the maze environments, we consider mazes of increasing difficulty, where an agent must learn to move itself and then reach the goal. We use both a simple point and a 29-DoF ant agent. For the Adroit environment, we use the ‘Relocate‘ task, where one must control a 24-DoF Adroit hand to pick up a ball and move it to a target location. All domains use the MuJoCo simulator Todorov et al. (2012) for a total of distinct domains. All environments have sparse reward structures, where either every step has a constant negative reward until the goal is reached (Fetch) or only the goal has a non-zero reward (Maze, Adroit). In particular, the Maze and Adroit environments are quite challenging as they both consist of controlling the agent’s joints to perform locomotion (maze) or dexterous manipulation (Adroit) over a long horizon. More detailed descriptions of each environment including its observation space are provided in Appendix A.
![]() |
![]() |
![]() |
4.2 Demonstrations and Implementation Details
To generate demonstrations in the Fetch and Maze domains, we train an expert policy by adding the goal position to the state, as in goal-oriented learning, and use off-the-shelf RL algorithms (Raffin, 2020; Haarnoja et al., 2017). For the Adroit domain, we use a pre-trained policy from Rajeswaran et al. (2018). We use demonstrations on the Push task and on the PickAndPlace task, and demonstrations for all Maze and Adroit environments.
For the policy, we use neural networks with
fully connected hidden layers with neurons and ReLU activations. For the backwards model, we use 4-layer MLPs with hidden units for both the action predictor and previous state predictorand use diagonal Gaussian distributions. To train the policy, we use
for the Fetch environments, for the Maze environments, and for the Adroit environments. We find that higher ratios are necessary for longer-horizon and more complex domains. For the model rollouts, we use the variance scaling action selection strategy for the first action only and use increasing rollout lengths for all domains, similar to Janner et al. (2019). For a more detailed discussion of experiment details, see Appendix B. Our code for the modified environments, generating expert policies, and running all experiments are available at https://github.com/jypark0/bmil.4.3 Evaluation
We evaluate BMIL against behavior cloning (BC) and VINS Luo et al. (2020). VINS specifically aims to learn value functions robust to perturbations using negative sampling and the induced policy learns self-corrective behavior. VINS was chosen as it is most relevant to our setting; other methods such as DART Laskey et al. (2017) or SQIL Reddy et al. (2020) require either an online expert or environment interactions.
We use the same number of demonstrations for all methods and also keep the same policy network architecture and the total number of policy gradient steps equal across all methods. We train both the policy and backwards model until the backwards model loss converges. Note that our goal is not to solve the training task faster but rather to robustify the policy using the backwards model. Additionally, we wish to solve the task at various starting conditions while still being able to succeed at the original initial start-goal states.
To evaluate the robustness of the learned policy, we vary the initial states and compute task success rates. For Fetch, we fix the initial gripper, object, and goal position during training and vary the gripper’s and position within the table boundaries during evaluation. We use samples during evaluation. For Maze, we initialize the agent to a random start position within a discretized grid of the maze and also add random uniform noise to the agent joints’ qpos,qvel. We sample initial states for each discrete grid cell and compute the success rate. Sampling points for each grid cell gives us an idea of which positions are easy for the agent to reach the goal. Intuitively, such positions would be those near the goal and the demonstrated path. For Adroit, we generate random initial states by adding uniform noise to the qpos of the hand.
5 Results
Our experiments aim to answer the following questions: 1) how robust of a policy does BMIL learn? and 2) what components of BMIL are important to improve robustness?
5.1 Robustness evaluation
Robustness (%) | Relative to BC | |||||||||
BC | VINS | BMIL | BC | VINS | BMIL | |||||
Fetch | Push ( demos) | 1 | 1.06 | 1.21 | ||||||
PickAndPlace ( demos) | 1 | 0.84 | 4.31 | |||||||
Maze |
|
UMaze | 1 | 0.81 | 0.98 | |||||
Room5x11 | 1 | 0.47 | 1.05 | |||||||
Corridor7x7 | 1 | 1.12 | 1.16 | |||||||
|
UMaze | 1 | 0.71 | 1.03 | ||||||
Room5x11 | 1 | 0.91 | 0.87 | |||||||
Corridor7x7 | 1 | 0.90 | 0.81 | |||||||
Adroit | Relocate ( demos) | 1 | 0.48 | 1.68 |
trials, respectively. The bounds indicate 95% confidence intervals. BMIL improves robustness considerably over BC in most environments.
The robustness results are shown in Table 1. We note that the absolute robustness percentages are generally low for all methods because of the difficulty in extrapolating from limited demonstrations () with a single pair of initial start and goal states. We therefore also include the relative improvement over BC.
In the Fetch environments, BMIL substantially increases robustness over BC and VINS. In particular, our method has an approximately and improvement over BC on Push and PickAndPlace, respectively. VINS on the other hand performs similarly to BC. We see a similar pattern on the harder Adroit environment, where BMIL improves robustness over BC by . For the Maze environments, BMIL generally outperforms BC for the Point agent, while the robustness is decreased for the Ant agent. Somewhat surprisingly, BC performs quite well on the long-horizon Maze domains. It may be that BC has some built-in extrapolation capabilities or that the backwards model may need better latent representations with more powerful networks.
Empirically, we can see that having short reverse rollouts from the backwards model and using only slight perturbations still helps to increase robustness, even though the traces contain some model misprediction errors. We hypothesize that these traces do not necessarily need to be accurate in order to benefit the policy and simply need to be plausible paths that lead to the demonstrations. It may be that having the general correct direction contained in the traces is sufficient for the policy to eventually reach states in the demonstration data.
The success rates during training are shown in Table 5 in Appendix C.1. BMIL achieves success rates close to for most domains, suggesting that increased robustness does not necessarily come at the cost of decreased performance on demonstrated start-goal states. On the other hand, VINS cannot consistently succeed during training, even though its robustness is similar to BC.
|
|
|
|
Visualization of robustness
Figure 3 shows which starting positions succeed during the robustness evaluation for Fetch. The green points correspond to successful episodes. We can see that both BC and BMIL succeed more frequently when starting from nearby the demonstration data (approximately a straight line from the start (red) to goal (blue)). However, we can see that BMIL learns a much larger region of attraction over BC and even succeeds at points that are much farther away from . We hypothesize that instead of perturbing a single state within as done in VINS, learning a short reverse rollout from this state allows BMIL to learn optimal paths from states much farther away from , leading to higher robustness values. Figure 4 shows a similar visualization for some Point maze environments where the agent’s position is discretized into a grid. BMIL learns a region of attraction that is either slightly bigger or similar in size to that of BC but has a higher rate of success at each cell.
5.2 Additional Experiments
Robustness (%) | Relative to BC | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BC |
|
|
BC |
|
|
|||||||||
Push ( demos) | 1 | 1.03 | 1.21 | |||||||||||
PickAndPlace ( demos) | 1 | 1.03 | 4.31 |
Forward vs Backward Dynamics
We first analyze the utility of a backwards vs a forward dynamics model. On the Fetch environments, we train BMIL with a forward dynamics model and compare against the original backwards model . The forward model is implemented nearly identically to other n-step model-based RL algorithms (e.g. Janner et al. (2019)), with the exception of no environment interactions. As with the backwards model, we generate rollouts from the forwards model starting from demonstrated states and train the policy on both the demonstrations and traces. To generate model rollouts from demonstration states, we use the action from the policy . The total number of parameters is kept approximately constant for the forwards and backwards models. As shown in Table 2, the forwards model offers little to no benefit over BC for both Push and PickAndPlace, suggesting that the backwards model is required to produce a robust policy.
![]() |
![]() |
Action selection strategy
We test different action selection strategies in trace generation in Figure 4(a) (and Figure 10 in Appendix C.2). Compared to no perturbation (None), we see that either action selection strategy improves robustness as it can lead to more diverse trajectories unseen within the support of the demonstrations. We use the variance scaling strategy SC(30) for all Fetch experiments as it was more stable than SC(50).
Number of demonstrations
We also study our method’s performance with varying numbers of demonstrations. As shown in Figure 4(b), both BC and BMIL improve in robustness with more demonstrations, but BC plateaus at a much lower level. On the other hand, BMIL requires slightly more () demonstrations than needed ( demonstrations are sufficient for BC to succeed during training) in order to train the backwards model (Figure 11 in Appendix C.2).
Computation budget
As BMIL trains both the policy and the backwards model, it requires more total gradient updates than BC. On Fetch domains, BMIL uses approximately x more computation than BC. We train BC for more steps to match or exceed BMIL’s computation budget, as shown in Figure 6 (BC is given x–x computation budget). However, more training for BC does not improve robustness and seems to have a harmful effect on robustness.
![]() |
![]() |
Training model first and then the policy
As an offline method, BMIL does not require the backwards model to be trained in a single loop along with the policy. We can first train the model first and then train the policy. We compare training the model first and then the policy with the process outlined in Algorithm 1. We find there are no noticeable differences when using this model first approach on the Fetch domains, as seen in Table 6.
6 Discussion
This work proposes a method to tackle the issue of covariate shift in imitation learning. We consider the restrictive setting where the expert is offline, where its behavior can only be inferred from demonstrations, and no access to additional environment interactions. Specifically, we show that pairing a generative backwards model with behavior cloning can allow a policy to learn a wider region of attraction around the demonstration data. By rolling out imagined traces from states within the demonstration and perturbing actions to generate diverse traces, BMIL learns a wider funnel than naive BC. Through experiments on several long-horizon, sparse-reward, continuous control domains, BMIL noticeably improves robustness when trained on a narrow set of initial start and goal states and evaluated at random starting positions.
There are many possible extensions for future work. BMIL does not necessarily preclude the use of image observations as we only assume that slightly perturbing an action will lead to new next states close to the original next state. However, to handle images, our approach likely requires an additional encoder and possibly more complex network architectures and augmentation techniques. Another interesting avenue could be to quantify how an increasing coverage of state space contained within the demonstration data affects robustness for both BC and BMIL. Finally, one could consider the setting of irrecoverable states and either resample rollouts containing such unsafe states or incorporate a measure of safety within the backwards model when generating model rollouts.
This material is based upon work supported by the National Science Foundation under Grant No. 2107256. This work was completed in part using the Discovery cluster, supported by Northeastern University’s Research Computing team.
References
- A survey of robot learning from demonstration. Robotics and Autonomous Systems 57 (5), pp. 469–483. External Links: ISSN 0921-8890 Cited by: §2.
- Sequential composition of dynamically dexterous robot behaviors. The International Journal of Robotics Research 18 (6), pp. 534–555. Cited by: §1.
- Forward-backward reinforcement learning. External Links: 1803.10227 Cited by: §2.
- D4RL: datasets for deep data-driven reinforcement learning. External Links: 2004.07219 Cited by: §3.2.
-
Off-policy deep reinforcement learning without exploration.
In
International Conference on Machine Learning
, pp. 2052–2062. Cited by: §3.2. - Recall traces: backtracking models for efficient reinforcement learning. In International Conference on Learning Representations, Cited by: §2.
- There is no turning back: a self-supervised approach for reversibility-aware reinforcement learning. In Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (Eds.), Cited by: §2.
- Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. Cited by: §B.1, §4.2.
- Deep q-learning from demonstrations. In AAAI, Cited by: §1.
- Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29, pp. . Cited by: §2.
- Imitation learning: a survey of learning methods. ACM Comput. Surv. 50 (2). External Links: ISSN 0360-0300 Cited by: §2.
- When to trust your model: model-based policy optimization. In Advances in Neural Information Processing Systems, Cited by: Table 3, §4.2, §5.2.
- Learning from limited demonstrations. In Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), Vol. 26, pp. . Cited by: §1.
- Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems 33, pp. 1179–1191. Cited by: §3.2.
- Bidirectional model-based policy optimization. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 5618–5627. Cited by: §2.
- DART: noise injection for robust imitation learning. In 1st Annual Conference on Robot Learning, CoRL 2017, Mountain View, California, USA, November 13-15, 2017, Proceedings, Proceedings of Machine Learning Research, Vol. 78, pp. 143–156. Cited by: §1, §2, §4.3.
- Context-aware dynamics model for generalization in model-based reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 5757–5766. Cited by: §2.
- Learning self-correctable policies and value functions from demonstrations with negative sampling. In International Conference on Learning Representations, Cited by: §2, §4.3.
- Funnel libraries for real-time robust feedback motion planning. The International Journal of Robotics Research 36 (8), pp. 947–982. Cited by: §1.
- The mechanics of manipulation. In IEEE International Conference on Robotics and Automation, Cited by: §1.
- Human-level control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §1.
- Overcoming exploration in reinforcement learning with demonstrations. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6292–6299. Cited by: §1, §1.
- TRASS: time reversal as self-supervision. In 2020 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 115–121. Cited by: §2.
- Planning with goal-conditioned policies. In NeurIPS, pp. 14814–14825. Cited by: Appendix A.
- An algorithmic perspective on imitation learning. Foundations and Trends® in Robotics 7 (1-2), pp. 1–179. External Links: ISSN 1935-8253 Cited by: §2.
- Multi-goal reinforcement learning: challenging robotics environments and request for research. arXiv. Cited by: Appendix A, §4.1.
- ALVINN: an autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems, D. Touretzky (Ed.), Vol. 1, pp. . Cited by: §2.
- RL baselines3 zoo. GitHub. Note: https://github.com/DLR-RM/rl-baselines3-zoo Cited by: §B.1, §4.2.
- Learning the arrow of time for problems in reinforcement learning. In International Conference on Learning Representations, Cited by: §2.
- Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations. In Proceedings of Robotics: Science and Systems (RSS), Cited by: Appendix A, §B.1, §1, §1, §3.2, §4.1, §4.2.
- SQIL: imitation learning via reinforcement learning with sparse rewards. In International Conference on Learning Representations, Cited by: §1, §4.3.
-
Efficient reductions for imitation learning.
In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, Y. W. Teh and M. Titterington (Eds.), Proceedings of Machine Learning Research, Vol. 9, Chia Laguna Resort, Sardinia, Italy, pp. 661–668. Cited by: §2, §3.1. - A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, G. Gordon, D. Dunson, and M. Dudík (Eds.), Proceedings of Machine Learning Research, Vol. 15, Fort Lauderdale, FL, USA, pp. 627–635. Cited by: §1, §2.
- Constrained Markov decision processes via backward value functions. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 8502–8511. Cited by: §2.
- Learning from demonstration. Advances in neural information processing systems, pp. 1040–1046. Cited by: §1, §2.
- Mastering atari, go, chess and shogi by planning with a learned model. Nature 588 (7839), pp. 604–609. Cited by: §1.
- Generative predecessor models for sample-efficient imitation learning. In International Conference on Learning Representations, Cited by: §2.
- Mastering the game of go with deep neural networks and tree search. Nature 529, pp. 484–503. Cited by: §1.
- Deeply AggreVaTeD: differentiable imitation learning for sequential prediction. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, pp. 3309–3318. Cited by: §1, §2.
- GNU parallel - the command-line power tool. ;login: The USENIX Magazine 36 (1), pp. 42–47. Cited by: §6.
- Temporal regularization for markov decision process. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31, pp. . Cited by: §2.
- Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: Appendix A, §4.1.
- Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817. Cited by: §1.
- Improving multi-step prediction of learned time series models. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pp. 3024–3030. External Links: ISBN 0262511290 Cited by: §3.1.
- Offline reinforcement learning with reverse model-based imagination. Advances in Neural Information Processing Systems 34, pp. 29420–29432. Cited by: §2.
- Robust imitation of diverse behaviors. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . Cited by: §2.
- How neural networks extrapolate: from feedforward to graph neural networks. arXiv preprint arXiv:2009.11848. Cited by: §3.3.
- PlayVirtual: augmenting cycle-consistent virtual trajectories for reinforcement learning. In Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (Eds.), Cited by: §2.
- Learning retrospective knowledge with reverse reinforcement learning. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 19976–19987. Cited by: §2.
- Reinforcement and imitation learning for diverse visuomotor skills. In Proceedings of Robotics: Science and Systems, Pittsburgh, Pennsylvania. Cited by: §1.
Checklist
-
For all authors…
-
Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
-
Did you describe the limitations of your work?
-
Did you discuss any potential negative societal impacts of your work?
-
Have you read the ethics review guidelines and ensured that your paper conforms to them?
-
-
If you are including theoretical results…
-
Did you state the full set of assumptions of all theoretical results?
-
Did you include complete proofs of all theoretical results?
-
-
If you ran experiments…
-
Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? Descriptions of the implementation are provided in the main text and supplementary material and the code and data are released as a public repository.
-
Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? Briefly in Section 4, and in full in the supplementary material.
-
Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? See Section 5.
-
Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? See supplementary material.
-
-
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…
-
If your work uses existing assets, did you cite the creators?
-
Did you mention the license of the assets?
-
Did you include any new assets either in the supplemental material or as a URL? See Section 4.2 for the code repository URL.
-
Did you discuss whether and how consent was obtained from people whose data you’re using/curating?
-
Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?
-
-
If you used crowdsourcing or conducted research with human subjects…
-
Did you include the full text of instructions given to participants and screenshots, if applicable?
-
Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?
-
Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?
-
Appendix A Environments
Fetch
![]() |
![]() |
These environments from Plappert et al. [2018] involve controlling the Fetch robotic arm. We consider two tasks: “Push” and “PickAndPlace” as shown in Figure 7. The objective of “Push” is to push the object on a table towards a fixed goal position using a closed gripper. The objective of “PickAndPlace” is to pick the object by controlling the gripper and move it towards the goal. The observation space is 25-dimensional and consists of the end effector coordinates and its linear velocity, the gripper’s position and velocity, and the object’s pose, velocities, and its relative position/velocity to the gripper. All Fetch environments have a fixed episode length of and an episode is considered successful if the object is at the goal at the end of the episode.
Maze
There are different maze environments of increasing difficulty and different agents (Point, Ant) for each environment (c.f. Figure 8). The initial start and goal position is fixed and the objective is to control the agent to reach the goal (colored in red). The observation space is the agent’s joint positions/velocities, the current timestep, and the agent’s current Cartesian position. We use within the Mujoco Todorov et al. [2012] simulator and set the gear ratio for Ant to to prevent it from falling over as in Nasiriany et al. [2019].
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Adroit
In this domain from Rajeswaran et al. [2018], the goal is to control a 24-DoF Adroit hand to pick up a ball from the table and move it to a target location. The agent must manipulate each finger and wrist joint with dexterity to grasp the object correctly and learn the correct ball and target positions. An observation consists of hand joint angles, the object position/orientation, and the target position/orientation. We modify the original domain by fixing the initial qpos of the hand and by terminating the episode when the agent correctly solves the task.
![]() |
![]() |
![]() |
![]() |
Appendix B Experiment Details
b.1 Expert demonstrations
For the Fetch environments, we use pre-defined settings in Raffin [2020] to train an expert policy. For the maze environments, we predefine a series of subgoals, and the position of the next subgoal is concatenated to the state. The reward is dense, using the negative Euclidean distance to the next subgoal and subsequent subgoals, and we use soft actor-critic Haarnoja et al. [2017] as the expert policy. For the Adroit environment, we use the pre-trained policy from Rajeswaran et al. [2018].
On the Fetch domains, as each episode is of length , this amounts to a total of a couple of hundred training samples. We thus make identical copies of each demonstration in order to stably train the policy and backwards model (we do the same for all other baselines). This has a similar effect of performing more gradient updates, but we found it to be computationally cheaper. For both Maze and Adroit environments, we use demonstrations and do not repeat the demonstrations as the episodes are sufficiently long.
b.2 Hyperparameters
All hyperparameter settings for BMIL are provided in Tables 3 and 4. For all experiments, we used an internal cluster with single GPU compute nodes, with 10 virtual CPU cores and a GPU with either an NVIDIA P100 or V100.
For all methods, we use the same number of policy gradient steps for a fair comparison. For the VINS baseline, we rely on the hyperparameter settings provided in the paper for the Fetch environments and also run hyperparameter sweeps to find the best settings specific to our environments. We run longer hyperparameter sweeps for the Maze and Adroit environments.
Fetch | Adroit | ||||||
Push | PickAndPlace | Relocate | |||||
epochs | 200 | 600 | |||||
policy updates per epoch | 100 | 50 | |||||
batch size | 64 | ||||||
demonstrations | 5 | 10 | 20 | ||||
|
0.5 | 0.8 | |||||
trace horizon length | 1 |
|
|
||||
action selection strategy | entropy | ||||||
action selection coefficient | 30 | 3 |
Point | Ant | |||||||||
UMaze | Room5x11 | Corridor7x7 | UMaze | Room5x11 | Corridor7x7 | |||||
epochs | 800 | 400 | ||||||||
|
250 | 500 | ||||||||
batch size | 256 | |||||||||
demonstrations | 20 | |||||||||
|
0.8 | 0.95 | 0.9 | 0.95 | ||||||
trace horizon length |
|
|
||||||||
|
entropy | |||||||||
|
40 | 1 | 40 | 10 |
Appendix C Results
c.1 Main results
BC | VINS | BMIL | |||||
Fetch | Push ( demos) | ||||||
PickAndPlace ( demos) | |||||||
Maze |
|
UMaze | |||||
Room5x11 | |||||||
Corridor7x7 | |||||||
|
UMaze | ||||||
Room5x11 | |||||||
Corridor7x7 | |||||||
Adroit | Relocate ( demos) |
Table 5 shows the success rates during training, on environments where the start and goal positions are unchanged. We see that both BC and BMIL achieve high success rates across all environments, while VINS does not for the Maze environments.
c.2 Additional Experiments
For all additional experiments on the Fetch domains, we use trials for each method. The error bars or shaded regions denote 95% confidence intervals.

Figure 10 shows the robustness of different action selection strategies on Fetch-Push. Here we see that all action selection strategies generally perform similarly, with possibly RS(0.3) and SC(50) a slight edge over the no strategy (None), though the error bars are fairly large.

Figure 11 shows the success rates during training with a varying number of demonstrations on Fetch-PickAndPlace. We see that demonstrations are sufficient for BC to achieve a success rate on the original start and goal positions. However, BMIL requires at least demonstrations to attain the same level of performance, as the backwards model requires at least a certain number of samples to train in a stable manner. We use demonstrations in our experiments.
Robustness (%) | Relative to BC | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
BC | BMIL |
|
BC | BMIL |
|
|||||
|
1 | 1.21 | 1.09 | |||||||
|
1 | 4.31 | 4.85 |
Table 6 shows the results of training the backwards model first compared to BMIL which trains both the model and the policy in the same training loop. We find similar robustness values, where training the model first produces slightly lower values on Push and slightly higher values on PickAndPlace.