Reinforcement Learning (RL) has shown impressive results in a plethora of simulated tasks, ranging from attaining super-human performance in video-games Mnih et al. , Vinyals et al.  and board-games Silver et al. , to learning complex locomotion behaviors Heess et al. , Florensa et al. [2017a]. Nevertheless, these successes are shyly echoed in real world robotics Riedmiller et al. , Zhu et al. [2018a]. This is due to the difficulty of setting up the same learning environment that is enjoyed in simulation. One of the critical assumptions that are hard to obtain in the real world are the access to a reward function. Self-supervised methods have the power to overcome this limitation.
A very versatile and reusable form of self-supervision for robotics is to learn how to reach any previously observed state upon demand. This problem can be formulated as training a goal-conditioned policy Kaelbling , Schaul et al.  that seeks to obtain the indicator reward of having the observation exactly match the goal. Such a reward does not require any additional instrumentation of the environment beyond the sensors the robot already has. But in practice, this reward is never observed because in continuous spaces like the ones in robotics, the exact same observation is never observed twice. Luckily, if we are using an off-policy RL algorithm Lillicrap et al. , Haarnoja et al. , we can “relabel" a collected trajectory by replacing its goal by a state actually visited during that trajectory, therefore observing the indicator reward as often as we wish. This method was introduced as Hindsight Experience Replay Andrychowicz et al.  or HER, although it used special resets, and the reward was in fact an -ball around the goal, which only makes sense in lower-dimensional state-spaces. More recently the method was shown to work directly from vision with a special reward Nair et al. [2018a], and even only with the indicator reward of exactly matching observation and goal Florensa et al. [2018a].
In theory these approaches could learn how to reach any goal, but the breadth-first nature of the algorithm makes that some areas of the space take a long time to be learned Florensa et al. [2018b]. This is specially challenging when there are bottlenecks between different areas of the state-space, and random motion might not traverse them easily Florensa et al. [2017b]. Some practical examples of this are pick-and-place, or navigating narrow corridors between rooms, as illustrated in Fig. 2 depicting the diverse set of environments we work with. In both cases a specific state needs to be reached (grasp the object, or enter the corridor) before a whole new area of the space is discovered (placing the object, or visiting the next room). This problem could be addressed by engineering a reward that guides the agent towards the bottlenecks, but this defeats the purpose of trying to learn without direct reward supervision. In this work we study how to leverage a few demonstrations that traverse those bottlenecks to boost the learning of goal-reaching policies.
Learning from Demonstrations, or Imitation Learning (IL), is a well-studied field in robotics Kalakrishnan et al. , Ross et al. , Bojarski et al. . In many cases it is easier to obtain a few demonstrations from an expert than to provide a good reward that describes the task. Most of the previous work on IL is centered around trajectory following, or doing a single task. Furthermore it is limited by the performance of the demonstrations, or relies on engineered rewards to improve upon them. In this work we study how IL methods can be extended to the goal-conditioned setting, and show that combined with techniques like HER it can outperform the demonstrator without the need of any additional reward. We also investigate how the different methods degrade when the trajectories of the expert become less optimal, or less abundant. We also observe that these methods can be run in a complete reset-free fashion, hence overcoming another limitation of RL in the real world. Finally, the method we develop is able leverage demonstrations that do not include the expert actions. This is very convenient in practical robotics where demonstrations might have been given by a motion planner, by kinestetic demonstrations (moving the agent externally, and not by actually actuating it), or even by another agent. To our knowledge, this is the first framework that can boost goal-conditioned policy learning with only state demonstrations.
We define a discrete-time finite-horizon discounted Markov decision process (MDP) by a tuple, where is a state set, is an action set,
is a transition probability distribution,is a discount factor, and is the horizon. Our objective is to find a stochastic policy that maximizes the expected discounted reward within the MDP, . We denote by the entire state-action trajectory, where , and . In the goal-conditioned setting that we use here, the policy and the reward are also conditioned on a “goal" . The reward is , and hence the return is the , where is the number of time-steps to the goal. Given that the transition probability is not affected by the goal, can be “relabeled" in hindsight, so a transition can be treated as . Finally, we also assume access to trajectories that were collected by an expert attempting to reach a goal sampled uniformly among the feasible goals. Those trajectories must be approximately geodesics, meaning that the actions are taken such that the goal is reached as fast as possible.
3 Related Work
Imitation Learning can be seen as an alternative to reward crafting to train desired behaviors. There are many ways to leverage demonstrations, from Behavioral Cloning Pomerleau  that directly maximizes the likelihood of the expert actions under the training agent policy, to Inverse Reinforcement Learning that extracts a reward function from those demonstrations and then trains a policy to maximize it Ziebart et al. , Finn et al. , Fu et al. . Another formulation close to the later introduced by Ho and Ermon  is Generative Adversarial Imitation Learning (GAIL), explained in details in the next section. Originally, the algorithms used to optimize the policy were on-policy methods like Trust Region Policy Optimization Schulman et al. , but recently there has been a wake of works leveraging the efficiency of off-policy algorithms without loss in stability Blondé and Kalousis , Sasaki et al. , Schroecker et al. , Kostrikov et al. . This is a key capability that we are going to exploit later on.
Unfortunately most work in the field cannot outperform the expert, unless another reward is available during training Vecerik et al. , Gao et al. , Sun et al. , which might defeat the purpose of using demonstrations in the first place. Furthermore, most tasks tackled with these methods consist on tracking expert state trajectories Zhu et al. [2018b], Peng et al. , but can’t adapt to unseen situations.
In this work we are interested in goal-conditioned tasks, where the objective is to be able to reach any state upon demand. This kind of multi-task learning are pervasive in robotics, but challenging if no reward-shaping is applied. Relabeling methods like Hindsight Experience Replay Andrychowicz et al.  unlock the learning even in the sparse reward case Florensa et al. [2018a]. Nevertheless, the inherent breath-first nature of the algorithm might still make very inefficient learning to learn complex policies. To overcome the exploration issue we investigate the effect of leveraging a few demonstrations. The closest prior work is by Nair et al. [2018b], where a Behavioral Cloning loss is used with a Q-filter. We found that a simple annealing of the Behavioral Cloning loss Rajeswaran et al.  works better. Furthermore, we also introduce a new relabeling technique of the expert trajectories that is particularly useful when only few demonstrations are available. We also experiment with Goal-conditioned GAIL, leveraging the recently shown compatibility with off-policy algorithms.
4 Demonstrations in Goal-conditioned tasks
In this section we describe the different algorithms we compare to running only Hindsight Experience Replay Andrychowicz et al. . First we revisit adding a Behavioral Cloning loss to the policy update as in Nair et al. [2018b], then we propose a novel expert relabeling technique, and finally we formulate for the first time a goal-conditioned GAIL algorithm, and propose a method to train it with state-only demonstrations.
4.1 Goal-conditioned Behavioral Cloning
The most direct way to leverage demonstrations is to construct a data-set of all state-action-goal tuples , and run a supervised regression algorithm. In the goal-conditioned case and assuming a deterministic policy , the loss is:
This loss and its gradient are computed without any additional environments samples from the trained policy . This makes it particularly convenient to combine a gradient descend step based on this loss with other policy updates. In particular we can use a standard off-policy Reinforcement Learning algorithm like DDPG Lillicrap et al. , where we fit the
, and then estimate the gradient of the expected return as:
In our goal-conditioned case, the fitting can also benefit form “relabeling" like done in HER Andrychowicz et al. . The improvement guarantees with respect to the task reward are lost when we combine the BC and the deterministic policy gradient updates, but this can be side-stepped by either applying a Q-filter to the BC loss as proposed in Nair et al. [2018b], or by annealing it as we do in our experiments, which allows to eventually outperform the expert.
4.2 Relabeling the expert
The expert trajectories have been collected by asking the expert to reach a specific goal . But they are also valid trajectories to reach any other state visited within the demonstration! This is the key motivating insight to propose a new type of relabeling: if we have the transitions in a demonstration, we can also consider the transition as also coming from the expert! Indeed that demonstration also went through the state , so if that was the goal, the expert would also have generated this transition. This can be understood as a type of data augmentation leveraging the assumption that the tasks we work on are quasi-static. It will be particularly effective in the low data regime, where not many demonstrations are available. The effect of expert relabeling can be visualized in the four rooms environment as it’s a 2D task where states and goals can be plotted. In Fig. 1 we compare the final performance of two agents, one trained with pure Behavioral Cloning, and the other one also using expert relabeling.
4.3 Goal-conditioned GAIL with Hindsight
The compounding error in Behavioral Cloning might make the policy deviate arbitrarily from the demonstrations, and it requires too many demonstrations when the state dimension increases. The first problem is less severe in our goal-conditioned case because in fact we do want to visit and be able to purposefully reach all states, even the ones that the expert did not visited. But the second drawback will become pressing when attempting to scale this method to practical robotics tasks where the observations might be high-dimensional sensory input like images. Both problems can be mitigated by using other Imitation Learning algorithms that can leverage additional rollouts collected by the learning agent in a self-supervised manner, like GAIL Ho and Ermon . In this section we extend the formulation of GAIL to tackle goal-conditioned tasks, and then we detail how it can be combined with HER Andrychowicz et al. , which allows to outperform the demonstrator and generalize to reaching all goals. We call the final algorithm goal-GAIL. First of all, the discriminator needs to also be conditioned on the goal , and be trained by minimizing
Once the discriminator is fitted, we can run our favorite RL algorithm on the reward . In our case we used the off-policy algorithm DDPG Lillicrap et al.  to allow for the relabeling techniques outlined above. In the goal-conditioned case we also supplement with the indicator reward . This combination is slightly tricky because now the fitted does not have the same clear interpretation it has when only one of the two rewards is used Florensa et al. [2018a] . Nevertheless, both rewards are pushing the policy towards the goals, so it shouldn’t be too conflicting. Furthermore, to avoid any drop in final performance, the weight of the reward coming from GAIL () can be annealed.
All possible variants we study are detailed in Algorithm 1. In particular, falls back to pure Behavioral Cloning, removes the BC component, doesn’t relabel agent trajectories, removes the discriminator output from the reward, and EXPERT RELABEL indicates whether the here explained expert relabeling should be performed. In the next section we test these variants in the diverse environments depicted in Fig. 2.
4.4 Use of state-only Demonstrations
Both Behavioral Cloning and GAIL use state-action pairs from the expert. This limits the use of the methods, combined or not with HER, to setups where the exact same agent was actuated to reach different goals. Nevertheless, much more data could be cheaply available if the action was not required. For example, non-expert humans might not be able to operate a robot, but might be able to move the robot along the desired trajectory. This is called a kinestetic demonstration. Another type of state-only demonstration could be the one used in third-person imitation Stadie et al. , where the expert performed the task with an embodiment different than the agent that needs to learn the task. This has mostly been applied to the trajectory-following case. In our case every demonstration might have a different objective.
Furthermore, we would like to propose a method that not only leverages state-only demonstrations, but can also outperform the quality and coverage of the demonstrations given, or at least generalize to similar goals. The main insight we have here is that we can replace the action in the GAIL formulation by the next state , and in most environments this should be as informative as having access to the action directly. Intuitively, given a desired goal , it should be possible to determine if a transition
is taking the agent in the right direction. The loss function to train a discriminator able to tell apart the current agent and demonstrations (always transitioning towards the goal) is simply:
We are interested in answering the following questions:
Can the use of demonstrations accelerate the learning of goal-conditioned tasks without reward?
Is the Expert Relabeling an efficient way of doing data-augmentation on the demonstrations?
Can state-only demonstrations be leveraged equally well as full trajectories?
Compared to Behavorial Cloning methods, is GAIL more robust to noise in the expert actions?
We will evaluate these questions in two different simulated robotic goal-conditioned tasks that are detailed in the next subsection along with the performance metric used throughout the experiments section. All the results use 20 demonstrations. All curves have 5 random seeds and the shaded area is one standard deviation
Experiments are conducted in two continuous environments in MuJoCo Todorov et al. . The performance metric we use in all our experiments is the percentage of goals in the feasible goal space the agent is able to reach. We call this metric coverage. To estimate this percentage we sample feasible goals uniformly, and execute a rollout of the current policy. It is consider a success if the agent reaches within of the desired goal. Note that during training we do not assume access to the feasible goal distribution, nor use any to give rewards. These are two very commonly used assumptions in works using HER Andrychowicz et al. , Nair et al. [2018b], and we do not assume them.
Four rooms environment: This is a continuous twist on a well studied problem in the Reinforcement Learning literature. A point mass is placed in an environment with four rooms connected through small openings as depicted in Fig. 1(a). The action space of the agent is continuous and specifies the desired change in state space, and the goals-space exactly corresponds to the state-space.
Pick and Place: This task is the same as the one described by Nair et al. [2018b], where a fetch robot needs to pick a block and place it in a desired point in space. The control is now four-dimensional, corresponding to a change in position of the end-effector as well as a change in gripper opening. The goal space is three dimensional and is restricted to the position of the block.
5.2 Goal-conditioned Imitation Learning
In goal-conditioned tasks, HER Andrychowicz et al.  should eventually converge to a policy able to reach any desired goal. Nevertheless, this might take a long time, specially in environments where there are bottlenecks that need to be traversed before accessing a whole new area of the goal space. In this section we show how the methods introduced in the previous section can leverage a few demonstrations to improve the convergence speed of HER. This was already studied for the case of Behavioral Cloning by Nair et al. [2018b], and in this work we show we also get a benefit when using GAIL as the Imitation Learning algorithm, which brings considerable advantages over Behavioral Cloning as shown in the next sections.
In both environments, we observe that running GAIL with relabeling (GAIL+HER) considerably outperforms running each of them in isolation. HER alone has a very slow convergence, although as expected it ends up reaching the same final performance if run long enough. On the other hand GAIL by itself learns fast at the beginning, but its final performance is capped. This is because despite collecting more samples on the environment, those come with no reward of any kind indicating what is the task to perform (reach the given goals). Therefore, once it has extracted all the information it can from the demonstrations it cannot keep learning and generalize to goals further from the demonstrations. This is not an issue anymore when combined with HER, as our results show.
5.3 Expert relabeling
Here we show that the Expert Relabeling technique introduced in Section 4.2 is beneficial when using demonstrations in the goal-conditioned imitation learning framework. As shown in Fig. 4, our expert relabeling technique brings considerable performance boosts for both Behavioral Cloning methods and goal-GAIL in both environments.
We also perform a further analysis of the benefit of the expert relabeling in the four-rooms environment because it is easy to visualize in 2D the goals the agent can reach. We see in Fig. 1 that without the expert relabeling, the agent fails to learn how to reach many intermediate states visited in the middle of a demonstration.
The performance of running pure Behavioral Cloning is plotted as a horizontal dotted line given that it does not require any additional environment steps. We observe that combining BC with HER always produces faster learning than running just HER, and it reaches higher final performance than running pure BC with only 20 demonstrations.
5.4 Using state-only demonstrations
Behavioral Cloning and standard GAIL rely on the state-action tuples coming from the expert. Nevertheless there are many cases in robotics where we have access to demonstrations of a task, but without the actions. In this section we want to emphasize that all the results obtained with our goal-GAIL method and reported in Fig. 3 and Fig. 4 do not require any access to the action that the expert took. Surprisingly, in the four rooms environment, despite the more restricted information goal-GAIL has access to, it outperforms BC combined with HER. This might be due to the superior imitation learning performance of GAIL, and also to the fact that these tasks might be possible to solve by only matching the state-distribution of the expert. We run the experiment of training GAIL only conditioned on the current state, and not the action (as also done in other non-goal-conditioned works Fu et al. ), and we observe that the discriminator learns a very well shaped reward that clearly encourages the agent to go towards the goal, as pictured in Fig. 5. See the Appendix for more details.
5.5 Robustness to sub-optimal expert
In the above sections we were assuming access to perfectly optimal experts. Nevertheless, in practical applications the experts might have a more erratic behavior, not always taking the best action to go towards the given goal. In this section we study how the different methods perform when a sub-optimal expert is used. To do so we collect trajectories attempting goals by modifying our optimal expert in three ways: first we condition it on a goal , where , therefore the expert doesn’t go exactly where it is asked to. Second we add noise to the optimal actions, and third we make it be -greedy. All together, the sub-optimal expert is then , where , and is a uniformly sampled random action in the action space.
In Fig. 6 we observe that approaches that directly try to copy the action of the expert, like Behavioral Cloning, greatly suffer under a sub-optimal expert, to the point that it barely provides any improvement over performing plain Hindsight Experience Replay. On the other hand, methods based on training a discriminator between expert and current agent behavior are able to leverage much noisier experts. A possible explanation of this phenomenon is that a discriminator approach can give a positive signal as long as the transition is "in the right direction", without trying to exactly enforce a single action. Under this lens, having some noise in the expert might actually improve the performance of these adversarial approaches, as it has been observed in many generative models literature Goodfellow et al. .
6 Conclusions and Future Work
Hindsight relabeling can be used to learn useful behaviors without any reward supervision for goal-conditioned tasks, but they are inefficient when the state-space is large or includes exploration bottlenecks. In this work we show how only a few demonstrations can be leveraged to improve the convergence speed of these methods. We introduce a novel algorithm, goal-GAIL, that converges faster than HER and to a better final performance than a naive goal-conditioned GAIL. We also study the effect of doing expert relabeling as a type of data augmentation on the provided demonstrations, and demonstrate it improves the performance of our goal-GAIL as well as goal-conditioned Behavioral Cloning. We emphasize that our goal-GAIL method only needs state demonstrations, without using expert actions like other Behavioral Cloning methods. Finally, we show that goal-GAIL is robust to sub-optimalities in the expert behavior.
All the above factors make our goal-GAIL algorithm very suited for real-world robotics. This is a very exciting future work. In the same line, we also want to test the performance of these methods in vision-based tasks. Our preliminary experiments show that Behavioral Cloning fails completely in the low data regime in which we operate (less than 20 demonstrations).
- Mnih et al.  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei a Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Vinyals et al.  Oriol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jaderberg, Wojciech M Czarnecki, Andrew Dudzik, Aja Huang, Petko Georgiev, Richard Powell, Timo Ewalds, Dan Horgan, Manuel Kroiss, Ivo Danihelka, John Agapiou, Junhyuk Oh, Valentin Dalibard, David Choi, Laurent Sifre, Yury Sulsky, Sasha Vezhnevets, James Molloy, Trevor Cai, David Budden, Tom Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaff, Toby Pohlen, Yuhuai Wu, Dani Yogatama, Julia Cohen, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy Lillicrap, Chris Apps, Koray Kavukcuoglu, Demis Hassabis, and David Silver. AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. Technical report, 2019.
- Silver et al.  David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George Van Den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human knowledge. Nature, 550(7676):354–359, 10 2017. ISSN 14764687. doi: 10.1038/nature24270. URL http://arxiv.org/abs/1610.00633.
- Heess et al.  Nicolas Heess, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, S. M. Ali Eslami, Martin Riedmiller, and David Silver. Emergence of Locomotion Behaviours in Rich Environments. 7 2017. URL http://arxiv.org/abs/1707.02286.
Florensa et al. [2017a]
Carlos Florensa, Yan Duan, and Pieter Abbeel.
Stochastic Neural Networks for Hierarchical Reinforcement Learning.International Conference in Learning Representations, pages 1–17, 2017a. ISSN 14779129. doi: 10.1002/rcm.765. URL http://arxiv.org/abs/1704.03012.
Riedmiller et al. 
Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave,
Tom Van De Wiele, Volodymyr Mnih, Nicolas Heess, and Tobias Springenberg.
Learning by Playing – Solving Sparse Reward Tasks from Scratch.
Internation Conference in Machine Learning, 2018.
- Zhu et al. [2018a] Henry Zhu, Abhishek Gupta, Aravind Rajeswaran, Sergey Levine, and Vikash Kumar. Dexterous Manipulation with Deep Reinforcement Learning: Efficient, General, and Low-Cost. 10 2018a. URL http://arxiv.org/abs/1810.06045.
Leslie P. Kaelbling.
Learning to Achieve Goals.
International Joint Conference on Artificial Intelligence (IJCAI), pages 1094–1098, 1993.
- Schaul et al.  Tom Schaul, Dan Horgan, Karol Gregor, and David Silver. Universal Value Function Approximators. Internation Conference in Machine Learning, 2015. URL http://jmlr.org/proceedings/papers/v37/schaul15.pdf.
- Lillicrap et al.  Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, pages 1–14, 2015. URL http://arxiv.org/abs/1509.02971.
- Haarnoja et al.  Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine, and Computer Sciences. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. Internation Conference in Machine Learning, pages 1–15, 2018.
- Andrychowicz et al.  Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight Experience Replay. Advances in Neural Information Processing Systems, 2017. ISSN 10495258. doi: 10.1016/j.surfcoat.2018.06.018. URL http://arxiv.org/abs/1707.01495.
- Nair et al. [2018a] Ashvin Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine. Visual Reinforcement Learning with Imagined Goals. Adavances in Neural Information Processing Systems, 2018a.
- Florensa et al. [2018a] Carlos Florensa, Jonas Degrave, Nicolas Heess, Jost Tobias Springenberg, and Martin Riedmiller. Self-supervised Learning of Image Embedding for Continuous Control. In Workshop on Inference to Control at NeurIPS, 2018a. URL http://arxiv.org/abs/1901.00943.
- Florensa et al. [2018b] Carlos Florensa, David Held, Xinyang Geng, and Pieter Abbeel. Automatic Goal Generation for Reinforcement Learning Agents. International Conference in Machine Learning, 2018b. URL http://arxiv.org/abs/1705.06366.
- Florensa et al. [2017b] Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse Curriculum Generation for Reinforcement Learning. Conference on Robot Learning, pages 1–16, 2017b. ISSN 1938-7228. doi: 10.1080/00908319208908727. URL http://arxiv.org/abs/1707.05300.
- Kalakrishnan et al.  Mrinal Kalakrishnan, Jonas Buchli, Peter Pastor, and Stefan Schaal. Learning locomotion over rough terrain using terrain templates. In International Conference on Intelligent Robots and Systems, pages 167–172. IEEE, 2009. ISBN 978-1-4244-3803-7. doi: 10.1109/IROS.2009.5354701. URL http://ieeexplore.ieee.org/document/5354701/.
- Ross et al.  Stéphane Ross, Geoffrey J Gordon, and J Andrew Bagnell. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. International Conference on Artificial Intelligence and Statistics, 2011.
- Bojarski et al.  Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. End to End Learning for Self-Driving Cars. 2016. URL http://arxiv.org/abs/1604.07316.
- Pomerleau  Dean A Pomerleau. ALVINN: an autonomous land vehicle in a neural network. Advances in Neural Information Processing Systems, pages 305–313, 1989. URL https://papers.nips.cc/paper/95-alvinn-an-autonomous-land-vehicle-in-a-neural-network.pdfhttp://dl.acm.org/citation.cfm?id=89851.89891.
- Ziebart et al.  Brian D Ziebart, Andrew Maas, J Andrew Bagnell, and Anind K Dey. Maximum Entropy Inverse Reinforcement Learning. pages 1433–1438, 2008.
- Finn et al.  Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization. Internation Conference in Machine Learning, 3 2016. URL http://arxiv.org/abs/1603.00448.
- Fu et al.  Justin Fu, Katie Luo, and Sergey Levine. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning. International Conference in Learning Representations, 10 2018. URL http://arxiv.org/abs/1710.11248.
- Ho and Ermon  Jonathan Ho and Stefano Ermon. Generative Adversarial Imitation Learning. Advances in Neural Information Processing Systems, 2016. URL http://arxiv.org/abs/1606.03476.
- Schulman et al.  John Schulman, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust Region Policy Optimization. International Conference in Machine Learning, 2015.
- Blondé and Kalousis  Lionel Blondé and Alexandros Kalousis. Sample-Efficient Imitation Learning via Generative Adversarial Nets. AISTATS, 2019. URL https://youtu.be/-nCsqUJnRKU.
- Sasaki et al.  Fumihiro Sasaki, Tetsuya Yohira, and Atsuo Kawaguchi. Sample Efficient Imitation Learning for Continuous Control. International Conference in Learning Representationsa, pages 1–15, 2019.
- Schroecker et al.  Yannick Schroecker, Mel Vecerik, and Jonathan Scholz. Generative predecessor models for sample-efficient imitation learning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SkeVsiAcYm.
- Kostrikov et al.  Ilya Kostrikov, Kumar Krishna Agrawal2, Debidatta Dwibedi, Sergey Levine, and Jonathan Tompson. DISCRIMINATOR-ACTOR-CRITIC: ADDRESSING SAMPLE INEFFICIENCY AND REWARD BIAS IN ADVERSARIAL IMITATION LEARNING. International Conference in Learning Representations, 2019.
- Vecerik et al.  Mel Vecerik, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin Riedmiller. Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards. pages 1–11, 2017.
- Gao et al.  Yang Gao, Huazhe Harry, Xu Ji, Lin Fisher, Yu Sergey, and Levine Trevor. Reinforcement Learning from Imperfect Demonstrations. Internation Conference in Machine Learning, 2018.
- Sun et al.  Wen Sun, J. Andrew Bagnell, and Byron Boots. Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning. pages 1–14, 2018. ISSN 0004-6361. doi: 10.1051/0004-6361/201527329. URL http://arxiv.org/abs/1805.11240.
- Zhu et al. [2018b] Yuke Zhu, Ziyu Wang, Josh Merel, Andrei Rusu, Tom Erez, Serkan Cabi, Saran Tunyasuvunakool, János Kramár, Raia Hadsell, Nando de Freitas, and Nicolas Heess. Reinforcement and Imitation Learning for Diverse Visuomotor Skills. Robotics: Science and Systems, 2018b. URL http://arxiv.org/abs/1802.09564.
- Peng et al.  Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills. Transactions on Graphics (Proc. ACM SIGGRAPH), 37(4), 2018. doi: 10.1145/3197517.3201311. URL http://arxiv.org/abs/1804.02717%0Ahttp://dx.doi.org/10.1145/3197517.3201311.
- Nair et al. [2018b] Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Overcoming Exploration in Reinforcement Learning with Demonstrations. International Conference on Robotics and Automation, 2018b. ISSN 0969-2290. doi: 10.1080/09692290.2013.809781. URL http://arxiv.org/abs/1709.10089.
- Rajeswaran et al.  Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, John Schulman, and L G Sep. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations. Robotics: Science and Systems, 2018.
- Stadie et al.  Bradly C. Stadie, Pieter Abbeel, and Ilya Sutskever. Third-Person Imitation Learning. International Conference in Learning Representations, 3 2017. URL http://arxiv.org/abs/1703.01703.
- Todorov et al.  Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo : A physics engine for model-based control. pages 5026–5033, 2012.
-  Ian J Goodfellow, Jean Pouget-abadie, Mehdi Mirza, Bing Xu, and David Warde-farley. Generative Adversarial Nets. pages 1–9.
Appendix A Hyperparameters and Architectures
In the two environments, i.e. Four Rooms environment and Fetch Pick & Place, the task horizons are set to 300 and 100 respectively. The discount factors are . In all experiments, the Q function, policy and discriminator are paramaterized by fully connected neural networks with two hidden layers of size 256. DDPG is used for policy optimization and hindsight probability is set to . The initial value of the behavior cloning loss weight is set to and is annealed by 0.9 per 250 rollouts collected. The initial value of the discriminator reward weight is set to . We found empirically that there is no need to anneal .
For experiments with sub-optimal expert in section 5.5, is set to and , and is set to and respectively for Four Rooms environment and Fetch Pick & Place.
Appendix B Effect of Different Input of Discriminator
We trained the discriminator in three settings:
current state and goal:
current state, next state and goal:
current state, action and goal: