Imitation learning enables autonomous agents to learn complex behaviors from demonstrations, which are often easy and intuitive for users to provide. However, learning expressive neural network policies from imitation requires a large number of demonstrations, particularly when learning from high-dimensional inputs such as image pixels. Meta-imitation learning has emerged as a promising approach for allowing an agent to leverage data from previous tasks in order to learn a new task from only a handful of demonstrationsDuan et al. (2017); Finn et al. (2017b); James et al. (2018). However, in many practical few-shot imitation settings, there is an identifiability problem: it may not be possible to precisely determine a policy from one or a few demonstrations, especially in a new situation. And even if a demonstration precisely communicates what the task entails, it might not precisely communicate how to accomplish it in new situations. For example, it may be difficult to discern from a single demonstration where to grasp an object when it is in a new position or how much force to apply in order to slide an object without knocking it over. It may be expensive to collect more demonstrations to resolve such ambiguities, and even when we can, it may not be obvious to a human demonstrator where the agent’s difficulty is arising from. Alternatively, it is easy for the user to provide success-or-failure feedback, while exploratory interaction is useful for learning how to perform the task. To this end, our goal is to build an agent that can first infer a policy from one demonstration, then attempt the task using that policy while receiving binary user feedback, and finally use the feedback to improve its policy such that it can consistently solve the task.
This vision of learning new tasks from a few demonstrations and trials inherently requires some amount of prior knowledge or experience, which we can acquire through meta-learning across a range of previous tasks. To this end, we develop a new meta-learning algorithm that incorporates elements of imitation learning with trial-and-error reinforcement learning. In contrast to previous meta-imitation learning approaches that learn one-shot imitation learning procedures through imitation Duan et al. (2017); Finn et al. (2017b), our approach enables the agent to improve at the test task through trial-and-error. Further, from the perspective of meta-RL algorithms that aim to learn efficient RL procedures Duan et al. (2016); Wang et al. (2016); Finn et al. (2017a), our approach also has significant appeal: as we aim to scale meta-RL towards broader task distributions and learn increasingly general RL procedures, exploration and efficiency becomes exceedingly difficult. However, a demonstration can significantly narrow down the search space while also providing a practical means for a user to communicate the goal, enabling the agent to achieve few-shot learning of behavior. While the combination of demonstrations and reinforcement has been studied extensively in single task problems Kober et al. (2013); Sun et al. (2018); Rajeswaran et al. (2018); Le et al. (2018), this combination is particularly important in meta-learning contexts where few-shot learning of new tasks is simply not possible without demonstrations. Further, we can even significantly improve upon prior methods that study this combination using meta-learning to more effectively integrate the information coming from both sources.
The primary contribution of this paper is a meta-learning algorithm that enables effective learning of new behaviors with a single demonstration and trial experience. In particular, after receiving a demonstration that illustrates a new goal, the meta-trained agent can learn to accomplish that goal through trial-and-error with only binary success-or-failure labels. We evaluate our algorithm and several prior methods on a challenging, vision-based control problem involving manipulation tasks from four distinct families of tasks: button-pressing, grasping, pushing, and pick and place. We find that our approach can effectively learn tasks with new, held-out objects using one demonstration and a single trial, while significantly outperforming meta-imitation learning, meta-reinforcement learning, and prior methods that combine demonstrations and reward feedback. To our knowledge, our experiments are the first to show that meta-learning can enable an agent to adapt to new tasks with binary reinforcement signals from raw pixel observations, which we show with a single meta-model for a variety of distinct manipulation tasks. Videos of our experimental results can be found on the supplemental website111For video results, see website: https://sites.google.com/view/watch-try-learn-project.
2 Related Work
Learning to learn, or meta-learning, has a long-standing history in the machine learning literatureThrun and Pratt (1998); Schmidhuber (1987); Bengio et al. (1992); Hochreiter et al. (2001). We particularly focus on meta-learning in the context of control. Our approach builds on and significantly improves upon meta-imitation learning Duan et al. (2017); Finn et al. (2017b); James et al. (2018); Paine et al. (2018) and meta-reinforcement learning Duan et al. (2016); Wang et al. (2016); Mishra et al. (2018); Rakelly et al. (2019), extending contextual meta-learning approaches. Unlike prior work in few-shot imitation learning Duan et al. (2017); Finn et al. (2017b); Yu et al. (2018); James et al. (2018); Paine et al. (2018), our method enables the agent to additionally improve upon trial-and-error experience. In contrast to work in multi-task and meta-reinforcement learning Duan et al. (2016); Wang et al. (2016); Finn et al. (2017a); Mishra et al. (2018); Houthooft et al. (2018); Sung et al. (2017); Nagabandi et al. (2019); Sæmundsson et al. (2018); Hausman et al. (2017), our approach learns to use one demonstration to address the meta-exploration problem Gupta et al. (2018); Stadie et al. (2018). Our work also requires only one round of on-policy data collection, collecting only trials for the vision-based manipulation tasks, while nearly all prior meta-learning works require thousands of iterations of on-policy data collection, amounting to hundreds of thousands of trials Duan et al. (2016); Wang et al. (2016); Finn et al. (2017a); Mishra et al. (2018).
Combining demonstrations and trial-and-error experience has long been explored in the machine learning and robotics literature Kober et al. (2013). This ranges from simple techniques such as demonstration-based pre-training and initialization Peters and Schaal (2006); Kober and Peters (2009); Kormushev et al. (2010); Kober et al. (2013); Silver et al. (2016) to more complex methods that incorporate both demonstration data and reward information in the loop of training (Taylor et al., 2011; Brys et al., 2015; Subramanian et al., 2016; Hester et al., 2018; Sun et al., 2018; Rajeswaran et al., 2018; Nair et al., 2018; Le et al., 2018). The key contribution of this paper is an algorithm that can learn how to learn from both demonstrations and rewards. This is quite different from RL algorithms that incorporate demonstrations: learning from scratch with demonstrations and RL involves a slow, iterative learning process, while fast adaptation with a meta-trained policy involves extracting inherently distinct pieces of information from the demonstration and the trials. The demonstration provides information about what to do, while the small number of RL trials can disambiguate the task and how it should be performed. This disambiguation process resembles a process of elimination and refinement. As a result, we get a procedure that significantly exceeds the efficiency of prior approaches, requiring only one demonstration and one trial to adapt to a new test task, even from pixel observations, by leveraging previous data. In comparison, single-task methods for learning from demonstrations and rewards typically require hundreds or thousands of trials to learn tasks of comparable difficulty Rajeswaran et al. (2018); Nair et al. (2018).
3 Meta-Learning from Demonstrations and Rewards
We first formalize the problem statement that we consider, then describe our approach and its implementation.
3.1 Problem Statement
We would like to design a problem statement that encapsulates the setting described above. Following the typical meta-learning problem statement Finn (2018), we will assume a distribution of tasks , from which the meta-training tasks and held-out meta-test tasks are drawn. A task
is defined as a finite-horizon Markov decision process (MDP),, with continuous state space , continuous action space , binary reward function , unknown dynamics , and horizon . Both the reward and dynamics may vary across tasks.
During meta-training, we will assume supervision in the form of several expert demonstration trajectories per task , and a reward function that can be queried for each of the meta-training tasks . After meta-training with this supervision, the goal at meta-test time is to quickly learn a new meta-test task . In particular, at meta-test time, the agent is first provided with demonstrations (where is small, e.g. one or a few), followed by the opportunity to attempt the task for which it receives a reward label. To attempt the task, the agent must infer a policy based on the demonstrations that will be suitable for gathering information about the task. We will refer to the roll-outs from this policy as trial episodes. The agent must use the trial episodes (along with demonstrations ) to infer a policy that succeeds at the task. In the ensuing retrial episode, this policy is evaluated. We use a trial to denote a sequence of states, actions, and rewards
while a demonstration denotes a sequence of only states and actions.
3.2 Learning to Imitate and Try Again
With the aforementioned problem statement in mind, we aim to develop a method that can learn to learn from both demonstration and trial-and-error experience. To do so, we need two capabilities: first the ability to infer a policy from a few demonstrations that is suitable for gathering information about the task, i.e. a trial policy, and second, the ability to extract and integrate information from these trials with that of the demonstrations to learn a successful policy, which we refer to as the retrial policy. The trial policy can be written as , while the re-trial policy is .
In theory we could design a single model to represent both the trial and retrial policies, for example by using a single MAML Finn et al. (2017a) model with a shared initial parameter setting. However, a key challenge when the trial and retrial policy share initial weights is that updates for the retrial policy will also affect the trial policy, thereby changing the distribution of trial trajectories that the retrial policy should expect. This requires the algorithm to constantly re-collect on-policy trial trajectories from the environment during meta-training, which is particularly difficult in real-world problem settings with broad task distributions. Instead, we found it substantially simpler to represent and train these two policies separately, decoupling their optimization. We parameterize the trial and retrial policies by and , respectively, and denote the parameterized policies and . Our approach allows us to train the trial policy first, freeze its weights while collecting trial data from the environment, for each meta-training task , and then train the retrial policy using the collected stationary trial data without having to visit the environment again.
How do we train each of these policies with off-policy demonstration and trial data? The trial policy must be trained in a way that will provide useful exploration for inferring the task. One simple and effective strategy for exploration is posterior or Thompson samplingRusso et al. (2018); Rakelly et al. (2019), i.e. greedily act according to the policy’s current belief of the task. To this end, we train the trial policy using a meta-imitation learning setup, where for any task the trial policy conditions on one or a few training demonstrations and is trained to maximize the likelihood of the actions under another demonstration of the same task (sampled from without replacement). This leads us to the objective:
We train the retrial policy in a similar fashion, but additionally condition on one or a few trial trajectories , which are the result of executing the trial policy in the environment. Suppose for any task , we have a set of demo-trial pairs . Then the retrial objective is:
During meta-training, we first train the trial policy by minimizing Eq. 2 with mini-batches of tasks and corresponding demonstrations . After training, we freeze to have a fixed trial policy. We then iteratively sample a set of task demonstrations and collect one or a few trial trajectories in the environment using the demo conditioned trial policy . We store the resulting demo-trial pairs in a new dataset . We then train our retrial policy by minimizing Eq. 3 with mini-batches of tasks and corresponding demo-trial pair datasets . At meta-test time, for any test task we first collect trials using the trial policy . Then we execute the retrial episode using the retrial policy . We refer to our approach as Watch-Try-Learn (WTL), and describe our meta-training and meta-test procedures in more detail in Alg. 1 and Alg. 2, respectively.
3.3 Watch-Try-Learn Implementation
WTL and Alg. 1 allow for general representations of the trial policy and retrial policy , so long as they integrate the task conditioning information: the trial policy conditions on a demonstration trajectories, while the retrial policy conditions on both demonstration and trial trajectories. To enable WTL to flexibly choose how to integrate the information coming from demonstrations and trials, we condition the policies by embedding trajectory information into a context vector using a neural network. This architectural choice resembles prior contextual meta-learning works Duan et al. (2016); Mishra et al. (2018); Duan et al. (2017); Rakelly et al. (2019), which have previously considered how to meta-learn efficiently from one modality of data (trials or demonstrations), but not how to integrate multiple sources, including off-policy trial data.
We illustrate the policy architecture in Figure 2. Given a demonstration trajectory or trial trajectory we embed each timestep’s state, action, and reward (recall Eq. 1) to produce per-timestep embeddings. In the retrial case where we have both demo and trial embeddings per-timestep, we concatenate them together. We then apply 1-D convolutions to the per-timestep embeddings, followed by a flatten operation and fully connected layers to produce a single context embedding vector for the actor network. At the output, we use mixture density networks (MDN) Bishop (1994) to represent the distribution over actions. Hence, the actor network receives the context embedding and produces the parameters of a Gaussian mixture.
In our experiments, we aim to evaluate our method on challenging few-shot learning domains that span multiple task families, where the agent must use both demonstrations and trial-and-error to effectively infer a policy for the task. Prior meta-imitation benchmarks Duan et al. (2017); Finn et al. (2017c) generally contain only a few tasks, and these tasks can be easily disambiguated given a single demonstration. Meanwhile, prior meta-reinforcement learning benchmarks Duan et al. (2016); Finn et al. (2017a) tend to contain fairly similar tasks that a meta-learner can solve with little exploration and no demonstration at all. Motivated by these shortcomings, we design two new problems where a meta-learner can leverage a combination of demonstration and trial experience: a toy reaching problem and a challenging multitask gripper control problem, described below. We evaluate how the following methods perform in those environments:
BC: A behavior cloning method that does not condition on either demonstration or trial-and-error experience, trained across all meta-training data. We train BC policies using maximum log-likelihood with expert demonstration actions.
MIL Finn et al. (2017b); James et al. (2018): A meta-imitation learning method that conditions on demonstration data, but does not leverage trial-and-error experience. We train MIL policies to minimize Eq. 2 just like trial policies, but MIL methods lack a retrial step. To perform a controlled comparison, we use the same architecture for both MIL and WTL.
WTL: Our Watch-Try-Learn method which conditions on demonstration and trial experience. In all experiments, we consider the setting where the agent receives demonstration and can take trial.
BC + SAC: In the gripper environment we study how much trial-and-error experience state of the art reinforcement learning algorithms would require to solve a single task. While WTL meta-learns a single model that needs just one trial episode per meta-test task, in “BC + SAC” we fine-tune a separate RL agent for each meta-test task and analyze how much trial experience it needs to match WTL’s single trial performance. In particular, we pre-train a policy on the meta-training demonstrations similar to BC, then fine-tune for each meta-test task using soft actor critic Haarnoja et al. (2018), where the SAC actor is initialized with our BC pretrained policy.
separate training runs with identical hyperparameters, evaluated onrandomly sampled meta-test tasks. Shaded regions indicate confidence intervals.
4.1 Reaching Environment Experiments
To first verify that our method can actually leverage demonstration and trial experience in a simplified problem domain, we begin with a family of toy planar reaching tasks inspired by Finn et al. (2017b) and shown in Fig. 3. The demonstrator shows which of two objects to reach towards, but crucially the agent’s dynamics are unknown at the start of a task and are not necessarily the same as the demonstrator’s: each of the two joints may have reversed orientation with probability. This simulates a domain adaptive setting (e.g., human and robot dynamics will be completely different when imitating from a human video demonstration). Since observing the demonstration alone is insufficient, a meta-learner must use its own trial episodes to identify the task dynamics and successfully reach the target object. The environment returns a reward at each timestep, which penalizes the reacher’s distance to the target object, as well as the magnitude of its control inputs at that timestep.
To obtain expert demonstration data, we first train a reinforcement learning agent using normalized advantage functions (NAF) Gu et al. (2016), where the agent receives oracle observations that show only the true target. With our trained expert demonstration agent, we collect 2 demonstrations per task for 10000 meta-training tasks and 1000 meta-test tasks. For these toy experiments we use simplified versions of the architecture in Fig. 2 as described in Appendix B.
We show the results of each method in the reaching environment in Fig 3; the WTL results are the retrial episode returns. Our method is quickly able to learn to imitate the expert, while methods that do not leverage trial information struggle due to the uncertainty in the task dynamics. Finally, Fig 3 shows that single-task RL agents typically require tens of thousands of episodes to approach the same performance our meta-trained WTL method achieves after one demonstration and a single trial.
4.2 Gripper Environment Experiments
We design the gripper environment to include tasks from four broad task families: button pressing, grasping, pushing, and pick and place. Unlike prior meta-imitation learning benchmarks, tasks of the same family that are qualitatively similar still have subtle but consequential differences. Some pushing tasks might require the agent to always push the left object towards the right object, while others might require the agent to always push, for example, the cup towards the teapot. Appendix A describes each task family and its variations.
The gripper environment is a realistic 3-D simulation created using the Bullet physics engine Coumans and Bai (2016). The agent controls a floating gripper that can move and rotate freely in space (6 DOFs), and 2 symmetric fingers that open and close together (1 DOF). The virtual scene is set up with the same viewpoint and table as in Figure 4 with random initial positions of the gripper, as well as two random kitchenware objects and a button panel that are placed on the table. The button panel contains two press-able buttons of different colors. One of the two kitchenware objects is placed onto a random location on the left half of the table, and the other kitchenware object is placed onto a random location on the right half of the table. For vision policies, the environment provides image observations ( RGB images from the fixed viewpoint) and the 7-D gripper position vector at each timestep. For state-space policies it instead provides the 7-D gripper state and the 6-D poses (translation + rotation) of all non-fixed objects in the scene. The environment returns a reward of once the agent has successfully completed a task and otherwise.
For each task, the demonstration trajectories are recorded by a human controlling the gripper through the HTC Vive virtual reality setup. Between the demonstration episode and the trial/retrial episodes, the kitchenware objects and button panel are repositioned: a kitchenware object on the left half of the table moves to a random location on the right half of the table, and vice versa. Similarly, the two colored buttons in the button panel swap positions. Swapping the lateral positions of objects and buttons after the demonstration is crucial because otherwise, for example, the difference between the two types of pushing tasks would be meaningless.
In our virtual reality setup, a human demonstrator recorded demonstrations for distinct tasks involving distinct sets of kitchenware objects. We held out tasks corresponding to sets of kitchenware objects for our meta-validation dataset, which we used for hyperparameter selection. Similarly, we selected and held out object sets of tasks for our meta-test dataset, which we used for final evaluations.
|WTL, 1 trial (ours)|
|RL fine-tuning with SAC|
|BC + SAC, 900 trials|
|BC + SAC, 2000 trials|
We trained and evaluated MIL, BC, and WTL policies with both state-space observations and vision observations. Sec. 3.3 describes our WTL trial and retrial architectures and Appendix B describes hyperparameter selection using the meta-validation tasks, and Appendix B.3 analyzes the sample and time complexity of WTL. The MIL policy uses an identical architecture and objective to the WTL trial policy, while the BC policy architecture is the same as the WTL trial policy without the any embedding components. For vision based models, we crop and resize image observations from to before providing them as input.
We show the meta-test task success rates in Fig. 5, where the WTL success rate is simply the success rate of the retrial policy. Overall, in both state space and vision domains, we find that WTL outperforms MIL and BC by a substantial margin, indicating that it can effectively leverage information from the trial and integrate it with that of the demonstration in order to achieve greater performance.
Finally, for the BC + SAC comparison we pre-trained an actor with behavior cloning and fine-tuned RL agents per task with identical hyperparameters using the TFAgents Sergio Guadarrama, Anoop Korattikara, Oscar Ramirez, Pablo Castro, Ethan Holly, Sam Fishman, Ke Wang, Ekaterina Gonina, Chris Harris, Vincent Vanhoucke, Eugene Brevdo (2018) SAC implementation. Table 1 shows that BC + SAC fine-tuning typically requires hundreds of trial episodes per task to reach the same performance our meta-trained WTL method achieves after one demonstration and a single trial episode. Appendix B.4 shows the BC + SAC training curves averaged across the different meta-test tasks.
We proposed a meta-learning algorithm that allows an agent to quickly learn new behavior from a single demonstration followed by trial experience and associated (possibly sparse) rewards. The demonstration allows the agent to infer the type of task to be performed, and the trials enable it to improve its performance by resolving ambiguities in new test time situations. We presented experimental results where the agent is meta-trained on a broad distribution of tasks, after which it is able to quickly learn tasks with new held-out objects from just one demonstration and a trial. We showed that our approach outperforms prior meta-imitation approaches in challenging experimental domains. We hope that this work paves the way towards more practical algorithms for meta-learning behavior. In particular, we believe that the Watch-Try-Learn approach enables a natural way for non-expert users to train agents to perform new tasks: by demonstrating the task and then observing and critiquing the performance of the agent on the task if it initially fails. In future work, we hope to explore the potential of such an approach to be meta-trained on a much broader range of tasks, testing the performance of the agent on completely new held-out tasks rather than on held-out objects. We believe that this will be enabled by the unique combination of demonstrations and trials in the inner loop of a meta-learning system, since the demonstration guides the exploration process for subsequent trials, and the use of trials allows the agent to learn new task objectives which may not have been seen during meta-training.
We would like to thank Luke Metz and Archit Sharma for reviewing an earlier draft of this paper, and also Alex Irpan for valuable discussions.
- Bengio et al. (1992) Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the optimization of a synaptic learning rule. In Optimality in Artificial and Biological Neural Networks, pages 6–8, 1992.
- Bishop (1994) Christopher M Bishop. Mixture density networks. 1994.
- Brys et al. (2015) Tim Brys, Anna Harutyunyan, Halit Bener Suay, Sonia Chernova, Matthew E Taylor, and Ann Nowé. Reinforcement learning from demonstration through shaping. In IJCAI, 2015.
- Coumans and Bai (2016) E Coumans and Y Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. GitHub repository, 2016.
- Duan et al. (2016) Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
- Duan et al. (2017) Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. Neural Information Processing Systems (NIPS), 2017.
- Finn (2018) Chelsea Finn. Learning to Learn with Gradients. PhD thesis, UC Berkeley, 2018.
- Finn et al. (2017a) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML), 2017a.
- Finn et al. (2017b) Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual imitation learning via meta-learning. Conference on Robot Learning (CoRL), 2017b.
- Finn et al. (2017c) Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual imitation learning via meta-learning. Conference on Robot Learning (CoRL), 2017c.
- Gu et al. (2016) Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, pages 2829–2838, 2016.
- Gupta et al. (2018) Abhishek Gupta, Russell Mendonca, YuXuan Liu, Pieter Abbeel, and Sergey Levine. Meta-reinforcement learning of structured exploration strategies. arXiv preprint arXiv:1802.07245, 2018.
- Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
- Hausman et al. (2017) Karol Hausman, Yevgen Chebotar, Stefan Schaal, Gaurav Sukhatme, and Joseph J Lim. Multi-modal imitation learning from unstructured demonstrations using generative adversarial nets. In Neural Information Processing Systems (NIPS), 2017.
- Hester et al. (2018) Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Gabriel Dulac-Arnold, et al. Deep q-learning from demonstrations. AAAI, 2018.
- Hochreiter et al. (2001) Sepp Hochreiter, A Younger, and Peter Conwell. Learning to learn using gradient descent. International Conference on Artificial Neural Networks (ICANN), 2001.
- Houthooft et al. (2018) Rein Houthooft, Richard Y Chen, Phillip Isola, Bradly C Stadie, Filip Wolski, Jonathan Ho, and Pieter Abbeel. Evolved policy gradients. arXiv:1802.04821, 2018.
- James et al. (2018) Stephen James, Michael Bloesch, and Andrew J Davison. Task-embedded control networks for few-shot imitation learning. arXiv preprint arXiv:1810.03237, 2018.
- Kingma and Ba (2015) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR), 2015.
- Kober and Peters (2009) Jens Kober and Jan R Peters. Policy search for motor primitives in robotics. In Neural Information Processing Systems (NIPS), 2009.
- Kober et al. (2013) Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 2013.
- Kormushev et al. (2010) Petar Kormushev, Sylvain Calinon, and Darwin G Caldwell. Robot motor skill coordination with em-based reinforcement learning. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, pages 3232–3237. IEEE, 2010.
- Le et al. (2018) Hoang M Le, Nan Jiang, Alekh Agarwal, Miroslav Dudík, Yisong Yue, and Hal Daumé III. Hierarchical imitation and reinforcement learning. arXiv preprint arXiv:1803.00590, 2018.
- Levine et al. (2016) Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research (JMLR), 2016.
- Mishra et al. (2018) Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner. International Conference on Learning Representations (ICLR), 2018.
- Nagabandi et al. (2019) Anusha Nagabandi, Ignasi Clavera, Simin Liu, Ronald S Fearing, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. International Conference on Learning Representations (ICLR), 2019.
- Nair et al. (2018) Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Overcoming exploration in reinforcement learning with demonstrations. International Conference on Robotics and Automation (ICRA), 2018.
- Paine et al. (2018) Tom Le Paine, Sergio Gómez Colmenarejo, Ziyu Wang, Scott Reed, Yusuf Aytar, Tobias Pfaff, Matt W Hoffman, Gabriel Barth-Maron, Serkan Cabi, David Budden, et al. One-shot high-fidelity imitation: Training large-scale deep nets with rl. arXiv preprint arXiv:1810.05017, 2018.
- Peters and Schaal (2006) Jan Peters and Stefan Schaal. Policy gradient methods for robotics. In International Conference on Intelligent Robots and Systems (IROS), 2006.
- Rajeswaran et al. (2018) Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. Robotics: Science and Systems, 2018.
- Rakelly et al. (2019) Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, and Sergey Levine. Efficient off-policy meta-reinforcement learning via probabilistic context variables. arXiv preprint arXiv:1903.08254, 2019.
- Russo et al. (2018) Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. A tutorial on thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1–96, 2018.
- Sæmundsson et al. (2018) Steindór Sæmundsson, Katja Hofmann, and Marc Peter Deisenroth. Meta reinforcement learning with latent variable gaussian processes. arXiv preprint arXiv:1803.07551, 2018.
- Schmidhuber (1987) Jürgen Schmidhuber. Evolutionary principles in self-referential learning. PhD thesis, Institut für Informatik, Technische Universität München, 1987.
Sergio Guadarrama, Anoop Korattikara, Oscar Ramirez, Pablo Castro,
Ethan Holly, Sam Fishman, Ke Wang, Ekaterina Gonina, Chris Harris, Vincent
Vanhoucke, Eugene Brevdo (2018)
Sergio Guadarrama, Anoop Korattikara, Oscar Ramirez, Pablo Castro, Ethan
Holly, Sam Fishman, Ke Wang, Ekaterina Gonina, Chris Harris, Vincent
Vanhoucke, Eugene Brevdo.
TF-Agents: A library for reinforcement learning in tensorflow.https://github.com/tensorflow/agents, 2018. URL https://github.com/tensorflow/agents. [Online; accessed 30-November-2018].
- Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016.
- Stadie et al. (2018) Bradly C Stadie, Ge Yang, Rein Houthooft, Xi Chen, Yan Duan, Yuhuai Wu, Pieter Abbeel, and Ilya Sutskever. Some considerations on learning to explore via meta-reinforcement learning. arXiv preprint arXiv:1803.01118, 2018.
- Subramanian et al. (2016) Kaushik Subramanian, Charles L Isbell Jr, and Andrea L Thomaz. Exploration from demonstration for interactive reinforcement learning. In International Conference on Autonomous Agents & Multiagent Systems, 2016.
- Sun et al. (2018) Wen Sun, J Andrew Bagnell, and Byron Boots. Truncated horizon policy search: Combining reinforcement learning & imitation learning. International Conference on Learning Representations (ICLR), 2018.
- Sung et al. (2017) Flood Sung, Li Zhang, Tao Xiang, Timothy Hospedales, and Yongxin Yang. Learning to learn: Meta-critic networks for sample efficient learning. arXiv:1706.09529, 2017.
- Taylor et al. (2011) Matthew E Taylor, Halit Bener Suay, and Sonia Chernova. Integrating reinforcement learning with human demonstrations of varying ability. In International Conference on Autonomous Agents and Multiagent Systems, 2011.
- Thrun and Pratt (1998) Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 1998.
- Wang et al. (2016) Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv:1611.05763, 2016.
- Yu et al. (2018) Tianhe Yu, Chelsea Finn, Annie Xie, Sudeep Dasari, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot imitation from observing humans via domain-adaptive meta-learning. Robotics: Science and Systems (RSS), 2018.
Appendix A Gripper Environment Task Types
Our gripper environment has four broad task families: button pressing, grasping, sliding, and pick and place. But tasks in each family may come in one of two types:
Button pressing 1: the demonstrator presses the left (respectively, right) button, and the agent must press the left (respectively, right) button.
Button pressing 2: the demonstrator presses one of the two buttons, and the agent must press the button of the same color.
Grasping 1: the demonstrator grasps and lifts the left (respectively, right) object. The agent must grasp and lift the left (respectively, right) object.
Grasping 2: the demonstrator grasps and lifts one of the two objects. The agent must grasp and lift that same object.
Sliding 1: the demonstrator pushes the left object into the right object (respectively, the right object into the left object). The agent must push the left object into the right object (respectively, the right object into the left object).
Sliding 2: the demonstrator pushes one object (A) into the other (B). The agent must push object A into object B.
Pick and Place 1: the demonstrator picks up one of the objects and places it on the near (respectively, far) edge of the table. The agent must pick up the same object and place it on the near (respectively, far) edge of the table.
Pick and Place 2: the demonstrator picks up one of the objects and places it on the left (respectively, right) edge of the table. The agent must pick up the same object and place it on the left (respectively, right) edge of the table.
Appendix B Experimental Details
We trained all policies using the ADAM optimizer Kingma and Ba , on varying numbers of Nvidia Tesla P100 GPUs. Whenever there is more than GPU, we use synchronized gradient updates and the batch size refers to the batch size of each individual GPU worker.
b.1 Reacher Environment Models
For the reacher toy problem, every neural network in both the WTL policies and the baselines policies uses the same architecture: two hidden layers of 100 neurons each, with ReLU activations on each layer except the output layer, which has no activation.
Rather than mixture density networks, our BC and MIL policies are deterministic policies trained by standard mean squared error (MSE). The WTL retrial policy is also deterministic and trained by MSE, while the WTL trial policy is stochastic and samples actions from a Gaussian distribution with learned mean and diagonal covariance (equivalently, it is a simplification of the MDN to a single component). We found that a deterministic policy works best for maximizing the MIL baseline’s average return, while a stochastic WTL trial policy achieves lower returns itself but facilitates easier learning for the WTL retrial policy. Embedding architectures are also simplified: the demonstration “embedding” is simply the state of the last demonstration trajectory timestep, while the trial embedding architecture replaces the temporal convolution and flatten operations with a simple average over per-timestep trial embeddings.
We trained all policies for 50000 steps using a batch size of tasks and a learning rate, using a single GPU operating at 25 gradient steps per second.
b.2 Gripper Environment Models
|Method||Learning Rate||MDN Components|
For each state-space and vision method we ran a simple hyperparameter sweep. We tried and MDN mixture components, and for each we sampled learning rates log-uniformly from the range (state-space) or (vision), leading to hyperparameter experiments per method. We selected the best hyperparameters by average success rate on the meta-validation tasks, see Table 2. We trained all gripper models with a GPU workers and a batch size of tasks for (state-space) and (vision) steps. Training the WTL trial and re-trial policies on the vision-based gripper environment takes 14 hours each.
b.3 Algorithmic Complexity of WTL
WTL’s trial and re-trial policies are trained off-policy via imitation learning, so only two phases of data collection are required: the set of demonstrations for training the trial policy, and the set of trial policy rollouts for training the re-trial policy. The time and sample complexity of data collection is linear in the number of tasks, for which only one demo and one trial is required per task. Furthermore, because the optimization of trial and re-trial policies are de-coupled into separate learning stages, the sample complexity is fixed with respect to hyperparameter sweeps. A fixed dataset of demos is used to obtain a good trial policy, and a fixed dataset of trials (from the trial policy) is used to obtain a good re-trial policy.
The forward-backward pass of the vision network is the dominant factor in the computational cost of training the vision-based retrial architecture. Thus, the time complexity for a single SGD update to WTL is , where is the number of sub-sampled frames used to compute the demo and trial visual features for forming the contextual embedding. However, in practice the model size and number of sub-sampled frames are small enough that computing embeddings for all frames in demos and trials can be vectorized efficiently on GPUs.