Watch, Try, Learn: Meta-Learning from Demonstrations and Reward

06/07/2019 ∙ by Allan Zhou, et al. ∙ Google The Team at X 1

Imitation learning allows agents to learn complex behaviors from demonstrations. However, learning a complex vision-based task may require an impractical number of demonstrations. Meta-imitation learning is a promising approach towards enabling agents to learn a new task from one or a few demonstrations by leveraging experience from learning similar tasks. In the presence of task ambiguity or unobserved dynamics, demonstrations alone may not provide enough information; an agent must also try the task to successfully infer a policy. In this work, we propose a method that can learn to learn from both demonstrations and trial-and-error experience with sparse reward feedback. In comparison to meta-imitation, this approach enables the agent to effectively and efficiently improve itself autonomously beyond the demonstration data. In comparison to meta-reinforcement learning, we can scale to substantially broader distributions of tasks, as the demonstration reduces the burden of exploration. Our experiments show that our method significantly outperforms prior approaches on a set of challenging, vision-based control tasks.



There are no comments yet.


page 2

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Imitation learning enables autonomous agents to learn complex behaviors from demonstrations, which are often easy and intuitive for users to provide. However, learning expressive neural network policies from imitation requires a large number of demonstrations, particularly when learning from high-dimensional inputs such as image pixels. Meta-imitation learning has emerged as a promising approach for allowing an agent to leverage data from previous tasks in order to learn a new task from only a handful of demonstrations 

Duan et al. (2017); Finn et al. (2017b); James et al. (2018). However, in many practical few-shot imitation settings, there is an identifiability problem: it may not be possible to precisely determine a policy from one or a few demonstrations, especially in a new situation. And even if a demonstration precisely communicates what the task entails, it might not precisely communicate how to accomplish it in new situations. For example, it may be difficult to discern from a single demonstration where to grasp an object when it is in a new position or how much force to apply in order to slide an object without knocking it over. It may be expensive to collect more demonstrations to resolve such ambiguities, and even when we can, it may not be obvious to a human demonstrator where the agent’s difficulty is arising from. Alternatively, it is easy for the user to provide success-or-failure feedback, while exploratory interaction is useful for learning how to perform the task. To this end, our goal is to build an agent that can first infer a policy from one demonstration, then attempt the task using that policy while receiving binary user feedback, and finally use the feedback to improve its policy such that it can consistently solve the task.

Figure 1: After watching one demonstration (left), the scene is re-arranged. With one trial episode (middle), our method can learn to solve the task in the retrial episode (right) by leveraging both the demo and trial-and-error experience.

This vision of learning new tasks from a few demonstrations and trials inherently requires some amount of prior knowledge or experience, which we can acquire through meta-learning across a range of previous tasks. To this end, we develop a new meta-learning algorithm that incorporates elements of imitation learning with trial-and-error reinforcement learning. In contrast to previous meta-imitation learning approaches that learn one-shot imitation learning procedures through imitation Duan et al. (2017); Finn et al. (2017b), our approach enables the agent to improve at the test task through trial-and-error. Further, from the perspective of meta-RL algorithms that aim to learn efficient RL procedures Duan et al. (2016); Wang et al. (2016); Finn et al. (2017a), our approach also has significant appeal: as we aim to scale meta-RL towards broader task distributions and learn increasingly general RL procedures, exploration and efficiency becomes exceedingly difficult. However, a demonstration can significantly narrow down the search space while also providing a practical means for a user to communicate the goal, enabling the agent to achieve few-shot learning of behavior. While the combination of demonstrations and reinforcement has been studied extensively in single task problems Kober et al. (2013); Sun et al. (2018); Rajeswaran et al. (2018); Le et al. (2018), this combination is particularly important in meta-learning contexts where few-shot learning of new tasks is simply not possible without demonstrations. Further, we can even significantly improve upon prior methods that study this combination using meta-learning to more effectively integrate the information coming from both sources.

The primary contribution of this paper is a meta-learning algorithm that enables effective learning of new behaviors with a single demonstration and trial experience. In particular, after receiving a demonstration that illustrates a new goal, the meta-trained agent can learn to accomplish that goal through trial-and-error with only binary success-or-failure labels. We evaluate our algorithm and several prior methods on a challenging, vision-based control problem involving manipulation tasks from four distinct families of tasks: button-pressing, grasping, pushing, and pick and place. We find that our approach can effectively learn tasks with new, held-out objects using one demonstration and a single trial, while significantly outperforming meta-imitation learning, meta-reinforcement learning, and prior methods that combine demonstrations and reward feedback. To our knowledge, our experiments are the first to show that meta-learning can enable an agent to adapt to new tasks with binary reinforcement signals from raw pixel observations, which we show with a single meta-model for a variety of distinct manipulation tasks. Videos of our experimental results can be found on the supplemental website111For video results, see website:

2 Related Work

Learning to learn, or meta-learning, has a long-standing history in the machine learning literature 

Thrun and Pratt (1998); Schmidhuber (1987); Bengio et al. (1992); Hochreiter et al. (2001). We particularly focus on meta-learning in the context of control. Our approach builds on and significantly improves upon meta-imitation learning Duan et al. (2017); Finn et al. (2017b); James et al. (2018); Paine et al. (2018) and meta-reinforcement learning Duan et al. (2016); Wang et al. (2016); Mishra et al. (2018); Rakelly et al. (2019), extending contextual meta-learning approaches. Unlike prior work in few-shot imitation learning Duan et al. (2017); Finn et al. (2017b); Yu et al. (2018); James et al. (2018); Paine et al. (2018), our method enables the agent to additionally improve upon trial-and-error experience. In contrast to work in multi-task and meta-reinforcement learning Duan et al. (2016); Wang et al. (2016); Finn et al. (2017a); Mishra et al. (2018); Houthooft et al. (2018); Sung et al. (2017); Nagabandi et al. (2019); Sæmundsson et al. (2018); Hausman et al. (2017), our approach learns to use one demonstration to address the meta-exploration problem Gupta et al. (2018); Stadie et al. (2018). Our work also requires only one round of on-policy data collection, collecting only trials for the vision-based manipulation tasks, while nearly all prior meta-learning works require thousands of iterations of on-policy data collection, amounting to hundreds of thousands of trials Duan et al. (2016); Wang et al. (2016); Finn et al. (2017a); Mishra et al. (2018).

Combining demonstrations and trial-and-error experience has long been explored in the machine learning and robotics literature Kober et al. (2013). This ranges from simple techniques such as demonstration-based pre-training and initialization Peters and Schaal (2006); Kober and Peters (2009); Kormushev et al. (2010); Kober et al. (2013); Silver et al. (2016) to more complex methods that incorporate both demonstration data and reward information in the loop of training (Taylor et al., 2011; Brys et al., 2015; Subramanian et al., 2016; Hester et al., 2018; Sun et al., 2018; Rajeswaran et al., 2018; Nair et al., 2018; Le et al., 2018). The key contribution of this paper is an algorithm that can learn how to learn from both demonstrations and rewards. This is quite different from RL algorithms that incorporate demonstrations: learning from scratch with demonstrations and RL involves a slow, iterative learning process, while fast adaptation with a meta-trained policy involves extracting inherently distinct pieces of information from the demonstration and the trials. The demonstration provides information about what to do, while the small number of RL trials can disambiguate the task and how it should be performed. This disambiguation process resembles a process of elimination and refinement. As a result, we get a procedure that significantly exceeds the efficiency of prior approaches, requiring only one demonstration and one trial to adapt to a new test task, even from pixel observations, by leveraging previous data. In comparison, single-task methods for learning from demonstrations and rewards typically require hundreds or thousands of trials to learn tasks of comparable difficulty Rajeswaran et al. (2018); Nair et al. (2018).

3 Meta-Learning from Demonstrations and Rewards

We first formalize the problem statement that we consider, then describe our approach and its implementation.

3.1 Problem Statement

We would like to design a problem statement that encapsulates the setting described above. Following the typical meta-learning problem statement Finn (2018), we will assume a distribution of tasks , from which the meta-training tasks and held-out meta-test tasks are drawn. A task

is defined as a finite-horizon Markov decision process (MDP),

, with continuous state space , continuous action space , binary reward function , unknown dynamics , and horizon . Both the reward and dynamics may vary across tasks.

During meta-training, we will assume supervision in the form of several expert demonstration trajectories per task , and a reward function that can be queried for each of the meta-training tasks . After meta-training with this supervision, the goal at meta-test time is to quickly learn a new meta-test task . In particular, at meta-test time, the agent is first provided with demonstrations (where is small, e.g. one or a few), followed by the opportunity to attempt the task for which it receives a reward label. To attempt the task, the agent must infer a policy based on the demonstrations that will be suitable for gathering information about the task. We will refer to the roll-outs from this policy as trial episodes. The agent must use the trial episodes (along with demonstrations ) to infer a policy that succeeds at the task. In the ensuing retrial episode, this policy is evaluated. We use a trial to denote a sequence of states, actions, and rewards


while a demonstration denotes a sequence of only states and actions.

3.2 Learning to Imitate and Try Again

With the aforementioned problem statement in mind, we aim to develop a method that can learn to learn from both demonstration and trial-and-error experience. To do so, we need two capabilities: first the ability to infer a policy from a few demonstrations that is suitable for gathering information about the task, i.e. a trial policy, and second, the ability to extract and integrate information from these trials with that of the demonstrations to learn a successful policy, which we refer to as the retrial policy. The trial policy can be written as , while the re-trial policy is .

In theory we could design a single model to represent both the trial and retrial policies, for example by using a single MAML Finn et al. (2017a) model with a shared initial parameter setting. However, a key challenge when the trial and retrial policy share initial weights is that updates for the retrial policy will also affect the trial policy, thereby changing the distribution of trial trajectories that the retrial policy should expect. This requires the algorithm to constantly re-collect on-policy trial trajectories from the environment during meta-training, which is particularly difficult in real-world problem settings with broad task distributions. Instead, we found it substantially simpler to represent and train these two policies separately, decoupling their optimization. We parameterize the trial and retrial policies by and , respectively, and denote the parameterized policies and . Our approach allows us to train the trial policy first, freeze its weights while collecting trial data from the environment, for each meta-training task , and then train the retrial policy using the collected stationary trial data without having to visit the environment again.

How do we train each of these policies with off-policy demonstration and trial data? The trial policy must be trained in a way that will provide useful exploration for inferring the task. One simple and effective strategy for exploration is posterior or Thompson sampling 

Russo et al. (2018); Rakelly et al. (2019), i.e. greedily act according to the policy’s current belief of the task. To this end, we train the trial policy using a meta-imitation learning setup, where for any task the trial policy conditions on one or a few training demonstrations and is trained to maximize the likelihood of the actions under another demonstration of the same task (sampled from without replacement). This leads us to the objective:


We train the retrial policy in a similar fashion, but additionally condition on one or a few trial trajectories , which are the result of executing the trial policy in the environment. Suppose for any task , we have a set of demo-trial pairs . Then the retrial objective is:


During meta-training, we first train the trial policy by minimizing Eq. 2 with mini-batches of tasks and corresponding demonstrations . After training, we freeze to have a fixed trial policy. We then iteratively sample a set of task demonstrations and collect one or a few trial trajectories in the environment using the demo conditioned trial policy . We store the resulting demo-trial pairs in a new dataset . We then train our retrial policy by minimizing Eq.  3 with mini-batches of tasks and corresponding demo-trial pair datasets . At meta-test time, for any test task we first collect trials using the trial policy . Then we execute the retrial episode using the retrial policy . We refer to our approach as Watch-Try-Learn (WTL), and describe our meta-training and meta-test procedures in more detail in Alg. 1 and Alg. 2, respectively.

1:  Input: Training tasks
2:  Input: Demo data per task
3:  Input: Number of training steps
4:  Randomly initialize
5:  for step  do
6:     Sample meta-training task
7:     Update with (see Eq. 2)
8:  end for
9:  for  do
10:     Initialize empty for demo-trial pairs.
11:     while not done do
12:        Sample demonstrations
13:        Collect trials
14:        Update
15:     end while
16:  end for
17:  for step  do
18:     Sample meta-training task
19:     Update with (see Eq. 3)
20:  end for
21:  return  
Algorithm 1 Watch-Try-Learn: Meta-training
1:  Input: Test tasks
2:  Input: Demo data for task
3:  for  do
4:     Sample demonstrations
5:     Collect trials with policy
6:     Perform task with re-trial policy
7:  end for
Algorithm 2 Watch-Try-Learn: Meta-testing

3.3 Watch-Try-Learn Implementation

Figure 2:

Our vision-based retrial architecture: RGB observations are passed through a 4-layer CNN with ReLU activations and layer normalization, followed by a spatial softmax layer that extracts 2D keypoints 

Levine et al. (2016)

. The flattened keypoints are concatenated with the current gripper pose, gripper velocity, and context embedding. The resulting state-context vector is passed through an actor network, which predicts the parameters of a Gaussian mixture over the commanded end-effector position, axis-angle orientation, and finger angle. A demo embedding network applies a separate vision network to 40 ordered images subsampled from a demo trajectory. A trial embedding network produces a trial embedding in a similar fashion. We concatenate the demo embeddings, trial embeddings, and trial rewards, then apply a 10x1 convolution across the time dimension, flatten, and apply a MLP to produce the final context embedding. The trial policy architecture is the same as shown here, but omits the concatenation of trial embeddings and trial rewards.

WTL and Alg. 1 allow for general representations of the trial policy and retrial policy , so long as they integrate the task conditioning information: the trial policy conditions on a demonstration trajectories, while the retrial policy conditions on both demonstration and trial trajectories. To enable WTL to flexibly choose how to integrate the information coming from demonstrations and trials, we condition the policies by embedding trajectory information into a context vector using a neural network. This architectural choice resembles prior contextual meta-learning works Duan et al. (2016); Mishra et al. (2018); Duan et al. (2017); Rakelly et al. (2019), which have previously considered how to meta-learn efficiently from one modality of data (trials or demonstrations), but not how to integrate multiple sources, including off-policy trial data.

We illustrate the policy architecture in Figure 2. Given a demonstration trajectory or trial trajectory we embed each timestep’s state, action, and reward (recall Eq. 1) to produce per-timestep embeddings. In the retrial case where we have both demo and trial embeddings per-timestep, we concatenate them together. We then apply 1-D convolutions to the per-timestep embeddings, followed by a flatten operation and fully connected layers to produce a single context embedding vector for the actor network. At the output, we use mixture density networks (MDN) Bishop (1994) to represent the distribution over actions. Hence, the actor network receives the context embedding and produces the parameters of a Gaussian mixture.

4 Experiments

In our experiments, we aim to evaluate our method on challenging few-shot learning domains that span multiple task families, where the agent must use both demonstrations and trial-and-error to effectively infer a policy for the task. Prior meta-imitation benchmarks Duan et al. (2017); Finn et al. (2017c) generally contain only a few tasks, and these tasks can be easily disambiguated given a single demonstration. Meanwhile, prior meta-reinforcement learning benchmarks Duan et al. (2016); Finn et al. (2017a) tend to contain fairly similar tasks that a meta-learner can solve with little exploration and no demonstration at all. Motivated by these shortcomings, we design two new problems where a meta-learner can leverage a combination of demonstration and trial experience: a toy reaching problem and a challenging multitask gripper control problem, described below. We evaluate how the following methods perform in those environments:

BC: A behavior cloning method that does not condition on either demonstration or trial-and-error experience, trained across all meta-training data. We train BC policies using maximum log-likelihood with expert demonstration actions.

MIL Finn et al. (2017b); James et al. (2018): A meta-imitation learning method that conditions on demonstration data, but does not leverage trial-and-error experience. We train MIL policies to minimize Eq. 2 just like trial policies, but MIL methods lack a retrial step. To perform a controlled comparison, we use the same architecture for both MIL and WTL.

WTL: Our Watch-Try-Learn method which conditions on demonstration and trial experience. In all experiments, we consider the setting where the agent receives demonstration and can take trial.

BC + SAC: In the gripper environment we study how much trial-and-error experience state of the art reinforcement learning algorithms would require to solve a single task. While WTL meta-learns a single model that needs just one trial episode per meta-test task, in “BC + SAC” we fine-tune a separate RL agent for each meta-test task and analyze how much trial experience it needs to match WTL’s single trial performance. In particular, we pre-train a policy on the meta-training demonstrations similar to BC, then fine-tune for each meta-test task using soft actor critic Haarnoja et al. (2018), where the SAC actor is initialized with our BC pretrained policy.

Figure 3: Left: Our reaching toy environment with per-task randomized dynamics. Right: Average return of each method on held out meta-test tasks, after one demonstration and one trial. Our Watch-Try-Learn (WTL) method is quickly able to learn to imitate the demonstrator. Each line shows the average over

separate training runs with identical hyperparameters, evaluated on

randomly sampled meta-test tasks. Shaded regions indicate confidence intervals.

4.1 Reaching Environment Experiments

To first verify that our method can actually leverage demonstration and trial experience in a simplified problem domain, we begin with a family of toy planar reaching tasks inspired by Finn et al. (2017b) and shown in Fig. 3. The demonstrator shows which of two objects to reach towards, but crucially the agent’s dynamics are unknown at the start of a task and are not necessarily the same as the demonstrator’s: each of the two joints may have reversed orientation with probability. This simulates a domain adaptive setting (e.g., human and robot dynamics will be completely different when imitating from a human video demonstration). Since observing the demonstration alone is insufficient, a meta-learner must use its own trial episodes to identify the task dynamics and successfully reach the target object. The environment returns a reward at each timestep, which penalizes the reacher’s distance to the target object, as well as the magnitude of its control inputs at that timestep.

To obtain expert demonstration data, we first train a reinforcement learning agent using normalized advantage functions (NAF) Gu et al. (2016), where the agent receives oracle observations that show only the true target. With our trained expert demonstration agent, we collect 2 demonstrations per task for 10000 meta-training tasks and 1000 meta-test tasks. For these toy experiments we use simplified versions of the architecture in Fig. 2 as described in Appendix B.

We show the results of each method in the reaching environment in Fig 3; the WTL results are the retrial episode returns. Our method is quickly able to learn to imitate the expert, while methods that do not leverage trial information struggle due to the uncertainty in the task dynamics. Finally, Fig 3 shows that single-task RL agents typically require tens of thousands of episodes to approach the same performance our meta-trained WTL method achieves after one demonstration and a single trial.

4.2 Gripper Environment Experiments

Figure 4: Illustration of the four distinct task families: button pressing, grasping, sliding, and pick-and-place. We meta-train each model on hundreds of tasks from each of these task families, involving nearly one hundred different kitchenware objects.

We design the gripper environment to include tasks from four broad task families: button pressing, grasping, pushing, and pick and place. Unlike prior meta-imitation learning benchmarks, tasks of the same family that are qualitatively similar still have subtle but consequential differences. Some pushing tasks might require the agent to always push the left object towards the right object, while others might require the agent to always push, for example, the cup towards the teapot. Appendix A describes each task family and its variations.

The gripper environment is a realistic 3-D simulation created using the Bullet physics engine Coumans and Bai (2016). The agent controls a floating gripper that can move and rotate freely in space (6 DOFs), and 2 symmetric fingers that open and close together (1 DOF). The virtual scene is set up with the same viewpoint and table as in Figure 4 with random initial positions of the gripper, as well as two random kitchenware objects and a button panel that are placed on the table. The button panel contains two press-able buttons of different colors. One of the two kitchenware objects is placed onto a random location on the left half of the table, and the other kitchenware object is placed onto a random location on the right half of the table. For vision policies, the environment provides image observations ( RGB images from the fixed viewpoint) and the 7-D gripper position vector at each timestep. For state-space policies it instead provides the 7-D gripper state and the 6-D poses (translation + rotation) of all non-fixed objects in the scene. The environment returns a reward of once the agent has successfully completed a task and otherwise.

For each task, the demonstration trajectories are recorded by a human controlling the gripper through the HTC Vive virtual reality setup. Between the demonstration episode and the trial/retrial episodes, the kitchenware objects and button panel are repositioned: a kitchenware object on the left half of the table moves to a random location on the right half of the table, and vice versa. Similarly, the two colored buttons in the button panel swap positions. Swapping the lateral positions of objects and buttons after the demonstration is crucial because otherwise, for example, the difference between the two types of pushing tasks would be meaningless.

In our virtual reality setup, a human demonstrator recorded demonstrations for distinct tasks involving distinct sets of kitchenware objects. We held out tasks corresponding to sets of kitchenware objects for our meta-validation dataset, which we used for hyperparameter selection. Similarly, we selected and held out object sets of tasks for our meta-test dataset, which we used for final evaluations.

Figure 5: The average success rate of different methods in the gripper control environment, for both state space (non-vision) and vision based policies. The leftmost column displays aggregate results across all task families. Our Watch-Try-Learn (WTL) method significantly outperforms the meta-imitation (MIL) baseline, which in turn outperforms the behavior cloning (BC) baseline. We conducted training runs of each method with identical hyperparameters and evaluated each run on held out meta-test tasks. Error bars indicate confidence intervals.
Method Success Rate
WTL, 1 trial (ours)
RL fine-tuning with SAC
BC + SAC, 900 trials
BC + SAC, 2000 trials
Table 1: Average success rates across meta-test tasks using state space observations. For BC + SAC we pre-train with behavior cloning and use RL to fine-tune a separate agent on each meta-test task. The table shows BC + SAC performance after and trials per task.

We trained and evaluated MIL, BC, and WTL policies with both state-space observations and vision observations. Sec. 3.3 describes our WTL trial and retrial architectures and Appendix B describes hyperparameter selection using the meta-validation tasks, and Appendix B.3 analyzes the sample and time complexity of WTL. The MIL policy uses an identical architecture and objective to the WTL trial policy, while the BC policy architecture is the same as the WTL trial policy without the any embedding components. For vision based models, we crop and resize image observations from to before providing them as input.

We show the meta-test task success rates in Fig. 5, where the WTL success rate is simply the success rate of the retrial policy. Overall, in both state space and vision domains, we find that WTL outperforms MIL and BC by a substantial margin, indicating that it can effectively leverage information from the trial and integrate it with that of the demonstration in order to achieve greater performance.

Finally, for the BC + SAC comparison we pre-trained an actor with behavior cloning and fine-tuned RL agents per task with identical hyperparameters using the TFAgents Sergio Guadarrama, Anoop Korattikara, Oscar Ramirez, Pablo Castro, Ethan Holly, Sam Fishman, Ke Wang, Ekaterina Gonina, Chris Harris, Vincent Vanhoucke, Eugene Brevdo (2018) SAC implementation. Table 1 shows that BC + SAC fine-tuning typically requires hundreds of trial episodes per task to reach the same performance our meta-trained WTL method achieves after one demonstration and a single trial episode. Appendix B.4 shows the BC + SAC training curves averaged across the different meta-test tasks.

5 Discussion

We proposed a meta-learning algorithm that allows an agent to quickly learn new behavior from a single demonstration followed by trial experience and associated (possibly sparse) rewards. The demonstration allows the agent to infer the type of task to be performed, and the trials enable it to improve its performance by resolving ambiguities in new test time situations. We presented experimental results where the agent is meta-trained on a broad distribution of tasks, after which it is able to quickly learn tasks with new held-out objects from just one demonstration and a trial. We showed that our approach outperforms prior meta-imitation approaches in challenging experimental domains. We hope that this work paves the way towards more practical algorithms for meta-learning behavior. In particular, we believe that the Watch-Try-Learn approach enables a natural way for non-expert users to train agents to perform new tasks: by demonstrating the task and then observing and critiquing the performance of the agent on the task if it initially fails. In future work, we hope to explore the potential of such an approach to be meta-trained on a much broader range of tasks, testing the performance of the agent on completely new held-out tasks rather than on held-out objects. We believe that this will be enabled by the unique combination of demonstrations and trials in the inner loop of a meta-learning system, since the demonstration guides the exploration process for subsequent trials, and the use of trials allows the agent to learn new task objectives which may not have been seen during meta-training.

6 Acknowledgements

We would like to thank Luke Metz and Archit Sharma for reviewing an earlier draft of this paper, and also Alex Irpan for valuable discussions.


Appendix A Gripper Environment Task Types

Our gripper environment has four broad task families: button pressing, grasping, sliding, and pick and place. But tasks in each family may come in one of two types:

  • Button pressing 1: the demonstrator presses the left (respectively, right) button, and the agent must press the left (respectively, right) button.

  • Button pressing 2: the demonstrator presses one of the two buttons, and the agent must press the button of the same color.

  • Grasping 1: the demonstrator grasps and lifts the left (respectively, right) object. The agent must grasp and lift the left (respectively, right) object.

  • Grasping 2: the demonstrator grasps and lifts one of the two objects. The agent must grasp and lift that same object.

  • Sliding 1: the demonstrator pushes the left object into the right object (respectively, the right object into the left object). The agent must push the left object into the right object (respectively, the right object into the left object).

  • Sliding 2: the demonstrator pushes one object (A) into the other (B). The agent must push object A into object B.

  • Pick and Place 1: the demonstrator picks up one of the objects and places it on the near (respectively, far) edge of the table. The agent must pick up the same object and place it on the near (respectively, far) edge of the table.

  • Pick and Place 2: the demonstrator picks up one of the objects and places it on the left (respectively, right) edge of the table. The agent must pick up the same object and place it on the left (respectively, right) edge of the table.

Appendix B Experimental Details

We trained all policies using the ADAM optimizer Kingma and Ba [2015], on varying numbers of Nvidia Tesla P100 GPUs. Whenever there is more than GPU, we use synchronized gradient updates and the batch size refers to the batch size of each individual GPU worker.

b.1 Reacher Environment Models

For the reacher toy problem, every neural network in both the WTL policies and the baselines policies uses the same architecture: two hidden layers of 100 neurons each, with ReLU activations on each layer except the output layer, which has no activation.

Rather than mixture density networks, our BC and MIL policies are deterministic policies trained by standard mean squared error (MSE). The WTL retrial policy is also deterministic and trained by MSE, while the WTL trial policy is stochastic and samples actions from a Gaussian distribution with learned mean and diagonal covariance (equivalently, it is a simplification of the MDN to a single component). We found that a deterministic policy works best for maximizing the MIL baseline’s average return, while a stochastic WTL trial policy achieves lower returns itself but facilitates easier learning for the WTL retrial policy. Embedding architectures are also simplified: the demonstration “embedding” is simply the state of the last demonstration trajectory timestep, while the trial embedding architecture replaces the temporal convolution and flatten operations with a simple average over per-timestep trial embeddings.

We trained all policies for 50000 steps using a batch size of tasks and a learning rate, using a single GPU operating at 25 gradient steps per second.

b.2 Gripper Environment Models

Method Learning Rate MDN Components
Vision Models
WTL 20
MIL 20
BC 20
State-Space Models
WTL 20
MIL 20
BC 20
Table 2: Best Model Hyperparameters for Gripper Environment Experiments

For each state-space and vision method we ran a simple hyperparameter sweep. We tried and MDN mixture components, and for each we sampled learning rates log-uniformly from the range (state-space) or (vision), leading to hyperparameter experiments per method. We selected the best hyperparameters by average success rate on the meta-validation tasks, see Table 2. We trained all gripper models with a GPU workers and a batch size of tasks for (state-space) and (vision) steps. Training the WTL trial and re-trial policies on the vision-based gripper environment takes 14 hours each.

b.3 Algorithmic Complexity of WTL

WTL’s trial and re-trial policies are trained off-policy via imitation learning, so only two phases of data collection are required: the set of demonstrations for training the trial policy, and the set of trial policy rollouts for training the re-trial policy. The time and sample complexity of data collection is linear in the number of tasks, for which only one demo and one trial is required per task. Furthermore, because the optimization of trial and re-trial policies are de-coupled into separate learning stages, the sample complexity is fixed with respect to hyperparameter sweeps. A fixed dataset of demos is used to obtain a good trial policy, and a fixed dataset of trials (from the trial policy) is used to obtain a good re-trial policy.

The forward-backward pass of the vision network is the dominant factor in the computational cost of training the vision-based retrial architecture. Thus, the time complexity for a single SGD update to WTL is , where is the number of sub-sampled frames used to compute the demo and trial visual features for forming the contextual embedding. However, in practice the model size and number of sub-sampled frames are small enough that computing embeddings for all frames in demos and trials can be vectorized efficiently on GPUs.

b.4 BC + SAC Baseline Training Details

Figure 6: We pre-train agents with behavior cloning and use RL to fine-tune on each meta-test task. X-axis shows environment experience steps collected per task. For comparison, our WTL method requires only a single trial ( environment steps) per task. Error bars indicate confidence intervals.