Scalable Multi-Task Imitation Learning with Autonomous Improvement

by   Avi Singh, et al.

While robot learning has demonstrated promising results for enabling robots to automatically acquire new skills, a critical challenge in deploying learning-based systems is scale: acquiring enough data for the robot to effectively generalize broadly. Imitation learning, in particular, has remained a stable and powerful approach for robot learning, but critically relies on expert operators for data collection. In this work, we target this challenge, aiming to build an imitation learning system that can continuously improve through autonomous data collection, while simultaneously avoiding the explicit use of reinforcement learning, to maintain the stability, simplicity, and scalability of supervised imitation. To accomplish this, we cast the problem of imitation with autonomous improvement into a multi-task setting. We utilize the insight that, in a multi-task setting, a failed attempt at one task might represent a successful attempt at another task. This allows us to leverage the robot's own trials as demonstrations for tasks other than the one that the robot actually attempted. Using an initial dataset of multi-task demonstration data, the robot autonomously collects trials which are only sparsely labeled with a binary indication of whether the trial accomplished any useful task or not. We then embed the trials into a learned latent space of tasks, trained using only the initial demonstration dataset, to draw similarities between various trials, enabling the robot to achieve one-shot generalization to new tasks. In contrast to prior imitation learning approaches, our method can autonomously collect data with sparse supervision for continuous improvement, and in contrast to reinforcement learning algorithms, our method can effectively improve from sparse, task-agnostic reward signals.



There are no comments yet.


page 1

page 5


Learning To Reach Goals Without Reinforcement Learning

Imitation learning algorithms provide a simple and straightforward appro...

Efficient Supervision for Robot Learning via Imitation, Simulation, and Adaptation

Recent successes in machine learning have led to a shift in the design o...

Teleoperator Imitation with Continuous-time Safety

Learning to effectively imitate human teleoperators, with generalization...

RLBench: The Robot Learning Benchmark Learning Environment

We present a challenging new benchmark and learning-environment for robo...

Towards More Generalizable One-shot Visual Imitation Learning

A general-purpose robot should be able to master a wide range of tasks a...

Deep Imitative Models for Flexible Inference, Planning, and Control

Imitation learning provides an appealing framework for autonomous contro...

MT-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale

General-purpose robotic systems must master a large repertoire of divers...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction & Related Work

Fig. 1: In this figure, we describe our problem setting and differentiate it from the standard imitation learning and meta-imitation learning problem settings. In standard imitation learning, we are provided a dataset of demonstrations for a specific task, and we learn a policy to mimic the behavior in the demonstrations. In meta-imitation learning, we are provided with several such datasets drawn from a task distribution, and we learn a single meta-policy that is capable of performing can generalize to new tasks in this distribution from a single demonstration. We also consider a one-shot imitation learning setting in this paper; however, unlike standard meta-imitation, our method can also utilize autonomously collected experience. After filtering out trials that do not succeed on any task, our method adds the autonomously collected trials to the meta-training set, matching them with existing tasks using an automated pairing method. We use this newly collected and organized data to update the policy, establishing a learning loop that can perpetually improve from autonomously collected data.

Robotic learning holds the potential to enable generalist

robots: robotic systems that can autonomously perform a wide range of different behaviors. In order to enable robots to be generalists, equipped with large repertoires of skills, we need them to be able to acquire each skill from a relatively modest amount of experience, and to make it feasible to add new skills using only a little bit of data. Such generalist systems not only have tremendous promise for applications across a range of domains that cannot currently be automated, but would also represent a significant step forward in artificial intelligence research. Two of the most successful approaches to general-purpose robotic learning are imitation learning, where a robot uses human-provided demonstration data to learn skills 

[1, 2], and reinforcement learning (RL), where it uses a trial-and-error learning process [3, 4].

When combined with high-capacity function approximators, such as deep neural networks, both techniques have been shown to enable complex robotic skills directly from low-level sensory observations, such as images 

[2, 5]. However, both techniques also have substantial limitations, some of which we will aim to mitigate in this work. First, the combination of deep networks with imitation learning and reinforcement learning results in very high data complexity. In imitation learning, this means that human operators must provide a large number of demonstrations for each task, while in the case of reinforcement learning, this translates into very large autonomous data collection requirements particularly to overcome the challenges of exploration [6]. With the data requirements in both cases, it is thus difficult to scale these techniques to enable truly generalist robots equipped with large repertoires of skills. The two approaches also have some distinct strengths and weaknesses. Reinforcement learning enables a robot to improve, perhaps perpetually, from its own experience [5, 7, 8], but RL-based policies are generally more limited in the complexity of the tasks they can accomplish, due to the difficulty of discovering task solutions autonomously and the complexity of the RL optimization problem [9, 10, 11]. By comparison, imitation learning methods present a much simpler supervised learning problem, making them suitable for more complex skills [12, 2], but they require a more manual process of collecting human demonstration data [2], and the robot cannot continue to improve autonomously from its own experience [13]. This last point in particular is a major shortcoming of current imitation learning approaches: the capacity for continuous and autonomous self-improvement is an exceptionally powerful feature of RL.

Fig. 2: The MILI algorithm. We bootstrap a one-shot imitation policy using a multi-task imitation learning dataset. We then use this policy to collect trials in new environments. A latent task space, learned using the same initial dataset, is then used to find similarities in the collected trials, and generate new tasks for meta-imitation learning. We update our meta-policy using the newly collected data, and repeat this process until convergence.

Multi-task [14] and meta-imitation learning [15, 16, 17, 18, 19, 20, 21] alleviate the data collection problem to some extent by reusing data across many different tasks. These methods typically learn a parameterized policy that is a function of both the current observation and the task, thus obtaining a single policy that is capable of performing several different tasks and generalizing to new tasks. For multi-task imitation learning, the task is typically specified in the form of a task index, while in meta-imitation, the task is specified using a single demonstration for that task. Thus, meta-imitation learning systems learn a one-shot imitation policy, capable of performing a new task from just a single demonstration for it. However, these approaches still require manually collecting data for hundreds of different tasks, and cannot improve from autonomously collected data. In this paper, we will also consider a one-shot imitation learning problem. However, unlike these prior works, our goal is to develop an algorithm that can improve its one-shot imitation performance using additional autonomously-collected data.

To this end, we utilize the following insight: if a robot is trained in a multi-task setting consisting of multiple distinct tasks with hundreds of different objects, it is likely to perform some useful behavior when it attempts a task in a new setting, even if it is not the behavior that was actually intended. This data can then be leveraged to learn the skill that was actually performed. This insight can be viewed as a generalization of prior works on goal-conditioned learning via hindsight goal relabeling [22, 23, 24, 25], but in the space of tasks rather than the space of goals to be reached. It is this insight, and how we can leverage it in a meta-imitation learning framework, that provides the novel contribution of this work: a framework for autonomous data collection and improvement without explicit reinforcement learning, that instead leverages a supervised meta-imitation learning method to enable a robot to be bootstrapped off of demonstration data, and then improve its repertoire of skills via its own autonomously collected experience.

Our approach, which we call Multi-task Imitation Learning with Improvement (MILI), is summarized in Figure 2. We bootstrap a meta imitation policy by performing one-shot imitation learning on a human-collected dataset of paired demonstrations, which consists of pairs of optimal trajectories corresponding to the same task (following the standard meta-imitation learning set-up in prior work [26, 16]). We then use this policy to collect trials autonomously. For each such trial, a user needs to provide an answer to the following yes/no question: did the robot succeed at any task of interest? If the answer is yes, this trajectory is added to an augmented meta-training set. This binary label can be provided via an automated reward function or direct human feedback. We use the initial demonstration dataset of paired demonstrations to also learn a latent space of tasks, where demonstrations corresponding to the same tasks are pushed close to each other, and demonstrations corresponding to different tasks are pushed apart. We then embed all of the autonomously collected trajectories into this learned task space, and generate a dataset of paired trajectories by pairing trajectories that are close to each other in the latent space.

The core contribution of this paper is a framework that allows imitation learning systems to continuously improve through autonomous data collection and learning. We empirically study a challenging, vision-based manipulation setting consisting of hundreds of tasks from four distinct task families: button pressing, sliding, grasping, and pick-and-place. Our framework allows for substantial improvements over standard imitation learning and meta-imitation learning approaches and, critically, has the potential to continuously improve itself without human-provided demonstrations.

Ii Preliminaries

In this section, we summarize the one-shot imitation learning problem setting considered in prior work [15, 16, 21, 19, 18]. We first formalize the single-task imitation learning setting, and then extend it to the one-shot setting.

Imitation Learning via Behavior Cloning. In the standard single task imitation learning setting [1, 27], we are provided a dataset of expert demonstrations . Each demonstration consists of a trajectory of observations and actions denoting optimal behavior for that task , and we need to learn a policy that can mimic this expert behavior. While there are several ways to perform imitation learning from expert demonstrations such as inverse reinforcement learning [28, 29, 30, 31, 32] and occupancy matching [33], we consider behavior cloning [1] in this paper due its stability and ease of use. We train a policy , parameterized using a neural network with parameters that takes the observation as input and outputs a distribution over actions. The parameters

of the policy are trained with stochastic gradient descent to minimize the following loss function:


where represents the distribution over actions for observation .

One-Shot Imitation Learning. In the one-shot imitation learning setting [26, 16], the goal is to learn a meta-policy that can adapt to new, unseen tasks from just a single demonstration for that task. In order to achieve this for tasks with high-dimensional observations such as pixels, which typically require large datasets to learn an effective policy [2, 34], we need to transfer knowledge from demonstrations of previously seen tasks to the task at hand. Thus, instead of assuming access to expert demonstrations for a single task, one-shot imitation learning assumes an unknown distribution of tasks , and is provided with a set of tasks from this distribution, which are called meta-training tasks. More concretely, for each training task , we have access to a set of demonstrations . Different tasks contain different objects, and different actions can be performed on those objects. For example, as shown in Figure 4, a mug could be picked up, a plate could be pushed across a table, a button could be pressed, and a glass could be placed on a tray. The combination of an action and an object constitutes a unique task.

One-shot imitation learning techniques learn a meta-policy , which takes as input both the current observation and a demonstration corresponding to the task which is to be performed, and outputs a distribution over actions. The demonstration specifies to the meta-policy what task is to be performed, and conditioning on different demos can lead to different tasks being performed for the same observation. At training time, we first sample a task , and then sample two demonstrations and corresponding to this task for . We condition the meta-policy on one of these two demonstrations, say demonstration , and optimize the following loss on the expert observation-action pairs from the other demonstration, :

We obtain the complete one-shot imitation learning loss by summing across all tasks and all possible demonstration pairs that can be drawn from the same task:


where is the total number of training tasks.

Iii Problem Statement

In this section, we describe our problem setting, and highlight its differences from the standard imitation and reinforcement learning problem settings. Similar to the one-shot imitation learning problem setting described in Section II, our goal is to learn a meta-policy that can adapt to new, unseen tasks from just a single demonstration for that task, and we assume access to a set of expert trajectories for a subset of the meta-training tasks . However, unlike the one-shot imitation learning setting, which only assumes access to a static set of demonstrations, we also assume that we can attempt new trials on the meta-training tasks using a meta-policy trained only on .

This is similar to the multi-task reinforcement learning and meta-reinforcement learning settings [35, 36], but we do not assume access to any task-specific reward functions for any of the tasks. We may also wish to constrain the space of learnable behaviors so that the robot avoids associating “knocking objects onto the floor” or “not touching any objects” as meaningful tasks. This requires access to a filtering function , which operates on the trial and returns TRUE

if any useful behavior was performed during a trial. This could be automated via a learned classifier for “useful” behaviors, or implemented manually via a human annotator that annotates trajectories collected by the meta-policy. Note that the human annotator does not need to specify what task was achieved by the robot in a particular trial: it could be pushing a bowl, picking up a mug or pressing a button, but the human does not need to provide a task label, as long as it is “useful”. This makes it possible to scale to hundreds of tasks without increasing the difficulty of the annotation. If the filtering function is not provided, the robot can in principle learn to perform all possible behaviors, including undesirable behaviors such as throwing objects onto the floor. While this is not necessarily unreasonable, it could in practice crowd out the desirable behaviors and increase training time.

Iv MILI: Multi-Task Imitation Learning with Improvement

1:  Input: Training tasks , dataset
2:  Input: Data collection batch size
3:  Input: Pairing threshold
4:  {One-shot imitation pre-training}
5:   (see Equation 3)
6:  {Autonomous data collection and improvement}
7:  Initialize empty trial dataset
8:  for iter do
9:     for trial do
10:         Sample training task
11:         Sample demo from
12:         Collect trial from task with meta-policy
13:         if  then
14:            Add to
15:         end if
16:     end for
17:     for  , in  do
18:         Compute
19:         if   then
20:            Create new task dataset
22:         end if
23:     end for
24:      (see Equation 2)
25:  end for
26:  return  
Algorithm 1 MILI: Multi-Task Imitation Learning with Improvement

As shown in Figure 2

, we extend a multi-task, meta-imitation learning pipeline with the ability to learn from the policy’s own experience, while preserving the stability and simplicity of supervised learning methods. In order to achieve this, we need to (a) learn a policy that can perform a diverse set of behaviors in a variety of different environments, (b) utilize this policy to autonomously collect data in new environments, and (c) learn an improved one-shot imitation policy that leverages both the initial dataset and the autonomously collected data. For obtaining a data collection policy, we perform meta-imitation learning on a human-provided dataset. We condition this meta-policy on random demos for collecting trials in new environments. We also learn a

latent space of skills using the same initial dataset, which is then used for organizing the collected trials into new tasks. We now run meta-imitation learning on this expanded dataset, resulting in an improved policy that utilizes both human-provided and autonomously collected data.

Learning a data collection policy Naïvely performing behavior cloning (as described in Section II) on the human-provided dataset is unlikely to result in diverse behavior: since a robot can perform many useful tasks in any given scene, cloning the actions from all human-provided trajectories without providing any context will lead to an averaging of the different possible actions, and the behavior generated by the policy will not be useful. Therefore, we instead train a demo-conditioned meta-policy using the loss function described in Equation 2 (see Section II). We can now use this meta-policy to collect trials in new scenes by conditioning it on different demonstrations from .

Utilizing autonomously collected data A straightforward way to improve our meta-policy’s performance using the collected trials would be to optimize the one-shot imitation learning loss defined in Equation 2 on the collected trajectories. However, in order to optimize this loss, we need at least two successful trials for any given task, as described in Section II. That is, we need at least two trials depicting optimal behavior for the same task for it to be useful for learning. Since our filtering function only labels trials as being useful for any task, we cannot use it for assigning trajectories to specific tasks. If we can find two trials that perform a similar behavior, we can add them to our dataset as corresponding to a new task.

However, finding similar trials can be non-trivial: we wish to learn policies from high-dimensional observations spaces such as visual inputs, which implies that our trials consist of videos, and finding the distance between two videos by computing a standard distance metric like the Euclidean distance is unlikely to result in useful pairings. This motivates the need to learn a latent space of tasks, in which we can embed any given trial, and compute meaningful distances between the trials. If two trials are found to be close to each other in the latent space, we can add them to as a pair of demonstrations corresponding to a new task, and update the meta-policy to minimize the loss in Equation 2. In Section IV-A, we describe how we can learn a latent space of tasks jointly with the meta-policy, using only the human-provided one-shot imitation dataset and no extra supervision. In Section IV-B, we describe how we utilize a learned latent space of tasks and meta-policy to improve from autonomously collected data.

Iv-a Learning a Latent Task Space Jointly with the Policy

Following prior work in one-shot imitation learning [26, 18, 20], we learn a meta-policy that consists of two neural networks: an embedding network and a policy network , where represents the parameters of the two neural networks (shown together in Figure 3). Below, we detail the roles performed by these networks, their architecture and loss functions that we use to train them.

Embedding network and contrastive loss functions. The embedding network, represented as

, accepts as input a demo of the task to be performed. The embedding network consists of a convolutional neural network followed by 1-D temporal convolutions, and embeds the demo into a fixed-length vector, which we denote as

, and refer to as the demo embedding. Intuitively, we would like the embeddings to satisfy the following property: we want two demo embeddings to be close to each other if the demos correspond to the same task, and we wish them to be further apart for demos that belong to different tasks. Formally, this can be accomplished by considering the distance function . should be low when and correspond to the same task, and it should be high when they correspond to different tasks. Contrastive loss functions [37] satisfy this property, and are given by the following expression:

where is a margin (which we set to 1.0).

Fig. 3: We train visuomotor meta-imitation policies end-to-end, based on the setup described by [20]. Upper left and right: RGB observations are mapped to visual features, which are transformed into desired gripper positions. Lower: The task embedding network aggregates visual features over a sequence of image observations (obtained from a demonstration episode or policy rollout) into a “task embedding”, which is used to condition multi-task behavior from the policy network. Unlike [20], the task embedding network does not condition on observed gripper positions in the demo/rollouts.

Policy network. The remaining part of the meta-policy consists of a policy network. It takes as input an image of the current scene (along with other parts of the robot’s state such as end-effector pose), and outputs a distribution over actions. In order to learn to predict actions from this information, we use the one-shot imitation learning loss in Equation 2 in Section II. Our complete loss function,


is minimized using the Adam [38] optimizer, and we train both the embedding and policy networks jointly, sharing the convolutional layers between them. The output of this procedure results in the meta-policy , where denotes the learned parameters.

Iv-B Autonomous Improvement

Our method is summarized in Alg. 1. We utilize the learned meta-policy for collecting data in new scenes. Given a scene, we condition the learned meta-policy on a randomly sampled demo from the initial dataset (see Lines 9-11 in Alg. 1). We then run the meta-policy in this new scene, and add it to the trial dataset if the filtering function deems the trial to have performed some useful task (but does not provide any label about what the performed task was). We then search the trial dataset for similar trials by computing the cosine distance (i.e. a normalized dot-product) between the embedding of the newly collected trial, and all trials collected in the past (see Lines 17-18 in Alg. 1). If the cosine distance for any pair of trials is found to be above a pre-specified threshold (which we determine using cross-validation to be 0.9), we add the pair of trials to the dataset as corresponding to a new task (see Lines 19-21 in Alg. 1). We then update our meta-policy using the one-shot imitation learning loss (see Line 24 in Alg. 1).

V Experiments

Our experiments seek to answer the following questions:

  1. Does our method enable autonomous improvement? That is, can the trials generated by the policy in new environments improve one-shot imitation performance in comparison to only meta-imitation learning or only behavior cloning on a static dataset?

  2. How does the performance vary with the number of trials collected?

  3. How does the performance of our learned latent space-based pairing model compare to an oracle that pairs the autonomously collected data optimally?

Fig. 4:

Our dataset of tasks consists of four distinct task families: button pressing, grasping, pushing and pick and place. Within each task family, we have hundreds of tasks, and our dataset contains over a hundred distinct kitchenware objects. Our train set has 520 tasks, while our validation and test sets each have 40 tasks each. We tune hyperparameters on the validation tasks, and report final performance on the test tasks.

We conduct our experiments on a realistic 3D simulation created using the Bullet physics engine [39], shown in Figure 4. It consists of a 7-DoF robotic gripper controlled with continuous position control from visual observations at 10Hz to accomplish manipulation tasks from four distinct task families, and contains over a hundred different kitchenware objects [40]. The four task families that we consider in our experiments are: button pressing, grasping, pushing and pick and place. Example tasks from each of the four families are shown in Figure 4. Different tasks within these families correspond to different sets of objects and manipulating those objects in different manners. For example, a pushing task in a scene containing both a mug and a bowl could be to push the bowl to the mug, or it could be to push the mug to the bowl. To succeed at a task, the robot must be able to perform the task from multiple different initial object arrangements such that it cannot simply memorize the motion from the one provided demonstration. The policy network, visualized in Figure 3, uses pixel inputs for all the tasks, i.e. an RGB image of size 100100, which gets passed in to the control network along with the 7-DoF gripper pose, three dimensions of which correspond to the its position in 3D space, three correspond to the Euler angles, and one corresponds to the finger angle. The output of the policy is the desired 7-DoF pose of the gripper, represented as the 3D position. We train on 520 tasks, use a validation set of 40 tasks for hyperparameter tuning, and show performance on a test set of 40 tasks, which contain held-out objects.

Demonstrations for the initial dataset are collected by a human using an HTC Vive virtual reality system. Four demonstrations are collected for each task, and the object positions are randomized between demonstrations. At evaluation time, we are provided with one demonstration for a given test task, and the policy performance is evaluated with an object arrangement that is different from the demonstration.

Fig. 5:

We evaluate one-shot imitation performance of our policy on 40 unseen test tasks from four distinct task families. We ran our method, as well as both of the comparisons, with five random seeds, and report the average final performance and standard error across seeds. All methods use the same human demonstration data. We see that our method outperforms the meta-imitation learning baseline for all task families, indicating that the autonomously collected experience results in a better policy.

V-a Autonomous Improvement

We first evaluate the central question of our paper: can the robot collect trials in new scenes and use them to improve its performance? We bootstrap the meta-policy from 800 human teleoperation demos on 200 different tasks, and learn a one-shot imitation meta-policy as well a latent task space from this data (see Line 5 in Algorithm 1). We collect 60K trials from 150 different scenes. We find that 8.2K of these trials pass the filtering function and get added to our trial dataset . Our method finds pairs of similar trials using a pairing threshold of in the trial dataset , adds them to , and re-runs meta-imitation learning (see Lines 17-24 in Algorithm 1). The performance of this policy is shown in Figure 5, and videos can be found on our project website111Project page: While it is possible to run several iterations of our method for continuous improvement, we only evaluate one round of our method in this section. We compare our method against:

Meta-imitation This corresponds to optimizing the loss function described in Equation 2 on the human demonstrations dataset , and is representative of prior work [18].

Behavior cloning This corresponds to optimizing the loss function described in Equation 1 on all the demos from .

As shown in Figure 5, our method substantially outperforms the meta-imitation learning baseline, achieving a relative improvement of 36.8%. The meta-imitation baseline in turn outperforms the behavior cloning baseline, which is expected since the behavior cloning baseline does not incorporate task-specific information. This shows that our proposed method is capable of incorporating policy roll-outs in the learning process so as to improve the policy’s one-shot imitation performance on new, unseen tasks.

V-B Varying the Number of Trials

A central claim in our paper is that a policy can continuously improve if it is able to collect more and more data and use it for learning. In this section, we aim to answer the following question: do we achieve monotonically improving performance as we collect more trials and add them to the dataset? To answer this question, we run our method with five random seeds while varying the number of trials collected: 10K, 20K, 40K and 60K. The results of this experiment are summarized in Figure 5, and show the following trend: as we keep collecting more trials, the one-shot imitation performance of our meta-policy for unseen tasks keeps improving. The relative improvements in performance are highest for the first 10K trials collected, but the performance starts levelling off as we collect more data.

V-C Using Oracle Task Labels

Evaluating meta-imitation learning on an optimally paired dataset of policy rollouts allows us to measure what success rates we might hope MILI to achieve if we learned a perfect latent space. This oracle achieves a one-shot imitation success rate of as compared to MILI’s 32.7%, suggesting that MILI can drastically reduce the burden of labeling tasks without significant loss in performance.

Fig. 6: Performance of MILI improves as we increase the number of trials.

Vi Discussion and Future Work

In this paper, we study how meta-imitation learning can be used to enable an agent to improve with autonomously collected data. Our aim is to preserve all of the desirable properties of imitation learning – simplicity and stability – while augmenting our method with one of the most powerful features of RL: the ability to improve from autonomously collected trials. Our approach is based on a simple observation: in a multi-task setting, the robot’s attempts to perform a new task, even if unsuccessful, may still serve as successful trials for other tasks. Therefore, if we can match them with corresponding prior trials and thereby form new task datasets for meta-learning, we can incorporate the robot’s own attempts as pseudo-demonstrations for other tasks. Unlike RL, our method does not require any task-specific reward function, only a general, task-agnostic filtering function that indicates whether a given trial is useful for any task. This makes it straightforward to scale our approach to large skill repertoires. After bootstrapping our meta-imitation policy from a few hundred human-provided demonstrations, we can collect tens of thousands of autonomous trials, and use them to meta-train a final policy that substantially outperforms standard meta-imitation learning on the initial dataset.

Limitations. Our work opens several possible directions for future work. Currently, our method still requires a filtering function, which means we need to obtain a binary label from a human oracle for each trial that we collect to decide if it should be included in the training set. While this signal is easy to obtain, future work might investigate learning-based methods that perform this filtering automatically, further boosting the scalability of this approach. Further, while our method improves with more data, the overall success rates are still relatively low due to the challenging nature of the environment. We expect that further advances in vision-based meta-learning will complement the approach in this paper to boost performance. Finally, another exciting direction for future work is to devise a complete lifelong learning approach based on our method, performing repeated iterations of data collection and meta-imitation on a real-world robotic system so as to continually improve the policy on both the original set of tasks and new tasks discovered during data collection.