Transfering Hierarchical Structure with Dual Meta Imitation Learning

by   Chongkai Gao, et al.
Tsinghua University

Hierarchical Imitation Learning (HIL) is an effective way for robots to learn sub-skills from long-horizon unsegmented demonstrations. However, the learned hierarchical structure lacks the mechanism to transfer across multi-tasks or to new tasks, which makes them have to learn from scratch when facing a new situation. Transferring and reorganizing modular sub-skills require fast adaptation ability of the whole hierarchical structure. In this work, we propose Dual Meta Imitation Learning (DMIL), a hierarchical meta imitation learning method where the high-level network and sub-skills are iteratively meta-learned with model-agnostic meta-learning. DMIL uses the likelihood of state-action pairs from each sub-skill as the supervision for the high-level network adaptation, and use the adapted high-level network to determine different data set for each sub-skill adaptation. We theoretically prove the convergence of the iterative training process of DMIL and establish the connection between DMIL and Expectation-Maximization algorithm. Empirically, we achieve state-of-the-art few-shot imitation learning performance on the Meta-world <cit.> benchmark and competitive results on long-horizon tasks of Kitchen environments.



page 8

page 18

page 19


Bottom-Up Skill Discovery from Unsegmented Demonstrations for Long-Horizon Robot Manipulation

We tackle real-world long-horizon robot manipulation tasks through skill...

One-Shot Visual Imitation Learning via Meta-Learning

In order for a robot to be a generalist that can perform a wide range of...

Meta-Learning Transferable Parameterized Skills

We propose a novel parameterized skill-learning algorithm that aims to l...

FIRL: Fast Imitation and Policy Reuse Learning

Intelligent robotics policies have been widely researched for challengin...

Adversarial Option-Aware Hierarchical Imitation Learning

It has been a challenge to learning skills for an agent from long-horizo...

Generalizing to New Tasks via One-Shot Compositional Subgoals

The ability to generalize to previously unseen tasks with little to no s...

LISA: Learning Interpretable Skill Abstractions from Language

Learning policies that effectually utilize language instructions in comp...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Imitation learning (IL) has shown promising results for intelligent robots to conveniently acquire skills from expert demonstrations reinforceandimitation; deepmimic. Nevertheless, imitating long-horizon unsegmented demonstrations has been a challenge for IL algorithms, because of the well-known issue of compounding errors areductionofimitation. This is one of the crucial problems for the application of IL methods to robots since plenty of practical manipulation tasks are long-horizon. Hierarchical Imitation Learning (HIL) aims to tackle this problem by decomposing long-horizon tasks with a hierarchical model, in which a set of sub-skills are employed to accomplish specific parts of the long-horizon task, and a high-level network is responsible for determining the switching of sub-skills along with the task. Such a hierarchical structure is usually modeled with Options ddo; ddco; optiongail or goal-conditioned IL paradigms hierarchicalimitationlearning. HIL expresses the nature of how humans solve complex tasks, and has been considered to be a valuable direction for IL algorithms analtorithmicperspectiveonimitationlearning.

However, most current HIL methods have no explicit mechanism to transfer previously learned sub-skills to new tasks with few-shot demonstrations. This requirement comes from that the learned hierarchical structure may conflict with discrepant situations in new tasks. As shown in 1(a), both the high-level network and sub-skills need to be transferred to new forms to satisfy new requirements: the high-level network needs new manners to schedule sub-skills in new tasks (for example, calling different sub-skills at the same state), and each sub-skill needs to adapt to new specific forms in new tasks (for example, grasping different kinds of objects). An appropriate approach needs to be exploited to endow HIL with such kind of ability to simultaneously transfer both the high-level network and sub-skills with few-shot new task demonstrations, and this can significantly increase the generalization ability of HIL methods and make them be applied to a wider range of scenarios.

(a) Illustration of the bi-level transfer problem of HIL in new tasks.
(b) Comparison of MIL and DMIL.
Figure 1: (a) Both the high-level network and sub-skills need to be transferred to new tasks. Above: when the robot arm is over a half-open drawer, the task can be either opening or closing the drawer, which requires the high-level network to call different sub-skills. Below: a same sub-skill pick-place may exhibit different specific forms in new tasks. (b) DMIL aims to integrate MAML into HIL with a novel iterative optimization procedure that meta-learns both the high-level network and sub-skills.

Recently, meta imitation learning (MIL) oneshotimitationfromobservinghumans; oneshotimitationvisuallearningviametalearning; oneshothierarchicalimitationlearning employ model-agnostic meta-learning (MAML) maml into imitation learning procedure to enable the learned policy to quickly adapt to new tasks with few shot demonstrations. MAML first fine-tunes the policy network in the inner loop, then evaluates the fine-tuned network to update its initial parameters with end-to-end gradient descent at the outer loop. The success of MIL inspires us to integrate MAML into HIL to transfer the hierarchical structure in new tasks. However, this is not straightforward. The reason why MAML can be directly applied to IL methods is that most IL methods learn a monolithic policy in an end-to-end fashion, which conforms to the original setting of MAML. However, in HIL, the bi-level policy network is trained in an iterative and self-supervised paradigm. Intuitively, this bi-level optimization procedure makes it necessary to add MAML into both levels of HIL to transfer the whole structure. Since MAML itself is also a bi-level process, this becomes a difficult problem to schedule the complex inner loops and outer loops of meta-learning processes of the bi-level networks in HIL to ensure the training convergence.

In this work, we propose a novel hierarchical meta imitation learning framework called Dual Meta Imitation Learning (DMIL) to successfully incorporate MAML into the iterative training process of HIL, as shown in 1(b). We firstly adopt the EM-like HIL method ddo as our basic bi-level HIL structure, where the high-level network and sub-skills are in mutual supervision: the likelihood of each state-action pair in few-shot demonstrations of each sub-skill can provide supervisions for the training of high-level network, and the high-level network in turn determines the data sets for the training of each sub-skill. Then we design an elaborate bi-level MAML procedure for this hierarchical structure to make it can be fully meta-learned. In this procedure, we first fine-tune the high-level network and sub-skills in sequence at inner loops, then meta-update them simultaneously at outer loops. We theoretically prove the convergence of this special training procedure by leveraging previous results from amortizedbayesianmeta; pmaml; gradientem to reframe both MAML and DMIL as hierarchical Bayes inference processes and get the convergence of DMIL according to the convergence of MAML from previous results convergenceofmaml.

We test our method on the challenging meta-world benchmark environments metaworld and the Kitchen environment of D4RL benchmarks d4rl. In our experiments, we successfully acquire a set of meaningful sub-skills from a large scale of manipulation tasks, and achieve state-of-the-art few-shot imitation learning abilities in the ML45 suite. In summary, the main contributions of this paper are as follows:

  • We propose DMIL, a novel hierarchical meta imitation learning framework that meta-learns both the high-level network and sub-skills from unsegmented multi-task demonstrations in a general EM-like fashion.

  • We propose a novel training algorithm for DMIL to schedule the meta-learning processes of its bi-level networks and theoretically prove its convergence.

  • We achieve state-of-the-art few-shot imitation learning performance on meta-world benchmark environments and competitive results in the Kitchen environment.

2 Related Work

2.1 Hierarchical Imitation Learning

Recovering inherent sub-skills contained in expert demonstrations and then reuse them with hierarchical structures has long been an interesting topic in the hierarchical imitation learning (HIL) domain. According to whether there are pretraining tasks, we can divide HIL methods into two categories. The first kind aims to manually design a set of simple pretraining tasks that could encourage distinct skills or primitives, and then they learn a high-level network to master the switching of primitives to accomplish complex tasks stochasticnn; mcp; pretrain1; pretrain2; hierarchicalimitationlearning. However, for unsegmented demonstrations where no pretraining tasks are provided, which is the situation in our paper, these methods can not be applied.

The second kind of methods aim to learn sub-skills with unsupervised learning methods.

ddo; ddco acquire Options option

from demonstrations with an Expectation-Maximization-like procedure and use the Baum-Welch algorithm to estimate the parameters of different options.

optiongan; optiongail

integrate generative adversarial networks into option discovery process.

infogail; directedinfogail; learningcompoundtaskswithouttaskspecificknowledge incorporate generative-adversarial imitation learning gail framework and an information-theoretic metric infogan to simultaneously imitate the expert and maximize the mutual-information between latent sub-skill categories and corresponding trajectories to acquire decoupled sub-skills. There are some methods called mixture-of-expert (MoE) that compute the weighted sum of all primitives to get the action rather than only using one of them at each time step moe1; moe2; moe3. Other methods aim to seek an appropriate latent space that can map sub-skills into it and then condition a policy on the latent variable to reuse sub-skills latentplay; latentspace; learninganembeddingspace; fist; spirl.

For transferring learned sub-skills, some work fine-tune the whole structure in new tasks

fist. However, the performance of fine-tuning all depends on the generalization of deep networks, which may vary among different tasks and network designs.

2.2 Meta Imitation Learning

Meta imitation learning, or one-shot imitation learning, leverages various meta-learning methods and multi-task demonstrations to meta-learn a policy that can be quickly adapted to a new task with few-shot new task demonstrations. oneshotimitationlearning; transformerbasedmeta employ self-attention modules to process the whole demonstration and the current observation to predict current action. oneshothierarchicalimitationlearning; oneshotimitationvisuallearningviametalearning; oneshotimitationfromobservinghumans use model-agnostic meta-learning (MAML) maml to achieve one-shot imitation learning ability for various manipulation tasks with robot or human visual demonstrations. learningaprioroverintent; metairl

propose to meta-learn a robust reward function that can be quickly adapted to new tasks and then use it to perform IRL in new tasks. However, they need downstream inverse reinforcement learning after the adaptation of reward functions, thus conflicts with our goal of few-shot adaptation. Most of above methods only learn one monolithic policy, lacking the ability to model multiple sub-skills in long-horizon tasks. Some works aim to tackle the multi-modal data problem in meta-learning by avoiding single parameters initialization across all tasks

multimodalmaml; modularmetalearning; metalearningsharedhierarchies; hierarchicallystructuredmetalearning, but they lack the mechanism to schedule the switching of different sub-skills over time. There are some works that also meta-learn a set of sub-skills in a hierarchical structure oneshothierarchicalimitationlearning; metalearningsharedhierarchies, but they either use manually designed pretraining tasks or relearn the high-level network in new tasks, which is not appropriate in few-shot imitation learning settings.

3 Method

3.1 Preliminaries

We denote a discrete-time finite-horizon Markov decision process (MDP) as a tuple

, where is the state space, is the action space, is the time horizon,

is the transition probability distribution,

is the reward function, and is the distribution of the initial state .

Figure 2: The iterative meta-learning process of DMIL at each iteration. Left: the supervision of high-level network (sub-skill categories) comes from the most accurate sub-skill (the green one, sub-skill 1 here). Right: the sub-skill updated at current step (the green one, sub-skill 0 here) is determined by the fine-tuned high-level network.

3.2 Formulation of Meta Imitation Learning Problem

We firstly introduce the general setting of the meta imitation learning problem. The goal of meta imitation learning is to extract some common knowledge from a set of robot manipulation tasks that come from the same task distribution , and adapt it to new tasks quickly with few shot new task demonstrations. As in model-agnostic meta-learning algorithm (MAML) maml, we formalize the common knowledge as the initial parameter of the policy network that can be efficiently adapted with new task gradients.

For each task , a set of demonstrations is provided, where consists of demonstration trajectories: , and consists of a sequence of state-action pairs: , where is the length of . Each is randomly split into support set and query set for meta-training and meta-testing respectively. During the training phase, we sample tasks from , and in each task , we use to fine-tune to get the adapted task-specific parameters with gradient descent, and then evaluate it with to get the meta-gradient of , and we optimize the initial parameters with the average of meta-gradients from all tasks. As in oneshotimitationvisuallearningviametalearning; oneshotimitationlearning, we use behavior cloning bc loss as our metrics for meta-training and meta-testing. It aims to train a policy that maximizes the likelihood such that , where

is the number of provided state-action pairs. We denote the loss function of this optimization problem as

, and the general objective of meta imitation learning problem is:


where , and is a hyper-parameter which represents the inner-update learning rate.

3.3 Dual Meta Imitation Learning (DMIL)

In this work we assume at each time step , the robot may switch to different sub-skills to accomplish the task. We define the sub-skill category at each time step as , where is the maximum number of sub-skills. We assume a successful trajectory of a task is generated from several (at least one) sub-skill policies, i.e., , where represents the expert policy. Our goal is to learn a hierarchical structure from multi-task demonstrations in an unsupervised fashion. In our model, a high-level network that parameterized by determines the sub-skill category at each time step , and the -th sub-skill among different sub-skills will be called to predict the corresponding action of state , where the hat symbol denotes the predicted result. We use and to represent the adapted parameters of and respectively. In order to achieve few-shot learning ability in new tasks, we condition the high-level network only on states, i.e., , since at the testing phase we only have access to states and have no access to action information.

DMIL aims to first fine-tune both and and then meta-update them. In a new task, may not provide correct sub-skill categories as stated in the introduction. However, sub-skills still retain the ability to give out supervision for the high-level network with knowledge learned from previously learned tasks and few-shot demonstrations. This is because most robot manipulation tasks are made up of a set of shared basis skills like reach, push and pick-place. As shown in the left side of 2, the sub-skill that gives out the closet to

can be seen as supervision for the high-level network to classify

into this sub-skill. On the other hand, the adapted high-level network can classify each data point in provided demonstrations to different sub-skills for them to perform fine-tuning, as shown in the right side of 2. In summary, DMIL contains four steps for one training iteration. We call them High-Inner-Update (HI), Low-Inner-Update (LI), High-Outer-Update (HO), and Low-Outer-Update (LO), which represents the fine-tuning and meta-updating process of the bi-level networks respectively. The key problem is how to arrange these optimization orders to ensure convergence. We first introduce these four steps formally here, then discuss how to schedule them in the next section. The whole procedure is summarized in algorithm 1.

HI: For each sampled task , we sample the first batch of trajectories from . The principle of this step is to use sub-skill that can predict the closest action to the expert action to provide self-supervised category ground truths for the training of high-level network, which is a classifier in form. We make every state-action pair passed directly to each sub-skill and compute , and choose the ground truth at each time step as the sub-skill category that minimizes :


Then we get predicted sub-skill categories from the high-level network: , and use a cross-entropy loss to train the high-level network:


Finally we perform gradient descent on the high-level network and get . Note are freezed here.

LI: We sample the second batch of trajectories from . The adapted high level network will process each state in to get sub-skill category at each time step, thus we get separate data sets for different sub-skills: ,

. Then we compute the adaptation loss for each sub-skill with the corresponding dataset. In case of continuous action space, we assume that actions belong to Gaussian distributions, so we have:


where is the number of state-action pairs in . Finally we perform gradient descent on sub-skills and get . Note is frozen in this process.

HO: We sample the third batch of trajectories from and get as in the HI process. Then we use it to compute the meta-gradient which equals to:


LO: We sample and get , as in the LI process, then we use it to compute the meta-gradient which equals to:


Note after the training of tasks, we average all meta-gradients from tasks and perform gradient descents on the initial parameters together to update high-level parameters and sub-skill policies parameters , , i.e., we do not update them at step 5 and 6. This is crucial to ensure convergence.

For testing, although our method needs totally two batches of trajectories for one round of adaptation, in practice we find only using one trajectory to perform HI and LI also works well in new tasks, thus DMIL can satisfy the one-shot imitation learning requirement. Besides the above process, we also add an auxiliary loss to better drive out meaningful sub-skills to avoid the excessively frequently switching between different sub-skills along with time. Detailed information can be found in B.

4 Theoretical Analysis

DMIL is a novel iterative hierarchical meta-learning procedure, and its convergence of DMIL needs to be proved to ensure feasibility. As stated in the above section, what makes DMIL special is that in HI and LI, we update parameters of each module immediately, but in LO and HO, we store the gradients of each part and update them simultaneously. In this section, we show this can make DMIL converge by rewriting both MAML and DMIL as hierarchical variational Bayes problems to establish the equivalence between them, since the convergence of MAML can be proved in convergenceofmaml. Proofs of all theorems are in Appendix C.

4.1 Hierarchical Variational Bayes Formulation of MAML

According to amortizedbayesianmeta, MAML is a hierarchical variational Bayes inference process. The general meta-learning objective (1), which equals to (we use for short), can be formulated as follows:


where represent the local latent variables for task , and are the variational parameters of the approximate posteriors over . We denote as and as to mean that and are determined with prior parameters and support data . First we need to minimize w.r.t. . According to C.2, we have:


and we can establish the connection between 8 and the fine-tuning process in MAML by the following Lemma:

Lemma 1 In case is a Dirac-delta function and choosing Gaussian prior for , equation 8 equals to the inner-update step of MAML, that is, maximizing w.r.t. by early-stopping gradient-ascent with choosing as initial point:


Then we need to optimize w.r.t. . Since we evaluate with only , we assume . We give out the following theorem to establish the connection between the meta-update process and the optimization of :

Theorem 1 In case that , i.e., the uncertainty in the global latent variables is small, the following equation holds:


A general EM algorithm will first compute the distribution of latent variables (E-step), then optimize the joint distribution of latent variable and trainable parameters (M-step), and the likelihood of data can be proved to be monotone increasing to guarantee the convergence since the evidence lower bound of likelihood is monotone increasing. Here

are the latent variables, and corresponds to the trainable parameters. Lemma 1 and Theorem 1 correspond to the E-step and M-step respectively. In the following part we establish the equivalence between 9 with 3 and 4, and between  10 with 5 and 6 to prove the equivalence between DMIL and MAML.

4.2 Modeling DMIL with hierarchical variational Bayes framework

For simplicity, here we only derive in one specific task , since derivatives of parameters from multi-task can directly add up. We first establish the connection between the maximization of with the particular loss functions in DMIL:

Theorem 2 In case of , we have:




where Note in 12, corresponds to data sets determined by the adapted high level network , and this connects with 3 and 4 in DMIL. According to 8, finding equals to maximize in specific conditions, and here in Theorem 2, we prove that maximize corresponds to 3 and 4 in DMIL. Thus theorem 2 corresponds to the E-step of DMIL, where we take and as , and optimize with coordinate descent method, which can be proved to be equal to 9 in C.5.

For the M-step, we take and as . According to Theorem 1, we can take the derivative of to maximize the joint distribution of latent variables and trainable parameters to maximize the likelihood of dataset, so we have:


where and

. This is exactly the gradients computed in HO and LO steps. Note this computation process can be automatically accomplished with standard deep learning libraries such as PyTorch

pytorch. To this end, we establish the equivalence between DMIL and MAML, and the convergence of DMIL can be proved.

For a clearer comparison, MAML is an iterative process of , and DMIL is an iterative process of , where the posterior estimation stages has no effect on parameters , thus can be divided to two steps as in DMIL. This decoupled fine-tuning fashion is exactly what we need to first adapt the high-level network and then adapt sub-skills. If we end-to-end fine-tune parameters like , sub-skills will receive supervisions from an unadapted high-level network, which may provide incorrect classifications. Different to this, the meta-updating process must be done at the same time, since if we update and successively, the later one will receive different derivative (for example, ) from derivatives in MAML (), and the equivalence would not be proved.

5 Experiments

In experiments we aim to answer the following questions: (a) Can DMIL successfully transfer the learned hierarchical structure to new tasks with few-shot new task demonstrations? (b) Can DMIL achieve higher performance compared to other few-shot imitation learning methods? (c) What are effects of different parts in DMIL, such as the skill number , the bi-level meta-learning procedure, and the continuity constraint? Codes are provided in an anonymous repository111 Video results are provided in supplementary materials.

5.1 Environments and Experiment Setups

ML10 ML45
Meta-training Meta-testing Meta-training Meta-testing
Methods 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot
OptionGAIL 0.4550.011 0.9520.016 0.2410.042 0.6400.025 0.5060.008 0.7150.006 0.2200.013 0.4810.010
MIL 0.7760.025 0.8690.029 0.3610.040 0.6890.032 0.5840.011 0.7450.017 0.2050.024 0.5100.005
PEMIRL 0.5980.023 0.8100.007 0.1620.003 0.2560.009 0.2890.051 0.3960.024 0.1050.005 0.1260.008
MLSH 0.5060.134 0.7250.021 0.1060.032 0.1350.009 0.2350.093 0.2950.021 0.0500.000 0.0500.000
DMIL 0.7750.010 0.9490.009 0.3960.016 0.7100.021 0.5900.010 0.8590.008 0.3760.004 0.6400.009
Table 1: Success rates of different methods on Meta-world environments with . Each data point comes from 20 random seeds.
Task (Unseen) FIST-no-FT SPiRL DMIL(ours)
Microwave, Kettle, Top Burner, Light Switch 2.0 0.0 2.1 0.48 1.50.48
Microwave, Bottom Burner, Light Switch, Slide Cabinet 0.0 0.0 2.3 0.49 2.350.39
Microwave, Kettle, Hinge Cabinet, Slide Cabinet 1.0 0.0 1.9 0.29 3.150.22
Microwave, Kettle, Hinge Cabinet, Slide Cabinet 2.0 0.0 3.3 0.38 2.950.44
Table 2: Cumulative rewards of different methods on four unseen tasks in Kitchen environment with . Boldface indicates excluded objects during training.

We choose to evaluate DMIL on two representative robot manipulation environments. The first one is Meta-world benchmark environments metaworld, which contains 50 diverse robot manipulation tasks, as shown in 6 and 7. We use both ML10 suite (10 meta-training tasks and 5 meta-testing tasks) and ML45 suite (45 meta-training tasks and 5 meta-testing tasks) to evaluate our method, and collect 2K demonstrations for each task. We choose Meta-world since we think a large scale of diverse manipulation tasks can help our method to learn semantic skills. We use the following approaches for comparison in this environment: Option-GAIL: a hierarchical generative adversarial imitation learning method to discover options from unsegmented demonstrations optiongail. We use Option-GAIL to evaluate the effect of meta-learning in DMIL. MIL: a transformer-based meta imitation learning method transformerbasedmeta. We use MIL to evaluate the effect of the hierarchical structure in DMIL. MLSH: the meta-learning shared hierarchies method metalearningsharedhierarchies that relearns the high-level network in every new task. We use MLSH to evaluate the effect of fine-tune (rather than relearn) the high-level network in new tasks. PEMIRL: a contextual meta inverse RL method which transfers the reward function in the new tasks metairl. We use PEMIRL to show DMIL can transfer to new tasks that have significantly different reward functions.

The second one is the Kitchen environment of the D4RL benchmark d4rl, which contains five different sub-tasks in the same kitchen environment. The accomplishment of each episode requires sequentially completions of four specific sub-tasks, as shown in 9. We use an open demonstration data set replaypolicylearning to train our method. During training, we exclude interactions with selected objects and at test-time we provide demonstrations that involve manipulating the excluded object to make them as unseen tasks. We choose this environment to show DMIL can be used in long-horizon tasks and is robust across different environments. We use two approaches for comparison in this experiment: SPiRL: an extension of the skill extraction methods to imitation learning over skill space spirl; FIST: an algorithm that extracts skills from offline data with an inverse dynamics model and a distance function fist.

We use fully-connected neuron networks for both the high-level network and sub-skills. Detailed descriptions on the environment setup, demonstration collection procedure, hyper-parameters setting, training process, and descriptions of different methods can be found in appendix


5.2 Results

Figure 3: The iterative meta-learning process of DMIL at each iteration. Left: the supervision of high-level network (sub-skill category) comes from the most accurate sub-skill. Right: the sub-skill updated at current step is determined by the fine-tuned high-level network.

1 shows success rates of different methods in ML10 and ML45 suites with sub-skill number . We perform 1-shot and 3-shot experiments respectively to show the progressive few-shot performance of different methods. DMIL achieves the best results in ML10 testing suite and ML45 training and testing suites. This shows the superiority of our method for transferring across a large scale of manipulation tasks. OptionGAIL achieves high success rates in both ML10 and ML45 training suites. These results come from their hierarchical structures that have adequate capacity to fit potential multi-modal behaviors in multi-task demonstrations. MIL achieves comparable results for all meta-testing tasks but is worse than DMIL. This shows the necessity of meta-learning processes. Compared to them, PEMIRL and MLSH are mediocre among all suites. This comes from that the reward functions across different tasks are difficult to transfer with few shot demonstrations, and the relearned high-level network of MLSH damages previously learned knowledge. We also illustrate t-sne results of these methods in 4(a) to further analyze them in appendix D.2.

2 shows the rewards of different methods on four unseen tasks in the Kitchen environment. FIST-no-FT refers to a variant of FIST that does not use future state conditioning, which makes the comparison fairer. DMIL achieves higher rewards on two out of four tasks and comparable results on the other two tasks, which exhibits the effectiveness of the bi-level meta-training procedure. The poor performance of DMIL on the first task may come from the choice of skill number or from low-quality demonstrations. We perform ablation studies on in the next section.


shows curves of sub-skill probabilities along time of two tasks

window-close and peg-insert-side of Meta-world and the microwave-kettle-top burner-light task in Kitchen environment. We can see the activation of sub-skills shows a strong correlation to different stages of tasks. In first two tasks, activates when the robot is closing to something, activates when the robot is picking up something, and activates when the robot is manipulating something. In the third task, activates when the robot is manipulating the microwave, activates when the robot is manipulating the kettle or the light switch, and activates when the robot is manipulating the burner switch. This shows that DMIL has the ability to learn semantic sub-skills from unsegmented multi-task demonstrations.

5.3 Ablation Studies

ML10 ML45
Meta-training Meta-testing Meta-training Meta-testing
K 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot
2 0.76 0.955 0.32 0.72 0.563 0.818 0.44 0.67
3 0.775 0.949 0.396 0.71 0.59 0.859 0.376 0.64
5 0.795 0.94 0.52 0.57 0.713 0.92 0.21 0.48
10 0.8 0.975 0.38 0.62 0.736 0.931 0.34 0.64
Table 3: Success rates of different sub-skill number in Meta-world environments.

In this section we perform ablation studies on different components of DMIL to provide effects of different parts. Due to limited space, we put ablations on fine-tuning steps, bi-level meta-learning processes, continuity constraint and hard/soft EM choices in appendix D.

Effect of different skill number : 3 shows the effect of different sub-skill number in Meta-world experiments. We can see that a larger can lead to higher success rates on meta-training tasks, but a smaller can lead to better results on meta-testing tasks. This tells us that an excessive number of sub-skills may result in over-fitting on training data, and a smaller can play the role of regularization. In Kitchen experiments, we can see similar phenomenons in 5. It is worth noting in both environments, we did not encounter collapse problems, i.e., every sub-skill gets well-trained even when in kitchen environment or in Meta-world environments. This may come from that, in the meta-training stage, adding more sub-skills can help the whole structure get lower loss, thus DMIL will use all of them for training. However, in our supplementary videos, we can see sub-skills trained with a large (for instance, in Meta-world environments) are not as semantic as sub-skills trained by a small (for instance, in Meta-world environments) during the execution of a task.

6 Discussion and Future Work

In this work, we propose DMIL to meta-learn a hierarchical structure from unsegmented multi-task demonstrations to endow it with fast adaptation ability to transfer to new tasks. We theoretically proved its convergence by reframing MAML and DMIL as hierarchical Bayes inference processes to get their equivalence. Empirically, we successfully acquire transferable hierarchical structures in both Meta-world and Kitchen Environments. Generally speaking, DMIL can be regarded from two different perspectives: a hierarchical extension of MAML to better perform in multi-modal task distributions, or a meta-learning enhancement of skill discovery methods to make learned skills quickly transferable to new tasks.

The limitations of DMIL come from several aspects, and future works can seek meaningful extensions in these perspectives. Firstly, DMIL models all tasks as bi-level structures. However, in real-world situations, tasks may be multi-level structures. One can extend DMIL to multi-level hierarchical structures like done in recent works hierarchicalmultitask. Secondly, DMIL does not capture temporal information in demonstrations. Future state conditioning in fist seems an effective tool to improve few-shot imitation learning performance in long-horizon tasks such as in the Kitchen environments. Future works can employ temporal modules such as transformer transformer as the high-level network of DMIL to improve its performance.


Appendix A Algorithm

  Require: task distribution , multi-task demonstrations , initial parameters of high-level network and sub-skill policies , inner and outer learning rate .
  while not done do
   Sample batch of tasks
   for all do
    Sample from
    Evaluate according to 3 and
    Compute adapted parameters of high-level network:
    Evaluate according to 4 and ,
    Compute adapted parameters of sub-skills: ,
    Evaluate and ,
   end for
   Update ,
  end while
Algorithm 1 Dual Meta Imitation Learning

Appendix B Auxiliary Loss

We adopt an auxiliary loss for DMIL to better drive out meaningful sub-skills by punishing excessive switching of sub-skills along the trajectory. This comes from an intuitive idea: each sub-skill should be a temporal-extended macro-action, and the high-level policy only needs to switch to different skills few times along a task, as the same idea of macro-action in MLSH metalearningsharedhierarchies. We denote:


and the auxiliary loss is:


Although this operation seems discrete, in practice we can use the operations in modern deep learning framework such as PyTorch pytorch to make it differentiable. We add this loss function to the and with a coefficient . We also perform ablation studies of and results are in 7.

Appendix C Proofs

c.1 Proof of Lemma 1

Lemma 1 In case is a Dirac-delta function and choosing Gaussian prior for , equation 8 equals to the inner-update step of MAML, that is, maximizing w.r.t. by early-stopping gradient-ascent with choosing as initial point:


Proof: in case of the conditions of Lemma 1, we have:


As stated in recasting, firstly in the case of linear models, early stopping of an iterative gradient descent process of equals to the maximum posterior estimation (MAP) earlylinear. In our case the posterior distribution refers to , and MAML is a Bayes process to find the MAP estimate as the point estimate of . In the nonlinear case, this point estimate is not necessarily the global mode of the posterior, and we can refer to earlystopping for another implicit posterior distribution over and making the early stopping procedure of MAML acting as priors to get the similar result.

c.2 Proof of Equation 8

Equation 8 can be written as:


where in MAML we assume , and use the joint distribution to replace since we assume that

subjects to uniform distribution. Thus

8 can be proved.

c.3 Proof of Theorem 1

Theorem 1 In case that , i.e., the uncertainty in the global latent variables is small, the following equation holds:




where the first approximate equal holds because the VI approximation error is small enough, and the second approximate equal holds because that in case and assuming be a neuron network, holds almost everywhere, so . Note the condition of theorem 1 is usually satisfied since we are using MAML, and the initial parameters are assumed to be deterministic.

From another perspective, the right side of above equation is the widely used meta-gradient in MAML, and it is equal to , which is proved to be converged by convergenceofmaml.

c.4 Proof of Theorem 2

Theorem 2 In case of , we have:




Proof: since and the second term is independent of , we consider the first conditional probability:


In the HI step, is fixed, thus 23 becomes a convex optimization problem:


The solution of this problem is which satisfies , . This means that needs to predict the sub-skill category at time step as , in which case can maximize . In case we choose

to be a classifier that employs a Softmax layer at the end, minimizing the cross entropy loss

3 equals to maximize 24, thus 21 can be proved.

In the LI step, is fixed, and the data sets for optimizing are also fixed as . Thus we need to maximize each with . In case of , we have , which leads to the loss function 4, thus 22 can be proved, which finishes the prove of Theorem 2.

c.5 Proof of the E-step of DMIL

According to Theorem 1, we aim to maximize 17 w.r.t from the initial point with coordinate gradient ascent. We here need to prove that in DMIL, we could also achieve the global maximum point of as in MAML. We first give out the following Lemma:

Lemma 2 Let be the solution found by coordinate gradient descent of . Let be the coordinate directions used in the optimization process. If can be decomposed as:


where is a differentiable convex function, and each is a convex function of the coordinate direction , then is the global minimum of .

Proof: Let be another arbitrary point, we have:


Now let’s consider our problem. Consider