Log In Sign Up

Skill-based Model-based Reinforcement Learning

by   Lucy Xiaoyang Shi, et al.

Model-based reinforcement learning (RL) is a sample-efficient way of learning complex behaviors by leveraging a learned single-step dynamics model to plan actions in imagination. However, planning every action for long-horizon tasks is not practical, akin to a human planning out every muscle movement. Instead, humans efficiently plan with high-level skills to solve complex tasks. From this intuition, we propose a Skill-based Model-based RL framework (SkiMo) that enables planning in the skill space using a skill dynamics model, which directly predicts the skill outcomes, rather than predicting all small details in the intermediate states, step by step. For accurate and efficient long-term planning, we jointly learn the skill dynamics model and a skill repertoire from prior experience. We then harness the learned skill dynamics model to accurately simulate and plan over long horizons in the skill space, which enables efficient downstream learning of long-horizon, sparse reward tasks. Experimental results in navigation and manipulation domains show that SkiMo extends the temporal horizon of model-based approaches and improves the sample efficiency for both model-based RL and skill-based RL. Code and videos are available at <>


page 2

page 6

page 13

page 14


Accelerating Reinforcement Learning with Learned Skill Priors

Intelligent agents rely heavily on prior experience when learning a new ...

Guided Skill Learning and Abstraction for Long-Horizon Manipulation

To assist with everyday human activities, robots must solve complex long...

Reset-Free Lifelong Learning with Skill-Space Planning

The objective of lifelong reinforcement learning (RL) is to optimize age...

Dynamics-Aware Unsupervised Discovery of Skills

Conventionally, model-based reinforcement learning (MBRL) aims to learn ...

Model-Based Reinforcement Learning via Latent-Space Collocation

The ability to plan into the future while utilizing only raw high-dimens...

Skill-based Meta-Reinforcement Learning

While deep reinforcement learning methods have shown impressive results ...

Adversarial Skill Chaining for Long-Horizon Robot Manipulation via Terminal State Regularization

Skill chaining is a promising approach for synthesizing complex behavior...

1 Introduction

A key trait of human intelligence is the ability to plan abstractly for solving complex tasks [18]. For instance, we perform cooking by imagining outcomes of high-level skills like washing and cutting vegetables, instead of planning every muscle movement involved [2]

. This ability to plan with temporally-extended skills helps to scale our internal model to long-horizon tasks by reducing the search space of behaviors. To apply this insight to artificial intelligence agents, we propose a novel skill-based and model-based reinforcement learning (RL) method, which learns a model and a policy in a high-level skill space, enabling accurate long-term prediction and efficient long-term planning.

Typically, model-based RL involves learning a flat single-step dynamics model, which predicts the next state from the current state and action. This model can then be used to simulate “imaginary” trajectories, which significantly improves sample efficiency over their model-free alternatives [9, 10]. However, such model-based RL methods have shown only limited success in long-horizon tasks due to inaccurate long-term prediction [20] and computationally expensive search [19, 11, 1].

Skill-based RL enables agents to solve long-horizon tasks by acting with multi-action subroutines (skills) [38, 16, 17, 28, 15, 3] instead of primitive actions. This temporal abstraction of actions enables systematic long-range exploration and allows RL agents to plan farther into the future, while requiring a shorter horizon for policy optimization, which makes long-horizon downstream tasks more tractable. Yet, on complex long-horizon tasks, such as furniture assembly [14], skill-based RL still requires a few million to billion environment interactions to learn [15], which is impractical for real-world applications (e.g. robotics and healthcare).

To combine the best of both model-based RL and skill-based RL, we propose Skill-based Model-based RL (SkiMo), which enables effective planning in the skill space using a skill dynamics model. Given a state and a skill to execute, the skill dynamics model directly predicts the resultant state after skill execution, without needing to model every intermediate step and low-level action (Figure 1), whereas the flat dynamics model predicts the immediate next state after one action execution. Thus, planning with skill dynamics requires fewer predictions than flat (single-step) dynamics, resulting in more reliable long-term future predictions and plans.

Concretely, we first jointly learn the skill dynamics model and a skill repertoire from large offline datasets collected across diverse tasks [21, 28, 29]. This joint training shapes the skill embedding space for easy skill dynamics prediction and skill execution. Then, to solve a complex downstream task, we train a hierarchical task policy that acts in the learned skill space. For more efficient policy learning and better planning, we leverage the skill dynamics model to simulate skill trajectories.

The main contribution of this work is to propose Skill-based Model-based RL (SkiMo), a novel sample-efficient model-based hierarchical RL algorithm that leverages task-agnostic data to extract not only a reusable skill set but also a skill dynamics model. The skill dynamics model enables efficient and accurate long-term planning for sample-efficient RL. Our experiments show that our method outperforms the state-of-the-art skill-based and model-based RL algorithms on long-horizon navigation and robotic manipulation tasks with sparse rewards.


Figure 1: Intelligent agents can use their internal models to imagine potential futures for planning. Instead of planning out every primitive action (black arrows in a), they aggregate action sequences into skills (red and blue arrows in b). Further, instead of simulating each low-level step, they can leap directly to the predicted outcomes of executing skills in sequence (red and blue arrows in c), which leads to better long-term prediction and planning compared to predicting step-by-step (blurriness of images represents the level of error accumulation in prediction). With the ability to plan over skills, they can accurately imagine and efficiently plan for long-horizon tasks.

2 Related Work

Model-based RL leverages a learned dynamics model of the environment to plan a sequence of actions ahead of time that leads to the desired behavior. The dynamics model predicts the future state of the environment, and optionally the associated reward, after taking a specific action for planning [1, 10] or subsequent policy search [7, 9, 24, 10]. Through simulating candidate behaviors in imagination instead of in the physical environment, model-based algorithms improve the sample efficiency of RL agents [9, 10]. Typically, model-based approaches leverage the model for planning, e.g., CEM [31] and MPPI [42]. On the other hand, the model can also be used to generate imaginary rollouts for policy optimization [9, 10]. Yet, due to the accumulation of prediction error at each step [20] and the increasing search space, finding an optimal, long-horizon plan is inaccurate and computationally expensive [19, 11, 1].

To facilitate learning of long-horizon behaviors, many works in skill-based RL have proposed to let the agent act over temporally-extended skills (i.e. options [39] or motion primitives [26]), which can be represented as sub-policies or a coordinated sequence of low-level actions. Temporal abstraction effectively reduces the task horizon for the agent and enables directed exploration [25], a major challenge in RL. However, skill-based RL is still impractical for real-world applications, requiring a few million to billion environment interactions [15]. In this paper, we use model-based RL to guide the planning of skills to improve the sample efficiency of skill-based approaches.

There have been attempts to plan over skills in model-based RL [37, 43, 20, 44, 34]. However, these approaches still utilize the conventional flat (single-step) dynamics model, which struggles at handling long-horizon planning due to error accumulation. To fully unleash the potential of temporally abstracted skills, we devise a skill-level dynamics model to provide accurate long-term prediction, which is essential for solving long-horizon tasks. To the best of our knowledge, SkiMo is the first work that jointly learns skills and a skill dynamics model from data for model-based RL.

3 Method

In this paper, we aim to improve the long-horizon learning capability of RL agents. To enable accurate long-term prediction and efficient long-horizon planning for RL, we introduce SkiMo, a novel skill-based and model-based RL method that combines the benefits of both frameworks. A key change to prior model-based approaches is the use of a skill dynamics model that directly predicts the outcome of a chosen skill, which enables efficient and accurate long-term planning. In this section, we describe the overview of our approach which consists of two phases: (1) learning the skill dynamics model and skills from an offline task-agnostic dataset (Section 3.3) and (2) downstream task learning with the skill dynamics model (Section 3.4), as illustrated in Figure 2.


Figure 2: Our approach, SkiMo, combines model-based RL and skill-based RL for sample efficient learning of long-horizon tasks. SkiMo consists of two phases: (1) learn a skill dynamics model and a skill repertoire from offline task-agnostic data, and (2) learn a high-level policy for the downstream task by leveraging the learned model and skills.

3.1 Preliminaries

Reinforcement Learning

We formulate our problem as a Markov decision process 

[40], which is defined by a tuple of the state space , action space , reward function

, transition probability

, initial state distribution , and discounting factor . We define a policy that maps from a state to an action . Our objective is to learn the optimal policy that maximizes the expected discounted return, , where is the variable episode length.

Unlabeled Offline Data

We assume access to a reward-free task-agnostic dataset [21, 28], which is a set of state-action trajectories, . Since it is task-agnostic, this data can be collected from training data for other tasks, unsupervised exploration, or human teleoperation. We do not assume this dataset contains solutions for the downstream task; therefore, tackling the downstream task requires re-composition of skills learned from diverse trajectories.

Skill-based RL

We define skills as a sequence of actions with a fixed horizon111It is worth noting that our method is compatible with variable-length skills [13, 36, 35] and goal-conditioned skills [24] with minimal change; however, for simplicity, we adopt fixed-length skills of in this paper. and parameterize skills as a skill latent and skill policy, , that maps a skill latent and state to the corresponding action sequence. The skill latent and skill policy can be trained using variational auto-encoder (VAE [12]), where a skill encoder embeds a sequence of transitions into a skill latent , and the skill policy decodes it back to the original action sequence. Following SPiRL [28], we also learn a skill distribution in the offline data, a skill prior , to guide the downstream task policy to explore promising skills over the large skill space.

3.2 SkiMo Model Components

SkiMo consists of three major model components: the skill policy (), skill dynamics model (), and task policy (

), along with auxiliary components for representation learning and value estimation. The following is a summary of the notations of our model components, where

denote a state, action, latent skill embedding, and latent state representation, respectively:


For convenience, we label the trainable parameters , , of each component according to which phase they are trained on:

  1. [leftmargin=2em]

  2. Learned from offline data and finetuned in downstream RL : The state representation module () and the skill dynamics model () are first trained on the offline task-agnostic data and then finetuned in downstream RL to account for unseen states and transitions.

  3. Learned only from offline data : The observation reconstruction module (), skill encoder (), skill prior (), and skill policy ( are learned from the offline task-agnostic data.

  4. Learned in downstream RL : We train a value () and a reward () function, and the high-level task policy () for the downstream task using environment interactions.

3.3 Pre-Training Skill Dynamics Model and Skills from Task-agnostic Data

Our method, SkiMo, consists of pre-training and downstream RL phases. In pre-training, SkiMo leverages offline data to extract (1) skills for temporal abstraction of actions, (2) skill dynamics for skill-level planning on a latent state space, and (3) a skill prior [28] to guide exploration. Specifically, we jointly learn a skill policy and skill dynamics model, instead of learning them separately [43, 20, 44], in a self-supervised manner. The key insight is that this joint training could shape the latent skill space and state embedding in that the skill dynamics model can easily predict the future.

In contrast to prior works that learn models completely online [9, 33, 10], we leverage existing offline task-agnostic datasets to pre-train a skill dynamics model and skill policy. This offers the benefit that the model and skills are agnostic to specific tasks so that they may be used in multiple tasks. Afterwards in the downstream RL phase, the agent continues to finetune the skill dynamics model to accommodate task-specific trajectories.

To learn a low-dimensional skill latent space that encodes action sequences, we train a conditional VAE [12] on the offline dataset that reconstructs the action sequence through a skill embedding given a state-action sequence as in SPiRL [28, 29]. Specifically, given consecutive states and actions , a skill encoder predicts a skill embedding and a skill decoder (i.e. the low-level skill policy) reconstructs the original action sequence from :


where is sampled from and is a weighting factor for regularizing the skill latent distribution to a prior of a

-transformed unit Gaussian distribution,


To ensure the latent skill space is suited for long-term prediction, we jointly train a skill dynamics model with the VAE above. The skill dynamics model learns to predict , the latent state -steps ahead conditioned on a skill , for sequential skill transitions using the latent state consistency loss [10]. To prevent a trivial solution and encode rich information from observations, we additionally train an observation decoder using the observation reconstruction loss. Altogether, the skill dynamics , state encoder , and observation decoder are trained on the following objective:


where and such that gradients are back-propagated through time. For stable training, we use a target network whose parameter is slowly soft-copied from .

Furthermore, to guide the exploration for downstream RL, we also extract a skill prior [28] from offline data that predicts the skill distribution for any state. The skill prior is trained by minimizing the KL divergence between output distributions of the skill encoder and the skill prior:


Combining the objectives above, we jointly train the policy, model, and prior, which leads to a well-shaped skill latent space that is optimized for both skill reconstruction and long-term prediction:


3.4 Downstream Task Learning with Learned Skill Dynamics Model

To accelerate downstream RL with the learned skill repertoire, SkiMo learns a high-level task policy that outputs a latent skill embedding , which is then translated into a sequence of actions using the pre-trained skill policy to act in the environment [28, 29].

To further improve the sample efficiency, we propose to use model-based RL in the skill space by leveraging the skill dynamics model. The skill dynamics model and task policy can generate imaginary rollouts in the skill space by repeating (1) sampling a skill, , and (2) predicting -step future after executing the skill, . Our skill dynamics model requires only dynamics predictions and action selections of the flat model-based RL approaches [9, 10], resulting in more efficient and accurate long-horizon imaginary rollouts (see Appendix, Figure 9).

Following TD-MPC [10], we leverage these imaginary rollouts both for planning (Algorithm 2) and policy optimization (Equation (7)), significantly reducing the number of necessary environment interactions. During rollout, we perform Model Predictive Control (MPC), which re-plans every step using CEM and executes the first skill of the skill plan (see Appendix, Section C for more details).

To evaluate imaginary rollouts, we train a reward function that predicts the sum of -step rewards222For clarity, we use to denote the sum of -step rewards , , and a Q-value function . We also finetune the skill dynamics model and state representation model on the downstream task to improve the model prediction:


Finally, we train a high-level task policy to maximize the estimated -value while regularizing it to the pre-trained skill prior  [28], which helps the policy output plausible skills:


For both Equation (6) and Equation (7

), consecutive skill-level transitions can be sampled together, so that the models can be trained using backpropagation through time, similar to Equation (


4 Experiments

In this paper, we propose a model-based RL approach that can efficiently and accurately plan long-horizon trajectories over the skill space, rather than the primitive action space, by leveraging the skills and skill dynamics model learned from an offline task-agnostic dataset. Hence, in our experiments, we aim to answer the following questions: (1) Can the use of the skill dynamics model improve the efficiency of RL training for long-horizon tasks? and (2) Is the joint training of skills and the skill dynamics model essential for efficient model-based learning?

4.1 Tasks


(a) Maze


(b) Kitchen


(c) Mis-aligned Kitchen


Figure 3: We evaluate our method on four long-horizon, sparse reward tasks. (a) The green point mass navigates the maze to reach the goal (red). (b, c) The robot arm in the kitchen must complete four tasks in the correct order (Microwave - Kettle - Bottom Burner - Light and Microwave - Light - Slide Cabinet - Hinge Cabinet). (d) The robot arm needs to compose skills learned from extremely task-agnostic data (Open Drawer - Turn on Lightbulb - Move Slider Left - Turn on LED).

To evaluate whether our method can efficiently learn temporally-extended tasks with sparse rewards, we compare it to prior model-based RL and skill-based RL approaches on four tasks: 2D maze navigation, two robotic kitchen manipulation, and robotic tabletop manipulation tasks, as illustrated in Figure 3. More details about environments, tasks, and offline data can be found in Section C.


We use the maze navigation task from Pertsch et al. [29], where a point mass agent is initialized randomly near the green region and needs to reach the fixed goal region in red (Figure 2(a)). The agent observes its 2D position and 2D velocity, and controls its -velocity. The agent receives a sparse reward of 100 only when it reaches the goal. The task-agnostic offline dataset from Pertsch et al. [29] consists of 3,046 trajectories between randomly sampled initial and goal positions.


We use the FrankaKitchen environment and 603 teleoperation trajectories from D4RL [5]. The 7-DoF Franka Emika Panda arm needs to perform four sequential sub-tasks (Microwave - Kettle - Bottom Burner - Light). In Mis-aligned Kitchen, we also test another task sequence (Microwave - Light - Slide Cabinet - Hinge Cabinet), which has a low sub-task transition probability in the offline data distribution [29]. The agent observes 11D robot state and 19D object state, and uses 9D joint velocity control. The agent receives a reward of 1 for every sub-task completion in order.


We adapt the CALVIN benchmark [23] to have the target task Open Drawer - Turn on Lightbulb - Move Slider Left - Turn on LED and the 21D observation space of robot and object states. It also uses a Panda arm, but with 7D end-effector pose control. We use the play data of 1,239 trajectories from Mees et al. [23] as our offline data. The agent receives a reward of 1 for every sub-task completion in the correct order.

4.2 Baselines and Ablated Methods

We compare our method to the state-of-the-art model-based RL [9, 10], skill-based RL [28], and combinations of them, as well as three ablations of our method, as summarized in Appendix, Table 1.

  • [leftmargin=2em]

  • Dreamer [9] and TD-MPC [10] learn a flat (single-step) dynamics and train a policy using latent imagination to achieve a high sample efficiency.

  • SPiRL [28] learns skills and a skill prior, and guides a high-level policy using the learned prior.

  • SPiRL + Dreamer and SPiRL + TD-MPC pre-train the skills using SPiRL and learn a policy and model in the skill space (instead of the action space) using Dreamer and TD-MPC, respectively. In contrast to SkiMo, these baselines do not jointly train the model and skills.

  • SkiMo w/o joint training learns the latent skill space using only the VAE loss in Equation (2).

  • SkiMo + SAC uses model-free RL (SAC [8]) to train a policy in the skill space.

  • SkiMo w/o CEM selects skill based on the policy without planning using the learned model.

4.3 Results


Maze navigation poses a hard exploration problem due to the sparsity of the reward: the agent only receives reward after taking 1,000+ steps to reach the goal. Figure 3(a) shows that only SkiMo is able to consistently succeed in long-horizon navigation, whereas baselines struggle to learn a policy or an accurate model due to the challenges in sparse feedback and long-term planning.

To better understand the result, we qualitatively analyze the behavior of each agent in Appendix, Figure 8. Dreamer and TD-MPC have a small coverage of the maze, since it is challenging to coherently explore for 1,000+ steps to reach the goal from taking primitive actions. SPiRL is able to explore a large fraction of the maze, but it does not learn to consistently find the goal due to difficult policy optimization in long-horizon tasks. On the other hand, SPiRL + Dreamer and SPiRL + TD-MPC fail to learn an accurate model and often collide with walls.


Figure 3(b) demonstrates that SkiMo reaches the same performance (above 3 sub-tasks) with 5x less environment interactions than SPiRL. This improvement in sample efficiency is crucial in real-world robot learning. In contrast, Dreamer and TD-MPC rarely succeed on the first sub-task due to the difficulty in long-horizon learning with primitive actions. SPiRL + Dreamer and SPiRL + TD-MPC perform better than flat model-based RL by leveraging skills, yet the independently trained model and policy are not accurate enough to consistently achieve more than two sub-tasks.

Mis-aligned Kitchen

The mis-aligned target task makes the downstream learning harder, because the skill prior, which reflects offline data distribution, offers less meaningful regularization to the policy. Figure 3(c) shows that SkiMo still performs well despite the mis-alignment between the offline data and downstream task. This demonstrates that the skill dynamics model is able to adapt to the new distribution of behaviors, which might greatly deviate from the distribution in the offline dataset.


One of the major challenges in CALVIN is that the offline data is much more task-agnostic. Any particular sub-task transition has probability lower than 0.1% on average, resulting in a large number of plausible sub-tasks from any state. This setup mimics real-world large-scale robot learning, where the robot may not receive a carefully curated dataset. Figure 3(d) demonstrates that SkiMo can learn faster than the model-free baseline, SPiRL, which supports the benefit of using our skill dynamics model. Meanwhile, Dreamer performs better in CALVIN than in Kitchen because objects in CALVIN are more compactly located and easier to manipulate; thus, it becomes viable to accomplish initial sub-tasks through random exploration. Yet, it falls short in composing coherent action sequences to achieve a longer task sequence due to the lack of temporally-extended reasoning.

In summary, we show the synergistic benefit of temporal abstraction in both the policy and dynamics model. SkiMo achieves at least 5x higher sample efficiency than all baselines in robotic domains, and is the only method that consistently solves the long-horizon maze navigation. Our results also demonstrate the importance of algorithmic design choices (e.g. skill-level planning, joint training of a model and skills) as naive combinations (SPiRL + Dreamer, SPiRL + TD-MPC) fail to learn.


(a) Maze


(b) Kitchen


(c) Mis-aligned Kitchen


Figure 4: Learning curves of our method and baselines. All averaged over 5 random seeds.

4.4 Ablation Studies

Model-based vs. Model-free

In Figure 5, SkiMo achieves better asymptotic performance and higher sample efficiency across all tasks than SkiMo + SAC, which directly uses the high-level task policy to select skills instead of using the skill dynamics model to plan. Since the only difference is in the use of the skill dynamics model for planning, this suggests that the task policy can make more informative decisions by leveraging accurate long-term predictions of the skill dynamics model.

Joint training of skills and skill dynamics model

Figure 5 shows that the joint training is crucial for Maze and CALVIN while the difference is marginal in the Kitchen tasks. This suggests that the joint training is essential especially in more challenging scenarios, where the agent needs to generate accurate long-term plans (for Maze) or the skills are very diverse (in CALVIN).

CEM planning

As shown in Figure 5, SkiMo learns significantly better and faster in Kitchen, Mis-aligned Kitchen, and CALVIN than SkiMo w/o CEM, indicating that CEM planning can effectively find a better plan. On the other hand, in Maze, SkiMo w/o CEM learns twice as fast. We find that action noise for exploration in CEM leads the agent to get away from the skill prior support and get stuck at walls and corners. We believe that with a careful tuning of action noise, SkiMo can solve Maze much more efficiently. Meanwhile, the fast learning of SkiMo w/o CEM in Maze confirms the advantage of policy optimization using imaginary rollouts generated by our skill dynamics model.

For further ablations and discussion on skill horizon and planning horizon, see Appendix, Section A.


(a) Maze


(b) Kitchen


(c) Mis-aligned Kitchen


Figure 5: Learning curves of our method and ablated models. All averaged over 5 random seeds.

4.5 Long-horizon Prediction with Skill Dynamics Model

To assess the accuracy of long-term prediction of our proposed skill dynamics over flat dynamics, we visualize imagined trajectories in Appendix, Figure 8(a), where the ground truth initial state and a sequence of 500 actions (50 skills for SkiMo) are given. Dreamer struggles to make accurate long-horizon predictions due to error accumulation. In contrast, SkiMo is able to reproduce the ground truth trajectory with little prediction error even when traversing through hallways and doorways. This is mainly because SkiMo allows temporal abstraction in the dynamics model, thereby enabling temporally-extended prediction and reducing step-by-step prediction error.

5 Conclusion

We propose SkiMo, an intuitive instantiation of saltatory model-based hierarchical RL [2] that combines skill-based and model-based RL approaches. Our experiments demonstrate that (1) a skill dynamics model reduces the long-term prediction error, improving the performance of prior model-based RL and skill-based RL; (2) it leads to temporal abstraction in both the policy and dynamics model, so the downstream RL can do efficient, temporally-extended reasoning without needing to model step-by-step planning; and (3) joint training of the skill dynamics and skill representations further improves the sample efficiency by learning skills useful to predict their consequences. We believe that the ability to learn and utilize a skill-level model holds the key to unlocking the sample efficiency and widespread use of RL agents, and our method takes a step toward this direction.

Limitations and future work

While our method extracts fixed-length skills from offline data, the lengths of semantic skills may vary based on the contexts and goals. Future work can learn variable-length semantic skills to improve long-term prediction and planning. Further, although we only experimented on state-based inputs, SkiMo is a general framework that can be extended to RGB, depth, and tactile observations. Thus, we would like to apply this sample-efficient approach to real robots where the sample efficiency is crucial.

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant (No.2019-0-00075, Artificial Intelligence Graduate School Program, KAIST) and National Research Foundation of Korea (NRF) grant (NRF-2021H1D3A2A03103683), funded by the Korea government (MSIT). This work was also partly supported by the Annenberg Fellowship from USC. We would like to thank Ayush Jain and Grace Zhang for help on writing, Karl Pertsch for assistance in setting up SPiRL and CALVIN, and all members of the USC CLVR lab for constructive feedback.


  • [1] A. Argenson and G. Dulac-Arnold (2021) Model-based offline planning. In iclr, External Links: Link Cited by: §1, §2.
  • [2] M. Botvinick and A. Weinstein (2014) Model-based hierarchical reinforcement learning and human action control. Philosophical Transactions of the Royal Society B: Biological Sciences 369 (1655). Cited by: §1, §5.
  • [3] M. Dalal, D. Pathak, and R. Salakhutdinov (2021) Accelerating robotic reinforcement learning via parameterized action primitives. In neurips, Cited by: §1.
  • [4] S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn (2019) RoboNet: large-scale multi-robot learning. In corl, Cited by: Appendix D.
  • [5] J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine (2020) D4RL: datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219. Cited by: §C.3, §C.3, §C.3, §4.1.
  • [6] A. Gupta, V. Kumar, C. Lynch, S. Levine, and K. Hausman (2019) Relay policy learning: solving long-horizon tasks via imitation and reinforcement learning. corl. Cited by: §C.3, §C.3.
  • [7] D. Ha and J. Schmidhuber (2018) World models. arXiv preprint arXiv:1803.10122. Cited by: §2.
  • [8] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In icml, pp. 1856–1865. Cited by: 5th item.
  • [9] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi (2019) Dream to control: learning behaviors by latent imagination. In iclr, Cited by: §C.1, §C.2, Table 1, §1, §2, §3.3, §3.4, 1st item, §4.2.
  • [10] N. Hansen, X. Wang, and H. Su (2022) Temporal difference learning for model predictive control. In icml, Cited by: §C.2, §C.2, §C.2, Table 1, §1, §2, §3.3, §3.3, §3.4, §3.4, 1st item, §4.2.
  • [11] M. Janner, J. Fu, M. Zhang, and S. Levine (2019) When to trust your model: model-based policy optimization. In neurips, Cited by: §1, §2.
  • [12] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In iclr, Cited by: §3.1, §3.3.
  • [13] T. Kipf, Y. Li, H. Dai, V. Zambaldi, A. Sanchez-Gonzalez, E. Grefenstette, P. Kohli, and P. Battaglia (2019)

    Compile: compositional imitation learning and execution

    In icml, Cited by: footnote 1.
  • [14] Y. Lee, E. S. Hu, and J. J. Lim (2021) IKEA furniture assembly environment for long-horizon complex manipulation tasks. In icra, External Links: Link Cited by: §1.
  • [15] Y. Lee, J. J. Lim, A. Anandkumar, and Y. Zhu (2021) Adversarial skill chaining for long-horizon robot manipulation via terminal state regularization. In corl, Cited by: §1, §2.
  • [16] Y. Lee, S. Sun, S. Somasundaram, E. S. Hu, and J. J. Lim (2019) Composing complex skills by learning transition policies. In iclr, External Links: Link Cited by: §1.
  • [17] Y. Lee, J. Yang, and J. J. Lim (2020) Learning to coordinate manipulation skills via skill behavior diversification. In iclr, Cited by: §1.
  • [18] S. Legg and M. Hutter (2007) Universal intelligence: a definition of machine intelligence. Minds and machines 17 (4), pp. 391–444. Cited by: §1.
  • [19] K. Lowrey, A. Rajeswaran, S. Kakade, E. Todorov, and I. Mordatch (2019) Plan online, learn offline: efficient learning and exploration via model-based control. In iclr, External Links: Link Cited by: §1, §2.
  • [20] K. Lu, A. Grover, P. Abbeel, and I. Mordatch (2021) Reset-free lifelong learning with skill-space planning. In iclr, Cited by: §1, §2, §2, §3.3.
  • [21] C. Lynch, M. Khansari, T. Xiao, V. Kumar, J. Tompson, S. Levine, and P. Sermanet (2020) Learning latent plans from play. In corl, pp. 1113–1132. Cited by: §1, §3.1.
  • [22] A. Mandlekar, Y. Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, S. Savarese, and L. Fei-Fei (2018) Roboturk: a crowdsourcing platform for robotic skill learning through imitation. In corl, pp. 879–893. Cited by: Appendix D.
  • [23] O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard (2022) CALVIN: a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters. Cited by: §C.3, §C.3, §4.1.
  • [24] R. Mendonca, O. Rybkin, K. Daniilidis, D. Hafner, and D. Pathak (2021) Discovering and achieving goals via world models. In neurips, Cited by: §2, footnote 1.
  • [25] O. Nachum, H. Tang, X. Lu, S. Gu, H. Lee, and S. Levine (2019) Why does hierarchy (sometimes) work so well in reinforcement learning?. arXiv preprint arXiv:1909.10618. Cited by: §2.
  • [26] P. Pastor, H. Hoffmann, T. Asfour, and S. Schaal (2009) Learning and generalization of motor skills by learning from demonstration. In icra, pp. 763–768. Cited by: §2.
  • [27] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017)

    Automatic differentiation in PyTorch

    In NIPS Autodiff Workshop, Cited by: §C.1.
  • [28] K. Pertsch, Y. Lee, and J. J. Lim (2020) Accelerating reinforcement learning with learned skill priors. In corl, Cited by: §C.2, §C.2, §C.2, §C.2, Table 1, §1, §1, §3.1, §3.1, §3.3, §3.3, §3.3, §3.4, §3.4, 2nd item, §4.2.
  • [29] K. Pertsch, Y. Lee, Y. Wu, and J. J. Lim (2021) Demonstration-guided reinforcement learning with learned skills. In corl, Cited by: §C.3, §C.3, §C.3, §1, §3.3, §3.4, §4.1, §4.1.
  • [30] D. A. Pomerleau (1989)

    Alvinn: an autonomous land vehicle in a neural network

    In nips, pp. 305–313. Cited by: §C.2.
  • [31] R. Y. Rubinstein (1997) Optimization of computer simulation models with rare events. European Journal of Operational Research 99 (1), pp. 89–112. Cited by: §2.
  • [32] T. Schaul, J. Quan, I. Antonoglou, and D. Silver (2016) Prioritized experience replay. In iclr, Cited by: §C.2.
  • [33] R. Sekar, O. Rybkin, K. Daniilidis, P. Abbeel, D. Hafner, and D. Pathak (2020) Planning to explore via self-supervised world models. In icml, Cited by: §3.3.
  • [34] D. Shah, A. T. Toshev, S. Levine, and brian ichter (2022) Value function spaces: skill-centric state abstractions for long-horizon reasoning. In iclr, External Links: Link Cited by: §2.
  • [35] T. Shankar and A. Gupta (2020) Learning robot skills with temporal variational inference. In icml, Cited by: footnote 1.
  • [36] T. Shankar, S. Tulsiani, L. Pinto, and A. Gupta (2020) Discovering motor programs by recomposing demonstrations. In iclr, Cited by: footnote 1.
  • [37] A. Sharma, S. Gu, S. Levine, V. Kumar, and K. Hausman (2020) Dynamics-aware unsupervised discovery of skills. In iclr, Cited by: §2.
  • [38] R. S. Sutton, D. Precup, and S. Singh (1999) Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artificial intelligence 112 (1-2), pp. 181–211. Cited by: §1.
  • [39] R. S. Sutton, D. Precup, and S. Singh (1999) Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artificial intelligence 112 (1-2), pp. 181–211. Cited by: §2.
  • [40] R. S. Sutton (1984) Temporal credit assignment in reinforcement learning. Ph.D. Thesis, University of Massachusetts Amherst. Cited by: §3.1.
  • [41] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. de Las Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, T. P. Lillicrap, and M. A. Riedmiller (2018) DeepMind control suite. arXiv preprint arXiv:1801.00690. Cited by: §C.2.
  • [42] G. Williams, A. Aldrich, and E. Theodorou (2015) Model predictive path integral control using covariance variable importance sampling. arXiv preprint arXiv:1509.01149. Cited by: §2.
  • [43] B. Wu, S. Nair, L. Fei-Fei, and C. Finn (2021) Example-driven model-based reinforcement learning for solving long-horizon visuomotor tasks. In corl, Cited by: §2, §3.3.
  • [44] K. Xie, H. Bharadhwaj, D. Hafner, A. Garg, and F. Shkurti (2020) Latent skill planning for exploration and transfer. In iclr, Cited by: §2, §3.3.

Appendix A Further Ablations

We include additional ablations on the Maze and Kitchen tasks to further investigate the influence of skill horizon and planning horizon , which is important for skill learning and planning.

a.1 Skill Horizon


(a) Maze


(b) Kitchen
Figure 6: Ablation analysis on skill horizon .

In both Maze and Kitchen environments, we find that a too short skill horizon () unable to yield sufficient temporal abstraction. A longer skill horizon () has little influence in Kitchen, but it makes the downstream performance much worse in Maze. This is because with too long-horizon skills, a skill dynamics prediction becomes more difficult and stochastic; and composing multiple skills can be not as flexible as short-horizon skills. The inaccurate skill dynamics makes long-term planning harder, which is already a major challenge in maze navigation.

a.2 Planning Horizon


(a) Maze


(b) Kitchen
Figure 7: Ablation analysis on planning horizon .

In Figure 6(b), we see that short planning horizon makes learning slower in the beginning, because it does not effectively leverage the skill dynamics model to plan further ahead. Conversely, if the planning horizon is too long, the performance gets a little worse due to the difficulty in modeling every step accurately. Indeed, the planning horizon 20 corresponds to 200 low-level steps, while the episode length in Kitchen is 280. This would demand the agent to make plan for nearly the entire episode. The performance is not sensitive to intermediate planning horizons.

On the other hand, the effect of the planning horizon differs in Maze due to its distinct environment characteristics. Interestingly, we find that very long planning horizon (e.g. 20) and very short planning horizon (e.g. 1) perform similarly in Maze (Figure 6(a)). This could attribute to the former creates useful long-horizon plans, while the latter avoids error accumulation altogether. We leave further investigation on planning horizon to future work.

Appendix B Qualitative Analysis on Maze

b.1 Exploration and Exploitation

To gauge the agent’s ability of exploration and exploitation, we visualize the replay buffer for each method in Figure 8. In this visualization, we represent early trajectories in the replay buffer with light blue dots and recent trajectories with dark blue dots. In Figure 7(a), the replay buffer of SkiMo (ours) contains early explorations that span to most corners in the maze. After it finds the goal, it exploits this knowledge and commits to paths that are between the start location and the goal (in dark blue). This explains why our method can quickly learn and consistently accomplish the task. Dreamer and TD-MPC only explore a small fraction of the maze, because they are prone to get stuck at corners or walls without guided exploration from skills and skill priors. SPiRL + Dreamer, SPiRL + TD-MPC and SkiMo w/o joint training explore better than Dreamer and TD-MPC, but all fail to find the goal. This is because without the joint training of the model and policy, the skill space is only optimized for action reconstruction, not for planning, which makes long-horizon exploration and exploitation harder.

On the other hand, SkiMo + SAC and SPiRL are able to explore the most portion of the maze, but comparatively the coverage is too wide to enable efficient learning. That is, even after the agent finds the goal through exploration, it continues to explore and does not exploit this experience to accomplish the task consistently (darker blue). This could attribute to the difficult long-horizon credit assignment problem which makes policy learning slow, and the reliance on skill prior which encourages exploration. On the contrary, our skill dynamics model effectively absorbs prior experience to generate goal-achieving imaginary rollouts for the actor and critic to learn from, which makes task learning more efficient. In essence, we find the skill dynamics model useful in guiding the agent explore coherently and exploit efficiently.


(a) SkiMo (Ours)


(b) Dreamer


(c) TD-MPC


(d) SPiRL + Dreamer


(e) SPiRL + TD-MPC


(f) SkiMo w/o Joint Training


(g) SkiMo + SAC


(h) SPiRL
Figure 8: Exploration and exploitation behaviors of our method and baseline approaches. We visualize trajectories in the replay buffer at 1.5M training steps in blue (light blue for early trajectories and dark blue for recent trajectories). Our method shows wide coverage of the maze at the early stage of training, and fast convergence to the solution.

b.2 Long-horizon Prediction

To compare the long-term prediction ability of the skill dynamics and flat dynamics, we visualize imagined trajectories by sampling trajectory clips of 500 timesteps from the agent’s replay buffer (the maximum episode length in Maze is 2,000), and predicting the latent state 500 steps ahead (which will be decoded using the observation decoder) given the initial state and 500 ground-truth actions (50 skills for SkiMo). The similarity between the imagined trajectory and the ground truth trajectory can indicate whether the model can make accurate predictions far into the future, producing useful imaginary rollouts for policy learning and planning.

SkiMo is able to reproduce the ground truth trajectory with little prediction error even when traversing through hallways and doorways while Dreamer struggles to make accurate long-horizon predictions due to error accumulation. This is mainly because SkiMo allows temporal abstraction in the dynamics model, thereby enabling temporally-extended prediction and reducing step-by-step prediction error.


(a) Dreamer


(b) SkiMo (Ours)
Figure 9: Prediction results of 500 timesteps using a flat single-step model (a) and skill dynamics model (b), starting from the ground truth initial state and 500 actions (50 skills for SkiMo). The predicted states from the flat model deviate from the ground truth trajectory quickly while the prediction of our skill dynamics model has little error.

Appendix C Implementation Details

c.1 Computing Resources

Our approach and all baselines are implemented in PyTorch [27]. All experiments are conducted on a workstation with an Intel Xeon E5-2640 v4 CPU and a NVIDIA Titan Xp GPU. Pre-training of the skill policy and skill dynamics model takes around 10 hours. Downstream RL for 2M timesteps takes around 18 hours. The policy and model update frequency is the same over all algorithms but Dreamer [9]. Since only Dreamer trains on primitive actions, it has 10 times more frequent model and policy updates than skill-based algorithms, which leads to slower training (about 52 hours).

c.2 Algorithm Implementation Details

For the baseline implementations, we use the official code for SPiRL and re-implemented Dreamer and TD-MPC in PyTorch, which are verified on DM control tasks [41]. The table below (Table 1) compares key components of SkiMo with model-based and skill-based baselines and ablated methods.

Method Skill-based Model-based Joint training
Dreamer [9] and TD-MPC [10]
SPiRL [28]
SPiRL + Dreamer and SPiRL + TD-MPC
SkiMo w/o joint training
SkiMo + SAC
SkiMo (Ours) and SkiMo w/o CEM
Table 1: Comparison to prior work and ablated methods.

Dreamer [30]

We use the same hyperparameters with the official implementation.

Td-Mpc [10]

We use the same hyperparameters with the official implementation, except that we do not use the prioritized experience replay [32]. The same implementation is used for the SPiRL + TD-MPC baseline and our method with only a minor modification.

SPiRL [28]

We use the official implementation of the original paper and use the hyperparameters suggested in the official implementation.

SPiRL + Dreamer [28]

We use our implementation of Dreamer and simply replace the action space with the latent skill space of SPiRL. We use the same pre-trained SPiRL skill policy and skill prior networks with the SPiRL baseline. Initializing the high-level downstream task policy with the skill prior, which is critical for downstream learning performance [28], is not possible due to the policy network architecture mismatch between Dreamer and SPiRL. Thus, we only use the prior divergence to regularize the high-level policy instead. Directly pre-train the high-level policy did not lead to better performance, but it might have worked better with more tuning.

SPiRL + TD-MPC [10]

Similar to SPiRL + Dreamer, we use our implementation of TD-MPC and replace the action space with the latent skill space of SPiRL. The initialization of the task policy is also not available due to the different architecture used for TD-MPC.

SkiMo (Ours)

The skill-based RL part of our method is inspired by Pertsch et al. [28] and the model-based component is inspired by Hansen et al. [10] and Hafner et al. [9]. We elaborate our skill and skill dynamics learning in Algorithm 1, planning algorithm in Algorithm 2, and model-based RL in Algorithm 3. Table 2 lists the all hyperparameters that we used.

0:  : offline task-agnostic data
1:  Randomly initialize
2:   initialize target network
3:  for each iteration do
4:     Sample mini-batch
5:      from Equation (5)
6:      update target network
7:  end for
8:  return
Algorithm 1 SkiMo RL (skill and skill dynamics learning)
0:   learned parameters, : current state
1:   initialize sampling distribution
2:  for  do
3:     Sample trajectories of length from

sample skill sequences from normal distribution

4:     Sample trajectories of length using sample skill sequences via imaginary rollouts
5:     Estimate -step returns of trajectories using
6:     Compute with top-k return trajectories update parameters for next iteration
7:  end for
8:  Sample a skill
9:  return
Algorithm 2 SkiMo RL (CEM planning)
0:   pre-trained parameters
1:   initialize replay buffer
2:  Randomly initialize
3:   initialize target network
4:   initialize task policy with skill prior
5:  for not converged do
6:      initialize episode
7:     for episode not done do
8:         CEM MPC with CEM planning in Algorithm 2
10:        for  steps do
11:            ENV rollout low-level skill policy
13:        end for
14:         collect -step environment interaction
17:        Sample mini-batch
18:         from Equation (6)
19:         from Equation (7). Update only policy parameters
20:         update target network
21:         update target network
22:     end for
23:  end for
24:  return
Algorithm 3 SkiMo RL (downstream task learning)
Hyperparameter Value
Maze FrankaKitchen CALVIN
Model architecture
# Layers of 5
Activation funtion elu
Hidden dimension 128 128 256
State embedding dimension 128 256 256
Skill encoder () 5-layer MLP LSTM LSTM
Skill encoder hidden dimension 128
Pre-training batch size 512
# Training mini-batches per update 5
Model-Actor joint learning rate () 0.001
Downstream RL
Model learning rate 0.001
Actor learning rate 0.001
Skill dimension 10
Skill horizon () 10
Planning horizon () 10 3 1
Batch size 128 256 256
# Training mini-batches per update 10
State normalization True False False
Prior divergence coefficient () 1 0.5 0.1
Alpha learning rate 0.0003 0 0
Target divergence 3 N/A N/A
# Warm up step 50,000 5,000 5,000
# Environment step per update 500
Replay buffer size 1,000,000
Target update frequency 2
Target update tau () 0.01
Discount factor () 0.99
Reward loss coefficient 0.5
Value loss coefficient 0.1
Consistency loss coefficient 2
Low-level actor loss coefficient 2
Planning discount () 0.5
Encoder KL regularization () 0.0001
CEM iteration () 6
# Sampled trajectories () 512
# Policy trajectories () 25
# Elites () 64
CEM momentum 0.1
CEM temperature 0.5
Maximum std 0.5
Minimum std 0.01
Std decay step 100,000 25,000 25,000
Horizon decay step 100,000 25,000 25,000
Table 2: SkiMo hyperparameters.

c.3 Environments and Data

Maze [5, 29]

Since our goal is to leverage offline data collected from diverse tasks in the same environment, we use a variant of the maze environment [5], suggested in Pertsch et al. [29]. The maze is of size ; an initial state is randomly sampled near a small pre-defined region (the green circle in Figure 2(a)); and the goal position is fixed shown as the red circle in Figure 2(a). The observation consists of the agent’s 2D position and velocity. The agent moves around the maze by controlling the continuous value of its velocity. The maximum episode length is 2,000 but an episode is also terminated when the agent reaches the circle around the goal with radius 2. The reward of 100 is given at task completion. We use the offline data of 3,046 trajectories, collected from randomly sampled start and goal state pairs from Pertsch et al. [29].

Kitchen [6, 5]

The 7-DoF Franka Panda robot arm needs to perform four sequential tasks (open microwave, move kettle, turn on bottom burner, and flip light switch). The agent has a 30D observation space (11D robot proprioceptive state and 19D object states), which removes a constant 30D goal state in the original environment, and 9D action space (7D joint velocity and 2D gripper velocity). The agent receives a reward of 1 for every sub-task completion. We use 603 trajectories collected by teleoperation from Gupta et al. [6] as the offline task-agnostic data. The episode length is 280 and an episode also ends once all four sub-tasks are completed. The initial state is set with a small noise in every state dimension.

Mis-aligned Kitchen [29]

The environment and task-agnostic data are the same with Kitchen but we use the different downstream task (open microwave, flip light switch, slide cabinet door, and open hinge cabinet, as illustrated in Figure 2(c)). This task ordering is not aligned with the sub-task transition probabilities of the task-agnostic data, which leads to challenging exploration following the prior from data.

Calvin [23]

We adapt the CALVIN environment [23] for long-horizon learning with the state observation. The CALVIN environment uses a Franka Emika Panda robot arm with 7D end-effector pose control (relative 3D position, 3D orientation, 1D gripper action). The 21D observation space consists of the 15D proprioceptive robot state and 6D object state. We use the teleoperated play data (Task DTask D) of 1,239 trajectories from Mees et al. [23] as our task-agnostic data. The agent receives a sparse reward of 1 for every sub-task completion in the correct order. The episode length is 360 and an episode also ends once all four sub-tasks are completed. In data, there exist 34 available target sub-tasks, and each sub-task can transition to any other sub-task, which makes any transition probability lower than 0.1% on average.

Appendix D Application to Real Robot Systems

Our algorithm is designed to be applied on real robot systems by improving sample efficiency of RL using a temporally-abstracted dynamics model. Throughout the extensive experiments in simulated robotic manipulation environments, we show that our approach achieves superior sample efficiency over prior skill-based and model-based RL, which gives us strong evidence for the application to real robot systems. Especially, in Kitchen and CALVIN, our approach improves the sample efficiency of learning long-horizon manipulation tasks with a 7-DoF Franka Emika Panda robot arm. Our approach consists of three phases: (1) task-agnostic data collection, (2) skills and skill dynamics model learning, and (3) downstream task learning. In each of these phases, our approach can be applied to physical robots:

Task-agnostic data collection

Our approach is designed to fully leverage task-agnostic data without any reward or task annotation. In addition to extracting skills and skill priors, we further learn a skill dynamics model from this task-agnostic data. Maximizing the utility of task-agnostic data is critical for real robot systems as data collection with physical robots itself is very expensive. Our method does not require any manual labelling of data and simply extracts skills, skill priors, and skill dynamics model from raw states and actions, which makes our method scalable.

Pre-training of skills and skill dynamics model

Our approach trains the skill policy, skill dynamics model, and skill prior from the offline task-agnostic dataset, without requiring any additional real-world robot interactions.

Downstream task learning

The goal of our work is to leverage skills and skill dynamics model to allow for more efficient downstream learning, i.e., requires less interactions of the agent with the environment for training the policy. This is especially important on real robot systems where a robot-environment interaction is slow, dangerous, and costly. Our approach directly addresses this concern by learning a policy from imaginary rollouts rather than actual environment interactions. Also, all sort of collected data can help improve the skill dynamics model, which leads to more accurate imagination and policy learning.

In summary, we believe that our approach can be applied to real-world robot systems with only minor modifications.