1 Introduction
A key trait of human intelligence is the ability to plan abstractly for solving complex tasks [18]. For instance, we perform cooking by imagining outcomes of highlevel skills like washing and cutting vegetables, instead of planning every muscle movement involved [2]
. This ability to plan with temporallyextended skills helps to scale our internal model to longhorizon tasks by reducing the search space of behaviors. To apply this insight to artificial intelligence agents, we propose a novel skillbased and modelbased reinforcement learning (RL) method, which learns a model and a policy in a highlevel skill space, enabling accurate longterm prediction and efficient longterm planning.
Typically, modelbased RL involves learning a flat singlestep dynamics model, which predicts the next state from the current state and action. This model can then be used to simulate “imaginary” trajectories, which significantly improves sample efficiency over their modelfree alternatives [9, 10]. However, such modelbased RL methods have shown only limited success in longhorizon tasks due to inaccurate longterm prediction [20] and computationally expensive search [19, 11, 1].
Skillbased RL enables agents to solve longhorizon tasks by acting with multiaction subroutines (skills) [38, 16, 17, 28, 15, 3] instead of primitive actions. This temporal abstraction of actions enables systematic longrange exploration and allows RL agents to plan farther into the future, while requiring a shorter horizon for policy optimization, which makes longhorizon downstream tasks more tractable. Yet, on complex longhorizon tasks, such as furniture assembly [14], skillbased RL still requires a few million to billion environment interactions to learn [15], which is impractical for realworld applications (e.g. robotics and healthcare).
To combine the best of both modelbased RL and skillbased RL, we propose Skillbased Modelbased RL (SkiMo), which enables effective planning in the skill space using a skill dynamics model. Given a state and a skill to execute, the skill dynamics model directly predicts the resultant state after skill execution, without needing to model every intermediate step and lowlevel action (Figure 1), whereas the flat dynamics model predicts the immediate next state after one action execution. Thus, planning with skill dynamics requires fewer predictions than flat (singlestep) dynamics, resulting in more reliable longterm future predictions and plans.
Concretely, we first jointly learn the skill dynamics model and a skill repertoire from large offline datasets collected across diverse tasks [21, 28, 29]. This joint training shapes the skill embedding space for easy skill dynamics prediction and skill execution. Then, to solve a complex downstream task, we train a hierarchical task policy that acts in the learned skill space. For more efficient policy learning and better planning, we leverage the skill dynamics model to simulate skill trajectories.
The main contribution of this work is to propose Skillbased Modelbased RL (SkiMo), a novel sampleefficient modelbased hierarchical RL algorithm that leverages taskagnostic data to extract not only a reusable skill set but also a skill dynamics model. The skill dynamics model enables efficient and accurate longterm planning for sampleefficient RL. Our experiments show that our method outperforms the stateoftheart skillbased and modelbased RL algorithms on longhorizon navigation and robotic manipulation tasks with sparse rewards.
2 Related Work
Modelbased RL leverages a learned dynamics model of the environment to plan a sequence of actions ahead of time that leads to the desired behavior. The dynamics model predicts the future state of the environment, and optionally the associated reward, after taking a specific action for planning [1, 10] or subsequent policy search [7, 9, 24, 10]. Through simulating candidate behaviors in imagination instead of in the physical environment, modelbased algorithms improve the sample efficiency of RL agents [9, 10]. Typically, modelbased approaches leverage the model for planning, e.g., CEM [31] and MPPI [42]. On the other hand, the model can also be used to generate imaginary rollouts for policy optimization [9, 10]. Yet, due to the accumulation of prediction error at each step [20] and the increasing search space, finding an optimal, longhorizon plan is inaccurate and computationally expensive [19, 11, 1].
To facilitate learning of longhorizon behaviors, many works in skillbased RL have proposed to let the agent act over temporallyextended skills (i.e. options [39] or motion primitives [26]), which can be represented as subpolicies or a coordinated sequence of lowlevel actions. Temporal abstraction effectively reduces the task horizon for the agent and enables directed exploration [25], a major challenge in RL. However, skillbased RL is still impractical for realworld applications, requiring a few million to billion environment interactions [15]. In this paper, we use modelbased RL to guide the planning of skills to improve the sample efficiency of skillbased approaches.
There have been attempts to plan over skills in modelbased RL [37, 43, 20, 44, 34]. However, these approaches still utilize the conventional flat (singlestep) dynamics model, which struggles at handling longhorizon planning due to error accumulation. To fully unleash the potential of temporally abstracted skills, we devise a skilllevel dynamics model to provide accurate longterm prediction, which is essential for solving longhorizon tasks. To the best of our knowledge, SkiMo is the first work that jointly learns skills and a skill dynamics model from data for modelbased RL.
3 Method
In this paper, we aim to improve the longhorizon learning capability of RL agents. To enable accurate longterm prediction and efficient longhorizon planning for RL, we introduce SkiMo, a novel skillbased and modelbased RL method that combines the benefits of both frameworks. A key change to prior modelbased approaches is the use of a skill dynamics model that directly predicts the outcome of a chosen skill, which enables efficient and accurate longterm planning. In this section, we describe the overview of our approach which consists of two phases: (1) learning the skill dynamics model and skills from an offline taskagnostic dataset (Section 3.3) and (2) downstream task learning with the skill dynamics model (Section 3.4), as illustrated in Figure 2.
3.1 Preliminaries
Reinforcement Learning
We formulate our problem as a Markov decision process
[40], which is defined by a tuple of the state space , action space , reward function, transition probability
, initial state distribution , and discounting factor . We define a policy that maps from a state to an action . Our objective is to learn the optimal policy that maximizes the expected discounted return, , where is the variable episode length.Unlabeled Offline Data
We assume access to a rewardfree taskagnostic dataset [21, 28], which is a set of stateaction trajectories, . Since it is taskagnostic, this data can be collected from training data for other tasks, unsupervised exploration, or human teleoperation. We do not assume this dataset contains solutions for the downstream task; therefore, tackling the downstream task requires recomposition of skills learned from diverse trajectories.
Skillbased RL
We define skills as a sequence of actions with a fixed horizon^{1}^{1}1It is worth noting that our method is compatible with variablelength skills [13, 36, 35] and goalconditioned skills [24] with minimal change; however, for simplicity, we adopt fixedlength skills of in this paper. and parameterize skills as a skill latent and skill policy, , that maps a skill latent and state to the corresponding action sequence. The skill latent and skill policy can be trained using variational autoencoder (VAE [12]), where a skill encoder embeds a sequence of transitions into a skill latent , and the skill policy decodes it back to the original action sequence. Following SPiRL [28], we also learn a skill distribution in the offline data, a skill prior , to guide the downstream task policy to explore promising skills over the large skill space.
3.2 SkiMo Model Components
SkiMo consists of three major model components: the skill policy (), skill dynamics model (), and task policy (
), along with auxiliary components for representation learning and value estimation. The following is a summary of the notations of our model components, where
denote a state, action, latent skill embedding, and latent state representation, respectively:(1) 
For convenience, we label the trainable parameters , , of each component according to which phase they are trained on:

[leftmargin=2em]

Learned from offline data and finetuned in downstream RL : The state representation module () and the skill dynamics model () are first trained on the offline taskagnostic data and then finetuned in downstream RL to account for unseen states and transitions.

Learned only from offline data : The observation reconstruction module (), skill encoder (), skill prior (), and skill policy ( are learned from the offline taskagnostic data.

Learned in downstream RL : We train a value () and a reward () function, and the highlevel task policy () for the downstream task using environment interactions.
3.3 PreTraining Skill Dynamics Model and Skills from Taskagnostic Data
Our method, SkiMo, consists of pretraining and downstream RL phases. In pretraining, SkiMo leverages offline data to extract (1) skills for temporal abstraction of actions, (2) skill dynamics for skilllevel planning on a latent state space, and (3) a skill prior [28] to guide exploration. Specifically, we jointly learn a skill policy and skill dynamics model, instead of learning them separately [43, 20, 44], in a selfsupervised manner. The key insight is that this joint training could shape the latent skill space and state embedding in that the skill dynamics model can easily predict the future.
In contrast to prior works that learn models completely online [9, 33, 10], we leverage existing offline taskagnostic datasets to pretrain a skill dynamics model and skill policy. This offers the benefit that the model and skills are agnostic to specific tasks so that they may be used in multiple tasks. Afterwards in the downstream RL phase, the agent continues to finetune the skill dynamics model to accommodate taskspecific trajectories.
To learn a lowdimensional skill latent space that encodes action sequences, we train a conditional VAE [12] on the offline dataset that reconstructs the action sequence through a skill embedding given a stateaction sequence as in SPiRL [28, 29]. Specifically, given consecutive states and actions , a skill encoder predicts a skill embedding and a skill decoder (i.e. the lowlevel skill policy) reconstructs the original action sequence from :
(2) 
where is sampled from and is a weighting factor for regularizing the skill latent distribution to a prior of a
transformed unit Gaussian distribution,
.To ensure the latent skill space is suited for longterm prediction, we jointly train a skill dynamics model with the VAE above. The skill dynamics model learns to predict , the latent state steps ahead conditioned on a skill , for sequential skill transitions using the latent state consistency loss [10]. To prevent a trivial solution and encode rich information from observations, we additionally train an observation decoder using the observation reconstruction loss. Altogether, the skill dynamics , state encoder , and observation decoder are trained on the following objective:
(3) 
where and such that gradients are backpropagated through time. For stable training, we use a target network whose parameter is slowly softcopied from .
Furthermore, to guide the exploration for downstream RL, we also extract a skill prior [28] from offline data that predicts the skill distribution for any state. The skill prior is trained by minimizing the KL divergence between output distributions of the skill encoder and the skill prior:
(4) 
Combining the objectives above, we jointly train the policy, model, and prior, which leads to a wellshaped skill latent space that is optimized for both skill reconstruction and longterm prediction:
(5) 
3.4 Downstream Task Learning with Learned Skill Dynamics Model
To accelerate downstream RL with the learned skill repertoire, SkiMo learns a highlevel task policy that outputs a latent skill embedding , which is then translated into a sequence of actions using the pretrained skill policy to act in the environment [28, 29].
To further improve the sample efficiency, we propose to use modelbased RL in the skill space by leveraging the skill dynamics model. The skill dynamics model and task policy can generate imaginary rollouts in the skill space by repeating (1) sampling a skill, , and (2) predicting step future after executing the skill, . Our skill dynamics model requires only dynamics predictions and action selections of the flat modelbased RL approaches [9, 10], resulting in more efficient and accurate longhorizon imaginary rollouts (see Appendix, Figure 9).
Following TDMPC [10], we leverage these imaginary rollouts both for planning (Algorithm 2) and policy optimization (Equation (7)), significantly reducing the number of necessary environment interactions. During rollout, we perform Model Predictive Control (MPC), which replans every step using CEM and executes the first skill of the skill plan (see Appendix, Section C for more details).
To evaluate imaginary rollouts, we train a reward function that predicts the sum of step rewards^{2}^{2}2For clarity, we use to denote the sum of step rewards , , and a Qvalue function . We also finetune the skill dynamics model and state representation model on the downstream task to improve the model prediction:
(6)  
Finally, we train a highlevel task policy to maximize the estimated value while regularizing it to the pretrained skill prior [28], which helps the policy output plausible skills:
(7) 
For both Equation (6) and Equation (7
), consecutive skilllevel transitions can be sampled together, so that the models can be trained using backpropagation through time, similar to Equation (
3).4 Experiments
In this paper, we propose a modelbased RL approach that can efficiently and accurately plan longhorizon trajectories over the skill space, rather than the primitive action space, by leveraging the skills and skill dynamics model learned from an offline taskagnostic dataset. Hence, in our experiments, we aim to answer the following questions: (1) Can the use of the skill dynamics model improve the efficiency of RL training for longhorizon tasks? and (2) Is the joint training of skills and the skill dynamics model essential for efficient modelbased learning?
4.1 Tasks
To evaluate whether our method can efficiently learn temporallyextended tasks with sparse rewards, we compare it to prior modelbased RL and skillbased RL approaches on four tasks: 2D maze navigation, two robotic kitchen manipulation, and robotic tabletop manipulation tasks, as illustrated in Figure 3. More details about environments, tasks, and offline data can be found in Section C.
Maze
We use the maze navigation task from Pertsch et al. [29], where a point mass agent is initialized randomly near the green region and needs to reach the fixed goal region in red (Figure 2(a)). The agent observes its 2D position and 2D velocity, and controls its velocity. The agent receives a sparse reward of 100 only when it reaches the goal. The taskagnostic offline dataset from Pertsch et al. [29] consists of 3,046 trajectories between randomly sampled initial and goal positions.
Kitchen
We use the FrankaKitchen environment and 603 teleoperation trajectories from D4RL [5]. The 7DoF Franka Emika Panda arm needs to perform four sequential subtasks (Microwave  Kettle  Bottom Burner  Light). In Misaligned Kitchen, we also test another task sequence (Microwave  Light  Slide Cabinet  Hinge Cabinet), which has a low subtask transition probability in the offline data distribution [29]. The agent observes 11D robot state and 19D object state, and uses 9D joint velocity control. The agent receives a reward of 1 for every subtask completion in order.
Calvin
We adapt the CALVIN benchmark [23] to have the target task Open Drawer  Turn on Lightbulb  Move Slider Left  Turn on LED and the 21D observation space of robot and object states. It also uses a Panda arm, but with 7D endeffector pose control. We use the play data of 1,239 trajectories from Mees et al. [23] as our offline data. The agent receives a reward of 1 for every subtask completion in the correct order.
4.2 Baselines and Ablated Methods
We compare our method to the stateoftheart modelbased RL [9, 10], skillbased RL [28], and combinations of them, as well as three ablations of our method, as summarized in Appendix, Table 1.

[leftmargin=2em]

SPiRL [28] learns skills and a skill prior, and guides a highlevel policy using the learned prior.

SPiRL + Dreamer and SPiRL + TDMPC pretrain the skills using SPiRL and learn a policy and model in the skill space (instead of the action space) using Dreamer and TDMPC, respectively. In contrast to SkiMo, these baselines do not jointly train the model and skills.

SkiMo w/o joint training learns the latent skill space using only the VAE loss in Equation (2).

SkiMo + SAC uses modelfree RL (SAC [8]) to train a policy in the skill space.

SkiMo w/o CEM selects skill based on the policy without planning using the learned model.
4.3 Results
Maze
Maze navigation poses a hard exploration problem due to the sparsity of the reward: the agent only receives reward after taking 1,000+ steps to reach the goal. Figure 3(a) shows that only SkiMo is able to consistently succeed in longhorizon navigation, whereas baselines struggle to learn a policy or an accurate model due to the challenges in sparse feedback and longterm planning.
To better understand the result, we qualitatively analyze the behavior of each agent in Appendix, Figure 8. Dreamer and TDMPC have a small coverage of the maze, since it is challenging to coherently explore for 1,000+ steps to reach the goal from taking primitive actions. SPiRL is able to explore a large fraction of the maze, but it does not learn to consistently find the goal due to difficult policy optimization in longhorizon tasks. On the other hand, SPiRL + Dreamer and SPiRL + TDMPC fail to learn an accurate model and often collide with walls.
Kitchen
Figure 3(b) demonstrates that SkiMo reaches the same performance (above 3 subtasks) with 5x less environment interactions than SPiRL. This improvement in sample efficiency is crucial in realworld robot learning. In contrast, Dreamer and TDMPC rarely succeed on the first subtask due to the difficulty in longhorizon learning with primitive actions. SPiRL + Dreamer and SPiRL + TDMPC perform better than flat modelbased RL by leveraging skills, yet the independently trained model and policy are not accurate enough to consistently achieve more than two subtasks.
Misaligned Kitchen
The misaligned target task makes the downstream learning harder, because the skill prior, which reflects offline data distribution, offers less meaningful regularization to the policy. Figure 3(c) shows that SkiMo still performs well despite the misalignment between the offline data and downstream task. This demonstrates that the skill dynamics model is able to adapt to the new distribution of behaviors, which might greatly deviate from the distribution in the offline dataset.
Calvin
One of the major challenges in CALVIN is that the offline data is much more taskagnostic. Any particular subtask transition has probability lower than 0.1% on average, resulting in a large number of plausible subtasks from any state. This setup mimics realworld largescale robot learning, where the robot may not receive a carefully curated dataset. Figure 3(d) demonstrates that SkiMo can learn faster than the modelfree baseline, SPiRL, which supports the benefit of using our skill dynamics model. Meanwhile, Dreamer performs better in CALVIN than in Kitchen because objects in CALVIN are more compactly located and easier to manipulate; thus, it becomes viable to accomplish initial subtasks through random exploration. Yet, it falls short in composing coherent action sequences to achieve a longer task sequence due to the lack of temporallyextended reasoning.
In summary, we show the synergistic benefit of temporal abstraction in both the policy and dynamics model. SkiMo achieves at least 5x higher sample efficiency than all baselines in robotic domains, and is the only method that consistently solves the longhorizon maze navigation. Our results also demonstrate the importance of algorithmic design choices (e.g. skilllevel planning, joint training of a model and skills) as naive combinations (SPiRL + Dreamer, SPiRL + TDMPC) fail to learn.
4.4 Ablation Studies
Modelbased vs. Modelfree
In Figure 5, SkiMo achieves better asymptotic performance and higher sample efficiency across all tasks than SkiMo + SAC, which directly uses the highlevel task policy to select skills instead of using the skill dynamics model to plan. Since the only difference is in the use of the skill dynamics model for planning, this suggests that the task policy can make more informative decisions by leveraging accurate longterm predictions of the skill dynamics model.
Joint training of skills and skill dynamics model
Figure 5 shows that the joint training is crucial for Maze and CALVIN while the difference is marginal in the Kitchen tasks. This suggests that the joint training is essential especially in more challenging scenarios, where the agent needs to generate accurate longterm plans (for Maze) or the skills are very diverse (in CALVIN).
CEM planning
As shown in Figure 5, SkiMo learns significantly better and faster in Kitchen, Misaligned Kitchen, and CALVIN than SkiMo w/o CEM, indicating that CEM planning can effectively find a better plan. On the other hand, in Maze, SkiMo w/o CEM learns twice as fast. We find that action noise for exploration in CEM leads the agent to get away from the skill prior support and get stuck at walls and corners. We believe that with a careful tuning of action noise, SkiMo can solve Maze much more efficiently. Meanwhile, the fast learning of SkiMo w/o CEM in Maze confirms the advantage of policy optimization using imaginary rollouts generated by our skill dynamics model.
For further ablations and discussion on skill horizon and planning horizon, see Appendix, Section A.
4.5 Longhorizon Prediction with Skill Dynamics Model
To assess the accuracy of longterm prediction of our proposed skill dynamics over flat dynamics, we visualize imagined trajectories in Appendix, Figure 8(a), where the ground truth initial state and a sequence of 500 actions (50 skills for SkiMo) are given. Dreamer struggles to make accurate longhorizon predictions due to error accumulation. In contrast, SkiMo is able to reproduce the ground truth trajectory with little prediction error even when traversing through hallways and doorways. This is mainly because SkiMo allows temporal abstraction in the dynamics model, thereby enabling temporallyextended prediction and reducing stepbystep prediction error.
5 Conclusion
We propose SkiMo, an intuitive instantiation of saltatory modelbased hierarchical RL [2] that combines skillbased and modelbased RL approaches. Our experiments demonstrate that (1) a skill dynamics model reduces the longterm prediction error, improving the performance of prior modelbased RL and skillbased RL; (2) it leads to temporal abstraction in both the policy and dynamics model, so the downstream RL can do efficient, temporallyextended reasoning without needing to model stepbystep planning; and (3) joint training of the skill dynamics and skill representations further improves the sample efficiency by learning skills useful to predict their consequences. We believe that the ability to learn and utilize a skilllevel model holds the key to unlocking the sample efficiency and widespread use of RL agents, and our method takes a step toward this direction.
Limitations and future work
While our method extracts fixedlength skills from offline data, the lengths of semantic skills may vary based on the contexts and goals. Future work can learn variablelength semantic skills to improve longterm prediction and planning. Further, although we only experimented on statebased inputs, SkiMo is a general framework that can be extended to RGB, depth, and tactile observations. Thus, we would like to apply this sampleefficient approach to real robots where the sample efficiency is crucial.
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant (No.2019000075, Artificial Intelligence Graduate School Program, KAIST) and National Research Foundation of Korea (NRF) grant (NRF2021H1D3A2A03103683), funded by the Korea government (MSIT). This work was also partly supported by the Annenberg Fellowship from USC. We would like to thank Ayush Jain and Grace Zhang for help on writing, Karl Pertsch for assistance in setting up SPiRL and CALVIN, and all members of the USC CLVR lab for constructive feedback.
References
 [1] (2021) Modelbased offline planning. In iclr, External Links: Link Cited by: §1, §2.
 [2] (2014) Modelbased hierarchical reinforcement learning and human action control. Philosophical Transactions of the Royal Society B: Biological Sciences 369 (1655). Cited by: §1, §5.
 [3] (2021) Accelerating robotic reinforcement learning via parameterized action primitives. In neurips, Cited by: §1.
 [4] (2019) RoboNet: largescale multirobot learning. In corl, Cited by: Appendix D.
 [5] (2020) D4RL: datasets for deep datadriven reinforcement learning. arXiv preprint arXiv:2004.07219. Cited by: §C.3, §C.3, §C.3, §4.1.
 [6] (2019) Relay policy learning: solving longhorizon tasks via imitation and reinforcement learning. corl. Cited by: §C.3, §C.3.
 [7] (2018) World models. arXiv preprint arXiv:1803.10122. Cited by: §2.
 [8] (2018) Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In icml, pp. 1856–1865. Cited by: 5th item.
 [9] (2019) Dream to control: learning behaviors by latent imagination. In iclr, Cited by: §C.1, §C.2, Table 1, §1, §2, §3.3, §3.4, 1st item, §4.2.
 [10] (2022) Temporal difference learning for model predictive control. In icml, Cited by: §C.2, §C.2, §C.2, Table 1, §1, §2, §3.3, §3.3, §3.4, §3.4, 1st item, §4.2.
 [11] (2019) When to trust your model: modelbased policy optimization. In neurips, Cited by: §1, §2.
 [12] (2014) Autoencoding variational bayes. In iclr, Cited by: §3.1, §3.3.

[13]
(2019)
Compile: compositional imitation learning and execution
. In icml, Cited by: footnote 1.  [14] (2021) IKEA furniture assembly environment for longhorizon complex manipulation tasks. In icra, External Links: Link Cited by: §1.
 [15] (2021) Adversarial skill chaining for longhorizon robot manipulation via terminal state regularization. In corl, Cited by: §1, §2.
 [16] (2019) Composing complex skills by learning transition policies. In iclr, External Links: Link Cited by: §1.
 [17] (2020) Learning to coordinate manipulation skills via skill behavior diversification. In iclr, Cited by: §1.
 [18] (2007) Universal intelligence: a definition of machine intelligence. Minds and machines 17 (4), pp. 391–444. Cited by: §1.
 [19] (2019) Plan online, learn offline: efficient learning and exploration via modelbased control. In iclr, External Links: Link Cited by: §1, §2.
 [20] (2021) Resetfree lifelong learning with skillspace planning. In iclr, Cited by: §1, §2, §2, §3.3.
 [21] (2020) Learning latent plans from play. In corl, pp. 1113–1132. Cited by: §1, §3.1.
 [22] (2018) Roboturk: a crowdsourcing platform for robotic skill learning through imitation. In corl, pp. 879–893. Cited by: Appendix D.
 [23] (2022) CALVIN: a benchmark for languageconditioned policy learning for longhorizon robot manipulation tasks. IEEE Robotics and Automation Letters. Cited by: §C.3, §C.3, §4.1.
 [24] (2021) Discovering and achieving goals via world models. In neurips, Cited by: §2, footnote 1.
 [25] (2019) Why does hierarchy (sometimes) work so well in reinforcement learning?. arXiv preprint arXiv:1909.10618. Cited by: §2.
 [26] (2009) Learning and generalization of motor skills by learning from demonstration. In icra, pp. 763–768. Cited by: §2.

[27]
(2017)
Automatic differentiation in PyTorch
. In NIPS Autodiff Workshop, Cited by: §C.1.  [28] (2020) Accelerating reinforcement learning with learned skill priors. In corl, Cited by: §C.2, §C.2, §C.2, §C.2, Table 1, §1, §1, §3.1, §3.1, §3.3, §3.3, §3.3, §3.4, §3.4, 2nd item, §4.2.
 [29] (2021) Demonstrationguided reinforcement learning with learned skills. In corl, Cited by: §C.3, §C.3, §C.3, §1, §3.3, §3.4, §4.1, §4.1.

[30]
(1989)
Alvinn: an autonomous land vehicle in a neural network
. In nips, pp. 305–313. Cited by: §C.2.  [31] (1997) Optimization of computer simulation models with rare events. European Journal of Operational Research 99 (1), pp. 89–112. Cited by: §2.
 [32] (2016) Prioritized experience replay. In iclr, Cited by: §C.2.
 [33] (2020) Planning to explore via selfsupervised world models. In icml, Cited by: §3.3.
 [34] (2022) Value function spaces: skillcentric state abstractions for longhorizon reasoning. In iclr, External Links: Link Cited by: §2.
 [35] (2020) Learning robot skills with temporal variational inference. In icml, Cited by: footnote 1.
 [36] (2020) Discovering motor programs by recomposing demonstrations. In iclr, Cited by: footnote 1.
 [37] (2020) Dynamicsaware unsupervised discovery of skills. In iclr, Cited by: §2.
 [38] (1999) Between mdps and semimdps: a framework for temporal abstraction in reinforcement learning. Artificial intelligence 112 (12), pp. 181–211. Cited by: §1.
 [39] (1999) Between mdps and semimdps: a framework for temporal abstraction in reinforcement learning. Artificial intelligence 112 (12), pp. 181–211. Cited by: §2.
 [40] (1984) Temporal credit assignment in reinforcement learning. Ph.D. Thesis, University of Massachusetts Amherst. Cited by: §3.1.
 [41] (2018) DeepMind control suite. arXiv preprint arXiv:1801.00690. Cited by: §C.2.
 [42] (2015) Model predictive path integral control using covariance variable importance sampling. arXiv preprint arXiv:1509.01149. Cited by: §2.
 [43] (2021) Exampledriven modelbased reinforcement learning for solving longhorizon visuomotor tasks. In corl, Cited by: §2, §3.3.
 [44] (2020) Latent skill planning for exploration and transfer. In iclr, Cited by: §2, §3.3.
Appendix A Further Ablations
We include additional ablations on the Maze and Kitchen tasks to further investigate the influence of skill horizon and planning horizon , which is important for skill learning and planning.
a.1 Skill Horizon
In both Maze and Kitchen environments, we find that a too short skill horizon () unable to yield sufficient temporal abstraction. A longer skill horizon () has little influence in Kitchen, but it makes the downstream performance much worse in Maze. This is because with too longhorizon skills, a skill dynamics prediction becomes more difficult and stochastic; and composing multiple skills can be not as flexible as shorthorizon skills. The inaccurate skill dynamics makes longterm planning harder, which is already a major challenge in maze navigation.
a.2 Planning Horizon
In Figure 6(b), we see that short planning horizon makes learning slower in the beginning, because it does not effectively leverage the skill dynamics model to plan further ahead. Conversely, if the planning horizon is too long, the performance gets a little worse due to the difficulty in modeling every step accurately. Indeed, the planning horizon 20 corresponds to 200 lowlevel steps, while the episode length in Kitchen is 280. This would demand the agent to make plan for nearly the entire episode. The performance is not sensitive to intermediate planning horizons.
On the other hand, the effect of the planning horizon differs in Maze due to its distinct environment characteristics. Interestingly, we find that very long planning horizon (e.g. 20) and very short planning horizon (e.g. 1) perform similarly in Maze (Figure 6(a)). This could attribute to the former creates useful longhorizon plans, while the latter avoids error accumulation altogether. We leave further investigation on planning horizon to future work.
Appendix B Qualitative Analysis on Maze
b.1 Exploration and Exploitation
To gauge the agent’s ability of exploration and exploitation, we visualize the replay buffer for each method in Figure 8. In this visualization, we represent early trajectories in the replay buffer with light blue dots and recent trajectories with dark blue dots. In Figure 7(a), the replay buffer of SkiMo (ours) contains early explorations that span to most corners in the maze. After it finds the goal, it exploits this knowledge and commits to paths that are between the start location and the goal (in dark blue). This explains why our method can quickly learn and consistently accomplish the task. Dreamer and TDMPC only explore a small fraction of the maze, because they are prone to get stuck at corners or walls without guided exploration from skills and skill priors. SPiRL + Dreamer, SPiRL + TDMPC and SkiMo w/o joint training explore better than Dreamer and TDMPC, but all fail to find the goal. This is because without the joint training of the model and policy, the skill space is only optimized for action reconstruction, not for planning, which makes longhorizon exploration and exploitation harder.
On the other hand, SkiMo + SAC and SPiRL are able to explore the most portion of the maze, but comparatively the coverage is too wide to enable efficient learning. That is, even after the agent finds the goal through exploration, it continues to explore and does not exploit this experience to accomplish the task consistently (darker blue). This could attribute to the difficult longhorizon credit assignment problem which makes policy learning slow, and the reliance on skill prior which encourages exploration. On the contrary, our skill dynamics model effectively absorbs prior experience to generate goalachieving imaginary rollouts for the actor and critic to learn from, which makes task learning more efficient. In essence, we find the skill dynamics model useful in guiding the agent explore coherently and exploit efficiently.
b.2 Longhorizon Prediction
To compare the longterm prediction ability of the skill dynamics and flat dynamics, we visualize imagined trajectories by sampling trajectory clips of 500 timesteps from the agent’s replay buffer (the maximum episode length in Maze is 2,000), and predicting the latent state 500 steps ahead (which will be decoded using the observation decoder) given the initial state and 500 groundtruth actions (50 skills for SkiMo). The similarity between the imagined trajectory and the ground truth trajectory can indicate whether the model can make accurate predictions far into the future, producing useful imaginary rollouts for policy learning and planning.
SkiMo is able to reproduce the ground truth trajectory with little prediction error even when traversing through hallways and doorways while Dreamer struggles to make accurate longhorizon predictions due to error accumulation. This is mainly because SkiMo allows temporal abstraction in the dynamics model, thereby enabling temporallyextended prediction and reducing stepbystep prediction error.
Appendix C Implementation Details
c.1 Computing Resources
Our approach and all baselines are implemented in PyTorch [27]. All experiments are conducted on a workstation with an Intel Xeon E52640 v4 CPU and a NVIDIA Titan Xp GPU. Pretraining of the skill policy and skill dynamics model takes around 10 hours. Downstream RL for 2M timesteps takes around 18 hours. The policy and model update frequency is the same over all algorithms but Dreamer [9]. Since only Dreamer trains on primitive actions, it has 10 times more frequent model and policy updates than skillbased algorithms, which leads to slower training (about 52 hours).
c.2 Algorithm Implementation Details
For the baseline implementations, we use the official code for SPiRL and reimplemented Dreamer and TDMPC in PyTorch, which are verified on DM control tasks [41]. The table below (Table 1) compares key components of SkiMo with modelbased and skillbased baselines and ablated methods.
Method  Skillbased  Modelbased  Joint training 

Dreamer [9] and TDMPC [10]  
SPiRL [28]  
SPiRL + Dreamer and SPiRL + TDMPC  
SkiMo w/o joint training  
SkiMo + SAC  
SkiMo (Ours) and SkiMo w/o CEM 
Dreamer [30]
We use the same hyperparameters with the official implementation.
TdMpc [10]
We use the same hyperparameters with the official implementation, except that we do not use the prioritized experience replay [32]. The same implementation is used for the SPiRL + TDMPC baseline and our method with only a minor modification.
SPiRL [28]
We use the official implementation of the original paper and use the hyperparameters suggested in the official implementation.
SPiRL + Dreamer [28]
We use our implementation of Dreamer and simply replace the action space with the latent skill space of SPiRL. We use the same pretrained SPiRL skill policy and skill prior networks with the SPiRL baseline. Initializing the highlevel downstream task policy with the skill prior, which is critical for downstream learning performance [28], is not possible due to the policy network architecture mismatch between Dreamer and SPiRL. Thus, we only use the prior divergence to regularize the highlevel policy instead. Directly pretrain the highlevel policy did not lead to better performance, but it might have worked better with more tuning.
SPiRL + TDMPC [10]
Similar to SPiRL + Dreamer, we use our implementation of TDMPC and replace the action space with the latent skill space of SPiRL. The initialization of the task policy is also not available due to the different architecture used for TDMPC.
SkiMo (Ours)
The skillbased RL part of our method is inspired by Pertsch et al. [28] and the modelbased component is inspired by Hansen et al. [10] and Hafner et al. [9]. We elaborate our skill and skill dynamics learning in Algorithm 1, planning algorithm in Algorithm 2, and modelbased RL in Algorithm 3. Table 2 lists the all hyperparameters that we used.
Hyperparameter  Value  
Maze  FrankaKitchen  CALVIN  
Model architecture  
# Layers of  5  
Activation funtion  elu  
Hidden dimension  128  128  256 
State embedding dimension  128  256  256 
Skill encoder ()  5layer MLP  LSTM  LSTM 
Skill encoder hidden dimension  128  
Pretraining  
Pretraining batch size  512  
# Training minibatches per update  5  
ModelActor joint learning rate ()  0.001  
Downstream RL  
Model learning rate  0.001  
Actor learning rate  0.001  
Skill dimension  10  
Skill horizon ()  10  
Planning horizon ()  10  3  1 
Batch size  128  256  256 
# Training minibatches per update  10  
State normalization  True  False  False 
Prior divergence coefficient ()  1  0.5  0.1 
Alpha learning rate  0.0003  0  0 
Target divergence  3  N/A  N/A 
# Warm up step  50,000  5,000  5,000 
# Environment step per update  500  
Replay buffer size  1,000,000  
Target update frequency  2  
Target update tau ()  0.01  
Discount factor ()  0.99  
Reward loss coefficient  0.5  
Value loss coefficient  0.1  
Consistency loss coefficient  2  
Lowlevel actor loss coefficient  2  
Planning discount ()  0.5  
Encoder KL regularization ()  0.0001  
CEM  
CEM iteration ()  6  
# Sampled trajectories ()  512  
# Policy trajectories ()  25  
# Elites ()  64  
CEM momentum  0.1  
CEM temperature  0.5  
Maximum std  0.5  
Minimum std  0.01  
Std decay step  100,000  25,000  25,000 
Horizon decay step  100,000  25,000  25,000 
c.3 Environments and Data
Maze [5, 29]
Since our goal is to leverage offline data collected from diverse tasks in the same environment, we use a variant of the maze environment [5], suggested in Pertsch et al. [29]. The maze is of size ; an initial state is randomly sampled near a small predefined region (the green circle in Figure 2(a)); and the goal position is fixed shown as the red circle in Figure 2(a). The observation consists of the agent’s 2D position and velocity. The agent moves around the maze by controlling the continuous value of its velocity. The maximum episode length is 2,000 but an episode is also terminated when the agent reaches the circle around the goal with radius 2. The reward of 100 is given at task completion. We use the offline data of 3,046 trajectories, collected from randomly sampled start and goal state pairs from Pertsch et al. [29].
Kitchen [6, 5]
The 7DoF Franka Panda robot arm needs to perform four sequential tasks (open microwave, move kettle, turn on bottom burner, and flip light switch). The agent has a 30D observation space (11D robot proprioceptive state and 19D object states), which removes a constant 30D goal state in the original environment, and 9D action space (7D joint velocity and 2D gripper velocity). The agent receives a reward of 1 for every subtask completion. We use 603 trajectories collected by teleoperation from Gupta et al. [6] as the offline taskagnostic data. The episode length is 280 and an episode also ends once all four subtasks are completed. The initial state is set with a small noise in every state dimension.
Misaligned Kitchen [29]
The environment and taskagnostic data are the same with Kitchen but we use the different downstream task (open microwave, flip light switch, slide cabinet door, and open hinge cabinet, as illustrated in Figure 2(c)). This task ordering is not aligned with the subtask transition probabilities of the taskagnostic data, which leads to challenging exploration following the prior from data.
Calvin [23]
We adapt the CALVIN environment [23] for longhorizon learning with the state observation. The CALVIN environment uses a Franka Emika Panda robot arm with 7D endeffector pose control (relative 3D position, 3D orientation, 1D gripper action). The 21D observation space consists of the 15D proprioceptive robot state and 6D object state. We use the teleoperated play data (Task DTask D) of 1,239 trajectories from Mees et al. [23] as our taskagnostic data. The agent receives a sparse reward of 1 for every subtask completion in the correct order. The episode length is 360 and an episode also ends once all four subtasks are completed. In data, there exist 34 available target subtasks, and each subtask can transition to any other subtask, which makes any transition probability lower than 0.1% on average.
Appendix D Application to Real Robot Systems
Our algorithm is designed to be applied on real robot systems by improving sample efficiency of RL using a temporallyabstracted dynamics model. Throughout the extensive experiments in simulated robotic manipulation environments, we show that our approach achieves superior sample efficiency over prior skillbased and modelbased RL, which gives us strong evidence for the application to real robot systems. Especially, in Kitchen and CALVIN, our approach improves the sample efficiency of learning longhorizon manipulation tasks with a 7DoF Franka Emika Panda robot arm. Our approach consists of three phases: (1) taskagnostic data collection, (2) skills and skill dynamics model learning, and (3) downstream task learning. In each of these phases, our approach can be applied to physical robots:
Taskagnostic data collection
Our approach is designed to fully leverage taskagnostic data without any reward or task annotation. In addition to extracting skills and skill priors, we further learn a skill dynamics model from this taskagnostic data. Maximizing the utility of taskagnostic data is critical for real robot systems as data collection with physical robots itself is very expensive. Our method does not require any manual labelling of data and simply extracts skills, skill priors, and skill dynamics model from raw states and actions, which makes our method scalable.
Pretraining of skills and skill dynamics model
Our approach trains the skill policy, skill dynamics model, and skill prior from the offline taskagnostic dataset, without requiring any additional realworld robot interactions.
Downstream task learning
The goal of our work is to leverage skills and skill dynamics model to allow for more efficient downstream learning, i.e., requires less interactions of the agent with the environment for training the policy. This is especially important on real robot systems where a robotenvironment interaction is slow, dangerous, and costly. Our approach directly addresses this concern by learning a policy from imaginary rollouts rather than actual environment interactions. Also, all sort of collected data can help improve the skill dynamics model, which leads to more accurate imagination and policy learning.
In summary, we believe that our approach can be applied to realworld robot systems with only minor modifications.