DeepAI
Log In Sign Up

One After Another: Learning Incremental Skills for a Changing World

Reward-free, unsupervised discovery of skills is an attractive alternative to the bottleneck of hand-designing rewards in environments where task supervision is scarce or expensive. However, current skill pre-training methods, like many RL techniques, make a fundamental assumption - stationary environments during training. Traditional methods learn all their skills simultaneously, which makes it difficult for them to both quickly adapt to changes in the environment, and to not forget earlier skills after such adaptation. On the other hand, in an evolving or expanding environment, skill learning must be able to adapt fast to new environment situations while not forgetting previously learned skills. These two conditions make it difficult for classic skill discovery to do well in an evolving environment. In this work, we propose a new framework for skill discovery, where skills are learned one after another in an incremental fashion. This framework allows newly learned skills to adapt to new environment or agent dynamics, while the fixed old skills ensure the agent doesn't forget a learned skill. We demonstrate experimentally that in both evolving and static environments, incremental skills significantly outperform current state-of-the-art skill discovery methods on both skill quality and the ability to solve downstream tasks. Videos for learned skills and code are made public on https://notmahi.github.io/disk

READ FULL TEXT VIEW PDF

page 7

page 8

06/27/2021

Unsupervised Skill Discovery with Bottleneck Option Learning

Having the ability to acquire inherent skills from environments without ...
11/23/2022

Choreographer: Learning and Adapting Skills in Imagination

Unsupervised skill learning aims to learn a rich repertoire of behaviors...
02/24/2018

One Big Net For Everything

I apply recent work on "learning to think" (2015) and on PowerPlay (2011...
06/08/2022

Mathematical model bridges disparate timescales of lifelong learning

Lifelong learning occurs on timescales ranging from minutes to decades. ...
12/07/2020

Reset-Free Lifelong Learning with Skill-Space Planning

The objective of lifelong reinforcement learning (RL) is to optimize age...
02/10/2016

Adaptive Skills, Adaptive Partitions (ASAP)

We introduce the Adaptive Skills, Adaptive Partitions (ASAP) framework t...
10/23/2021

Guided Policy Search for Parameterized Skills using Adverbs

We present a method for using adverb phrases to adjust skill parameters ...

Code Repositories

disk

PyTorch implementation for "Discovery of Incremental Skills" (DISk) algorithm from ICLR 2022 paper "One After Another: Learning Incremental Skills for a Changing World"


view repo

1 Introduction

Modern successes of Reinforcement Learning (RL) primarily rely on task-specific rewards to learn motor behavior 

(Levine et al., 2016; Schulman et al., 2017; Haarnoja et al., 2018; Andrychowicz et al., 2020). This learning requires well behaved reward signals in solving control problems. The challenge of reward design is further coupled with the inflexibility in the learned policies, i.e. policies trained on one task do not generalize to a different task. Not only that, but the agents also generalize poorly to any changes in the environment (Raileanu et al., 2020; Zhou et al., 2019). This is in stark contrast to how biological agents including humans learn (Smith and Gasser, 2005). We continuously adapt, explore, learn without explicit rewards, and are importantly incremental in our learning (Brandon, 2014; Corbetta and Thelen, 1996).

Creating RL algorithms that can similarly adapt and generalize to new downstream tasks has hence become an active area of research in the RL community (Bellemare et al., 2016; Eysenbach et al., 2018; Srinivas et al., 2020). One viable solution is unsupervised skill discovery (Eysenbach et al., 2018; Sharma et al., 2019b; Campos et al., 2020; Sharma et al., 2020). Here, an agent gets access to a static environment without any explicit information on the set of downstream tasks or reward functions. During this unsupervised phase, the agent, through various information theoretic objectives (Gregor et al., 2016), is asked to learn a set of policies that are repeatable while being different from each other. These policies are often referred to as behavioral primitives or skills (Kober and Peters, 2009; Peters et al., 2013; Schaal et al., 2005). Once learned, these skills can then be used to solve downstream tasks in the same static environment by learning a high-level controller that chooses from the set of these skills in a hierarchical control fashion (Stolle and Precup, 2002; Sutton et al., 1999; Kulkarni et al., 2016) or by using Behavior Transfer to aid in agent exploration (Campos et al., 2021). However, the quality of the skills learned and the subsequent ability to solve downstream tasks are dependent on the unsupervised objective.

Figure 1: We present Discovery of Incremental Skills (DISk). As illustrated in (a), DISk learns skills incrementally by having each subsequent skills be diverse from the previous skills while being consistent with itself. DISk lets us learn skills in environment where dynamics change during training, an example of which is shown in (b): an Ant environment where a different leg breaks every few episodes. In this environment, DISk’s learned skills’ trajectories are shown in (c) under each broken leg – each color shows a different skill.

Unfortunately, current objectives for unsupervised skill discovery make a major assumption about the environment stationarity. Since all skills are simultaneously trained from start to finish, if the environment changes, every skill is equally exposed to highly non-stationary experience during its training. Moreover, all skills trying to adapt to the changed environment means that the agent may catastrophically forget previously learned skills, which is undesirable for an agent with expanding scope. This severely limits the application of current skill discovery methods in lifelong RL settings (Nareyek, 2003; Koulouriotis and Xanthopoulos, 2008; Da Silva et al., 2006; Cheung et al., 2020), a simple example of which is illustrated in Figure 1(b), or real-world dynamic settings (Gupta et al., 2018; Julian et al., 2020).

In this paper, we propose a new unsupervised model-free framework: Discovery of Incremental Skills (DISk). Instead of simultaneously learning all desired skills, DISk learns skills one after another to avoid the joint skill optimization problem. In contrast to prior work, we treat each individual skill as an independent neural network policy without parameter sharing, which further decouples the learned skills. Given a set of previously learned skills, new skills are optimized to have high state entropy with respect to previously learned skills, which promotes skill diversity. Simultaneously, each new skill is also required to have low state entropy within itself, which encourages the skill to be consistent. Together, these objectives ensure that every new skill increases the diversity of the skill pool while being controllable (Fig. 

1(a)). Empirically, this property enables DISk to generate higher quality skills and subsequently improve downstream task learning in both dynamic and static environments compared to prior state-of-the-art skill discovery methods. Especially in evolving environments, DISk can quickly adapt new skills to the changes in the environment while keeping old skills decoupled and fixed to prevent catastrophic forgetting. We demonstrate this property empirically in MuJoCo navigation environments undergoing varying amounts of change. While DISk was designed for dynamic, evolving environments, we observe that it performs noticeably better than previous skill learning methods in static environments like standard MuJoCo navigation environments by improving skill quality and downstream learning metrics.

To summarize, this paper provides the following contributions: (a) we propose a new unsupervised skill discovery algorithm (DISk) that can effectively learn skills in a decoupled manner. (b) We demonstrate that the incremental nature of DISk lets agents discover diverse skills in evolving or expanding environments. (c) Even in static environments, we demonstrate that DISk can learn a diverse set of controllable skills on continuous control tasks that outperform current state-of-the-art methods on skill quality, as well as in sample and computational efficiency. (d) Finally, we demonstrate that our learned skills, both from dynamic and static environments, can be used without any modifications in a hierarchical setting to efficiently solve downstream long-horizon tasks.

2 Background and Preliminaries

To setup our incremental RL framework for unsupervised skill discovery, we begin by concisely introducing relevant background and preliminaries. A detailed treatment on RL can be found in Sutton and Barto (2018), incremental learning in Chen and Liu (2018), and skill discovery in Eysenbach et al. (2018).

Reinforcement Learning: Given an MDP with actions , states , standard model-free RL aims to maximize the sum of task specific reward over time by optimizing the parameters of a policy . In contrast to task-specific RL, where the rewards correspond to making progress on that specific task, methods in unsupervised RL (Bellemare et al., 2016; Eysenbach et al., 2018; Srinivas et al., 2020) instead optimize an auxiliary reward objective that does not necessarily correspond to the task-specific reward .

Incremental Learning: Incremental, lifelong, or continual learning covers a wide variety of problems, both within and outside of the field of Reinforcement Learning. Some works consider the problem to fast initialization in a new task given previous tasks (Tanaka and Yamamura, 1997; Fernando et al., 2017), while others focus on learning from a sequence of different domains such that at time , the model does not forget what it has seen so far (Fei et al., 2016; Li and Hoiem, 2017). Finally, other works have considered a bidirectional flow of information, where performance in both old and new tasks improves simultaneously with each new task (Ruvolo and Eaton, 2013; Ammar et al., 2015). In all of these cases, there is a distribution shift in the data over time, which differentiates this problem setting from multi-task learning. Some of these methods (Tanaka and Yamamura, 1997; Ruvolo and Eaton, 2013; Fei et al., 2016; Ruvolo and Eaton, 2013)

learn a set of independently parametrized models to address this distribution shift over time.

Skill Discovery: Often, skill discovery is posited as learning a -dependent policy  (Eysenbach et al., 2018; Sharma et al., 2020), where is a latent variable that represents an individual skill. The space hence represents the pool of skills available to an agent. To learn parameters of the skill policy, several prior works (Hausman et al., 2018; Eysenbach et al., 2018; Campos et al., 2020; Sharma et al., 2020) propose objectives that diversify the outcomes of different skills while making the same skill produce consistent behavior.

One such algorithm, DIAYN (Eysenbach et al., 2018), does this by maximizing the following information theoretic objective:

(1)

Here, represents the mutual information between states and skills, which intuitively encourages the skill to control the states being visited. represents the Shannon entropy of actions conditioned on state and skill, which encourages maximum entropy (Haarnoja et al., 2017) skills. Operationally, this objective is maximized by model-free RL optimization of an intrinsic reward function that corresponds to maximizing Equation 1, i.e. , where

is an estimate of the objective. Note that the right part of the objective can be directly optimized using maximum-entropy RL 

(Haarnoja et al., 2018, 2017).

Other prominent methods such as DADS and off-DADS (Sharma et al., 2019b, 2020) propose to instead maximize , the mutual information between the next state and skill conditioned on the current state. Similar to DIAYN, this objective is maximized by RL optimization of the corresponding intrinsic reward. However, unlike DIAYN, the computation of the reward is model-based, as it requires learning a skill-conditioned dynamics model. In both cases, the -conditioned skill policy requires end-to-end joint optimization of all the skills and shares parameters across skills through a single neural network policy.

3 Method

In DIAYN and DADS, all skills are simultaneously trained from start to end, and they are not designed to be able to adapt to changes in the environment (non-stationary dynamics). This is particularly problematic in lifelong RL settings Nareyek (2003); Koulouriotis and Xanthopoulos (2008); Da Silva et al. (2006); Cheung et al. (2020) or in real-world settings Gupta et al. (2018); Julian et al. (2020) where the environment can change during training.

To address these challenges of reward-free learning and adaptability, we propose incremental policies. Here, instead of simultaneous optimization of skills, they are learned sequentially. This way newer skills can learn to adapt to the evolved environment without forgetting what older skills have learned. In Section 3.1 we discuss our incremental objective, and follow it up with practical considerations and the algorithm in Section 3.2.

3.1 Discovery of Incremental Skills (DISk)

To formalize our incremental discovery objective, we begin with Equation 1, under which categorical or discrete skills can be expanded as:

(2)

Here, refers to the information gain quantity . Intuitively, this corresponds to the reduction in entropy of the state visitation distribution when a specific skill is run. With a uniform skill prior and total discrete skills, this expands to:

(3)

In incremental skill discovery, since our goal is to learn a new skill given previously trained skills we can fix the learned skills , and reduce the objective to:

(4)
(5)

The first term corresponds to maximizing the total entropy of states across all skills, which encourages the new skill to produce diverse behavior different from the previous ones. The second term corresponds to reducing the entropy of states given the skill , which encourages the skill to produce consistent state visitations. Finally, the third term encourages maximum entropy policies for the skill . Note that this objective can be incrementally applied for , where can be arbitrarily large. A step-by-step expansion of this objective is presented in Appendix B.

3.2 A Practical Algorithm

Since we treat each skill as an independent policy, our framework involves learning the skill given a set of prior previously learned skills . Note that each is a stochastic policy representing the corresponding skill and has its own set of learned parameters . Let

denote random variables associated with the states visited by rolling out each policy

for and the states visited by rolling out all policies for . Then, our objective from Equation 5 becomes:

(6)

The rightmost term can be optimized used max-entropy based RL optimization (Haarnoja et al., 2018) as done in prior skill discovery work DIAYN or Off-DADS. However, computing the first two terms is not tractable in large continuous state-spaces. To address this, we employ three practical estimation methods. First, we use Monte-Carlo estimates of the random variables by rolling out the policy . Next, we use a point-wise entropy estimator function to estimate the entropy from the sampled rollouts. Finally, since distances in raw states are often not meaningful, we measure similarity in a projected space for entropy computation. Given this estimate of the objective, we can set the intrinsic reward of the skill as . The full algorithm is described in Algorithm 1, pseudocode is given in Appendix H, and the key features are described below.

Monte-Carlo estimation with replay: For each skill, we collect a reward buffer containing rollouts from each previous skills that make up our set . For , we keep a small number of states visited by our policy most recently in a circular buffer such that our cannot drift too much from the distribution . Since point-wise entropy computation of a set of size has an complexity, we subsample states from our reward buffer to estimate the entropy.

Point-wise entropy computation: A direct, Monte-Carlo based entropy estimation method would require us to estimate entropy by computing , which is intractable. Instead, we use a non-parametric, Nearest Neighbor (NN) based entropy estimator (Singh et al., 2003; Liu and Abbeel, 2021; Yarats et al., 2021):

(7)

where each point of the dataset contributes an amount proportional to to the overall entropy. We further approximate similar to Yarats et al. (2021) and set since any small value suffices. Additional details are discussed in Appendix B.

Measuring similarity under projection: To measure entropy using Equation 7 on our problem, we have to consider nearest neighbors in an Euclidean space. A good choice of such a space will result in more meaningful diversity in our skills. Hence, in general we estimate the entropy not directly on raw state , but under a projection , where captures the meaningful variations in our agent. Learning such low-dimensional projections is in itself an active research area (Whitney et al., 2019; Du et al., 2019). Hence, following prior work like Off-DADS and DIAYN that use velocity-based states, we use a pre-defined that gives us the velocity of the agent. We use instantaneous agent velocity since in locomotion tasks it is a global metric that is generally independent of how long the agent is run for. Like Off-DADS and DIAYN, we found that a good projection is crucial for success of an agent maximizing information theoretic diversity.

Input: Projection function

, hyperparameter

, off-policy learning algorithm .
Initialize: Learnable parameters for , empty circular buffer Buf with size , and replay buffer .
Sample trajectories from previous policies on current environment.
while Not converged do
     Collect transition by running on the environment.
     Store transition data into the replay buffer, .
     Add the new projected state to the circular buffer, Buf .
     for  = 1,2,.. do
         Sample from .
         Sample batch .
         Find , the -Nearest Neighbor of within Buf,
         Find , the -Nearest Neighbor of within .
         Set consistency penalty .
         Set diversity reward .
         Set intrinsic reward .
         Update using using as our transition.      
Return:
Algorithm 1 Discovery of Incremental Skills: Learning the th Skill

Putting it all together: On any given environment, we can put together the above three ideas to learn arbitrary number of incremental skills using Algorithm 1. To learn a new skill in a potentially evolved environment, we start with collecting an experience buffer with states collected from previous skills. Then, we collect transitions by running on the environment. We store this transition in the replay buffer, and also store the projected state in a circular buffer Buf. Then, to update our policy parameter , on each update step, we sample from our replay buffer and calculate the intrinsic reward of that sample. To calculate this, we first find the Nearest Neighbor of within Buf, called henceforth. Then, we sample a batch , and find the Nearest Neighbor of within , called henceforth. Given these nearest neighbors, we define our consistency penalty , and our diversity reward , which yields an intrinsic reward . and are estimated such that the expected value of and are close in magnitude, which is done by using the mean values of and from the previous skill .

Training the first skill with DISk uses a modified reward function, since there is no prior skills to compute diversity rewards against. In that case, we simply set throughout the first skills’ training, thus encouraging the agent to stay alive while maximizing the consistency. Exact hyperparameter settings are provided in Appendix D.

4 Experiments

We have presented an unsupervised skill discovery method DISk that can learn decoupled skills incrementally without task-specific rewards. In this section we answer the following questions: (a) Does DISk learn diverse and consistent skills in evolving environments with non-stationary dynamics? (b) How does DISk compare against traditional skill learning methods in stationary environments? (c) Can skills learned by DISk accelerate the learning of downstream tasks? (d) How does our specific design choices effect the performance of DISk? To answer these questions we first present our experimental framework and baselines, and then follow it with our experimental results.

4.1 Experimental setup

To study DISk, we train skills on an agent that only has access to the environment without information about the downstream tasks (Eysenbach et al., 2018; Sharma et al., 2019b, 2020). We use two types of environments: for stationary tasks we use standard MuJoCo environments from OpenAI Gym (Todorov et al., 2012; Brockman et al., 2016): HalfCheetah, Hopper, Ant, and Swimmer, visualized in Fig. 4 (left), while for non-stationary environments we use two modified variants of the Ant environment: one with disappearing blocks, and another with broken legs. Environment details are provided in Appendix C.

Baselines: In our experiments, we compare our algorithm against two previous state-of-the-art unsupervised skill discovery algorithms mentioned in Section 2, DIAYN (Eysenbach et al., 2018) and Off-DADS (Sharma et al., 2020). To contextualize the performance of these algorithms, whenever appropriate, we compare them with a collection of random, untrained policies. To make our comparison fair, we give both DIAYN and Off-DADS agents access to the same transformed state, , that we use in DISk. Since Off-DADS is conditioned by continuous latents, for concrete evaluations we follow the same method as the original paper by uniformly sampling latents from the latent space, and using them as the skills from Off-DADS in all our experiments. Exact hyperparameter settings are described in Appendix D, and our implementation of DISk is attached as supplimentary material.

4.2 Can DISk adaptively discover skills in evolving environments?

One of the key hypothesis of this work is that DISk is able to discover useful skills even when the environment evolves over time. To gauge the extent of DISk’s adaptability, we design two experiments shown in Fig. 2 (left) and Fig. 1 (b) for judging adaptability of DISk in environments that change in continuous and discrete manner respectively.

Continuous Change: We create a MuJoCo environment similar to Ant with 40 blocks encircling the agent. The blocks initially prevent long trajectories in any direction. Then, as we train the agent, we slowly remove the blocks at three different speeds. If is the total training time of the agent, in “fast”, we remove one block per steps, in “even” we remove two blocks per training steps, and in “slow” we remove ten blocks per training steps. In each of the cases, the environment starts with a completely encircled agent, and ends with an agent completely free to move in any direction; across the three environments just the frequency and the magnitude of the evolution changes. This environment is meant to emulate an expanding environment where new possibilities open up during training; like a domestic robot being given access to new rooms in the house. The three different rates of change lets us observe how robust a constant skill addition schedule is with different rates of change in the world.

Figure 2: Left: Block environments that evolve during training; red highlights portion of blocks vanishing at once. Middle: Trajectories of skills discovered by different algorithms; each unique color is a different skill. Right: Mean Hausdorff distance of the skills discovered in these environments.

In Fig. 2 (middle), we can see how the skill learning algorithms react to the changing environment. The comparison between Off-DADS and DISk is especially noticeable. The Off-DADS agent latches onto the first opening of the block-circle, optimizes the skill dynamics model for motions on that single direction, and ignores further evolution of the environment quite completely. On the other hand, since the DISk agent incrementally initializes an independent policy per skill, it can easily adapt to the new, changed environment during the training time of each skill. To quantitatively measure the diversity of skills learned, we use the Hausdorff distance (Belogay et al., 1997), which is a topological way of measuring distance between two sets of points (see appendix B.3.) We plot the mean Hausdorff distance between all endpoints of a skill and all other endpoints in Fig. 2 (right). In this metric as well, DISk does better than other methods across environments.

Discrete Change: We modify the standard Ant environment to a changing dynamic environment by disabling one of the four Ant legs during test and train time. To make the environment dynamic, we cycle between broken legs every 1M steps during the 10M steps of training. We show the trajectories of the learned skills with each of the legs broken in Fig. 3.

Figure 3: Evaluation of skills learned on the broken Ant environment with different legs of the Ant disabled; leg disabled during evaluation in legend.

In this case, the Off-DADS and the DISk agent both adapt to the broken leg and learn some skills that travel a certain distance over time. However, a big difference becomes apparent if we note under which broken leg the Off-DADS agent is performing the best. The Off-DADS agent, as seen in Fig. 3 (left), performs well with the broken leg that it was trained on most recently. To optimize for this recently-broken leg, this agent forgets previously learned skills (App. Fig. 78). Conversely, each DISk skill learns how to navigate with a particular leg broken, and once a skill has learned to navigate with some leg broken, that skill always performs well with that particular dynamic.

4.3 What quality of skills are learned by DISk in a static environment?

Figure 4: Left: Environments we evaluate on. Middle: Visualization of trajectories generated by the skills learned by DISk and baselines, with each unique color denoting a specific skill. For environments that primarily move in the x-axis (HalfCheetah and Hopper), we plot the final state reached by the skill across five runs. For environments that primarily move in the x-y plane (Ant and Swimmer), we plot five trajectories for each skill. Right: Hausdorff distance (higher is better) between the terminal states visited by each skill and terminal states of all other skills, averaged across the set of discovered skills to give a quantitative estimate of the skill diversity.

While DISk was created to address skill learning in an evolving environment, intuitively it could work just as well in a static environment. To qualitatively study DISk-discovered skills in static environments, we plot the trajectories generated by them in Fig. 4 (middle). On the three hardest environments (Hopper, Ant, and Swimmer), DISk produces skills that cover larger distances than DIAYN and Off-DADS. Moreover, on Ant and Swimmer, we can see that skills from DISk are quite consistent with tight packing of trajectories. On the easier HalfCheetah environment, we notice that DIAYN produces higher quality skills. One reason for this is that both DIAYN and Off-DADS share parameters across skills, which accelerates skill learning for less complex environments. Also, note that the performance of DIAYN is worse on the complex environments like Swimmer and Ant. The DIAYN agent stops learning when the DIAYN discriminator achieves perfect discriminability; since we are providing the agent and the discriminator with extra information , in complex environments like Swimmer or Ant DIAYN achieves this perfect discriminability and then stops learning. While the performance of DIAYN on Ant may be surprising, it is in line with the observations by Sharma et al. (2019b, 2020).

As shown in the quantitative results in Fig. 4 (right), for the harder Hopper, Ant, and Swimmer tasks, DISk demonstrates substantially better performance. However, on the easier HalfCheetah task, DISk underperforms DIAYN and Off-DADS. This supports our hypothesis that incremental learning discovers higher quality skills in hard tasks, while prior works are better on easier environments.

4.4 Can DISk skills accelerate downstream learning?

A promise of unsupervised skill discovery methods is that in complicated environments the discovered skills can accelerate learning on downstream tasks. In our experiments, we find that skills learned by DISk can learn downstream tasks faster by leveraging the learned skills hierarchically. To examine the potential advantage of using discovered skills, we set up an experiment similar to the goal-conditioned hierarchical learning experiment seen in Sharma et al. (2019b). In this experiment, we initialize a hierarchical Ant agent with the learned skills and task it with reaching an coordinate chosen uniformly from . Note that this environment is the standard Ant environment with no changing dynamics present during evaluations. To highlight the versatility of the skills learned under evolving environments, we also evaluate the agents that were trained under our evolving Block environments (see Section 4.2) under the same framework.

Figure 5: Downstream Hierarchical Learning. We plot the average normalized distance (lower is better), given by

from our Ant agent to the target over 500K training steps. From left to right, we plot the performance of agents that were trained in environment with no obstacles, and our three evolving block environments. Shading shows variance over three runs, and we add a goal-based Soft Actor Critic model trained non-hierarchically as a baseline. Across all settings, DISk outperforms prior work.

We generally find that DISk agents outperform other agents on this hierarchical learning benchmark, by generally converging within 100-200k steps compared to DIAYN agent never converging and Off-DADS taking around 500k steps (Fig. 5). This statement holds true regardless of the evolution dynamics of the environment where the agents were learned, thus showing that DISk provides an adaptive skill discovery algorithm under a variety of learning circumstances. This property makes DISk particularly suitable for lifelong learning scenarios where the environment is continuously evolving.

4.5 Ablation analysis on design choices

We make several design choices that improve DISk. In this section we briefly discuss the two most important ones. Firstly, our implementation of DISk shares the data collected by the previous skills by initializing the replay buffer of a new skill with data from the previous ones. Once initialized with this replay, the new skill can relabel that experience with its own intrinsic rewards. This method has shown promise in multi-task RL and hindsight relabelling (Andrychowicz et al., 2017; Li et al., 2020; Eysenbach et al., 2020). Across our experiments, performing such relabelling improves performance significantly for the later skills (App. Fig. 10 (left)) as it can reuse vast amounts of data seen by prior skills. Training details for this experiment is presented in Appendix G.

Compared to prior work, DISk not only learns skills incrementally, but also learns independent neural network policies for each skill. To check if the performance gains in static environment are primarily due to the incremental addition of skills or the independent policy part, we run a version of our algorithm where all the independent skills are trained in parallel (App. Fig. 10 (left)) on Ant. Experimentally, we see that simply training independent skills in parallel does not discover useful skills; with the parallel variant achieving far lower mean Hausdorff distance. Note that we cannot have incremental addition of skills without independent policies, at least in the current instantiation of DISk, without significant modification to the algorithm or the architecture. This is because it is highly nontrivial to extend or modify a network’s input or parameter space, and the simplest extension is just a new network which is DISk itself.

5 Related Work

Our work on developing incremental skill discovery is related to several subfields of AI. In this section we briefly describe the most relevant ones.

Incremental Learning:Machine learning algorithms have traditionally focused on stationary, non-incremental learning, where a fixed dataset is presented to the algorithm (Giraud-Carrier, 2000; Sugiyama and Kawanabe, 2012; Ditzler et al., 2015; Lee et al., 2018, 2017; Smith and Gasser, 2005). Inspired by studies in child development (Mendelson et al., 1976; Smith and Gasser, 2005; Smith et al., 1998; Spelke, 1979; Wertheimer, 1961), sub-areas such as curriculum learning (Bengio et al., 2009; Kumar et al., 2010; Murali et al., 2018) and incremental SVMs (Cauwenberghs and Poggio, 2001; Syed et al., 1999; Ruping, 2001) have emerged, where learning occurs incrementally. We note that the general focus in these prior work is on problems where a labelled dataset is available. In this work, we instead focus on RL problems, where labelled datasets or demonstrations are not provided during the incremental learning. In the context of RL, some prior works consider learning for a sequence of tasks (Tanaka and Yamamura, 1997; Wilson et al., 2007; Ammar et al., 2015) while other works (Peng and Williams, 1994; Ammar et al., 2015; Wang et al., 2019a, b) create algorithms that can adapt to changing environments through incremental policy or Q-learning. However, these methods operate in the task-specific RL setting, where the agent is trained to solve one specific reward function.

Unsupervised Learning in RL: Prominent works in this area focus on obtaining additional information about the environment in the expectation that it would help downstream task-based RL: Representation learning methods (Yarats et al., 2019; Lee et al., 2019; Srinivas et al., 2020; Schwarzer et al., 2020; Stooke et al., 2020; Yarats et al., 2021) use unsupervised perceptual losses to enable better visual RL; Model-based methods (Hafner et al., 2018, 2019; Yan et al., 2020; Agrawal et al., 2016) learn approximate dynamics models that allow for accelerated planning with downstream rewards; Exploration-based methods (Bellemare et al., 2016; Ostrovski et al., 2017; Pathak et al., 2017; Burda et al., 2018; Andrychowicz et al., 2017; Hazan et al., 2019; Mutti et al., 2020; Liu and Abbeel, 2021) focus on obtaining sufficient coverage in environments. Since our proposed work is based on skill discovery, it is orthogonal to these prior work and can potentially be combined with them for additional performance gain. Most related to our work is the sub-field of skill discovery (Eysenbach et al., 2018; Sharma et al., 2019b; Campos et al., 2020; Gregor et al., 2016; Achiam et al., 2018; Sharma et al., 2020; Laskin et al., 2022). Quite recently, Unsupervised Reinforcement Learning Benchmark (Laskin et al., 2021) has combined a plethora of such algorithms into one benchmark. A more formal treatment of this is presented in Section 2. Since our method makes the skill discovery process incremental, it allows for improved performance compared to DIAYN (Eysenbach et al., 2018) and Off-DADS (Sharma et al., 2020), which is discussed in Section 4. Finally, we note that skills can also be learned in supervised setting with demonstrations or rewards (Kober and Peters, 2009; Peters et al., 2013; Schaal et al., 2005; Dragan et al., 2015; Konidaris et al., 2012; Konidaris and Barto, 2009; Zahavy et al., 2021; Shankar et al., 2020; Shankar and Gupta, 2020). Our work can potentially be combined with such prior work to improve skill discovery when demonstrations are present.

RL with evolving dynamics: Creating RL algorithms that can adapt and learn in environments with changing dynamics is a longstanding problem (Nareyek, 2003; Koulouriotis and Xanthopoulos, 2008; Da Silva et al., 2006; Cheung et al., 2020). Recently, a variant of this problem has been studied in the sim-to-real transfer community (Peng et al., 2018; Chebotar et al., 2019; Tan et al., 2018), where minor domain gaps between simulation and reality can be bridged through domain randomization. Another promising direction is online adaptation (Rakelly et al., 2019; Hansen et al., 2020; Yu et al., 2017) and Meta-RL (Finn et al., 2017; Nagabandi et al., 2018a, b), where explicit mechanisms to infer the environment are embedded in policy execution. We note that these methods often focus on minor variations in dynamics during training or assume that there is an underlying distribution of environment variability, while our framework does not.

6 Discussion, limitations, and scope for future work

We have presented DISk, an unsupervised skill discovery method that takes the first steps towards learning skills incrementally. Although already powerful, we still have significant scope for future work before such methods can be applied in real-world settings. For instance, we notice that on easy tasks such as HalfCheetah, DISk underforms prior works, which is partly due to its inability to share parameters from previously learned skills. Sharing parameters efficiently can allow improved learning and a reduction of the total parameters. Next, to measure similarity in states, we use a fixed projection function. Being able to simultaneously learn this projection function is likely to further improve performance. Also, we use a fixed schedule based on skill convergence to add new skills to our repertoire. A more intelligent schedule for skill addition would help DISk be even more optimized. Finally, to apply such skill discovery methods in the real-world, we may have to bootstrap from supervised data or demonstrations. Such bootstrapping can provide an enormous boost in settings where humans can interact with the agent.

Acknowledgments

We thank Ben Evans, Jyo Pari, and Denis Yarats for their feedback on early versions of the paper. This work was supported by a grant from Honda and ONR award number N00014-21-1-2758.

References

  • J. Achiam, H. Edwards, D. Amodei, and P. Abbeel (2018) Variational option discovery algorithms. arXiv preprint arXiv:1807.10299. Cited by: Figure 12, §G.4, §G.4, §5.
  • P. Agrawal, A. V. Nair, P. Abbeel, J. Malik, and S. Levine (2016) Learning to poke by poking: experiential learning of intuitive physics. In Advances in Neural Information Processing Systems, pp. 5074–5082. Cited by: §5.
  • H. B. Ammar, E. Eaton, P. Ruvolo, and M. E. Taylor (2015) Unsupervised cross-domain transfer in policy gradient reinforcement learning via manifold alignment. In

    Twenty-Ninth AAAI Conference on Artificial Intelligence

    ,
    Cited by: §2, §5.
  • M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba (2017) Hindsight experience replay. NIPS. Cited by: §4.5, §5.
  • O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. (2020) Learning dexterous in-hand manipulation. The International Journal of Robotics Research 39 (1), pp. 3–20. Cited by: §1.
  • M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos (2016) Unifying count-based exploration and intrinsic motivation. arXiv preprintarXiv:1606.01868. Cited by: §1, §2, §5.
  • E. Belogay, C. Cabrelli, U. Molter, and R. Shonkwiler (1997) Calculating the hausdorff distance between curves. Information Processing Letters 64 (1). Cited by: §4.2.
  • Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In ICML, Cited by: §5.
  • R. N. Brandon (2014) Adaptation and environment. Vol. 1040, Princeton University Press. Cited by: §1.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: Appendix C, §4.1.
  • Y. Burda, H. Edwards, A. Storkey, and O. Klimov (2018) Exploration by random network distillation. arXiv preprint arXiv. Cited by: §5.
  • V. Campos, P. Sprechmann, S. S. Hansen, A. Barreto, S. Kapturowski, A. Vitvitskyi, A. P. Badia, and C. Blundell (2021) Beyond fine-tuning: transferring behavior in reinforcement learning. In ICML 2021 Workshop on Unsupervised Reinforcement Learning, Cited by: §1.
  • V. Campos, A. Trott, C. Xiong, R. Socher, X. Giro-i-Nieto, and J. Torres (2020) Explore, discover and learn: unsupervised discovery of state-covering skills. In International Conference on Machine Learning, pp. 1317–1327. Cited by: §1, §2, §5.
  • G. Cauwenberghs and T. Poggio (2001)

    Incremental and decremental support vector machine learning

    .
    Advances in neural information processing systems, pp. 409–415. Cited by: §5.
  • Y. Chebotar, A. Handa, V. Makoviychuk, M. Macklin, J. Issac, N. Ratliff, and D. Fox (2019) Closing the sim-to-real loop: adapting simulation randomization with real world experience. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8973–8979. Cited by: §5.
  • Z. Chen and B. Liu (2018) Lifelong machine learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 12 (3), pp. 1–207. Cited by: §2.
  • W. C. Cheung, D. Simchi-Levi, and R. Zhu (2020)

    Reinforcement learning for non-stationary markov decision processes: the blessing of (more) optimism

    .
    In International Conference on Machine Learning, pp. 1843–1854. Cited by: §1, §3, §5.
  • D. Corbetta and E. Thelen (1996) The developmental origins of bimanual coordination: a dynamic perspective.. Journal of Experimental Psychology: Human Perception and Performance 22 (2), pp. 502. Cited by: §1.
  • B. C. Da Silva, E. W. Basso, A. L. Bazzan, and P. M. Engel (2006) Dealing with non-stationary environments using context detection. In Proceedings of the 23rd international conference on Machine learning, pp. 217–224. Cited by: §1, §3, §5.
  • G. Ditzler, M. Roveri, C. Alippi, and R. Polikar (2015) Learning in nonstationary environments: a survey. IEEE Computational Intelligence Magazine 10 (4), pp. 12–25. Cited by: §5.
  • A. D. Dragan, K. Muelling, J. A. Bagnell, and S. S. Srinivasa (2015) Movement primitives via optimization. In ICRA, Cited by: §5.
  • S. Du, A. Krishnamurthy, N. Jiang, A. Agarwal, M. Dudik, and J. Langford (2019) Provably efficient rl with rich observations via latent state decoding. In International Conference on Machine Learning, pp. 1665–1674. Cited by: §3.2.
  • B. Eysenbach, X. Geng, S. Levine, and R. Salakhutdinov (2020) Rewriting history with inverse rl: hindsight inference for policy improvement. arXiv preprint arXiv:2002.11089. Cited by: §4.5.
  • B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine (2018) Diversity is all you need: learning skills without a reward function. CoRR abs/1802.06070. External Links: Link, 1802.06070 Cited by: §B.1, §D.5, §G.3, §1, §2, §2, §2, §2, §4.1, §4.1, §5.
  • G. Fei, S. Wang, and B. Liu (2016) Learning cumulatively to become more knowledgeable. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1565–1574. Cited by: §2.
  • C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra (2017) Pathnet: evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734. Cited by: §2.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126–1135. Cited by: §5.
  • S. Fujimoto, H. Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pp. 1587–1596. Cited by: Appendix A.
  • C. Giraud-Carrier (2000) A note on the utility of incremental learning. Ai Communications 13 (4), pp. 215–223. Cited by: §5.
  • K. Gregor, D. J. Rezende, and D. Wierstra (2016) Variational intrinsic control. arXiv preprint arXiv:1611.07507. Cited by: §1, §5.
  • A. Gupta, A. Murali, D. P. Gandhi, and L. Pinto (2018) Robot learning in homes: improving generalization and reducing dataset bias. Advances in Neural Information Processing Systems 31, pp. 9094–9104. Cited by: §1, §3.
  • T. Haarnoja, H. Tang, P. Abbeel, and S. Levine (2017) Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning, pp. 1352–1361. Cited by: §2.
  • T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. (2018) Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905. Cited by: Appendix A, Appendix A, §1, §2, §3.2.
  • D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi (2019) Dream to control: learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603. Cited by: §5.
  • D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson (2018) Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551. Cited by: §5.
  • N. Hansen, R. Jangir, Y. Sun, G. Alenyà, P. Abbeel, A. A. Efros, L. Pinto, and X. Wang (2020) Self-supervised policy adaptation during deployment. arXiv preprint arXiv:2007.04309. Cited by: §5.
  • K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. Riedmiller (2018) Learning an embedding space for transferable robot skills. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • E. Hazan, S. M. Kakade, K. Singh, and A. V. Soest (2019) Provably efficient maximum entropy exploration. External Links: 1812.02690 Cited by: §5.
  • R. Julian, B. Swanson, G. S. Sukhatme, S. Levine, C. Finn, and K. Hausman (2020) Never stop learning: the effectiveness of fine-tuning in robotic reinforcement learning. arXiv e-prints, pp. arXiv–2004. Cited by: §1, §3.
  • S. M. Kakade (2002) A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538. Cited by: Appendix A.
  • J. Kober and J. Peters (2009) Learning motor primitives for robotics. In ICRA, Cited by: §1, §5.
  • G. Konidaris and A. Barto (2009) Skill chaining: skill discovery in continuous domains. In the Multidisciplinary Symposium on Reinforcement Learning, Montreal, Canada, Cited by: §5.
  • G. Konidaris, S. Kuindersma, R. Grupen, and A. Barto (2012) Robot learning from demonstration by constructing skill trees. IJRR. Cited by: §5.
  • D. E. Koulouriotis and A. Xanthopoulos (2008)

    Reinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems

    .
    Applied Mathematics and Computation 196 (2), pp. 913–922. Cited by: §1, §3, §5.
  • T. D. Kulkarni, K. R. Narasimhan, A. Saeedi, and J. B. Tenenbaum (2016) Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation. arXiv preprint arXiv:1604.06057. Cited by: §1.
  • M. P. Kumar, B. Packer, and D. Koller (2010) Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, pp. 1189–1197. Cited by: §5.
  • M. Laskin, H. Liu, X. B. Peng, D. Yarats, A. Rajeswaran, and P. Abbeel (2022) CIC: contrastive intrinsic control for unsupervised skill discovery. arXiv preprint arXiv:2202.00161. Cited by: §5.
  • M. Laskin, D. Yarats, H. Liu, K. Lee, A. Zhan, K. Lu, C. Cang, L. Pinto, and P. Abbeel (2021) URLB: unsupervised reinforcement learning benchmark. arXiv preprint arXiv:2110.15191. Cited by: §5.
  • A. X. Lee, A. Nagabandi, P. Abbeel, and S. Levine (2019) Stochastic latent actor-critic: deep reinforcement learning with a latent variable model. arXiv e-prints. Cited by: §5.
  • K. Lee, H. Lee, K. Lee, and J. Shin (2017)

    Training confidence-calibrated classifiers for detecting out-of-distribution samples

    .
    arXiv preprint arXiv:1711.09325. Cited by: §5.
  • K. Lee, K. Lee, H. Lee, and J. Shin (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. arXiv preprint arXiv:1807.03888. Cited by: §5.
  • S. Levine, C. Finn, T. Darrell, and P. Abbeel (2016) End-to-end training of deep visuomotor policies. JMLR. Cited by: §1.
  • A. C. Li, L. Pinto, and P. Abbeel (2020) Generalized hindsight for reinforcement learning. arXiv preprint arXiv:2002.11708. Cited by: §4.5.
  • Z. Li and D. Hoiem (2017) Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: §2.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: Appendix A, Appendix A.
  • H. Liu and P. Abbeel (2021) Unsupervised active pre-training for reinforcement learning. openreview. External Links: Link Cited by: §3.2, §5.
  • M. J. Mendelson, M. M. Haith, and J. J. Gibson (1976) The relation between audition and vision in the human newborn. Monographs of the Society for Research in Child Development, pp. 1–72. Cited by: §5.
  • A. Murali, L. Pinto, D. Gandhi, and A. Gupta (2018)

    CASSL: curriculum accelerated self-supervised learning

    .
    In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6453–6460. Cited by: §5.
  • M. Mutti, L. Pratissoli, and M. Restelli (2020) A policy gradient method for task-agnostic exploration. External Links: 2007.04640 Cited by: §5.
  • A. Nagabandi, I. Clavera, S. Liu, R. S. Fearing, P. Abbeel, S. Levine, and C. Finn (2018a) Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. arXiv preprint arXiv:1803.11347. Cited by: §5.
  • A. Nagabandi, C. Finn, and S. Levine (2018b) Deep online learning via meta-learning: continual adaptation for model-based rl. arXiv preprint arXiv:1812.07671. Cited by: §5.
  • A. Nareyek (2003)

    Choosing search heuristics by non-stationary reinforcement learning

    .
    In Metaheuristics: Computer decision-making, pp. 523–544. Cited by: §1, §3, §5.
  • G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos (2017) Count-based exploration with neural density models. arXiv preprint arXiv. Cited by: §5.
  • D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell (2017) Curiosity-driven exploration by self-supervised prediction. arXiv preprint arXiv. Cited by: §5.
  • J. Peng and R. J. Williams (1994) Incremental multi-step q-learning. In Machine Learning Proceedings 1994, pp. 226–232. Cited by: §5.
  • X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel (2018) Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 3803–3810. Cited by: §5.
  • J. Peters, J. Kober, K. Mülling, O. Krämer, and G. Neumann (2013) Towards robot skill learning: from simple skills to table tennis. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 627–631. Cited by: §1, §5.
  • A. Raffin, A. Hill, M. Ernestus, A. Gleave, A. Kanervisto, and N. Dormann (2019) Stable baselines3. GitHub. Note: https://github.com/DLR-RM/stable-baselines3 Cited by: Appendix C.
  • R. Raileanu, M. Goldstein, A. Szlam, and R. Fergus (2020) Fast adaptation via policy-dynamics value functions. arXiv preprint arXiv:2007.02879. Cited by: §1.
  • K. Rakelly, A. Zhou, C. Finn, S. Levine, and D. Quillen (2019) Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning, pp. 5331–5340. Cited by: §5.
  • S. Ruping (2001)

    Incremental learning with support vector machines

    .
    In Proceedings 2001 IEEE International Conference on Data Mining, pp. 641–642. Cited by: §5.
  • P. Ruvolo and E. Eaton (2013) ELLA: an efficient lifelong learning algorithm. In International Conference on Machine Learning, pp. 507–515. Cited by: §2.
  • S. Schaal, J. Peters, J. Nakanishi, and A. Ijspeert (2005) Learning movement primitives. In RR, Cited by: §1, §5.
  • J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz (2015) Trust region policy optimization.. In ICML, pp. 1889–1897. Cited by: Appendix A.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1.
  • M. Schwarzer, A. Anand, R. Goel, R. D. Hjelm, A. Courville, and P. Bachman (2020) Data-efficient reinforcement learning with momentum predictive representations. arXiv preprint arXiv:2007.05929. Cited by: §5.
  • T. Shankar and A. Gupta (2020) Learning robot skills with temporal variational inference. In International Conference on Machine Learning, pp. 8624–8633. Cited by: §5.
  • T. Shankar, S. Tulsiani, L. Pinto, and A. Gupta (2020) Discovering motor programs by recomposing demonstrations. In International Conference on Learning Representations, External Links: Link Cited by: §5.
  • A. Sharma, M. Ahn, S. Levine, V. Kumar, K. Hausman, and S. Gu (2020) Emergent real-world robotic skills via unsupervised off-policy reinforcement learning. arXiv preprint arXiv:2004.12974. Cited by: §D.5, §1, §2, §2, §4.1, §4.1, §4.3, §5.
  • A. Sharma, S. Gu, S. Levine, V. Kumar, and K. Hausman (2019a) Dynamics-aware unsupervised discovery of skills. CoRR abs/1907.01657. External Links: Link, 1907.01657 Cited by: §G.3.
  • A. Sharma, S. Gu, S. Levine, V. Kumar, and K. Hausman (2019b) Dynamics-aware unsupervised discovery of skills. arXiv preprint arXiv:1907.01657. Cited by: §D.5, §1, §2, §4.1, §4.3, §4.4, §5.
  • H. Singh, N. Misra, V. Hnizdo, A. Fedorowicz, and E. Demchuk (2003) Nearest neighbor estimates of entropy. American journal of mathematical and management sciences 23 (3-4), pp. 301–321. Cited by: §B.2, §B.2, §3.2.
  • L. B. Smith, A. L. Quittner, M. J. Osberger, and R. Miyamoto (1998) Audition and visual attention: the developmental trajectory in deaf and hearing populations.. Developmental psychology 34 (5), pp. 840. Cited by: §5.
  • L. Smith and M. Gasser (2005) The development of embodied cognition: six lessons from babies. Artificial life 11 (1-2), pp. 13–29. Cited by: §1, §5.
  • E. S. Spelke (1979) Perceiving bimodally specified events in infancy.. Developmental psychology 15 (6), pp. 626. Cited by: §5.
  • A. Srinivas, M. Laskin, and P. Abbeel (2020) CURL: contrastive unsupervised representations for reinforcement learning. arXiv preprint arXiv:2004.04136. Cited by: §1, §2, §5.
  • M. Stolle and D. Precup (2002) Learning options in reinforcement learning. In International Symposium on abstraction, reformulation, and approximation, pp. 212–223. Cited by: §1.
  • A. Stooke, K. Lee, P. Abbeel, and M. Laskin (2020) Decoupling representation learning from reinforcement learning. arXiv preprint arXiv. Cited by: §5.
  • M. Sugiyama and M. Kawanabe (2012) Machine learning in non-stationary environments: introduction to covariate shift adaptation. MIT press. Cited by: §5.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §2.
  • R. S. Sutton, D. Precup, and S. Singh (1999) Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artificial intelligence 112 (1-2), pp. 181–211. Cited by: §1.
  • N. A. Syed, S. Huan, L. Kah, and K. Sung (1999) Incremental learning with support vector machines. Cited by: §5.
  • J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke (2018) Sim-to-real: learning agile locomotion for quadruped robots. arXiv preprint arXiv:1804.10332. Cited by: §5.
  • F. Tanaka and M. Yamamura (1997) An approach to lifelong reinforcement learning through multiple environments. In 6th European Workshop on Learning Robots, pp. 93–99. Cited by: §2, §5.
  • E. Todorov, T. Erez, and Y. Tassa (2012) MuJoCo: a physics engine for model-based control. In IROS, Cited by: §4.1.
  • H. Van Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30. Cited by: Appendix A.
  • Z. Wang, C. Chen, H. Li, D. Dong, and T. Tarn (2019a) Incremental reinforcement learning with prioritized sweeping for dynamic environments. IEEE/ASME Transactions on Mechatronics 24 (2), pp. 621–632. Cited by: §5.
  • Z. Wang, H. Li, and C. Chen (2019b) Incremental reinforcement learning in continuous spaces via policy relaxation and importance weighting. IEEE transactions on neural networks and learning systems 31 (6), pp. 1870–1883. Cited by: §5.
  • M. Wertheimer (1961) Psychomotor coordination of auditory and visual space at birth. Science 134 (3491), pp. 1692–1692. Cited by: §5.
  • W. Whitney, R. Agarwal, K. Cho, and A. Gupta (2019) Dynamics-aware embeddings. arXiv preprint arXiv:1908.09357. Cited by: §3.2.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: Appendix A.
  • A. Wilson, A. Fern, S. Ray, and P. Tadepalli (2007) Multi-task reinforcement learning: a hierarchical bayesian approach. In Proceedings of the 24th international conference on Machine learning, pp. 1015–1022. Cited by: §5.
  • W. Yan, A. Vangipuram, P. Abbeel, and L. Pinto (2020) Learning predictive representations for deformable objects using contrastive estimation. arXiv preprint arXiv:2003.05436. Cited by: §5.
  • D. Yarats, R. Fergus, A. Lazaric, and L. Pinto (2021) Reinforcement learning with prototypical representations. arXiv preprint arXiv:2102.11271. Cited by: §3.2, §5.
  • D. Yarats and I. Kostrikov (2020)

    Soft actor-critic (sac) implementation in pytorch

    .
    GitHub. Note: https://github.com/denisyarats/pytorch_sac Cited by: §D.1, §D.5, Appendix D.
  • D. Yarats, A. Zhang, I. Kostrikov, B. Amos, J. Pineau, and R. Fergus (2019) Improving sample efficiency in model-free reinforcement learning from images. arXiv preprint arXiv:1910.01741. Cited by: §5.
  • W. Yu, C. K. Liu, and G. Turk (2017) Preparing for the unknown: learning a universal policy with online system identification. arXiv preprint arXiv:1702.02453. Cited by: §5.
  • T. Zahavy, B. O’Donoghue, A. Barreto, V. Mnih, S. Flennerhag, and S. Singh (2021) Discovering diverse nearly optimal policies withsuccessor features. arXiv preprint arXiv:2106.00669. Cited by: §5.
  • W. Zhou, L. Pinto, and A. Gupta (2019) Environment probing interaction policies. arXiv preprint arXiv:1907.11740. Cited by: §1.
  • B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey (2008) Maximum entropy inverse reinforcement learning. Cited by: Appendix A.

Appendix

Appendix A Reinforcement learning

In our continuous-control RL setting, an agent receives a state observation from the environment and applies an action according to policy . In our setting, where the policy is stochastic, the policy returns a distribution , and we sample a concrete action . The environment returns a reward for every action . The goal of the agent is to maximize expected cumulative discounted reward for discount factor and horizon length .

On-policy RL  (Schulman et al., 2015; Kakade, 2002; Williams, 1992) optimizes by iterating between data collection and policy updates. It hence requires new on-policy data every iteration, which is expensive to obtain. On the other hand, off-policy reinforcement learning retains past experiences in a replay buffer and is able to re-use past samples. Thus, in practice, off-policy algorithms have been found to achieve better sample efficiency (Lillicrap et al., 2015; Haarnoja et al., 2018). For our experiments we use SAC (Haarnoja et al., 2018) as our base RL optimizer due to its implicit maximization of action distribution entropy, sample efficiency, and fair comparisons with baselines that also build on top of SAC. However, we note that our framework is compatible with any standard off-policy RL algorithm that maximizes the entropy of the action distribution either implicitly or explicitly.

Soft Actor-Critic

The Soft Actor-Critic (SAC) (Haarnoja et al., 2018) is an off-policy model-free RL algorithm that instantiates an actor-critic framework by learning a state-action value function , a stochastic policy and a temperature over a discounted infinite-horizon MDP by optimizing a -discounted maximum-entropy objective (Ziebart et al., 2008). With a slight abuse of notation, we define both the actor and critic learnable parameters by . SAC parametrizes the actor policy via a -Gaussian defined as , where , and

are parametric mean and standard deviation. The SAC’s critic

) is parametrized as an MLP neural network.

The policy evaluation step learns the critic network by optimizing the one-step soft Bellman residual:

where is a replay buffer of transitions, is an exponential moving average of as done in (Lillicrap et al., 2015). SAC uses clipped double-Q learning (Van Hasselt et al., 2016; Fujimoto et al., 2018), which we omit from our notation for simplicity but employ in practice.

The policy improvement step then fits the actor network by optimizing the following objective:

Finally, the temperature is learned with the loss:

where is the target entropy hyper-parameter that the policy tries to match, which in practice is set to . The overall optimization objective of SAC equals to:

Appendix B Further Mathematical Details

b.1 Expansion and Derivation of Objectives

The objective function , as defined in Equation 1, is the source from where we derive our incremental objective function, reproduced here.

We can expand the first term in Equation 1 as

by definition of mutual information. Now, once we assume is a discrete variable, the second part of this equation becomes

And thus we have

But the term inside the expectation is the definition of information gain (not to be confused with KL divergence), defined by

Thus, we arrive at

Similarly, by definition of conditional entropy, we can expand the second part of the Equation 1

Thus, we can convert Equation 1 into

If we assume a uniform prior over our skills, which is another assumption made by Eysenbach et al. (2018), and also assume we are trying to learn skills in total, we can further expand Equation 1 into:

Ignoring the number of skills term (which is constant over a single skills’ learning period) gives us exactly Equation 3, which was:

Now, under our framework, we assume that skills has been learned and fixed, and we are formulating an objective for the th skill. As a result, we can ignore the associated Information Gain and action distribution entropy terms from skills , and simplify our objective to be:

which is exactly the same objective we defined in Equation 5.

b.2 Point-based Nearest Neighbor Entropy Estimation

In our work, we use an alternate approach, first shown by  Singh et al. (2003), to estimate the entropy of a set of points. This method gives us a non-parametric Nearest Neighbor (NN) based entropy estimator:

where is the gamma function, is the bias correction term, and is the Euclidean distance between and its nearest neighbor from the dataset , defined as .

The term inside the sum can be simplified as

Here, is a constant term independent of . If we ignore the this term and the bias-correction term and the constant, we get

Which is the formulation we use in this work. This estimator is shown to be asymptotically unbiased and consistent in Singh et al. (2003).

b.3 Hausdorff Distance

In our work, to compare between two algorithms learning skills on the same environment, we used a metric based on Hausdorff distance. Hausdorff distance, also known as the Hausdorff metric or the Pompeiu–Hausdorff distance, is a metric that measures the distance between two subsets of a metric space. Informally, we think of two sets in a metric space as close in the Hausdorff distance if every point of either set is close to some point of the other set. The Hausdorff distance is the longest distance one can force you to travel by choosing a point adversarially in one of the two sets, from which you have to travel to the other set. Put simply, it is the greatest of all the distances from a point in one set to the nearest point in the other.

Mathematically, given two subsets and of a metric space we define Hausdorff distance as:

Where represents the supremum, is the distance between a point and another set, and represents the infimum.

Given a set of skills, we calculate the diversity of one skill over all other skills by calculating the Hausdorff distance between that skill’s trajectory end location, and the terminal locations of all other trajectories. Intuitively, a skill has high Hausdorff distance if the end states it generates is far away from other skills’ endpoints. Similarly, a high average Hausdorff distance for skills from an algorithm means that the algorithm’s generated skills on average have a high distance from each other, which is a desirable property for an algorithm which needs to generate diverse skills.

Appendix C Environments

Gym Experiments

We derive all environments used in the experiments in this paper from OpenAI Gym (Brockman et al., 2016) MuJoCo tasks. Namely, we use the HalfCheetah-v3 and Hopper-v3 environments for 2d locomotion tasks and Swimmer-v3 and Ant-v3 environments for the 3d locomotion tasks (see Figure 2 for the agent morphologies).

Since we aim to train primitives, we want policies that perform well regardless of the global states of the agent (global position etc.), only depending on local states (join angles etc.). Thus, we train our agents and each of our baselines with a maximum episode length of ( for Swimmer only), while we test them with a maximum episode length of for static or the block environments and for the broken leg environments.

As our projection function , we measured the velocity of the agent in 2d environments, and the velocity of the agent in 3d environments. We made available to the intrinsic reward calculation functions of both our methods and the baselines.

Block Experiments

For our block experiment set, we implemented the blocks as immovable spheres of radius at a distance from origin. We dynamically added blocks at the environment creation, and deleted them with the MuJoCo interface available in Gym. The blocks were all added before the agent took the first step in the environment, and removed over the agents’ lifetime as described in Section 4.3. The blocks were always removed counter-clockwise, following the trajectory of over , where is the current timestep and is the total timestep for training.

Broken Leg Experiments

For our broken leg experiment set, we implemented a broken leg as a leg where no actions have any effect. We switch which leg is broken every 1M steps, and train all skills for a total of 10M steps in both Off-DADS and DISk.

Figure 6: Skills learned by DISk, evaluated with each of the legs broken. The legs are numbered such that the final leg is numbered #4
Figure 7: Skills learned by Off-DADS at the end of 10M steps, evaluated with each of the legs broken. The legs are numbered such that the final leg is numbered #4,
Figure 8: Skills learned by Off-DADS at the end of 9.3M, or 0.3M steps after breaking the final leg, evaluated with each of the legs broken. Compared to this agent, the agent in Fig.  7 performs worse on all broken leg but the last one, which shows that Off-DADS suffers from an instance of catastrophic forgetting.
Hierarchical Experiments

For the hierarchical environments, we use the Ant-v3 environment in a goal-conditioned manner. The goals are sampled from uniformly, and the hierarchical agent can take steps to reach as close to the goal as possible. At each step of the hierarchical agent, it chooses a skill, which is then executed for 10 timesteps in the underlying environment. So, in total, the agent has timesteps to reach the goal. At every timestep, the agent is given a dense reward of , where is the current location of the agent, and is the location of the goal. On each step, the hierarchical agent gets the sum of the goal conditioned reward from the 10 timesteps in the underlying environment.

All the hierarchical agents were trained with the stable-baselines3 package (Raffin et al., 2019). We used their default PPO agent for all the downstream set of skills, and trained the hierarchical agent for a total environment steps.

Appendix D Implementation Details and Hyperparameters

We base our implementation on the PyTorch implementation of SAC by Yarats and Kostrikov (2020). On top of the implementation, we add a reward buffer class that keeps track of the intrinsic reward. We provide the pseudocode for our algorithm in Appendix H.

d.1 Architecture

For all of our environments, we used MLPs with two hidden layers of width 256 as our actor and critic networks. We used ReLU units as our nonlinearities. For our stochastic policies, the actor networks generated a mean and variance value for each coordinate of the action, and to sample an action we sampled each dimension from a Gaussian distribution defined by those mean and variance values. For stability, we clipped the log standard deviation between

, similar to Yarats and Kostrikov (2020).

d.2 Reward Normalization

In learning all skills except the first, we set and such that both the diversity reward and the consistency penalty have a similar magnitude. We do so by keeping track of the average of the previous skills’ diversity reward and consistency penalty, and using the inverse of that as and respectively, which is largely successful at keeping the two terms to similar orders of magnitude.

Another trick that helps the DISk learn is using a tanh-shaped annealing curve to scale up , which starts at and ends at its full value. This annealing is designed to keep only the diversity reward term relevant at the beginning of training, and then slowly introduce a consistency constraint into the objective over the training period. This annealing encourages the policy to explore more in the beginning. Then, as the consistency term kicks in, the skill becomes more and more consistent while still being diverse. As a note, while DISk is more efficient with this annealing, its success is not dependent on it.

d.3 Full List of Hyper-Parameters

Parameter Setting
Replay buffer capacity (static env)
Replay buffer capacity (changing env)
Seed steps
Per-skill collected steps
Minibatch size
Discount ()
Optimizer Adam
Learning rate
Critic target update frequency
Critic target EMA momentum ()
Actor update frequency
Actor log stddev bounds
Encoder target update frequency
Encoder target EMA momentum ()
SAC entropy temperature
Number of samples in
Size of circular buffer Buf 50
in NN
Table 1:  DISks list of hyper-parameters.
Environment (Average) steps per skill Number of skills Total number of steps
HalfCheetah-v3
Hopper-v3
Swimmer-v3
Ant-v3
Ant-v3 (blocks)
Ant-v3 (broken)
Table 2:  DISks  number of steps per skill.

d.4 Learning Schedule for Ant-v3 on DISk

We noticed that on the static Ant environment, not all learned skills have the same complexity. For example, the first two skills learned how to sit in place, and flip over. On the other hand, later skills which just learns how to move in a different direction could reuse much of the replay buffers of the earlier, complex skills which learned to move, and thus they did not need as many training steps as the earlier complex skills. As a result, they converged at different speed. A schedule that most closely fits this uneven convergence time is what we used: environment steps unevenly between the skills, with the skills getting , , , , , , , , , steps respectively. On all other static environments, we use a regular learning schedule with equal number of steps per skill since they are simple enough and converge in roughly equal speed.

d.5 Baselines

Diayn

We created our own implementation PyTorch based on Yarats and Kostrikov (2020), following the official implementation of Eysenbach et al. (2018) method to the best of our ability. We used the same architecture as DISk for actor and critic network. For the discriminator, we used an MLP with two hidden layers of width 256 that was optimized with a cross-entropy loss. To make the comparison fair with our method, the discriminator was given access to the values in all cases. Each of the DIAYN models were trained with the same number of skills and steps as DISk, as shown in Table 2.

While the performance of DIAYN in complex environments like Ant may seem lackluster, we found that to be in line with the work by Sharma et al. (2019b) (see their Fig. 6, top right.) We believe a big part of the seeming gap is because of presenting them in a grid. We believe this presentation is fair since we want to showcase the abilities of Off-DADS and DISk.

Off-DADS

We used the official implementation of Sharma et al. (2020), with their standard set of parameters and architectures. We provide steps to each environment except otherwise specified, and to evaluate, we sample the same number of skills as DISk from the underlying skill latent distribution. For the hierarchical experiments, we sample the skills and fix them first, and then we create a hierarchical agent with those fixed skills.

d.6 Compute Details

All of our experiments were run between a local machine with an AMD Threadripper 3990X CPU and two NVIDIA RTX 3080 GPUs running Ubuntu 20.04, and a cluster with Intel Xeon CPUs and NVIDIA RTX 8000 GPUs, on a Ubuntu 18.04 virtual image. Each job was restricted to maximum of one GPU, with often up to three jobs sharing a GPU.

Appendix E Consistency of Skills

Figure 9: Mean normalized variance over skills of different agents on four different environments (lower is better). On average, the initial position has high normalized variance since the agent has little control over it, and the subsequent steps has lower normalized variance as the skills lead to consistent states.

To ensure that the diversity of the agent as seen from Section 4.2 are not coming from entirely random skills, we run an experiment where we track the average consistency of the skills and compare them to our baseline. To measure consistency, we measure the variance of the agent location normalized by the agent’s distance from the origin for DISk, each of our baselines, and a set of random skills as shown in the Figure 6.

We see from this experiment that our method consistently achieves much lower normalized variance than a random skill, thus showing that our skills are consistent. Note that, on the Hopper environment, random skills always crash down onto the ground, thus achieving a very high consistency. While skills from the baselines sometimes achieve higher average consistency compared to DISk, we believe that it might be caused by the higher overall diversity in the skills from DISk.

Appendix F Hierarchical Benchmark

Note that in the fast block environment Off-DADS is able to achieve nontrivial performance; this is because as the circle of blocks opens up fast, the agent learns to move slightly left to avoid the last-to-be-opened block to take advantage of the first opening. In the same environment, DISk suffers the most because the environment itself changes too much while a single skill is training, but it is still able to learn a decent set of skills (see Fig. 2.) As a result, their plots seem similar.

Appendix G Ablation studies

In this section, we discuss the details of the ablation studies discussed on Section 4.5.

Figure 10: Ablation analysis of DISk on parallel skill discovery and replay skill relabelling.

g.1 Reusing the Replay Buffer

Since we use an off-policy algorithm to learn our reinforcement learning policies and compute the rewards given the tuples on the fly, it is possible for us to reuse the replay buffer from one policy to the next. In this ablation, we examine the efficacy of reusing the replay buffer from previous skills.

Intuitively, if using the replay buffer is effective, it makes more sense that the impact of reusing the replay buffer will be more noticable when there is a large amount of experience stored in the replay buffer. Thus, for this ablation, we first train six skills on an Ant agent for steps total while reusing the replay buffer, and then train four further skills for steps each. In first case, we reuse the replay buffers from the first six skills. On the second case, we do not reuse the replay buffer from the first skills and instead clear the replay buffer every time we initialize a new skill. Once the skills are trained, we measure their diversity with the mean Hausdorff distance metric.

As we have seen on Figure 10, the four skills learned while reusing the buffer are much more diverse than the skills learned without reusing the replay buffer. This result goes to show that the ability to reuse the replay buffer gives DISk a large advantage, and not reusing the replay buffer would make this unsupervised learning method much worse in terms of sample complexity.

g.2 Parallel Training with Independent Skills

We trained our DISk skills incrementally, but also with independent neural network policies. We run this ablation study to ensure that our performance is not because of the incremental only.

To run this experiment, we initialize independent policies on the Ant environment, and train them in parallel using our incremental objective. We initialize a separate circular buffer for each policy. For each policy, we use the union of the other policies’ circular buffer as that policy’s . We train the agent for total steps, with each policy trained for steps. To train the policies in parallel, we train them in round-robin fashion, with one environment step taken per policy per step in the training loop.

As we have seen in Section 4.5, this training method does not work – the policies never learn anything. We hypothesize that this failure to learn happens because the circular dependencies in the learning objective does not give the policies a stable enough objective to learn something useful.

g.3 Baseline Methods with Independent Skills

To understand the effect of training DISk with independent skills, ideally, we would run an ablation where DISk uses a shared parameter network instead. However, it is nontrivial to do so, since DISk is necessarily incremental and we would need to create a single network with incremental input or output space to properly do so. Still, to understand the effect of using independent skill, we run an ablation where we modify some of our baseline methods to use one network per skill.

Out of our baselines, Sharma et al. (2019a) uses continuous skill latent space, and thus it is not suitable to have an independent network per skill. Moreover, its performance degrades if we use a discrete latent space as shown in Sharma et al. (2019a) Fig. 5. So, we take our other baseline, Eysenbach et al. (2018), which uses discrete skills, and convert it to use one policy per skill.

Figure 11: Training DIAYN with one policy per skill improves its performance. On the (left) we show the comparison of mean Hausdorff distance of DIAYN with independent policies alongside our method and other baselines, and on the (right) we show trajectories from running skills from the Disjoint-DIAYN agent.

We see a better performance with Disjoint-DIAYN agent compared to the DIAYN agent, which implies that some of the performance gain on DISk may indeed come from using an independent policy per skill. The reason why disjoint DIAYN performs much better than simple DIAYN may also stem from the fact that it is much harder for the skills in DIAYN to “coordinate" and exploit the diversity based reward in a distributed settings where the skills only communicate through the reward function.

g.4 VALOR-like Incremental Schedule for Baselines

In VALOR(Achiam et al., 2018), to stabilize training options based on a variational objective, the authors present an curriculum training method where the latents were sampled from a small initial distribution, and this distribution was extended every time the information theoretic objective was saturated. More concretely, if is a variable signifying the option latent and a trajectory generated by it, then every time their discriminator reached , where is a set constant , they expanded the set of option latents. If there were latents when the objective saturates, it was updated using the following formula:

(8)

Where is a hyperparameter that signifies the maximum number of options that can be learned.

Figure 12: Training DIAYN for 4M steps with curriculum scheduling similar to VALOR(Achiam et al., 2018) results in better performance than vanilla DIAYN but suboptimal performance compared to Fig. 11. On the (left) we show the trajectories from learned skills on Disjoint-DIAYN with an added curriculum training on a static Ant environment. On the (right), we show the same on the dynamic Ant-block environment, where the agent learned two skills that wasn’t available at the environment initialization; however all other skills are degenerate.

Adapting this schedule for Off-DADS is difficult since it has nominally infinite number of skills throughout training. However, we implemented this scheduling for DIAYN by sampling the skill from the distribution , starting with and updating with Eq. 8 whenever threshold was reached, where is a hyperparameter. We ran this experiments on top of the disjoint DIAYN (Sec. G.3) baseline and searched over possible values, since we found it necessary to have DIAYN converge to any meaningful skills. We chose the values from the set , and found that the best performance was around , or , which lines up with the findings in the VALOR paper.

As we can see on the figures Fig. 12, in complicated static environments like Ant, adding curriculum scheduling to DIAYN performs better than vanilla DIAYN with shared weights, but may actually result in more degenerate skills than Disjoint-DIAYN. We hypothesize this may be caused by the arbitrary nature of the threshold , which is also mentioned by Achiam et al. (2018).

On dynamic environments like Ant-block, adding skills successively does allow DIAYN to learn one skill that was not available at the beginning, unlike vanilla DIAYN. However, we believe because of the downsides of curriculum scheduling, it also results in a lot of “dead” skills, which slow down the addition of new skills, which results in few useful skills overall.

Appendix H DISk Pseudo Code

# Env: environment
# : Stochastic policy network
# : Projection function
# D: State dimension
# D: Dimension after projection
# M: Current skill number
# P: Number of collected states from each past skills
# max_steps: Number of steps to train the policy for
# replay_buffer: Standard replay buffer for off-policy algorithms
# buf: Circular replay buffer
# : Hyperparameter for normalizing consistency penalty and diversity reward respectively
# Training loop
# At the beginning of training a skill, collect  using the learned policies 
#  : P sample states (projected) visited from each of the M-1 previous skills, (M-1)xPxD
 = []
for m in {1, 2, ..., M-1}:
    total_collection_step = 0
    collected_projected_states = []
    while total_collection_step < P:
        x = Env.reset()
        done = False
        while not done:
            total_collection_step += 1
            a = mode((x))
            x_next, reward, done = Env.step(a)
            # Add projected state to the collected states
            collected_projected_states.append((x_next))
            x = x_next
    .append(collected_projected_states)
total_steps = 0
while total_steps < max_steps:
    x = Env.reset()
    done = False
    while not done:
        total_steps += 1
        a (x)
        x_next, reward, done = Env.step(a)
        replay_buffer.add(x, a, x_next)
        x = x_next
        # Push projected state to the circular buffer
        buf.push((x_next))
        # Update phase
        for i in range(update_steps):
            # sample a minibatch of B transitions without reward from the replay buffer
            # (x, a, x): state (BxD), action (Bx|A|), next state (BxD)
            (x, a, x) = sample(replay_buffer)
            # compute entropy-based reward using the next projected state (x)
            r = compute_rewards((x))
            # train exploration RL agent on an augmented minibatch of B transitions (x, a, r, x)
            update_(x, a,  r,  x) # standard state-based SAC
# Entropy-based task-agnostic reward computation
# s: state (BxD)
# b: diversity reward batch size
def compute_rewards(s, b = 256):
     = uniform_sample(, b) # Sample diversity reward candidates
    # Finding k-nearest neighbor for each sample in s (BxD) over the candidates buffer , (bxD)
    # Find pairwise L2 distances (Bxb) between s and 
    dists = norm(s[:, None, :] - [None, :, :], dim=-1, p=2)
    topk_dists, _ = topk(dists, k=3, dim=1, largest=False) # compute topk distances (Bx3)
    # Diversity rewards (Bx1) are defined as L2 distances to the k-nearest neighbor from 
    diversity_reward = topk_dists[:, -1:]
    # Finding k-nearest neighbor for each sample in s (BxD) over the circular buffer buf, (|buf|xD)
    # Find pairwise L2 distances (Bx|buf|) between s and buf
    dists = norm(s[:, None, :] - buf[None, :, :], dim=-1, p=2)
    topk_dists, _ = topk(dists, k=3, dim=1, largest=False) # compute topk distances (Bx3)
    # Consistency penalties (Bx1) are defined as L2 distances to the k-nearest neighbor from buf
    consistency_penalty = topk_dists[:, -1:]
    reward = *diversity_reward - *consistency_penalty
    return reward
Algorithm 2 Pseudocode for DISk training routine for the th skill.

Appendix I DISk Skills Learned In the Static Ant environment In Sequential Order

Figure 13: All 10 skills learned by DISk in a static Ant environment in the order learned. Even though it seems like skill 1 and 2 did not learn anything, skill 1 learns to stand perfectly still in the environment while skill 2 learns to flip over and terminate the episode.