Disentangled Skill Embeddings for Reinforcement Learning

by   Janith C. Petangoda, et al.
Imperial College London

We propose a novel framework for multi-task reinforcement learning (MTRL). Using a variational inference formulation, we learn policies that generalize across both changing dynamics and goals. The resulting policies are parametrized by shared parameters that allow for transfer between different dynamics and goal conditions, and by task-specific latent-space embeddings that allow for specialization to particular tasks. We show how the latent-spaces enable generalization to unseen dynamics and goals conditions. Additionally, policies equipped with such embeddings serve as a space of skills (or options) for hierarchical reinforcement learning. Since we can change task dynamics and goals independently, we name our framework Disentangled Skill Embeddings (DSE).



There are no comments yet.


page 16


Hierarchical and Interpretable Skill Acquisition in Multi-task Reinforcement Learning

Learning policies for complex tasks that require multiple different skil...

Hyperbolic Embeddings for Learning Options in Hierarchical Reinforcement Learning

Hierarchical reinforcement learning deals with the problem of breaking d...

Adversarial Skill Networks: Unsupervised Robot Skill Learning from Video

Key challenges for the deployment of reinforcement learning (RL) agents ...

SKILL-IL: Disentangling Skill and Knowledge in Multitask Imitation Learning

In this work, we introduce a new perspective for learning transferable c...

Robust Hierarchical Planning with Policy Delegation

We propose a novel framework and algorithm for hierarchical planning bas...

The Logical Options Framework

Learning composable policies for environments with complex rules and tas...

Behavior Priors for Efficient Reinforcement Learning

As we deploy reinforcement learning agents to solve increasingly challen...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, Reinforcement Learning (RL) [sutton1998reinforcement, ] techniques have been successfully applied to solve complex sequential decision-making problems under uncertainty [mnih2015human, ; hessel2017rainbow, ; haarnoja2017reinforcement, ; haarnoja2018soft, ]. However, agents trained on a single-task typically exhibit poor performance when faced with a modified task with different dynamics or reward functions [taylor2009transfer, ]. A key remaining challenge for advancing the field towards general purpose applications is to train agents that can generalize over tasks [plappert2018multi, ]. Generalization over tasks is also important in hierarchical reinforcement learning (HRL) settings [dayan1993feudal, ; sutton1999between, ; dietterich1998maxq, ] where there is a need to be data-efficient by using reusable skills across different unseen situations. However, a long-standing problem in HRL is of obtaining a general skill-set and how to properly reuse that set in different situations.

In this paper, we focus on the problem of learning policies that generalize well under changes in both the dynamics and reward functions. We do so by formulating a novel Multi-Task RL (MTRL) problem from a variational inference (VI) perspective. Our formulation relies on two latent skill embeddings which hold information about the dynamics of the system and the goal. The skill embeddings are disentangled in that one can independently specify a change in the dynamics or the goal of the system and still obtain a well-performing policy. We call this method Disentangled Skill Embeddings (DSE). Having trained such policies using DSE, we can tackle HRL problems by allowing the agent to move in the previously learned space of skills.

We contribute: 1) a novel MTRL formulation for learning disentangled skill embeddings using a VI objective; 2) two MTRL algorithms, DSE-REINFORCE and DSE-SAC (Soft Actor Critic), that can learn multi-dynamics and multi-goal policies and generalize to new tasks after fast retraining; and 3) demonstrate that one can learn higher level policies over latent skills in a HRL scenario.

2 Related work

Most approaches in MTRL  [taylor2009transfer, ; oh2017zero, ; henderson2017benchmark, ; nachum2018data, ; steindor, ; deisenroth2014multi, ] consider changes in dynamics or reward functions to result in different tasks. That is, two tasks with the same dynamics but different reward functions are considered to be different tasks. Here, we focus on exploiting known changes either in dynamics or goals, or both, for better generalization.

An approach that decouples dynamics and rewards is [devin2017learning, ]

. In this work modular neural networks that capture robot dynamics are combined with modules that capture the task goals. This allows robots to solve novel tasks by recombining task and robot modules. In

[zhang2018decoupling, ] decoupling reward and dynamics is done in a model-based framework based on successor features.

Closest to our work is [hausman2018learning, ] where policies trained in a MTRL scenario are equipped with a latent variable embedding that describes a particular task. However, the latent variable embedding contains entangled information about both the transition and the reward function. This can impede generalization as the latent space might not be able to represent a policy for a task for unseen reward and dynamics combination. In contrast, our approach disentangles the latent spaces to overcome this issue. Another similar approach is [gupta2018meta, ], which is a meta-learning algorithm that optimizes the latent space for the policies to enable structured exploration across multiple time steps.

We derive our algorithm using a variational infernence formulation for RL. Several previous works have described RL as an inference problem [kappen2005path, ; todorov2008general, ; levine2013variational, ] including its relation to entropy regularization haarnoja2017reinforcement ; haarnoja2018soft ; grau2016planning ; peters2010relative ; ziebart2008maximum . This formalism has recently attracted attention [haarnoja2017reinforcement, ; levine2018reinforcement, ; abdolmaleki2018maximum, ], because it provides a powerful and intuitive way to describe more complex agent architectures using the tools from graphical models.

The multi-task algorithm, Distral [teh2017distral, ], utilizes, not only entropy regularization, but also adds a relative-entropy penalty that encourages the policies to be close to a shared compressed policy for transfer between tasks. The trained policies of this approach are not parameterized by any variables that identify the task at hand. This is important since policies trained with Distral cannot generalize beyond their training tasks.

3 Background and Notation

We consider a set

of Markov decision processes (MDPs)

defined as the tuple where is the state space, the action space, is the discount factor, denotes the ’th state transition function that fully specifies the dynamics of the system, and denotes the reward function that quantifies the agent’s performance and fully specifies the goal of the system.

In the multi-task problem, the agent must provide an optimal policy for each of the possible tasks. Obtaining all solutions independently maximizes performance on each individual task, but does not transfer information between the tasks. Solving all tasks with a single policy maximizes transfer since all parameters of the policy are shared but, importantly, the final solution only maximizes the average reward across tasks. Thus, a mixture of shared and task specific parameters is ideal.

In the following we derive a variational multi-task reinforcement learning formulation where the policy has some parameters that are shared across all tasks, and two latent embeddings—dynamics-specific and reward/goal-specific—that serve as task-specific parameters.

4 Disentangled Skill Embeddings

We learn flexible skills that are reusable across different dynamics and goals by learning two latent spaces, and . We achieve this by gathering data from the set of MDPs , indexed by the dynamics-condition and goal-condition for all and and then learn the conditional distributions and for each and . The latent variables are inputs to the policy serving as behaviour modulators. Importantly, once the latent spaces are fully learnt, one can directly use the policy equipped with the skill embeddings without knowledge of the task indices.

We now derive a variational inference (VI) formulation for multi-task RL that allows learning both latent spaces and the policy. As in  [haarnoja2017reinforcement, ; kappen2005path, ; todorov2008general, ; levine2013variational, ; levine2018reinforcement, ; abdolmaleki2018maximum, ]

, we start by introducing a random variable

that denotes whether the trajectory is optimal () or not (). Note that this includes the dynamics index and the goal index as well as and for all time-steps . The likelihood of an optimal trajectory is defined as

. We denote the posterior trajectory probability assuming optimality as

. Treating

as a latent variable with prior probability

, we specify the log-evidence as .

We now introduce a variational distribution on trajectories which combined with Jensen’s inequality provides the Evidence Lower Bound (ELBO) . In practice, we maximize the ELBO and use as an approximate posterior. The generative model is and the variational distribution is . We stress that the only difference between these are the conditional factors involving the latent variables and . The MTRL problem can now be stated as a maximization of the ELBO w.r.t. and :


where we added the scalars and to weight each information term (see Appendix A.1 for a mathematical justification) and we set the problem to have infinite horizon () with discount factor . The first two information terms measure how far the variational distributions and are from the specified priors that we assume fixed and equal for all conditions. Similarly, the last information term measures how far the variational policy is from the prior policy. Note that by setting we can eliminate this restriction. Furthermore, by setting to be an improper uniform prior we recover a formulation with entropy regularization in the policy.

This trajectory-based formulation of the problem is sufficient to derive a novel REINFORCE-type algorithm equipped with DSE. However, our work also provides a derivation of a multi-task SAC algorithm with DSE (DSE-SAC) that requires a full specification of the recursive properties of value functions and the optimal solutions for the policy and embeddings. We describe those properties in the next section.

4.1 Recursions, Optimal Policies and Optimal Embeddings

Crucial to the construction of DSE-SAC is a recursive property that we can exploit for value bootstraping. A task-indexed value function can be defined by taking the expectation in Equation (1) over all random variables except and . The Q-function is then defined as . We provide a lemma for the value recursion.

Lemma 1 (Index- and state-dependent Value Function Recursion).

The index-dependent Value function satisfies the following recursive property.


The proof of the previous and subsequent lemmas can be found in Appendix A.2 and A.4.

DSE-SAC also requires analytic solutions for the policy and embeddings. The optimal policy can be obtained by computing the functional derivative of a Lagrangian (see Appendix) of the variational problem w.r.t. the policy and equating the result to zero.

Lemma 2 (Optimal policy with DSE).

Let the variational distributions and be fixed. Then, the optimal policy is

where is the normalizing function and with and are the Bayesian posteriors over and .

Note that the Q-values are computed by using both Bayesian posterior distributions over the task indices, i.e., and . Intuitively, in the extreme case where a and can completely specify the task at hand with certainty (i.e., the Bayesian posterior is peaked), the optimal policy selects the correct Q-function for this task; whereas for non-extreme cases a mixture is computed.

Employing the same procedure as before, we write the optimal variational distributions as follows:

Lemma 3 (Optimal Embeddings).

Assuming fixed , the optimal variational distributions are


where and are conceptually similar to Value functions but depend on , and , , respectively.

5 DSE Algorithms

This section focuses on describing two practical algorithms using disentangled embeddings. DSE-REINFORCE is updated on-policy and requires full trajectories from the different tasks. Although it is easier to implement, REINFORCE-type algorithms are known to suffer from high variance in the gradient estimates which slows down training. The second algorithm, DSE-SAC, is inspired by the SAC algorithm 

[haarnoja2018soft, ], and is an more data-efficient off-policy algorithm that directly uses the transitions of all the tasks sampled from a replay memory. It achieves this by estimating the Value functions and Q-functions.

Common to both algorithms is the parametrization of the policy with parameters representing those of a neural network. Additionally, we consider pure entropy regularization by setting the prior policy

to an improper uniform distribution by which we can ignore. Similarly, we consider the embeddings

and to have parameters (abusing notation) and , respectively. With this notation in place we proceed to describe DSE-REINFORCE.

5.1 Dse-Reinforce

DSE-REINFORCE first samples trajectories from each combination of dynamics and goal contexts , where . Then, these are used to update the shared parameters and the parameters and of the variational distributions. For the updates of the variational distributions we use the reparametrization trick [kingma2013auto, ] and make the assumption that the variational parameters and contain a set of specific parameters and for each dynamics and goal context. We further assume that the latent variables are multivariate Gaussians with diagonal covariance matrix (this assumption can easily be relaxed). Therefore, the latent variables are expressed as and where and are the noise terms; and

are the mean vectors and;

and are the diagonal vectors of the covariance matrix.

The maximization in (1) can be written using Monte Carlo estimators as with

where we have used the following definition of the regularized discounted future returns

Moreover, we have separated the KL terms of the variational distributions out of the summation over , as they are independent of and can be computed in closed form due to the Gaussian assumption. Consequently, we added the corresponding sum of discounts by computing the geometric sum . Note that we implicitly redefined the trajectories so that they contain the noise realizations instead of the latent variables. Algorithmic details can be found in the Appendix B.

Adaptive normalization using Pop-Art:

In our preliminary experiments we observed that DSE-REINFORCE was selectively solving some tasks but not others. For this reason we use the adaptive rescaling method Pop-Art [van2016learning, ; hessel2018multi, ] to normalize the discounted rewards to have zero mean and unit variance before each training iteration. Thus all tasks affect the gradient equally.

5.2 Dse-Sac

DSE-SAC collects transitions from each dynamics and reward context and stores them in separate replay memories . Then, samples from the replay memories are used to estimate the Q-functions , the value functions , the variational distributions and and the policy . Subscripts denote the symbols for the parameters of the neural networks used as function approximators.

The Q-functions are learned by optimizing the loss , where the target is a one-sample estimate obtained with real experience and denotes the parameters of a target network which is updated at every training iteration as for some .

The value functions are learned by minimizing , where the target exploits the value recursion in Equation (4):

In order to reduce overestimation of Q-values, we follow  [haarnoja2018soft, ] where the minimum of two Q-function approximators is used i.e., , which have different sets of parameters and initialization but are trained using the same loss .

The parameters can be learned by minimizing the following expected KL-divergence between the parametric policy and the optimal policy from Lemma (5) (with an improper prior )

The policy loss is written as , where the Q-function is estimated with a single sample i.e., given that and . The normalizing function can be safely ignored as in haarnoja2018soft .

Following a similar rationale as before, the variational parameters for each context () can be learned by minimizing the following KL-divergences

which translates into


where, for clarity, we define with . All remaining algorithmic details are in the Appendix.

6 Experiments

Figure 1:

Multi-task Cartpole. The colors correspond to different goal contexts. (a) shows the average evolution of the total rewards across 4 random seeds; shaded regions are the standard error, separated by the goals. For clarity, we averaged over dynamics (full plots in Appendix C.1). (b) and (c) show the latent spaces

and respectively after the training. The distributions shown are standard deviation. (d) and (e) are results from a generalization experiment carried out on (see Sec 6.1 for description).

Here we empirically validate our algorithms and show the applicability of the trained policies equipped with DSE on both multi-task and hierarchical RL problems. DSE-REINFORCE is tested on a discrete action-space problem (Cartpole) and DSE-SAC on a continous action-space problem (Reacher111From the Mujoco dynamics simulation software). On multi-task problems we show the benefit of disentanglement when compared to three baselines: single-embedding algorithm similar to hausman2018learning , Distral teh2017distral

and independent learners. Hyperparameter values are shown in the Appendix B.3.

6.1 DSE-REINFORCE on Cartpole

We extended the Cartpole environment provided in the Open AI gym library222https://gym.openai.com/envs/CartPole-v1/ by modifying the reward function to reflect the need to balance at different locations: left (), middle () and right (). Additionally, we allowed for three different dynamics conditions by changing the mass of the cart . Simulations were run for time steps.

Figure 1(a) shows that DSE-REINFORCE solves all nine tasks simultaneously at approximately the same rate exceeding the performance of the baselines: Distral [teh2017distral, ] and independently trained (no multi-task; trained with REINFORCE) algorithms; and performing similarly to the single embedding case. Importantly, we find that DSE-REINFORCE produces a policy that generalizes better than the baselines as we show in the next section. In Figure 1(b) we observe the variational distributions learned for embeddings of the different dynamics contexts (in grey) and in (c) for the different reward contexts (in red, orange and blue). These have separated to represent the different tasks in the latent space. The variational distributions shown in dark-red color in (b) and green color in (c) are the result of learning (with identical priors on and conditioned on the trained shared parameters) in a new unseen condition (

) successfully solving the task. This shows that the latent spaces are able to interpolate well. In Figure 

1(d) we show the mean of the variational distributions for the goal contexts in color; and in grey, latent vectors that we used to test whether the learned policy is able to generalize to unseen goals. We show in panel (e) the x-location of the tip of the pole. As can be seen through the grey conditions, the policy is able to generalize to new locations in an ordered (along the x-axis) fashion.

Retraining and generalization of DSE-REINFORCE on Cartpole

In this section we test generalization when there are missing dynamics or goal conditions on a task matrix. We consider the case of training on off-diagonal tasks () and testing on the diagonal; and the case of training on only tasks () (See Figure 2). The testing phase is executed on each test-task by initializing the variational distributions with matching indices and retraining both the variational and shared parameters.

Figure 2: Retraining experiments on 6-3 and 4-5 configurations for 2 algorithms. The configurations are shown in the bottom right panel; the setting marked by X were omitted during initial training ((a) and (e)). (a-h) show the average reward curves and their standard errors over 5 random seeds. The columns are organized according to: first column the multitask training and the rest of columns correspond to retraining of , and respectively. (a-d) were for the 6-3 configuration; the rest were for the 4-5 configuration. For brevity, the reward curves for 4-5 in (e-f) were also averaged across similar goals. The hyperparameters were the same as for the complete MTRL case in Section 6.1.

In Figure 2, we show on the left-most panels the multi-task training for both (6-3) and (4-5) settings and on the remaining panels the performance of the testing phase. We compare our DSE-REINFORCE (dse0) against the single-embedding algorithm from previous section. We can clearly see the benefits of disentangling the dynamics from the reward; the DSE algorithm provides strong initializations for tasks never seen before that are not mere “interpolation"-tasks as tested in the previous section. Note that independent single-task training would need about 10000 trajectories to train whereas DSE sometimes solves the test-task instantaneously (without accounting for the multi-task trainning).

HRL on Cartpole
Figure 3: Evolution of the total reward for HRL problems. (a) is the 1-Asteroid AsteroidCartpole problem. (b) is the Hierarchical Reacher problem.

We test the validity of the trained policies equiped with DSE in an HRL scenario by training a high-level policy that acts on the rewards latent space. For this, we developed a novel cartpole problem (AsteroidCartpole), where a balanced cartpole must avoid falling asteroids; this is detailed in the Appendix. For this, we fixed the mass to ; the latent variable for was fixed to the mean of . The high-level policy acted on a discrete action space consisting of 5 selected points of rewards latent space; three were the means of the learned variational distributions for the goal-contexts and the remaining two were interpolations . Figure 3(a) shows the evolution of the episodic rewards while training the high-level policy (HRL) with standard REINFORCE equipped with a baseline and with Pop-Art. As a comparison, we also trained the same REINFORCE algorithm but acting directly on the low-level actions. As seen, the hierarchical policy outperforms the baseline and attains maximum reward.

6.2 DSE-SAC on Reacher

Figure 4: Multi-task Training the Mujoco Reacher-v2 in the full configuration (). (a) shows the episodic rewards obtained by both DSE-SAC, compared against a single embedding algorithm, and training independently. Shaded regions are standard deviations. For each dynamics case, the lengths of the 2 arm components (arm0, arm1) were set as follows: {(arm0, arm1): }. Note we have averaged over dynamics cases of each algorithm. Appendix C.4 details the original data.

The original Reacher environment consists of moving the tip of a robotic arm to a random location; its state space included position of the goal. We modified this environment by removing the goal position information and instead, learn an embedding for it. This is considerably a more difficult task. Further, we modified it to vary the dynamics and reward functions by changing the arm lengths and goal position. We chose different goal locations and different arm lengths.

Figure 4 shows the results of our experiments with DSE-SAC on our multi-task Reacher problem. We compared these results with single-task independent learners and single-embedding SAC. As we see, both single-embedding and DSE-SAC have comparable performance and exceed the single-task learner. We also carried out experiments comparing DSE-SAC with DSE-REINFORCE (Appendix C.5) where DSE-SAC outperforms DSE-REINFORCE in Reacher by a large margin. Further, we carried out the “interpolation” experiments similar to the previous Cartpole experiments (Appendix C.5) showing generalization capabilities in Reacher.

Retraining and generalization of DSE-SAC on multi-task Reacher

Similar to the Cartpole scenario, we test generalization of DSE-SAC when training with missing tasks on the conditions () and (). Testing is performed on unseen test-tasks by initializing the variational distributions by matching index. Learning curves of the multitask policy is shown in Figure 5. In Table 3 of the Appendix, we found that the initial performance in test-tasks is on average better for DSE-SAC compared to the single-embedding SAC algorithm and the performs well in terms of the number of trajectories that it takes for a single-task learner to reach such performance. DSE-SAC obtained episodic reward, while the single embedding obtained episodic reward. It also takes the single-task policy number trajectories to reach the performance of DSE-SAC.

Figure 5: Multi-task training on Reacher-v2 under an incomplete grid of problems ( and ) for DSE-SAC compared against a single-embedding SAC condition with the same hyperparameters.
HRL on Reacher

We tested the policy trained with DSE-SAC on a HRL scenario. In this case, we continuously moved the goal location in a circle passing by locations that the multi-task policy has never seen. We trained with standard single-task Soft Actor-Critic (H-SAC), a high-level policy that acts on the latent space and uses the pre-trained multi-task policy as low-level policy. Such policy is compared against the baseline of SAC trained directly on low-level actions. Figure 3(b) shows the performances of H-SAC and SAC in purple and green respectively. We see that H-SAC can solve the task faster than standard SAC can.

7 Conclusions

We have developed a multi-task framework from a variational inference perspective that is able to learn latent spaces that generalize to unseen tasks where the dynamics and reward can change independently. In particular, the disentangling allows for better generalization and faster retraining in new tasks. We have shown that the policies learned with our two algorithms DSE-REINFORCE and DSE-SAC, can be used successfully in HRL scenarios.

A promising future direction for DSE-SAC could be to learn Q-functions and Value functions that do not depend on the task-index but directly depend on the latent variables. This would allow for the training of a single Q-function and Value function instead of one per goal and dynamics condition.


Appendix A Proofs

a.1 Information term weights justification

We can easily weigh each information term with by assuming


More concretely, this gives

where we have eliminated the constant terms since they do not affect the solution of the optimization problem. Therefore, to unclutter the notation, we override the definition of the variational distributions by and .

a.2 Proof Lemma 1

Lemma 4 (Index- and state-dependent Value Function Recursion).

The index-dependent Value function satisfies the following recursive property.


We start by stating again the definition of the value function:

Then we take out the terms with inside the summation and write explicitly the expectation, i.e.,

We see now that the inner expectation term is in fact . Therefore, changing the sub-indices to , , , and we proved the lemma. ∎

a.3 Lagrangian for DSE-SAC Optimal Policy

Definition 1 (DSE Lagrangian).

Let be an arbitrary distribution over states. Then the Lagrangian is defined as

where , , are the Lagrange multipliers ensuring that the policy and variational distributions are properly normalized.

a.4 Proof Lemma 2

Lemma 5 (Optimal policy with DSE).

Let the variational distributions and be fixed. Then, the optimal policy is

where is the normalizing function and with and are the Bayesian posteriors over the task indices.


We take the functional derivative of the Lagrangian with respect to where the star denotes a particular element resulting in

Next, equating the previous equation to zero and using the following equalities , and we obtain


Re-arranging the terms we have


Finally, using the fact that we can obtain the value of the Lagrange multiplier . Then, we obtain the desired policy


a.5 Proof Lemma 3

Lemma 6 (Optimal Embeddings).

Assuming a fixed policy the optimal variational distributions are given by