Multitask Soft Option Learning

by   Maximilian Igl, et al.

We present Multitask Soft Option Learning (MSOL), a hierarchical multitask framework based on Planning as Inference. MSOL extends the concept of options, using separate variational posteriors for each task, regularized by a shared prior. This allows fine-tuning of options for new tasks without forgetting their learned policies, leading to faster training without reducing the expressiveness of the hierarchical policy. Additionally, MSOL avoids several instabilities during training in a multitask setting and provides a natural way to not only learn intra-option policies, but also their terminations. We demonstrate empirically that MSOL significantly outperforms both hierarchical and flat transfer-learning baselines in challenging multi-task environments.


Matching options to tasks using Option-Indexed Hierarchical Reinforcement Learning

The options framework in Hierarchical Reinforcement Learning breaks down...

SOAC: The Soft Option Actor-Critic Architecture

The option framework has shown great promise by automatically extracting...

Stay Alive with Many Options: A Reinforcement Learning Approach for Autonomous Navigation

Hierarchical reinforcement learning approaches learn policies based on h...

Benchmark Environments for Multitask Learning in Continuous Domains

As demand drives systems to generalize to various domains and problems, ...

DAC: The Double Actor-Critic Architecture for Learning Options

We reformulate the option framework as two parallel augmented MDPs. Unde...

PODNet: A Neural Network for Discovery of Plannable Options

Learning from demonstration has been widely studied in machine learning ...

Beyond Shared Hierarchies: Deep Multitask Learning through Soft Layer Ordering

Existing deep multitask learning (MTL) approaches align layers shared be...

1 Introduction

A key challenge in RL is to scale current approaches to higher complexity tasks without requiring a prohibitive number of environmental interactions. However, for many tasks, it is possible to construct or learn efficient exploration priors that allow to focus on more relevant parts of the state-action space, reducing the number of required interactions. These include, for example, reward shaping (Ng et al., 1999; Konidaris & Barto, 2006), curriculum learning (Bengio et al., 2009), some meta-learning algorithms (Wang et al., 2016; Duan et al., 2016; Gupta et al., 2018; Houthooft et al., 2018; Xu et al., 2018), and transfer learning (Caruana, 1997; Taylor & Stone, 2011; Bengio, 2012; Parisotto et al., 2015; Rusu et al., 2015; Teh et al., 2017).

One promising way to capture prior knowledge is to decompose policies into a hierarchy of sub-policies (or skills) that can be reused and combined in novel ways to solve new tasks (Dayan & Hinton, 1993; Thrun & Schwartz, 1995; Parr & Russell, 1998; Sutton et al., 1999; Barto & Mahadevan, 2003). The idea of HRL is also supported by findings that humans appear to employ a hierarchical mental structure when solving tasks (Botvinick et al., 2009; Collins & Frank, 2016). In such a hierarchical RL policy, lower-level, temporally extended skills yield directed behavior over multiple time steps. This has two advantages: i) it allows efficient exploration, as the target states of skills can be reached without having to explore much of the state space in between, and ii)

directed behavior also reduces the variance of the future reward, which accelerates convergence of estimates thereof.

On the other hand, while a hierarchical approach can therefore significantly speed up exploration and training, it can also severely limit the expressiveness of the final policy and lead to suboptimal performance when the temporally extended skills are not able to express the required policy for the task at hand (Mankowitz et al., 2014).

Many methods exist for constructing and/or learning skills for particular tasks (Dayan & Hinton, 1993; Sutton et al., 1999; McGovern & Barto, 2001; Menache et al., 2002; Şimşek & Barto, 2009; Gregor et al., 2016; Kulkarni et al., 2016; Bacon et al., 2017; Nachum et al., 2018a). Training on multiple tasks simultaneously is one promising approach to learn skills that are both relevant and generalise across tasks (Thrun & Schwartz, 1995; Pickett & Barto, 2002; Fox et al., 2016; Andreas et al., 2017; Frans et al., 2018). Ideally, the entire hierarchy can be trained end-to-end on the obtained return, obviating the need to specify proxy rewards for skills (Frans et al., 2018).

However, learning hierarchical policies end-to-end in a multitask setting poses two major challenges: i) because skills optimize environmental rewards directly, correctly updating them relies on already (nearly) converged master policies that use them similarly across all tasks, requiring complex training schedules (Frans et al., 2018), and ii) the end-to-end optimization is prone to local minima in which multiple skills have learned similar behavior. Figure 0(a) illustrates the latter: A number of tasks aim to reach either target or , using two skills and . An optimal solution would assign each target to one of the skills. However, assume that due to random initialization or later discovery of target , both skills currently reach . Changing one skill, e.g. , towards , in order to solve some new tasks, decreases the performance on all tasks that currently use to reach target , preventing from changing. This configuration represents a local minimum in the overall loss, since cannot be reached, but missing coordination between master policies prevents us from escaping it. To “free up” and learn a new skill, all higher-level policies need to refrain from using it to reach and instead use the equally useful skill .

In this paper, we propose MSOL, a novel approach to learning hierarchical policies in a multi-task setting that extends Options (Sutton et al., 1999), a common definition for skills, and casts the concept into the PAI framework (see, e.g., Levine, 2018, for a review). MSOL brings multiple advantages: i) it stabilizes end-to-end multitask training, removing the need for complex training schedules like in Frans et al. (2018), ii) it gives rise to coordination between master policies, avoiding local minima of the type described in Figure 0(a), iii) it allows fine-tuning of options, i.e. adapting them to new tasks at test-time without the risk of unlearning previously acquired useful behavior, thereby avoiding suboptimal performance due to restricted expressiveness, iv) and lastly, we show how the soft option framework gives rise to a natural solution to the challenging task of learning option-termination policies.

MSOL differentiates between a prior policy for each option, shared across all tasks, and a flexible task-specific posterior policy. The option prior can be fixed once it is fully trained, preventing unlearning of useful behavior even when the posteriors are updated. On new tasks, the option posteriors are initialized to the priors and regularized towards them, but are still adaptable to the specific task. This allows the same accelerated training as with ‘hard’ options, but can solve more tasks due to the adjustable posteriors. Furthermore, during option learning, we can train prior and posterior policies simultaneously (Teh et al., 2017), all without the need for complex training schedules (Frans et al., 2018): training is stabilized because only the priors, which are not used to generate rollouts, are shared across tasks.

(a) Local minimum
(b) Using skill priors
(c) After training
Figure 1: Hierarchical learning of two concurrent tasks ( and ) using two options ( and ) to reach two relevant targets ( and ). a) Local minimum when simply sharing options across tasks. b) Escaping the local minimum by using prior () and posterior () policies. c) Learned options after training. Details are given in the main text.

Importantly, learning soft options also makes it possible to escape the aforementioned local minima of insufficiently diverse skills (Figure 0(a)). Figure 0(b) shows this: In MSOL we have separate task-specific posteriors and for tasks and and soft options . The corresponding shared priors (one per option) are adjusted for all tasks concurrently, which effectively yields the average between the task-specific posteriors and . Allowing the posteriors of an option, for example and , to deviate from the prior allows that option, in a first step, to solve both tasks, irrespective of the misaligned priors. Subsequently, as all tasks’ posteriors are regularized towards their prior, the higher-level policies are encouraged to choose option to reach target , as option induces larger regularization costs. After this implicit coordination, is always used to reach , allowing option to specialize in target , leading to fully specialized options after training (Figure 0(c)).

Our experiments demonstrate that MSOL outperforms previous hierarchical and transfer learning algorithms during transfer tasks in a multitask setting. MSOL only modifies the regularized reward and loss function, but does not require any specialized architecture. In particular, it also does not require artificial restrictions on the expressiveness of either the higher-level or intra-option policies.

2 Preliminaries

An agent’s task is formalized as a MDP , consisting of the state space , the action space , the initial state distribution

, the transition probability

of reaching state by executing action  in state , and the reward an agent receives for this transition.

2.1 Planning as inference

Planning as inference (PAI) (Todorov, 2008; Toussaint, 2009; Kappen et al., 2012) frames RL as a probabilistic-inference problem (Levine, 2018). The agent learns a distribution over actions given states , i.e., a policy, parameterized by , which induces a distribution over trajectories  of length , i.e., :


This can be seen as a structured variational approximation of the optimal trajectory distribution. Note that the true initial state probability and transition probability are used in the variational posterior, as we can only control the policy, not the environment.

An advantage of the Bayesian PAI formulation of RL is that we can incorporate information both from prior knowledge, in the form of a prior policy distribution, and the task at hand through a likelihood function that is defined in terms of the achieved reward. The prior policy can be specified by hand or, as in our case, learned (see Section 3). To incorporate the reward, we introduce a binary optimality variable (Levine, 2018), whose likelihood is highest along the optimal trajectory that maximizes return:


The constraint can be relaxed without changing the inference procedure (Levine, 2018). For brevity, we denote as .

For a given prior policy , the distribution of ‘optimal’ trajectories is


If the prior policy explores the state-action space sufficiently, then this is the distribution of desirable trajectories. PAI aims to find a policy such that the variational posterior in Equation 1 approximates this distribution by minimizing the KL divergence


2.2 Multi-task learning

In a multi-task setting, we have a set of different tasks , drawn from a task distribution with probability . All tasks share state space and action space , but each task has its own initial-state distribution , transition probability , and reward function . Our goal is to learn tasks concurrently, distilling common information that can be leveraged to learn faster on new tasks from .

In this setting, the prior policy can be learned jointly with the task-specific posterior policies (Teh et al., 2017). To do so, we simply extend Equation 4 to

where is the regularised reward, defined as

Minimizing the loss in Equation 5 is equivalent to maximizing the regularized reward . Moreover, minimizing the term implicitly minimizes the expected KL-divergence . In practise (see Section 3.4) we will also make use of a discount factor . For details on how arises in the PAI framework we refer to Levine (2018).

2.3 Options

Options (Sutton et al., 1999) are skills that generalize primitive actions and consist of three components: i) an intra-option policy  selecting primitive actions according to the currently active option , ii) a probability of terminating the previously active option , and iii) an initiation set , which we simply assume to be . Note that by construction, the higher-level (or master-) policy can only select a new option  if the previous option  has terminated.

3 Method

Task 1

Task 2


Figure 2: Two hierarchical posterior policies (left and right) with common priors (middle). For each task , the policy conditions on the current state and the last selected option . It samples, in order, whether to terminate the last option (), which option to execute next () and correspondingly what primitive action () to execute in the environment.

We aim to learn a reusable set of options that allow for faster training on new tasks from a given distribution. Unlike prior work (Frans et al., 2018), we learn not just the intra-option policies, but also their termination policies, while preventing multiple options from learning the same behavior.

To differentiate ourselves from classical ‘hard’ options, which, once learned, do not change during new tasks, we call our novel approach soft-options. Each soft-option consists of an option prior, which is shared across all tasks, and a task-specific option posterior. The priors of both the intra-option policy and the termination policy capture how an option typically behaves and remain fixed once they are fully learned. At the beginning of training on a new task, they are used to initialize the task-specific posterior distribution. During training, the posterior is then regularized against the prior to prevent inadvertent unlearning. However, if maximizing the reward on certain tasks is not achievable with the prior policy, the posterior is free to deviate from it. We can thus speed up training using options, while remaining flexible enough to solve any task.

Additionally, this soft option framework also allows for learning good priors in a multitask setting while avoiding local minima in which several options learn the same behavior. See Figure 2 for an overview over the hierarchical prior-posterior architecture that we explain further below.

3.1 Hierarchical posterior policies

To express options in the PAI framework, we introduce two additional variables at each time step : option selections , representing the currently selected option, and decisions to terminate them and allow the higher-level (master) policy to choose a new option. The agent’s behavior depends on the currently selected option , by drawing actions from the intra-option posterior policy . The selection itself is drawn from a master policy . Conditioned on , drawn by the termination posterior policy , this policy either continues with the previous or draws a new option:


where is the Dirac-delta function and we set at the beginning of each episode. The joint posterior policy is


While can be a continuous variable, we consider only , where is the number of available options. The induced distribution over trajectories of task , , is then


3.2 Hierarchical prior policy

Our framework transfers knowledge between tasks by a shared prior over all joint policies (8):


By choosing , , and correctly, we can learn useful temporally extended options. The parameterized priors and are structurally equivalent to the posterior policies and so that they can be used as initialization for the latter. Optimizing the regularized return (see next section) w.r.t.  distills the common behavior into the prior policy and softly enforces similarity across posterior distributions of each option amongst all tasks .

The prior selects the previous option if , and otherwise draws options uniformly to ensure exploration:


Because the posterior master policy is different on each task, there is no need to distill common behavior into a joint prior.

3.3 Objective

We extend the multitask objective in (5) by substituting and with those induced by our hierarchical prior and posterior policies (8) and (10). The resulting objective has the same form but with a new regularized reward that is maximized instead of Equation 6:

As we maximize , this corresponds to maximizing the expectation over along the on-policy trajectories drawn from . Term

of the regularization encourages exploration in the space of options. It can also be seen as a form of deliberation cost (Harb et al., 2017) as it is only nonzero whenever we terminate an option and the master policy needs to select another to execute.


softly enforces similarity between option posteriors across tasks and updates the prior towards the ‘average’ posterior. It also encourages the master policy to pick only the most specialized option, i.e., the option whose posteriors across all tasks are most similar. This creates the necessary coordination between master policies that allows us to escape the local minimum described in Figure 1: if two options and can be used for subgoal on task , but the option posterior of also learns to reach on task , the master policy on task starts to use only (because it is more specialized), leaving to specialize on . Consequently, despite only softly regularizing the option posteriors towards their joint prior, this term leads to specialized options after training as long as the number of available options is sufficient.

Lastly, we can use

to also encourage temporal abstraction of options. To do so, during option learning, we fix the termination prior

to a Bernoulli distribution

. Choosing a large encourages prolonged execution of one option, but allows switching whenever necessary. This is also similar to deliberation costs (Harb et al., 2017) but with a more flexible cost model.

Additionally, we can still distill a termination prior which can be used on future tasks. Instead of learning by minimizing the KL against the posterior termination policies, we can get more decisive terminations by minimizing


and i.e., the learned termination prior distills the probability that the tasks’ master policies would change the active option if they had the opportunity.

3.4 Optimization

Even though depends on , its gradient w.r.t.  vanishes.111 . Consequently, we can treat the regularized reward as a classical RL reward and use any RL algorithm to find the optimal hierarchical policy parameters . In the following, we explain how to adapt A2C (Mnih et al., 2016) to soft options. The extension to PPO (Schulman et al., 2017) is straightforward.222

However, for PAI frameworks like ours, unlike in the original PPO implementation, the advantage function must be updated after each epoch.

The joint posterior policy in (8) depends on the current state and the previously selected option . The expected sum of regularized future rewards of task , the value function , must therefore also condition on this pair:



cannot be directly observed, we approximate it with a parametrized model

. The -step advantage estimation at time of trajectory is given by


where the superscript ‘’ indicates treating the term as a constant. The approximate value function can be optimized towards its bootstrapped -step target by minimizing . As per A2C, depending on the state (Mnih et al., 2016). The corresponding policy gradient loss is

The gradient w.r.t. the prior parameters is333Here we ignore as it is folded into later.


where and . To encourage exploration in all policies of the hierarchy, we also include an entropy maximization loss:

Note that term

in (3.3) already encourages maximizing for the master policy, since we chose a uniform prior . As both terms serve the same purpose, we are free to drop either one of them. In our experiments, we chose to drop the term for in , which proved slightly more stable to optimize that the alternative.

We can optimize all parameters jointly with a combined loss over all tasks , based on sampled trajectories and corresponding sampled values of :


3.5 Training schedule

For faster training, it is important to prevent the master policies from converging too quickly to allow sufficient updating of all options. On the other hand, a lower exploration rate leads to more clearly defined options. We consequently anneal the exploration bonus with a linear schedule during training.

Similarly, a high value of leads to better options but can prevent finding the extrinsic reward early on in training. Consequently, we increase over the course of training, also using a linear schedule.

3.6 Relationship to classical options

Assume we are faced with a new task and are given some prior knowledge in the form of a set of skills that we can use. Using those skills and their termination probabilities as prior policies and in the soft option framework, we can see as a temperature parameter determining how closely we are restricted to following them. For we recover the classical ‘hard’ option case and our posterior option policies are restricted to the prior.444However, in this limiting case optimization using the regularized reward is not possible. For the priors only initialize the otherwise unconstrained policy, quickly unlearning behavior that may be useful down the line. Lastly, for we use the prior information to guide exploration but are only softly restricted to the given skills and can also explore and use policies ‘close’ to them.

4 Related Work

Most hierarchical approaches rely on proxy rewards to train the lower level components. Some of them aim to reach pre-specified subgoals (Sutton et al., 1999; Kulkarni et al., 2016), which are either manually set or found by analyzing the structure of the MDP (McGovern & Barto, 2001; Menache et al., 2002; Şimşek et al., 2005; Şimşek & Barto, 2009) or previously learned policies (Goel & Huber, 2003). Those methods typically require knowledge, or a sufficient approximation, of the transition model, both of which are often infeasible.

Recently, several authors have proposed unsupervised training objectives for learning diverse skills based on their distinctiveness (Gregor et al., 2016; Florensa et al., 2017; Achiam et al., 2018; Eysenbach et al., 2019). However, those approaches don’t learn termination functions and cannot guarantee that the required behavior on the downstream task is included in the set of learned skills. Hausman et al. (2018) also incorporate reward information, but do not learn termination policies and are therefore restricted to learning multiple solutions to the provided task instead of learning a decomposition of the task solutions which can be re-composed to solve new tasks. Their off-policy training algorithm, based on Retrace (Munos et al., 2016) and SVG (Heess et al., 2015), extends straightforwardly to our setting.

A third usage of proxy rewards is by training lower level policies to move towards goals defined by the higher levels. When those goals are set in the original state space (Nachum et al., 2018a), this approach has difficulty scaling to high dimensional state spaces like images. Setting the goals in a learned embedding space (Dayan & Hinton, 1993; Vezhnevets et al., 2017; Nachum et al., 2018b) can be difficult to train, though. Furthermore, in both cases, the intended temporal extension of learned skills needs to be set manually.

In this work, we do not employ any proxy reward functions for the lower level policy. Instead, we are training the entire hierarchy on the same regularized reward, which guarantees that the learned skills are useful on the task distribution at hand. This is similar to the Option-Critic framework (Bacon et al., 2017; Smith et al., 2018), however, we are using a multitask setting which allows us to learn options that generalize better to a pre-specified range of tasks (Thrun & Schwartz, 1995; Frans et al., 2018).

HiREPS (Daniel et al., 2012) also take an inference motivated approach to learning options. In particular Daniel et al. (2016) propose a similarly structured hierarchical policy, albeit in a single task setting. However, they do not utilize learned prior and

posterior distributions, but instead use expectation maximization to iteratively infer a hierarchical policy to explain the current reward-weighted trajectory distribution.

Several previous works try to overcome the restrictive nature of options that can lead to sub-optimal solutions by allowing the higher-level actions to modulate the behavior of the lower-level policies Schaul et al. (2015); Heess et al. (2016); Haarnoja et al. (2018). However, this significantly increases the required complexity of the higher-level policy and therefore the learning time.

The multitask- and transfer-learning setup used in this work is inspired by Thrun & Schwartz (1995) and Pickett & Barto (2002) who suggest extracting options by using commonalities between solutions to multiple tasks. Andreas et al. (2017) extend this to deep policies but require additional human supervision in the form of policy sketches. Closest to our work is MLSH (Frans et al., 2018) which, however, shares the lower-level policies across all tasks without distinguishing between prior and posterior. As discussed, this leads to local minima and insufficient diversity in the learned options. Similarly to us, Fox et al. (2016) differentiate between prior and posterior policies on multiple tasks and utilize a KL-divergence between them for training. However, they do not consider termination probabilities and instead only choose one option per task.

Our approach is closely related to distral (Teh et al., 2017) with which we share the multitask learning of prior and posterior policies. However, distral

 has no hierarchical structure and applies the same prior distribution over primitive actions, independent of the task. As a necessary hierarchical heuristic, the authors propose to also condition on the last primitive action taken. This works well when the last action is indicative of future behavior; however, in

Section 5 we show several failure cases where a learned hierarchy is needed.

5 Experiments

(a) Moving Bandits
(b) Taxi
(c) Directional Taxi
(d) Swimmer
Figure 3:

Performance during testing of the leaned options and exploration priors. Each line is the median over 5 random seeds (2 for MLSH) and shaded areas indicate standard deviations.

We conduct a series of experiments to show: i) when learning hierarchies in a multitask setting, MSOL successfully overcomes the local minimum of insufficient option diversity, as described in Figure 1; ii) MSOL can learn useful termination policies, allowing it to learn options that can be chained together to solve a set of tasks; iii) MSOL is equally applicable to discrete as well as continuous domains; and iv) using soft options yields fast transfer learning while still reaching optimal performance.

All architectural details and hyper-parameters can be found in the appendix. For all experiments, we first train the exploration priors and options on tasks from the available task distribution (plots for the training phase are shown in appendix). Subsequently, we test how quickly we can learn new tasks drawn from .

We compare the following algorithms: MSOL is our proposed method that utilizes soft options both during option learning and transfer. MSOL(frozen) uses the soft options framework during learning to find more diverse skills but does not allow fine-tuning the posterior sub-policies during transfer. The difference between the two algorithms shows the advantage of the adaptability of soft options for transfer to new tasks. distral (Teh et al., 2017) is a strong non-hierarchical transfer learning algorithm that also utilizes prior and posterior distributions. distral(+action) utilizes the last action as option-heuristic which works well in some tasks but fails when the last action is not sufficiently informative. Lastly, MLSH (Frans et al., 2018) is a multitask option learning algorithm like MSOL, but utilizes ‘hard’ options for both learning and transfer, i.e., sub-policies that are shared exactly across tasks. It also relies on fixed option durations and requires a complex training schedule between master and intra-option policies to stabilize training. We use the MLSH implementation provided by the authors.

5.1 Moving Bandits

We start with the 2D Moving Bandits environment proposed and implemented by Frans et al. (2018), which is similar to the example in Figure 1. In each episode, the agent receives a reward of 1 for each time step it is sufficiently close to one of two randomly sampled, distinguishable, marked positions in the environment. The agent can take actions that move it in one of the four cardinal directions. Which position is rewarded is determined by the task and not signaled in the observation. Each episode lasts 50 time steps.

We allow MLSH and MSOL to learn two options. During transfer, optimal performance can only be achieved when both options successfully learned to reach different marked locations, i.e., when they are diverse. In Figure 2(a) we can see that MSOL is able to do so but the options learned by MLSH are not sufficiently diverse. distral, even with the last action provided as additional input, is not able to quickly utilize the prior knowledge. Because the locations of the two marked positions are randomly sampled for each episode, the last action only conveys meaningful information when taking the goal locations into account: The distral agent needs to infer the intention based on the last action and the relative goal positions. While this is in principle possible, in practice the agent was not able to do so, even with a much larger network. However, distral is able to reach the same performance as MSOL when trained long enough, since its posterior is flexible, denoted by “distral(+action) limit”. Lastly, MSOL(frozen) also outperforms distral(+action) and MLSH, but performs worse that MSOL, except for extremely short training times. This highlights the utility of making options soft, i.e. adaptable.

5.2 Taxi

Figure 4: Options learned with MSOL on the taxi domain. The light gray area indicates walls. Intra-option policies before (top) and after (bottom) pickup: Arrows and colors indicated direction of most likely action, the size indicates its probability. Square indicates the pickup/dropoff action. Termination policies before and after pickup: Intensity and size of the circles indicate termination probability. MSOL learns four different options to reach the four different locations and learns appropriate termination probabilities.

Next, we use a slightly modified version of the original Taxi domain (Dietterich, 1998)

to show that MSOL can learn appropriate termination functions in addition to intra-option policies. To solve the task, the agent must pick up a passenger on one of four possible locations by moving to their location and executing a special ‘pickup/drop-off’ action. Then, the passenger must be dropped off at one of the other three locations, again using the same action executed at the corresponding location. The domain has a discrete state space with 30 locations arranged on a grid and a flag indicating whether the passenger was already picked up. The observation is a one-hot encoding of the discrete state. Walls (see

Figure 4) limit the movement of the agent and invalid actions, i.e., moving into a wall, result in no change to the state.

We investigate two versions of Taxi. In the original (Dietterich, 1998, just called Taxi), the action space consists of one no-op, one ‘pickup/drop-off’ action and four actions to move in all cardinal directions. In Directional Taxi, we extend this setup: the agent faces in one of the cardinal directions and the available movement actions are to move forward or rotate either clockwise or counter-clockwise.

In both environments the set of tasks are the 12 different combinations of pickup/drop-off locations which are not part of the observation. Episodes last at most 50 steps and there is a reward of 2 for delivering the passenger to its goal and a penalty of -0.1 for each time step. During training, the agent is initialized to any valid state, excluding the four special locations. During testing, the agent is always initialized without the passenger on board.

We allow four learnable options in MLSH and MSOL. This necessitates the options to be diverse, i.e., one option to reach each of the four pickup/drop-off locations. Importantly, it also requires the options to learn to terminate when a passenger is picked up. As one can see in Figure 2(b), MLSH struggles due to its fixed option duration which is not flexible enough for this environment. distral(+action) performs well in the original Taxi environment, as seen in Figure 2(b), since here the last action is a good indicator for the agent’s intention. However, in the directional case shown in Figure 2(c), the actions are less informative and make it much harder for distral to use prior knowledge. By contrast, MSOL performs well in both taxi environments. Comparing its performance with MSOL(frozen) shows the utility of adaptable soft options during transfer.

Figure 4, which visualizes the options learned by MSOL, shows that it successfully learns useful movement primitives and termination functions. The same soft option represents different behavior depending on whether it already picked up the passenger. This is expected as this behavior does not need to terminate the current option on three of the 12 tasks.

5.3 Swimmer

Lastly, we show that MSOL can also be applied to continuous multitask domains. In particular, we investigate the MuJoCo environment ‘Swimmer’ (Todorov et al., 2012; Brockman et al., 2016). Instead of rewarding forward movement as in the original implementation, now the rewarded movement direction depends on the task from . We also include a small amount of additive action noise (details in the Appendix). Due to limited computational resources, we only compare MSOL to the strongest baseline, distral and show that MSOL performs competitive even in the absence of known failure cases of distral (see Figure 2(d)).

6 Discussion

Multitask Soft Option Learning (MSOL) proposes reformulating options using the Bayesian perspective of prior and posterior distributions. This offers several key advantages.

First, during transfer, it allows us to distinguish between fixed, and therefore knowledge-preserving option priors, and flexible option posteriors that can adjust to the reward structure of the task at hand. This effects a similar speed-up in learning as the original options framework, while avoiding sub-optimal performance when the available options are not perfectly aligned to the task. Second, utilizing this ‘soft’ version of options in a multitask learning setup increases optimization stability and removes the need for complex training schedules between master and lower level policies. Furthermore, this framework naturally allows master policies to coordinate across tasks and avoid local minima of insufficient option diversity. It also allows for autonomously learning option-termination policies, a very challenging task which is often avoided by fixing option durations manually.

Lastly, using this Bayesian formulation also allows inclusion of prior information in a principled manner without imposing too rigid a structure on the resulting hierarchy. We utilize this advantage to explicitly incorporate the bias that good options should be temporally extended. In future research, other types of information can be explored. As an example, one could investigate sets of tasks which would benefit from a learned master prior, like walking on different types of terrain.

7 Acknowledgements

M. Igl is supported by the UK EPSRC CDT in Autonomous Intelligent Machines and Systems. N. Siddharth, Andrew Gambardella and Nantas Nardelli were funded by ERC grant ERC-2012-AdG 321162-HELIOS, EPSRC grant Seebibyte EP/M013774/1 and EPSRC/MURI grant EP/N019474/1 S. Whiteson is supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713).


Appendix A Architecture

All policies and value functions share the same encoder network with two fully connected hidden layers of size 64 for the Moving Bandits environment and three hidden layers of sizes 512, 256, and 512 for the Taxi environments. Distral was tested with both model sizes on the Moving Bandits task to make sure that limited capacity is not the problem. Both models resulted in similar performance, the results shown in the paper are for the larger model. On swimmer the encoder model size is . Master-policies, as well as all prior- and posterior policies and value functions consist of only one layer which takes the latent embedding produced by the encoder as input. Furthermore, the encoder is shared across tasks, allowing for much faster training since observations can be batched together.

Options are specified as an additional one-hot encoded input to the corresponding network that is passed through a single 128 dimensional fully connected layer and concatenated to the state embedding before the last hidden layer. We implement the single-column architecture of Distral as a hierarchical policy with just one option and with a modified loss function that does not include terms for the master and termination policies. Our implementation builds on the A2C/PPO implementation by Kostrikov (2018), and we use the implementation for MLSH that is provided by the authors (

(a) Moving Bandits
(b) Taxi
(c) Directional Taxi
(d) Swimmer
Figure 5: Performance during training phase. Note that MSOL and MSOL(frozen) share the same training as they only differ during testing. Further, note that the highest achievable performance for Taxi and Directional Taxi is higher during training as they can be initialized closer to the final goal (i.e. with the passenger on board).

Appendix B Hyper-parameters

We use in all experiments. Furthermore, we train on all tasks from the task distribution, regularly resetting individual tasks by resetting the corresponding master and re-initializing the posterior policies. Optimizing for MSOL and Distral was done over . We use for Moving Bandits and Taxi and for Swimmer.

b.1 Moving Bandits

For MLSH, we use the original hyper-parameters (Frans et al., 2018). The duration of each option is fixed to 10. The required warm-up duration is set to 9 and the training duration set to 1. We also use 30 parallel environments split between 10 tasks. This and the training duration are the main differences to the original paper. Originally, MLSH was trained on 120 parallel environments which we were unable to do due to hardware constraints. Training is done over 6 million frames per task.

For MSOL and Distral we use the same number of 10 tasks and 30 processes. The duration of options are learned and we do not require a warm-up period. We set the learning rate to and , , . Training is done over 0.6 million frames per task. For Distral we use , and also 0.6 million frames per task.

b.2 Taxi

For MSOL we anneal from 0.02 to 0.1 and from 0.1 to 0.05. For Distral we use . We use 3 processes per task to collect experience for a batch size of 15 per task. Training is done over 1.4 million frames per task for Taxi and 4 million frames per task for Directional Taxi. MLSH was trained on 0.6 million frames for Taxi as due to it’s long runtime of several days, using more frames was infeasible. Training was already converged.

b.3 Swimmer

For training Distral and MSOL we use PPO instead of A2C as it generally achieves better performance on continuous tasks. We have for both MSOL and Distral for primitive actions and for the master- and termination policies in MSOL. We use a learning rate of , GAE (Schulman et al., 2015) with . We collect steps in parallel on processes per task, resulting in a batchsize of per task. Training is done over 6 million frames with a linearly scheduled increase of from to for MSOL and for Distral. We set .