Regularized Hierarchical Policies for Compositional Transfer in Robotics

06/26/2019 ∙ by Markus Wulfmeier, et al. ∙ Google 1

The successful application of flexible, general learning algorithms -- such as deep reinforcement learning -- to real-world robotics applications is often limited by their poor data-efficiency. Domains with more than a single dominant task of interest encourage algorithms that share partial solutions across tasks to limit the required experiment time. We develop and investigate simple hierarchical inductive biases -- in the form of structured policies -- as a mechanism for knowledge transfer across tasks in reinforcement learning (RL). To leverage the power of these structured policies we design an RL algorithm that enables stable and fast learning. We demonstrate the success of our method both in simulated robot environments (using locomotion and manipulation domains) as well as real robot experiments, demonstrating substantially better data-efficiency than competitive baselines.



There are no comments yet.


page 6

page 20

page 22

page 23

page 30

page 31

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

While recent successes in deep (reinforcement) learning for computer games (Atari (Mnih et al., 2013), StarCraft (Vinyals et al., 2019), Go (Silver et al., 2017) and other high-throughput domains, e.g. (OpenAI et al., 2018)

, have demonstrated the potential of these methods in the big data regime, high costs of data acquisition limit progress in many tasks of real-world relevance. Data efficiency in machine learning often relies on inductive biases to guide and accelerate the learning process; e.g. by including expert domain knowledge of varying granularity. While incorporating this knowledge often helps to accelerate learning, when inaccurate it can inappropriately bias the space of solutions and lead to sub-optimal results.

Instead of designing inductive biases manually we aim to infer them from related tasks. In particular, we turn towards research on multitask transfer learning, where, from the view of one task, training signals of related tasks can serve as an additional inductive bias

(Caruana, 1997). Focusing on domains with consistent algorithm embodiment and more than a single task of interest – such as robotics – sharing and applying knowledge across tasks is imperative for efficiency and scalability. Successes for transfer learning have, for example, built on optimizing initial parameters (e.g. Finn et al., 2017), sharing models and parameters across tasks (e.g. Rusu et al., 2016; Teh et al., 2017; Galashov et al., 2018), data-sharing across tasks (e.g. Riedmiller et al., 2018; Andrychowicz et al., 2017) or related, auxiliary objectives (Jaderberg et al., 2016; Wulfmeier et al., 2017). Transfer between tasks can, however, lead to either constructive or destructive transfer for humans (Singley and Anderson, 1989) as well as machines (Pan and Yang, 2010; Torrey and Shavlik, 2010). That is, jointly learning to solve different tasks can provide benefits as well as disadvantages for individual tasks, depending on the similarity of required solutions.

In reinforcement learning (RL), (partial) solutions can be shared e.g. in the form of policies or value functions. While training independent, per-task, policies prevents interference, it also prevents transfer. On the other hand, training a monolithic policy shared across tasks can instead cause interference of opposing modes of behaviour. With this work we continue to explore the benefits of hierarchy for multitask learning and the modelling of multi-modal distributions (Rosenstein et al., 2005; Bishop, 1994).

We focus on developing an actor-critic RL approach for training hierarchical, modular models (policies) for continuous control. We represent low-level sub-policies as Gaussian policies combined with a categorical distribution for the high-level controller. We parametrize the policies as simple Mixture Density Networks (Bishop, 1994) with the important modification of providing task information only to the high-level controller, so that the low-level policies acquire more robust, task-independent behaviours. To train hierarchical policies in a coordinated and robust manner, we extend the optimization scheme of the Maximum a-posteriori Policy Optimization algorithm (MPO) (Abdolmaleki et al., 2018a). We additionally demonstrate the general benefits of hierarchical policies for transfer learning via improved results for another, gradient-based, entropy regularized RL algorithm (Heess et al., 2016) (see Appendix A.8.4).

The resulting algorithm, Regularized Hierarchical Policy Optimization  (RHPO), is appealing due to its conceptual simplicity and appears extremely effective in practice. We demonstrate this first in simulation for a variety of continuous control problems. In multitask transfer problems, we find that RHPO  can improve learning speed compared to the state-of-the-art. We then compare the approach against competitive continuous control baselines on a real robot for robotic manipulation tasks where RHPO  leads to a significant speed up in training, enabling us to learn with increased efficiency in a challenging stacking domain on a single robot. To the best of our knowledge, RHPO  demonstrates the first success for learning such a challenging task, defined via a sparse reward, from raw robot state and a randomly initialized policy in the real world – which we can learn within about a week of real robot time111 Additional videos are provided under

2 Preliminaries

We consider a multitask reinforcement learning setting with an agent operating in a Markov Decision Process (MDP) consisting of the state space

, the action space

, the transition probability

of reaching state from state when executing action at the previous time step

. The actions are drawn from a probability distribution over actions

referred to as the agent’s policy. Jointly, the transition dynamics and policy induce the marginal state visitation distribution . Finally, the discount factor together with the reward gives rise to the expected reward, or value, of starting in state (and following thereafter) . Furthermore, we define multitask learning over a set of tasks with common agent embodiment as follows. We assume shared state, action spaces and shared transition dynamics across tasks; tasks only differ in their reward function . Furthermore, we consider task conditional policies . The overall objective is defined as


where all actions are drawn according to the policy , that is, and we used the common definition of the state-action value function – here conditioned on the task – .

3 Method

This section introduces Regularized Hierarchical Policy Optimization  (RHPO) which focuses on efficient training on modular policies by sharing data across tasks building on an instance of Scheduled Auxiliary Control (SAC-X) with randomized scheduling (SAC-U) (Riedmiller et al., 2018). We start by introducing the considered class of policies, followed by the required extension of the MPO algorithm (Abdolmaleki et al., 2018b, a) for training structured hierarchical policies in an off-policy setting.

3.1 Modular Hierarchical Policies

We start by defining the modular policy class which supports sharing sub-policies across tasks. Formally, we decompose the per-task policy as


with and respectively representing a “high-level” switching controller (a categorical distribution) and a “low-level” sub-policy and is the index of the sub-policy. Here, denotes the parameters of both and , which we will seek to optimize. While the number of components has to be decided externally, the method is robust with respect to this parameter (Appendix A.8.1).

Note that, in the above formulation only the high-level controller is conditioned on the task information ; i.e. we employ a form of information asymmetry (Galashov et al., 2018; Tirumala et al., 2019; Heess et al., 2016) to enable the low-level policies to acquire general, task-independent behaviours. This choice strengthens decomposition of tasks across domains and inhibits degenerate cases of bypassing the high-level controller. These sub-policies can be understood as building reflex-like low-level control loops, which perform domain-dependent useful behaviours and can be modulated by higher cognitive functions with knowledge of the task at hand.

3.2 Decoupled Hierarchical Multitask Optimization

In the following sections, we present the equations underlying RHPO, for the complete pseudocode algorithm the reader is referred to the Appendix A.2.1. To optimize the policy class described above we build on the MPO algorithm (Abdolmaleki et al., 2018b, a) which decouples the policy improvement step (optimizing independently of the policy structure) from the fitting of the hierarchical policy. Concretely, we first introduce an intermediate non-parametric policy and consider optimizing while staying close, in expectation, to a reference policy



denotes the Kullback Leibler divergence,

defines a bound on the KL, denotes the data contained in a replay buffer (note that we do not differentiate between states from different tasks here, see Section 3.3 for further explanation), and assuming that we have an approximation of the ground-truth state-action value function available (see Section 3.3 for details on how can be learned from off-policy data). Starting from an initial policy

we can then iterate the following steps (interleaved with policy evaluation to estimate

) to improve the policy :

  • Policy Evaluation: Update such that , see Section 3.3.

  • Step 1: Obtain , under constraints (Equation (3)).

  • Step 2: Obtain , under additional regularization (Equation (5)).

Step 1: Obtaining Non-parametric Policies

We first find the intermediate policy by maximizing Equation (3). Analogous to MPO (Abdolmaleki et al., 2018b, a), we obtain a closed-form solution with a non-parametric policy for each task, as


where is a temperature parameter (corresponding to a given bound ) which is optimized alongside the policy optimization. See Abdolmaleki et al. (2018b, a) for the general form as well as the Appendix A.1.1 for a detailed derivation for the multi-task case. As mentioned above, this policy representation is independent of the form of the parametric policy ; i.e. only depends on through its density. This, crucially, makes it easy to employ complicated structured policies (such as the one introduced in Section 3.1). The only requirement for this and the following steps is that we must be able to sample from and calculate the gradient (w.r.t. ) of its log density.

Step 2: Fitting Parametric Policies

In the second step we fit a policy to the non-parametric distribution obtained from the previous calculation by minimizing the divergence Assuming that we can sample from this step corresponds to maximum likelihood estimation (MLE). Furthermore, we can effectively mitigate optimization instabilities during training by placing a prior on the change of the policy (a trust-region constraint). The idea is commonly used in on- as well as off-policy RL (Schulman et al., 2015; Abdolmaleki et al., 2018b, a). The application to hierarchical policy classes highlights the importance of this constraint as investigated in Section 4.1. Formally, we aim to obtain the solution


where defines a bound on the change of the new policy. Here we drop constant terms and the negative sign in the second line (turning into ), and explicitly insert the definition , to highlight that we are marginalizing over the high-level choices in this fitting step (since

is not tied to the policy structure). Hence, the update is independent of the specific policy component from which the action was sampled, enabling joint updates of all components. This reduces the variance of the update and also enables efficient off-policy learning. Different approaches can be used control convergence for both the “high-level” categorical choices and the action choices to change slowly throughout learning. The average KL constraint in Equation (

5) is similar in nature to a upper bound on the computationally intractable KL divergence between the two mixture distributions and has been determined experimentally to perform better in practice than simple bounds. In practice, inspired by Abdolmaleki et al. (2018a), in order to control the change of the high level and low level policies independently we decouple the constraints to be able to set different for the means (), covariances () and the categorical distribution () in case of a mixture of Gaussian policy. Please see the Appendix A.1.2 for more details on decoupled constraints. To solve Equation (5), we first employ Lagrangian relaxation to make it amenable to gradient based optimization and then perform a fixed number of gradient descent steps (using Adam (Kingma and Ba, 2014)); details on this step, as well as an algorithm listing, can be found in the Appendix A.1.2.

3.3 Data-efficient Multitask Learning

For data-efficient off-policy learning of we consider the same setting as scheduled auxiliary control setting (SAC-X) (Riedmiller et al., 2018) which introduces two main ideas to obtain data-efficiency: i) experience sharing across tasks; ii) switching between tasks within one episode for improved exploration. We consider uniform random switches between tasks every N steps within an episode (the SAC-U setting from (Riedmiller et al., 2018)).

Formally, we assume access to a replay buffer containing data gathered from all tasks, which is filled asynchronously to the optimization (similar to e.g. Espeholt et al. (2018); Abdolmaleki et al. (2018a)) where for each trajectory snippet we record the rewards for all tasks

as a vector in the buffer. Using this data we define the retrace objective for learning

, parameterized via , following Riedmiller et al. (2018) as


where is the L-step retrace target (Munos et al., 2016), see the Appendix A.2.2 for details.

4 Experiments

In the following sections, we investigate how RHPO  can provide compelling benefits for multitask transfer learning in real and simulated robotic manipulation tasks. This includes experiments on physical hardware with robotic manipulation tasks for the Sawyer arm, emphasizing the importance of data-efficiency. In addition, we provide results for single-task domains from the DeepMind Control Suite (Tassa et al., 2018) in Appendix A.5.4, where we demonstrate how hierarchical policies can be employed to flexibly integrate human domain knowledge.

All policies are trained via the same reward structures, underlying constraints and task hyperparameters. More details on these parameters and the complete results for additional ablations and all tasks from the multitask domains are provided in the Appendix

A.4. Across all tasks, we build on a distributed actor-critic framework (similar to (Espeholt et al., 2018)) with flexible hardware assignment (Buchlovsky et al., 2019) to train all agents, performing critic and policy updates from a replay buffer, which is asynchronously filled by a set of actors.

4.1 Simulated Robot Experiments

We use three simulated multitask scenarios with the Kinova Jaco and Rethink Robotics Sawyer robot arms to test in a variety of conditions. Pile1: Here, the seven tasks of interest range from simple reaching for a block over tasks like grasping it, to the final task of stacking the block on top of another block. In addition to the experiments in simulation, which are executed with 5 actors in a distributed setting, the same Pile1 multitask domain (same rewards and setup) is investigated with a single, physical robot in Section 4.2. In all figures with error bars, we visualize mean and variance derived from 3 runs. We further extend the evaluation towards two more complex multitask domains in simulation. The first extension includes stacking with both blocks on top of the respective other block, resulting in a setting with 10 tasks (Pile2). And a last domain including harder tasks such as opening a box and placing blocks into this box, consisting of a total of 13 tasks (Cleanup2).

Across all experiments, we utilize the data-sharing scheme of Scheduled Auxiliary Control (SAC-X) (Riedmiller et al., 2018) (see Section 3.3), and execute a series of random tasks per episode, i.e. our setup is comparable to SAC-U. We thus compare RHPO  for training hierarchical policies against a flat, monolithic policy shared across all tasks which is provided with the additional task id as input (displayed as Monolithic in the plot) as well as policies with task dependent heads (displayed as Independent in the plots) following (Riedmiller et al., 2018) – both using MPO as the optimizer and a re-implementation of SAC-U using SVG (Heess et al., 2015). The baselines aim to provide the two opposite, naive perspectives on transfer: by using the same monolithic policy across tasks we enable positive as well as negative interference and independent policies prevent policy-based transfer. After experimentally confirming the robustness of RHPO  with respect to the number of low-level sub-policies (see Appendix A.8.1), we set equal to the number of tasks in each domain.

Figure 1:

Results for the multitask robotic manipulation experiments in simulation. The dashed line corresponds to the performance of a re-implementation of SAC-U. From left to right: Pile1, Pile2, Cleanup2. We show averages over 3 runs each, with corresponding standard deviation. RHPO  outperforms both baselines across all tasks with the benefits increasing for more complex tasks.

Figure 1 demonstrates that the hierarchical policy (RHPO) outperforms the monolithic as well as the independent baselines. For simple tasks such as the stacking of a one block on another (Pile1), the difference is comparably small, but the more tasks are trained and the more complex the domain becomes (cf. Pile2 and Cleanup2), the greater is the advantage of sharing learned behaviours across tasks. Compared to a re-implementation of SAC-U (using SVG as the optimizer) – dashed line in the plots – we observe that the baselines based on MPO already result in an improvement, which becomes even bigger with the hierarchical policies.

4.2 Physical Robot Experiments

For real-world experiments, data-efficiency is crucial. We perform all experiments in this section relying on a single robot (a single actor) – demonstrating the benefits of RHPO  in the low data regime. The performed task is the real world version of the Pile1 task described in Section 4.1. The main task objective is to stack one cube onto a second one and move the gripper away from it. We introduce an additional third cube which serves purely as a distractor.

Figure 2: Left: Overview of the real robot setup with the Sawyer robot performing the Pile1 task. Screen pixelated for anonymization. Middle: Simulated Sawyer performing the same task. Right: Cleanup2 setup with the Jaco.

The setup for the experiments consists of a Sawyer robot arm mounted on a table, equipped with a Robotiq 2F-85 parallel gripper. A basket of size in front of the robot contains the three cubes. Three cameras on the basket track the cubes using fiducials (augmented reality tags). As in simulation, the agent is provided with proprioception information (joint positions, velocities and torques), a wrist sensor’s force and torque readings, as well as the cubes’ poses – estimated via the fiducials. The agent action is five dimensional and consists of the three Cartesian translational velocities, the angular velocity of the wrist around the vertical axis and the speed of the gripper’s fingers.

Figure 3 plots the learning progress on the real robot for two (out of 7) of the tasks, the simple reach tasks and the stack task – which is the main task of interest. Plots for the learning progress of all tasks are given in the appendix A.6. As can be observed, all methods manage to learn the reach task quickly (within about a few thousand episodes) but only RHPO  with a hierarchical policy is able to learn the stacking task (taking about 15 thousand episodes to obtain good stacking success), which takes about 8 days of training on the real robot. With respect to the current state of the baselines and previous experiments it can be estimated that the baselines will take considerably longer time (and we will continue running the ’Independent’ baseline experiment after submission until all reach the same training duration).

Reach Stack
Figure 3: Robot Experiments. While simpler tasks such as reaching are learned with comparable efficiency, the later, more complex tasks are acquired significantly faster with a hierarchical policy.

4.2.1 Additional Ablations

In this section we build on the earlier introduced Pile1 domain to perform a set of ablations, providing additional insights into benefits of RHPO  and important factors for robust training.

Impact of Data Rate

Evaluating in a distributed off-policy setting enables us to investigate the effect of different rates for data generation by controlling the number of actors. Figure 4 demonstrates how the different agents converge slower lower data rates (changing from 5 to 1 actor). These experiments are highly relevant for the application domain as the number of available physical robots for real-world experiments is typically highly limited. To limit computational cost, we focus on the simplest domain from Section 4.1, Pile1, in this comparison. We display the learning curve of the most complicated task within the 7 tasks (the final stack). The results for all tasks as a function of the number of actor processes generating data is available in the Appendix A.8.3.

Figure 4: Learning convergence on Pile1 in relation to the number of actors. RHPO  results in significantly quicker learning in particular in the low data regime (1-5 actors) where both baselines fail to converge in the given time.
Importance of Regularization

Coordinating convergence progress in hierarchical models can be challenging but can be effectively moderated by the KL constraint. To investigate its importance, we perform an ablation study varying the strength of KL constraints on the high-level controller between prior and the current policy during training – demonstrating a range of possible degenerate behaviors.

As depicted in Figure 5, with a weak KL constraint, the high-level controller can converge too quickly leading to only a single sub-policy getting a gradient signal per step. In addition, the categorical distribution tends to change at a high rate, preventing successful convergence for the low-level policies. On the other hand, the low-level policies are missing task information to encourage decomposition as described in Section 3.2. This fact, in combination with strong KL constraints, can prevent specialization of the low-level policies as the categorical remains near static, finally leading to no or very slow convergence. As long as a reasonable constraint is picked, convergence is fast and the final policies obtain high quality for all tasks. We note that no tuning of the constraints is required across domains and the range of admissible constraints is quite broad.

Figure 5: Importance of Categorical KL Constraint. A weak constraint causes the categorical to adapt too fast for the low-level policies a too strong constraint prevents convergence of the categorical.

5 Related Work

Transfer learning, in particular in the multitask context, has long been part of machine learning (ML) for data-limited domains (Caruana, 1997; Torrey and Shavlik, 2010; Pan and Yang, 2010). Commonly, it is not straightforward to train a single model jointly across different tasks as the solutions to tasks might not only interfere positively but also negatively (Wang et al., 2018).

Preventing this type of forgetting or negative transfer presents a challenge for biological (Singley and Anderson, 1989) as well as artificial systems (French, 1999). In the context of ML, a common scheme is the reduction of representational overlap (French, 1999; Rusu et al., 2016; Wang et al., 2018). Bishop (1994)

utilize neural networks to parametrize mixture models for representing multi-modal distributions thus mitigating shortcomings of non-hierarchical approaches.

Rosenstein et al. (2005) demonstrate that hierarchical classification models can be applied to limit the impact of negative transfer. Our constrained optimization scheme for mixture policies builds on (Abdolmaleki et al., 2018b). Some aspects of it are similar to Daniel et al. (2016) but we phrase the learning problem in a fashion suitable for large-scale off-policy learning with deep neural networks.

Hierarchical approaches have a long history in the reinforcement learning literature (e.g. Sutton et al., 1999; Dayan and Hinton, 1993). Similar to our work, the well known options framework (Sutton et al., 1999; Precup, 2000) supports a two level behavior hierarchy, where the higher level chooses from a discrete set of sub-policies or “options” which commonly are run until a termination criterion is satisfied. The framework focuses on the notion of temporal abstraction. A number of works have proposed practical and scalable algorithms for learning option policies with reinforcement learning (e.g. Bacon et al., 2017; Zhang and Whiteson, 2019; Smith et al., 2018) or criteria for option induction (e.g. Harb et al., 2018; Harutyunyan et al., 2019). Rather than temporal abstraction, RHPO  emphasizes the sharing of sub-policies across tasks to provide structure for efficient multitask transfer.

Conceptually different from the options framework and from our work are approaches such as (Vezhnevets et al., 2017; Nachum et al., 2018a, b; Xie et al., 2018) which employ different rewards for different levels of the hierarchy rather than optimizing a single objective for the entire model as we do. Other works have shown the additional benefits for the stability of training and data-efficiency when sequences of high-level actions are given as guidance during optimization in a hierarchical setting (Shiarlis et al., 2018; Andreas et al., 2017; Tirumala et al., 2019).

Probabilistic trajectory models have been used for the discovery of behavior abstractions as part of an end-to-end reinforcement learning paradigm (e.g. Teh et al., 2017; Igl et al., 2019; Tirumala et al., 2019; Galashov et al., 2018) where the models act as learned inductive biases that induce the sharing of behavior across tasks. Several works have previously emphasized the importance of information asymmetry or information hiding (Galashov et al., 2018; Heess et al., 2016). In a vein similar to the present work (e.g Heess et al., 2016; Tirumala et al., 2019) share a low-level controller across tasks but modulate the low-level behavior via a continuous embedding rather than picking from a small number of mixture components. In related work Hausman et al. (2018); Haarnoja et al. (2018) learn hierarchical policies with continuous latent variables optimizing the entropy regularized objective.

6 Discussion

We introduce a novel framework, RHPO, to enable robust training of hierarchical policies for complex real-world tasks and provide insights into methods for stabilizing the learning process. In simulation as well as on real robots, the proposed approach outperforms baseline methods which either handle tasks independently or utilize implicit sharing. Especially with increasingly complex tasks or limited data rate, as given in real-world applications, we demonstrate hierarchical inductive biases to provide a compelling foundation for transfer learning, reducing the number of environment interactions significantly and often leading to more robust learning as well as improved final performance. While this approach partially mitigates negative interference between tasks in a parallel multitask learning scenario, addressing catastrophic inference in sequential settings remains challenging and provides a valuable direction.

Since with mixture distributions, we are able to marginalize over components when optimizing the weighted likelihood over action samples in Equation 5, the extension towards multiple levels of hierarchy is trivial but can provide a valuable direction for practical future work.

We believe that especially in domains with consistent agent embodiment and high costs for data generation learning tasks jointly and sharing will be imperative. RHPO  combines several ideas that we believe will be important: multitask learning with hierarchical and modular policy representations, robust optimization, and efficient off-policy learning. Although we have found this particular combination of components to be very effective we believe it is just one instance of – and step towards – a spectrum of efficient learning architectures that will unlock further applications of RL both in simulation and, importantly, on real hardware.


The authors would like to thank Michael Bloesch, Jonas Degrave, Joseph Modayil and Doina Precup for helpful discussion and relevant feedback for shaping our submission. As robotics (and AI research) is a team sport we’d additionally like to acknowledge the support of Francesco Romano, Murilo Martins, Stefano Salicetti, Tom Rothörl and Francesco Nori on the hardware and infrastructure side as well as many others of the DeepMind team.


Appendix A Appendix

a.1 Additional Derivations

In this section we explain the detailed derivations for training hierarchical policies parameterized as a mixture of Gaussians.

a.1.1 Obtaining Non-parametric Policies

In each policy improvement step, to obtain non-parametric policies for a given state and task distribution, we solve the following program:

To make the following derivations easier to follow we open up the expectations, writing them as integrals explicitly. For this purpose let us define the joint distribution over states

together with randomly sampled tasks as , where

denotes the uniform distribution over possible tasks. This allows us to re-write the expectations that include the corresponding distributions, i.e.

, but again, note that here is not necessarily the task under which was observed. We can then write the Lagrangian equation corresponding to the above described program as

Next we maximize the Lagrangian w.r.t the primal variable . The derivative w.r.t reads,

Setting it to zero and rearranging terms we obtain

However, the last exponential term is a normalization constant for . Therefore we can write,


Now, to obtain the dual function , we plug in the solution to the KL constraint term (second term) of the Lagrangian which yields

After expanding and rearranging terms we get

Most of the terms cancel out and after rearranging the terms we obtain

Note that we have already calculated the term inside the integral in Equation 7. By plugging in equation 7 we obtain the dual function


which we can minimize with respect to based on samples from the replay buffer.

a.1.2 Extended Update Rules For Fitting a Mixture of Gaussians

After obtaining the non parametric policies, we fit a parametric policy to samples from said non-parametric policies – effectively employing using maximum likelihood estimation with additional regularization based on a distance function , i.e,


where is an arbitrary distance function to evaluate the change of the new policy with respect to a reference/old policy, and denotes the allowed change for the policy. To make the above objective amenable to gradient based optimization we employ Lagrangian Relaxation, yielding the following primal:


We solve for by iterating the inner and outer optimization programs independently: We fix the parameters to their current value and optimize for the Lagrangian multipliers (inner minimization) and then we fix the Lagrangian multipliers to their current value and optimize for (outer maximization). In practice we found that it is effective to simply perform one gradient step each in inner and outer optimization for each sampled batch of data.

The optimization given above is general, i.e. it works for any general type of policy. As described in the main paper, we consider hierarchical policies of the form


In particular, in all experiments we made use of a mixture of Gaussians parametrization, where the high level policy is a categorical distribution over low level Gaussian policies, i.e,

where denote the index of components and is the high level policy assigning probabilities to each mixture component for a state s given the task and the low level policies are all Gaussian. Here s are the probabilities for a categorical distribution over the components.

We also define the following distance function between old and new mixture of Gaussian policies

where evaluates the KL between categorical distributions and corresponds to the average KL across Gaussian components, as also described in the main paper (c.f. Equation 5 in the main paper).

In order to bound the change of categorical distributions, means and covariances of the components independently – which makes it easy to control the convergence of the policy and which can prevent premature convergence as argued in Abdolmaleki et al. [2018a] – we separate out the following three intermediate policies

Which yields the following final optimization program


This decoupling allows us to set different values for the change of means, covariances and categorical probabilities, i.e., . Different lead to different learning rates. We always set a much smaller epsilon for the covariance and categorical than for the mean. The intuition is that while we would like the distribution to converge quickly in the action space, we also want to keep the exploration both locally (via the covariance matrix) and globally (via the high level categorical distribution) to avoid premature convergence.

a.2 Algorithmic Details

a.2.1 Pseudocode for the full procedure

We provide a pseudo-code listing of the full optimization procedure – and the asynchronous data gathering – performed by RHPO  in Algorithm 1 and 2. The implementation relies on Sonnet [Reynolds et al., 2017] and TensorFlow [Abadi et al., 2015].

  Input: number of update steps, update steps between target update, number of action samples per state, KL regularization parameters , initial parameters for and
  initialize N = 0
  while  do
     for  in  do
        sample a batch of trajectories from replay buffer
        sample actions from to estimate expectations below
        // compute mean gradients over batch for policy, Lagrangian multipliers and Q-function
         following Eq. 5
         following Eq. 8
         with following Eq. 6
        // apply gradient updates
         optimizer_update(, ),
         optimizer_update(, )
         optimizer_update(, )
     end for
     // update target networks
  end while
Algorithm 1 Asynchronous Learner
  Input: number of total trajectories requested, steps per episode, scheduling period
  initialize N = 0
  while  do
     fetch parameters
     // collect new trajectory from environment
     for  in  do
        if  then
           // sample active task from uniform distribution
        end if
        // execute action and determine rewards for all tasks
     end for
     send batch trajectories to replay buffer
  end while
Algorithm 2 Asynchronous Actor

a.2.2 Details on the policy improvement step

As described in the main paper we consider the same setting as scheduled auxiliary control setting (SAC-X) [Riedmiller et al., 2018] to perform policy improvement (with uniform random switches between tasks every N steps within an episode, the SAC-U setting).

Given a replay buffer containing data gathered from all tasks, where for each trajectory snippet we record the rewards for all tasks as a vector in the buffer, we define the retrace objective for learning , parameterized via , following Riedmiller et al. [2018] as


where the importance weights are defined as , with denoting an arbitrary behavior policy; in particular this will be the policy for the executed tasks during an episode as in [Riedmiller et al., 2018]. Note that, in practice, we truncate the infinite sum after steps, bootstrapping with . We further perform optimization of Equation (6) via gradient descent and make use of a target network [Mnih et al., 2015], denoted with parameters , which we copy from after a couple of gradient steps. We reiterate that, as the state-action value function remains independent of the policy’s structure, we are able to utilize any other off-the-shelf Q-learning algorithm such as TD(0) [Sutton, 1988].

Given that we utilize the same policy evaluation mechanism as SAC-U it is worth pausing here to identify the differences between SAC-U and our approach. The main difference is in the policy parameterization: SAC-U used a monolithic policy for each task (although a neural network with shared components, potentially leading to some implicit task transfer, was used). Furthermore, we perform policy optimization based on MPO instead of using stochastic value gradients (SVG [Heess et al., 2016]). We can thus recover a variant of plain SAC-U using MPO if we drop the hierarchical policy parameterization, which we employ in the single task experiments in the main paper.

a.2.3 Network Architectures

To represent the Q-function in the multitask case we use the network architecture from SAC-X (see right sub-figure in Figure 6

). The proprioception of the robot, the features of the objects and the actions are fed together in a torso network. At the input we use a fully connected first layer of 200 units, followed by a layer normalization operator, an optional tanh activation and another fully connected layer of 200 units with an ELU activation function. The output of this torso network is shared by independent head networks for each of the tasks (or intentions, as they are called in the SAC-X paper). Each head has two fully connected layers and outputs a Q-value for this task, given the input of the network. Using the task identifier we then can compute the Q value for a given sample by discrete selection of the according head output.

Figure 6: Schematics of the fully connected networks in SAC-X. While we use the Q-function (right sub-figure) architecture in all multitask experiments, we investigate variations of the policy architecture (left sub-figure) in this paper (see Figure 7).

While we use the network architecture for the Q function for all multitask experiments, we investigate different architectures for the policy in this paper. The original SAC-X policy architecture is shown in Figure 6 (left sub-figure). The main structure follows the same basic principle that we use in the Q function architecture. The only difference is that the heads compute the required parameters for the policy distribution we want to use (see subsection A.2.4). This architecture is referenced as the independent heads (or task dependent heads).


Figure 7: Schematics of the alternative multitask policy architectures used in this paper. Left sub-figure: the monolithic architecture; Right sub-figure: the hierarchical architecture.

The alternatives we investigate in this paper are the monolithic policy architecture (see Figure 7, left sub-figure) and the hierarchical policy architecture (see Figure 7

, right sub-figure). For the monolithic policy architecture we reduce the original policy architecture basically to one head and append the task-id as a one-hot encoded vector to the input. For the hierarchical architecture, we build on the same torso and create a set of networks parameterizing the Gaussians which are shared across tasks and a task-specific network to parameterize the categorical distribution for each task. The final mixture distribution is task-dependent for the high-level controller but task-independent for the low-level policies.

a.2.4 Algorithm Hyperparameters

In this section we outline the details on the hyperparameters used for RHPO  and baselines in both single task and multitask experiments. All experiments use feed-forward neural networks. We consider a flat policy represented by a Gaussian distribution and a hierarchical policy represented by a mixture of Gaussians distribution. The flat policy is given by a Gaussian distribution with a diagonal covariance matrix, i.e,

. The neural network outputs the mean and diagonal Cholesky factors , such that . The diagonal factor has positive diagonal elements enforced by the softplus transform

to ensure positive definiteness of the diagonal covariance matrix. Mixture of Gaussian policy has a number of Gaussian components as well as a categorical distribution for selecting the components. The neural network outputs the Gaussian components based on the same setup described above for a single Gaussian and outputs the logits for representing the categorical distribution. Tables

1 show the hyperparameters we used for the single tasks experiments. We found layer normalization and a hyperbolic tangent () on the layer following the layer normalization are important for stability of the algorithms. For RHPO  the most important hyperparameters are the constraints in Step 1 and Step 2 of the algorithm.

Hyperparameters Hierarchical Single Gaussian
Policy net 200-200-200 200-200-200
Number of actions sampled per state 10 10
Q function net 500-500-500 500-500-500
Number of components 3 NA
0.1 0.1
0.0005 0.0005
0.00001 0.00001
0.0001 NA
Discount factor () 0.99 0.99
Adam learning rate 0.0002 0.0002
Replay buffer size 2000000 2000000
Target network update period 250 250
Batch size 256 256
Activation function elu elu
Layer norm on first layer Yes Yes
Tanh on output of layer norm Yes Yes
Tanh on input actions to Q-function Yes Yes
Retrace sequence length 10 10
Table 1: Hyperparameters - Single Task
Hyperparameters Hierarchical Independent Monolith
Policy torso
(shared across tasks) 400-200
Policy task-dependent heads 100 (high-level controller) 100 NA
Policy shared heads 100 (per mixture component) NA 200
Number of action samples 20
Q function torso
(shared across tasks) 400-400
Q function head
(per task) 300
Number of components number of tasks NA NA
Discount factor () 0.99
Replay buffer size 1e6 * number of tasks
Target network update period 500
Batch size 256 (512 for Pile1)
Table 2: Hyperparameters - Multitask. Values are taken from the single task experiments with the above mentioned changes.

a.3 Additional Details on the SAC-U with SVG baseline

For the SAC-U baseline we used a re-implementation of the method from [Riedmiller et al., 2018] using SVG [Heess et al., 2015] for optimizing the policy. Concretely we use the same basic network structure as for the "Monolithic" baseline with MPO and parameterize the policy as


denotes the identity matrix and

is computed from the network output via a softplus activation function.

Together with entropy regularization, as described in [Riedmiller et al., 2018] the policy can be optimized via gradient ascent, following the reparameterized gradient for a given states sampled from the replay:


which can be computed, using the reparameterization trick, as



is now a deterministic function of a sample from the standard multivariate normal distribution. See e.g.

Heess et al. [2015] (for SVG) as well as Kingma and Welling [2014] (for the reparameterization trick) for a detailed explanation.

a.4 Details on the Experimental Setup

a.4.1 Simulation (Single- and Multitask)

For the simulation of the robot arm experiments the numerical simulator MuJoCo 222MuJoCo: see was used – using a model we identified from the real robot setup.

We run experiments of length 2 - 7 days for the simulation experiments (depending on the task) with access to 2-5 recent CPUs with 32 cores each (depending on the number of actors) and 2 recent NVIDIA GPUs for the learner. Computation for data buffering is negligible.

a.4.2 Real Robot Multitask

Compared to simulation where the ground truth position of all objects is known, in the real robot setting, three cameras on the basket track the cube using fiducials (augmented reality tags).

For safety reasons, external forces are measured at the wrist and the episode is terminated if a threshold of 20N on any of the three principle axes is exceeded (this is handled as a terminal state with reward 0 for the agent), adding further to the difficulty of the task.

The real robot setup differs from the simulation in the reset behaviour between episodes, since objects need to be physically moved around when randomizing, which takes a considerable amount of time. To keep overhead small, object positions are randomized only every 25 episodes, using a hand-coded controller. Objects are also placed back in the basket if they were thrown out during the previous episode. Other than that, objects start in the same place as they were left in the previous episode. The robot’s starting pose is randomized each episode, as in simulation.

a.5 Task Descriptions

a.5.1 Pile1

Figure 8: Sawyer Set-Up.

For this task we have a real setup and a MuJoCo simulation that are well aligned. It consists of a Sawyer robot arm mounted on a table and equipped with a Robotiq 2F-85 parallel gripper. In front of the robot there is a basket of size 20x20 cm which contains three cubes with an edge length of 5 cm (see Figure 8).

The agent is provided with proprioception information for the arm (joint positions, velocities and torques), and the tool center point position computed via forward kinematics. For the gripper, it receives the motor position and velocity, as well as a binary grasp flag. It also receives a wrist sensor’s force and torque readings. Finally, it is provided with the cubes’ poses as estimated via the fiducials, and the relative distances between the arm’s tool center point and each object. At each time step, a history of two previous observations is provided to the agent, along with the last two joint control commands, in order to account for potential communication delays on the real robot. The observation space is detailed in Table 4.

The robot arm is controlled in Cartesian mode at 20Hz. The action space for the agent is 5-dimensional, as detailed in Table 3. The gripper movement is also restricted to a cubic volume above the basket using virtual walls.

Entry Dimensions Unit Range
Translational Velocity in x, y, z 3 m/s [-0.07, 0.07]
Wrist Rotation Velocity 1 rad/s [-1, 1]
Finger speed 1 tics/s [-255, 255]
Table 3: Action space for the Sawyer experiments.

Entry Dimensions Unit
Joint Position (Arm) 7 rad
Joint Velocity (Arm) 7 rad/s
Joint Torque (Arm) 7 Nm
Joint Position (Hand) 1 rad
Joint Velocity (Hand) 1 tics/s
Force-Torque (Wrist) 6 N, Nm
Binary Grasp Sensor 1 au
TCP Pose 7 m, au
Last Control Command (Joint Velocity) 8 rad/s
Green Cube Pose 7 m, au
Green Cube Relative Pose 7 m, au
Yellow Cube Pose 7 m, au
Yellow Cube Relative Pose 7 m, au
Blue Cube Pose 7 m, au
Blue Cube Relative Pose 7 m, au
Table 4: Observation used in the experiments with the Sawyer arm. An object’s pose is represented as its world coordinate position and quaternion. In the table, denotes meters, denotes radians, and refers to a quaternion in arbitrary units ().

For the Pile1 experiment we use 7 different task to learn, following the SAC-X principles. The first 6 tasks are seen as auxiliary tasks that help to learn the final task (STACK_AND_LEAVE(G, Y)) of stacking the green cube on top of the yellow cube. Overview of the used tasks:

  • REACH(G): :
    Minimize the distance of the TCP to the green cube.

  • GRASP:
    Activate grasp sensor of gripper ("inward grasp signal" of Robotiq gripper)

  • LIFT(G):
    Increase z coordinate of an object more than 3cm relative to the table.

    Bring green cube to a position 5cm above the yellow cube.

    Like PLACE_WIDE(G, Y) but more precise.

  • STACK(G, Y):
    Sparse binary reward for bringing the green cube on top of the yellow one (with 3cm tolerance horizontally and 1cm vertically) and disengaging the grasp sensor.

    Like STACK(G, Y), but needs to move the arm 10cm above the green cube.

Let be the distance between the reference of two objects (the reference of the cubes are the center of mass, TCP is the reference of the gripper), and let be the distance only in the dimensions denoted by the set of axes . We can define the reward function details by:


a.5.2 Pile2

Figure 9: The Pile2 set-up in simulation with two main tasks: The first is to stack the blue on the red cube, the second is to stack the red on the blue cube.

For the Pile2 task, taken from Riedmiller et al. [2018], we use a different robot arm, control mode and task setup to emphasize that RHPO’s improvements are not restricted to cartesian control or a specific robot and that the approach also works for multiple external tasks.

Here, the agent controls a simulated Kinova Jaco robot arm, equipped with a Kinova KG-3 gripper. The robot faces a 40 x 40 cm basket that contains a red cube and a blue cube. Both cubes have an edge length of 5 cm (see Figure 9). The agent is provided with proprioceptive information for the arm and the fingers (joint positions and velocities) as well as the tool center point position (TCP) computed via forward kinematics. Further, the simulated gripper is equipped with a touch sensor for each of the three fingers, whose value is provided to the agent as well. Finally, the agent receives the cubes’ poses, their translational and rotational velocities and the relative distances between the arm’s tool center point and each object. Neither observation nor action history is used in the Pile2 experiments. The cubes are spawned at random on the table surface and the robot hand is initialized randomly above the table-top with a height offset of up to 20 cm above the table (minimum 10 cm). The observation space is detailed in Table 6.

Entry Dimensions Unit Range
Joint Velocity (Arm) 6 rad/sec [-0.8, 0.8]
Joint Velocity (Hand) 3 rad/sec [-0.8, 0.8]
Table 5: Action space used in the experiments with the Kinova Jaco Arm.

The robot arm is controlled in raw joint velocity mode at 20 Hz. The action space is 9-dimensional as detailed in Table 5. There are no virtual walls and the robot’s movement is solely restricted by the velocity limits and the objects in the scene.

Analogous to Pile1 and the SAC-X setup, we use 10 different task for Pile2. The first 8 tasks are seen as auxiliary tasks, that the agent uses to learn the main two tasks PILE_RED and PILE_BLUE, which represent stacking the red cube on the blue cube and stacking the blue cube on the red cube respectively. The tasks used in the experiment are:

  • REACH(R) = :
    Minimize the distance of the TCP to the red cube.

  • REACH(B) = :
    Minimize the distance of the TCP to the blue cube.

  • MOVE(R) = :
    Move the red cube.

  • MOVE(B) = :
    Move the blue cube.

  • LIFT(R) =
    Increase the z-coordinate of the red cube to more than 5cm relative to the table.

  • LIFT(B) =
    Increase the z-coordinate of the blue cube to more than 5cm relative to the table.

    Bring the red cube to a position above of and close to the blue cube.

    Bring the blue cube to a position above of and close to the red cube.

  • PILE(R):
    Place the red cube on another object (touches the top). Only given when the cube doesn’t touch the robot or the table.

  • PILE(B):
    Place the blue cube on another object (touches the top). Only given when the cube doesn’t touch the robot or the table.

The sparse reward above(A, B) is given by comparing the bounding boxes of the two objects A and B. If the bounding box of object A is completely above the highest point of object B’s bounding box, above(A, B) is , otherwise above(A, B) is .

Entry Dimensions Unit
Joint Position (Arm) 6 rad
Joint Velocity (Arm) 6 rad/s
Joint Position (Hand) 3 rad
Joint Velocity (Hand) 3 rad/s
TCP Position 3 m
Touch Force (Fingers) 3 N
Red Cube Pose 7 m, au
Red Cube Velocity 6 m/s, dq/dt
Red Cube Relative Position 3 m
Blue Cube Pose 7 m, au
Blue Cube Velocity 6 m/s, dq/dt
Blue Cube Relative Position 3 m
Lid Position 1 rad
Lid Velocity 1 rad/s
Table 6: Observation used in the experiments with the Kinova Jaco Arm. An object’s pose is represented as its world coordinate position and quaternion. The lid position and velocity are only used in the Clean-Up task. In the table, denotes meters, denotes radians, and refers to a quaternion in arbitrary units ().

a.5.3 Clean-Up

The Clean-Up task is also taken from Riedmiller et al. [2018] and builds on the setup described for the Pile2 task. Besides the two cubes, the work-space contains an additional box with a moveable lid, that is always closed initially (see Figure 10). The agent’s goal is to clean up the scene by placing the cubes inside the box. In addition to the observations used in the Pile2 task, the agent observes the lid’s angle and it’s angle velocity.

Figure 10: The Clean-Up task set-up in simulation. The task is solved when both bricks are in the box.

Analogous to Pile2 and the SAC-X setup, we use 13 different task for Clean-Up. The first 12 tasks are seen as auxiliary tasks, that the agent uses to learn the main task ALL_INSIDE_BOX. The tasks used in this experiments are:

  • REACH(R) = :
    Minimize the distance of the TCP to the red cube.

  • REACH(B) = :
    Minimize the distance of the TCP to the blue cube.

  • MOVE(R) = :
    Move the red cube.

  • MOVE(B) = :
    Move the blue cube.

  • NO_TOUCH =
    Sparse binary reward, given when neither of the touch sensors is active.

  • LIFT(R) =
    Increase the z-coordinate of the red cube to more than 5cm relative to the table.

  • LIFT(B) =
    Increase the z-coordinate of the blue cube to more than 5cm relative to the table.

  • OPEN_BOX =
    Open the lid up to 85 degrees.