1 Introduction
Modelfree deep reinforcement learning (RL) has demonstrated potential in many challenging domains, ranging from games (Mnih et al., 2013; Silver et al., 2016) to manipulation (Levine et al., 2016) and locomotion task (Schulman et al., 2015)
. Part of the promise of incorporating deep representations into RL is the potential for the emergence of hierarchies, which can enable reasoning and decision making at different levels of abstract. A hierarchical RL algorithm could, in principle, efficiently discover solutions to complex problems and reuse representations between related tasks. While hierarchical structures have been observed to emerge in deep networks applied to perception tasks, such as computer vision
(LeCun et al., 2015), it remains an open question how suitable hierarchical representations can be induced in a reinforcement learning setting. A central challenge with these methods is the automation of the hierarchy construction process: handspecified hierarchies require considerable expertise and insight to design and limit the generality of the approach (Sutton et al., 1999; Kulkarni et al., 2016; Tessler et al., 2017), while automated methods must contend with severe challenges, such as the collapse of all primitives into just one useful skill (Bacon et al., 2017) or the need to handengineer primitive discovery objectives or intermediate goals (Heess et al., 2016).When learning hierarchies automatically, we must answer a critical question: What objective can we use to ensure that lower layers in a hierarchy are useful to the higher layers? Prior work has proposed a number of heuristic approaches, such as hiding some parts of the observation from the lower layers
(Heess et al., 2016), handdesigning state features and training the lower layer behaviors to maximize mutual information against these features (Florensa et al., 2017), or constructing diversityseeking priors that cause lowerlayer primitives to take on different roles (Daniel et al., 2012; Eysenbach et al., 2018). Oftentimes, these heuristics intentionally cripple the lower layers of the hierarchy, for example, by withholding information, so as to force a hierarchy to emerge, or else limit the higher levels of the hierarchy to selecting from among a discrete set of skills (Bacon et al., 2017). In both cases, the hierarchy is forced to emerge because neither higher nor lower layers can solve the problem alone. However, constraining the layers in this way can involve artificial and taskspecific restrictions (Heess et al., 2016) or else diminish overall performance.Instead of crippling or limiting the different levels of the hierarchy, we can imagine a hierarchical framework in which each layer directly attempts to solve the task and, if it is not fully successful, makes the job easier for the layer above it. In this paper, we explore a solution to the hierarchical reinforcement learning problem based on this principle. In our framework, each layer of the hierarchy corresponds to a policy with internal latent variables. These latent variables determine how the policy maps states into actions, and the latent variables of the lowerlevel policy act as the action space for the higher level. Crucially, each layer is unconstrained, both in its ability to sense and affect the environment: each layer receives the full state as the observation, and each layer is, by construction, fully invertible, so that higher layers in the hierarchy can undo any transformation of the action space imposed on the layers below.
In order to train policies with latent variables, we cast the problem of reinforcement learning into the framework of probabilistic graphical models. To that end, we build on maximum entropy reinforcement learning (Todorov, 2007; Ziebart et al., 2008), where the RL objective is modified to optimize for stochastic policies that maximize both reward and entropy. It can be shown that, in this framework, the RL problem becomes equivalent to an inference problem in a particular type of probabilistic graphical model (Toussaint, 2009). By augmenting this model with latent variables, we can derive a method that simultaneously produces a policy that attempts to solve the task, and a latent space that can be used by a higherlevel controller to steer the policy’s behavior.
The particular latent variable model representation that we use is based on normalizing flows (Dinh et al., 2016) that transform samples from a spherical Gaussian prior latent variable distribution into a posterior distribution, which in the case of our policies corresponds to a distribution over actions. When this transformation is described by a generalpurpose neural network, the model can represent any distribution over the observed variable when the network is large enough. By conditioning the entire generation process on the state, we obtain a policy that can represent any conditional distribution over actions. When combined with maximum entropy reinforcement learning algorithms, this leads to a RL method that is expressive, powerful, and surprisingly stable. In fact, our experimental evaluation shows that this approach can attain stateoftheart results on a number of continuous control benchmark tasks by itself, independently of its applicability to hierarchical RL.
The contributions of our paper consist of a stable and scalable algorithm for training maximum entropy policies with latent variables as well as a framework for constructing hierarchies out of these latent variable policies. Hierarchies are constructed in layerwise fashion, by training one latent variable policy at a time, with each policy using the latent space of the policy below it as an action space, as illustrated in Figure 2. Each layer can be trained either on the true reward for the task, without any modification, or on a lowerlevel shaping reward. For example, for learning a complex navigation task, lower layers might receive a reward that promotes locomotion, regardless of direction, while higher layers aim to reach a particular location. When the shaping terms are not available, the same reward function can be used for each layer, and we still observe significant improvements from hierarchy. Our experimental evaluation illustrates that our method produces stateoftheart results in terms of sample complexity on a variety of benchmark tasks, including a humanoid robot with 21 actuators, even when training a single layer of the hierarchy, and can further improve performance when additional layers are added. Furthermore, we illustrate that more challenging tasks with sparse reward signals can be solved effectively by providing shaping rewards to lower layers.
2 Related Work
A number of prior works have explored how reinforcement learning can be cast in the framework of probabilistic inference (Kappen, 2005; Todorov, 2007; Ziebart et al., 2008; Toussaint, 2009; Peters et al., 2010; Neumann, 2011). Our approach is based on a formulation of reinforcement learning as inference in a graphical model (Ziebart et al., 2008; Toussaint, 2009; Levine, 2014). Prior work has shown that this framework leads to an entropymaximizing version of reinforcement learning, where the standard objective is augmented with a term that also causes the policy to maximize entropy (Ziebart et al., 2008; Haarnoja et al., 2017, 2018; Nachum et al., 2017; Schulman et al., 2017a). Intuitively, this encourages policies that maximize reward while also being as random as possible, which can be useful for robustness, exploration, and, in our case, increasing the diversity of behaviors for lower layers in a hierarchy. Building on this graphical model interpretation of reinforcement learning also makes it natural for us to augment the policy with latent variables. While several prior works have sought to combine maximum entropy policies with learning of latent spaces (Haarnoja et al., 2017; Hausman et al., 2018) and even with learning hierarchies in small state spaces (Saxe et al., 2017), to our knowledge, our method is the first to extend this mechanism to the setting of learning hierarchical policies with deep RL in continuous domains.
Prior frameworks for hierarchical learning are often based on either options or contextual policies. The options framework (Sutton et al., 1999) combines lowlevel option policies with a toplevel policy that invokes individual options, whereas contextual policies (Kupcsik et al., 2013; Schaul et al., 2015; Heess et al., 2016) generalize options to continuous goals. One of the open questions in both options and contextual policy frameworks is how the base policies should be acquired. In some situations, a reasonable solution is to resort to domain knowledge and design a span of subgoals manually (Heess et al., 2016; Kulkarni et al., 2016; MacAlpine & Stone, 2018). Another option is to train the entire hierarchy endtoend (Bacon et al., 2017; Vezhnevets et al., 2017; Daniel et al., 2012). While the endtoend training scheme provides generality and flexibility, it is prone to learning degenerate policies that exclusively use a single option, losing much of the benefit of the hierarchical structure (Bacon et al., 2017). To that end, the optioncritic (Bacon et al., 2017) adopts a standard entropy regularization scheme ubiquitous in policy gradient methods (Mnih et al., 2016; Schulman et al., 2015), Florensa et al. propose maximizing the mutual information of the toplevel actions and the state distribution, and Daniel et al. bound the mutual information of the actions and toplevel actions. Our method also uses entropy maximization to obtain diverse base policies, but in contrast to prior methods, our subpolicies are invertible and parameterized by continuous latent variables. The higher levels can thus undo any lower level transformation, and the lower layers can learn independently, allowing us to train the hierarchies in bottomup layerwise fashion. Unlike prior methods, which use structurally distinct higher and lower layers, all layers in our hierarchies are structurally generic, and are trained with exactly the same procedure.
3 Preliminaries
In this section, we introduce notation and summarize standard and maximum entropy reinforcement learning.
3.1 Notation
We address policy learning in continuous action spaces formalized as learning in a Markov decision process (MDP)
, where and represent the state and actions spaces, andrepresents the state transition probabilities of the next state
given the current state and action . At each transition, the environment emits a bounded reward . We will also use to denote the initial state distribution, to denote a trajectory and its distribution under a policy .3.2 Maximum Entropy Reinforcement Learning
The standard objective used in reinforcement learning is to maximize the expected sum of rewards . We will consider a more general maximum entropy objective (see, e.g. , (Todorov, 2007; Ziebart, 2010; Rawlik et al., 2012; Fox et al., 2016; Haarnoja et al., 2017, 2018)), which augments the objective with the expected entropy of the policy over :
(1) 
The temperature parameter determines the relative importance of the entropy term against the reward and thus controls the stochasticity of the optimal policy. The conventional objective can be recovered in the limit as . For the rest of this paper, we will omit writing the temperature explicitly, as it can always be subsumed into the reward by scaling it with . In practice, we optimize a discounted, infinite horizon objective, which is more involved to write out explicitly, and we refer the interested readers to prior work for details (Haarnoja et al., 2017).
4 Control as Inference
In this section, we derive the maximum entropy objective by transforming the optimal control problem into an inference problem. Our proposed hierarchical framework will later build off of this probabilistic view of optimal control (Section 5).
4.1 Probabilistic Graphical Model for Control
Our derivation is based on the probabilistic graphical model in Figure 1. This model is composed of factors for the dynamics and for an action prior
, which is typically taken to be a uniform distribution but, as we will discuss later, will be convenient to set to a Gaussian distribution in the hierarchical case. Because we are interested in inferring the optimal trajectory distribution under a given reward function, we attach to each state and action a binary random variable
, or optimality variable, denoting whether the time step was “optimal.” To solve the optimal control problem, we can now infer the posterior action distribution , which simply states that an optimal action is such that the optimality variable is active for the current state and for all of the future states. For the remainder of this paper, we will refrain, for conciseness, from explicitly writing , and instead write to denote the stateaction tuple for the corresponding time was optimal.A convenient way to incorporate reward function into this framework is to choose , assuming, without loss of generality, that . We can write the distribution over optimal trajectories as
(2) 
and use it to make queries, such as . As we will discuss in the next section, using variational inference to determine reduces to the familiar maximum entropy reinforcement learning problem in Equation missing.
4.2 Reinforcement Learning via Variational Inference
The optimal action distribution inferred from Equation missing cannot be directly used as a policy for two reasons. First, it would lead to an overly optimistic policy that assumes that the stochastic state transitions can also be modified to prefer optimal behavior, even though in practice, the agent has no control over the dynamics. Second, in continuous domains, the optimal policy is intractable and has to be approximated, for example by using a Gaussian distribution. We can correct both issues by using structured variational inference, where we approximate the posterior with a probabilistic model that constrains the dynamics and the policy. We constrain the dynamics in this distribution to be equal to the true dynamics, which we do not need to actually know in practice but can simply sample in modelfree fashion, and constrain the policy to some parameterized distribution. This defines the variational distribution as
(3) 
where and are the true initial state distribution and dynamics, and is the parameterized policy that we wish to learn. We can fit this distribution by maximizing the evidence lower bound (ELBO):
(4) 
Since the dynamics and initial state distributions in and match, it’s straightforward to check that the divergence term simplifies to
(5) 
which, if we choose a uniform action prior, is exactly the maximum entropy objective in Equation missing up to a constant, and it can be optimized with any offtheshelf entropy maximizing reinforcement learning algorithms (e.g. , (Nachum et al., 2017; Schulman et al., 2017a; Haarnoja et al., 2017, 2018)). Although the variational inference framework is not the only way to motivate the maximum entropy RL objective, it provides a convenient starting point for our method, which will augment the graphical model in Figure 1 with latent variables that can then be used to produce a hierarchy of policies, as discussed in the following section.
5 Learning Latent Space Policies
In this section, we discuss how the probabilistic view of RL can be employed in the construction of hierarchical policies, by augmenting the graphical model in Figure 1 with latent variables. We will also propose a particular way to parameterize the distribution over actions conditioned on these latent variables that is based on bijective transformations, which will provide us with a model amenable to stable and tractable training and the ability for higher levels in the hierarchy to fully invert the behavior of the lower layers, as we will discuss in Section 5.2. We will derive the method for twolayer hierarchies to simplify notation, but it can be easily generalized to arbitrarily deep hierarchies. In this work, we consider the bottomup approach, where we first train a lowlevel policy, and then use it to provide a higherlevel action space for a higher level policy that ideally can now solve an easier problem.
5.1 Latent Variable Policies for Hierarchical RL
We start constructing a hierarchy by defining a stochastic base policy as a latent variable model. In other words, we require the base policy to consist of two factors: a conditional action distribution , where is a latent random variable, and a prior . Actions can be sampled from this policy by first sampling from the prior and then sampling an action conditioned on . Adding the latent variables results in a new graphical model, which can now be conditioned on some new optimality variables that can represent either the same task, or a different higherlevel task, as shown in Figure 1. In this new graphical model, the base policy is integrated into the transition structure of the MDP, which now exposes a new, higherlevel set of actions . Insofar as the base policy succeeds in solving the task, partially or completely, learning a related task with as the action should be substantially easier.
This “policyaugmented” graphical model has a semantically identical interpretation as the original graphical model in Figure 1, where the combination of the transition model and the action conditional serves as a new (and likely easier) dynamical system. We can derive the dynamics model for the combined system by marginalizing out the actions: , where is a new, higherlevel action and is its prior, as illustrated in Figure 1. In other words, the base policy shapes the underlying dynamics of the system, ideally making it more easily controllable by a higher level policy. We can now learn a higher level policy for the latents by conditioning on new optimality variables. We can repeat this process multiple times by integrating each new policy level into the dynamics and learning a new higherlevel policy on top of the previous policy’s latent space, thus constructing an arbitrarily deep hierarchical policy representation. We will refer to these policy layers as subpolicies in the following sections. Similar layerwise training has been studied in the context of generative modeling with deep believe networks and shown to improve optimization of deep architectures (Hinton & Salakhutdinov, 2006).
5.2 Practical Training of Latent Variable Policies
An essential choice in our method is the representation of subpolicies which, ideally, should be characterized by three properties. First, each subpolicy should be tractable. This is required, since we need to maximize the loglikelihood of good actions, which requires marginalizing out all latent variables. Second, the subpolicies should be expressive, so that they don’t suppress the information flow from the higher levels to the lower levels. For example, a mixture of Gaussians as a subpolicy can only provide a limited number of behaviors for the higher levels, corresponding to the mixture elements, potentially crippling the ability of the higher layers to solve the task. Third, the conditional factor of each subpolicy should be deterministic, since the higher levels view it as a part of the environment, and additional implicit noise in the system can degrade their performance. Our approach to model the conditionals is based on bijective transformations of the latent variables into actions, which provides all of the aforementioned properties: the subpolicies are tractable, since marginalization reduces to singlepoint evaluation; they are expressive if represented via neural networks; and they are deterministic due to the bijective transformation. Specifically, we borrow from the recent advances in unsupervised learning based on realvalued nonvolume preserving (real NVP) neural network transformations
(Dinh et al., 2016). Our network differs from the original real NVP architecture in that we also condition the transformations on the current state or observation. Note that, even though the transformation from the latent to the action is bijective, it can depend on the observation in arbitrarily complex noninvertible ways, as we discuss in Section 6.1, providing our subpolicies with the requisite expressive power.We can learn the parameters of these networks by utilizing the change of variables formula as discussed by Dinh et al. (2016), and summarized here for completeness. Let be a bijective transformation, and let be a random variable and its prior density. It is possible to express the density of in terms of the prior and the Jacobian of the transformation by employing the change of variable formula as
(6) 
Dinh et al. (2016) propose a particular kind of bijective transformation, which has a triangular Jacobian, simplifying the computation of the determinant to a product of its diagonal elements. The exact structure of these transformations is outside of the scope of this work, and we refer the reader to (Dinh et al., 2016) for a more detailed description. We can easily chain these transformations to form multilevel policies, and we can train them endtoend as a single policy or layerwise as a hierarchical policy. As a consequence, the policy representation is agnostic to whether or not it was trained as a single expressive latentvariable policy or as a hierarchical policy consisting of several subpolicies, allowing us to choose the training method that best suits the problem at hand without the need to redesign the policy topology each time from scratch. Next, we will discuss the different hierarchical training strategies in more detail.
5.3 Reward Functions for Policy Hierarchies
The simplest way to construct a hierarchy out of latent variable policies is to follow the procedure described in Section 5.1, where we train each layer in turn, then freeze its weights, and train a new layer that uses the lower layer’s latent variables as an action space. In this procedure, each layer is trained on the same maximum entropy objective, and each layer simplifies the task for the layer above it. As we will show in Section 6, this procedure can provide substantial benefit on challenging and highdimensional benchmark tasks.
However, for tasks that are more challenging, we can also naturally incorporate weak prior information into the training process by using underdefined heuristic reward functions for training the lower layers. In this approach, lower layers of the hierarchy are trained on reward functions that include shaping terms that elicit more desirable behaviors. For example, if we wish to learn a complex navigation task for a walking robot, the lowerlayer objective might provide a reward for moving in any direction, while the higher layer is trained only on the primary objective of the task. Because our method uses bijective transformations, the higher layer can always undo any behavior in the lower layer if it is detrimental to task success, while lower layer objectives that correlate with task success can greatly simplify learning. Furthermore, since each layer still aims to maximize entropy, even a weak objective, such as a reward for motion in any direction (e.g., the norm of the velocity), will produce motion in many different directions that is controllable by the lower layer’s latent variables. In our experiments, we will demonstrate how this approach can be used to solve a goal navigation task for a simulated, quadrupedal robot.
5.4 Algorithm Summary
We summarize the proposed algorithm in Algorithm 1. The algorithm begins by operating on the lowlevel actions in the environment according to the unknown system dynamics , where we use to denote the lowestlayer actions. The algorithm is also provided with an ordered set of reward functions , which can all represent the same task or different tasks, depending on the skill we want to learn: skills that naturally divide into primitive skills can benefit from specifying a different lowlevel objective, but for simpler tasks, such as locomotion, which do not naturally divide into primitives, we can train each subpolicy to optimize the same objective. In both cases, the last reward function should correspond to the actual task we want to solve. The algorithm then chooses each sequentially and learns a maximum entropy policy , represented by an invertible transformation and a prior , to optimize the corresponding variational inference objective in Equation missing. Our proposed implementation uses soft actorcritic (Haarnoja et al., 2018) to optimize the policy due to its robustness and good sampleefficiency, although other entropy maximizing RL algorithms can also be used. After each iteration, we embed the newly learned transformation into the environment, which produces a new system dynamics that can be used by the next layer. As before, we do not need an analytic form of these dynamics, only the ability to sample from them, which allows our algorithm to operate in the fully modelfree setting.
6 Experiments
Our experiments were conducted on several continuous control benchmark tasks from the OpenAI Gym benchmark suite (Brockman et al., 2016). The aim of our experiments was to answer the following questions: (1) How well does our latent space policy learning method compare to prior reinforcement learning algorithms? (2) Can we attain improved performance from adding additional latent variable policy layers to a hierarchy, especially on challenging, highdimensional tasks? (3) Can we solve more complex tasks by providing simple heuristic shaping to lower layers of the hierarchy, while the higher layers optimize the original task reward? Videos of our experiments are available online^{1}^{1}1https://sites.google.com/view/latentspacedeeprl.
6.1 Policy Architecture
In our experiments, we used both singlelevel policies and hierarchical policies composed of two subpolicies as shown on the right in Figure 2. Each subpolicy has an identical structure, as depicted on the left in Figure 2. A subpolicy is constructed from two coupling layers that are connected using the alternating pattern described in (Dinh et al., 2016). Our implementation differs from Dinh et al. (2016)
only in that we condition the coupling layers on the observations, via an embedding performed by a twolayer fullyconnected network. In practice, we concatenate the embedding vector with the latent input to each coupling layer. Note that our method requires only the path from the input latent to the output to be invertible, and the output can depend on the observation in arbitrarily complex ways. We have released our code for reproducibility.
^{2}^{2}2https://github.com/haarnoja/sacTraining curves for continuous control benchmarks. Thick lines correspond to mean performance, and shaded regions show standard deviations of five random seeds. Our method (SACLSP) attains stateoftheart performance across all tasks.
6.2 Benchmark Tasks with SingleLevel Policies
We compare our method (SACLSP) to two commonly used RL algorithms: proximal policy optimization (PPO) (Schulman et al., 2017b), a commonly used policy gradient algorithm, and deep deterministic policy gradient (DDPG) (Lillicrap et al., 2015), which is a sample efficient offpolicy method. We also include two recent algorithm that learn maximum entropy policies: soft Qlearning (SQL) (Haarnoja et al., 2017), which also learns a sampling network as part of the model, represented by an implicit density model, and soft actorcritic (Haarnoja et al., 2018)
, which uses a Gaussian mixture model policy (SACGMM). Note that the benchmark tasks compare the total expected return, but the entropy maximizing algorithms optimize a slightly different objective, so this comparison slightly favors PPO and DDPG, which optimize the benchmark objective directly. Another difference between the two classes of algorithms is that the maximum entropy policies are stochastic at test time, while DDPG is deterministic and PPO typically converges to nearly deterministic policies. For SACGMM, we execute an approximate maximum a posteriori action by choosing the mean of the mixture component that has the highest Qvalue at test time, but for SQL and our method, which both can represent an arbitrarily complex posterior distribution, we simply sample from the stochastic policy.
Our results on the benchmark tasks show that our policy representation generally performs on par or better than all of the tested prior methods, both in terms of efficiency and final return (Figure 3), especially on the more challenging and highdimensional tasks, such as Humanoid. These results indicate that our policy representation can accelerate learning, particularly in challenging environments with highdimensional actions and observations.
6.3 MultiLevel Policies
In this section, we evaluate the performance of our approach when we compose multiple latent variable policies into a hierarchy. In the first experiment, we train a singlelevel base policy on the most challenging standard benchmarks, Ant and Humanoid. We then freeze the weights of the base policy, and learn another policy level that uses the latent variables of the first policy as its action space. Intuitively, each layer in such a hierarchy attempts to solve the task to the best of its ability, providing an easier problem for the layer above. In Figure (a) and Figure (b), we show the training curves for Ant and Humanoid, where the blue curve corresponds to the base policy and orange curves show the performance after we freeze the base policy weights and optimize a secondlevel policy. The different orange curves correspond to the addition of the second layer after a different numbers of training steps. In each case, the twolevel policy can outperform a single level policy by a large margin. The performance boost is more prominent if we train the base policy longer. Note that the base policy corresponds to a singlelevel policy in Figure 3, and already learns more efficiently than prior methods. We also compare to a single, more expressive policy (green) that consists of four invertible layers, which has a similar number of parameters to a stacked twolevel policy, but trained endtoend as oppose to stagewise. This single fourlayer policy performs comparably (Ant) or worse (Humanoid) than a single, twolayer policy (blue), indicating that the benefit of the twolevel hierarchy is not just in the increased expressivity of the policy, but that the stagewise training procedure plays an important role in improving performance.
In our second experiment, we study how our method can be used to build hierarchical policies for complex tasks that require compound skills. The particular task we consider requires Ant to navigate through a simple maze (Figure (a)) with a sparse, binary reward for reaching the goals (shown in green). To solve this task, we can construct a hierarchy, where the lower layer is trained to acquire a general locomotion skill simply by providing a reward for maximizing velocity, regardless of direction. This pretraining phase can be conducted in a simpler, open environment where the agent can move freely. This provides a small amount of domain knowledge, but this type of domain knowledge is much easier to specify than true intermediate goals or modulation (Heess et al., 2016; Kulkarni et al., 2016), and the entropy maximization term in the objective automatically causes the lowlevel policy to learn a range of locomotion skills for various directions. The higherlevel policy is then provided the standard task reward. Because the lowerlevel policy is invertible, the higherlevel policy can still solve the task however it needs to, potentially even by fully undoing the behavior of the lowerlevel policy. However, the results suggest that the lowerlevel policy does indeed substantially simplify the problem, resulting in both rapid learning and good final performance.
In Figure (b), we compare our approach (blue) to four singlelevel baselines that either learn a policy from scratch or finetune a pretrained policy. The pretraining phase (4 million steps) is not included in the learning curves, since the same base policies were reused across multiple tasks, corresponding to the three difference goal locations shown in Figure (a). With task reward only, training a policy from scratch (red) failed to solve the task due to lack of structured exploration, whereas finetuning a pretrained policy (brown) that already knows how to move around and could occasionally find the way to the goal was able to slowly learn this task. We also tried improving exploration by augmenting the objective with a motion reward, provided as a shaping term for the entire policy. In this case, a policy trained from scratch (purple) learned slowly, as it first needed to acquire a locomotion skill, but was able to eventually solve the task, while finetuning a pretrained policy resulted in much faster learning (pink). However, in both cases, adding motion as a shaping term prevents the policy from converging on an optimal solution, since the shaping alters the task. This manifests as residual error at convergence. On the other hand, our method (blue) can make use of the pretrained locomotion skills while optimizing for the task reward directly, resulting in faster learning and better final performance. Note that our method converges to a solution where the final distance to the goal is more than four times smaller than for the next best method, which finetunes with a shaped reward. We also applied our method to soft Qlearning (yellow), which also trains latent space policies, but we found it to learn substantially slower than SACLSP.
7 Discussion and Future Work
We presented a method for training policies with latent variables. Our reinforcement learning algorithm not only compares favorably to stateoftheart algorithms on standard benchmark tasks, but also provides an appealing avenue for constructing hierarchical policies: higher levels in the hierarchy can directly make use of the latent space of the lower levels as their action space, which allows us to train the entire hierarchy in a layerwise fashion. This approach to hierarchical reinforcement learning has a number of conceptual and practical benefits. First, each layer in the hierarchy can be trained with exactly the same algorithm. Second, by using an invertible mapping from latent variables to actions, each layer becomes invertible, which means that the higher layer can always perfectly invert any behavior of the lower layer. This makes it possible to train lower layers on heuristic shaping rewards, while higher layers can still optimize taskspecific rewards with good asymptotic performance. Our method has a natural interpretation as an iterative procedure for constructing graphical models that gradually simplify the task dynamics. Our approach can be extended to enable different layers to operate at different temporal scales, which would provide for extensions into domains with temporally delayed rewards and multistage tasks.
Acknowledgments
We thank Aurick Zhou for producing some of the baseline results. This work was supported by Siemens and Berkeley DeepDrive.
References

Bacon et al. (2017)
Bacon, P.L., Harb, J., and Precup, D.
The optioncritic architecture.
In
AAAI Conference on Artificial Intelligence (AAAI)
, pp. 1726–1734, 2017.  Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
 Daniel et al. (2012) Daniel, C., Neumann, G., and Peters, J. Hierarchical relative entropy policy search. In Artificial Intelligence and Statistics, pp. 273–281, 2012.
 Dinh et al. (2016) Dinh, L., SohlDickstein, J., and Bengio, S. Density estimation using Real NVP. arXiv preprint arXiv:1605.08803, 2016.
 Eysenbach et al. (2018) Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
 Florensa et al. (2017) Florensa, C., Duan, Y., and P., A. Stochastic neural networks for hierarchical reinforcement learning. In International Conference on Learning Representations (ICLR), 2017.
 Fox et al. (2016) Fox, R., Pakman, A., and Tishby, N. Taming the noise in reinforcement learning via soft updates. In Conference on Uncertainty in Artificial Intelligence, 2016.

Haarnoja et al. (2017)
Haarnoja, T., Tang, H., Abbeel, P., and Levine, S.
Reinforcement learning with deep energybased policies.
In
International Conference on Machine Learning (ICML)
, pp. 1352–1361, 2017.  Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
 Hausman et al. (2018) Hausman, K., Springenberg, J. T., Wang, Z., Heess, N., and Riedmiller, M. Learning an embedding space for transferable robot skills. In Conference on Learning Representations (ICLR), 2018.
 Heess et al. (2016) Heess, N., Wayne, G., Tassa, Y., Lillicrap, T., Riedmiller, M., and Silver, D. Learning and transfer of modulated locomotor controllers. arXiv preprint arXiv:1610.05182, 2016.
 Hinton & Salakhutdinov (2006) Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
 Kappen (2005) Kappen, H. J. Path integrals and symmetry breaking for optimal control theory. Journal of Statistical Mechanics: Theory And Experiment, 2005(11):P11011, 2005.
 Kulkarni et al. (2016) Kulkarni, T. D., Narasimhan, K., Saeedi, A., and Tenenbaum, J. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems (NIPS), pp. 3675–3683, 2016.
 Kupcsik et al. (2013) Kupcsik, A. G., Deisenroth, M. P., Peters, J., and Neumann, G. Dataefficient generalization of robot skills with contextual policy search. In Proceedings of the 27th AAAI Conference on Artificial Intelligence (AAAI), pp. 1401–1407, 2013.
 LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature, 521:436–444, May 2015.
 Levine (2014) Levine, S. Motor skill learning with local trajectory methods. PhD thesis, Stanford University, 2014.
 Levine et al. (2016) Levine, S., Finn, C., Darrell, T., and Abbeel, P. Endtoend training of deep visuomotor policies. Journal of Machine Learning Research, 17(39):1–40, 2016.
 Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 MacAlpine & Stone (2018) MacAlpine, P. and Stone, P. Overlapping layered learning. Artificial Intelligence, 254:21–43, 2018.
 Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning (ICML), 2016.
 Nachum et al. (2017) Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2772–2782, 2017.
 Neumann (2011) Neumann, G. Variational inference for policy search in changing situations. In International Conference on Machine Learning (ICML), pp. 817–824, 2011.
 Peters et al. (2010) Peters, J., Mülling, K., and Altun, Y. Relative entropy policy search. In AAAI Conference on Artificial Intelligence (AAAI), pp. 1607–1612, 2010.
 Rawlik et al. (2012) Rawlik, K., Toussaint, M., and Vijayakumar, S. On stochastic optimal control and reinforcement learning by approximate inference. Robotics: Science and Systems (RSS), 2012.
 Saxe et al. (2017) Saxe, A. M., Earle, A. C., and Rosman, B. S. Hierarchy through composition with multitask LMDPs. Proceedings of Machine Learning Research, 2017.
 Schaul et al. (2015) Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In International Conference on Machine Learning (ICML), pp. 1312–1320, 2015.
 Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning (ICML), pp. 1889–1897, 2015.
 Schulman et al. (2017a) Schulman, J., Abbeel, P., and Chen, X. Equivalence between policy gradients and soft Qlearning. arXiv preprint arXiv:1704.06440, 2017a.
 Schulman et al. (2017b) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017b.
 Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, Jan 2016. ISSN 00280836. Article.
 Sutton et al. (1999) Sutton, R. S., Precup, D., and Singh, S. Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211, 1999.
 Tessler et al. (2017) Tessler, C., Givony, S., Zahavy, T., Mankowitz, D. J., and Mannor, S. A deep hierarchical approach to lifelong learning in minecraft. In AAAI Conference on Artificial Intelligence (AAAI), volume 3, pp. 6, 2017.
 Todorov (2007) Todorov, E. Linearlysolvable Markov decision problems. In Advances in Neural Information Processing Systems (NIPS), pp. 1369–1376. MIT Press, 2007.
 Toussaint (2009) Toussaint, M. Robot trajectory optimization using approximate inference. In International Conference on Machine Learning (ICML), pp. 1049–1056. ACM, 2009.
 Vezhnevets et al. (2017) Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., and Kavukcuoglu, K. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017.
 Ziebart (2010) Ziebart, B. D. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. PhD thesis, 2010.
 Ziebart et al. (2008) Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In AAAI Conference on Artificial Intelligence (AAAI), pp. 1433–1438, 2008.
Appendix A Hyperparameters
a.1 Common Parameters
We use the following parameters for LSP policies throughout the experiments. The algorithm uses a replay pool of one million samples, and the training is delayed until at least 1000 samples have been collected to the pool. Each training iteration consists of 1000 environments time steps, and all the networks (value functions, policy scale/translation, and observation embedding network) are trained at every time step. Every training batch has a size of 128. The value function networks and the embedding network are all neural networks comprised of two hidden layers, with 128 ReLU units at each hidden layer. The dimension of the observation embedding is equal to two times the number of action dimensions. The scale and translation neural networks used in the real NVP bijector both have one hidden layer consisting of number of ReLU units equal to the number of action dimensions. All the network parameters are updated using Adam optimizer with learning rate
.Table 1 lists the common parameters used for the LSPpolicy, and Table 2 lists the parameters that varied across the environments.
Parameter  Value 

learning rate  
batch size  128 
discount  0.99 
target smoothing coefficient  
maximum path length  
replay pool size  
hidden layers (Q, V, embedding)  2 
hidden units per layer (Q, V, embedding)  128 
policy coupling layers  2 
Parameter 
Swimmer (rllab) 
Hopperv1 
Walker2dv1 
HalfCheetahv1 
Ant (rllab) 
Humanoid (rllab) 

action dimensions  2  3  6  6  8  21 
reward scale  100  1  3  1  3  3 
observation embedding dimension  4  6  12  12  16  42 
scale/translation hidden units  2  3  6  6  8  21 
a.2 HighLevel Policies
All the lowlevel policies in hierarchical cases (Figures 5(b), 4(a), 4(b)) are trained using the same parameters used for the corresponding benchmark environment. All the highlevel policies use Gaussian action prior. For the Ant maze task, the latent sample of the highlevel policy is sampled once in the beginning of the rollout and kept fixed until the next one. The same highlevel action is kept fixed over three environment steps. Otherwise, all the policy parameters for the highlevel policies are equal to the benchmark parameters.
The environments used for training the lowlevel policies are otherwise equal to the benchmark environments, except for their reward function, which is modified to yield velocity based reward in any direction on the xyplane, in contrast to just positive xdirection in the benchmark tasks. In the Ant maze environment, the agent receives a reward of 1000 upon reaching the goal and 0 otherwise. In particular, no velocity reward nor any control costs are awarded to the agent. The environment terminates after the agent reaches the goal.
Comments
There are no comments yet.