1 Introduction
Reinforcement learning approaches, coupled with neural networks as function approximators, have solved an impressive range of tasks, from complex control tasks
(Lillicrap et al., 2016; Heess et al., 2017; Riedmiller et al., 2018; Levine et al., 2016; OpenAI et al., 2018) to computer games (Mnih et al., 2015; OpenAI, 2018) and Go (Silver et al., 2016). Recent advances have greatly improved data efficiency, scalability, and stability of these algorithms in a variety of domains (Rennie et al., 2017; Zoph & Le, 2017; Espeholt et al., 2018; Ganin et al., 2018; Zhu et al., 2018).Nevertheless, many tasks remain challenging to solve and require large numbers of interactions with the environment. While the reasons can be hard to pin down they frequently have to do with the fact that solutions are unlikely to be found by chance when no prior knowledge is available, that the solution space is dominated by local optima, or that properties of the desired behaviour are not captured by the reward (e.g. natural movement of the body). Curricula, task distributions, demonstrations, among other approaches, have been shown to be powerful tools to overcome some of these problems. All these ideas rely critically on the ability to transfer behaviours across tasks, and thus on mechanisms for extracting knowledge about the structure of existing solutions from and for injecting prior knowledge into a reinforcement learning problem.
Probabilistic models are widely used across the machine learning community. They provide a rich set of tools to specify inductive biases, extract structure from data, to impose structure on learned solutions, as well as rules for composing model components. The KLregularized objective (Todorov, 2007; Kappen et al., 2012; Rawlik et al., 2012; Schulman et al., 2017a) creates a connection between RL and probabilistic models. It introduces a second component, a prior or default behaviour, and the policy is then encouraged to remain close to it in terms of the KullbackLeibler (KL) divergence – which can be used to influence the learned policy. Recently, within this framework, (Teh et al., 2017; Czarnecki et al., 2018; Goyal et al., 2019; Galashov et al., 2019) have proposed to learn a parametrized default policy in the role of a more informative prior.
These works suggest an elegant solution for enforcing complex biases that can be also learned or transferred from different tasks. And the objective provides much flexibility in terms of model and algorithm choice. However, its potential and limitations, and how to use it most effectively have not yet been studied well. It has been argued, e.g. (Galashov et al., 2019), that information asymmetry between the default policy and agent policy can be an efficient mechanism for discovering meaningful priors. Restricting the priors access to certain information (e.g. about the task) forces it to learn the expected behaviour that arises from marginalizing away this information, and the result can be understood as habitual behaviour of the agent.
In this work we extend this line of thought, considering the scenario when both the default policy and the agent are hierarchically structured and augmented with latent variables. This provides new mechanisms for restricting the information flow and introducing inductive biases. We present a general algorithmic framework in Appendix A
, with the hierarchically structured variant elaborated in the main text and studied experimentally. In addition, we also explore how the resulting modular policies can be used in transfer learning scenarios. We provide empirical results on several tasks with physically simulated bodies and continuous action spaces as well as on discrete action grid worlds, highlighting the role of the structured policies.
2 RL as probabilistic modelling
In this section, we will provide an overview of how the KLregularized objective can connect RL and probabilistic model learning, before developing our approach in the next section. We start by introducing some standard notation. We will denote states and actions at time respectively with and . is the instantaneous reward received in state when taking action . We will refer to the history up to time as and the whole trajectory as . The agent policy denotes a distribution over next actions given history , while denotes a default or habitual policy.^{1}^{1}1 We generally work with history dependent policies since we will consider restricting access to state information from policies (for information asymmetry), which may render fully observed MDPs effectively partially observed. The KLregularized RL objective (Todorov, 2007; Kappen et al., 2012; Rawlik et al., 2012; Schulman et al., 2017a) takes the form:
(1) 
where we use a convenient notation^{2}^{2}2In the following, always denotes for arbitrary variables and . for the KL divergence: , is the discount factor and
is a hyperparameter controlling the relative contributions of both terms.
is taken with respect to the distribution over trajectories defined by the agent policy and system dynamics: .When optimized with respect to the objective can be seen to trade off expected reward with closeness (in terms of KL) to . This is also evident from the optimal in eq. (1)
(2)  
(3)  
(4) 
where and are optimal action value and value functions of eq. (1); See (e.g. Rawlik et al., 2012; Fox et al., 2016; Schulman et al., 2017a; Nachum et al., 2017) for derivations. We can thus think of as a specialization of that is obtained by tilting towards highvalue actions (as measured by the action value ). Several recent works have considered optimizing (1) when is of a fixed and simple form. For instance, when
is chosen to be the uniform distribution the entropyregularized objective is recovered
(e.g. Ziebart, 2010; Fox et al., 2016; Haarnoja et al., 2017; Schulman et al., 2017a; Hausman et al., 2018). More interestingly, there are scenarios where available can be used to inject detailed prior knowledge into the learning problem. In a transfer scenario can be a learned object, and the KL term plays effectively the role of a shaping reward.and can also be cooptimized. In this case the relative parametric forms of and are of importance. The optimal in eq. (1) is
(5) 
which maximizes only terms in eq. (1) depending on . Thus learning
can be seen as supervised learning where
is trained to match the historyconditional action sequences produced by . It should be clear that is the optimal solution when has sufficient capacity, and in this scenario the regularizing effect of is lost. When the capacity of is limited then will be forced to generalize the behavior of . For instance, (Teh et al., 2017) and (Galashov et al., 2019) consider a multitask scenario in which is given taskidentifying information, while is not. As a result, is forced to learn a common default behaviour across tasks, and this behaviour is then shareable across tasks to regularize .More generally, appropriate choices for the model classes of and will allow us to influence both the learning dynamics as well as the final solutions to eq. (1) and thus provide us with a rich means of injecting prior knowledge into the learning problem. This closely mirrors the discussion in the probabilistic modeling literature where the parametric form of a model will greatly influence its ability to generalize to unseen data, as well as the nature of the learned representation. (Galashov et al., 2019) focus on the information that has access to. Their discussion suggested that restricting the information available to improves generalization but also reduces the specificity of the modelled default behaviour.
3 Hierarchically structured policies
In this paper we focus on a complementary perspective. We explore how variations of the parametric forms of and , via the introduction of latent variables, give rise to hierarchically structured models with different inductive biases and generalization properties. In this section we discuss a particular instantiation of this idea and discuss the general framework in Appendix A.
We consider multilevel representations of behaviour. In our experiments we instantiate a two level architecture in which the highlevel decisions are concerned with task objectives but largely agnostic to details of actuation. The lowlevel control translates the highlevel decisions into motor actions while being agnostic to task objectives. The resulting abstractions can exploit repetitive structures within or across tasks. As two use cases we consider (a) multitask control where different tasks require similar motorskills; as well as (b) a scenario where we aim to solve similar tasks with different actuation systems.
Conceptually policies are divided into highlevel and lowlevel components which interact via auxiliary latent variables. Let be a (continuous) latent variable for each time step (we discuss alternative choices such as latent variables that are sampled infrequently in Appendices A and C). The agent policy is extended as and likewise for the default policy . can be interpreted as a highlevel or abstract action, taken according to the highlevel (HL) controller , and which is translated into lowlevel or motor action by the lowlevel (LL) controller . We extend the histories and trajectories to appropriately include ’s. As will be discussed in Section 5, structuring a policy into HL and LL controllers has been studied previously (e.g. Heess et al., 2016; Hausman et al., 2018; Haarnoja et al., 2018a; Merel et al., 2019), but the concept of default policy has not been widely explored in this context.
In case ’s can take on many values or are continuous, the objective (1) becomes intractable as the marginal distributions and in the KL divergence cannot be computed in closed form. As discussed in more detail in Appendix A this problem can be addressed in different ways. For simplicity and concreteness we here assume that the latent variables in and have the same dimension and semantics. We can then construct a lower bound for the objective by using the following upper bound for the KL:
(6) 
which is tractably approximated using Monte Carlo sampling. The derivation is in Appendix C.1. Note that:
(7)  
(8) 
The resulting lower bound for the objective is:
(9) 
where is a trajectory that appropriately includes ’s. Full derivation including discount terms is in Appendix C.2. In this paper we consider eq. (9) as a main objective function.
3.1 Sharing lowlevel controllers
An advantage of the hierarchical structure of the policies is that it enables several options for partial parameter sharing, which when used in the appropriate context can make learning more statistically efficient. As a special case we consider sharing lowlevel controllers in both the agent and the default policy, i.e. . This results in a new lower bound:
(10) 
Note that this objective function is similar in spirit to current KLregularized RL approaches discussed in Section 2, except that the KL divergence is between policies defined on abstract actions as opposed to concrete actions . The effect of this KL divergence is that it regularizes both the HL policies as well as the space of behaviours parameterised by the abstract actions. This special case of our framework reveals connection to (Goyal et al., 2019) as well, which motivated eq. (10) as an approximation of information botteneck for learning a goal conditioned policy. In Section 6 and 7, we empirically demonstrate and discuss how sharing or separating lowlevel controllers can be useful in different learning scenarios.
3.2 Regularizing via information asymmetries
As discussed in (Galashov et al., 2019) restricting the information available to different policies is a powerful tool to force regularization and generalization. In our case we let this information asymmetry be reflected also in the separation between HL and LL controllers (see Figure 1). Specifically we introduce a separation of concerns between and by providing full information only to while information provided to is limited. In our experiments we vary the information provided to ; it receives bodyspecific (proprioceptive) information as well as different amounts of environmentrelated (exteroceptive) information. The task is only known to . Hiding task specific information from the LL controller makes it easier to transfer across tasks. It forces to focus on learning task agnostic behaviour, and to rely on the abstract actions selected by to solve the task. Similarly, we hide task specific information from , regardless of the parameter sharing strategy for the LL controllers. Since we also limit the information available to (see section 3.3), this setup implements a similar default behaviour policy as in (Galashov et al., 2019), which can be derived by marginalizing the latents .
In the experiments we further consider transferring the HL controller across bodies, in situations where the abstract task is the same but the body changes. Here we additionally hide bodyspecific information from , so that the HL controller is forced to learn bodyagnostic behaviour.
3.3 Parametrizing the default policy
In our experiments we consider different formulations for the default policy. For LL default policy, we use identical parametric forms to implement and , regardless of the parameter sharing strategy. The specific form of LL controller depends on the experiments. The remaining freedom lies in the choice of the default HL controller , which may induce different inductive bias based on its parameteric form. Here, we consider the following choices:
Independent isotropic Gaussian We define the default HL policy as .
AR(1) process , i.e. the default HL policy is a firstorder autoregressive process with a fixed parameter chosen to ensure a marginal distribution . This allows for more structured temporal dependence among the abstract actions.
Learned AR prior Similar to the AR(1) process this default HL policy allows to depend on
but now the highlevel default policy is a Gaussian distribution with mean and variance that are learned functions of
with parameters : .4 Algorithm
There are different instantiations of the proposed method based on several algorithmic choices. In case has learnable parameters (learned AR prior) or is not shared, we jointly optimize the default policy and the agent’s policy, while the agent’s policy is regularized by a target default policy, which is periodically updated to a new default policy. The objective for the HL default policy is similar to distillation (Parisotto et al., 2016; Rusu et al., 2016) or supervised learning, where HL controller defines the data distribution. Note that due to the particular way we lower bound the KL (section 3) the supervised step remains unproblematic despite the presence of the latent variables in .^{3}^{3}3We discuss alternative schemes in Appendix A.
To optimize the hierarchical policy, we follow a strategy similar to Heess et al. (2016) and reparameterize as , where is a fixed noise distribution and is a deterministic function. In practice this means that the hierarchical policy can be treated as a flat policy . We can employ different algorithms to optimize this policy. As an example, Algorithm 1 provides the pseudocode for a simple actorcritic algorithm; in the experiments we use different formulations depending on the environment.
In continuous control experiments, we employ SVG(0) (Heess et al., 2015) augmented with experience replay to train the agents. We reparameterize the flat policy
and optimize it by backpropagating the gradient from an action value function. The action value function is optimized to match a target action value estimated by Retrace
(Munos et al., 2016), which provides low variance estimate of action value from Kstep windows of offpolicy trajectories.For discrete action spaces we adapt IMPALA (Espeholt et al., 2018), a distributed actor critic algorithm with offpolicy correction. We estimate the gradient of the flat policy as , where is the action value estimate and is a baseline. We use a learned value function as and to estimate based on Vtrace (Espeholt et al., 2018), which provides low variance estimate of from offpolicy trajectories. The offpolicy trajectories are buffered by a queue of size equal to one minibatch to ensure the trajectories are close to the current policy.
All algorithms are implemented in a distributed setup (Espeholt et al., 2018; Riedmiller et al., 2018) where multiple actors are used to collect trajectories and a single learner is used to optimize model parameters. Similarly to other KLregularized RL approaches (e.g. Teh et al., 2017; Galashov et al., 2019), we additionally regularize the entropy of to encourage exploration. More details about the learning algorithms are in Appendix B.
5 Related Work
Entropy regularized reinforcement learning (RL), also known as maximum entropy RL (Ziebart, 2010; Kappen et al., 2012; Toussaint, 2009) is a special case of KL regularized RL. This framework connects probabilistic inference and sequential decision making problems. Recently, this idea has been adapted to deep reinforcement learning (Fox et al., 2016; Schulman et al., 2017a; Nachum et al., 2017; Haarnoja et al., 2017; Hausman et al., 2018; Haarnoja et al., 2018b). Another instance of KL regularized RL includes trust region based methods (Schulman et al., 2015, 2017b; Wang et al., 2017; Abdolmaleki et al., 2018). They use KL divergence between new policy and old policy as a trust region constraints for conservative policy update.
Introducing a parameterized default policy provides a convenient way to transfer knowledge or regularize the policy. Schmitt et al. (2018) use a pretrained policy as the default policy; other works jointly learn the policy and default policy to capture reusable behaviour from experience (Teh et al., 2017; Czarnecki et al., 2018; Galashov et al., 2019; GrauMoya et al., 2019). To retain the role of default policy as a regularizer, it has been explored to restrict its input (Galashov et al., 2019; GrauMoya et al., 2019), parameteric form (Czarnecki et al., 2018) or to share it across different contexts (Teh et al., 2017; Ghosh et al., 2018).
Another closely related regularization for RL is using information bottleneck (Tishby & Polani, 2011; Still & Precup, 2012; Rubin et al., 2012; Ortega & Braun, 2013; Tiomkin & Tishby, 2017). Galashov et al. (2019) discussed the relation between information bottleneck and KL regularized RL. Strouse et al. (2018) learn to hide or reveal information for future use in multiagent cooperation or competition. Goyal et al. (2019) consider identifying bottleneck states based on objective similar to eq. (10), which is a special case of our framework, and using it for transfer. However their setting is differently motivated, and so is their objective. For example, they use the positive KL between pretrained HL controllers as an exploration bonus, to learn a new related task. On the other hand we are interested in transferring the same behaviour policy to a different body. For this we rely on the negative KL as intrinsic rewards, as a signal whether the LL controller executed the latent action set by HL controller.
The hierarchical RL literature (Dayan & Hinton, 1993; Parr & Russell, 1998; Sutton et al., 1999) has studied hierarchy extensively as a means to introduce inductive bias. Among various ways (Sutton et al., 1999; Bacon et al., 2017; Vezhnevets et al., 2017; Nachum et al., 2018, 2019; Xie et al., 2018), our approach resembles Heess et al. (2016); Hausman et al. (2018); Haarnoja et al. (2018a); Merel et al. (2019)
, in that a HL controller modulates a LL controller through a continuous channel. For learning the LL controller, imitation learning
(Fox et al., 2017; Krishnan et al., 2017; Merel et al., 2019)(Gregor et al., 2017; Eysenbach et al., 2019) and meta learning (Frans et al., 2018) have been employed. Similar to our approach, (Heess et al., 2016; Florensa et al., 2017; Hausman et al., 2018) use a pretraining task to learn a reusable LL controller. However, the concept of a default policy has not been widely explored in this context.Works that transfer knowledge across different bodies include (Devin et al., 2017; Gupta et al., 2017; Sermanet et al., 2017; Xie et al., 2018). Devin et al. (2017) mixes and matches modular task and body policies for zeroshot generalization to unseen combination. Gupta et al. (2017); Sermanet et al. (2017) learn a common representation space to align poses from different bodies. Xie et al. (2018) transfer the HL controller in a hierarchical agent, where the LL controller is learned with an intrinsic reward based on goals in state space. This approach, however, requires careful design of the goal space and the intrinsic reward.
6 Experiments
We evaluate our method in several environments with continuous action space and states. We consider a set of structured, sparse reward tasks that can be executed by multiple bodies with different degrees of freedom. The tasks and bodies are illustrated in Figure
2.We consider task distributions that are designed such that their solutions exhibit significant overlap in trajectory space so that transfer can reasonably be expected. They are further designed to contain instances of variable difficulty and hence provide a natural curriculum. Go to 1 of K targets: In this task the agent receives a sparse reward on reaching a specific target among K locations. The egocentric locations of each of the targets and the goal target index are provided as observations. Move K boxes to K targets: the goal is to move one of K boxes to one of K targets (the positions of which are randomized) as indicated by the environment. Move heavier box: variants of move K boxes to K targets with heavier boxes. Gather boxes: the agent needs to move two boxes such that they are in contact with each other. We also consider Move box and go to target, in which the agent is required to move the box to one target and then go to a different target in a single episode.
We use three different bodies: Ball, Ant, and Quadruped. Ball and Ant have been used in several previous works (Heess et al., 2017; Xie et al., 2018; Galashov et al., 2019), and we introduce the Quadruped as an alternative to the Ant. The Ball is a body with 2 actuators for moving forward or backward and turning left or right. The Ant is a body with 4 legs and 8 actuators, which moves its legs to walk and to interact with objects. The Quadruped is similar to the Ant, but with 12 actuators. Each body is characterized by a different set of proprioceptive (proprio) features. Further details of the tasks and bodies are in Appendix E.
6.1 Experimental setting
Throughout the experiments, we use 32 actors to collect trajectories and a single learner to optimize the model. We plot average episode return with respect to the number of steps processed by the learner. Note that the number of steps is different from the number of agent’s interaction with environment, because the collected trajectories are processed multiple times by a centralized learner to update model parameters. Hyperparameters, including KL cost and action entropy regularization cost, are optimized on a pertask basis. Details are provided in Appendices F and G.
6.2 Learning from scratch experiments
We first study whether KL regularization with the proposed structure and parameterization benefits endtoend learning. As baselines, we use a policy with entropy regularization (SVG0) and a KL regularized policy with unstructured default policy similar to Galashov et al. (2019); Teh et al. (2017) (DISTRAL prior). As described in Section 3 we employ hierarchical structure with shared LL components (Shared LL prior) and with separate LL components (Separate LL prior). Unless otherwise stated, we use Shared LL prior as our default hierarchical model. The HL controller receives full information while the LL controller (and hence the default policy) receives proprioceptive information plus the positions of the box(es) as indicated. The same information asymmetry is applied to the DISTRAL prior i.e. the default policy receives proprioception plus box positions as inputs. We explore multiple HL default policies including Isotropic Gaussian, AR(1) process, and learned AR prior.
Figure 3 illustrates the results of the experiment. Our main finding is that the KL regularized objective significantly speeds up learning of complex tasks, and that the hierarchical approach provides an advantage over the flat, DISTRAL formulation. The gap increases for more challenging tasks (e.g. move 2 boxes to 2 targets). For analysis, we compare different combinations of hierarchical structure and parameter sharing strategies. As baselines using parameter sharing without hierarchy, we introduce DISTRAL prior 2 cols and DISTRAL shared prior, which are variants of DISTRAL prior with different parameter sharing strategies. Details of these baselines are explained in Appendix F.2. The result in Figure 4 shows that both hierarchical structure and partial parameter sharing, introduced with the proposed framework, are important to speed up learning.
7 Transfer Learning
The hierarchical structure introduces a modularity of the policy and default policy, which can be utilized for transfer learning. We consider two transfer scenarios (see Figure 5): 1) task transfer where we reuse the learned default policy to solve novel tasks with different goals, and 2) body transfer, where reusing the body agnostic HL controller and default policy transfers the goal directed behaviour to another body.
7.1 Task transfer
We consider transfer between task distributions whose solutions exhibit significant shared structure, e.g. because solution trajectories can be produced by a common set of skills or repetitive behaviour. If the default policy can capture and transfer this reusable structure it will facilitate learning similar tasks. Transfer then involves specializing the default behavior to the needs of the target task (e.g. by directing locomotion towards a goal).
For task transfer, we reuse pretrained goal agnostic components, including the HL default policy and the LL default policy , and learn a new HL controller and optionally learn a new LL controller for a target task (see Figure 5a). In general, we set the LL controller identical to the LL default policy (Shared LL), but for some tasks we allow to diverge from (Separate LL). In case of Shared LL, similarly e.g. to Heess et al. (2016); Hausman et al. (2018), the new HL controller learns to manipulate the LL controller
by modulating and interpolating the latent space, while being regularized by
. Compared to Galashov et al. (2019), which also attempts to transfer default behaviour learned on a different task, in this work we exploit the structure of and . In particular, in case LL policy and default policy are parameterized identically, we could either share parameter (Shared LL) or initialize LL policy with the pretrained parameters of LL default policy (Separate LL with weight initialization) to exploit pretrained goal agnostic behaviour even in the initial phase of learning.In the experiments we introduce two baselines. The first baseline, the identical model learned from scratch (Hierarchical Agent), allows us to assess the benefit of transfer. The second baseline is a DISTRALstyle prior, i.e. we transfer a pretrained unstructured default policy to regularize the policy for the target task. This second baseline provides an indication whether the hierarchical policy structure is beneficial for transfer. Additionally, we compare different types of HL default policies. Specifics of the experiments including the information provided to HL and LL are provided in Appendix E.
Figure 6 illustrates the result of task transfer. Overall, transferring the pretrained default policy brings significant benefits when learning related tasks. Furthermore the hierarchical architecture which facilitates parameter reuse performs better than the DISTRAL prior regardless of type of HL default policy. While sharing the LL is effective for transfer between tasks with significant overlap, allowing the LL policy to diverge from the LL default policy as in eq. (9) is useful in some cases. Figure 7 illustrates the result on task transfer scenarios requiring adaptation of skills. Here the LL policy is only initialized and softconstrained to the behavior of the LL default policy (via the KL term in eq. (9)) which allows adapting the LL skills as required for target task.
7.2 Body transfer
Our formulation can also be used for transfer between different bodies which share common behaviour structures for the same task distribution. To transfer the HL structure of a goalspecific behaviour, we reuse the pretrained bodyagnostic components, HL controller and the default policy . We learn a new bodyspecific LL controller , which is assumed to be shared with LL default controller (see Figure 5b). The transferred HL components provide goalspecific behaviour actuated on the latent space, which can then be instantiated by learning a new LL controller.
The main challenge of transferring task knowledge across bodies is how to interpret the latent actions. The HL controller provides abstract instructions for solving a task but it is unknown how they should be instantiated on the new body. Furthermore, rewarding the LL controller in order to learn the semantics of the latent code is problematic, as these semantics are not available to us. To address this challenge, we optimize eq. (10) during transfer; it provides a dense reward signal based on the negative KL divergence. Using KL of the HL components during transfer looks similar to Goyal et al. (2019). However its interpretation and behaviour is quite different. Goyal et al. (2019) rely on the positive KL between HL controllers as an exploration bonus, trying to get the agent to explore the new task (in the transfer phase). On the contrary we rely on negative KL, which encourages the HL controller to be close to the default policy. In our scenario the negative KL is an intrinsic reward telling us if the LL controller executed the latent action properly so that HL controller could behave in a habitual way.
We explore this body transfer setup both in discrete and continuous environments. We compare performance to learning the hierarchical policy from scratch and analyze the effects of the KL regularization. The experimental setup in the continuous case is the same as before, and Figure 8 provides results for different types of bodies and tasks. Generally transferring the HL component and relying on both the task reward and the KL term as a dense shaping reward signal for LL controller works best in these settings. As illustrated in Figure 9, this trend is also apparent when using egocentric vision as observation input.
In the discrete case, we construct a discrete go to target task in a 2D grid world. An agent and goal are randomly placed in an grid and the agent is rewarded for reaching the goal. The body agnostic task observation is the global x, y coordinates of the agent and goal. The different bodies in this case must take different numbers of actions to achieve an actual step in the grid. For instance the 4step body needs to take 4 consecutive actions in the same direction to move forward by one step in that direction. We assume that the latent is sampled every steps, where is the number of actions required to take a step. Details for models in which latent variables are sampled with a period are provided in Appendix B. The environment is described in Appendix E.
Figure 10 illustrates the result for transfering behavior from the 1step to the 8step body with ARlearned prior. (We were only able to solve the challenging 8step version through body transfer with a KL reward.) In Figure 10
, we visualize the negative KL divergence (KL reward) along the agent’s movement in every location of the grid. The size and the colour of arrows denotes the expected KL reward. This illustrates that the KL reward forms a vector field that guides the agent toward the goal, which provides a dense reward signal when transferring to a new body. This observation explains the gain from KL regularization, which can lead to faster learning and improve asymptotic performance.
8 Discussion
In this work we explore how hierarchical structure can be introduced in KLregularized RL using latent variables, how inductive biases can be introduced in these structures via information asymmetry and parameter sharing, and how this can lead to specialization of policy components and induce semantically meaningful latent representations. We show that the resulting structures can be exploited efficiently at transfer time, where either the HL component or the LL component can be completely transferred to new tasks. Furthermore, in the case of body transfer (where the LL component needs to be relearned), the KL term between the HL components and acts as a dense reward signal guiding learning for the LL component.
Overall, we believe that the framework of KLregularized RL as probabilistic modelling offers complex mechanisms for introducing prior knowledge, and is fertile grounds for advancing RL algorithms. Appendix A describes the framework in further generality than explored here.
References
 Abdolmaleki et al. (2018) Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M. Maximum a posteriori policy optimisation. In International Conference on Learning Representations, 2018.
 Agakov & Barber (2004) Agakov, F. V. and Barber, D. An auxiliary variational method. In Neural Information Processing, 11th International Conference, ICONIP 2004, Calcutta, India, November 2225, 2004, Proceedings, pp. 561–566, 2004.

Bacon et al. (2017)
Bacon, P.L., Harb, J., and Precup, D.
The optioncritic architecture.
In
ThirtyFirst AAAI Conference on Artificial Intelligence
, 2017.  Czarnecki et al. (2018) Czarnecki, W., Jayakumar, S., Jaderberg, M., Hasenclever, L., Teh, Y. W., Heess, N., Osindero, S., and Pascanu, R. Mix & match agent curricula for reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, 2018.
 Dayan & Hinton (1993) Dayan, P. and Hinton, G. E. Feudal reinforcement learning. In Advances in neural information processing systems, pp. 271–278, 1993.
 Devin et al. (2017) Devin, C., Gupta, A., Darrell, T., Abbeel, P., and Levine, S. Learning modular neural network policies for multitask and multirobot transfer. In IEEE International Conference on Robotics and Automation (ICRA), pp. 2169–2176. IEEE, 2017.
 Espeholt et al. (2018) Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. Impala: Scalable distributed deeprl with importance weighted actorlearner architectures. In Proceedings of the 35th International Conference on Machine Learning, 2018.
 Eysenbach et al. (2019) Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations, 2019.
 Florensa et al. (2017) Florensa, C., Duan, Y., and Abbeel, P. Stochastic neural networks for hierarchical reinforcement learning. In International Conference on Learning Representations, 2017.
 Fox et al. (2016) Fox, R., Pakman, A., and Tishby, N. Taming the noise in reinforcement learning via soft updates. In Proceedings of the ThirtySecond Conference on Uncertainty in Artificial Intelligence, pp. 202–211. AUAI Press, 2016.
 Fox et al. (2017) Fox, R., Krishnan, S., Stoica, I., and Goldberg, K. Multilevel discovery of deep options. CoRR, abs/1703.08294, 2017. URL http://arxiv.org/abs/1703.08294.
 Frans et al. (2018) Frans, K., Ho, J., Chen, X., Abbeel, P., and Schulman, J. Meta learning shared hierarchies. In International Conference on Learning Representations, 2018.
 Galashov et al. (2019) Galashov, A., Jayakumar, S., Hasenclever, L., Tirumala, D., Schwarz, J., Desjardins, G., Czarnecki, W. M., Teh, Y. W., Pascanu, R., and Heess, N. Information asymmetry in KLregularized RL. In International Conference on Learning Representations, 2019.
 Ganin et al. (2018) Ganin, Y., Kulkarni, T., Babuschkin, I., Eslami, S. M. A., and Vinyals, O. Synthesizing programs for images using reinforced adversarial learning. In Proceedings of the 35th International Conference on Machine Learning, pp. 1666–1675, 2018.
 Ghosh et al. (2018) Ghosh, D., Singh, A., Rajeswaran, A., Kumar, V., and Levine, S. Divideandconquer reinforcement learning. In International Conference on Learning Representations, 2018.
 Goyal et al. (2019) Goyal, A., Islam, R., Strouse, D., Ahmed, Z., Larochelle, H., Botvinick, M., Levine, S., and Bengio, Y. Transfer and exploration via the information bottleneck. In International Conference on Learning Representations, 2019.
 GrauMoya et al. (2019) GrauMoya, J., Leibfried, F., and Vrancx, P. Soft qlearning with mutualinformation regularization. In International Conference on Learning Representations, 2019.
 Gregor et al. (2017) Gregor, K., Rezende, D. J., and Wierstra, D. Variational intrinsic control. In International Conference on Learning Representations, 2017.
 Gupta et al. (2017) Gupta, A., Devin, C., Liu, Y., Abbeel, P., and Levine, S. Learning invariant feature spaces to transfer skills with reinforcement learning. In International Conference on Learning Representations, 2017.
 Haarnoja et al. (2017) Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energybased policies. In International Conference on Machine Learning, pp. 1352–1361, 2017.
 Haarnoja et al. (2018a) Haarnoja, T., Hartikainen, K., Abbeel, P., and Levine, S. Latent space policies for hierarchical reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, pp. 1851–1860, 2018a.
 Haarnoja et al. (2018b) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, pp. 1861–1870, 2018b.
 Hausman et al. (2018) Hausman, K., Springenberg, J. T., Wang, Z., Heess, N., and Riedmiller, M. Learning an embedding space for transferable robot skills. In International Conference on Learning Representations, 2018.
 Heess et al. (2015) Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., and Tassa, Y. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, 2015.
 Heess et al. (2016) Heess, N., Wayne, G., Tassa, Y., Lillicrap, T., Riedmiller, M., and Silver, D. Learning and transfer of modulated locomotor controllers. arXiv preprint arXiv:1610.05182, 2016.
 Heess et al. (2017) Heess, N., Tirumala, D., Sriram, S., Lemmon, J., Merel, J., Wayne, G., Tassa, Y., Erez, T., Wang, Z., Eslami, A., Riedmiller, M., et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017.
 Johnson et al. (2016) Johnson, M., Duvenaud, D. K., Wiltschko, A., Adams, R. P., and Datta, S. R. Composing graphical models with neural networks for structured representations and fast inference. In Advances in Neural Information Processing Systems 29, pp. 2946–2954. 2016.
 Kappen et al. (2012) Kappen, H. J., Gómez, V., and Opper, M. Optimal control as a graphical model inference problem. Machine learning, 87(2):159–182, 2012.
 Krishnan et al. (2017) Krishnan, S., Fox, R., Stoica, I., and Goldberg, K. DDCO: discovery of deep continuous options forrobot learning from demonstrations. CoRR, abs/1710.05421, 2017. URL http://arxiv.org/abs/1710.05421.
 Levine et al. (2016) Levine, S., Finn, C., Darrell, T., and Abbeel, P. Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
 Lillicrap et al. (2016) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
 Merel et al. (2019) Merel, J., Hasenclever, L., Galashov, A., Ahuja, A., Pham, V., Wayne, G., Teh, Y. W., and Heess, N. Neural probabilistic motor primitives for humanoid control. In International Conference on Learning Representations, 2019.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937, 2016.
 Munos et al. (2016) Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems, 2016.
 Nachum et al. (2017) Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2775–2785, 2017.
 Nachum et al. (2018) Nachum, O., Gu, S. S., Lee, H., and Levine, S. Dataefficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems 31, pp. 3307–3317. 2018.
 Nachum et al. (2019) Nachum, O., Gu, S., Lee, H., and Levine, S. Nearoptimal representation learning for hierarchical reinforcement learning. In International Conference on Learning Representations, 2019.
 OpenAI (2018) OpenAI. Openai five. https://blog.openai.com/openaifive/, 2018.
 OpenAI et al. (2018) OpenAI, Andrychowicz, M., Baker, B., Chociej, M., Jozefowicz, R., McGrew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., et al. Learning dexterous inhand manipulation. arXiv preprint arXiv:1808.00177, 2018.
 Ortega & Braun (2013) Ortega, P. A. and Braun, D. A. Thermodynamics as a theory of decisionmaking with informationprocessing costs. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 469(2153):20120683, 2013.
 Parisotto et al. (2016) Parisotto, E., Ba, J. L., and Salakhutdinov, R. Actormimic: Deep multitask and transfer reinforcement learning. In International Conference on Learning Representations, 2016.
 Parr & Russell (1998) Parr, R. and Russell, S. J. Reinforcement learning with hierarchies of machines. In Advances in neural information processing systems, pp. 1043–1049, 1998.
 Rawlik et al. (2012) Rawlik, K., Toussaint, M., and Vijayakumar, S. On stochastic optimal control and reinforcement learning by approximate inference. In Robotics: science and systems, volume 13, pp. 3052–3056, 2012.

Rennie et al. (2017)
Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., and Goel, V.
Selfcritical sequence training for image captioning.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 7008–7024, 2017.  Riedmiller et al. (2018) Riedmiller, M., Hafner, R., Lampe, T., Neunert, M., Degrave, J., van de Wiele, T., Mnih, V., Heess, N., and Springenberg, J. T. Learning by playing solving sparse reward tasks from scratch. In Proceedings of the 35th International Conference on Machine Learning, pp. 4344–4353, 2018.
 Rubin et al. (2012) Rubin, J., Shamir, O., and Tishby, N. Trading value and information in mdps. Decision Making with Imperfect Decision Makers, pp. 57–74, 2012.
 Rusu et al. (2016) Rusu, A. A., Colmenarejo, S. G., Gulcehre, C., Desjardins, G., Kirkpatrick, J., Pascanu, R., Mnih, V., Kavukcuoglu, K., and Hadsell, R. Policy distillation. In International Conference on Learning Representations, 2016.
 Salimans et al. (2014) Salimans, T., Kingma, D. P., and Welling, M. Markov Chain Monte Carlo and Variational Inference: Bridging the Gap. ArXiv eprints, October 2014.
 Schmitt et al. (2018) Schmitt, S., Hudson, J. J., Zidek, A., Osindero, S., Doersch, C., Czarnecki, W. M., Leibo, J. Z., Kuttler, H., Zisserman, A., Simonyan, K., et al. Kickstarting deep reinforcement learning. arXiv preprint arXiv:1803.03835, 2018.
 Schulman et al. (2015) Schulman, J., Levine, S., Moritz, P., Jordan, M., and Abbeel, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on International Conference on Machine LearningVolume 37, pp. 1889–1897. JMLR. org, 2015.
 Schulman et al. (2017a) Schulman, J., Chen, X., and Abbeel, P. Equivalence between policy gradients and soft qlearning. arXiv preprint arXiv:1704.06440, 2017a.
 Schulman et al. (2017b) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017b.
 Sermanet et al. (2017) Sermanet, P., Lynch, C., Hsu, J., and Levine, S. Timecontrastive networks: Selfsupervised learning from multiview observation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 14–15, 2017.
 Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
 Still & Precup (2012) Still, S. and Precup, D. An informationtheoretic approach to curiositydriven reinforcement learning. Theory in Biosciences, 131(3):139–148, 2012.
 Strouse et al. (2018) Strouse, D., KleimanWeiner, M., Tenenbaum, J., Botvinick, M., and Schwab, D. J. Learning to share and hide intentions using information regularization. In Advances in Neural Information Processing Systems, pp. 10270–10281, 2018.
 Sutton et al. (1999) Sutton, R. S., Precup, D., and Singh, S. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211, 1999.
 Teh et al. (2017) Teh, Y., Bapst, V., Czarnecki, W. M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N., and Pascanu, R. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4496–4506, 2017.
 Tiomkin & Tishby (2017) Tiomkin, S. and Tishby, N. A unified bellman equation for causal information and value in markov decision processes. arXiv preprint arXiv:1703.01585, 2017.
 Tishby & Polani (2011) Tishby, N. and Polani, D. Information theory of decisions and actions. PerceptionAction Cycle, pp. 601–636, 2011.
 Todorov (2007) Todorov, E. Linearlysolvable markov decision problems. In Advances in Neural Information Processing Systems, 2007.
 Toussaint (2009) Toussaint, M. Robot trajectory optimization using approximate inference. In Proceedings of the 26th annual international conference on machine learning, pp. 1049–1056. ACM, 2009.
 Vezhnevets et al. (2017) Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., and Kavukcuoglu, K. Feudal networks for hierarchical reinforcement learning. In International Conference on Machine Learning, pp. 3540–3549, 2017.
 Wang et al. (2017) Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and de Freitas, N. Sample efficient actorcritic with experience replay. In International Conference on Learning Representations, 2017.
 Xie et al. (2018) Xie, S., Galashov, A., Liu, S., Hou, S., Pascanu, R., Heess, N., and Teh, Y. W. Transferring task goals via hierarchical reinforcement learning, 2018. URL https://openreview.net/forum?id=S1Y6TtJvG.
 Zhu et al. (2018) Zhu, H., Gupta, A., Rajeswaran, A., Levine, S., and Kumar, V. Dexterous manipulation with deep reinforcement learning: Efficient, general, and lowcost. arXiv preprint arXiv:1810.06045, 2018.
 Ziebart (2010) Ziebart, B. D. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. PhD thesis, Carnegie Mellon University, 2010.
 Zoph & Le (2017) Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2017.
Appendix A A general framework for RL as probabilistic modelling
In Sections 2 and 3 of the main text we have introduced the KLregularized objective and explored a particular formulation that uses latent variables in the default policy and policy (Section 3 and experiments). The particular choice in Section 3 arises as a special case of a more general framework which we here outline briefly.
For both the default policy and for agent policy we can consider general directed latent variable models of the following form
(11)  
(12) 
where both and can be time varying, e.g. , and can be causally dependent on the trajectory prefix , e.g. (and equivalently for ). The latent variables can further be continuous or discrete, and or
can exhibit further structure (and thus include e.g. binary variables that model option termination). The general form of the objective presented in the main text
remains valid regardless of the particular form of and . This form can be convenient when and are tractable (e.g. when or have a small number of discrete states or decompose conveniently over time, e.g. as in Fox et al. 2017; Krishnan et al. 2017).
In general, however, latent variables in and may introduce the need for additional approximations. In this case different models and algorithms can be instantiated based on a) the particular approximation chosen there, as well as b) choices for sharing of components between and . A possible starting point when contains latent variables is the following lower bound to :
(13)  
(14)  
(15) 
If are discrete and take on a small number of values we can compute exactly (e.g. using the forwardbackward algorithm as in Fox et al. 2017; Krishnan et al. 2017); in other cases we can learn a parameterized approximation to the true posterior or can conceivably apply mixed inference schemes (e.g. Johnson et al., 2016).
Latent variables in the policy can require an alternative approximation discussed e.g. in Hausman et al. (2018):
(16) 
where is a learned approximation to the true posterior . This formulation bears interesting similarities with diversity inducing regularization schemes based on mutual information (e.g. Gregor et al., 2017; Florensa et al., 2017) but arises here as an approximation to trajectory entropy. This formulation also has interesting connections to auxiliary variable formulations in the approximate inference literature (Salimans et al., 2014; Agakov & Barber, 2004).
When both and contain latent variables eqs. (15,16) can be combined. The model described in Section 3 in the main text then arises when the latent variable is “shared” between and and we effectively use the policy itself as the inference network for : . In this case the objective simplifies to
(17) 
When we further set we recover the model discussed in the main text of the paper.
As a proofofconcept for a model without a shared latent space, with latent variables in but not , we consider a simple humanoid with 28 degrees of freedom and 21 actuators and consider two different tasks: 1) a densereward walking task, in which the agent has to move forward, backward, left, or right at a fixed speed. The direction is randomly sampled at the beginning of an episode and changed to a different direction halfway through the episode and 2) a sparse reward gototarget task, in which the agent has to move to a target whose location is supplied to the agent as a feature vector similar to those considered in (Galashov et al., 2019).
Figure 11 shows some exploratory results. In a first experiment we compare different prior architectures on the directional walking task. We let the prior marginalize over task condition. We include a feedforward network, an LSTM, and a latent variable model with one latent variable per time step in the comparison. For the latent variable model we chose an inference network so that eq. (15) decomposes over time. All priors studied in this comparison gave a modest speedup in learning. While the latent variable prior works well, it does not work as well as the LSTM and MLP priors in this setup. In a first set of transfer experiments, we used the learned priors to learn the walking task again. Again, the learned priors led to a modest speedup relative to learning from scratch.
We also experimented with parameter sharing for transfer as in the main text. We can freeze the conditional distribution and learn a new policy , effectively using the learned latent space as an action space. In a second set of experiments, we study how well a prior learned on the walking task can transfer to the sparse gototarget task. Here all learned priors led to a significant speed up relative to learning from scratch. Small return differences aside, all three different priors considered here solved the task with clear goal directed movements. On the other hand, the baseline only learned to go to very closeby targets. Reusing the latent space did not work well on this task. We speculate that the required turns are not easy to represent in the latent space resulting from the walking task.
Appendix B Algorithm
This section provides more details about the learning algorithm we use to optimize eq. (9) in the main text. We use different learning algorithms based on the environments. Specifically, we employ SVG(0) (Heess et al., 2015) with experience replay in continuous control environments, and employ IMPALA (Espeholt et al., 2018) in discrete action space environments. Overall, we adapt the base algorithms to support learning hierarchical policy and prior. Unless otherwise mentioned, we follow notations from the main paper.
b.1 Reparameterized latent for hierarchical policy
To optimize the hierarchical policy, we follow a strategy similar to Heess et al. (2016) and reparameterize as , where is a fixed distribution. The is a deterministic function that outputs distribution parameters. In practice this means that the hierarchical policy can be treated as a flat policy . We exploit the reparameterized flat policy to employ existing distributed learning algorithm with minimal modification.
b.2 Continuous control
In continuous control experiments, we employ distributed version of SVG(0) (Heess et al., 2015) augmented with experience replay and offpolicy correction algorithm called Retrace (Munos et al., 2016). In the distributed setup, behaviour policies in multiple actors are used to collect offpolicy trajectories and a single learner is used to optimize model parameters The SVG(0) reparameterize a policy and optimize it by backpropagating gradient from a learned action value function through a sampled action .
To employ this algorithm, we reparameterize action from flat policy with parameter as , where is a fixed distribution, and is a deterministic function outputting a sample from the distribution . We also introduce the action value function . Unlike policies without hierarchy, we estimate the action value depending on the sampled action as well, so that it could capture the future returns depending on . Given the flat policy and the action value function, SVG(0) (Heess et al., 2015) suggests to use following gradient estimate
(18) 
which facilitates using backpropagation. Note that policy parameter could be learned through as well, but we decide not to because it tends to make learning unstable.
To learn action value function and learn policy, we use offpolicy trajectories from experience replay. We use Retrace (Munos et al., 2016) to estimate the action values from offpolicy trajectories. The main idea behind Retrace is to use importance weighting to correct for the difference between the behavior policy and the online policy , while cutting the importance weights to reduce variance. Specifically, we estimate corrected action value with
(19) 
where and . is estimated bootstrap value, and is discount. is truncated importance weight called traces.
There are, however, a few notable details that we adapt for our method. Firstly, we do not use the latent sampled from behaviour policies in actors. This is possible because the latent does not affect the environment directly. Instead, we consider the behavior policy as , which does not depend on latents. This approach is useful since we do not need to consider the importance weight with respect to the HL policy, which might introduce additional variance in the estimator. Another detail is that the KL term at step is not considered in because the KL at step is not the result of action . Instead, we introduce close form KL at step as a loss to compensate for this. The pseudocode for the resulting algorithm is illustrated in Algorithm 2.
b.3 Discrete action space
For discrete control problems, we use distributed learning with Vtrace (Espeholt et al., 2018) offpolicy correction. Similarly to the distributed learning setup in continuous control, behaviours policies in multiple actors are used to collect trajectories and a single learner is used to optimize model parameters. The learning algorithm is almost identical to (Espeholt et al., 2018), but there are details that need to be considered mainly because of hierarchy with stochastic latent variable and temporal abstraction. Using negative KL as reward introduces another complication as well.
We consider optimizing objective with infrequent latent
(20) 
where is the indicator function whose value is if with period . This lower bound will be discussed later in Appendix C.2. This infrequent latent case is used for discrete action space experiment, by defining period to be equal to the effective step size of the body.
We learn latent conditional value function and reparameterized flat policy . Vtrace target is computed as follows
(21) 
where , , and is bootstraped value at time step . Importance weights are computed by and , where . and are truncation coefficient identical to ones from original Vtrace paper (Espeholt et al., 2018). Note that here we ignore latent sampled by behaviour policy and just consider states and actions from the trajectory. As discussed in Appendix B.2, we sample latent onpolicy and this helps avoiding additional variance introduced with importance weight for HL policy.
Computed Vtrace target is used for training both policy and value function with actorcritic algorithm. For training policy, we use policy gradient defined as
(22) 
where and . We optimize negative KL for time step
by adding an analytic loss function for HL policy
and default policy(23) 
For training value function, perform gradient descent over loss
(24) 
Additionally we include an action entropy bonus to encourage exploration (Mnih et al., 2016)
(25) 
where is close form entropy. We optimize the gradients from all four objectives jointly. Unlike continuous control, we do not maintain target parameters separately for the discrete action space experiments.
Appendix C Derivations
This section includes derivations not described in the main paper.
c.1 Upper bound of KL divergence
The upper bound of KL divergence in eq. (6) of the main paper is derived as