1 Introduction
Reinforcement learning (RL) has shown impressive results on artificial tasks such as game playing (Mnih et al., 2015, Silver et al., 2016), where collecting experience is cheap. However, for robotics tasks, such as locomotion and manipulation (Kober et al., 2013, Gu et al., 2016), current algorithms often require manually designed smooth reward functions, limiting their applicability in realworld scenarios. In this paper, we approach learning from sparse rewards using hierarchical reinforcement learning (HRL), where multiple levels of temporallyabstract controllers modulate another to produce an action. We propose a novel hierarchical agent that is simple to train and learns to push objects and stack blocks endtoend from sparse rewards, as shown in Figure 1. To achieve this, we consider three common challenges of HRL.
Stability.
Simultaneously updating the levels of a hierarchical agent introduces nonstationarity since the levels affect another, resulting in unstable learning dynamics. Prior HRL algorithms thus often introduce multiple training phases to stabilize learning (Heess et al., 2016, Frans et al., 2018)
. This requires more effort in implementation and introduces additional hyperparameters, which may in part prevent HRL from becoming standard for RL problems. We alleviate this problem by jointly training the levels as separate PPO agents
(Schulman et al., 2017), encouraging smooth changes in all levels of the hierarchy. We hope that the simplicity of this solution helps to make HRL more generally applicable.Modulation.
A critical design choice of hierarchical agents is the form of communication between levels. Typically, each level receives a modulation signal from the more abstract level above it (Dayan and Hinton, 1993)
. Such signal could be a categorical variable called an option
(Sutton et al., 1999, Frans et al., 2018, Florensa et al., 2017) or a continuousvalued activation vector (Heess et al., 2016, Vezhnevets et al., 2017, Haarnoja et al., 2018). While a categorical signal allows to select exactly one skill at a time, a continuous signal allows smooth modulation over lower levels. Inspired by this tradeoff, we propose communication via bitvectors, which allows to mix multiple skills. Empirically, this outperforms categorical modulation signals.Exploration.
While hierarchical controllers with different timescales have a builtin prior for temporally extended behavior, this does not necessarily help the exploration of skills (Eysenbach et al., 2019). For this reason, HRL methods often report transfer performance after pretraining the agent or part of it on manually defined subtasks (Heess et al., 2016, Tessler et al., 2017, Kulkarni et al., 2016). Intrinsic motivation or curiosity (Schmidhuber, 2010, Pathak et al., 2017) is a common approach to exploration, but is not commonly used for hierarchical agents. We achieve temporally extended exploration by employing intrinsic motivation at each level of the hierarchy, resulting in an agent that learns from sparse rewards without pretraining on simpler tasks.
We summarize the main contributions of this paper as follows:

We introduce modulated policy hierarchies (MPH), a hierarchical agent that is trained jointly without requiring pretraining tasks or multiple training phases.

We model modulation signals as bit vectors instead of the typical onehot modulation, allowing the agent to interpolate between its skills.

We employ intrinsic motivation based on prediction error of dynamics models on all levels of the hierarchy, resulting in temporally extended exploration.

We evaluate our method together with recent HRL algorithms on pushing and sparse block stacking and provide ablation studies of the design choices.
2 Related work
Manipulation.
Learning algorithms have been applied to a variety of robotics tasks, such as grasping (Pinto and Gupta, 2016, Lampe and Riedmiller, 2013)
, opening bottles using supervised learning
(Levine et al., 2015), learning to grasp with reinforcement learning (Levine et al., 2016), opening doors (Gu et al., 2016), and stacking Lego blocks (Popov et al., 2017). Here, we examine a pushing task, FetchPushv1, and a stacking task similar to the one described by Popov et al. (2017). We focus on pushing and stacking tasks as they conceptually require subskills, such as reaching and grasping. Previously, solving block stacking required to manually design a smooth reward function (Popov et al., 2017)(Duan et al., 2017) with an existing lowlevel controller that could already grasp and place blocks.Stability.
HRL inherits common instability issues of RL (Henderson et al., 2018). Moreover, training multiple controllers simultaneously can lead to degenerate solutions (Bacon et al., 2017). Pretraining the low levels (Heess et al., 2016) or alternating training between levels (Frans et al., 2018) has been proposed to improve learning stability. Other methods add regularization terms based on entropy (Bacon et al., 2017, Hausman et al., 2018, Haarnoja et al., 2018) or mutual information (Daniel et al., 2016, Florensa et al., 2017). We find that the slow change of the action distribution that PPO encourages can be a simple and practical way to mitigate instability. In addition to Frans et al. (2018), we find that we can train all levels simultaneously without degeneracies.
Modulation.
A core design choice of HRL agents is how higher levels communicate with the levels underneath them. The options framework (Sutton et al., 1999) uses a categorical signal that switches between lowlevel policies that can be implemented as separate networks (Tessler et al., 2017, Frans et al., 2018). This approach requires a large number of parameters and does not allow skills to share information. Another line of work is based on feudal learning (Dayan and Hinton, 1993), where a typically continuous valued signal modulates the lower level. In this context, the modulation signal is often referred to as goal. It can be specified directly in the observation space (Nachum et al., 2018) or in a learned embedding space (Vezhnevets et al., 2017, Kulkarni et al., 2016, Heess et al., 2016). However, such methods usually require prior knowledge in a form of pretraining (Hausman et al., 2018) or reward shaping (Haarnoja et al., 2018) which our method with binary signals does not need.
Exploration.
One of the major goals of HRL is to address hard exploration problems with longhorizon and sparse rewards. Structured exploration provides a mechanism to more effectively guide exploration and is usually referred to as intrinsic motivation. The intrinsic motivation methods vary from curiositybased bonuses (Houthooft et al., 2016, Pathak et al., 2017) to state visitation counters (Bellemare et al., 2016, Ostrovski et al., 2017). However, such approaches have only been explored for singlelevel policies while one can imagine taking advantage of the intrinsic motivation at both layers of the hierarchy.
3 Modulated Policy Hierarchy
Preliminaries.
We follow the typical formulation of reinforcement learning as Markov decision process. At each discrete time step
, the agent receives state from the environment, selects an action , receives a scalar reward , and transitions into a new state . We aim to maximize the expected return using a policy . The policy is parameterized with parameters and denoted as , thus at each timestep the chosen action is given by .Hierarchical controller.
MPH learns a hierarchy of policies , where in our experiments we consider twolevel policies (). MPH policies have their own state and action spaces . Each policy is represented with a single network and modulated by bit vectors from the policies above (Figure 1(c)). In contrast, the options framework switches between independent skill networks by a categorical modulation signal (Figure 1(a)). The categorical signal might be also used with the skill policies merged into a single network (Figure 1(b)).
Modulation signal.
The highestlevel (master) policy receives the state from the environment () and outputs a bit vector of size as its action. Each intermediatelevel policy
receives the environment state concatenated with the modulation signals from the layers above. The policies are implemented as fullyconnected neural networks that predict the probabilities of the bits. Given the probabilities, the modulation signal is generated by sampling from
independent Bernoulli distributions. Once sampled, the signal is passed to all lower policies. Finally, the lowest level policy
(worker) observes all the modulation signals and the state. In the twolevel structure, receives the environment state and the master modulation, . The worker policy outputs the final action which is applied to the environment.Time scales.
To encourage each level in the hierarchy to attend to different timescales, we activate higherlevel policies less frequently, i.e., where is the timescale of the policy at level . When a policy is inactive in a given time step, it outputs the same modulation signal as was generated in the previous time step, which promotes consistency in higherlevel decisions and facilitates longerterm planning. The policies at each level only receive inputs at time steps for which they are active.
Optimization.
We train MPH policies using PPO (Schulman et al., 2017) which is a stateoftheart onpolicy method. PPO guarantees that after each update the new policy does not deviate too far from the old one in terms of KL divergence. We use PPO for each layer independently which also means that the MDPs seen by highlevel policies change during training due to lowlevel policies updates. However, given the PPO guarantees, we can ensure that after a training step, the MDP on each layer of the hierarchy remain close to the old MDP in terms of transition probabilities change. As a result, the optimization problem solved by PPO for higher layers changes smoothly during the updates. This fact makes MPH more stable to train than most HRL approaches. Please refer to Appendix A for exact bounds and full derivation.
Hierarchical exploration.
Since MPH is designed for environments with sparse reward signals, we employ intrinsic motivation to accelerate learning. As suggested by Pathak et al. (2017), we add intrinsic motivation to our agent in the form of a curiositydriven exploration bonus. We apply this independently for each level of the hierarchy and on the corresponding timescale. In practice, this means that higherlevel policies, which operate on longer timescales, are more curious about longer term effects than lowerlevel policies. The reward for a policy at level is defined as
(1) 
where is a learned embedding and is a prediction of the next state given . The standard method for learning requires an inverse model for the action prediction, but we find that training a reverse model instead works better. Specifically, the reverse model predicts the previous state given . To learn the embedding, we jointly train forward and reverse models by minimizing the loss
(2) 
where we add a regularization term to prevent trivial embeddings to be learned; is a scalar weighting the reverse model loss against the forward model loss, and is a regularization scaling factor.
4 Experiments
We compare our approach to baselines and stateoftheart approaches described in Section 4.1. We evaluate our approach on two tasks with sparse rewards: block pushing and block stacking (see Figure 1). First, we show that MPH outperforms the baselines on the block stacking in Section 4.2 and analyze the modulation signals produced by the master policy. Second, we compare MPH to baselines on the pushing task in Section 4.3. Third, we show the benefits of temporally extended intrinsic motivation in Section 4.4.
4.1 Baselines
We compare to the following baselines:

PPO A flat policy trained using PPO.

options An options hierarchy with separate skill networks corresponding to Figure 1(a).

onehot A twolevel hierarchy with 1hot modulation signal as in Figure 1(b).

MLSH Meta Learning Shared Hierarchy by Frans et al. (2018).
All the hierarchies employ PPO as the core optimization method and use temporal abstraction for the master policy. We share the common hyperparameters between MPH and baselines for each task. We discuss the hyperparameters in more details in Section 4.2 and Section 4.3. The last approach, MLSH (Frans et al., 2018) is a recent, stateoftheart approach that learns a set of skill policies, switched by a master policy. MLSH is trained stagewise: a warmup period where only the master is updated is alternated with a period where skills and master are trained jointly. We implemented the first three approaches and rely on the code of MLSH released by its authors.
4.2 Stacking
Evaluation curves for MPH and baselines. The solid lines correspond to mean success rate, and the shaded lines show the standard deviation. For stacking, both values are mean values for 50 episodes and averaged over top 5 out of 16 random seeds. For pushing, we average over 32 episodes and use 5 random seeds.
Task description.
The block stacking task is a pybullet (Coumans, 2009) based simulation environment (see Figure 1 right). We use a model of the 7DOF Kuka LBR arm with a 1DOF pinch gripper. The scene contains two blocks and the goal is to stack one on top of the other. All episodes start in a randomly initialized state in terms of robot configuration and object placement. The state perceived by the agent consists of the angles and the angular velocities of the arm and the gripper, the tactile sensor for each finger of the gripper, location and orientation of each object in the scene as well as the relative distances of the two blocks to the pinch position of the gripper. The agent acts at the frequency of 40 Hz and outputs desired joint position change which is then applied using position control. The time horizon is 200 timesteps. The agent receives the following sparse rewards: a) for touching a block with the gripper, b) for successfully lifting a block, c) for holding a block above another one, and d) for placing a block on top of another block. We also reward the agent with a larger reward when the objects are stacked and the gripper pinch position is far enough from the tower.
Hyperparameters.
We use identical network architectures and common hyperparameters for all the approaches including MPH. For the policies, value functions, and the models, we use fully connected neural networks with 2 hidden layers, consisting of 64 hidden units with tanh activation each. We use the implementation of PPO from Hafner et al. (2017)
and collect a batch of 50 rollouts using parallel environments. The learning rate is set to 0.0001, 0.01, and 0.005 for policies, value functions and models networks correspondingly. We use Adam as an optimizer. The maximum KL divergence step size of PPO is set to 0.001 and 0.002 for the master and the worker policies correspondingly. We update both the policies and the value networks using 40 training epochs. We set discount factor,
, to 0.985 and the models loss coefficient, , to 0.2. We use 3 as a width of all modulation signals and the number of skills for the baselines. For the master policy timescale, we choose the best value among 4, 8, and 16. The options framework uses a timescale of 8 for the master, the 1hot baseline and MPH use the timescale of 4. In the case of MLSH we adapt some of the parameters according to the suggestion of the authors: we use learning rates of 0.01 and 0.0003 for the master and the skill policies respectively, and use 10 groups of 12 cores to train MLSH, a warmup time of 20 (the best among 10, 20, 30), a training time of 70 (the best among 30, 50, 70), and a master policy timescale of 25 (the best among 10, 25, 50).Performance.
Figure 2(a) shows that MPH outperforms the baselines on the stacking problem. We compare the success rates of the approaches averaged over 50 episodes. The stacking is considered successful if in the end of the episode the blocks are in a stacked configuration without any block being in contact with the robot. A single policy PPO does not solve the task and on average has a success rate of . The approach with the 1hot modulation signal achieves a success rate of . The options framework stacks the blocks in of episodes on average but takes more time to train. MLSH learns faster than the two previously discussed methods. However, it plateaus out and reaches the same success rate as the options. MPH outperforms all the baselines, both in terms of final average score and the speed of learning. MPH achieves a success rate of on average (5 seeds) and the best random seed stacks the blocks in of episodes. In contrast to the options framework, MPH uses the whole batch to train all the networks and in contrast to MLSH, trains jointly in a single phase always updating all the networks.
Modulation.
To obtain a better understanding on the role of the modulation signal, we plot histograms of the master policies’ decisions, for both the options baseline and MPH. Figure 4 shows the histogram for a single random seed robustly solving the task. First, we notice that the options master (acting on a timescale of 8) takes consistent decisions and prefers certain skills over others at each timestep (Figure 4 top). We highlight the fact that the master policy network is memoryless and does not observe the current timestep value. In the beginning, the master chooses the 3rd skill for about 24 timesteps, then it chooses the 2nd skill for roughly 16 timesteps and finally the first one for the rest of the episode. Thus we conclude that the skills correspond to reaching, lifting and placing primitives which is confirmed by observing the policy acting. Once trained, the options framework solves the problem in the first third of an episode and spends the rest of the time avoiding contact with any block (requires no specific skill). We observe a similar pattern for the MPH modulation signal switching the bits in roughly the same time intervals. The master policy of MPH acts on a timescale of 4 and changes the modulation signal in roughly the same time intervals as the options master. However, MPH typically employs more than a single bit and benefits from higher modulation skill capacity than the categorical methods like options and MLSH.
4.3 Pushing
Task description.
The block pushing task is FetchPushv1 from OpenAI Gym where following Andrychowicz et al. (2017), we discard initial stategoal pairs in which the goal is already satisfied. In FetchPushv1, the endeffector is controlled in XYZ space and the goal is to push a randomly placed box to the target. The agent receives the reward of 1 when the block is in an epsilon ball of the episode target and 0 anywhere else. Each episode is randomly initialized in terms of the robot and the block configurations and the target. The length of the episodes is set to 50.
Hyperparameters.
We use the same set of hyperparameters as described for the stacking task with several exceptions. We adapt the batch size (set to 32 rollouts), the number of training epochs (set to 32), the policies learning rate (set to 0.0001), the value functions learning rate (set to 0.0003), and the discount factor (set to 0.98). For MLSH we change the warmup time to 10, the training time to 50, and the master policy timescale to 10.
Performance.
We compare MPH with four baselines on FetchPushv1. We use episode success as performance metrics (averaged over 32 episodes). The pushing is considered to be successful when the block is close to the episode target. As shown in Figure 2(b), MPH outperforms all the other approaches. A single policy PPO plateaus out after achieving a success rate of . The 1hot hierarchy and options framework on average solve the task with a success rate of and correspondingly. MLSH performs better than the first three methods and achieves a success rate of . The options and MLSH take more time to train due to the fact that each option is trained on a sub set of the batch of data. MPH is on average more successful than the best of the baselines and achieves a success rate of on average and the best random seed is able to successfully push the block in of episodes.
4.4 Hierarchical intrinsic motivation
We evaluate the effect of intrinsic motivation (IM) applied on both levels of the hierarchy. LABEL:fig:motivation shows the results for both tasks with the four possible settings of intrinsic rewards: intrinsic reward for both policies, intrinsic reward only for the worker policy, intrinsic reward only for the master policy and no intrinsic reward. We notice that MPH without the intrinsic reward often struggles to find the solution and performs worse. Given the intrinsic bonus for one of the layers, MPH performance improves. The intrinsic motivation on the worker side results in faster initial exploration, however MPH with intrinsically rewarded master network has higher final score, potentially due to better long term planning. The best score is achieved with an intrinsic motivation for both policies. Applying intrinsic motivation to both policies results in an improvement of and w.r.t. the version without intrinsic motivation for the stacking and the pushing tasks correspondingly.
5 Conclusion
We introduced Modulated Policy Hierarchies (MPHs) to address environments with sparse rewards that can be decomposed into subtasks. By combing rich modulation signals, temporal abstraction, and intrinsic motivation, MPH benefits from better exploration and increased stability of training. Moreover, in contrast to many stateoftheart approaches, MPH does not require pretraining, multiple training phases or manual reward shaping. We evaluated MPH on two simulated robot manipulation tasks: pushing and block stacking. In both cases, MPH outperformed baselines and the recently proposed MLSH algorithm, suggesting that our approach may be a fertile direction for further investigation.
Acknowledgements.
This work was supported in part by ERC advanced grant Allegro.
References
 Andrychowicz et al. (2017) M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba. Hindsight experience replay. NIPS, 2017.
 Bacon et al. (2017) P.L. Bacon, J. Harb, and D. Precup. The optioncritic architecture. AAAI, 2017.
 Bellemare et al. (2016) M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifying countbased exploration and intrinsic motivation. NIPS, 2016.
 Coumans (2009) E. Coumans. Bullet physics engine., 2009. URL www.bulletphysics.org.

Daniel et al. (2016)
C. Daniel, G. Neumann, O. Kroemer, and J. Peters.
Hierarchical relative entropy policy search.
The Journal of Machine Learning Research
, 17(1):3190–3239, 2016.  Dayan and Hinton (1993) P. Dayan and G. Hinton. Feudal Reinforcement Learning. NIPS, 1993.
 Duan et al. (2017) Y. Duan, M. Andrychowicz, B. C. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. OneShot Imitation Learning. NIPS, 2017.
 Eysenbach et al. (2019) B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine. Diversity is all you need: Learning skills without a reward function. ICLR, 2019.
 Florensa et al. (2017) C. Florensa, Y. Duan, and P. Abbeel. Stochastic Neural Networks for Hierarchical Reinforcement Learning. ICLR, 2017.
 Frans et al. (2018) K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman. Meta learning shared hierarchies. ICLR, 2018.
 Gu et al. (2016) S. Gu, E. Holly, T. Lillicrap, and S. Levine. Deep Reinforcement Learning for Robotic Manipulation. ICML, 2016.
 Haarnoja et al. (2018) T. Haarnoja, K. Hartikainen, P. Abbeel, and S. Levine. Latent space policies for hierarchical reinforcement learning. ICML, 2018.
 Hafner et al. (2017) D. Hafner, J. Davidson, and V. Vanhoucke. Tensorflow agents: Efficient batched reinforcement learning in tensorflow. arXiv, 1709.02878, 2017.
 Hausman et al. (2018) K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. Riedmiller. Learning an embedding space for transferable robot skills. ICLR, 2018.
 Heess et al. (2016) N. Heess, G. Wayne, Y. Tassa, T. Lillicrap, M. Riedmiller, and D. Silver. Learning and Transfer of Modulated Locomotor Controllers. arXiv, 1610.05182, 2016.
 Henderson et al. (2018) P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforcement learning that matters. AAAI, 2018.
 Houthooft et al. (2016) R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel. Vime: Variational information maximizing exploration. NIPS, 2016.
 Kober et al. (2013) J. Kober, J. A. Bagnell, and J. Peters. Reinforcement learning in robotics: A survey. IJRR, 2013.
 Kulkarni et al. (2016) T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. NIPS, 2016.
 Lampe and Riedmiller (2013) T. Lampe and M. Riedmiller. Acquiring visual servoing reaching and grasping skills using neural reinforcement learning. IJCNN, 2013.
 Levine et al. (2015) S. Levine, C. Finn, T. Darrell, and P. Abbeel. EndtoEnd Training of Deep Visuomotor Policies. The Journal of Machine Learning Research, 2015.

Levine et al. (2016)
S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen.
Learning HandEye Coordination for Robotic Grasping with Deep Learning and LargeScale Data Collection.
ISER, 2016.  Mnih et al. (2015) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Humanlevel control through deep reinforcement learning. Nature, 2015.
 Nachum et al. (2018) O. Nachum, S. Gu, H. Lee, and S. Levine. Dataefficient hierarchical reinforcement learning. NIPS, 2018.
 Ostrovski et al. (2017) G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos. Countbased exploration with neural density models. arXiv, 1703.01310, 2017.
 Pathak et al. (2017) D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiositydriven exploration by selfsupervised prediction. ICML, 2017.
 Pinto and Gupta (2016) L. Pinto and A. Gupta. Supersizing selfsupervision: Learning to grasp from 50K tries and 700 robot hours. ICRA, 2016.
 Popov et al. (2017) I. Popov, N. Heess, T. Lillicrap, R. Hafner, G. BarthMaron, M. Vecerik, T. Lampe, Y. Tassa, T. Erez, and M. Riedmiller. Dataefficient deep reinforcement learning for dexterous manipulation. arXiv, 1704.03073, 2017.
 Schmidhuber (2010) J. Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010.
 Schulman et al. (2017) J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv, 1707.06347, 2017.
 Silver et al. (2016) D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 2016.
 Sutton et al. (1999) R. S. Sutton, D. Precup, and S. Singh. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211, 1999.
 Tessler et al. (2017) C. Tessler, S. Givony, T. Zahavy, D. J. Mankowitz, and S. Mannor. A deep hierarchical approach to lifelong learning in minecraft. AAAI, 2017.
 Vezhnevets et al. (2017) A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. FeUdal Networks for Hierarchical Reinforcement Learning. ICML, 2017.
Appendix A Markovian formulation of MPH
Since different policies act on different timescales, we use the following notation for the action and the state of the policy acting on the timescale in an level hierarchy:
(3) 
We refer to control signals as latent variables of policies on higher levels where the policy on level is a conditional action distribution assuming that the latent variables are the part of the state . Therefore, the action can be sampled once are sampled. The corresponding problem of finding the optimal policy for each hierarchy layer can be solved using the RL machinery once it is reformulated in the MDP formalism as we do below.
For each layer of the hierarchy , we can rewrite the transition probabilities of the MDP on the layer marginalizing over the actions of the policy and using the transition probabilities of the layer of the hierarchy MDP:
(4) 
Since the higher level policies act on longer timescales, we also derive the timescaled transition probabilities for each MDP with the timescale :
(5) 
Given Equation 4 and Equation 5, one can get the transitions probabilities for any layer MDP using only the environment transition probabilities and the policies. While the former is stationary, the policies are updated in each training epoch that might bring instabilities to the training. The trust region optimization methods, such as TRPO (or its approximation, PPO) bring in a convenient way to bound the changes of the high level MDPs. Such bound can guarantee that the optimization problem solved by TRPO (or PPO) for higher layers changes smoothly during the updates. Thus, the global solution is converging to the optimal solution of the original problem. Below we derive the upper bounds for the transition probabilities change for a discrete case which can be extended to the continuous state and action spaces. We rewrite Equation 4 for the discrete case ( is shifted by 1):
(6) 
We start by deriving the equation for the first two levels of the hierarchy. We denote the transition probability after the training epoch as and the updated policy as . Since where is the transition probability of the environment, we get the following inequalities:
(7)  
(8)  
(9) 
where we used Hölder’s inequality in Equation 7, Pinsker’s inequality and the fact that is ergodic in Equation 8 and TRPO (or PPO) guarantee of in Equation 9.
Next we derive the bound for level of the hierarchy:
(10)  
(11)  
(12)  
(13)  
(14)  
(15)  
(16) 
where we use Hölder’s and Pinsker’s inequalities and the result of Equation 9.
We showed that for any layer of MPH its MDP’s transition probabilities change is upperbounded with the TRPO (or PPO) update. In addition, this bound scales linearly with . Thus, we have a direct control on how much the MDPs on higer layers change after each policy update. Given that the change is small, the optimization problem solved by TRPO (or PPO) for higher layers will also change smoothly during the updates. Therefore, we can apply the standard RL machinery for the hierarchical timescaled MDPs with the given transition probabilities independently for each layer. Moreover, such guarantees also mean more stable training of the hierarchy.