1 Introduction
In reinforcement learning (RL), options (Sutton et al., 1999; Precup, 2000) provide a general framework for defining temporally abstract courses of action for learning and planning. Extensive research has focused on discovering these temporal abstractions autonomously (McGovern and Barto, 2001; Stolle and Precup, 2002; Menache et al., 2002; Şimşek and Barto, 2009; Silver and Ciosek, 2012) while approaches that can be used in continuous state and/or action spaces have only recently became feasible (Konidaris et al., 2011; Niekum, 2013; Mann et al., 2015; Mankowitz et al., 2016; Kulkarni et al., 2016; Vezhnevets et al., 2016; Daniel et al., 2016). Most existing work has focused on finding subgoals (i.e. useful states for the agent) and then learning policies to achieve them. However, these approaches do not scale well because of their combinatorial nature. Recent work on optioncritic learning blurs the line between option discovery and option learning by providing policy gradient theorems for optimizing a twolevel hierarchy of options and primitive actions (Bacon et al., 2017). These approaches have achieved success when applied to Qlearning on Atari games, but also in continuous action spaces (Klissarov et al., 2017) and with asynchronous parallelization (Harb et al., 2017). In this paper, we extend optioncritic to a novel hierarchical optioncritic framework, presenting generalized policy gradient theorems that can be applied to an arbitrarily deep hierarchy of options.
Work on learning with temporal abstraction is motivated by two key potential benefits over learning with primitive actions: long term credit assignment and exploration. Learning only at the primitive action level or even with low levels of abstraction slows down learning, because agents must learn longer sequences of actions to achieve the desired behavior. This frustrates the process of learning in environments with sparse rewards. In contrast, agents that learn a high level decomposition of subtasks are able to explore the environment more effectively by exploring in the abstract action space rather than the primitive action space. While the recently proposed deliberation cost (Harb et al., 2017) can be used as a margin that effectively controls how temporally extended options are, the standard twolevel version of the optioncritic framework is still illequipped to learn complex tasks that require subtask decomposition at multiple quite different temporal resolutions of abstraction. In Figure 1
we depict how we overcome this obstacle to learn a deep hierarchy of options. The standard twolevel option hierarchy constructs a SemiMarkov Decision Process (SMDP), where new options are chosen when temporally extended sequences of primitive actions are terminated. In our framework, we consider not just options and primitive actions, but also an arbitrarily deep hierarchy of lower level and higher level options. Higher level options represent a further temporally extended SMDP than the low level options below as they only have an opportunity to terminate when all lower level options terminate.
We will start by reviewing related research and by describing the seminal work we build upon in this paper that first derived policy gradient theorems for learning with primitive actions (Sutton et al., 2000) and options (Bacon et al., 2017). We will then describe the core ideas of our approach, presenting hierarchical intraoption policy and termination gradient theorems. We leverage this new type of policy gradient learning to construct a hierarchical optioncritic architecture, which generalizes the optioncritic architecture to an arbitrarily deep hierarchy of options. Finally, we demonstrate the empirical benefit of this architecture over standard optioncritic when applied to RL benchmarks. To the best of our knowledge, this is the first general purpose endtoend approach for learning a deep hierarchy of options beyond twolevels in RL settings, scaling to very large domains at comparable efficiency.
2 Related Work
Our work is related to recent literature on learning to compose skills in RL. As an example, Sahni et al. (2017) leverages a logic for combining prelearned skills by learning an embedding to represent the combination of a skill and state. Unfortunately, their system relies on a prespecified subtask decomposition into skills. In (Shu et al., 2017), the authors propose to ground all goals in a natural language description space. Created descriptions can then be high level and express a sequence of goals. While these are interesting directions for further exploration, we will focus on a more general setting without provided natural language goal descriptions or subtask decomposition information.
Our work is also related to methods that learn to decompose the problem over long time horizons. A prominent paradigm for this is Feudal Reinforcement Learning (Dayan and Hinton, 1993), which learns using manager and worker models. Theoretically, this can be extended to a deep hierarchy of managers and their managers as done in the original work for a hand designed decomposition of the state space. Much more recently, Vezhnevets et al. (2017)
showed the ability to successfully train a Feudal model end to end with deep neural networks for the Atari games. However, this has only been achieved for a twolevel hierarchy (i.e. one manager and one worker). We can think of Feudal approaches as learning to decompose the problem with respect to the state space, while the options framework learns a temporal decomposition of the problem. Recent work
(Levy et al., 2017) also breaks down the problem over a temporal hierarchy, but like (Vezhnevets et al., 2017) is based on learning a latent goal representation that modulates the policy behavior as opposed to options. Conceptually, options stress choosing among skill abstractions and Feudal approaches stress the achievement of certain kinds of states. Humans tend to use both of these kinds of reasoning when appropriate and we conjecture that a hybrid approach will likely win out in the end. Unfortunately, in the space available we feel that we cannot come to definitive conclusions about the precise nature of the differences and potential synergies of these approaches.The concept of learning a hierarchy of options is not new. It is an obviously desirable extension of options envisioned in the original papers. However, actually learning a deep hierarchy of options end to end has received surprisingly little attention to date. Compositional planning where options select other options was first considered in (Silver and Ciosek, 2012). The authors provided a generalization of value iteration to option models for multiple subgoals, leveraging explicit subgoals for options. Recently, Fox et al. (2017)
successfully trained a hierarchy of options end to end for imitation learning. Their approach leverages an EM based algorithm for recursive discovery of additional levels of the option hierarchy. Unfortunately, their approach is only applicable to imitation learning and not general purpose RL. We are the first to propose theorems along with a practical algorithm and architecture to train arbitrarily deep hierarchies of options end to end using policy gradients, maximizing the expected return.
3 Problem Setting and Notation
A Markov Decision Process (MDP) is defined with a set of states , a set of actions , a transition function and a reward function . We follow (Bacon et al., 2017)
and develop our ideas assuming discrete state and action sets, while our results extend to continuous spaces using usual measuretheoretic assumptions as demonstrated in our experiments. A policy is defined as a probability distribution over actions conditioned on states,
. The value function of a policy is the expected return with an actionvalue function of where is the discount factor.Policy gradient methods (Sutton et al., 2000; Konda and Tsitsiklis, 2000)
consider the problem of improving a policy by performing stochastic gradient descent to optimize a performance objective over a family of parametrized stochastic policies,
. The policy gradient theorem (Sutton et al., 2000) provides the gradient of the discounted reward objective with respect to in a straightforward expression. The objective is defined with respect to a designated starting state . The policy gradient theorem shows that: , where is the discounted weighting of the states along the trajectories starting from initial state .The options framework (Sutton et al., 1999; Precup, 2000) provides formalism for the idea of temporally extended actions. A Markovian option is a triple where represents an initiation set, represents an intraoption policy, and represents a termination function. Like most option discovery algorithms, we assume that all options are available everywhere. MDPs with options become SMDPs (Puterman, 1994) with an associated optimal value function over options and optionvalue function (Sutton et al., 1999; Precup, 2000).
The optioncritic architecture (Bacon et al., 2017) utilizes a callandreturn option execution model. An agent picks option according to its policy over options , then follows the intraoption policy until termination (as determined by ), which triggers a repetition of this procedure. Let denote the intraoption policy of option parametrized by and the termination function of parameterized by . Like policy gradient methods, the optioncritic architecture optimizes directly for the discounted return expected over trajectories starting at a designated state and option : . The optionvalue function is then:
(1) 
where is the value of selecting an action given the context of a stateoption pair:
(2) 
The pairs define an augmented state space (Levy and Shimkin, 2011). The optioncritic architecture instead leverages the function which is called the optionvalue function upon arrival (Sutton et al., 1999). The value of selecting option upon entering state is:
(3) 
For notation clarity, we omit and which and both depend on. The intraoption policy gradient theorem results from taking the derivative of the expected discounted return with respect to the intraoption policy parameters and defines the update rule for the intraoption policy:
(4) 
where is the discounted weighting of along trajectories originating from . The termination gradient theorem results from taking the derivative of the expected discounted return with respect to the termination policy parameters and defines the update rule for the termination policy for the initial condition :
(5) 
(6) 
where is now the discounted weighting of from . is the advantage function over options: .
4 Learning Options with Arbitrary Levels of Abstraction
Notation: As it makes our equations much clearer and more condensed we adopt the notation . This implies that denotes a list of variables in the range of through .
The hierarchical options framework that we introduce in this work considers an agent that learns using an level hierarchy of policies, termination functions, and value functions. Our goal is to extend the ideas of the optioncritic architecture in such a way that our framework simplifies to policy gradient based learning when and optioncritic learning when . At each hierarchical level above the lowest primitive action level policy, we consider an available set of options that is a subset of the total set of available options . This way we keep our view of the possible available options at each level very broad. On one extreme, each hierarchical level may get its own unique set of options and on the other extreme each hierarchical level may share the same set of options. We present a diagram of our proposed architecture in Figure 2.
We denote as the policy over the most abstract options in the hierarchy given the state . For example, from our discussion of the optioncritic architecture. Once is chosen with policy , then we go to policy , which is the next highest level policy, to select conditioning it on both the current state and the selected highest level option . This process continues on in the same fashion stepping down to policies at lower levels of abstraction conditioned on the augmented state space considering all selected higher level options until we reach policy . is the lowest level policy and it finally selects over the primitive action space conditioned on all of the selected options.
Each level of the option hierarchy has a complimentary termination function that governs the termination pattern of the selected option at that level. We adopt a bottom up termination strategy where high level options only have an opportunity to terminate when all of the lower level options have terminated first. For example, cannot terminate until terminates at which point we can assess to see whether terminates. If it did terminate, this would allow the opportunity to asses if it should terminate and so on. This condition ensures that higher level options will be more temporally extended than their lower level option building blocks, which is a key motivation of this work.
The final key component of our system is the value function over the augmented state space. To enable comprehensive reasoning about the policies at each level of the option hierarchy, we need to maintain value functions that consider the state and every possible combination of active options and actions . These value functions collectively serve as the critic in our analogy to the actorcritic and optioncritic training paradigms.
4.1 Generalizing the Option Value Function to N Hierarchical Levels
Like policy gradient methods and optioncritic, the hierarchical options framework optimizes directly for the discounted return expected over all trajectories starting at a state with active options :
(7) 
This return depends on the policies and termination functions at each level of abstraction. We now consider the option value function for understanding reasoning about an option at level based on the augmented state space :
(8) 
Note that in order to simplify our notation we write as referring to both abstract and primitive actions. As a result, is equivalent to , leveraging the primitive action space . Extending the meaning of from (Bacon et al., 2017), we define the corresponding value of executing an option in the presence of the currently active higher level options by integrating out the lower level options:
(9) 
The hierarchical option value function upon arrival with augmented state is defined as:
(10) 
We explain the derivation of this equation ^{1}^{1}1Note that when no options terminate, as in the first term in equation (10), the lowest level option does not terminate and thus no higher level options have the opportunity to terminate. in the Appendix A.1
. Finally, before we can extend the policy gradient theorem, we must establish the Markov chain along which we can measure performance for options with
levels of abstraction. This is derived in the Appendix A.2.4.2 Generalizing the Intraoption Policy Gradient Theorem
We can think of actorcritic architectures, generalizing to the optioncritic architecture as well, as pairing a critic with each actor network so that the critic has additional information about the value of the actor’s actions that can be used to improve the actor’s learning. However, this is derived by taking gradients with respect to the parameters of the policy while optimizing for the expected discounted return. The discounted return is approximated by a critic (i.e. value function) with the same augmented statespace as the policy being optimized for. As examples, an actorcritic policy is optimized by taking the derivative of its parameters with respect to (Sutton et al., 2000) and an optioncritic policy is optimized by taking the derivative of its parameters with respect to (Bacon et al., 2017). The intraoption policy gradient theorem (Bacon et al., 2017) is an important contribution, outlining how to optimize for a policy that is also associated with a termination function. As the policy over options in that work never terminates, it does not need a special training methodology and the optioncritic architecture allows the practitioner to pick their own method of learning the policy over options while using Q Learning as an example in their experiments. We do the same for our highest level policy that also never terminates. For all other policies we perform a generalization of actorcritic learning by providing a critic at each level and guiding gradients using the appropriate critic.
We now seek to generalize the intraoption policy gradients theorem, deriving the update rule for a policy at an arbitrary level of abstraction by taking the gradient with respect to using the value function with the same augmented state space . Substituting from equation (8) we find:
(11) 
Theorem 1 (Hierarchical Intraoption Policy Gradient Theorem). Given an level hierarchical set of Markov options with stochastic intraoption policies differentiable in their parameters governing each policy , the gradient of the expected discounted return with respect to and initial conditions is:
where is a discounted weighting of augmented state tuples along trajectories starting from . A proof is in Appendix A.3.
4.3 Generalizing the Termination Gradient Theorem
We now turn our attention to computing gradients for the termination functions at each level, assumed to be stochastic and differentiable with respect to the associated parameters .
(12) 
Hence, the key quantity is the gradient of . This is a natural consequence of callandreturn execution, where termination function quality can only be evaluated upon entering the next state.
Theorem 2 (Hierarchical Termination Gradient Theorem). Given an level hierarchical set of Markov options with stochastic termination functions differentiable in their parameters governing each function , the gradient of the expected discounted return with respect to and initial conditions is:
where is a discounted weighting of augmented state tuples along trajectories starting from . is the generalized advantage function over a hierarchical set of options . compares the advantage of not terminating the current option with a probability weighted expectation based on the likelihood that higher level options also terminate. In (Bacon et al., 2017) this expression was simple as there was not a hierarchy of higher level termination functions to consider. A proof is in Appendix A.4.
It is interesting to see the emergence of an advantage function as a natural consequence of the derivation. As in (Bacon et al., 2017) where this kind of relationship also appears, the advantage function gives the theorem an intuitive interpretation. When the option choice is suboptimal at level with respect to the expected value of terminating option
, the advantage function is negative and increases the odds of terminating that option. A new concept, not paralleled in the optioncritic derivation, is the inclusion of a
multiplicative factor. This can be interpreted as discounting gradients by the likelihood of this termination function being assessed as is only used if all lower level options terminate. This is a natural consequence of multilevel callandreturn execution.5 Experiments
We would now like to empirically validate the efficacy of our proposed hierarchical optioncritic (HOC) model. We achieve this by exploring benchmarks in the tabular and nonlinear function approximation settings. In each case we implement an agent that is restricted to primitive actions (i.e. ), an agent that leverages the optioncritic (OC) architecture (i.e. ), and an agent with the HOC architecture at level of abstraction . We will demonstrate that complex RL problems may be more easily learned using beyond two levels of abstraction and that the HOC architecture can successfully facilitate this level of learning using data from scratch.
For our tabular architectures, we followed protocol from (Bacon et al., 2017)
and chose to parametrize the intraoption policies with softmax distributions and the terminations with sigmoid functions. The policy over options was learned using intraoption Qlearning. We also implemented primitive actorcritic (AC) using a softmax policy. For the nonlinear function approximation setting, we trained our agents using A3C
(Mnih et al., 2016). Our primitive action agents conduct A3C training using a convolutional network when there is image input followed by an LSTM to contextualize the state. This way we ensure that benefits seen from options are orthogonal to those seen from these common neural network building blocks. We follow (Harb et al., 2017) to extend A3C to the Asynchronous Advantage OptionCritic (A2OC) and Asynchronous Advantage Hierarchical OptionCritic architectures (A2HOC). We include detailed algorithm descriptions for all of our experiments in Appendix B. We also conducted hyperparameter optimization that is summarized along with detail on experimental protocol in Appendix
B. In all of our experiments, we made sure that the twolevel OC architecture had access to more total options than the three level alternative and that the three level architecture did not include any additional hyperparameters. This ensures that empirical gains are the result of increasingly abstract options.5.1 Tabular Learning Challenge Problems
Exploring four rooms: We first consider a navigation task in the fourrooms domain (Sutton et al., 1999). Our goal is to evaluate the ability of a set of options learned fully autonomously to learn an efficient exploration policy within the environment. The initial state and the goal state are drawn uniformly from all open nonwall cells every episode. This setting is highly nonstationary, since the goal changes every episode. Primitive movements can fail with probability , in which case the agent transitions randomly to one of the empty adjacent cells. The reward is +1 at the goal and 0 otherwise. In Figure 3 we report the average number of steps taken in the last 100 episodes every 100 episodes, reporting the average of 50 runs with different random seeds for each algorithm. We can clearly see that reasoning with higher levels of abstraction is critical to achieving a good exploration policy and that reasoning with three levels of abstraction results in better sample efficient learning than reasoning with two levels of abstraction. For this experiment we explore four levels of abstraction as well, but unfortunately there seem to be diminishing returns at least for this tabular setting.
Discrete stochastic decision process: Next, we consider a hierarchical RL challenge problem as explored in (Kulkarni et al., 2016) with a stochastic decision process where the reward depends on the history of visited states in addition to the current state. There are 6 possible states and the agent always starts at . The agent moves left deterministically when it chooses left action; but the action right only succeeds half of the time, resulting in a left move otherwise. The terminal state is and the agent receives a reward of 1 when it first visits and then . The reward for going to without visiting is 0.01. In Figure 4 we report the average reward over the last 100 episodes every 100 episodes, considering 10 runs with different random seeds for each algorithm. Reasoning with higher levels of abstraction is again critical to performing well at this task with reasonable sample efficiency. Both OC learning and HOC learning converge to a high quality solution surpassing performance obtained in (Kulkarni et al., 2016). However, it seems that learning converges faster with three levels of abstractions than it does with just two.
5.2 Deep Function Approximation Problems
Multistory building navigation:
For an intuitive look at higher level reasoning, we consider the four rooms problem in a partially observed setting with an 11x17 grid at each level of a seven level building. The agent has a receptive field size of 3 in both directions, so observations for the agent are 9dimension feature vectors with 0 in empty spots, 1 where there is a wall, 0.25 if there are stairs, or 0.5 if there is a goal location. The stairwells in the north east corner of the floor lead upstairs to the south west corner of the next floor up. Stairs in the south west corner of the floor lead down to the north east corner of the floor below. Agents start in a random location in the basement (which has no south west stairwell) and must navigate to the roof (which has no north east stairwell) to find the goal in a random location. The reward is +10 for finding the goal and 0.1 for hitting a wall. This task could seemingly benefit from abstraction such as a composition of subpolicies to get to the stairs at each intermediate level. We report the rolling mean and standard deviation of the reward. In Figure
5we see a qualitative difference between the policies learned with three levels of abstraction which has high variance, but fairly often finds the goal location and those learned with less abstraction. A2OC and A3C are hovering around zero reward, which is equivalent to just learning a policy that does not run into walls.
Architecture  Clipped Reward 
A3C  8.43 2.29 
A2OC  10.56 0.49 
A2HOC  13.12 1.46 
Learning many Atari games with one model: We finally consider application of the HOC to the Atari games (Bellemare et al., 2013). Evaluation protocols for the Atari games are famously inconsistent (Machado et al., 2017), so to ensure for fair comparisons we implement apples to apples versions of our baseline architectures deployed with the same codebase and environment settings. We put our models to the test and consider a very challenging setting (Sharma et al., 2017) where a single agent attempts to learn many Atari games at the same time. Our agents attempt to learn 21 Atari games simultaneously, matching the largest previous multitask setting on Atari (Sharma et al., 2017). Our tasks are hand picked to fall into three categories of related games each with 7 games represented. The first category is games that include maze style navigation (e.g. MsPacman), the second category is mostly fully observable shooter games (e.g. SpaceInvaders), and the final category is partially observable shooter games (e.g. BattleZone). We train each agent by always sampling the game with the least training frames after each episode, ensuring the games are sampled very evenly throughout training. We also clip rewards to allow for equal learning rates across tasks (Mnih et al., 2015). We train each game for 10 million frames (210 million total) and report statistics on the clipped reward achieved by each agent when evaluating the policy without learning for another 3 million frames on each game across 5 separate training runs. As our main metric, we report the summary of how each multitask agent maximizes its reward in Table 1. While all agents struggle in this difficult setting, HOC is better able to exploit commonalities across games using fewer parameters and policies.
Analysis of Learned Options: An advantage of the multitask setting is it allows for a degree of quantitative interpretability regarding when and how options are used. We report characteristics of the agents with median performance during the evaluation period. A2OC with 16 options uses 5 options the bulk of the time with the rest of the time largely split among another 6 options (Figure 6). The average number of time steps between switching options has a pretty narrow range across games falling between 3.4 (Solaris) and 5.5 (MsPacman). In contrast, A2HOC with three options at each branch of the hierarchy learns to switch options at a rich range of temporal resolutions depending on the game. The high level options vary between an average of 3.2 (BeamRider) and 9.7 (Tutankham) steps before switching. Meanwhile, the low level options vary between an average of 1.5 (Alien) and 7.8 (Tutankham) steps before switching. In Appendix B.4 we provide additional details about the average duration before switching options for each game. In Figure 7 we can see that the most used options for HOC are distributed pretty evenly across a number of games, while OC tends to specialize its options on a smaller number of games. In fact, the average share of usage dominated by a single game for the top 7 most used options is 40.9% for OC and only 14.7% for HOC. Additionally, we can see that a hierarchy of options imposes structure in the space of options. For example, when or the low level options tend to focus on different situations within the same games.
6 Conclusion
In this work we propose the first policy gradient theorems to optimize an arbitrarily deep hierarchy of options to maximize the expected discounted return. Moreover, we have proposed a particular hierarchical optioncritic architecture that is the first general purpose reinforcement learning architecture to successfully learn options from data with more than two abstraction levels. We have conducted extensive empirical evaluation in the tabular and deep nonlinear function approximation settings. In all cases we found that, for significantly complex problems, reasoning with more than two levels of abstraction can be beneficial for learning. While the performance of the hierarchical optioncritic architecture is impressive, we envision our proposed policy gradient theorems eventually transcending it in overall impact. Although the architectures we explore in this paper have a fixed structure and fixed depth of abstraction for simplicity, the underlying theorems can also guide learning for much more dynamic architectures that we hope to explore in future work.
Acknowledgements
The authors thank Murray Campbell, Xiaoxiao Guo, Ignacio Cases, and Tim Klinger for fruitful discussions that helped shape this work.
References
 Bacon et al. [2017] PierreLuc Bacon, Jean Harb, and Doina Precup. The optioncritic architecture. 2017.

Bellemare et al. [2013]
M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling.
The arcade learning environment: An evaluation platform for general
agents.
Journal of Artificial Intelligence Research
, 47:253–279, jun 2013.  Caruana [1997] Rich Caruana. Multitask learning. Machine Learning, 28(1):41–75, 1997. doi: 10.1023/A:1007379606734. URL http://dx.doi.org/10.1023/A:1007379606734.
 Daniel et al. [2016] Christian Daniel, Herke Van Hoof, Jan Peters, and Gerhard Neumann. Probabilistic inference for determining options in reinforcement learning. Machine Learning, 104(23):337–357, 2016.
 Dayan and Hinton [1993] Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. In Advances in neural information processing systems, pages 271–278, 1993.
 Fernando et al. [2017] Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A Rusu, Alexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017.
 Fox et al. [2017] Roy Fox, Sanjay Krishnan, Ion Stoica, and Ken Goldberg. Multilevel discovery of deep options. arXiv preprint arXiv:1703.08294, 2017.
 Graves et al. [2018] Alex Graves, Marc G Bellemare, Jacob Menick, Remi Munos, and Koray Kavukcuoglu. Automated curriculum learning for neural networks. ICML, 2018.
 Harb et al. [2017] Jean Harb, PierreLuc Bacon, Martin Klissarov, and Doina Precup. When waiting is not an option: Learning options with a deliberation cost. arXiv preprint arXiv:1709.04571, 2017.
 Kirkpatrick et al. [2017] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka GrabskaBarwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, page 201611835, 2017.
 Klissarov et al. [2017] Martin Klissarov, PierreLuc Bacon, Jean Harb, and Doina Precup. Learnings options endtoend for continuous action tasks. arXiv preprint arXiv:1712.00004, 2017.
 Konda and Tsitsiklis [2000] Vijay R Konda and John N Tsitsiklis. Actorcritic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.
 Konidaris et al. [2011] George Konidaris, Scott Kuindersma, Roderic A Grupen, and Andrew G Barto. Autonomous skill acquisition on a mobile manipulator. In AAAI, 2011.
 Kulkarni et al. [2016] Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pages 3675–3683, 2016.
 Levy et al. [2017] Andrew Levy, Robert Platt, and Kate Saenko. Hierarchical actorcritic. arXiv preprint arXiv:1712.00948, 2017.
 Levy and Shimkin [2011] Kfir Y Levy and Nahum Shimkin. Unified inter and intra options learning using policy gradient methods. In European Workshop on Reinforcement Learning, pages 153–164. Springer, 2011.

Li and Hoiem [2016]
Zhizhong Li and Derek Hoiem.
Learning without forgetting.
In
European Conference on Computer Vision
, pages 614–629. Springer, 2016.  Machado et al. [2017] Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. arXiv preprint arXiv:1709.06009, 2017.
 Mankowitz et al. [2016] Daniel J Mankowitz, Timothy A Mann, and Shie Mannor. Adaptive skills adaptive partitions (asap). In Advances in Neural Information Processing Systems, pages 1588–1596, 2016.
 Mann et al. [2015] Timothy A Mann, Shie Mannor, and Doina Precup. Approximate value iteration with temporally extended actions. Journal of Artificial Intelligence Research, 53:375–438, 2015.
 McGovern and Barto [2001] Amy McGovern and Andrew G Barto. Automatic discovery of subgoals in reinforcement learning using diverse density. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 361–368. Morgan Kaufmann Publishers Inc., 2001.
 Menache et al. [2002] Ishai Menache, Shie Mannor, and Nahum Shimkin. Qcut—dynamic discovery of subgoals in reinforcement learning. In European Conference on Machine Learning, pages 295–306. Springer, 2002.

Misra et al. [2016]
Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert.
Crossstitch networks for multitask learning.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 3994–4003, 2016.  Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
 Niekum [2013] Scott D Niekum. Semantically grounded learning from unstructured demonstrations. University of Massachusetts Amherst, 2013.
 Precup [2000] Doina Precup. Temporal abstraction in reinforcement learning. University of Massachusetts Amherst, 2000.
 Puterman [1994] Martin L Puterman. Markov decision processes: Discrete dynamic stochastic programming, 92–93, 1994.

Riemer et al. [2015]
Matthew Riemer, Sophia Krasikov, and Harini Srinivasan.
A deep learning and knowledge transfer based architecture for social media user characteristic determination.
InProceedings of the third International Workshop on Natural Language Processing for Social Media
, pages 39–47, 2015.  Riemer et al. [2016] Matthew Riemer, Elham Khabiri, and Richard Goodwin. Representation stability as a regularizer for improved text analytics transfer learning. arXiv preprint arXiv:1704.03617, 2016.
 Riemer et al. [2017] Matthew Riemer, Michele Franceschini, Djallel Bouneffouf, and Tim Klinger. Generative knowledge distillation for general purpose function compression. NIPS 2017 Workshop on Teaching Machines, Robots, and Humans, 5:30, 2017.
 Rosenbaum et al. [2018] Clemens Rosenbaum, Tim Klinger, and Matthew Riemer. Routing networks: Adaptive selection of nonlinear functions for multitask learning. ICLR, 2018.
 Rusu et al. [2015] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. arXiv preprint arXiv:1511.06295, 2015.
 Rusu et al. [2016] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
 Sahni et al. [2017] Himanshu Sahni, Saurabh Kumar, Farhan Tejani, and Charles Isbell. Learning to compose skills. arXiv preprint arXiv:1711.11289, 2017.
 Sharma et al. [2017] S. Sharma, A. Jha, P. Hegde, and B. Ravindran. Learning to multitask by active sampling. arXiv preprint arXiv:1702.06053, 2017.
 Shu et al. [2017] Tianmin Shu, Caiming Xiong, and Richard Socher. Hierarchical and interpretable skill acquisition in multitask reinforcement learning. arXiv preprint arXiv:1712.07294, 2017.
 Silver and Ciosek [2012] David Silver and Kamil Ciosek. Compositional planning using optimal option models. arXiv preprint arXiv:1206.6473, 2012.
 Şimşek and Barto [2009] Özgür Şimşek and Andrew G Barto. Skill characterization based on betweenness. In Advances in neural information processing systems, pages 1497–1504, 2009.
 Stolle and Precup [2002] Martin Stolle and Doina Precup. Learning options in reinforcement learning. In International Symposium on abstraction, reformulation, and approximation, pages 212–223. Springer, 2002.
 Sutton et al. [1999] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211, 1999.
 Sutton et al. [2000] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 Vezhnevets et al. [2016] Alexander Vezhnevets, Volodymyr Mnih, Simon Osindero, Alex Graves, Oriol Vinyals, John Agapiou, et al. Strategic attentive writer for learning macroactions. In Advances in neural information processing systems, pages 3486–3494, 2016.
 Vezhnevets et al. [2017] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017.
Appendix A Derivation of Generalized Policy Gradient and Termination Gradient Theorems
a.1 The Derivation of U
To help explain the meaning and derivation of equation (10), we separate the expression into four primary terms. The first term is applicable for and represents the expected return from cases where no options terminate. The second term is applicable for and represents the expected return from cases where every option terminates. The third and fourth terms are applicable for and represent the expected return from cases where some options terminate.
We will first discuss how to estimate the return when there are no terminated options. In this case we simply use our estimate of the value of the current state following the current options if there are any. As we are computing the expectation, we also multiply this term by its likelihood of happening which is equal to the probability that the lowest level option policy does not terminate. When
we can consider the termination probability of the current policy as zero and the current option context to be empty. As such, we estimate the value function upon arrival as as we do for actorcritic policy gradients.Next we turn our attention to estimating the return when all options are terminated. This can be approximated using our estimate of the return given the state . The likelihood of this happening is equal to the conditional likelihood of options terminating at every level of abstraction we are modeling. When , equation (10) simplifies to equation (3). This expression is precisely the option value function upon arrival of the optioncritic framework derived in [Bacon et al., 2017].
The final quantity we will estimate bridges the gap to cases where only some options terminate. This situation has not been explored by other work on option learning as it only arises for situations with at least hierarchical levels of planning. The case where some (but not all) options terminate arises when a series of low level options terminate while a high level option does not terminate. For a given level of abstraction, we can analyze the likelihood that at each level the lower level options terminate while the current does not. In such a case, we multiply this likelihood by the value one level more abstract than the current option hierarchy level. For convenience in our derivation, we split our notation for this quantity into two separate terms accounting explicitly for the case when only lower level options terminate.
a.2 Generalized Markov Chain and Augmented Process
We must establish the Markov chain along which we can measure performance for options with levels of abstraction. The natural approach is to consider the chain defined in the augmented state space because state and active option based tuples now play the role of regular states in a usual Markov chain. If options have been initiated or are executing at time in state , then the probability of transitioning to in one step is:
(13) 
where primitive actions are . Like the Markov chain derived for the option critic architecture [Bacon et al., 2017], the process given by equation (13) is homogeneous. Additionally, when options are available at every state, the process is ergodic with the existance of a unique stationary distribution over the augmented state space tuples.
We continue by presenting an extension of results about augmented processes used for derivation of learning algorithms in [Bacon et al., 2017] to an option hierarchy with levels of abstraction. If options have been initiated or are executing at time , then the discounted probability of transitioning to where is:
(14) 
As such, when we condition the process from , the discounted probability of transitioning to is:
(15) 
This definition will be very useful later for our derivation of the hierarchical intraoption policy gradient. However, for the derivation of the hierarchical termination gradient theorem we should reformulate the discounted probability of transitioning to from the view of the termination policy at abstraction level explicitly separating out terms that depend on :
(16) 
The step discounted probabilities can more generally be expressed recursively:
(17) 
Or rather conditioning on