Learning Abstract Options

10/27/2018 ∙ by Matthew Riemer, et al. ∙ ibm 0

Building systems that autonomously create temporal abstractions from data is a key challenge in scaling learning and planning in reinforcement learning. One popular approach for addressing this challenge is the options framework (Sutton et al., 1999). However, only recently in (Bacon et al., 2017) was a policy gradient theorem derived for online learning of general purpose options in an end to end fashion. In this work, we extend previous work on this topic that only focuses on learning a two-level hierarchy including options and primitive actions to enable learning simultaneously at multiple resolutions in time. We achieve this by considering an arbitrarily deep hierarchy of options where high level temporally extended options are composed of lower level options with finer resolutions in time. We extend results from (Bacon et al., 2017) and derive policy gradient theorems for a deep hierarchy of options. Our proposed hierarchical option-critic architecture is capable of learning internal policies, termination conditions, and hierarchical compositions over options without the need for any intrinsic rewards or subgoals. Our empirical results in both discrete and continuous environments demonstrate the efficiency of our framework.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In reinforcement learning (RL), options (Sutton et al., 1999; Precup, 2000) provide a general framework for defining temporally abstract courses of action for learning and planning. Extensive research has focused on discovering these temporal abstractions autonomously (McGovern and Barto, 2001; Stolle and Precup, 2002; Menache et al., 2002; Şimşek and Barto, 2009; Silver and Ciosek, 2012) while approaches that can be used in continuous state and/or action spaces have only recently became feasible (Konidaris et al., 2011; Niekum, 2013; Mann et al., 2015; Mankowitz et al., 2016; Kulkarni et al., 2016; Vezhnevets et al., 2016; Daniel et al., 2016). Most existing work has focused on finding subgoals (i.e. useful states for the agent) and then learning policies to achieve them. However, these approaches do not scale well because of their combinatorial nature. Recent work on option-critic learning blurs the line between option discovery and option learning by providing policy gradient theorems for optimizing a two-level hierarchy of options and primitive actions (Bacon et al., 2017). These approaches have achieved success when applied to Q-learning on Atari games, but also in continuous action spaces (Klissarov et al., 2017) and with asynchronous parallelization (Harb et al., 2017). In this paper, we extend option-critic to a novel hierarchical option-critic framework, presenting generalized policy gradient theorems that can be applied to an arbitrarily deep hierarchy of options.

Figure 1: State trajectories over a three-level hierarchy of options. Open circles represent SMDP decision points while filled circles are primitive steps within an option. The low level options are temporally extended over primitive actions, and high level options are even further extended.

Work on learning with temporal abstraction is motivated by two key potential benefits over learning with primitive actions: long term credit assignment and exploration. Learning only at the primitive action level or even with low levels of abstraction slows down learning, because agents must learn longer sequences of actions to achieve the desired behavior. This frustrates the process of learning in environments with sparse rewards. In contrast, agents that learn a high level decomposition of sub-tasks are able to explore the environment more effectively by exploring in the abstract action space rather than the primitive action space. While the recently proposed deliberation cost (Harb et al., 2017) can be used as a margin that effectively controls how temporally extended options are, the standard two-level version of the option-critic framework is still ill-equipped to learn complex tasks that require sub-task decomposition at multiple quite different temporal resolutions of abstraction. In Figure 1

we depict how we overcome this obstacle to learn a deep hierarchy of options. The standard two-level option hierarchy constructs a Semi-Markov Decision Process (SMDP), where new options are chosen when temporally extended sequences of primitive actions are terminated. In our framework, we consider not just options and primitive actions, but also an arbitrarily deep hierarchy of lower level and higher level options. Higher level options represent a further temporally extended SMDP than the low level options below as they only have an opportunity to terminate when all lower level options terminate.

We will start by reviewing related research and by describing the seminal work we build upon in this paper that first derived policy gradient theorems for learning with primitive actions (Sutton et al., 2000) and options (Bacon et al., 2017). We will then describe the core ideas of our approach, presenting hierarchical intra-option policy and termination gradient theorems. We leverage this new type of policy gradient learning to construct a hierarchical option-critic architecture, which generalizes the option-critic architecture to an arbitrarily deep hierarchy of options. Finally, we demonstrate the empirical benefit of this architecture over standard option-critic when applied to RL benchmarks. To the best of our knowledge, this is the first general purpose end-to-end approach for learning a deep hierarchy of options beyond two-levels in RL settings, scaling to very large domains at comparable efficiency.

2 Related Work

Our work is related to recent literature on learning to compose skills in RL. As an example, Sahni et al. (2017) leverages a logic for combining pre-learned skills by learning an embedding to represent the combination of a skill and state. Unfortunately, their system relies on a pre-specified sub-task decomposition into skills. In (Shu et al., 2017), the authors propose to ground all goals in a natural language description space. Created descriptions can then be high level and express a sequence of goals. While these are interesting directions for further exploration, we will focus on a more general setting without provided natural language goal descriptions or sub-task decomposition information.

Our work is also related to methods that learn to decompose the problem over long time horizons. A prominent paradigm for this is Feudal Reinforcement Learning (Dayan and Hinton, 1993), which learns using manager and worker models. Theoretically, this can be extended to a deep hierarchy of managers and their managers as done in the original work for a hand designed decomposition of the state space. Much more recently, Vezhnevets et al. (2017)

showed the ability to successfully train a Feudal model end to end with deep neural networks for the Atari games. However, this has only been achieved for a two-level hierarchy (i.e. one manager and one worker). We can think of Feudal approaches as learning to decompose the problem with respect to the state space, while the options framework learns a temporal decomposition of the problem. Recent work

(Levy et al., 2017) also breaks down the problem over a temporal hierarchy, but like (Vezhnevets et al., 2017) is based on learning a latent goal representation that modulates the policy behavior as opposed to options. Conceptually, options stress choosing among skill abstractions and Feudal approaches stress the achievement of certain kinds of states. Humans tend to use both of these kinds of reasoning when appropriate and we conjecture that a hybrid approach will likely win out in the end. Unfortunately, in the space available we feel that we cannot come to definitive conclusions about the precise nature of the differences and potential synergies of these approaches.

The concept of learning a hierarchy of options is not new. It is an obviously desirable extension of options envisioned in the original papers. However, actually learning a deep hierarchy of options end to end has received surprisingly little attention to date. Compositional planning where options select other options was first considered in (Silver and Ciosek, 2012). The authors provided a generalization of value iteration to option models for multiple subgoals, leveraging explicit subgoals for options. Recently, Fox et al. (2017)

successfully trained a hierarchy of options end to end for imitation learning. Their approach leverages an EM based algorithm for recursive discovery of additional levels of the option hierarchy. Unfortunately, their approach is only applicable to imitation learning and not general purpose RL. We are the first to propose theorems along with a practical algorithm and architecture to train arbitrarily deep hierarchies of options end to end using policy gradients, maximizing the expected return.

3 Problem Setting and Notation

A Markov Decision Process (MDP) is defined with a set of states , a set of actions , a transition function and a reward function . We follow (Bacon et al., 2017)

and develop our ideas assuming discrete state and action sets, while our results extend to continuous spaces using usual measure-theoretic assumptions as demonstrated in our experiments. A policy is defined as a probability distribution over actions conditioned on states,

. The value function of a policy is the expected return with an action-value function of where is the discount factor.

Policy gradient methods (Sutton et al., 2000; Konda and Tsitsiklis, 2000)

consider the problem of improving a policy by performing stochastic gradient descent to optimize a performance objective over a family of parametrized stochastic policies,

. The policy gradient theorem (Sutton et al., 2000) provides the gradient of the discounted reward objective with respect to in a straightforward expression. The objective is defined with respect to a designated starting state . The policy gradient theorem shows that: , where is the discounted weighting of the states along the trajectories starting from initial state .

The options framework (Sutton et al., 1999; Precup, 2000) provides formalism for the idea of temporally extended actions. A Markovian option is a triple where represents an initiation set, represents an intra-option policy, and represents a termination function. Like most option discovery algorithms, we assume that all options are available everywhere. MDPs with options become SMDPs (Puterman, 1994) with an associated optimal value function over options and option-value function (Sutton et al., 1999; Precup, 2000).

The option-critic architecture (Bacon et al., 2017) utilizes a call-and-return option execution model. An agent picks option according to its policy over options , then follows the intra-option policy until termination (as determined by ), which triggers a repetition of this procedure. Let denote the intra-option policy of option parametrized by and the termination function of parameterized by . Like policy gradient methods, the option-critic architecture optimizes directly for the discounted return expected over trajectories starting at a designated state and option : . The option-value function is then:

(1)

where is the value of selecting an action given the context of a state-option pair:

(2)

The pairs define an augmented state space (Levy and Shimkin, 2011). The option-critic architecture instead leverages the function which is called the option-value function upon arrival (Sutton et al., 1999). The value of selecting option upon entering state is:

(3)

For notation clarity, we omit and which and both depend on. The intra-option policy gradient theorem results from taking the derivative of the expected discounted return with respect to the intra-option policy parameters and defines the update rule for the intra-option policy:

(4)

where is the discounted weighting of along trajectories originating from . The termination gradient theorem results from taking the derivative of the expected discounted return with respect to the termination policy parameters and defines the update rule for the termination policy for the initial condition :

(5)
(6)

where is now the discounted weighting of from . is the advantage function over options: .

4 Learning Options with Arbitrary Levels of Abstraction

Notation: As it makes our equations much clearer and more condensed we adopt the notation . This implies that denotes a list of variables in the range of through .

The hierarchical options framework that we introduce in this work considers an agent that learns using an level hierarchy of policies, termination functions, and value functions. Our goal is to extend the ideas of the option-critic architecture in such a way that our framework simplifies to policy gradient based learning when and option-critic learning when . At each hierarchical level above the lowest primitive action level policy, we consider an available set of options that is a subset of the total set of available options . This way we keep our view of the possible available options at each level very broad. On one extreme, each hierarchical level may get its own unique set of options and on the other extreme each hierarchical level may share the same set of options. We present a diagram of our proposed architecture in Figure 2.

Figure 2: A diagram describing our proposed hierarchical option-critic architecture. Dotted lines represent processes within the agent while solid lines represent processes within the environment. Option selection is top down through the hierarchy and option termination is bottom up (represented with red dotted lines).

We denote as the policy over the most abstract options in the hierarchy given the state . For example, from our discussion of the option-critic architecture. Once is chosen with policy , then we go to policy , which is the next highest level policy, to select conditioning it on both the current state and the selected highest level option . This process continues on in the same fashion stepping down to policies at lower levels of abstraction conditioned on the augmented state space considering all selected higher level options until we reach policy . is the lowest level policy and it finally selects over the primitive action space conditioned on all of the selected options.

Each level of the option hierarchy has a complimentary termination function that governs the termination pattern of the selected option at that level. We adopt a bottom up termination strategy where high level options only have an opportunity to terminate when all of the lower level options have terminated first. For example, cannot terminate until terminates at which point we can assess to see whether terminates. If it did terminate, this would allow the opportunity to asses if it should terminate and so on. This condition ensures that higher level options will be more temporally extended than their lower level option building blocks, which is a key motivation of this work.

The final key component of our system is the value function over the augmented state space. To enable comprehensive reasoning about the policies at each level of the option hierarchy, we need to maintain value functions that consider the state and every possible combination of active options and actions . These value functions collectively serve as the critic in our analogy to the actor-critic and option-critic training paradigms.

4.1 Generalizing the Option Value Function to N Hierarchical Levels

Like policy gradient methods and option-critic, the hierarchical options framework optimizes directly for the discounted return expected over all trajectories starting at a state with active options :

(7)

This return depends on the policies and termination functions at each level of abstraction. We now consider the option value function for understanding reasoning about an option at level based on the augmented state space :

(8)

Note that in order to simplify our notation we write as referring to both abstract and primitive actions. As a result, is equivalent to , leveraging the primitive action space . Extending the meaning of from (Bacon et al., 2017), we define the corresponding value of executing an option in the presence of the currently active higher level options by integrating out the lower level options:

(9)

The hierarchical option value function upon arrival with augmented state is defined as:

(10)

We explain the derivation of this equation 111Note that when no options terminate, as in the first term in equation (10), the lowest level option does not terminate and thus no higher level options have the opportunity to terminate. in the Appendix A.1

. Finally, before we can extend the policy gradient theorem, we must establish the Markov chain along which we can measure performance for options with

levels of abstraction. This is derived in the Appendix A.2.

4.2 Generalizing the Intra-option Policy Gradient Theorem

We can think of actor-critic architectures, generalizing to the option-critic architecture as well, as pairing a critic with each actor network so that the critic has additional information about the value of the actor’s actions that can be used to improve the actor’s learning. However, this is derived by taking gradients with respect to the parameters of the policy while optimizing for the expected discounted return. The discounted return is approximated by a critic (i.e. value function) with the same augmented state-space as the policy being optimized for. As examples, an actor-critic policy is optimized by taking the derivative of its parameters with respect to (Sutton et al., 2000) and an option-critic policy is optimized by taking the derivative of its parameters with respect to (Bacon et al., 2017). The intra-option policy gradient theorem (Bacon et al., 2017) is an important contribution, outlining how to optimize for a policy that is also associated with a termination function. As the policy over options in that work never terminates, it does not need a special training methodology and the option-critic architecture allows the practitioner to pick their own method of learning the policy over options while using Q Learning as an example in their experiments. We do the same for our highest level policy that also never terminates. For all other policies we perform a generalization of actor-critic learning by providing a critic at each level and guiding gradients using the appropriate critic.

We now seek to generalize the intra-option policy gradients theorem, deriving the update rule for a policy at an arbitrary level of abstraction by taking the gradient with respect to using the value function with the same augmented state space . Substituting from equation (8) we find:

(11)

Theorem 1 (Hierarchical Intra-option Policy Gradient Theorem). Given an level hierarchical set of Markov options with stochastic intra-option policies differentiable in their parameters governing each policy , the gradient of the expected discounted return with respect to and initial conditions is:

where is a discounted weighting of augmented state tuples along trajectories starting from . A proof is in Appendix A.3.

4.3 Generalizing the Termination Gradient Theorem

We now turn our attention to computing gradients for the termination functions at each level, assumed to be stochastic and differentiable with respect to the associated parameters .

(12)

Hence, the key quantity is the gradient of . This is a natural consequence of call-and-return execution, where termination function quality can only be evaluated upon entering the next state.

Theorem 2 (Hierarchical Termination Gradient Theorem). Given an level hierarchical set of Markov options with stochastic termination functions differentiable in their parameters governing each function , the gradient of the expected discounted return with respect to and initial conditions is:

where is a discounted weighting of augmented state tuples along trajectories starting from . is the generalized advantage function over a hierarchical set of options . compares the advantage of not terminating the current option with a probability weighted expectation based on the likelihood that higher level options also terminate. In (Bacon et al., 2017) this expression was simple as there was not a hierarchy of higher level termination functions to consider. A proof is in Appendix A.4.

It is interesting to see the emergence of an advantage function as a natural consequence of the derivation. As in (Bacon et al., 2017) where this kind of relationship also appears, the advantage function gives the theorem an intuitive interpretation. When the option choice is sub-optimal at level with respect to the expected value of terminating option

, the advantage function is negative and increases the odds of terminating that option. A new concept, not paralleled in the option-critic derivation, is the inclusion of a

multiplicative factor. This can be interpreted as discounting gradients by the likelihood of this termination function being assessed as is only used if all lower level options terminate. This is a natural consequence of multi-level call-and-return execution.

5 Experiments

We would now like to empirically validate the efficacy of our proposed hierarchical option-critic (HOC) model. We achieve this by exploring benchmarks in the tabular and non-linear function approximation settings. In each case we implement an agent that is restricted to primitive actions (i.e. ), an agent that leverages the option-critic (OC) architecture (i.e. ), and an agent with the HOC architecture at level of abstraction . We will demonstrate that complex RL problems may be more easily learned using beyond two levels of abstraction and that the HOC architecture can successfully facilitate this level of learning using data from scratch.

Figure 3: Learning performance as a function of the abstraction level for a nonstationary four rooms domain where the goal location changes every episode.

For our tabular architectures, we followed protocol from (Bacon et al., 2017)

and chose to parametrize the intra-option policies with softmax distributions and the terminations with sigmoid functions. The policy over options was learned using intra-option Q-learning. We also implemented primitive actor-critic (AC) using a softmax policy. For the non-linear function approximation setting, we trained our agents using A3C

(Mnih et al., 2016). Our primitive action agents conduct A3C training using a convolutional network when there is image input followed by an LSTM to contextualize the state. This way we ensure that benefits seen from options are orthogonal to those seen from these common neural network building blocks. We follow (Harb et al., 2017) to extend A3C to the Asynchronous Advantage Option-Critic (A2OC) and Asynchronous Advantage Hierarchical Option-Critic architectures (A2HOC). We include detailed algorithm descriptions for all of our experiments in Appendix B

. We also conducted hyperparameter optimization that is summarized along with detail on experimental protocol in Appendix

B. In all of our experiments, we made sure that the two-level OC architecture had access to more total options than the three level alternative and that the three level architecture did not include any additional hyperparameters. This ensures that empirical gains are the result of increasingly abstract options.

5.1 Tabular Learning Challenge Problems

Figure 4: The diagram, from (Kulkarni et al., 2016), details the stochastic decision process challenge problem. The chart compares learning performance across abstract reasoning levels.

Exploring four rooms: We first consider a navigation task in the four-rooms domain (Sutton et al., 1999). Our goal is to evaluate the ability of a set of options learned fully autonomously to learn an efficient exploration policy within the environment. The initial state and the goal state are drawn uniformly from all open non-wall cells every episode. This setting is highly non-stationary, since the goal changes every episode. Primitive movements can fail with probability , in which case the agent transitions randomly to one of the empty adjacent cells. The reward is +1 at the goal and 0 otherwise. In Figure 3 we report the average number of steps taken in the last 100 episodes every 100 episodes, reporting the average of 50 runs with different random seeds for each algorithm. We can clearly see that reasoning with higher levels of abstraction is critical to achieving a good exploration policy and that reasoning with three levels of abstraction results in better sample efficient learning than reasoning with two levels of abstraction. For this experiment we explore four levels of abstraction as well, but unfortunately there seem to be diminishing returns at least for this tabular setting.

Discrete stochastic decision process: Next, we consider a hierarchical RL challenge problem as explored in (Kulkarni et al., 2016) with a stochastic decision process where the reward depends on the history of visited states in addition to the current state. There are 6 possible states and the agent always starts at . The agent moves left deterministically when it chooses left action; but the action right only succeeds half of the time, resulting in a left move otherwise. The terminal state is and the agent receives a reward of 1 when it first visits and then . The reward for going to without visiting is 0.01. In Figure 4 we report the average reward over the last 100 episodes every 100 episodes, considering 10 runs with different random seeds for each algorithm. Reasoning with higher levels of abstraction is again critical to performing well at this task with reasonable sample efficiency. Both OC learning and HOC learning converge to a high quality solution surpassing performance obtained in (Kulkarni et al., 2016). However, it seems that learning converges faster with three levels of abstractions than it does with just two.

5.2 Deep Function Approximation Problems

Figure 5: Building navigation learning performance across abstract reasoning levels.

Multistory building navigation:

For an intuitive look at higher level reasoning, we consider the four rooms problem in a partially observed setting with an 11x17 grid at each level of a seven level building. The agent has a receptive field size of 3 in both directions, so observations for the agent are 9-dimension feature vectors with 0 in empty spots, 1 where there is a wall, 0.25 if there are stairs, or 0.5 if there is a goal location. The stairwells in the north east corner of the floor lead upstairs to the south west corner of the next floor up. Stairs in the south west corner of the floor lead down to the north east corner of the floor below. Agents start in a random location in the basement (which has no south west stairwell) and must navigate to the roof (which has no north east stairwell) to find the goal in a random location. The reward is +10 for finding the goal and -0.1 for hitting a wall. This task could seemingly benefit from abstraction such as a composition of sub-policies to get to the stairs at each intermediate level. We report the rolling mean and standard deviation of the reward. In Figure

5

we see a qualitative difference between the policies learned with three levels of abstraction which has high variance, but fairly often finds the goal location and those learned with less abstraction. A2OC and A3C are hovering around zero reward, which is equivalent to just learning a policy that does not run into walls.

Architecture Clipped Reward
A3C 8.43 2.29
A2OC 10.56 0.49
A2HOC 13.12 1.46
Table 1: Average clipped reward per episode over 5 runs on 21 Atari games.

Learning many Atari games with one model: We finally consider application of the HOC to the Atari games (Bellemare et al., 2013). Evaluation protocols for the Atari games are famously inconsistent (Machado et al., 2017), so to ensure for fair comparisons we implement apples to apples versions of our baseline architectures deployed with the same code-base and environment settings. We put our models to the test and consider a very challenging setting (Sharma et al., 2017) where a single agent attempts to learn many Atari games at the same time. Our agents attempt to learn 21 Atari games simultaneously, matching the largest previous multi-task setting on Atari (Sharma et al., 2017). Our tasks are hand picked to fall into three categories of related games each with 7 games represented. The first category is games that include maze style navigation (e.g. MsPacman), the second category is mostly fully observable shooter games (e.g. SpaceInvaders), and the final category is partially observable shooter games (e.g. BattleZone). We train each agent by always sampling the game with the least training frames after each episode, ensuring the games are sampled very evenly throughout training. We also clip rewards to allow for equal learning rates across tasks (Mnih et al., 2015). We train each game for 10 million frames (210 million total) and report statistics on the clipped reward achieved by each agent when evaluating the policy without learning for another 3 million frames on each game across 5 separate training runs. As our main metric, we report the summary of how each multi-task agent maximizes its reward in Table 1. While all agents struggle in this difficult setting, HOC is better able to exploit commonalities across games using fewer parameters and policies.

Analysis of Learned Options: An advantage of the multi-task setting is it allows for a degree of quantitative interpretability regarding when and how options are used. We report characteristics of the agents with median performance during the evaluation period. A2OC with 16 options uses 5 options the bulk of the time with the rest of the time largely split among another 6 options (Figure 6). The average number of time steps between switching options has a pretty narrow range across games falling between 3.4 (Solaris) and 5.5 (MsPacman). In contrast, A2HOC with three options at each branch of the hierarchy learns to switch options at a rich range of temporal resolutions depending on the game. The high level options vary between an average of 3.2 (BeamRider) and 9.7 (Tutankham) steps before switching. Meanwhile, the low level options vary between an average of 1.5 (Alien) and 7.8 (Tutankham) steps before switching. In Appendix B.4 we provide additional details about the average duration before switching options for each game. In Figure 7 we can see that the most used options for HOC are distributed pretty evenly across a number of games, while OC tends to specialize its options on a smaller number of games. In fact, the average share of usage dominated by a single game for the top 7 most used options is 40.9% for OC and only 14.7% for HOC. Additionally, we can see that a hierarchy of options imposes structure in the space of options. For example, when or the low level options tend to focus on different situations within the same games.

Figure 6: Option usage (left) and specialization across Atari games for the top 9 most used options (right) of a 16 option Option-Critic architecture trained in the many task learning setting.
Figure 7: Option usage (left) and specialization across Atari games (right) of a Hierarchical Option-Critic architecture with and 3 options at each layer trained in the many task learning setting.

6 Conclusion

In this work we propose the first policy gradient theorems to optimize an arbitrarily deep hierarchy of options to maximize the expected discounted return. Moreover, we have proposed a particular hierarchical option-critic architecture that is the first general purpose reinforcement learning architecture to successfully learn options from data with more than two abstraction levels. We have conducted extensive empirical evaluation in the tabular and deep non-linear function approximation settings. In all cases we found that, for significantly complex problems, reasoning with more than two levels of abstraction can be beneficial for learning. While the performance of the hierarchical option-critic architecture is impressive, we envision our proposed policy gradient theorems eventually transcending it in overall impact. Although the architectures we explore in this paper have a fixed structure and fixed depth of abstraction for simplicity, the underlying theorems can also guide learning for much more dynamic architectures that we hope to explore in future work.

Acknowledgements

The authors thank Murray Campbell, Xiaoxiao Guo, Ignacio Cases, and Tim Klinger for fruitful discussions that helped shape this work.

References

Appendix A Derivation of Generalized Policy Gradient and Termination Gradient Theorems

a.1 The Derivation of U

To help explain the meaning and derivation of equation (10), we separate the expression into four primary terms. The first term is applicable for and represents the expected return from cases where no options terminate. The second term is applicable for and represents the expected return from cases where every option terminates. The third and fourth terms are applicable for and represent the expected return from cases where some options terminate.

We will first discuss how to estimate the return when there are no terminated options. In this case we simply use our estimate of the value of the current state following the current options if there are any. As we are computing the expectation, we also multiply this term by its likelihood of happening which is equal to the probability that the lowest level option policy does not terminate. When

we can consider the termination probability of the current policy as zero and the current option context to be empty. As such, we estimate the value function upon arrival as as we do for actor-critic policy gradients.

Next we turn our attention to estimating the return when all options are terminated. This can be approximated using our estimate of the return given the state . The likelihood of this happening is equal to the conditional likelihood of options terminating at every level of abstraction we are modeling. When , equation (10) simplifies to equation (3). This expression is precisely the option value function upon arrival of the option-critic framework derived in [Bacon et al., 2017].

The final quantity we will estimate bridges the gap to cases where only some options terminate. This situation has not been explored by other work on option learning as it only arises for situations with at least hierarchical levels of planning. The case where some (but not all) options terminate arises when a series of low level options terminate while a high level option does not terminate. For a given level of abstraction, we can analyze the likelihood that at each level the lower level options terminate while the current does not. In such a case, we multiply this likelihood by the value one level more abstract than the current option hierarchy level. For convenience in our derivation, we split our notation for this quantity into two separate terms accounting explicitly for the case when only lower level options terminate.

a.2 Generalized Markov Chain and Augmented Process

We must establish the Markov chain along which we can measure performance for options with levels of abstraction. The natural approach is to consider the chain defined in the augmented state space because state and active option based tuples now play the role of regular states in a usual Markov chain. If options have been initiated or are executing at time in state , then the probability of transitioning to in one step is:

(13)

where primitive actions are . Like the Markov chain derived for the option critic architecture [Bacon et al., 2017], the process given by equation (13) is homogeneous. Additionally, when options are available at every state, the process is ergodic with the existance of a unique stationary distribution over the augmented state space tuples.

We continue by presenting an extension of results about augmented processes used for derivation of learning algorithms in [Bacon et al., 2017] to an option hierarchy with levels of abstraction. If options have been initiated or are executing at time , then the discounted probability of transitioning to where is:

(14)

As such, when we condition the process from , the discounted probability of transitioning to is:

(15)

This definition will be very useful later for our derivation of the hierarchical intra-option policy gradient. However, for the derivation of the hierarchical termination gradient theorem we should reformulate the discounted probability of transitioning to from the view of the termination policy at abstraction level explicitly separating out terms that depend on :

(16)

The -step discounted probabilities can more generally be expressed recursively: