1 Introduction
Autonomous discovery of meaningful behavioral abstractions in reinforcement learning has proven to be surprisingly elusive. One part of the difficulty perhaps is the fundamental question of why such abstractions are needed, or useful. Indeed, it is often the case that a primitive action policy is sufficient, and the overhead of discovery outweighs the potential speedup advantages. In this work, we adopt the view that abstractions are primarily useful due to their ability to compress information, which yields speedups in learning and planning. For example, it has been argued in the neuroscience literature that behavior hierarchies are optimal if they induce plans of minimum description length across a set of tasks [Solway et al., 2014].
Despite significant interest in the discovery problem, only in recent years have there been algorithms that tackle it with minimal supervision. An important example is the optioncritic [Bacon et al., 2017], which focuses on options [Sutton et al., 1999]
as the formal framework of temporal abstraction and learns both of the key components of an option, the policy and the termination, endtoend. Unfortunately, in later stages of training, the options tend to collapse to singleaction primitives. This is in part due to using the advantage function as a training objective of the termination condition: an option ought to terminate if another option has better value. This, however, occurs often throughout learning, and can be simply due to noise in value estimation. The followup work on optioncritic
with deliberation cost [Harb et al., 2018] addresses this option collapse by modifying the termination objective to additionally penalize option termination, but it is highly sensitive to the associated cost parameter.In this work, we take the idea of modifying the termination objective to the extreme, and propose for it to be completely independent of the task reward. Taking the compression perspective, we suggest that this objective should be informationtheoretic, and should capture the intuition that it would be useful to have "simple" option encodings that focus termination on a small set of states. We show that such an objective correlates with the planning performance of options for a set of goaldirected tasks.
Our key technical contribution is the manner in which the objective is optimized. We derive a result that relates the gradient of the option transition model to the gradient of the termination condition, allowing one to express objectives in terms of the option model, and optimize them directly via the termination condition. The model hence acts as a "critic", in the sense that it measures the quality of the termination condition in the context of the objective, analogous to how the value function measures the quality of the policy in the context of reward. Using this result, we obtain a novel policygradientstyle algorithm which learns terminations to optimize our objective, and policies to optimize the reward as usual. The separation of concerns is appealing, since it bypasses the need for sensitive tradeoff parameters. We show that the resulting options are nontrivial, and useful for learning and planning.
The paper is organized as follows. After introducing relevant background, we present the termination gradient theorem, which relates the change in the option model to the change in terminations. We then formalize the proposed objective, express it via the option model, and use the termination gradient result to obtain the online actorcritic terminationcritic (ACTC) algorithm. Finally, we empirically study the learning dynamics of our algorithm, the relationship of the proposed loss to planning performance, and analyze the resulting options qualitatively and quantitatvely.
2 Background and Notation
We assume the standard reinforcement learning (RL) setting [Sutton and Barto, 2017]
of a Markov Decision Process (MDP)
, where is the set of states, the set of discrete actions; is the transition model that specifies the environment dynamics, withdenoting the probability of transitioning to state
upon taking action in ; the reward function, and the scalar discount factor. A policy is a probabilistic mapping from states to actions. For a policy , let the matrixdenote the dynamics of the induced Markov chain:
and the reward expected for each state under :The goal of an RL agent is to find a policy that produces the highest expected cumulative reward:
(1) 
where
is the random variable corresponding to the reward received at time
, and is the initial state of the process. An optimal policy is one that maximizes . A related instrumental quantity in RL is the actionvalue (or Q) function, which measures for a particular stateaction pair:(2) 
The simplest way to optimize the objective (1) is directly by adjusting the policy parameters (assuming is differentiable) [Williams, 1992]. The policy gradient (PG) theorem [Sutton et al., 2000] states that:
where is the stationary distribution induced by . Hence, samples of the form
produce an unbiased estimate of the gradient
.The final remaining question is how to estimate . A standard answer relies on the idea of temporaldifference (TD) learning, which itself leverages sampling of the following recursive form of Eq. (2), known as the Bellman Equation [Bellman, 1957]:
Iterating this equation from an arbitrary initial function produces the correct in expectation (e.g. [Puterman, 1994]).
2.1 The Options Framework
Options provide the standard formal framework for modeling temporal abstraction in RL [Sutton et al., 1999]. An option is a tuple . Here, is the initiation set, from which option may start (as in other recent work, we take for simplicity), is the termination condition, with denoting the probability of option terminating in state ; and is the internal policy of option . As is common, we assume that options eventually terminate:
Assumption 1.
For all and , that is reachable by from , s.t. .
Analogously to the onestep MDP reward and transition models and , options induce semiMDP [Puterman, 1994] reward and transition models:
That is: the option transition model
outputs a subprobability distribution over states, which, for each
captures the discounted probability of reaching from (in any number of steps) and terminating there. In this work, unless otherwise stated, we will use a slightly different formulation, which is undiscounted and shifted backwards by one step:(3) 
where denotes the indicator function. The lack of discounting is convenient because it leads to being a probability (rather than subprobability) distribution over so long as Assumption 1 holds.^{1}^{1}1In particular, if we sum over we obtain the Poisson binomial (or sequential independent Bernoulli) probability of one success out of infinite trials, which is equal to 1. The backwards time shifting replaces the first term with , which conveniently considers of the same state, and will be important for our derivations.
We will use to denote the policy over options, and define the optionlevel Qfunction as in [Sutton et al., 1999].
3 The Termination Gradient Theorem
The starting point for our work is the idea of optimizing the option’s termination condition independently and with respect to a different objective than the policy. How can one easily formulate such objectives? We propose to leverage the relationship of the termination condition to the option model , which is a distribution over the final states of the option, a quantity that is relevant to many natural objectives. The classical idea of terminating in "bottleneck" states is an example of such an objective.
To this end, we first note that the definition (3) of has a formal similarity to a Bellman equation, in which is akin to a value function and influences both the immediate reward and the discount. Hence, we can express the gradient of through the gradient of . This will allow us to express highlevel objectives through and optimize them through . The following general result formalizes this relationship, while the next section uses it for a particular objective. The theorem concerns a single option , and we write to denote the parameters of .
Theorem 1.
Let be parameterized by . The gradient of the option model w.r.t. is:
where the "reward" term is given by:
The proof is given in appendix. In the following, we will drop the superscript and simply write . Note that in the particular (common) case when
is parameterized with a sigmoid function, the expression takes the following simple shape:
where
denotes the logit of
. The pseudoreward has an intuitive interpretation. In general, it is always positive at (and so should always increase to maximize ), and negative at all other states (and so should always decrease to maximize ). The amount of the change depends on the dynamics .
In terminating states , if – the likelihood of terminating in immediately or upon a later return – is low, then the change in is high: an immediate termination needs to occur in order for to be high. If is high, then may be high without needing to be high.^{2}^{2}2E.g. if there is a single terminating state that is guaranteed to be reached, any will do.

In nonterminating states , the higher , or the likelier it is for the desired terminating state to be reached from , the more the termination probability at is reduced, to ensure that this happens. If is not reachable from no change occurs at to maximize .
The theorem can be used as a general tool to optimize any differentiable objective of . In the next section we propose and justify one such objective, and derive the complete algorithm.
4 The Termination Critic
We now propose the specific idea for the actorcritic termination critic (ACTC) algorithm. We first formulate our objective, then use Theorem 1 to derive the gradient of this objective as a function of the gradient of . Finally, we formulate the online algorithm that estimates all of the necessary quantities from samples.
4.1 Predictability Objective
We postulate that desirable options are "targeted" and have small terminating regions. This can be expressed as having a low entropy distribution of final states. We hence propose to explicitly minimize this entropy as an objective:
(4) 
where denotes entropy, and is the random variable denoting a terminating state. We call this objective predictability, as it measures how predictable an option’s outcome (final state) is. Note that the entropy is marginalized over all starting states. The marginalization allows for consistency – without it, the objective can be satisfied for each starting state individually without requiring the terminating states to be the same. We will later show that this objective indeed correlates with planning performance (Section 6.3).
In the exact, discrete setting the objective (4) is minimized when an option terminates at a single state . Note that the this is the case irrespective of the choice of . The resulting will hence be determined by the learning dynamics of the particular algorithm. We will show later empirically that the gradient algorithm we derive is attracted to s that are most visited, as measured by (Section 6.1).
The objective is reminiscent of other recent informationtheoretic optiondiscovery approaches (e.g. [Gregor et al., 2016, Florensa et al., 2017, Hausman et al., 2018]), but is focused on the shape of each option in isolation. Similar to those works, one may wish to explicitly optimize for diversity between options as well, so as to encourage specialization. This can be attained for example by maximizing the complete mutual information as the objective (rather than only its second term). In this work we choose to keep the objective minimalistic, but will observe some diversity occur under a set of tasks, due to the sampling procedure, so long as the policy over options is nontrivial.
The manner in which we propose to optimize this objective is novel and entirely different from existing work. We leverage the analytical gradient relationship derived in Theorem 1, and so instead of estimating the entropy term directly, we will express it through the option model, and estimate this option model. The following proposition expresses criteria (4) via the option model.
Proposition 1.
Let denote the policy over options, and let be option ’s starting distribution induced by . Let The criteria (4) can be expressed via the option model as follows:
The proof is mainly notational and given in appendix. This proposition implies that for a particular start state , our global entropy loss can be interpreted as a crossentropy loss between the distributions of final states from a particular state and of final states from all start states.
A note on Assumption 1.
Finally, we note that the entropy expression is only meaningful if is a probability distribution, otherwise it is for example minimized by an allzero . In turn, is a distribution if Assumption 1 holds, but if the option components are being learned, this may be tricky to uphold, and a trivial solution may be attractive. We will see empirically that the algorithm we derive does not collapse to this trivial solution. However, a general way of ensuring adherence to Assumption 1 remains an open problem.
4.2 The Predictability Gradient
We are now ready to express the gradient of the overall objective from Proposition 1 in terms of the termination parameters . This result is analogous to the policy gradient theorem, and similarly, for the gradient to be easy to sample, we need for a certain distribution to be independent of the termination condition.
Assumption 2.
The distribution over the starting states of an option under policy is independent of its termination condition .
This is satisfied for example by any fixed policy . Of course, in general often depends on the value function over options which in turn depends on , but one may for example apply twotimescale optimization to make appear quasifixed (e.g. [Borkar, 1997]). Our experiments show that even when not doing this explicitly, the online algorithm converges reliably. The following theorem derives the gradient of the objective (4) in terms of the gradient of .
Theorem 2.
Let Assumption 2 hold. We have that:
The proof is a fairly straightforward differentiation, and is given in appendix. The loss consists of two terms. The termination advantage measures how likely option is to reach and terminate in state , as compared to other alternatives . The trajectory advantage measures the desirability of state in context of the starting state – if is a likely termination state in general, but not for the particular , this term will account for it. Appendix D studies the effects of these two terms on learning dynamics in isolation.
Now, we would like to derive a samplebased algorithm that produces unbiased estimates of the derived gradient. Substituting the expression for the gradient of the logits, we can rewrite the overall expression as follows:
where the two expectations are w.r.t. the distributions written out in Theorem 2. We will sample these expectations from trajectories of the form , and base our updates on the following corollary.
Corollary 1.
Consider a sample trajectory , and let be parameterized by a sigmoid funciton. For a state , the following gradient is an unbiased estimate of :
(5)  
where denotes the logit of at state , and , are samples from the corresponding advantages for a particular .
The factor in Eq. (5) is akin to an importance correction necessary due to not actually having terminated at (and hence not having sampled ). Note that the state itself never gets updated directly, because the sampled advantages for it are zero (although the underlying state still may get updates when not sampled as final). The resulting magnitude of termination values at chosen final states hence depends on the initialization. To remove the dependence, we will deploy a baseline in the complete algorithm.
5 Algorithm
We now give our algorithm based on Corollary 1. In order to do so, we need to simultaneously estimate the transition model . Because for a given final state, it is simply a value function, we can readily do this with temporal difference learning. Furthermore, we need to estimate , which we do simply as an empirical average of over all experienced . Finally, we deploy a peroption baseline that tracks the empirical average update and gets subtracted from the updates at all states. The complete endtoend algorithm is summarized in Algorithm 1. Similarly to e.g. A3C [Mnih et al., 2016], our algorithm is online up to the trajectory length. One can potentially be fully online by instead of relying on the sampled terminating state , "bootstrapping" with the value of , but this requires an accurate estimate of .
6 Experiments
We will now empirically support the following claims:

Our algorithm directs termination into a small number of frequently visited states;

The resulting options are intuitively appealing and improve learning performance; and

The predictability objective is related to planning performance and the resulting options improve both the objective and planning performance.
We will evaluate the latter two in the classical Four Rooms domain [Sutton et al., 1999] (see Figure 6 in appendix). But before we begin, let us consider the learning dynamics of the algorithm on a small example.
6.1 Learning Dynamics
The predictability criterion we have proposed is minimized by any termination scheme that deterministically terminates in a single state. Then, what solution do we expect for our algorithm to find? We hypothesize that the learning dynamics favor states that are visited more frequently, that is: are more reachable as measured by , up to the initial disbalance of . In this section we validate this hypothesis on a small example.
Consider the MDP in Fig. 1. The green state is a potential attractor, with its attraction value being determined by a parameter . Let the initial . In this experiment, we investigate the resulting solution as a function of the initial and . Following the intuition above, we expect the algorithm to increase more when these values are higher. We compare the results with another way of achieving a similar outcome, namely by simply using the marginal value as the objective ("naive reachability"). As expected, we find that the full predictability objective is necessary for concentrating in a single mostvisited state. See Fig. 2.
We further performed an ablation on the two advantage terms that comprise our loss (the reachability advantage and the trajectory advantage); these results are given in Figure 8. We find that neither term in isolation is sufficient, or as effective as the full objective.
6.2 Option Discovery
We now ask how well the termination critic and the predictability objective are suited as a method for endtoend option discovery. As discussed, option discovery remains a challenging problem and endtoend algorithms, such as the optioncritic, often have difficulties learning nontrivial or intuitively appealing results.
We evaluate both ACTC and optioncritic with deliberation cost (A2OC) on the Four Rooms domain using visual inputs (a topdown view of the maze passed through a convolutional neural network). The basic task is to navigate to a goal location, with zero perstep reward and a reward of
at the goal. The goal changes randomly between a fixed set of eight locations every 20 episodes, with the option components being oblivious to the location, but the optionlevel value function being able to observe the location of the goal.6.2.1 Architecture
We build on the neural network architecture of A2OC with deliberation cost [Harb et al., 2018], which in turn is almost identical to the network for A3C [Mnih et al., 2016]. Specifically, the A3C network outputs a policy and actionvalue function , whereas our network outputs , and given the input state. We use an additional network that takes in two input states, and , and outputs , and , as well as some useful auxiliary predictions: a marginal and the update baseline . This network uses a convolutional network with identical structure as used in A3C, that feeds into a single shared fullyconnected hidden layer, followed by a specialized fullyconnected layer for each of the outputs.
6.2.2 Learning and Planning in Four Rooms
We first depict the options qualitatively with an example termination profile shown in Figure 3. We see that ACTC leads to tightly concentrated regions with high termination probability and low probability elsewhere, whereas A2OC even with deliberation cost tends to converge to trivial termination solutions. Although ACTC does not always converge to terminating in a single region, it leads to distinct options with characteristic behavior and termination profiles.
Next, in Figure 4 we compare the online learning performance between ACTC and A2OC with deliberation cost. The traces indicate separate hyperparameter settings and seeds for each algorithm and the bold line gives their average. ACTC enjoys better performance throughout learning.
6.3 Correlation with Planning Performance
Finally, we investigate the claim that more directed termination leads to improved planning performance. To this end, we generate various sets () of goaldirected options in the Four Rooms domain by systematically varying the optionpolicy goal location and concentration of termination probability around the goal location. We evaluate these options, combined with primitive actions, by averaging the policy value during ten iterations of value iteration and all possible goal locations (see appendix for more details).
We compare this average policy value as a function of the predictability objective of each set of options in Figure 5. "Hallways" corresponds to one option for each hallway, "Centers" – to the centers of each room, "Random Goals" – to each option selecting a unique random goal location, and "Random Options" – to both the policies and termination probabilities being uniformly random. We observe, as has previously been reported, that even random options improve planning over primitive actions alone (shown by the dashedline). Additionally, we confirm that generally as the predictability objective increases towards zero the average policy value increases, corresponding to faster convergence of value iteration.
Finally, we plot (with square markers) the performance of options learned from the previous section using ACTC and A2OC with deliberation cost. Due to A2OC’s option collapse, its advantage over primitive actions is small, while ACTC performs similarly to the better (more deterministic) random goal options.
7 Related Work
The idea of decomposing behavior into reusable components or abstractions has a long history in the reinforcement learning literature. One question that remains largely unanswered, however, is that of suitable criteria for identifying such abstractions. The option framework itself [Sutton et al., 1999, Bacon et al., 2017] provides a computational model that allows the implementation of temporal abstractions but does in itself not provide an objective for option induction. This is addressed partially in [Harb et al., 2018] where longlasting options are explicitly encouraged.
A popular means of encouraging specialization is via different forms of information hiding, either by shielding part of the policy from the task goal (e.g. [Heess et al., 2016]) or from the task reward (e.g. [Vezhnevets et al., 2017]). [Frans et al., 2018] combine information hiding with metalearning to learn options that are easy to reuse across tasks.
Informationtheoretic regularization terms that encourage mutual information between the option and features of the resulting trajectory (such as the final state) have been used to induce diverse options in an unsupervised or mixed setting (e.g. [Gregor et al., 2016, Florensa et al., 2017, Eysenbach et al., 2019]), and they can be combined with information hiding (e.g. [Hausman et al., 2018]). Similar to our objective they can be seen to encourage predictable outcomes of options. [CoReyes et al., 2018] have recently proposed a model that directly encourages the predictability of trajectories associated with continuous option embeddings to facilitate mixed modelbased and modelfree control.
Unlike our work, approaches described above consider options of fixed, predefined length. The problem of learning option boundaries has received some attention in the context of learning behavioral representations from demonstrations (e.g. [Daniel et al., 2016, Lioutikov et al., 2015, Fox et al., 2017, Krishnan et al., 2017]). These approaches effectively learn a probabilistic model of trajectories and can thus be seen to perform trajectory compression.
Finally, another class of approaches that is related to our overall intuition is one that seeks to identify "bottleneck" states and construct goaldirected options to reach them. There are many definitions of bottlenecks, but they are generally understood to be states that connect different parts of an environment, and are hence visited more often by successful trajectories. Such states can be identified through heuristics related to the pattern of state visitation
[McGovern and Barto, 2001, Stolle and Precup, 2003] or by looking at betweenness centrality measures on the statetransition graphs. Our objective is based on a similar motivation of finding a small number of states that give rise to a compressed highlevel decision problem which is easy to solve. However, our algorithm is very different (and cheaper computationally).8 Discussion
We have presented a novel optiondiscovery criterion that uses predictability as a means of regularizing the complexity of behavior learned by individual options. We have applied it to learning meaningful termination conditions for options, and have demonstrated its ability to induce options that are nontrivial and useful.
Optimization of the criterion is achieved by a novel policy gradient formulation that relies on learned option models. In our implementation we choose to decouple the reward optimization from the problem of learning where to terminate. This particular choice allowed us to study the effects of meaningful termination in isolation. We saw that even if the option policies optimize the same reward objective, nontrivial terminations prevent option collapse onto the same policy.
This work has focused entirely on goaldirected options, whose purpose is to reach a certain part of the state space. There is another class, often referred to as skills, which are in a sense complementary, aiming to abstract behaviors that apply anywhere
in the state space. One exciting direction for future work is to study the relation between the two and to design option induction criteria that can interpolate between different regimes.
References

[Bacon et al., 2017]
Bacon, P.L., Harb, J., and Precup, D. (2017).
The optioncritic architecture.
In
Proceedings of 31st AAAI Conference on Artificial Intelligence (AAAI17)
, pages 1726–1734.  [Bellman, 1957] Bellman, R. (1957). Dynamic programming. Princeton University Press.
 [Borkar, 1997] Borkar, V. S. (1997). Stochastic approximation with two time scales. Systems & Control Letters, 29(5):291–294.

[CoReyes et al., 2018]
CoReyes, J. D., Liu, Y., Gupta, A., Eysenbach, B., Abbeel, P., and
Levine, S. (2018).
SelfConsistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings.
InProceedings of the 35th International Conference on Machine Learning (ICML18)
.  [Daniel et al., 2016] Daniel, C., van Hoof, H., Peters, J., and Neumann, G. (2016). Probabilistic inference for determining options in reinforcement learning. Machine Learning, Special Issue, 104(2):337–357.
 [Eysenbach et al., 2019] Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. (2019). Diversity is all you need: Learning skills without a reward function. In Proceedings of the 7th International Conference on Learning Representations (ICLR19).
 [Florensa et al., 2017] Florensa, C., Duan, Y., and Abbeel, P. (2017). Stochastic neural networks for hierarchical reinforcement learning. In Proceedings of the 5th International Conference on Learning Representations (ICLR17).
 [Fox et al., 2017] Fox, R., Krishnan, S., Stoica, I., and Goldberg, K. (2017). Multilevel discovery of deep options. CoRR, abs/1703.08294.
 [Frans et al., 2018] Frans, K., Ho, J., Chen, X., Abbeel, P., and Schulman, J. (2018). Meta learning shared hierarchies. In Proceedings of the 6th International Conference on Learning Representations (ICLR18).
 [Gregor et al., 2016] Gregor, K., Rezende, D. J., and Wierstra, D. (2016). Variational intrinsic control. CoRR, abs/1611.07507.
 [Harb et al., 2018] Harb, J., Bacon, P.L., Klissarov, M., and Precup, D. (2018). When waiting is not an option: Learning options with a deliberation cost. In Proceedings of 32nd AAAI Conference on Artificial Intelligence (AAAI18).
 [Hausman et al., 2018] Hausman, K., Springenberg, J. T., Wang, Z., Heess, N., and Riedmiller, M. (2018). Learning an embedding space for transferable robot skills. In Proceedings of the 6th International Conference on Learning Representations (ICLR18).
 [Heess et al., 2016] Heess, N., Wayne, G., Tassa, Y., Lillicrap, T. P., Riedmiller, M. A., and Silver, D. (2016). Learning and transfer of modulated locomotor controllers. CoRR, abs/1610.05182.
 [Krishnan et al., 2017] Krishnan, S., Fox, R., Stoica, I., and Goldberg, K. (2017). Ddco: Discovery of deep continuous options for robot learning from demonstrations. In Proceedings of the 1st Conference on Robot Learning (CoRL17), pages 418–437.
 [Lioutikov et al., 2015] Lioutikov, R., Neumann, G., Maeda, G., and Peters, J. (2015). Probabilistic segmentation applied to an assembly task. In 2015 IEEERAS 15th International Conference on Humanoid Robots (Humanoids), pages 533–540.
 [McGovern and Barto, 2001] McGovern, A. and Barto, A. G. (2001). Automatic discovery of subgoals in reinforcement learning using diverse density. In Proceedings of the 18th International Conference on Machine Learning (ICML01).
 [Mnih et al., 2016] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML16), pages 1928–1937.
 [Puterman, 1994] Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, New York, NY, USA, 1st edition.
 [Solway et al., 2014] Solway, A., Diuk, C., Córdova, N., Yee, D., Barto, A. G., Niv, Y., and Botvinick, M. M. (2014). Optimal behavioral hierarchy. PLoS computational biology, 10(8):e1003779.
 [Stolle and Precup, 2003] Stolle, M. and Precup, D. (2003). Learning options in reinforcement learning. In International Symposium on abstraction, reformulation, and approximation, pages 212–223.
 [Sutton and Barto, 2017] Sutton, R. S. and Barto, A. G. (2017). Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, USA, 2nd edition.
 [Sutton et al., 2000] Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12, pages 1057–1063.
 [Sutton et al., 1999] Sutton, R. S., Precup, D., and Singh, S. (1999). Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211.
 [Vezhnevets et al., 2017] Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., and Kavukcuoglu, K. (2017). Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning (ICML17).
 [Williams, 1992] Williams, R. J. (1992). Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256.
Appendix A The Option Transition Process
It will be convenient to consider the option transition process:
We can then rewrite from (3) as:
(6) 
Appendix B Omitted Proofs
b.1 Proof of Theorem 1
Proof.
We have:
(7) 
And so what we have is a discounted value function, whose reward is , where
Now, from Eq. (3) and if , we have:
(8) 
Using this notation, and recalling the transition process from Eq. (6), we can rewrite (7) as:
Where the third equality follows from (6) and requires for to not be .
∎
b.2 Proof of Proposition 1
Proof.
Let denote the probability of a state being terminal for an option . By definition of entropy we have:
∎
b.3 Proof of Theorem 2
Proof.
Sampling the highlighted expectations, and noting that if are the logits of ,
we have our result.
∎
Appendix C Correlation with Planning Performance
The policies considered in these experiments consist of some set of four options combined with the set of primitive actions. Planning performance, for a single goaldirected task, is evaluated as the average policy value over all states at the end of each of ten iterations of value iteration. Consider Figure 7 which shows the value iteration performance curve for a single task, comparing policies of primitive actions, options, and their combination. The planning performance is the average of this curve for ten iterations, further averaged over all possible goaldirected tasks in Four Rooms. This measures how quickly value iteration, using this set of option policies and terminations, is able to plan.
Appendix D Learning Dynamics
Fig. 8 further studies the learning dynamics induced by the different components of the algorithm. We compare the previous two variants from Fig. 2 with only including the reachability advantage term (Row 3), and only including the trajectory advantage term (Row 4). The former does not focus on a single state, while the latter does not concentrate at all for many values of .