1 Introduction
Hierarchical reinforcement learning methods enable agents to tackle challenging problems by identifying reusable skills—temporally extended actions—that simplify the task. For example, a robot agent that tries to learn to play chess by reasoning solely at the level of how much current to give to its actuators every 20ms will struggle to correlate obtained rewards with their true underlying cause. However, if this same agent first learns skills to move its arm, grasp a chess piece, and move a chess piece, then the task of learning to play chess (leveraging these skills) becomes tractable. Several mathematical frameworks for hierarchical reinforcement learning have been proposed, including hierarchies of machines [Parr and Russell1998], MAXQ [Dietterich2000], and the options framework [Sutton, Precup, and Singh1999]. However, none of these frameworks provides a practical mechanism for skill discovery: determining what skills will be useful for an agent to learn. Although skill discovery methods have been proposed, they tend to be heuristic in that they find skills that have a property that intuitively might make for good skills for some problems, but which do not follow directly from the primary objective of optimizing the expected discounted return (pmlrv70machado17a pmlrv70machado17a; Simsek2008SkillCB Simsek2008SkillCB; NIPS1994_887 NIPS1994_887; NIPS2009_3683 NIPS2009_3683).
The optioncritic architecture [Bacon, Harb, and Precup2017], stands out from other attempts at developing a general framework for skill discovery in that it searches for the skills that directly optimize the expected discounted return. Specifically, the option critic uses the aforementioned options framework, wherein a skill is called an option
, and it proposes parameterizing all aspects of the option and then performing stochastic gradient descent on the expected discounted return with respect to these parameters. The key insight that enables the optioncritic architecture is a set of theorems that give expressions for the gradient of the expected discounted return with respect to the different parameters of an option.
One limitation of the option critic is that it uses ordinary (stochastic) gradient descent. In this paper we show how the option critic can be extended to use natural gradient descent [Amari1998], which exploits the underlying structure of the optionparameter space to produce a more informed update direction. The primary contributions of this work are theoretical: we define the natural gradients associated with the option critic, derive the Fisher information matrices associated with an option’s parameterized policy and termination function, and show how the natural gradients can be estimated with pertimestep time and space complexity linear in the total number of parameters. This is achieved by means of compatible function approximations. We also analyze the performance of natural gradient descent based approach on various learning tasks.
2 Preliminaries and Notation
A reinforcement learning (RL) agent interacts with an environment, modeled as a Markov decision process (MDP), over a sequence of time steps . A finite MDP is a tuple . is the finite set of possible states of the environment. is the state of the environment at time . is the finite set of possible actions the agent can take. is the action taken by the agent at time . is the transition function: , for all . Meaning,
the probability of transitioning to state
given the agent takes action in state . denotes the reward at time . is the reward function, , where , i.e., the expected reward the agent receives given it took action in state . We say that a process has ended when the environment enters a terminal state, meaning for a terminal state , and for all and . The process ends after steps and we call the horizon. We say the process is infinite horizon when there does not exist a finite . is the initial state distribution, i.e., . The parameter scales how the rewards are discounted over time. When a terminal state is reached, time is reset to and consequently a new initial state is sampled using .A policy, , represents the agent’s decision making system: . Given a policy, , and an MDP, , an episode, is a sequence of states of the environment, actions taken by the agent, and the rewards observed from the initial state, , to the terminal state, , i.e., . We also define the path that an agent takes to be a sequence of states and actions, i.e., a history without rewards, . Path
is a random variable from the set of all possible paths,
. The return of an episode is the discounted sum of all rewards, . We call the value function for the policy , , where . We call the actionvalue function associated with policy , , where .2.1 Policy Gradient Framework
The policy gradient framework (Sutton:1999:PGM:3009657.3009806 Sutton:1999:PGM:3009657.3009806; kondat:ac kondat:ac) assumes the policy , parametrized by , is differentiable. The objective function, , is defined with respect to a start state , . The agent learns by updating the parameters approximately proportional to the gradient , i.e., where is the learning rate (LR): a scalar hyperparameter.
2.2 Option Critic framework
The options framework [Sutton, Precup, and Singh1999] formalizes the notion of temporal abstractions by introducing options. An option, , from a set of options, , is a generalization of primitive actions. The intraoption policy represents the agent’s decision making while executing an option : . Like primitive actions the agent executes an option at a state and the option terminates at another , where is the duration for which the agent is executing the option: . While in the option , from state to , the agent follows the policy . Option terminates stochastically in state according to a distribution . The framework puts restrictions on where an option can be initiated by defining an initiation state set, , for option . The option is initiated in state based on , which is a policy over options defined as . An initiation state set , an intraoption policy and a termination function comprise an option . It is commonly assumed that all options are available everywhere and thereby we dispense with the notion of an initiation set.
The option critic framework makes all the options available everywhere, and introduces policygradient theorems within the options framework. The option active at time step is . The intraoption policies () and termination functions () are represented using differentiable functions parametrized by and , respectively. The goal is to optimize the expected discounted return starting at state and option . We redefine the objective function, , for the option critic setting: .
Equations similar to those in the policy gradient framework [Sutton et al.1999] are manipulated to derive gradients of the objective with respect to and in the optioncritic framework. The analogous state value function is , where . is the value of a state , within the options framework, with the option set and the policy over options . The optionvalue function is , where . Here, is the value of state when option is active with the option set . The stateoptionaction value function is , where . Here, is the value of executing action in the context of stateoption pair . The optionvalue function upon arrival is , where . Here, is the value of option being active upon the agent entering state . DBLP:journals/corr/BaconHP16 DBLP:journals/corr/BaconHP16 observe a consequence of the definitions:
(1) 
The main results presented by DBLP:journals/corr/BaconHP16 DBLP:journals/corr/BaconHP16 are the intraoption policy gradient theorem and the termination gradient theorem. The gradient of the expected discounted return with respect to and initial condition is:
(2) 
where is the discounted weighting of stateoption pair along trajectories starting from defined by . The gradient of the expected discounted return with respect to and initial condition is:
(3) 
where is the advantage function over options such that . Here, is the discounted weighting of state option pair from
, i.e., according to a Markov chain shifted by one time step, defined by
. The agent learns by updating parameters and in the direction approximately proportional to and , respectively. Meaning, it learns by updating and , where and are the learning rates for and , respectively.2.3 Natural Actor Critic
Natural gradient descent [Amari1998] exploits the underlying structure of the parameter space when defining the direction of steepest descent. It does so by defining the inner product in the parameter space as:
(4) 
where is called the
metric tensor
. Although the choice of remains open under certain conditions [Thomas et al.2016] we choose the Fisher information matrix, as is common practice. The fisher information matrix distribution over random variable , parametrized by policy parameters , that lie on a Reimannian manifold (raofim raofim; amarifim amarifim):(5) 
where the expectation is over the distribution and represents a matrix with its element being the expression as defined on the right hand side — we use this notation to represent a matrix throughout the paper. Kakade:2001 Kakade:2001 makes the assumption that every policy, , is ergodic and irreducible, therefore it has a welldefined stationary distribution for each state . Under this assumption, Kakade:2001 Kakade:2001 introduces the use of natural gradient for optimizing the expected reward over the parameters of policy , as defined by . The natural gradient for the objective function, , is defined as:
(6) 
The derivation of a closed form expression for for the parameter space of policy , parametrized by , is nontrivial as demonstrated for the limiting matrix of the infinite horizon problem in reinforcement learning [Bagnell and Schneider2003]
. For a weight vector
let be an approximation of the state action value function , which has the form:The mean squared error , for a weight vector and a given policy parametrized by , is defined as:
where is the discounted weighting of state in the infinite horizon problem. The weights normalize to the stationary distribution for state under policy in the undiscounted setting where the MDP terminates at every time step with probability . Theorem 1 as introduced by Kakade:2001 Kakade:2001 states that which minimizes the mean squared error, , is equal to the natural gradient as defined in (6).
Kakade:2001 Kakade:2001 also demonstrates how natural policy gradient performs under the rescaling of parameters. In addition to that, Kakade:2001 Kakade:2001 demonstrates how the natural gradient weights the components of uniformly, instead of using . We also point out that the natural gradient is independent to local reparametrization of the model [Pascanu and Bengio2013] and can be used in online learning [Degris, Pilarski, and Sutton2012]
. Natural gradients for reinforcement learning (Peters2008,Bhatnagar2009 Peters2008,Bhatnagar2009; Degris_modelfreereinforcement Degris_modelfreereinforcement), as well as more recent work in deep neural networks (desjardins2015natural desjardins2015natural; Pascanu2013RevisitingNG Pascanu2013RevisitingNG; thomas2017decoupling thomas2017decoupling; Sun2017RelativeFI Sun2017RelativeFI) have shown to be effective in learning.
The OptionCritic architecture uses vanilla gradient to learn temporal abstraction and internal policies, which can be less data efficient compared to the natural gradient [Amari1998]. The natural gradient also overcomes the difficulty posed by the plateau phenomena [Amari2016]. We derive the metric tensors for the parameters in the optioncritic architecture. Computing the complete Fisher information matrix or is expensive. We use a blockdiagonal estimate of the Fisher information matrix as has been applied in the past to reinforcement learning [Thomas2011] and to neural networks (NIPS2007_3234 NIPS2007_3234; Kurita1992IterativeWL Kurita1992IterativeWL; Martens2010DLV31043223104416 Martens2010DLV31043223104416; Pascanu2013RevisitingNG Pascanu2013RevisitingNG; Martens2015OptimizingNN Martens2015OptimizingNN). Specifically, we estimate and separately, where and are the parameters of of the intraoption policy and the option termination function. These are then combined into a sized estimate of the complete Fisher information matrix of the parameter space, where represent the size of vectors.
We also provide theoretical justification for the resulting algorithm inspired from the incremental natural actor critic algorithm [Bhatnagar et al.2007] (INAC) and its extension to include eligibility traces (morimuranatgrad morimuranatgrad; Thomas2014BiasIN Thomas2014BiasIN).
3 Start State Fisher Information Matrix Over IntraOption Path Manifold
We define path in the options framework for the infinite horizon problem as the sequence of stateoptionaction tuples: . We use to denote the set of all paths. We introduce the function called the expected return over path, where is the expected return given the path . The goal in a reinforcement learning problem, in the context of the optioncritic architecture, is to maximize the discounted return, . The goal can be rewritten as maximizing . Where the summation is over all starting from and the intraoption policies are parametrized by . To optimize the objective , we define it over a Riemannian space , with . In the Riemannian space the inner product is defined as in . The direction of steepest ascent of in the Riemannian space, , is given by [Amari1998], (see equation (6)).
In this section we use to denote and use to indicate the expected value of with respect to distribution . We obtain an alternative form of the Fisher information matrix which is a well know result [DeGroot1970] (for details see appendix):
(7) 
3.1 Fisher Information Matrix Over IntraOption Path Manifold
In Theorem 1 we show that the Fisher information matrix over the paths, , truncated to terminate at time step converges as to the Fisher information matrix over the intraoption policies, . This gives an expression for Fisher information matrix over the set of paths, , and simplifies computation of the natural gradient when maximizing the objective . We use to indicate the step finite horizon Fisher information matrix, meaning the Fisher information matrix if the problem were to be reduced to terminate at step . We normalize the metric by the total length of path [Bagnell and Schneider2003] to get a convergent metric.
Theorem 1 (Infinite Horizon IntraOption Matrix).
Let be the step finite horizon Fisher information matrix and be the Fisher information matrix of intraoption policies under a stationary distribution of states, actions and options: . Then:
Proof.
See the appendix (supplementary materials). ∎
3.2 Compatible Function Approximation For IntraOption Path Manifold
We subtract the optionstate value function, , from the stateoptionaction value function,
, and treat it as a baseline to reduce variance in the gradient estimate of the expected discounted return. The baseline can be a function of both state and action in special circumstances, but none of those apply here
[Thomas and Brunskill2017]. So, we define the stateoptionaction advantage function . Where is the advantage of the agent taking action in state in the context of option . Here, is approximated by some compatible function approximator . For vector and parameters we define:(8) 
The that is a local minima of the squared error :
is equal to the natural gradient of the objective, , with respect to (the complete derivation is in the appendix):
Thus, for a sensible [Kakade2001] function approximation, as in (8), in the optioncritic framework the natural gradient of the expected discounted return is the weights of linear function approximation.
4 Start State Fisher Information Matrix Over StateOption Transition Path Manifold
We derive the Fisher information matrix for the parameters over the stateoption transitions path manifold. We define as a path for stateoption transitions in the optioncritic architecture. More specifically, we define to be path tuples of state option pairs shifted by one time step. We define to be the set of all stateoption transition paths. Similar to the previous section, we define the expected return over stateoption transitions , where is the expected return given stateoption transitions path . The goal can be rewritten to maximize . Where the summation is over all starting from and terminations are parametrized by . To optimize we define it over a Reimannian space with and the inner product defined as in (4), similar to previous section. The direction of steepest ascent in the Reimannian space, , is the natural gradient.
In this section, we use to denote and use to indicate the expected value of with respect to the distribution . Equation (7) implies that the Fisher information matrix can be written as:
4.1 Fisher Information Matrix Over StateOption Transition Path Manifold
In Theorem 2 we show that the Fisher information matrix over the paths, , truncated to terminate at time step converges as to an expression in terms of the terminations and the policy over options over the stationary distribution of states and options. This gives an expression for Fisher information Matrix over set of paths, , and simplifies computation of the natural gradient when maximizing the objective .
Theorem 2 (Infinite Horizon StateOption Transition Matrix).
Let be the step finite horizon Fisher information matrix and is the stationary distribution of stateoption pairs . Then:
Proof.
See appendix (supplementary materials). ∎
4.2 Compatible Function Approximation For StateOption Transition Path Manifold
We define the advantage function of continued option as: . Where is the advantage of the option being active while exiting given that option is active when the agent enters . We consider terminations improvement when is approximated by some compatible function approximator . For vector and parameters we define:
(9) 
We define the squared error associated with vector as:
where is the likelihood ratio of option being active while exiting given that option is active when the agent enters . It is defined as follows:
We assume, throughout the paper, that the denominator is not . The that is a local minima of satisfies (the complete derivation is in the appendix):
Therefore, for an approximation of the continued stateoption value function, as in (9), the natural gradient of the expected discounted return is the negative weights of the linear function approximation.
5 Incremental Natural Option Critic Algorithm
We introduce algorithms inspired from the incremental natural actor critic introduced by Degris_modelfreereinforcement Degris_modelfreereinforcement, who in turn built on the theoretical work of Bhatnagar:2007:INA:2981562.2981576 Bhatnagar:2007:INA:2981562.2981576. The algorithm learns the parameters for approximations of stateoptionaction advantage function, , and the advantage function of continued option, , incrementally by taking steps in the direction of reducing the error and . It does stochastic gradient descent using the gradients and . Learning the parameters and leads to natural gradient based updates for and . We introduce hyper parameters and , which are the learning rate for , the learning rate for and the the eligibility trace parameter of both and , respectively. The algorithm learns the policy over options, , using intraoption Qlearning [Sutton, Precup, and Singh1999] as in previous work [Bacon, Harb, and Precup2017].
The algorithm uses TDerror style updates to learn and . Analogous to the consistent estimates used by Bhatnagar:2007:INA:2981562.2981576 Bhatnagar:2007:INA:2981562.2981576, we state that a consistent estimate of the stateoption value function, , satisfies . Similarly, a consistent estimate of the value function upon arrival, , satisfies . We define the TDerror for the intraoption policies at time step to be .
A consistent estimate of the state value function, , satisfies . We define the TDerror at time step for the terminations to be . We provide Lemmas 1 and 2 to show that and are consistent estimates of and .
Lemma 1.
Given intraoption policies, for all , policy over options, , and terminations, for all , then:
Lemma 2.
Under the precondition and given intraoption policies, for all , policy over options, , and terminations, for all , then:
The proofs are in the appendix (supplementary materials). Using these lemmas and theorems we introduce algorithm 20 (INOC). We provide details on how we arrive at the updates to parameters and in the appendix. The precondition might lead to fewer updates to the parameters of the terminations. The options evaluation part in the algorithm is the same as in previous work [Bacon, Harb, and Precup2017].
6 Experiments
We look at the performance of natural option critic in three different types of domains: a simple 2 state MDP, one with linear state representations and one with neural networks for state representations, and compare it to option critic. In all the cases we use sigmoid terminations and linearsoftmax intraoption policies, as in previous work [Bacon, Harb, and Precup2017].
MDP Setup: We design an MDP to demonstrate the uniform weighting of the components of the natural termination gradient, , as opposed to using . Note that the effectiveness of the natural policy gradient has been demonstrated sufficiently in past work (Kakade:2001 Kakade:2001; bagnell2003covariant bagnell2003covariant; Degris_modelfreereinforcement Degris_modelfreereinforcement). We define a simple 2 state MDP as in Figure 1. The initial state distribution is and . The transitions are deterministic. The reward for self loops into and are 1 and 2, respectively. The episode terminates after 30 steps. We use an greedy policy over options, .
We consider a scenario with two options, and , each of which has probability 0.9 for actions and , respectively, regardless of the state. This gives us options as abstractions over individual actions. We initialize the terminations, , and option value function, such that they are biased towards the greedy action, , in state via the selection of option . Specifically, we set and , this way the setup is biased towards higher probability of . This presents the agent with the challenge of learning the more optimal action of transitioning to state , despite the higher probability and the self loop reward of . We set the learning rate for the intraoption policies, , to be negligible as our goal is to demonstrate the efficacy of the natural termination gradient.
As can be seen from Figure 2, the natural option critic converges to the optimal value, by overcoming the plateau, for average reward much faster than the option critic. The option critic is initially stuck in the greedy selfloop action, this is due to the weighting by . Whereas the natural option critic begins learning early on and achieves the optimal average reward.
Four Rooms: The four rooms domain [Sutton, Precup, and Singh1999] is a particularly favorable case for demonstrating the use of options. We use the same number of options, 4, as in previous work [Bacon, Harb, and Precup2017]. The result (Figure 4) indicates that natural option critic converges faster.
Arcade Learning Environment:
We compare natural optioncritic with the option critic framework on the Arcade Learning Environment [Bellemare et al.2013]. To showcase the improvement over the optioncritic architecture we use the same configuration for all the layers as in previous work [Bacon, Harb, and Precup2017]. Which in turn uses the same configuration for the first 3 convolutional layers of the network introduced by Mnih2013PlayingAW Mnih2013PlayingAW. The critic network was trained, similar to previous work [Bacon, Harb, and Precup2017], using experience replay [Mnih et al.2013]
and RMSProp.
As in previous work [Bacon, Harb, and Precup2017], we apply the regularizer prescribed by Mnih2016AsynchronousMF Mnih2016AsynchronousMF to penalize low entropy policies. We use an onpolicy estimate of the policy over options, , which is used in the computation of the natural gradient with respect to the termination parameters.
We compare the two approaches, option critic and natural option critic, by evaluating them for the games Asterisk, Seaquest, and Zaxxon [Bacon, Harb, and Precup2017]
. For comparison we run training over same number of frames per epoch as done by DBLP:journals/corr/BaconHP16 DBLP:journals/corr/BaconHP16, running the same number of trial and use the same number of options: 8. We demonstrate the results in Figure
4. More importantly, we use the same hyperparameters, for learning rates and entropy regularization, as in previous work to merit a fair comparison. We obtain improvements on the optioncritic architecture (OC) for Asterisk and Zaxxon. We also note that we were unable to reproduce the results for Seaquest for option critic, but having given the same set of hyperparameters we observe that option critic performs better. We explain the issue with termination updates, and it’s effect on the return, for Seaquest in the appendix.
For Zaxxon and Asterisk we see that NOC breaks the plateau much earlier than option critic. Note that the value network, for approximating , is learned using vanilla gradient.
7 Discussion
We have introduced a natural gradient based approach for learning intraoption policies and terminations, within the optioncritic framework, which is linear in the number of parameters. More importantly, we have furnished instructive proofs on deriving the Fisher information matrix over path manifolds and corresponding function approximations based approach while reducing mean squared errors. We have also introduced an algorithm that uses consistent estimates of the advantage functions and learn the natural gradient by learning coefficients of the corresponding linear function approximators. The results showcase performance improvements on previous work. The proofs for finite horizon metrics are very similar to the ones provided by bagnell2003covariant bagnell2003covariant. We also demonstrate the effectiveness of natural option critic in three distinct domains.
As discussed by Thomas2014BiasIN Thomas2014BiasIN we can obtain a truly unbiased estimate for our updates, but it may not be practical
[Thomas2014]. The limitations that apply to the optioncritic framework, except the use of vanilla gradient, apply. We use a block diagonal estimate of the Fisher information matrix. The complete Fisher information matrix for the optioncritic framework over path manifolds is:where and are the Fisher information matrices for intraoption path manifold and stateoption transition manifold, respectively. The random variable is the path variable over stateoptionaction tuples. The computation of the complete Fisher information matrix suffers and its inverse is expensive and needs a compatible function approximation based approach to obtain a natural gradient estimate with space complexity linear in number of parameters.
Although our approach has added benefits it is limited by fewer updates of the termination policy. Work is required to develop better estimates of the advantage functions. More experimental work, e.g. applications to other domains, can further help understand the efficacy of natural gradients in the context of the optioncritic framework.
References

[Amari1967]
Amari, S.
1967.
A theory of adaptive pattern classifiers.
IEEE Trans. Electronic Computers 16:299–307.  [Amari1985] Amari, S. 1985. Differentialgeometrical methods in statistics. In Lecture Notes in Statistics 28. SpringerVerlag.
 [Amari1998] Amari, S.I. 1998. Natural gradient works efficiently in learning. Neural Comput. 10(2):251–276.
 [Amari2016] Amari, S.i. 2016. Information Geometry and Its Applications. Springer.
 [Bacon, Harb, and Precup2017] Bacon, P.L.; Harb, J.; and Precup, D. 2017. The optioncritic architecture. In AAAI.
 [Bagnell and Schneider2003] Bagnell, J. A., and Schneider, J. 2003. Covariant policy search. IJCAI.
 [Bellemare et al.2013] Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M. H. 2013. The arcade learning environment: An evaluation platform for general agents. J. Artif. Intell. Res. 47:253–279.
 [Bhatnagar et al.2007] Bhatnagar, S.; Sutton, R. S.; Ghavamzadeh, M.; and Lee, M. 2007. Incremental natural actorcritic algorithms. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’07, 105–112. USA: Curran Associates Inc.
 [Bhatnagar et al.2009] Bhatnagar, S.; Sutton, R. S.; Ghavamzadeh, M.; and Lee, M. 2009. Natural actorcritic algorithms. Automatica 45(11):2471–2482.
 [Degris, Pilarski, and Sutton2012] Degris, T.; Pilarski, P. M.; and Sutton, R. S. 2012. Modelfree reinforcement learning with continuous action in practice.
 [DeGroot1970] DeGroot, M. 1970. Optimal Statistical Decisions. Wiley Classics Library. Wiley.
 [Desjardins et al.2015] Desjardins, G.; Simonyan, K.; Pascanu, R.; et al. 2015. Natural neural networks. In Advances in Neural Information Processing Systems, 2071–2079.
 [Dietterich2000] Dietterich, T. G. 2000. Hierarchical reinforcement learning with the maxq value function decomposition. J. Artif. Intell. Res.(JAIR) 13(1):227–303.
 [Kakade2001] Kakade, S. 2001. A natural policy gradient. In Dietterich, T. G.; Becker, S.; and Ghahramani, Z., eds., Advances in Neural Information Processing Systems 14 (NIPS 2001), 1531–1538. MIT Press.
 [Konda and Tsitsiklis2000] Konda, V. R., and Tsitsiklis, J. N. 2000. Actorcritic algorithms. NIPS’2000, 1008–1014.
 [Konidaris and Barto2009] Konidaris, G., and Barto, A. G. 2009. Skill discovery in continuous reinforcement learning domains using skill chaining. In Bengio, Y.; Schuurmans, D.; Lafferty, J. D.; Williams, C. K. I.; and Culotta, A., eds., Advances in Neural Information Processing Systems 22. Curran Associates, Inc. 1015–1023.
 [Kurita1992] Kurita, T. 1992. Iterative weighted least squares algorithms for neural networks classifiers. New Generation Computing 12:375–394.

[Machado, Bellemare, and
Bowling2017]
Machado, M. C.; Bellemare, M. G.; and Bowling, M.
2017.
A Laplacian framework for option discovery in reinforcement
learning.
In Precup, D., and Teh, Y. W., eds.,
Proceedings of the 34th International Conference on Machine Learning
, volume 70 of Proceedings of Machine Learning Research, 2295–2304. International Convention Centre, Sydney, Australia: PMLR.  [Martens and Grosse2015] Martens, J., and Grosse, R. B. 2015. Optimizing neural networks with kroneckerfactored approximate curvature. In ICML.
 [Martens2010] Martens, J. 2010. Deep learning via hessianfree optimization. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, 735–742. USA: Omnipress.
 [Mnih et al.2013] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. A. 2013. Playing atari with deep reinforcement learning. CoRR abs/1312.5602.
 [Mnih et al.2016] Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T. P.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In ICML.
 [Morimura, Uchibe, and Kenji2005] Morimura, T.; Uchibe, E.; and Kenji, D. 2005. Utilizing the natural gradient in temporal difference reinforcement learning with eligibility traces. 0–0.
 [Parr and Russell1998] Parr, R., and Russell, S. J. 1998. Reinforcement learning with hierarchies of machines. In Advances in neural information processing systems, 1043–1049.
 [Pascanu and Bengio2013] Pascanu, R., and Bengio, Y. 2013. Revisiting natural gradient for deep networks.
 [Peters and Schaal2008] Peters, J., and Schaal, S. 2008. Natural actorcritic. Neurocomputing 71:1180–1190.
 [Rao1945] Rao, C. R. 1945. Information and accuracy attainable in the estimation of statistical parameters. In Bulletin of the Calcutta Mathematical Society. 81–91.
 [Roux, Manzagol, and Bengio2008] Roux, N. L.; Manzagol, P.; and Bengio, Y. 2008. Topmoumoute online natural gradient algorithm. In Platt, J. C.; Koller, D.; Singer, Y.; and Roweis, S. T., eds., Advances in Neural Information Processing Systems 20. Curran Associates, Inc. 849–856.
 [Simsek and Barto2008] Simsek, Ö., and Barto, A. G. 2008. Skill characterization based on betweenness. In NIPS.
 [Stein and Shakarchi2009] Stein, E., and Shakarchi, R. 2009. Real Analysis: Measure Theory, Integration, and Hilbert Spaces. Princeton University Press.
 [Sun and Nielsen2017] Sun, K., and Nielsen, F. 2017. Relative fisher information and natural gradient for learning large modular models. In ICML.
 [Sutton et al.1999] Sutton, R. S.; McAllester, D.; Singh, S.; and Mansour, Y. 1999. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, 1057–1063.
 [Sutton, Precup, and Singh1999] Sutton, R. S.; Precup, D.; and Singh, S. P. 1999. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artif. Intell. 112:181–211.
 [Thomas and Brunskill2017] Thomas, P. S., and Brunskill, E. 2017. Policy gradient methods for reinforcement learning with function approximation and actiondependent baselines. CoRR abs/1706.06643.
 [Thomas et al.2016] Thomas, P.; Silva, B. C.; Dann, C.; and Brunskill, E. 2016. Energetic natural gradient descent. In Balcan, M. F., and Weinberger, K. Q., eds., Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, 2887–2895. New York, New York, USA: PMLR.
 [Thomas, Dann, and Brunskill2018] Thomas, P. S.; Dann, C.; and Brunskill, E. 2018. Decoupling learning rules from representations. In ICML.
 [Thomas2011] Thomas, P. S. 2011. Policy gradient coagent networks. In ShaweTaylor, J.; Zemel, R. S.; Bartlett, P. L.; Pereira, F.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 24. Curran Associates, Inc. 1944–1952.
 [Thomas2014] Thomas, P. 2014. Bias in natural actorcritic algorithms. In ICML.
 [Thrun and Schwartz1995] Thrun, S., and Schwartz, A. 1995. Finding structure in reinforcement learning. In Tesauro, G.; Touretzky, D. S.; and Leen, T. K., eds., Advances in Neural Information Processing Systems 7. MIT Press. 385–392.
Appendix
Here, we provide proofs for the theorems and lemmas presented in the body of the paper and we also provide derivations for the estimates for the natural gradient. Despite these proofs being in the appendix due to space constraint these are our major contributions.
Alternate Form Of The Fisher Information Matrix
We derive the following result, same as bagnell2003covariant bagnell2003covariant with the meanings of the symbols changed, for the Fisher information matrix under appropriate regularity conditions for :
(10)  
(11)  
(12)  
(13)  
(14)  
(15) 
The first equality follows from the definition of Fisher information matrix. The third equality follows from integration by parts. The last equality is a result of the sum of probabilities being constant, i.e., . The matrix is positive semidefinite [Amari1967] and the derivations resulting from this expression inherit this property.
Proof Of Infinite Horizon IntraOption Matrix
Theorem (Infinite Horizon IntraOption Matrix).
Let be the step finite horizon Fisher information matrix and be the Fisher information matrix of intraoption policies under a stationary distribution of states, actions and options: . Then:
Proof.
is the step finite horizon Fisher information matrix.
(16)  
(17) 
The process represented by the path is Markovian, meaning . This leads to the following result for the likelihood probability, similar to the simple form of the path probability metric presented by bagnell2003covariant bagnell2003covariant:
(18)  
(19)  