1 Introduction
Markov decision processes (MDPs) are the standard model for optimal control in a fully observable environment (Bertsekas, 2010). Strong empirical results have been obtained in numerous challenging realworld optimal control problems using the MDP framework. This includes problems of nonlinear control (Stengel, 1993; Li and Todorov, 2004; Todorov and Tassa, 2009; Deisenroth and Rasmussen, 2011; Rawlik et al., 2012; Spall and Cristion, 1998), robotic applications (Kober and Peters, 2011; Kohl and Stone, 2004; Vlassis et al., 2009), biological movement systems (Li, 2006), traffic management (Richter et al., 2007; Srinivasan et al., 2006), helicopter flight control (Abbeel et al., 2007), elevator scheduling (Crites and Barto, 1995) and numerous games, including chess (Veness et al., 2009), go (Gelly and Silver, 2008), backgammon (Tesauro, 1994) and Atari video games (Mnih et al., 2015).
It is wellknown that the global optimum of a MDP can be obtained through methods based on dynamic programming, such as value iteration (Bellman, 1957) and policy iteration (Howard, 1960)
. However, these techniques are known to suffer from the curse of dimensionality, which makes them infeasible for most realworld problems of interest. As a result, most research in the reinforcement learning and control theory literature has focused on obtaining approximate or locally optimal solutions. There exists a broad spectrum of such techniques, including approximate dynamic programming methods
(Bertsekas, 2010), tree search methods (Russell and Norvig, 2009; Kocsis and Szepesvári, 2006; Browne et al., 2012), local trajectoryoptimization techniques, such as differential dynamic programming (Jacobson and Mayne, 1970) and iLQG (Li and Todorov, 2006), and policy search methods (Williams, 1992; Baxter and Bartlett, 2001; Sutton et al., 2000; Marbach and Tsitsiklis, 2001; Kakade, 2002; Kober and Peters, 2011).The focus of this paper is on policy search methods, which are a family of algorithms that have proven extremely popular in recent years, and which have numerous desirable properties that make them attractive in practice. Policy search algorithms are typically specialized applications of techniques from numerical optimization (Nocedal and Wright, 2006; Dempster et al., 1977). As such, the controller is defined in terms of a differentiable representation and local information about the objective function, such as the gradient, is used to update the controller in a smooth, nongreedy manner. Such updates are performed in an incremental manner until the algorithm converges to a local optimum of the objective function. There are several benefits to such an approach: the smooth updates of the control parameters endow these algorithms with very general convergence guarantees; as performance is improved at each iteration (or at least on average in stochastic policy search methods) these algorithms have good anytime performance properties; it is not necessary to approximate the value function, which is typically a difficult function to approximate – instead it is only necessary to approximate a lowdimensional projection of the value function, an observation which has led to the emergence of so called actorcritic methods (Konda and Tsitsiklis, 2003, 1999; Bhatnagar et al., 2008, 2009); policy search methods are easily extendable to models for optimal control in a partially observable environment, such as the finite state controllers (Meuleau et al., 1999; Toussaint et al., 2006).
In (stochastic) steepest gradient ascent (Williams, 1992; Baxter and Bartlett, 2001; Sutton et al., 2000) the control parameters are updated by moving in the direction of the gradient of the objective function. While steepest gradient ascent has enjoyed some success, it suffers from a serious issue that can hinder its performance. Specifically, the steepest ascent direction is not invariant to rescaling the components of the parameter space and the gradient is often poorlyscaled, i.e., the variation of the objective function differs dramatically along the different components of the gradient, and this leads to a poor rate of convergence. It also makes the construction of a good step size sequence a difficult problem, which is an important issue in stochastic methods.^{1}^{1}1
This is because line search techniques lose much of their desirability in stochastic numerical optimization algorithms, due to variance in the evaluations.
Poor scaling is a wellknown problem with steepest gradient ascent and alternative numerical optimization techniques have been considered in the policy search literature. Two approaches that have proven to be particularly popular are Expectation Maximization
(Dempster et al., 1977) and natural gradient ascent (Amari, 1997, 1998; Amari et al., 1992), which have both been successfully applied to various challenging MDPs (see Dayan and Hinton (1997); Kober and Peters (2009); Toussaint et al. (2011) and Kakade (2002); Bagnell and Schneider (2003) respectively).An avenue of research that has received less attention is the application of Newton’s method to Markov decision processes. Although Baxter and Bartlett (2001) provide such an extension of their GPOMDP algorithm, they give no empirical results in either Baxter and Bartlett (2001) or the accompanying paper of empirical comparisons (Baxter et al., 2001). There has since been only a limited amount of research into using the second order information contained in the Hessian during the parameter update. To the best of our knowledge only two attempts have been made: in Schraudolph et al. (2006)
an online estimate of a Hessianvector product is used to adapt the step size sequence in an online manner; in
Ngo et al. (2011), Bayesian policy gradient methods (Ghavamzadeh and Engel, 2007) are extended to the Newton method. There are several reasons for this lack of interest. Firstly, in many problems the construction and inversion of the Hessian is too computationally expensive to be feasible. Additionally, the objective function of a MDP is typically not concave, and so the Hessian isn’t guaranteed to be negativedefinite. As a result, the search direction of the Newton method may not be an ascent direction, and hence a parameter update could actually lower the objective. Additionally, the variance of samplebased estimators of the Hessian will be larger than that of estimators of the gradient. This is an important point because the variance of gradient estimates can be a problematic issue and various methods, such as baselines (Weaver and Tao, 2001; Greensmith et al., 2004), exist to reduce the variance.Many of these problems are not particular to Markov decision processes, but are general longstanding issues that plague the Newton method. Various methods have been developed in the optimization literature to alleviate these issues, whilst also maintaining desirable properties of the Newton method. For instance, quasiNewton methods were designed to efficiently mimic the Newton method using only evaluations of the gradient obtained during previous iterations of the algorithm. These methods have low computational costs, a superlinear rate of convergence and have proven to be extremely effective in practice. See Nocedal and Wright (2006) for an introduction to quasiNewton methods. Alternatively, the wellknown GaussNewton method is a popular approach that aims to efficiently mimic the Newton method. The GaussNewton method is particular to nonlinear least squares objective functions, for which the Hessian has a particular structure. Due to this structure there exist certain terms in the Hessian that can be used as a useful proxy for the Hessian itself, with the resulting algorithm having various desirable properties. For instance, the preconditioning matrix used in the GaussNewton method is guaranteed to be positivedefinite, so that the nonlinear least squares objective is guaranteed to decrease for a sufficiently small step size.
While a straightforward application of quasiNewton methods will not typically be possible for MDPs^{2}^{2}2In quasiNewton methods, to ensure an increase in the objective function it is necessary to satisfy the secant condition (Nocedal and Wright, 2006). This condition is satisfied when the objective is concave/convex or the strong Wolfe conditions are met during a line search. For this reason, stochastic applications of quasiNewton methods has been restricted to convex/concave objective functions (Schraudolph et al., 2007)., in this paper we consider whether an analogue to the GaussNewton method exists, so that the benefits of such methods can be applied to MDPs. The specific contributions are as follows:

In Section 3, we present an analysis of the Hessian for MDPs. Our starting point is a policy Hessian theorem (Theorem 3) and we analyse the behaiviour of individual terms of the Hessian to provide insight into constructing efficient approximate Newton methods for policy optimization. In particular we show that certain terms are negligible near local optima.

Motivated by this analysis, in Section 4 we provide two GaussNewton type methods for policy optimization in MDPs which retain certain terms of our Hessian decomposition in the preconditioner in a gradientbased policy search algorithm. The first method discards terms which are negligible near local optima and are difficult to approximate. The second method further discards an additional term which we cannot guarantee to be negativedefinite. We provide an analysis of our GaussNewton methods and give several important performance guarantees for the second GaussNewton method:

We demonstrate that the preconditioning matrix is negativedefinite when the controller is concave in the control parameters (detailing some widely used controllers for which this condition holds) guaranteeing that the search direction is an ascent direction.

We show that the method is invariant to affine transformations of the parameter space and thus does not suffer the significant drawback of steepest ascent.

We provide a convergence analysis, demonstrating linear convergence to local optima, in terms of the step size of the update. One key practical benefit of this analysis is that the step size for the incremental update can be chosen independently of unknown quantities, while retaining a guarantee of convergence.

The preconditioner has a particular form which enables the assent direction to be computed particularly efficiently via a Hessianfree conjugate gradient method in large parameter spaces.


In Section 5 we present a unifying perspective for several policy search methods. In particular we relate the search direction of our second GaussNewton algorithm to that of Expectation Maximization (which provides new insights in to the latter algorithm when used for policy search), and we also discuss its relationship to the natural gradient algorithm.

In Section 6 we present experiments demonstrating stateoftheart performance on challenging domains including Tetris and robotic arm applications.
2 Preliminaries and Background
In Section 2.1 we introduce Markov decision processes, along with some standard terminology relating to these models that will be required throughout the paper. In Section 2.2 we introduce policy search methods and detail several key algorithms from the literature.
2.1 Markov Decision Processes
In a Markov decision process an agent, or controller, interacts with an environment over the course of a planning horizon. At each point in the planning horizon the agent selects an action (based on the the current state of the environment) and receives a scalar reward. The amount of reward received depends on the selected action and the state of the environment. Once an action has been performed the system transitions to the next point in the planning horizon, and the new state of the environment is determined (often in a stochastic manner) by the action the agent selected and the current state of the environment. The optimality of an agent’s behaviour is measured in terms of the total reward the agent can expect to receive over the course of the planning horizon, so that optimal control is obtained when this quantity is maximized.
Formally a MDP is described by the tuple , in which and are sets, known respectively as the state and action space, is the initial state distribution, which is a distribution over the state space, is the transition dynamics and is formed of the set of conditional distributions over the state space, , and is the (deterministic) reward function, which is assumed to be bounded and nonnegative. Given a planning horizon, , and a timepoint in the planning horizon, , we use the notation and
to denote the random variable of the state and action of the
timepoint, respectively. The state at the initial timepoint is determined by the initial state distribution, . At any given timepoint, , and given the state of the environment, the agent selects an action, , according to the policy . The state of the next point in the planning horizon is determined according to the transition dynamics, . This process of selecting actions and transitioning to a new state is iterated sequentially through all of the timepoints in the planning horizon. At each point in the planning horizon the agent receives a scalar reward, which is determined by the reward function.The objective of a MDP is to find the policy that maximizes a given function of the expected reward over the course of the planning horizon. In this paper we usually consider the infinite horizon discounted reward framework, so that the objective function takes the form
(1) 
where we use the semicolon to identify parameters of the distribution, rather than conditioning variables, and where the distribution of and , which we denote by , is given by the marginal at time
of the joint distribution over
, where , , denoted byThe discount factor , in (1) ensures that the objective is bounded.
We use the notation to denote trajectories through the stateaction space of length, . We use to denote trajectories that are of infinite length, and use to denote the space of all such trajectories. Given a trajectory, , we use the notation to denote the total discounted reward of the trajectory, so that
Similarly, we use the notation
to denote the probability of generating the trajectory
under the policy .We now introduce several functions that are of central importance. The value function w.r.t. policy is defined as the total expected future reward given the current state,
(2) 
It can be seen that . The value function can also be written as the solution of the following fixedpoint equation,
(3) 
which is known as the Bellman equation (Bertsekas, 2010). The stateaction value function w.r.t. policy is given by
(4) 
and gives the value of performing an action, in a given state, and then following the policy. Note that . Finally, the advantage function (Baird, 1993)
gives the relative advantage of an action in relation to the other actions available in that state and it can be seen that , for each .
2.2 Policy Search Methods
In policy search methods the policy is given some differentiable parametric form, denoted with the policy parameter, and local information, such as the gradient of the objective function, is used to update the policy in a smooth nongreedy manner. This process is iterated in an incremental manner until the algorithm converges to a local optimum of the objective function. Denoting the parameter space by , , we write the objective function directly in terms of the parameter vector, i.e.,
(5) 
while the trajectory distribution is written in the form
(6) 
Similarly, , and denote respectively the value function, stateaction value function and the advantage function in terms of the parameter vector . We introduce the notation
(7) 
Note that the objective function can be written
(8) 
We shall consider two forms of policy search algorithm in this paper, gradientbased optimization methods and methods based on iteratively optimizing a lowerbound on the objective function. In gradientbased methods the update of the policy parameters take the form
(9) 
where is the step size parameter and is some preconditioning matrix that possibly depends on . If is positivedefinite and is sufficiently small, then such an update will increase the total expected reward. Provided that the preconditioning matrix is always negativedefinite and the step size sequence is appropriately selected, by iteratively updating the policy parameters according to (9) the policy parameters will converge to a local optimum of (5). This generic gradientbased policy search algorithm is given in Algorithm 1. Gradientbased methods vary in the form of the preconditioning matrix used in the parameter update. The choice of the preconditioning matrix determines various aspects of the resulting algorithm, such as the computational complexity, the rate at which the algorithm converges to a local optimum and invariance properties of the parameter update. Typically the gradient and the preconditioner will not be known exactly and must be approximated by collecting data from the system. In the context of reinforcement learning, the Expectation Maximization (EM) algorithm searches for the optimal policy by iteratively optimizing a lower bound on the objective function. While the EMalgorithm doesn’t have an update of the form given in (9) we shall see in Section 5.2 that the algorithm is closely related to such an update. We now review specific policy search methods.
2.2.1 Steepest Gradient Ascent
Steepest gradient ascent corresponds to the choice , where denotes the identity matrix so that the parameter update takes the form:
Policy search update using steepest ascent
(10) 
The gradient can be written in a relatively simple form using the following theorem (Sutton et al., 2000):
Theorem 1 (Policy Gradient Theorem (Sutton et al., 2000)).
It is not possible to calculate the gradient exactly for most realworld MDPs of interest. For instance, in discrete domains the size of the stateaction space may be too large for enumeration over these sets to be feasible. Alternatively, in continuous domains the presence of nonlinearities in the transition dynamics makes the calculation of the occupancy marginals an intractable problem. Various techniques have been proposed in the literature to estimate the gradient, including the method of finitedifferences (Kiefer and Wolfowitz, 1952; Kohl and Stone, 2004; Tedrake and Zhang, 2005), simultaneous perturbation methods (Spall, 1992; Spall and Cristion, 1998; Srinivasan et al., 2006) and likelihoodratio methods (Glynn, 1986, 1990; Williams, 1992; Baxter and Bartlett, 2001; Konda and Tsitsiklis, 2003, 1999; Sutton et al., 2000; Bhatnagar et al., 2009; Kober and Peters, 2011). Likelihoodratio methods, which originated in the statistics literature and were later applied to MDPs, are now the prominent method for estimating the gradient. There are numerous such methods in the literature, including MonteCarlo methods (Williams, 1992; Baxter and Bartlett, 2001) and actorcritic methods (Konda and Tsitsiklis, 2003, 1999; Sutton et al., 2000; Bhatnagar et al., 2009; Kober and Peters, 2011).
Steepest gradient ascent is known to perform poorly on objective functions that are poorlyscaled, that is, if changes to some parameters produce much larger variations to the function than changes in other parameters. In this case steepest gradient ascent zigzags along the ridges of the objective in the parameter space (see e.g., Nocedal and Wright, 2006). It can be extremely difficult to gauge an appropriate scale for these steps sizes in poorlyscaled problems and the robustness of optimization algorithms to poor scaling is of significant practical importance in reinforcement learning since line search procedures to find a suitable step size are often impractical.
2.2.2 Natural Gradient Ascent
Natural gradient ascent techniques originated in the neural network and blind source separation literature
(Amari, 1997, 1998; Amari et al., 1996, 1992), and were introduced into the policy search literature in Kakade (2002). To address the issue of poor scaling, natural gradient methods take the perspective that the parameter space should be viewed with a manifold structure in which distance between points on the manifold captures discrepancy between the models induced by different parameter vectors. In natural gradient ascent in (9), with denoting the Fisher information matrix, so that the parameter update takes the formPolicy search update using natural gradient ascent
(12) 
In the case of Markov decision processes the Fisher information matrix takes the form,
(13) 
which can then be viewed as a imposing a local norm on the parameter space which is second order approximation to the KLdivergence between induced policy distributions. When the trajectory distribution satisfies the Fisher regularity conditions (Lehmann and Casella, 1998) there is an alternate, equivalent, form of the Fisher information matrix given by
(14) 
There are several desirable properties of the natural gradient approach: the Fisher information matrix is always positivedefinite, regardless of the policy parametrization; The search direction is invariant to the parametrization of the policy, (Bagnell and Schneider, 2003; Peters and Schaal, 2008). Additionally, when using a compatible function approximator (Sutton et al., 2000) within an actorcritic framework, then the optimal critic parameters coincide with the natural gradient. Furthermore, natural gradient ascent has been shown to perform well in some difficult MDP environments, including Tetris (Kakade, 2002) and several challenging robotics problems (Peters and Schaal, 2008). However, theoretically, the rate of convergence of natural gradient ascent is the same as steepest gradient ascent, i.e., linear, although, it has been noted to be substantially faster in practice.
2.2.3 Expectation Maximization
An alternative optimization procedure that has been the focus of much research in the planning and reinforcement learning communities is the EMalgorithm (Dayan and Hinton, 1997; Toussaint et al., 2006, 2011; Kober and Peters, 2009, 2011; Hoffman et al., 2009; Furmston and Barber, 2009, 2010)
. The EMalgorithm is a powerful optimization technique popular in the statistics and machine learning community
(see e.g., Dempster et al., 1977; Little and Rubin, 2002; Neal and Hinton, 1999) that has been successfully applied to a large number of problems. See Barber (2011) for a general overview of some of the applications of the algorithm in the machine learning literature. Among the strengths of the algorithm are its guarantee of increasing the objective function at each iteration, its often simple update equations and its generalization to highly intractable models through variational Bayes approximations (Saul et al., 1996).Given the advantages of the EMalgorithm it is natural to extend the algorithm to the MDP framework. Several derivations of the EMalgorithm for MDPs exist (Kober and Peters, 2011; Toussaint et al., 2011). For reference we state the lowerbound upon which the algorithm is based in the following theorem.
Theorem 2.
Suppose we are given a Markov Decision Process with objective (5) and Markovian trajectory distribution (6). Given any distribution, , over the space of trajectories, , then the following bound holds,
(15) 
in which denotes the entropy function (Barber, 2011).
Proof.
The proof is based on an application of Jensen’s inequality and can be found in Kober and Peters (2011). ∎
The distribution, , in Theorem 2 is often referred to as the variational distribution. An EMalgorithm is obtained through coordinatewise optimization of (15) with respect to the variational distribution (the Estep) and the policy parameters (the Mstep). In the Estep the lowerbound is optimized when , in which are the current policy parameters. In the Mstep the lowerbound is optimized with respect to , which, given and the Markovian structure of , is equivalent to optimizing the function,
(16) 
with respect to the first parameter, . The Estep and Mstep are iterated in this manner until the policy parameters converge to a local optimum of the objective function.
3 The Hessian of Markov Decision Processes
As noted in Section 1, the Newton method suffers from issues that often make its application to MDPs unattractive in practice. As a result there has been comparatively little research into the Newton method in the policy search literature. However, the Newton method has significant attractive properties, such as affine invariance of the policy parametrization and a quadratic rate of convergence. It is of interest, therefore, to consider whether one can construct an efficient GaussNewton type method for MDPs, in which the positive aspects of the Newton method are maintained and the negative aspects are alleviated. To this end, in this section we provide an analysis of the Hessian of a MDP. This analysis will then be used in Section 4 to propose GaussNewton type methods for MDPs.
In Section 3.1 we provide a novel representation of the Hessian of a MDP, in Section 3.2 we detail the definiteness properties of certain terms in the Hessian and in Section 3.3 we analyse the behaviour of individual terms of the Hessian in the vicinity of a local optimum.
3.1 The Policy Hessian Theorem
There is a standard expansion of the Hessian of a MDP in the policy search literature (Baxter and Bartlett, 2001; Kakade, 2001, 2002) that, as with the gradient, takes a relatively simple form. This is summarized in the following result.
Theorem 3 (Policy Hessian Theorem).
We remark that and are relatively simple to estimate, in the same manner as estimating the policy gradient. The term is more difficult to estimate since it contains terms involving the unknown gradient and removing this dependence would result in a double sum over stateactions.
Below we will present a novel form for the Hessian of a MDP, with attention given to the term in (17), which will require the following notion of parametrization with constant curvature.
Definition 1.
A policy parametrization is said to have constant curvature with respect to the action space, if for each the Hessian of the logpolicy, , does not depend upon the action, i.e.,
When a policy parametrization satisfies this property the notation, , is used to denote , for each .
A common class of policy which satisfies the property of Definition 1 is, , in which is a vector of features that depends on the stateaction pair, . Under this parametrization,
which does not depend on, . In the case when the action space is continuous, then the policy parametrization , in which is a given feature map, satisfies the properties of Definition 1 with respect to the mean parameters, .
We now present a novel decomposition of the Hessian for Markov decision processes.
Theorem 4.
Proof.
See Section A.2 in the Appendix. ∎
We now present an analysis of the terms of the policy Hessian, simplifying the expansion and demonstrating conditions under which certain terms disappear. The analysis will be used to motivate our GaussNewton methods in Section 4.
3.2 Analysis of the Policy Hessian – Definiteness
An interesting comparison can be made between the expansions (17) and (21, 22) in terms of the definiteness properties of the component matrices. As the stateaction value function is nonnegative over the entire stateaction space, it can be seen that is positivedefinite for all . Similarly, it can be shown that under certain common policy parametrizations is negativedefinite over the entire parameter space. This is summarized in the following theorem.
Theorem 5.
The matrix is negativedefinite for all if: 1) the policy is concave with respect to the policy parameters; or 2) the policy parametrization has constant curvature with respect to the action space.
Proof.
See Section A.3 in the Appendix. ∎
It can be seen, therefore, that when the policy parametrization satisfies the properties of Theorem 5 the expansion (17) gives in terms of a positivedefinite term, , a negativedefinite term, , and a remainder term, , which we shall show, in Section 3.3, becomes negligible around a local optimum when given a sufficiently rich policy parametrization. In contrast to the stateaction value function, the advantage function takes both positive and negative values over the stateaction space. As a result, the matrices and in (21, 22) can be indefinite over parts of the parameter space.
3.3 Analysis in Vicinity of a Local Optimum
In this section we consider the term , which is both difficult to estimate and not guaranteed to be negative definite. In particular, we shall consider the conditions under which these terms vanish at a local optimum. We start by noting that
(23) 
This means that if , for all , then . It is sufficient, therefore, to require that , for all , at a local optimum . We therefore consider the situations in which this occurs. We start by introducing the notion of a value consistent policy class. This property of a policy class captures the idea that the policy class is rich enough such that changing a parameter to maximally improve the value in one state, does not worsen the value in another state. i.e., when a policy class is value consistent, there are no tradeoffs between improving the value in different states.
Definition 2.
A policy parametrization is said to be value consistent w.r.t. a Markov decision process if whenever,
(24) 
for some , and , then it holds that either
(25) 
or
(26) 
Furthermore, for any state, , for which (26) holds it also holds that
The notation is used to denote the standard basis vector of in which the component is equal to one, and all other components are equal to zero.
Example.
To illustrate the concept of a value consistent policy parametrization we now consider two simple maze navigation MDPs, one with a value consistent policy parametrization, and one with a policy parametrization that is not value consistent. The two MDPs are displayed in Figure 1. Walls of the maze are solid lines, while the dotted lines indicate state boundaries and are passable. The agent starts, with equal probability, in one of the states marked with an ‘S’. The agent receives a positive reward for reaching the goal state, which is marked with a ‘G’, and is then reset to one of the start states. All other stateaction pairs return a reward of zero. There are four possible actions (up, down, left, right) in each state, and the optimal policy is to move, with probability one, in the direction indicated by the arrow. We consider the policy parametrization, , where denotes the successor state of stateaction pair and is a feature map. We consider the feature map which indicates the presence of a wall on each of the four state boundaries. Perceptual aliasing (Whitehead, 1992) occurs in both MDPs under this policy parametrization, with states , & aliased in the hallway problem, and states , & aliased in McCallum’s grid. In the hallway problem all of the aliased states have the same optimal action, and the value of these states all increase/decrease in unison. Hence, it can be seen that the policy parametrization is value consistent for the hallway problem. In McCallum’s grid, however, the optimal action for states & is to move upwards, while in state it is to move downwards. In this example increasing the probability of moving downwards in state will also increase the probability of moving downwards in states & . There is a point, therefore, at which increasing the probability of moving downwards in state will decrease the value of states & . Thus this policy parametrization is not value consistent for McCallum’s grid.
We now show that tabular policies – i.e., policies such that, for each state , the conditional distribution is parametrized by a separate parameter vector for some – are value consistent, regardless of the given Markov decision process.
Theorem 6.
Suppose that a given Markov decision process has a tabular policy parametrization, then the policy parametrization is value consistent.
Proof.
See Section A.4 in the Appendix. ∎
We now show that under a value consistent policy parametrization the terms and vanish near local optima.
Theorem 7.
Suppose that is a local optimum of the differentiable objective function,
. Suppose that the Markov chain induced by
is ergodic. Suppose that the policy parametrization is value consistent w.r.t. the given Markov decision process. Then is a stationary point of for all , andProof.
See Appendix A.5 ∎
Furthermore, when we have the additional condition that the gradient of the value function is continuous in (at ) then as . This condition will be satisfied if, for example, the policy is continuously differentiable w.r.t. the policy parameters.
Example (continued).
Returning to the MDPs given in Figure 1, we now empirically observe the behaviour of the term as the policy approaches a local optimum of the objective function. Figure 2 gives the magnitude of , in terms of the spectral norm, in relation to the distance from the local optimum. In correspondence with the theory, as in the hallway problem, while this is not the case in McCallum’s grid. This simple example illustrates the fact that if the feature representation is wellchosen and sufficiently rich the term vanishes in the vicinity of a local optimum.
4 GaussNewton Methods for Markov Decision Processes
In this section we propose several GaussNewton type methods for MDPs, motivated by the analysis of Section 3. The algorithms are outlined in Section 4.1, and key performance analysis is provided in Section 4.2.
4.1 The GaussNewton Methods
The first GaussNewton method we propose drops the Hessian terms which are difficult to estimate, but are expected to be negligible in the vicinity of local optima. Specifically, it was shown in Section 3.3 that if the policy parametrization is value consistent with a given MDP, then as converges towards a local optimum of the objective function. Similarly, if the policy parametrization is sufficiently rich, although not necessarily value consistent, then it is to be expected that will be negligible in the vicinity of a local optimum. In such cases , as defined in Theorem 4, will be a good approximation to the Hessian in the vicinity of a local optimum. For this reason, the first GaussNewton method that we propose for MDPs is to precondition the gradient with in (9), so that the update is of the form:
Policy search update using the first GaussNewton method
(27) 
When the policy parametrization has constant curvature with respect to the action space and it is sufficient to calculate just .
The second GaussNewton method we propose removes further terms from the Hessian which are not guaranteed to be negative definite. As was seen in Section 3.1, when the policy parametrization satisfies the properties of Theorem 5 then is negativedefinite over the entire parameter space. Recall that in (9) it is necessary that is positivedefinite (in the Newton method this corresponds to requiring the Hessian to be negativedefinite) to ensure an increase of the objective function. That is negativedefinite over the entire parameter space is therefore a highly desirable property of a preconditioning matrix, and for this reason the second GaussNewton method that we propose for MDPs is to precondition the gradient with in (9), so that the update is of the form:
Policy search update using the second GaussNewton method
(28) 
We shall see that the second GaussNewton method has important performance guarantees including: a guaranteed ascent direction; linear convergence to a local optimum under a step size which does not depend upon unknown quantities; invariance to affine transformations of the parameter space; and efficient estimation procedures for the preconditioning matrix. We will also show, in Section 5 that the second GaussNewton method is closely related to both the EM and natural gradient algorithms.
We shall also consider a diagonal form of the approximation for both forms of GaussNewton methods. Denoting the diagonal matrix formed from the diagonal elements of and by and , respectively, then we shall consider the methods that use and in (9). We call these methods the diagonal first and second GaussNewton methods, respectively. This diagonalization amounts to performing the approximate Newton methods on each parameter independently, but simultaneously.
4.1.1 Estimation of the Preconditioners and the GaussNewton Update Direction
It is possible to extend typical techniques used to estimate the policy gradient to estimate the preconditioner for the GaussNewton method, by including either the Hessian of the policy, the outer product of the derivative of the policy, or the respective diagonal terms. As an example, in Section B.1 of the Appendix we detail the extension of the recurrent state formulation of gradient evaluation in the average reward framework (Williams, 1992) to the second GaussNewton method. We use this extension in the Tetris experiment that we consider in Section 6. Given sampled stateaction pairs, the complexity of this extension scales as for the second GaussNewton method, while it scales as for the diagonal version of the algorithm.
We provide more details of situations in which the inversion of the preconditioning matrices can be performed more efficiently in Section B.2 of the Appendix. Finally, for the second GaussNewton method the ascent direction can be estimated particularly efficiently, even for large parameter spaces, using a Hessianfree conjugategradient approach, which is detailed in Section B.3 of the Appendix.
4.2 Performance Guarantees and Analysis
4.2.1 Ascent Directions
In general the objective (5) is not concave, which means that the Hessian will not be negativedefinite over the entire parameter space. In such cases the Newton method can actually lower the objective and this is an undesirable aspect of the Newton method. We now consider ascent directions for the GaussNewton methods, and in particular demonstrate that the proposed second GaussNewton method guarantees an ascent direction in typical settings.
Ascent directions for the first GaussNewton method:
As mentioned previously, the matrix will typically be indefinite, and so a straightforward application of the first GaussNewton method will not necessarily result in an increase in the objective function. There are, however, standard correction techniques that one could consider to ensure that an increase in the objective function is obtained, such as adding a ridge term to the preconditioning matrix. A survey of such correction techniques can be found in Boyd and Vandenberghe (2004).
Ascent directions for the second GaussNewton method:
It was seen in Theorem 5 that will be negativedefinite over the entire parameter space if either the policy is concave with respect to the policy parameters, or the policy has constant curvature with respect to the action space. It follows that in such cases an increase of the objective function will be obtained when using the second GaussNewton method with a sufficiently small stepsize. Additionally, the diagonal terms of a negativedefinite matrix are negative, so that is negativedefinite whenever is negativedefinite, and thus similar performance guarantees exist for the diagonal version of the second GaussNewton algorithm.
To motivate this result we now briefly consider some widely used policies that are either concave or blockwise concave. Firstly, consider the Gibb’s policy, , in which is a feature vector. This policy is widely used in discrete systems and is concave in , which can be seen from the fact that is the sum of a linear term and a negative logsumexp term, both of which are concave (Boyd and Vandenberghe, 2004). In systems with a continuous stateaction space a common choice of controller is , in which is a feature vector. This controller is not jointly concave in and , but it is blockwise concave in and . In terms of the policy is quadratic and the coefficient matrix of the quadratic term is negativedefinite. In terms of the policy consists of a linear term and a determinant term, both of which are concave.
4.2.2 Affine Invariance
A undesirable aspect of steepest gradient ascent is that its performance is dependent on the choice of basis used to represent the parameter space. An important and desirable property of the Newton method is that it is invariant to nonsingular affine transformations of the parameter space (Boyd and Vandenberghe, 2004). This means that given a nonsingular affine mapping, , the Newton update of the objective is related to the Newton update of the original objective through the same affine mapping, i.e., , in which and and denote the respective Newton steps. A method is said to be scale invariant if it is invariant to nonsingular rescalings of the parameter space. In this case the mapping , is given by a nonsingular diagonal matrix. The proposed approximate Newton methods have various invariance properties, and these properties are summarized in the following theorem.
Theorem 8.
The first and second GaussNewton methods are invariant to (nonsingular) affine transformations of the parameter space. The diagonal versions of these algorithms are invariant to (nonsingular) rescalings of the parameter space.
Proof.
See Section A.6 in the Appendix. ∎
4.2.3 Convergence Analysis
We now provide a local convergence analysis of the GaussNewton framework. We shall focus on the full GaussNewton methods, with the analysis of the diagonal GaussNewton method following similarly. Additionally, we shall focus on the case in which a constant step size is considered throughout, which is denoted by . We say that an algorithm converges linearly to a limit at a rate if
If then the algorithm converges superlinearly. We denote the parameter update function of the first and second GaussNewton methods by and , respectively, so that and . Given a matrix, we denote the spectral radius of by , where
are the eigenvalues of
. Throughout this section we shall use to denote .Theorem 9 (Convergence analysis for the first GaussNewton method).
Suppose that is such that and is invertible, then is Fréchet differentiable at and takes the form,
(29) 
If and are negativedefinite, and the step size is in the range,
(30) 
then is a point of attraction of the first GaussNewton method, the convergence is at least linear and the rate is given by . When the policy parametrization is value consistent with respect to the given Markov Decision Process, then (29) simplifies to
(31) 
and whenever then is a point of attraction of the first GaussNewton method, and the convergence to is linear if with a rate given by , and convergence is superlinear when .
Proof.
See Section A.7 in the Appendix. ∎
Additionally we make the following remarks for the case when the policy parametrization is not value consistent with respect to the given Markov decision process. For simplicity, we shall consider the case in which . In this case takes the form,
From the analysis in Section 3.3 we expect that when the policy parametrization is rich, but not value consistent with respect to the given Markov decision process, that will generally be small. In this case the first GaussNewton method will converge linearly, and the rate of convergence will be close to zero.
Theorem 10 (Convergence analysis for the second GaussNewton method).
Suppose that is such that and is invertible, then is Fréchet differentiable at and takes the form,
(32) 
If is negativedefinite and the step size is in the range,
(33) 
then is a point of attraction of the second GaussNewton method, convergence to is at least linear and the rate is given by . Furthermore, implies condition (33). When the policy parametrization is value consistent with respect to the given Markov decision process, then (32) simplifies to
(34) 
Proof.
See Section A.7 in the Appendix. ∎
The conditions of Theorem 10 look analogous to those of Theorem 9, but they differ in important ways: it is not necessary to assume that the preconditioning matrix is negativedefinite and the sets in (30) will not be known in practice, whereas the condition in Theorem 10 is more practical, i.e., for the second GaussNewton method convergence is guaranteed for a constant step size which is easily selected and does not depend upon unknown quantities.
It will be seen in Section 5.2 that the second GaussNewton method has a close relationship to the EMalgorithm. For this reason we postpone additional discussion about the rate of convergence of the second GaussNewton method until then.
5 Relation to Existing Policy Search Methods
In this section we detail the relationship between the second GaussNewton method and existing policy search methods; In Section 5.1 we detail the relationship with natural gradient ascent and in Section 5.2 we detail the relationship with the EMalgorithm.
5.1 Natural Gradient Ascent and the Second GaussNewton Method
Comparing the form of the Fisher information matrix given in (13) with (19) it can be seen that there is a close relationship between natural gradient ascent and the second GaussNewton method: in there is an additional weighting of the integrand from the stateaction value function. Hence, incorporates information about the reward structure of the objective function that is not present in the Fisher information matrix.
We now consider how this additional weighting affects the search direction for natural gradient ascent and the GaussNewton approach. Given a norm on the parameter space, , the steepest ascent direction at with respect to that norm is given by,
Natural gradient ascent is obtained by considering the (local) norm given by
with as in (14). The natural gradient method allows less movement in the directions that have high norm which, as can be seen from the form of (14), are those directions that induce large changes to the policy over the parts of the stateaction space that are likely to be visited under the current policy parameters. More movement is allowed in directions that either induce a small change in the policy, or induce large changes to the policy, but only in parts of the stateaction space that are unlikely to be visited under the current policy parameters. In a similar manner the second GaussNewton method can be obtained by considering the (local) norm ,
so that each term in (13) is additionally weighted by the stateaction value function, . Thus, the directions which have high norm are those in which the policy is rapidly changing in stateaction pairs that are not only likely to be visited under the current policy, but also have high value. Thus the second GaussNewton method updates the parameters more carefully if the behaviour in high value states is affected. Conversely, directions which induce a change only in stateaction pairs of low value have low norm, and larger increments can be made in those directions.
5.2 Expectation Maximization and the Second GaussNewton Method
It has previously been noted (Kober and Peters, 2011) that the parameter update of steepest gradient ascent and the EMalgorithm can be related through the function defined in (16). In particular, the gradient (11) evaluated at can be written in terms of as follows,
while the parameter update of the EMalgorithm is given by,
In other words, steepest gradient ascent moves in the direction that most rapidly increases with respect to the first variable, while the EMalgorithm maximizes with respect to the first variable. While this relationship is true, it is also quite a negative result. It states that in situations in which it is not possible to explicitly maximize with respect to its first variable, then the alternative, in terms of the EMalgorithm, is a generalized EMalgorithm, which is equivalent to steepest gradient ascent. Given that the EMalgorithm is typically used to overcome the negative aspects of steepest gradient ascent, this is an undesirable alternative. It is possible to find the optimum of (16) numerically, but this is also undesirable as it results in a doubleloop algorithm that could be computationally expensive. Finally, this result provides no insight into the behaviour of the EMalgorithm, in terms of the direction of its parameter update, when the maximization over in (16) can be performed explicitly.
We now demonstrate that the stepdirection of the EMalgorithm has an underlying relationship with the second of our proposed GaussNewton methods. In particular, we show that under suitable regularity conditions the direction of the EMupdate, i.e., , is the same, up to first order, as the direction of the second GaussNewton method that uses in place of .
Theorem 11.
Suppose we are given a Markov decision process with objective (5) and Markovian trajectory distribution (6). Consider the parameter update (Mstep) of Expectation Maximization at the iteration of the algorithm, i.e.,
Provided that is twice continuously differentiable in the first parameter we have that
(35) 
Additionally, in the case where the policy is quadratic the relation to the second GaussNewton method is exact, i.e., the second term on the r.h.s. of (35) is zero.
Proof.
See Section A.8 in the Appendix. ∎
Given a sequence of parameter vectors, , generated through an application of the EMalgorithm, then . This means that the rate of convergence of the EMalgorithm will be the same as that of the second GaussNewton method when considering a constant step size of one. We formalize this intuition and provide the convergence properties of the EMalgorithm when applied to Markov decision processes in the following theorem. This is, to our knowledge, the first formal derivation of the convergence properties for this application of the EMalgorithm.
Theorem 12.
Suppose that the sequence, , is generated by an application of the EMalgorithm, where the sequence converges to . Denoting the update operation of the EMalgorithm by , so that , then
Comments
There are no comments yet.