1 Introduction
Many recent successes in reinforcement learning are driven by a class of algorithms called policy gradient methods. These methods search over a parameterized class of polices by performing stochastic gradient descent on a cost function capturing the cumulative expected cost incurred. Specifically, they aim to optimize over a smooth, and often stochastic, class of parametrized policies . For discounted or episodic problems, they treat the scalar cost function , which averages the total costtogo function over a random initial state distribution . Policy gradient methods perform stochastic gradient descent on , following the iteration
Unfortunately, even for simple control problems solvable by classical methods, the total cost is a nonconvex function of . Typical of results concerning the blackbox optimization of nonconvex functions, policy gradient methods are widely understood to converge asymptotically to a stationary point or a local minimum. Important theory guarantees this under technical conditions [36, 57, 4] and it is widely repeated in textbooks and surveys [43, 21, 56].
The reinforcement learning literature seems to provide almost no guarantees into the quality of the points to which policy gradient methods converge. Although these methods can be applied to a very broad class of problems, it is not clear whether they adequately address even simple and classical dynamic programming problems. Inspired by this disconnect, important recent work of Fazel et al. [16], showed that policy gradient on the space of linear policies for deterministic linear quadratic control problem converges to the global optimum, despite the nonconvexity of the objective. The authors provided an intricate analysis in this case, leveraging a variety of closed form expressions available for linearquadratic problems. Separate from the RL literature, Kunnumkal and Topaloglu [32] propose a stochastic approximation method for setting basestock levels in inventory control. Surprisingly, despite nonconvexity of the objective, an intricate analysis quite different that from Fazel et al. [16] establishes convergence to the global optimum.
Our work aims to construct a simple and more general understanding of the global convergence properties of policy gradient methods. As a consequence of our general framework, we can show that for several classic dynamic programming problems, policy gradient methods performed with respect to natural structured policy classes faces no suboptimal local minima. More precisely, despite its nonconvexity, any stationary point^{1}^{1}1Any point with is a stationary point of the function of the policy gradient cost function is a global optimum. The examples we treat include:
Example 1.
Softmax policies applied in finite state and action MDPs: Here, with states and actions, . The policy associates each state
with a probability distribution
over actions, with . This set of policies contains all possible stochastic policies and its closure contains all possible policies.Example 2.
Linear policies applied in linear quadratic control: Here, actions and states
are vectors, states evolve according to
where is i.i.d Gaussian noise^{2}^{2}2The work of Fazel et al. [16] considers LQ control with a random initial state but does not consider noisy dynamics. Their objective is the total undiscounted costtogo over an infinite horizon. With noisy dynamics, this objective is infinite under all policies. We introduce discounting to keep the total costtogo finite., and the goal is to minimize the cumulative discounted cost for positive definite matrices and . It is known that a linear policy of the form for , is optimal for this problem. We assume are controllable, in which case the set of stable linear policies is nonempty.Example 3.
Threshold policies applied in an optimal stopping problem: One classic optimal stopping problem is an asset selling problem, where at every time , an agent observes i.i.d offers and chooses a stopping time with the goal of maximizing . We consider a somewhat richer contextual variant of this problem. In each round, the agent passively observes contextual information,
which evolves according to an uncontrolled Markov chain with finite state space. The context reflects variables like the weather or economic indicators, which are not influenced by the offers but inform the likelihood of receiving high offers. Conditioned on the context
, is drawn i.i.d from some bounded distribution . The agent’s objective is to solve where the supremum is taken over stopping times adapted to the observations . There are standard ways to cast such a stopping problem as an MDP with a particular statespace. (See [5] or Appendix E.2.) The optimal policy in this setting has a threshold for each context , and accepts an offer in that context if and only if it exceeds the threshold. To accommodate cases where the set of possible offers is discrete, while still using smooth policies, we consider randomized policies that map a state to a probability of accepting the offer, . For a vector we set where is the logistic function. While this policy is similar to the one in Example 1, it leverages the structure of the problem and hence has only parameters even if the set of possible offers is infinite.Example 4.
Basestock policies applied in finite horizon inventory control: The example we treat is known as a multiperiod newsvendor problem with backlogged demands. The state of a seller’s inventory evolves according to where is the quantity of inventory ordered from a supplier and is the random demand at time . Negative values of indicate backlogged demand that must be filled in later periods. We allow for continuous inventory and order levels. Here we consider a finite horizon objective of minimizing , where is perunit ordering cost, is a perunit holding cost, and is a perunit cost for backlogged demand. Only nonnegative orders are feasible. For a finite horizon problem, we consider the class of timeinhomogenous basestock policies, which are known to contain the optimal policy. Here is a vector, and at time such a policy orders inventory . That is, it orders enough inventory to reach a target level , whenever feasible.
For each of these examples, simple experiments show that gradient descent with backtracking line search performed on converges rapidly to the global minimum. Sample plots for three of the problems are shown in Figure 1. For linear quadratic control we refer readers to Figure 1 in [16]. We have shared code here for reproducibility and full experiment details are also given in Appendix F.
Our work aims to understand this phenomenon. Why does gradient descent on a nonconvex function reach the global minimum? These examples share important structural properties. Consider a linear quadratic control problem. Starting with a linear policy and performing a policy iteration step yields another linear policy. That is, the policy class is closed under policy improvement. In addition, although the costtogo function is a nasty nonconvex function of the policy, the policy iteration update involves just solving a quadratic minimization problem. In fact, for each of the first three examples, the policy class is closed under policy improvement and the policyiteration objective (i.e. the function) is smooth and convex in the chosen action. Similar ideas, apply to the fourth example, but as shown in Theorem 2, weaker conditions are needed to ensure convergence for some finitehorizon problems. Given this insight, strikingly simple proofs show that any stationary point of the cost function is a global minimum.
In our view, these canonical control problems provide an important benchmark and sanity check for policy gradient methods. At the same time, one hopes that the insights developed from considering these problems extend to more complex scenarios. To spur progress in this direction, we take a first step in Section 5 where we relax the assumption that the policy class is closed under policy improvement. Our theory gives conditions under which any stationary point of is nearly optimal, where the error bound depends on a notion of the expressive capacity of the policy class.
Beyond RL, this work connects to a large body of work on firstorder methods in nonconvex optimization. Under broad conditions, these methods are guaranteed to converge asymptotically to stationary points of the objective function under a variety of noise models [9, 10]
. The ubiquity of nonconvex optimization problems in machine learning and especially deep learning has sparked a slew of recent work
[34, 1, 24, 14] giving rates of convergence and ensuring convergence to approximate local minima rather than saddle points. A complementary line of research studies the optimization landscape of specific problems to essentially ensure that local minima are global, [18, 55, 11, 19, 28]. Taken together, these results show interesting nonconvex optimization problems can be efficiently solved using gradient descent. Our work contributes to the second line of research, offering insight into the optimization landscape of for classic dynamic programming problems.Challenges with policy gradient methods and the scope of this work.
There are many reasons why practitioners may find simple policy gradient methods, like the classic REINFORCE algorithm reviewed in Appendix A, offer poor performance. In an effort to clarify the scope of our contribution, and its place in the literature, let us briefly review some of these challenges.

[leftmargin=*]

Nonconvexity of the loss function:
Policy gradient methods apply (stochastic) gradient descent on a nonconvex loss function. Such methods are usually expected to converge toward a stationary point of the objective function. Unfortunately, a general nonconvex function could have many stationary points that are far from optimal.

Unnatural policy parameterization: It is possible for parameters that are far apart in Euclidean distance to describe nearly identical polices. Precisely, this happens when the Jacobian matrix of the policy vanishes or becomes ill conditioned. Researchers have addressed this challenge through natural gradient algorithms [2, 26], which perform steepest descent in a different metric. The issue can also be alleviated with regularized policy gradient algorithms [50, 52].

Insufficient exploration: Although policy gradients are often applied with stochastic policies, convergence with this kind of naive random exploration can require a number of iterations that scales exponentially with the number of states in the MDP. Kakade and Langford [25] provide a striking example. Combining efficient exploration methods with policy gradients algorithms is challenging, but is an active area of research [see e.g. 40, 44].

Large variance of stochastic gradients:
The variance of estimated policy gradients generally increases with the problem’s effective time horizon, usually expressed in terms of a discount factor or the average length of an episode. Considerable research is aimed at alleviating this problem through the use of actorcritic methods
[57, 31, 36] and appropriate baselines [51, 37].
We emphasize that this paper is focused on the first challenge and on understanding the risks posed by spurious local minima. Such an investigation is relevant to many strategies for searching locally over the policy space, including policy gradient methods, natural gradient methods [26] , finite difference methods [47], random search [35], and evolutionary strategies[49]. For concreteness, one can mostly have in mind the idealized policy gradient iteration . As in the REINFORCE algorithm in Appendix A, we imagine applying policy gradient algorithms in simulation, where an appropriate restart distribution provides sufficient exploration.
A natural direction for future work would be to analyze the rate of convergence of specific algorithms that follow noisy gradient steps on . Fazel et al. [16] give an impressive analysis of several exact gradientbased algorithms for deterministic linear quadratic control, along with an extension to zeroth order optimization for approximate gradients. We leave this for future work.
2 Problem formulation
Consider a Markov decision process (MDP), which is a sixtuple
. In some of our examples, the state space is a convex subset of , but to ease notation we present some of the notations below assuming is countable, trusting readers can substitute sums for integrals when needed. The initial state distribution is a probability distribution supported on the entire state space. For each , the set of feasible actions is . When action is executed in state state , the agent incurs some immediate cost and transitions to a new state. The instantaneous cost function specifies the expected cost incurred and the transition kernel specifies probability of transitioning to state in the next period. We assume expected costs are nonnegative for all feasible stateaction pairs. A policy is a mapping from states to feasible actions. For each , the costtogo function encodes the expected total discounted cost incurred when applying policy from initial state . Here the expectation is taken over the sequence of states visited under , since the function already integrates over any randomness in instantaneous costs. The stateaction costtogo functionmeasures the cumulative expected cost of taking action in state and applying thereafter. We let denote the optimal costtogo function. For every problem we consider, is the unique solution to the Bellman equation , where the Bellman operator associates each function with another function defined as
Similarly, define . We assume throughout that is convex. In some settings, like linear quadratic control problems, this is natural in all problem formulations. In others, like MDPs with a finite set of actions, the action set is convexified by randomization. In particular, when there are deterministic actions feasible in each state, we will take to be the dimensional probability simplex. Cost and transition functions are naturally extended to functions on the simplex defined by and . Policy gradient methods search over a parameterized class of policies . When considering softmax policies for finite state and action MDPs as in Example 1, we take to consist of the closure of this set, in which case it contains all stationary policies. We assume throughout that is differentiable as a function of . We overload notation, writing and for each . Although classical dynamic programming methods seek a policy that minimizes the expected cost incurred from every initial state, for policy gradient methods it is more natural to study a scalar loss function under which states are weighted by their initial probabilities under . The discounted stateoccupancy measure under and is defined as where the subscript indicates that transition probabilities are evaluated under the Markov chain that results from applying . We often consider the weighted 1norm, .
3 General results
The introduction described in words some special structural properties shared by our motivating examples. This section states formal assumptions capturing that intuition and culminates in a strikingly simple proof that such conditions ensure that has no suboptimal stationary points.
Assumption 1 (Closure under policy improvement).
For any , there is such that for every ,
Assumption 2 (Convexity of policy improvement steps).
For every and , is a convex function of .
Next, we assume the policy class is convex, ensuring that a soft policyiteration update from the policy to the is feasible. In addition to this somewhat stringent assumption on the policy class, we need a mild regularity property of the parametrization. To make this assumption more transparent, let’s look at Examples 1 and 2 with softmax and linear policies respectively. Consider any two policies and . The goal is to find a direction in the parameter space such that the directional derivative of along points in the direction of . Since the map is onetoone^{3}^{3}3Softmax policies are onetoone with a common parameterization that fixes a single component of per state. Otherwise, we can follow the argument above, with an appropriate rule for selecting when multiple exist. and convexity ensures , we can pick such that . Varying traces out a line segment in policy space and a smooth curve in parameter space. Then the desired direction satisfies In the case of linear quadratic control, the direction can be expressed simply as . For softmax policies, the existence of follows from an inverse function theorem, which ensures the differentiable map has differentiable inverse. There, the direction is a concatenation of vectors of length . Each solves the linear system where is the Jacobian matrix. This parallels the construction of natural gradient directions [26].
Assumption 3 (Convexity of the policy class).
Assume is convex. Moreover, for any policy and any , there exists such that for every ,
Finally, the policy gradient theorem [36, 57] requires that certain limits and integrals can be interchanged. In specific applications, this is often easy to justify. Here we state some general, though potentially stringent, regularity conditions that allows us to simply apply a general policy gradient theorem due to [53], stated for directional derivatives in Lemma 1. For specific applications like linear quadratic control, the interchange of limits and expectations can be easily verified and we don’t need this assumption.
Assumption 4 (Regularity conditions by [53]).
is compact, and , , , and exists and are jointly continuous in and .
Lemma 1 (Policy gradients for directional derivatives).
Under assumption 4, for any ,
(1) 
The following theorem shows that has no suboptimal stationary points by constructing a specific descent direction. The direction is chosen to point toward the policy gradient update, and we show the corresponding directional derivative scales with the average magnitude of the Bellman error weighted under the state occupancy distribution .
Theorem 1 (No spurious local minima).
Proof.
After applying the policy gradient theorem stated in Lemma 1, our goal is to bound . Let denote the standard inner product. We have
The final inequality follows from the first order condition for convex differentiable function which implies . We use the fact that in conjunction with Lemma 1 to get,
∎
An immediate corollary is that, under Assumptions 14, if then almost surely over drawn from . Textbooks on dynamic programming provide different technical conditions under which any policy whose costtogo function solves Bellman’s equation must be optimal. This holds when is a contraction, but also in many settings where is not a contraction [7]. This applies immediately to Examples 1 and 3. For linear quadratic control as formulated in Example 2, any stable linear policy satisfies if and only if it is the optimal policy.
Relaxing Assumption 1 for finite horizon problems.
For finite horizon problems, we can guarantee that there are no spurious local minima for policy gradient under a much weaker condition. Rather than require the policy class is closed under improvement – which would imply the policy class contains the optimal policy – it is sufficient that the policy class contain the optimal policy. For this reason, our theory will cover as special cases a broad variety of finite horizon dynamic programming problems for which structured policy classes are known to be optimal.
Unfortunately, we do not have space in this short paper to develop specialized notation for finite horizon problems. We do so and give a more detailed treatment in Appendix C. We can state our formal result without rewriting our problem formulation, by a well known trick that treats finitehorizon timeinhomogenous MDPs as a special case of infinite horizon MDPs (see e.g. [42]). Under the following assumption, the state space factorizes into components, thought of as stages or time periods of the decision problem. Under any policy, any state will transition to some state in the next stage until stage is reached and the interaction effectively ends. This assumption on the policy class allows us to change the policy in stage without influencing the policy at other stages, essentially encoding timeinhomogenous policies.
Assumption 5 (Finite horizon).
Suppose the state space factors as , where for a state with , for all . The final subset contains a single costless absorbing state, with and for any action . The policy parameter is the concatenation of subvectors, where for any fixed , depends only on .
4 Revisiting Examples 14
Softmax policies in finite state and action MDPs.
Consider again Example 1. Abusing notation, we could write a stochastic policy for a finite state and action MDP as long vector where is the probability of choosing action in state . Softmax policies are . We have . This contains all possible policies, so policy class is automatically closed under policy improvement (Assumption 1). In this case, Assumption 2 holds since the function is linear: for a probability vector . The policy class is clearly convex, since the probability simplex is convex. We gave a constructive definiton of above Assumption 3.
Linear policies in linear quadratic control.
Since the work of Kleinman [30] and Hewer [22], it has been known that, starting from any stable linear policy, policy iteration solves a sequence of quadratic minimization problems with solutions converging to the optimal linear policy. The conditions needed to apply each step in the proof of Theorem 1 essentially follow immediately from this classic theory. However, like this work, we need to add an appropriate qualifier to rule out unstable linear policies, under which the costtogo is infinite from every state and many expressions are not even defined. We provide more details in Appendix E.1, and also discuss when gradient descent on will not leave the class of stable policies.
Lemma 2.
Consider Example 2. Choose to be . Then, for any , if then .
Threshold policies in optimal stopping.
In Example 3, we considered a parameterized class of soft or randomized threshold policies. We take to be the closure of the set of such policies, which also contains any deterministic threshold policies. This policy class is closed under policy improvement (Assumption 3): for any denoting a probability of accepting the offer and any , we have
For any state , if and only if the offer exceeds the continuation value . This means that, starting from a threshold policy, each step of policy iteration yields a new threshold policy, so the convergence of policy iteration implies threshold policies are optimal for this problem. Unfortunately, while we can essentially copy the proof of Theorem 1 line by line to establish Lemma 3 as shown below, it does not apply directly to this problem. The challenge is that the policy class is not convex, so moving on a line segment toward the policy iteration update is not a feasible descent direction. However, it is still simple to move the policy closer to , in the sense that for small , at every . The proof, given in Appendix E.2, essentially writes the formula for such a descent direction , and then repeats each line of Theorem 1 with this choice of .
Lemma 3.
For the optimal stopping problem formulated in Example 3, for any .
Basestock policies in finite horizon inventory control.
We consider again the multiperiod newsvendor problem with backlogged demand described in Example 4. In this problem, it is known that basestockpolicies are optimal [5, 45], but the policy gradient cost function is a non convex function of the vector of basestock levels [32]. The following lemma, proved in Appendix E.3, shows that nevertheless any stationary point of the objective function is a global minimum. The result is stated formally here in terms of the notation in Example 4. We establish this claim essentially by modifying one line in the proof of Theorem 2. The modification addresses the fact that, because a policy only orders in some states, local changes in the basestock levels only changes the actions in those states. This property technically breaks the convexity of the policy class, but does not affect our construction of a descent direction on . Global convergence of some online gradient methods for this problem were also established through more direct approaches in [32, 23].
5 Approximation and expressive policy classes
So far, we have studied some classical dynamic programming problems that are ideally suited to policy iteration. The key property we used is that certain structured policy classes were closed under policy improvement, so that exact policy iteration can be performed when only considering that policy class. Although simple structured policy classes are common in some applications of stochastic approximation based policy search [e.g. 33, 27, 60]
, they are not widely used in the RL literature. Instead, flexible policy classes like those parameterized by a deep neural network, a Kernel method
[46], or using state aggregation [54, 17, 8] are preferred. In place of a concluding section, we leave the readers with some preliminary but interesting progress toward understanding why, for highly expressive policy classes, any local minimum of the policy gradient cost function might be nearoptimal. We conjecture this theory can at least be clearly instantiated in the special case of state aggregation given in Appendix B.Let denote the Bellman operator corresponding to a policy , defined by . Recall the optimal Bellman operator is defined by . Given an expressive policy class ,
(3) 
measures the approximation error of the best approximate policy iteration update to a policy . If satisfied Assumption 1, the approximation error would be zero since for every . Equation (3) measures the deviation from this ideal case, in a norm that weights states by the discountedstateoccupancy distribution under the policy . The first part of Theorem 3 shows that if is a stationary point of , then the Bellman error measured in this same norm is upper bounded by the approximation error.
But when does a small Bellman error imply the policy is near optimal? This is the second part of Theorem 3. We relate the Bellman error in the supremum norm to the average Bellman error over states sampled from the initial distribution . Define and
(4) 
The constant measures the extent to which errors at some state must be detectable by random sampling, which depends both on the initial state distribution and on properties of the set of costtogo functions . If the state space is finite, and naturally captures the ability of the distribution to uniformly explore the state space. This is similar to constants that depend on the worstcase likelihood ratio between state occupancy measures [25]. However, those constants can equal zero for continuous state problems. It seems (4) could still be meaningful for such problems^{4}^{4}4To give some intuition, consider a quadratic function on the unit sphere . For denoting the uniform density of , we have . since it also captures regularity properties of . The second part of Theorem 2 is reminiscent of results in the study of approximate policy iteration methods, pioneered by [9, 38, 3, 39, 6], among others. The primary differences are that (1) we directly consider an approximate policy class whereas that line of work considers the error in parametric approximations to the function and (2) we make a specific link with the stationary points of a policy gradient method. The abstract framework of Kakade and Langford [25] is also closely related, though they do not study the stationary points of . We refer the readers to Appendix D for the proof.
Theorem 3.
plus 0.4ex
References

Agarwal et al. [2017]
Naman Agarwal, Zeyuan AllenZhu, Brian Bullins, Elad Hazan, and Tengyu Ma.
Finding approximate local minima faster than gradient descent.
In
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
, pages 1195–1199. ACM, 2017.  Amari [1998] ShunIchi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
 Antos et al. [2008] András Antos, Csaba Szepesvári, and Rémi Munos. Learning nearoptimal policies with bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.

Baxter and Bartlett [2001]
Jonathan Baxter and Peter L Bartlett.
Infinitehorizon policygradient estimation.
Journal of Artificial Intelligence Research
, 15:319–350, 2001.  Bertsekas [1995] Dimitri P Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 1995.
 Bertsekas [2011] Dimitri P Bertsekas. Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications, 9(3):310–335, 2011.
 Bertsekas [2013] Dimitri P Bertsekas. Abstract dynamic programming. Athena Scientific Belmont, MA, 2013.
 Bertsekas [2019] Dimitri P Bertsekas. Featurebased aggregation and deep reinforcement learning: A survey and some new implementations. IEEE/CAA Journal of Automatica Sinica, 6(1):1–31, 2019.
 Bertsekas and Tsitsiklis [1996] Dimitri P Bertsekas and John N Tsitsiklis. Neurodynamic programming, volume 5. Athena Scientific Belmont, MA, 1996.
 Bertsekas and Tsitsiklis [2000] Dimitri P Bertsekas and John N Tsitsiklis. Gradient convergence in gradient methods with errors. SIAM Journal on Optimization, 10(3):627–642, 2000.
 Bhojanapalli et al. [2016] Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Global optimality of local search for low rank matrix recovery. In Advances in Neural Information Processing Systems, pages 3873–3881, 2016.
 Boyd and Vandenberghe [2004] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
 Bradtke et al. [1994] Steven J Bradtke, B Erik Ydstie, and Andrew G Barto. Adaptive linear quadratic control using policy iteration. In Proceedings of 1994 American Control ConferenceACC’94, volume 3, pages 3475–3479. IEEE, 1994.
 Carmon et al. [2018] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28(2):1751–1772, 2018.

Evans [2005]
Lawrence C Evans.
An introduction to mathematical optimal control theory.
Lecture Notes, University of California, Department of Mathematics, Berkeley, 2005.  Fazel et al. [2018] Maryam Fazel, Rong Ge, Sham M Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for linearized control problems. In Proceedings of the 35th International Conference on Machine Learning, pages 1467–1476, 2018.
 Ferns et al. [2004] Norm Ferns, Prakash Panangaden, and Doina Precup. Metrics for finite markov decision processes. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 162–169. AUAI Press, 2004.

Ge et al. [2015]
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan.
Escaping from saddle points—online stochastic gradient for tensor decomposition.
In Conference on Learning Theory, pages 797–842, 2015.  Ge et al. [2016] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973–2981, 2016.
 Gordon [1995] Geoffrey J Gordon. Stable function approximation in dynamic programming. In Machine Learning Proceedings 1995, pages 261–268. Elsevier, 1995.
 Grondman et al. [2012] Ivo Grondman, Lucian Busoniu, Gabriel AD Lopes, and Robert Babuska. A survey of actorcritic reinforcement learning: Standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(6):1291–1307, 2012.
 Hewer [1971] G Hewer. An iterative technique for the computation of the steady state gains for the discrete optimal regulator. IEEE Transactions on Automatic Control, 16(4):382–384, 1971.
 Huh and Rusmevichientong [2013] Woonghee Tim Huh and Paat Rusmevichientong. Online sequential optimization with biased gradients: theory and applications to censored demand. INFORMS Journal on Computing, 26(1):150–159, 2013.
 Jin et al. [2017] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1724–1732. JMLR. org, 2017.
 Kakade and Langford [2002] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pages 267–274, 2002.
 Kakade [2002] Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems, pages 1531–1538, 2002.
 Karaesmen and Van Ryzin [2004] Itir Karaesmen and Garrett Van Ryzin. Overbooking with substitutable inventory classes. Operations Research, 52(1):83–104, 2004.
 Kawaguchi [2016] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in neural information processing systems, pages 586–594, 2016.
 Kiefer [1953] Jack Kiefer. Sequential minimax search for a maximum. Proceedings of the American mathematical society, 4(3):502–506, 1953.
 Kleinman [1968] D. Kleinman. On an iterative technique for riccati equation computations. IEEE Transactions on Automatic Control, 13:114 – 115, 1968.
 Konda and Tsitsiklis [2000] Vijay R Konda and John N Tsitsiklis. Actorcritic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.
 Kunnumkal and Topaloglu [2008] Sumit Kunnumkal and Huseyin Topaloglu. Using stochastic approximation methods to compute optimal basestock levels in inventory control problems. Operations Research, 56(3):646–664, 2008.
 L’Ecuyer and Glynn [1994] Pierre L’Ecuyer and Peter W Glynn. Stochastic optimization by simulation: Convergence proofs for the gi/g/1 queue in steadystate. Management Science, 40(11):1562–1578, 1994.
 Lee et al. [2016] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Conference on learning theory, pages 1246–1257, 2016.
 Mania et al. [2018] Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search of static linear policies is competitive for reinforcement learning. In Advances in Neural Information Processing Systems, pages 1800–1809, 2018.
 Marbach and Tsitsiklis [2001] Peter Marbach and John N Tsitsiklis. Simulationbased optimization of markov reward processes. IEEE Transactions on Automatic Control, 46(2):191–209, 2001.
 Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
 Munos [2003] Rémi Munos. Error bounds for approximate policy iteration. In ICML, volume 3, pages 560–567, 2003.
 Munos and Szepesvári [2008] Rémi Munos and Csaba Szepesvári. Finitetime bounds for fitted value iteration. Journal of Machine Learning Research, 9(May):815–857, 2008.
 Nachum et al. [2017] Ofir Nachum, Mohammad Norouzi, and Dale Schuurmans. Improving policy gradient by exploring underappreciated rewards. CoRR, abs/1611.09321, 2017.
 Ortner and Ryabko [2012] Ronald Ortner and Daniil Ryabko. Online regret bounds for undiscounted continuous reinforcement learning. In Advances in Neural Information Processing Systems, pages 1763–1771, 2012.
 Osband et al. [2017] Ian Osband, Benjamin Van Roy, Daniel Russo, and Zheng Wen. Deep exploration via randomized value functions. arXiv preprint arXiv:1703.07608, 2017.
 Peters and Schaal [2006] Jan Peters and Stefan Schaal. Policy gradient methods for robotics. In 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2219–2225. IEEE, 2006.
 Plappert et al. [2017] Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. Parameter space noise for exploration. arXiv preprint arXiv:1706.01905, 2017.
 Puterman [2014] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
 Rajeswaran et al. [2017] Aravind Rajeswaran, Kendall Lowrey, Emanuel V Todorov, and Sham M Kakade. Towards generalization and simplicity in continuous control. In Advances in Neural Information Processing Systems, pages 6550–6561, 2017.
 Riedmiller et al. [2007] Martin Riedmiller, Jan Peters, and Stefan Schaal. Evaluation of policy gradient methods and variants on the cartpole benchmark. In 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pages 254–261. IEEE, 2007.

Rust [1997]
John Rust.
Using randomization to break the curse of dimensionality.
Econometrica: Journal of the Econometric Society, pages 487–516, 1997.  Salimans et al. [2017] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
 Schulman et al. [2015a] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015a.
 Schulman et al. [2015b] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015b.
 Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Silver et al. [2014] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014.
 Singh et al. [1995] Satinder P Singh, Tommi Jaakkola, and Michael I Jordan. Reinforcement learning with soft state aggregation. In Advances in neural information processing systems, pages 361–368, 1995.
 Sun et al. [2017] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere i: Overview and the geometric picture. IEEE Transactions on Information Theory, 63(2):853–884, Feb 2017. ISSN 00189448. doi: 10.1109/TIT.2016.2632162.
 Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
 Sutton et al. [2000] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 Tsitsiklis and Van Roy [1996] John N Tsitsiklis and Benjamin Van Roy. Featurebased methods for large scale dynamic programming. Machine Learning, 22(13):59–94, 1996.
 Van Roy [2006] Benjamin Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Mathematics of Operations Research, 31(2):234–244, 2006.
 Van Ryzin and Vulcano [2008] Garrett Van Ryzin and Gustavo Vulcano. Simulationbased optimization of virtual nesting controls for network revenue management. Operations Research, 56(4):865–880, 2008.
 Whitt [1978] Ward Whitt. Approximations of dynamic programs, i. Mathematics of Operations Research, 3(3):231–243, 1978.
 Williams [1992] Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
Appendix A Background on policy gradient methods for discounted problems
To begin, we provide a brief review of the simplest policy gradient algorithm: the REINFORCE algorithm for episodic tasks first proposed by [62]. What we present below is a special case of this algorithm, tailored to infinite horizon discounted objectives. The algorithm repeatedly interacts with an MDP with uncertain transition probabilities. Playing a policy until period results in a trajectory of data consisting of observed states , actions , and rewards . Policy gradient methods search over a family of policies . REINFORCE is restricted to stochastic policies, where is a smooth function of that determines the probability of selecting action in state . For any , we can define the cumulative expected costtogo from state by
where is the number of time steps the policy is executed. The second equality simply notes the well known equivalence between optimizing an infinite horizon discounted objective and optimizing an undiscounted objective over a random geometric timehorizon. REINFORCE with restart distribution can be thought of as performing stochastic gradient descent on the scalar loss . In particular REINFORCE follows
As shown in the algorithm box, generating some noisy but unbiased estimate of
is often simple when employing stochastic policies. It is sometimes also feasible for deterministic policies. This is the case when employing actorcritic methods [53] or in some special cases where differential dynamic programming techniques can be employed (See the inventory control example in Subsection F).algocf[htbp]
Appendix B The example of state aggregation
State aggregation is the simplest form of value function approximation employed in reinforcement learning and comes with strong stability properties [20, 58, 59]. It is common across several academic communities [e.g 61, 48]. Numerous theoretical papers carefully construct classes of MDPs with sufficient smooth dynamics, and upper bound the error from planning on a discretized state space [e.g 41]. The following example describes stateaggregation in policy space. It satisfies all of our assumptions other than closure under policy improvement, but we expect it can be shown to be approximately closed under policy improvement.
Example 5 (Softmax policies with state aggregation).
There are a finite number of deterministic actions , so we take to be the set of probability distributions over actions. is a bounded convex subset of eucidean space and the dimension is thought to be small. Reward functions and state transitions probabilities are smooth in . We therefore expect an effective action in some state will be effective in another state if is sufficiently small. We partition the state space into disjoint subsets. We consider a modified softmax policy where . If lies in the th subset of the state partition, plays action with probability probability .
Appendix C Proof of Theorem 2 and formulation of finite horizon problems
In Section 3, we stated our result by treating finitehorizon timeinhomogenous MDPs as a special case of infinite horizon MDPs. For a clearer understanding, we reformulate the finite horizon problem with specialized notation along with restating all the assumptions we need. We then restate Theorem 2 in our new notation and give a proof.
First, let as briefly provide some details to clarify the equivalence. We assumed the statespace factorizes as follows. See 5 To simplify the notation, assume for the moment that each set is finite and . We could express any state as a unique pair such that is the ’th element of . We now rewrite the finite horizon problem in this way.
Consider a finite Markov decision process, represented as . Over periods, the state evolves according to a controlled stochastic dynamical system. In each period , the agent observes the state , chooses the action which incurs the instantaneous expected costs and transition to a new state . The transition dynamics are encoded in where . We continue to assume that is convex, where for finite action spaces this convexity is enforced through randomization. A policy is a sequence of functions, each of which is a mapping from obeying the constraint for all . For any , the associated costtogo from period and state is defined as
We take the Qfunction
to denote the expected cumulative cost of taking action in period , state , and continuing to play policy until the end of the horizon. We set for notational convenience. Let and denote these costtogo functions under the optimal policy. The distribution is a probability distribution supported over . (This is exact analogue under Assumption 5 assuming for all , as we have throughout the paper).
We consider a parameterized family of policies, where the parameter is the concatenation of vectors of length and where is the policy applied in period . We define the policy class as so . Policy gradient methods seek to minimize cumulative costtogo by minimizing a scalar cost function,
The basic idea for finite horizon problems remains the same  for any suboptimal policy we want to construct a descent direction. We make the following assumptions on the policy class and the Qfunction.
Assumption 6 (Correctness of the policy class).
There is some such that for every and ,
In words, we assume that the optimal policy is contained within the policy class. As discussed in the beginning of this section, our results do not require the policy class to be closed under policy improvement, an assumption we needed for infinite horizon problems.
Assumption 7 (Convexity of policy improvement steps).
For every and , is convex is .
Assumption 8 (Convexity of the policy class).
For each , is convex. Moreover, for any and there exists some such that
We use the policy gradient theorem from [53] as given below.
Lemma 5 (Policy gradient theorem for directional derivatives).
For any and
Using Lemma 5, we can show the following result. Let denote the optimal average cost.
Theorem (Restatement of Theorem 2).
Proof.
We assume and construct a specific descent direction such that
Since , there exists some pair with . Let be the last period in which this occurs, i.e. for , for all . For , we have defined so this is trivially satisfied. Now, we have that for every and , since by definition
Let denote an optimal policy in time period , defined by for all . Using Assumption 8, take such that
(7) 
Let where for all . Then, using Lemma 5, we write