1 Introduction
(RL) is a subfield of machine learning in which computational
agents learn to maximize a numerical reward signal through interaction with their environment. Policy gradient methods encode an agent’s behavior as a parameterized stochastic policyand update the policy parameters according to an estimate of the gradient of the expected sum of rewards (the expected
return) with respect to those parameters. In practice, estimating the effect of a particular action on the return that follows can be difficult, so almost all stateoftheart implementations instead consider an exponentially discounted sum of rewards (the discounted return), which causes the agent to primarily consider the shortterm effects of its actions. However, they do not do so in the manner described by the original policy gradient theorem (Sutton et al., 2000). It is an open question in RL as to whether these algorithms optimize a different, related objective (Thomas, 2014). In this paper, we answer this question by showing that most policy gradient methods, including stateoftheart methods, do not follow the gradient of any objective, and are therefore better understood as semigradient methods with respect to the undiscounted objective.The analysis in this paper applies to all stateoftheart policy gradient methods known to the authors. This includes the published versions of methods such as A2C/A3C (Mnih et al., 2016), TRPO (Schulman et al., 2015), PPO (Schulman et al., 2017), DDPG Silver et al. (2014), ACKTR (Wu et al., 2017), ACER (Wang et al., 2016), and soft actorcritic (Haarnoja et al., 2018), as well as their publically available implementations (Dhariwal et al., 2017). We note however that the concerns of this paper long predate “deep” RL. The observation that most policy gradient methods do not conform to the policy gradient theorem presented by Sutton et al. (2000) dates at least back to the work of Kakade (2001), while the explicit question addressed by this paper as to whether these algorithms estimate the gradient of any objective dates at least to the work of Thomas (2014). The purpose of this paper is not to criticize any particular algorithm, or the policy gradient approach in general, nor to propose specific new algorithms, but rather to improve our understanding of the mathematical foundations of policy gradient methods.
2 Notation
In RL, the agent learns through interactions with an environment, which is expressed mathematically as a Markov decision process (MDP). An MDP is a tuple, , where is the set of possible states of the environment, is the set of actions available to the agent, is a transition function
that determines the probability of transitioning between states given an action,
is the expected reward from taking an action in a particular state, is the initial state distribution, and is the discount factor which decreases the utility of rewards received in the future.In the episodic setting, interactions with the environment are broken into episodes. Each episode is further broken into individual timesteps. At each timestep, , the agent observes a state, , takes an action, , transitions to a new state, , and receives a reward, . The first timestep is 0. For finite horizon MDPs, the maximum timestep is given by . If the agent reaches a special state, called the terminal absorbing state, prior to the final timestep, it stays in this state and receives a reward of 0 at each timestep until , after which the process terminates.
A policy, , determines the probability that an agent will choose an action in a particular state. A parameterized policy,
, is a policy that is determined by some parameter vector,
, which may be the weights in a neural network, values in a tabular representation, etc. The
compatible features of a parameterized policy represent how may be changed in order to make a particular action, , more likely in a particular state, , and are defined as .The value function, , represents the expected discounted sum of rewards when starting in a particular state under policy ; that is, , where conditioning on indicates that . The actionvalue function, , is similar, but also considers the action taken; that is, . The advantage function is the difference between the actionvalue function and the (state) value function: .
The objective of an RL agent is to maximize some function of its parameters, . In the episodic setting, the two most commonly stated objectives are the discounted objective, , and the undiscounted objective, . The discounted objective has some convenient mathematical properties, but it corresponds to few realworld tasks. Sutton and Barto (2018) have even argued for its deprecation. The undiscounted objective is a more intuitive match for most tasks. Both objectives and their respective gradients are discussed in this paper.
3 The policy gradient theorem
The policy gradient theorem (Sutton et al., 2000) was originally presented for two objectives: the average reward objective for the continuing setting (Mahadevan, 1996) and the discounted objective () for the episodic setting. The episodic setting is usually more popular, as it is better suited to the types of tasks that RL researchers typically use for evaluation (e.g., many classic control tasks, and the Arcade Learning Environment (Bellemare et al., 2013)). The policy gradient, , tells us how to modify in order to increase , and is given by:
(1) 
Because
is the true gradient of the discounted objective, algorithms that follow unbiased estimates of (
1) are given the standard guarantees of stochastic gradient descent (namely, that given an appropriate stepsize schedule and smoothness assumptions, convergence to a local optimum is almost sure
(Bertsekas and Tsitsiklis, 2000)). However, researchers typically are not interested in the discounted objective, but rather the undiscounted objective. As a result, most conventional “policy gradient” algorithms (PPO, A3C, etc.) instead directly or indirectly estimate the expression:(2) 
Note that the difference between and is that the latter has dropped the term (Thomas, 2014). However, notice that this new expression is not the gradient of either, as it uses the discounted actionvalue function, . We label this update direction because it is no longer the gradient of or , but is rather the gradient of some hypothetical, related objective, . The asterisk character, *, was chosen to represent the ambiguity of its existence.
Thomas (2014) attempted to construct , but was only able to do so for an impractically restricted setting wherein the agent’s actions do not impact state transitions. Its value in the general case was left as an open question. We will later show that the hypothetical objective, , does not exist when the impractical restrictions imposed by Thomas (2014) are removed. The result is that algorithms based on (2), which are commonly referred to as “policy gradient” algorithms, cannot be interpreted as gradient algorithms with respect to any objective. In Section 5, we show that these methods are better understood as semigradient methods relative to the undiscounted objective. In the next section, we prove an intermediate result relating the undiscounted objective to the discounted value function.
4 Decomposing the undiscounted objective
We begin our analysis by presenting a lemma that provides a new view of the undiscounted objective by allowing us to express it in terms of the discounted value function, for any value of . Writing the objective in this new form is the first step towards uncovering its relationship with . This is necessary, as is defined in terms of the discounted actionvalue function.
Lemma 1.
The undiscounted objective is related to for all by:
(3) 
where is the unnormalized, weighted state distribution given by:
(4) 
Proof.
See appendix. ∎
Taking the gradient of this expression gives us a corresponding new form of the policy gradient. The first step of the derivation is given here, as it deepens our intuition as to the relationship between the discounted value function and the undiscounted objective:
(5) 
We see that there are two components to this new form of the gradient. The first component optimizes the discounted value function over the current distribution of states. The second term takes into account the change in the weighted distribution over states induced by changes in the policy. Note that in the case, is zero for all , and we are left with only the first term. Also note that under this formulation,
is strictly a hyperparameter, and does not impact the optimal policy. In the next section, we relate this expression to
.5 Policy semigradient methods
In this section, we show that when discounting is used, most policy “gradient” methods are actually policy semigradient methods, and that they are not truly gradient methods with respect to any objective. First, we define the term semigradient. If a gradient of a function can be written with two terms, the semigradients are each of these terms; they represent the two “halves” of the gradient. The semigradients of a function are not unique, as they entirely depend on how the gradient is written. Nevertheless, they can provide useful intuition for the behavior of algorithms. The term was coined by Sutton (2015) in order to describe the relationship between temporal difference (TD) learning methods and the meansquared Bellman error. Similarly, we use the term to describe the relationship between and the undiscounted objective. In the following lemma, we show that may be written as the value gradient term in (5), from which the proof that it is a semigradient is immediate:
Lemma 2.
The conventional policy update, , can be written as:
(6) 
Proof.
See appendix. ∎
Corollary 1.
is a semigradient of .
Proof.
In a sense, it is remarkable that this approach to policy gradients works at all. Value function estimates are typically learned using semigradient TD methods (Sutton and Barto, 2018). The policy is typically learned using an approximation of , which we have shown is a semigradient. Both “actor” and “critic” follow semigradients, and yet implementations of actorcritic methods (Konda and Tsitsiklis, 2000) that follow these update directions are able to learn effective policies. An open question in RL is whether or not these algorithms are following some related objective, and if this is the cause for their effectiveness (Thomas, 2014). We show that this is not the case.
Theorem 1.
There does not exist any continuous objective, , such that is its gradient. That is, for all :
where is the space of continuous functions.
Proof.
We present a proof by contradiction. Assume that there exists some whose gradient is , and that and are parameters of the policy, . If is a continuous function, then it is also true that its second derivative is symmetric:
However, it is easy to show that this is not true algebraically using Lemma 2:
(7) 
Thus, we immediately arrive at an apparent contradiction. However, to formally complete the proof, we must provide a counterexample where the latter term is asymmetric. Such a counterexample is provided in Figure 1. ∎
6 Nonoptimality of semigradient methods
One may wonder, why should we be concerned at all that policy gradient methods are following a semigradient rather than a gradient? After all, the same may be said of TD (Sutton, 2015), which remains in wide use. First, there is a large body of literature (Baird, 1995; Sutton et al., 2009; Liu et al., 2015; Mahmood et al., 2015; Liu et al., 2016) suggesting that many researchers are concerned regarding TD; even in the subfield of “deep” RL, the concept of a target network (Mnih et al., 2015) was introduced to to address the instability introduced by the use of a semigradient update. Second, semipolicy gradient methods suffer from weaknesses that semigradient TD methods do not.
Consider that semigradient TD is a semigradient of the meansquared Bellman error (Baird, 1995; Sutton, 2015), whereas the policy semigradient is a semigradient of the undiscounted objective. In the tabular setting, semigradient TD will converge to the minimum of the meansquared Bellman error (Jaakkola et al., 1994), but policy semigradient methods will often fail to converge to the maximum of the undiscounted objective—e.g., they may instead converge to the maximum of the discounted objective (see Figure 2). There is therefore an asymmetry between semigradient TD and policy semigradient methods, in that the former converges to the objective of which it is a semigradient, while the latter does not. However, this situation is not particularly alarming, as it is well understood that the optimal policy under the discounted objective approximates the optimal policy under the undiscounted objective (Marbach and Tsitsiklis, 2001; Weaver and Tao, 2001; Kakade, 2001; Thomas, 2014)
. The choice of the discount factor for policy semigradient methods in the tabular setting may therefore be considered in terms of a biasvariance tradeoff.
When a function approximator is used, a worse problem emerges that is analogous to divergence in semigradient TD methods. With a linear function approximator, semigradient TD methods only diverge in the offpolicy setting,^{1}^{1}1 Assuming the typical bootstrapping approach is used, thus completing the “deadly triad” of function approximation, bootstrapping, and offpolicy training (Sutton and Barto, 2018). for instance, in Baird’s famous counterexample (Baird, 1995). However, the following problem may emerge for policy semigradient methods even in the onpolicy setting. It is not possible for policies to diverge in quite the same way as value estimations, as the probability of choosing an action is bounded by . The worst possible outcome is convergence to the pessimal policy within the given parameterization.^{2}^{2}2 Convergence to a limit cycle is problematic as well, but preferable to the pessimal solution. One may ask, pessimal with respect to which objective? In Figure 3, we give an example wherein policy semigradient methods will converge to the pessimal policy with respect to both the discounted and undiscounted objectives. While the behavior of policy semigradient methods described in Figure 2 may be justified as a myopic preference, the same justification cannot be applied to the case shown in Figure 3.
7 Conclusions
We have shown in Theorem 1 that most policy gradient methods, including stateoftheart algorithms, do not follow the gradient of any objective function due to the way the discount factor is used. By Corollary 1, we argued that policy gradient methods are therefore better understood as semigradient methods. Finally, we showed that policy semigradients are not guaranteed to converge to a locally optimal policy (Figure 2), and in some cases even converge to the pessimal policy (Figure 3).
References
 Baird [1995] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30–37. Elsevier, 1995.

Bellemare et al. [2013]
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling.
The arcade learning environment: An evaluation platform for general
agents.
Journal of Artificial Intelligence Research
, 47:253–279, 2013.  Bertsekas and Tsitsiklis [2000] D. P. Bertsekas and J. N. Tsitsiklis. Gradient convergence in gradient methods with errors. SIAM J. Optim., 10:627–642, 2000.
 Dhariwal et al. [2017] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017.
 Haarnoja et al. [2018] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
 Jaakkola et al. [1994] Tommi Jaakkola, Michael I Jordan, and Satinder P Singh. Convergence of stochastic iterative dynamic programming algorithms. In Advances in neural information processing systems, pages 703–710, 1994.

Kakade [2001]
Sham Kakade.
Optimizing average reward using discounted rewards.
In
International Conference on Computational Learning Theory
, pages 605–615. Springer, 2001.  Konda and Tsitsiklis [2000] Vijay R Konda and John N Tsitsiklis. Actorcritic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.
 Liu et al. [2015] Bo Liu, Ji Liu, Mohammad Ghavamzadeh, Sridhar Mahadevan, and Marek Petrik. Finitesample analysis of proximal gradient td algorithms. In Proceedings of the ThirtyFirst Conference on Uncertainty in Artificial Intelligence, pages 504–513. AUAI Press, 2015.
 Liu et al. [2016] Bo Liu, Ji Liu, Mohammad Ghavamzadeh, Sridhar Mahadevan, and Marek Petrik. Proximal gradient temporal difference learning algorithms. In IJCAI, pages 4195–4199, 2016.
 Mahadevan [1996] Sridhar Mahadevan. Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine learning, 22(13):159–195, 1996.
 Mahmood et al. [2015] A Rupam Mahmood, Huizhen Yu, Martha White, and Richard S Sutton. Emphatic temporaldifference learning. arXiv preprint arXiv:1507.01569, 2015.
 Marbach and Tsitsiklis [2001] Peter Marbach and John N Tsitsiklis. Simulationbased optimization of markov reward processes. IEEE Transactions on Automatic Control, 46(2):191–209, 2001.
 Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
 Schulman et al. [2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
 Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Silver et al. [2014] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014.
 Sutton [2015] Richard S Sutton. Introduction to reinforcement learning with function approximation. In Tutorial at the Conference on Neural Information Processing Systems, 2015.
 Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
 Sutton et al. [2000] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 Sutton et al. [2009] Richard S Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvári, and Eric Wiewiora. Fast gradientdescent methods for temporaldifference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 993–1000. ACM, 2009.
 Thomas [2014] Philip Thomas. Bias in natural actorcritic algorithms. In International Conference on Machine Learning, pages 441–448, 2014.
 Wang et al. [2016] Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actorcritic with experience replay. arXiv preprint arXiv:1611.01224, 2016.
 Weaver and Tao [2001] Lex Weaver and Nigel Tao. The optimal reward baseline for gradientbased reinforcement learning. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pages 538–545. Morgan Kaufmann Publishers Inc., 2001.
 Wu et al. [2017] Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trustregion method for deep reinforcement learning using kroneckerfactored approximation. In Advances in neural information processing systems, pages 5279–5288, 2017.
Appendix
Lemma 1.
The undiscounted objective is related to for all by:
(8) 
where is the unnormalized, weighted state distribution given by:
(9) 
Proof.
Consider that for any and , we can rearrange the Bellman equation: . This allows us to rewrite the objective:
∎
Lemma 2.
The conventional policy update, , can be written as:
(10) 
Proof.
We begin by hypothesizing that takes a form similar to (10), given some timedependent weights, , on each term in the state distribution. That is, we hypothesize that (10) holds for some
We then must show that Equation (10) is equal to , for some choice of , and then derive the satisfying choice of . Sutton et al. [2000] established a lemma stating:
(11) 
by substitution of the variable . Since the value of the terminal state is 0, for , we have that , and thus the sum over can stop at rather than . Continuing, we can move the summation inside the expectation and reorder the summation:
In order to derive , we simply need to choose a w(t) such that . This is satisfied by the choice: if , and otherwise. This trivially holds for , as . For :
Thus, for this choice of :
Finally, we see that this choice of also gives us the correct :
∎
Theorem 1.
There does not exist any continuous objective, , such that is its gradient. That is, for all :
where is the space of continuous functions.
Proof.
We present a proof by contradiction. Assume that there exists some whose gradient is , and that and are parameters of the policy, . If is a continuous function, then it is also true that its second derivative is symmetric:
However, it is easy to show that this is not true algebraically using Lemma 2:
(12) 
Thus, we immediately arrive at an apparent contradiction. However, to complete the proof of Theorem 1, we must show a case where the summation produces an asymmetry. Such a case is shown in Figure 1. We present an MDP with two states, and , in addition to the terminal absorbing state. The agent begins in . There are two actions, and . If the agent selects action in state , it is taken to the terminal absorbing state and receives a reward of 0. If it chooses action in state , it is taken to state , and again receives a reward of 0. In state , the agent is always taken to the terminal absorbing state regardless of the action chosen. If the agent takes action , it again receives a a reward of 0. However, if the agent takes action , it receives a reward of +1.
We consider the case where , and the agent’s policy is tabular over states and actions: that is, we do not introduce any bias due to the use of function approximation. There are two parameters, and , which determine the policy in states and respectively. The probability of choosing action in state by , where is the sigmoid function, , and the probability of choosing is given by . Finally, we initialize the policy to .
Consider the partial derivatives of and . First, consider . The agent cannot affect the likelihood of being in at time , so . Additionally, because there are no immediately rewards available and , the value function and its gradient are zero at regardless of the policy parameters: . Next, consider . The probability of entering is proportional to the probability of choosing in . At :
On the other hand, the value at does not depend on , so . Finally, consider and . cannot influence , as the action taken in has no effect on the state distribution. However:
so:
Putting it all together:
We have shown that in this instance the second derivative is not symmetric. Therefore, we conclude that no general form of exists. ∎