Is the Policy Gradient a Gradient?

06/17/2019 ∙ by Chris Nota, et al. ∙ University of Massachusetts Amherst 0

The policy gradient theorem describes the gradient of the expected discounted return with respect to an agent's policy parameters. However, most policy gradient methods do not use the discount factor in the manner originally prescribed, and therefore do not optimize the discounted objective. It has been an open question in RL as to which, if any, objective they optimize instead. We show that the direction followed by these methods is not the gradient of any objective, and reclassify them as semi-gradient methods with respect to the undiscounted objective. Further, we show that they are not guaranteed to converge to a locally optimal policy, and construct an counterexample where they will converge to the globally pessimal policy with respect to both the discounted and undiscounted objectives.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning

(RL) is a subfield of machine learning in which computational

agents learn to maximize a numerical reward signal through interaction with their environment. Policy gradient methods encode an agent’s behavior as a parameterized stochastic policy

and update the policy parameters according to an estimate of the gradient of the expected sum of rewards (the expected

return) with respect to those parameters. In practice, estimating the effect of a particular action on the return that follows can be difficult, so almost all state-of-the-art implementations instead consider an exponentially discounted sum of rewards (the discounted return), which causes the agent to primarily consider the short-term effects of its actions. However, they do not do so in the manner described by the original policy gradient theorem (Sutton et al., 2000). It is an open question in RL as to whether these algorithms optimize a different, related objective (Thomas, 2014). In this paper, we answer this question by showing that most policy gradient methods, including state-of-the-art methods, do not follow the gradient of any objective, and are therefore better understood as semi-gradient methods with respect to the undiscounted objective.

The analysis in this paper applies to all state-of-the-art policy gradient methods known to the authors. This includes the published versions of methods such as A2C/A3C (Mnih et al., 2016), TRPO (Schulman et al., 2015), PPO (Schulman et al., 2017), DDPG Silver et al. (2014), ACKTR (Wu et al., 2017), ACER (Wang et al., 2016), and soft actor-critic (Haarnoja et al., 2018), as well as their publically available implementations (Dhariwal et al., 2017). We note however that the concerns of this paper long predate “deep” RL. The observation that most policy gradient methods do not conform to the policy gradient theorem presented by Sutton et al. (2000) dates at least back to the work of Kakade (2001), while the explicit question addressed by this paper as to whether these algorithms estimate the gradient of any objective dates at least to the work of Thomas (2014). The purpose of this paper is not to criticize any particular algorithm, or the policy gradient approach in general, nor to propose specific new algorithms, but rather to improve our understanding of the mathematical foundations of policy gradient methods.

2 Notation

In RL, the agent learns through interactions with an environment, which is expressed mathematically as a Markov decision process (MDP). An MDP is a tuple, , where is the set of possible states of the environment, is the set of actions available to the agent, is a transition function

that determines the probability of transitioning between states given an action,

is the expected reward from taking an action in a particular state, is the initial state distribution, and is the discount factor which decreases the utility of rewards received in the future.

In the episodic setting, interactions with the environment are broken into episodes. Each episode is further broken into individual timesteps. At each timestep, , the agent observes a state, , takes an action, , transitions to a new state, , and receives a reward, . The first timestep is 0. For finite horizon MDPs, the maximum timestep is given by . If the agent reaches a special state, called the terminal absorbing state, prior to the final timestep, it stays in this state and receives a reward of 0 at each timestep until , after which the process terminates.

A policy, , determines the probability that an agent will choose an action in a particular state. A parameterized policy,

, is a policy that is determined by some parameter vector,

, which may be the weights in a neural network, values in a tabular representation, etc. The

compatible features of a parameterized policy represent how may be changed in order to make a particular action, , more likely in a particular state, , and are defined as .

The value function, , represents the expected discounted sum of rewards when starting in a particular state under policy ; that is, , where conditioning on indicates that . The action-value function, , is similar, but also considers the action taken; that is, . The advantage function is the difference between the action-value function and the (state) value function: .

The objective of an RL agent is to maximize some function of its parameters, . In the episodic setting, the two most commonly stated objectives are the discounted objective, , and the undiscounted objective, . The discounted objective has some convenient mathematical properties, but it corresponds to few real-world tasks. Sutton and Barto (2018) have even argued for its deprecation. The undiscounted objective is a more intuitive match for most tasks. Both objectives and their respective gradients are discussed in this paper.

3 The policy gradient theorem

The policy gradient theorem (Sutton et al., 2000) was originally presented for two objectives: the average reward objective for the continuing setting (Mahadevan, 1996) and the discounted objective () for the episodic setting. The episodic setting is usually more popular, as it is better suited to the types of tasks that RL researchers typically use for evaluation (e.g., many classic control tasks, and the Arcade Learning Environment (Bellemare et al., 2013)). The policy gradient, , tells us how to modify in order to increase , and is given by:

(1)

Because

is the true gradient of the discounted objective, algorithms that follow unbiased estimates of (

1

) are given the standard guarantees of stochastic gradient descent (namely, that given an appropriate step-size schedule and smoothness assumptions, convergence to a local optimum is almost sure

(Bertsekas and Tsitsiklis, 2000)). However, researchers typically are not interested in the discounted objective, but rather the undiscounted objective. As a result, most conventional “policy gradient” algorithms (PPO, A3C, etc.) instead directly or indirectly estimate the expression:

(2)

Note that the difference between and is that the latter has dropped the term (Thomas, 2014). However, notice that this new expression is not the gradient of either, as it uses the discounted action-value function, . We label this update direction because it is no longer the gradient of or , but is rather the gradient of some hypothetical, related objective, . The asterisk character, *, was chosen to represent the ambiguity of its existence.

Thomas (2014) attempted to construct , but was only able to do so for an impractically restricted setting wherein the agent’s actions do not impact state transitions. Its value in the general case was left as an open question. We will later show that the hypothetical objective, , does not exist when the impractical restrictions imposed by Thomas (2014) are removed. The result is that algorithms based on (2), which are commonly referred to as “policy gradient” algorithms, cannot be interpreted as gradient algorithms with respect to any objective. In Section 5, we show that these methods are better understood as semi-gradient methods relative to the undiscounted objective. In the next section, we prove an intermediate result relating the undiscounted objective to the discounted value function.

4 Decomposing the undiscounted objective

We begin our analysis by presenting a lemma that provides a new view of the undiscounted objective by allowing us to express it in terms of the discounted value function, for any value of . Writing the objective in this new form is the first step towards uncovering its relationship with . This is necessary, as is defined in terms of the discounted action-value function.

Lemma 1.

The undiscounted objective is related to for all by:

(3)

where is the unnormalized, weighted state distribution given by:

(4)
Proof.

See appendix. ∎

Taking the gradient of this expression gives us a corresponding new form of the policy gradient. The first step of the derivation is given here, as it deepens our intuition as to the relationship between the discounted value function and the undiscounted objective:

(5)

We see that there are two components to this new form of the gradient. The first component optimizes the discounted value function over the current distribution of states. The second term takes into account the change in the weighted distribution over states induced by changes in the policy. Note that in the case, is zero for all , and we are left with only the first term. Also note that under this formulation,

is strictly a hyperparameter, and does not impact the optimal policy. In the next section, we relate this expression to

.

5 Policy semi-gradient methods

In this section, we show that when discounting is used, most policy “gradient” methods are actually policy semi-gradient methods, and that they are not truly gradient methods with respect to any objective. First, we define the term semi-gradient. If a gradient of a function can be written with two terms, the semi-gradients are each of these terms; they represent the two “halves” of the gradient. The semi-gradients of a function are not unique, as they entirely depend on how the gradient is written. Nevertheless, they can provide useful intuition for the behavior of algorithms. The term was coined by Sutton (2015) in order to describe the relationship between temporal difference (TD) learning methods and the mean-squared Bellman error. Similarly, we use the term to describe the relationship between and the undiscounted objective. In the following lemma, we show that may be written as the value gradient term in (5), from which the proof that it is a semi-gradient is immediate:

Lemma 2.

The conventional policy update, , can be written as:

(6)
Proof.

See appendix. ∎

Corollary 1.

is a semi-gradient of .

Proof.

The right-hand side of (6) is the first term of the gradient, , in (5). The corollary follows from the definition of a semi-gradient. ∎

In a sense, it is remarkable that this approach to policy gradients works at all. Value function estimates are typically learned using semi-gradient TD methods (Sutton and Barto, 2018). The policy is typically learned using an approximation of , which we have shown is a semi-gradient. Both “actor” and “critic” follow semi-gradients, and yet implementations of actor-critic methods (Konda and Tsitsiklis, 2000) that follow these update directions are able to learn effective policies. An open question in RL is whether or not these algorithms are following some related objective, and if this is the cause for their effectiveness (Thomas, 2014). We show that this is not the case.

Theorem 1.

There does not exist any continuous objective, , such that is its gradient. That is, for all :

where is the space of continuous functions.

Proof.

We present a proof by contradiction. Assume that there exists some whose gradient is , and that and are parameters of the policy, . If is a continuous function, then it is also true that its second derivative is symmetric:

However, it is easy to show that this is not true algebraically using Lemma 2:

(7)

Thus, we immediately arrive at an apparent contradiction. However, to formally complete the proof, we must provide a counterexample where the latter term is asymmetric. Such a counterexample is provided in Figure 1. ∎

Figure 1: A counterexample wherein the derivative of is asymmetric for all , necessary for the proof of Theorem 1. The agent begins in , and may choose between actions and , each of which produces deterministic transitions. The rewards are 0 everywhere except when the agent chooses in state , which produces a reward of . The policy is assumed to be tabular over states and actions, with one parameter, , determining the policy in , and a second parameter, , determining the policy in

(e.g., the policy may be determined by the sigmoid function,

). For intuition, consider . (7) is asymmetric in this case because affects the state distribution, but not the value function, whereas affects the value function but not the state distribution. Therefore, the second term in (7) is non-zero when and , and zero when and . We show the full computation in the appendix.

6 Non-optimality of semi-gradient methods

One may wonder, why should we be concerned at all that policy gradient methods are following a semi-gradient rather than a gradient? After all, the same may be said of TD (Sutton, 2015), which remains in wide use. First, there is a large body of literature (Baird, 1995; Sutton et al., 2009; Liu et al., 2015; Mahmood et al., 2015; Liu et al., 2016) suggesting that many researchers are concerned regarding TD; even in the subfield of “deep” RL, the concept of a target network (Mnih et al., 2015) was introduced to to address the instability introduced by the use of a semi-gradient update. Second, semi-policy gradient methods suffer from weaknesses that semi-gradient TD methods do not.

Consider that semi-gradient TD is a semi-gradient of the mean-squared Bellman error (Baird, 1995; Sutton, 2015), whereas the policy semi-gradient is a semi-gradient of the undiscounted objective. In the tabular setting, semi-gradient TD will converge to the minimum of the mean-squared Bellman error (Jaakkola et al., 1994), but policy semi-gradient methods will often fail to converge to the maximum of the undiscounted objective—e.g., they may instead converge to the maximum of the discounted objective (see Figure 2). There is therefore an asymmetry between semi-gradient TD and policy semi-gradient methods, in that the former converges to the objective of which it is a semi-gradient, while the latter does not. However, this situation is not particularly alarming, as it is well understood that the optimal policy under the discounted objective approximates the optimal policy under the undiscounted objective (Marbach and Tsitsiklis, 2001; Weaver and Tao, 2001; Kakade, 2001; Thomas, 2014)

. The choice of the discount factor for policy semi-gradient methods in the tabular setting may therefore be considered in terms of a bias-variance tradeoff.

When a function approximator is used, a worse problem emerges that is analogous to divergence in semi-gradient TD methods. With a linear function approximator, semi-gradient TD methods only diverge in the off-policy setting,111 Assuming the typical bootstrapping approach is used, thus completing the “deadly triad” of function approximation, bootstrapping, and off-policy training (Sutton and Barto, 2018). for instance, in Baird’s famous counterexample (Baird, 1995). However, the following problem may emerge for policy semi-gradient methods even in the on-policy setting. It is not possible for policies to diverge in quite the same way as value estimations, as the probability of choosing an action is bounded by . The worst possible outcome is convergence to the pessimal policy within the given parameterization.222 Convergence to a limit cycle is problematic as well, but preferable to the pessimal solution. One may ask, pessimal with respect to which objective? In Figure 3, we give an example wherein policy semi-gradient methods will converge to the pessimal policy with respect to both the discounted and undiscounted objectives. While the behavior of policy semi-gradient methods described in Figure 2 may be justified as a myopic preference, the same justification cannot be applied to the case shown in Figure 3.

Figure 2: An example where semi-gradient methods using the specified value of will produce the optimal policy with respect to the discounted objective but the pessimal behavior with respect to the undiscounted objective. The agent starts in state , and can achieve a reward of by transitioning from to , and a reward of by transitioning from to . In order to maximize the undiscounted return, the agent should choose . However, the discounted return is higher from choosing . Thus, semi-gradient methods will choose . If the researcher is concerned with the undiscounted objective, as we argue is generally the case, this result is problematic. Choosing a larger value of trivially fixes the problem in this particular example, but for any value of , similar problems will arise given a sufficiently long horizon, and thus the problem is not truly eliminated.
Figure 3: An example where semi-gradient methods will produce the pessimal policy with respect to both the discounted and undiscounted objectives. In this formulation, there is a single policy parameter, , so the agent must execute the same policy in every state. It is difficult to justify any solution other than to always choose . If the agent is completely myopic, then gives the superior immediate reward of in the starting state. If the agent is somewhat farther-sighted, then always choosing will eventually result in the reward. Choosing provides no benefit with respect to either objective. Consider three updates: (1) the policy gradient update given by Sutton et al. (2000), (2) the gradient of the undiscounted objective, and (3) the semi-gradient update that drops the term. (1) and (2) will produce the policy that always chooses . (3) will produce a policy that always chooses , a solution which appears to be strictly inferior by any reasonable metric. Again, while we chose for simplicity, similar examples can be produced in long-horizon problems, with “reasonable” settings of , such as or higher. One may also ask if sharing a policy between states is contrived, but such a situation occurs reliably when under partial observability or when a function approximator is used.

7 Conclusions

We have shown in Theorem 1 that most policy gradient methods, including state-of-the-art algorithms, do not follow the gradient of any objective function due to the way the discount factor is used. By Corollary 1, we argued that policy gradient methods are therefore better understood as semi-gradient methods. Finally, we showed that policy semi-gradients are not guaranteed to converge to a locally optimal policy (Figure 2), and in some cases even converge to the pessimal policy (Figure 3).

References

  • Baird [1995] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30–37. Elsevier, 1995.
  • Bellemare et al. [2013] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents.

    Journal of Artificial Intelligence Research

    , 47:253–279, 2013.
  • Bertsekas and Tsitsiklis [2000] D. P. Bertsekas and J. N. Tsitsiklis. Gradient convergence in gradient methods with errors. SIAM J. Optim., 10:627–642, 2000.
  • Dhariwal et al. [2017] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017.
  • Haarnoja et al. [2018] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
  • Jaakkola et al. [1994] Tommi Jaakkola, Michael I Jordan, and Satinder P Singh. Convergence of stochastic iterative dynamic programming algorithms. In Advances in neural information processing systems, pages 703–710, 1994.
  • Kakade [2001] Sham Kakade. Optimizing average reward using discounted rewards. In

    International Conference on Computational Learning Theory

    , pages 605–615. Springer, 2001.
  • Konda and Tsitsiklis [2000] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.
  • Liu et al. [2015] Bo Liu, Ji Liu, Mohammad Ghavamzadeh, Sridhar Mahadevan, and Marek Petrik. Finite-sample analysis of proximal gradient td algorithms. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 504–513. AUAI Press, 2015.
  • Liu et al. [2016] Bo Liu, Ji Liu, Mohammad Ghavamzadeh, Sridhar Mahadevan, and Marek Petrik. Proximal gradient temporal difference learning algorithms. In IJCAI, pages 4195–4199, 2016.
  • Mahadevan [1996] Sridhar Mahadevan. Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine learning, 22(1-3):159–195, 1996.
  • Mahmood et al. [2015] A Rupam Mahmood, Huizhen Yu, Martha White, and Richard S Sutton. Emphatic temporal-difference learning. arXiv preprint arXiv:1507.01569, 2015.
  • Marbach and Tsitsiklis [2001] Peter Marbach and John N Tsitsiklis. Simulation-based optimization of markov reward processes. IEEE Transactions on Automatic Control, 46(2):191–209, 2001.
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
  • Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
  • Schulman et al. [2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Silver et al. [2014] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014.
  • Sutton [2015] Richard S Sutton. Introduction to reinforcement learning with function approximation. In Tutorial at the Conference on Neural Information Processing Systems, 2015.
  • Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Sutton et al. [2000] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
  • Sutton et al. [2009] Richard S Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvári, and Eric Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 993–1000. ACM, 2009.
  • Thomas [2014] Philip Thomas. Bias in natural actor-critic algorithms. In International Conference on Machine Learning, pages 441–448, 2014.
  • Wang et al. [2016] Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224, 2016.
  • Weaver and Tao [2001] Lex Weaver and Nigel Tao. The optimal reward baseline for gradient-based reinforcement learning. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pages 538–545. Morgan Kaufmann Publishers Inc., 2001.
  • Wu et al. [2017] Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems, pages 5279–5288, 2017.

Appendix

Lemma 1.

The undiscounted objective is related to for all by:

(8)

where is the unnormalized, weighted state distribution given by:

(9)
Proof.

Consider that for any and , we can rearrange the Bellman equation: . This allows us to rewrite the objective:

Lemma 2.

The conventional policy update, , can be written as:

(10)
Proof.

We begin by hypothesizing that takes a form similar to (10), given some time-dependent weights, , on each term in the state distribution. That is, we hypothesize that (10) holds for some

We then must show that Equation (10) is equal to , for some choice of , and then derive the satisfying choice of . Sutton et al. [2000] established a lemma stating:

(11)

Applying (11) to (10) gives us:

since

and by the law of total probability. Continuing, starting with the fact that

, we have that:

by substitution of the variable . Since the value of the terminal state is 0, for , we have that , and thus the sum over can stop at rather than . Continuing, we can move the summation inside the expectation and reorder the summation:

In order to derive , we simply need to choose a w(t) such that . This is satisfied by the choice: if , and otherwise. This trivially holds for , as . For :

Thus, for this choice of :

Finally, we see that this choice of also gives us the correct :

Theorem 1.

There does not exist any continuous objective, , such that is its gradient. That is, for all :

where is the space of continuous functions.

Proof.

We present a proof by contradiction. Assume that there exists some whose gradient is , and that and are parameters of the policy, . If is a continuous function, then it is also true that its second derivative is symmetric:

However, it is easy to show that this is not true algebraically using Lemma 2:

(12)

Thus, we immediately arrive at an apparent contradiction. However, to complete the proof of Theorem 1, we must show a case where the summation produces an asymmetry. Such a case is shown in Figure 1. We present an MDP with two states, and , in addition to the terminal absorbing state. The agent begins in . There are two actions, and . If the agent selects action in state , it is taken to the terminal absorbing state and receives a reward of 0. If it chooses action in state , it is taken to state , and again receives a reward of 0. In state , the agent is always taken to the terminal absorbing state regardless of the action chosen. If the agent takes action , it again receives a a reward of 0. However, if the agent takes action , it receives a reward of +1.

We consider the case where , and the agent’s policy is tabular over states and actions: that is, we do not introduce any bias due to the use of function approximation. There are two parameters, and , which determine the policy in states and respectively. The probability of choosing action in state by , where is the sigmoid function, , and the probability of choosing is given by . Finally, we initialize the policy to .

Consider the partial derivatives of and . First, consider . The agent cannot affect the likelihood of being in at time , so . Additionally, because there are no immediately rewards available and , the value function and its gradient are zero at regardless of the policy parameters: . Next, consider . The probability of entering is proportional to the probability of choosing in . At :

On the other hand, the value at does not depend on , so . Finally, consider and . cannot influence , as the action taken in has no effect on the state distribution. However:

so:

Putting it all together:

We have shown that in this instance the second derivative is not symmetric. Therefore, we conclude that no general form of exists. ∎