Classical Policy Gradient: Preserving Bellman's Principle of Optimality

06/06/2019 ∙ by Philip S. Thomas, et al. ∙ 0

We propose a new objective function for finite-horizon episodic Markov decision processes that better captures Bellman's principle of optimality, and provide an expression for the gradient of the objective.



There are no comments yet.


page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


  • Bellman [1954] R. Bellman. The theory of dynamic programming. Bulletin of the American Mathematical Society, 60(6):503–515, 1954.
  • Sutton and Barto [2018] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 2nd edition, 2018.
  • Watkins [1989] C. Watkins. Learning From Delayed Rewards. PhD thesis, University of Cambridge, England, 1989.
  • Thomas and Okal [2016] P. S. Thomas and B. Okal. A notation for Markov decision processes. arXiv preprint arXiv:1512.09075v2, 2016.
  • Nota and Thomas [2019] C. Nota and P. S. Thomas. Is the policy gradient a gradient? Unpublished, 2019.
  • Williams [1992] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
  • Thomas [2014] P. S. Thomas. GeNGA: A generalization of natural gradient ascent with positive and negative convergence results. In ICML, 2014.