Classical Policy Gradient: Preserving Bellman's Principle of Optimality

06/06/2019 ∙ by Philip S. Thomas, et al. ∙ 0

We propose a new objective function for finite-horizon episodic Markov decision processes that better captures Bellman's principle of optimality, and provide an expression for the gradient of the objective.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

References

  • Bellman [1954] R. Bellman. The theory of dynamic programming. Bulletin of the American Mathematical Society, 60(6):503–515, 1954.
  • Sutton and Barto [2018] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 2nd edition, 2018.
  • Watkins [1989] C. Watkins. Learning From Delayed Rewards. PhD thesis, University of Cambridge, England, 1989.
  • Thomas and Okal [2016] P. S. Thomas and B. Okal. A notation for Markov decision processes. arXiv preprint arXiv:1512.09075v2, 2016.
  • Nota and Thomas [2019] C. Nota and P. S. Thomas. Is the policy gradient a gradient? Unpublished, 2019.
  • Williams [1992] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
  • Thomas [2014] P. S. Thomas. GeNGA: A generalization of natural gradient ascent with positive and negative convergence results. In ICML, 2014.