- Bellman  R. Bellman. The theory of dynamic programming. Bulletin of the American Mathematical Society, 60(6):503–515, 1954.
- Sutton and Barto  R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 2nd edition, 2018.
- Watkins  C. Watkins. Learning From Delayed Rewards. PhD thesis, University of Cambridge, England, 1989.
- Thomas and Okal  P. S. Thomas and B. Okal. A notation for Markov decision processes. arXiv preprint arXiv:1512.09075v2, 2016.
- Nota and Thomas  C. Nota and P. S. Thomas. Is the policy gradient a gradient? Unpublished, 2019.
- Williams  R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
- Thomas  P. S. Thomas. GeNGA: A generalization of natural gradient ascent with positive and negative convergence results. In ICML, 2014.