Policy Evaluation and Temporal-Difference Learning in Continuous Time and Space: A Martingale Approach

08/15/2021
by   Yanwei Jia, et al.
0

We propose a unified framework to study policy evaluation (PE) and the associated temporal difference (TD) methods for reinforcement learning in continuous time and space. We show that PE is equivalent to maintaining the martingale condition of a process. From this perspective, we find that the mean–square TD error approximates the quadratic variation of the martingale and thus is not a suitable objective for PE. We present two methods to use the martingale characterization for designing PE algorithms. The first one minimizes a "martingale loss function", whose solution is proved to be the best approximation of the true value function in the mean–square sense. This method interprets the classical gradient Monte-Carlo algorithm. The second method is based on a system of equations called the "martingale orthogonality conditions" with "test functions". Solving these equations in different ways recovers various classical TD algorithms, such as TD(λ), LSTD, and GTD. Different choices of test functions determine in what sense the resulting solutions approximate the true value function. Moreover, we prove that any convergent time-discretized algorithm converges to its continuous-time counterpart as the mesh size goes to zero. We demonstrate the theoretical results and corresponding algorithms with numerical experiments and applications.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/22/2021

Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms

We study policy gradient (PG) for reinforcement learning in continuous t...
research
04/17/2017

O^2TD: (Near)-Optimal Off-Policy TD Learning

Temporal difference learning and Residual Gradient methods are the most ...
research
07/02/2022

q-Learning in Continuous Time

We study the continuous-time counterpart of Q-learning for reinforcement...
research
01/31/2023

Toward Efficient Gradient-Based Value Estimation

Gradient-based methods for value estimation in reinforcement learning ha...
research
06/28/2023

Continuous-Time q-learning for McKean-Vlasov Control Problems

This paper studies the q-learning, recently coined as the continuous-tim...
research
02/16/2022

On a Variance Reduction Correction of the Temporal Difference for Policy Evaluation in the Stochastic Continuous Setting

This paper deals with solving continuous time, state and action optimiza...
research
12/13/2015

True Online Temporal-Difference Learning

The temporal-difference methods TD(λ) and Sarsa(λ) form a core part of m...

Please sign up or login with your details

Forgot password? Click here to reset