O^2TD: (Near)-Optimal Off-Policy TD Learning

04/17/2017
by   Bo Liu, et al.
0

Temporal difference learning and Residual Gradient methods are the most widely used temporal difference based learning algorithms; however, it has been shown that none of their objective functions is optimal w.r.t approximating the true value function V. Two novel algorithms are proposed to approximate the true value function V. This paper makes the following contributions: (1) A batch algorithm that can help find the approximate optimal off-policy prediction of the true value function V. (2) A linear computational cost (per step) near-optimal algorithm that can learn from a collection of off-policy samples. (3) A new perspective of the emphatic temporal difference learning which bridges the gap between off-policy optimality and off-policy stability.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/13/2019

A Convergent Off-Policy Temporal Difference Algorithm

Learning the value function of a given policy (target policy) from the d...
research
04/06/2019

Randomised Bayesian Least-Squares Policy Iteration

We introduce Bayesian least-squares policy iteration (BLSPI), an off-pol...
research
08/15/2020

Reducing Sampling Error in Batch Temporal Difference Learning

Temporal difference (TD) learning is one of the main foundations of mode...
research
09/16/2018

LVIS: Learning from Value Function Intervals for Contact-Aware Robot Controllers

Guided policy search is a popular approach for training controllers for ...
research
06/30/2020

Delayed Q-update: A novel credit assignment technique for deriving an optimal operation policy for the Grid-Connected Microgrid

A microgrid is an innovative system that integrates distributed energy r...
research
08/15/2021

Policy Evaluation and Temporal-Difference Learning in Continuous Time and Space: A Martingale Approach

We propose a unified framework to study policy evaluation (PE) and the a...
research
03/13/2023

n-Step Temporal Difference Learning with Optimal n

We consider the problem of finding the optimal value of n in the n-step ...

Please sign up or login with your details

Forgot password? Click here to reset