Policy Gradient Methods for Off-policy Control

12/13/2015
by   Lucas Lehnert, et al.
0

Off-policy learning refers to the problem of learning the value function of a way of behaving, or policy, while following a different policy. Gradient-based off-policy learning algorithms, such as GTD and TDC/GQ, converge even when using function approximation and incremental updates. However, they have been developed for the case of a fixed behavior policy. In control problems, one would like to adapt the behavior policy over time to become more greedy with respect to the existing value function. In this paper, we present the first gradient-based learning algorithms for this problem, which rely on the framework of policy gradient in order to modify the behavior policy. We present derivations of the algorithms, a convergence theorem, and empirical evidence showing that they compare favorably to existing approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/02/2011

The Local Optimality of Reinforcement Learning by Value Gradients, and its Relationship to Policy Gradient Learning

In this theoretical paper we are concerned with the problem of learning ...
research
04/15/2013

Off-policy Learning with Eligibility Traces: A Survey

In the framework of Markov Decision Processes, off-policy learning, that...
research
05/25/2019

A Kernel Loss for Solving the Bellman Equation

Value function learning plays a central role in many state-of-the-art re...
research
07/22/2020

Approximation Benefits of Policy Gradient Methods with Aggregated States

Folklore suggests that policy gradient can be more robust to misspecific...
research
01/31/2022

Optimal Estimation of Off-Policy Policy Gradient via Double Fitted Iteration

Policy gradient (PG) estimation becomes a challenge when we are not allo...
research
11/03/2020

A Study of Policy Gradient on a Class of Exactly Solvable Models

Policy gradient methods are extensively used in reinforcement learning a...
research
11/28/2001

Gradient-based Reinforcement Planning in Policy-Search Methods

We introduce a learning method called "gradient-based reinforcement plan...

Please sign up or login with your details

Forgot password? Click here to reset