Approximation Benefits of Policy Gradient Methods with Aggregated States

07/22/2020
by   Daniel Russo, et al.
0

Folklore suggests that policy gradient can be more robust to misspecification than its relative, approximate policy iteration. This paper studies the case of state-aggregation, where the state space is partitioned and either the policy or value function approximation is held constant over partitions. This paper shows a policy gradient method converges to a policy whose regret per-period is bounded by ϵ, the largest difference between two elements of the state-action value function belonging to a common partition. With the same representation, both approximate policy iteration and approximate value iteration can produce policies whose per-period regret scales as ϵ/(1-γ), where γ is a discount factor. Theoretical results synthesize recent analysis of policy gradient methods with insights of Van Roy (2006) into the critical role of state-relevance weights in approximate dynamic programming.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/28/2021

The Role of Lookahead and Approximate Policy Evaluation in Policy Iteration with Linear Value Function Approximation

When the sizes of the state and action spaces are large, solving MDPs ca...
research
12/13/2015

Policy Gradient Methods for Off-policy Control

Off-policy learning refers to the problem of learning the value function...
research
05/14/2012

Approximate Modified Policy Iteration

Modified policy iteration (MPI) is a dynamic programming (DP) algorithm ...
research
10/15/2017

Manifold Regularization for Kernelized LSTD

Policy evaluation or value function or Q-function approximation is a key...
research
04/16/2014

An Analysis of State-Relevance Weights and Sampling Distributions on L1-Regularized Approximate Linear Programming Approximation Accuracy

Recent interest in the use of L_1 regularization in the use of value fun...
research
10/17/2020

Approximate information state for approximate planning and reinforcement learning in partially observed systems

We propose a theoretical framework for approximate planning and learning...
research
01/30/2013

Solving POMDPs by Searching in Policy Space

Most algorithms for solving POMDPs iteratively improve a value function ...

Please sign up or login with your details

Forgot password? Click here to reset