Infinite-Horizon Policy-Gradient Estimation

06/03/2011
by   P. L. Bartlett, et al.
0

Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes POMDPs controlled by parameterized stochastic policies. A similar algorithm was proposed by (Kimura et al. 1995). The algorithm's chief advantages are that it requires storage of only twice the number of policy parameters, uses one free beta (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of GPOMDP, and show how the correct choice of the parameter beta is related to the mixing time of the controlled POMDP. We briefly describe extensions of GPOMDP to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter et al., this volume) we show how the gradient estimates generated by GPOMDP can be used in both a traditional stochastic gradient algorithm and a conjugate-gradient procedure to find local optima of the average reward.

READ FULL TEXT
research
06/03/2011

Experiments with Infinite-Horizon, Policy-Gradient Estimation

In this paper, we present algorithms that perform gradient ascent of the...
research
09/05/2023

Regret Analysis of Policy Gradient Algorithm for Infinite Horizon Average Reward Markov Decision Processes

In this paper, we consider an infinite horizon average reward Markov Dec...
research
05/11/2022

Stochastic first-order methods for average-reward Markov decision processes

We study the problem of average-reward Markov decision processes (AMDPs)...
research
06/21/2019

Entropic Risk Measure in Policy Search

With the increasing pace of automation, modern robotic systems need to a...
research
03/05/2018

Recurrent Predictive State Policy Networks

We introduce Recurrent Predictive State Policy (RPSP) networks, a recurr...
research
04/17/2019

PLOTS: Procedure Learning from Observations using Subtask Structure

In many cases an intelligent agent may want to learn how to mimic a sing...
research
10/09/2019

Compatible features for Monotonic Policy Improvement

Recent policy optimization approaches have achieved substantial empirica...

Please sign up or login with your details

Forgot password? Click here to reset