Experiments with Infinite-Horizon, Policy-Gradient Estimation

06/03/2011
by   P. L. Bartlett, et al.
0

In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter and Bartlett, this volume), which computes biased estimates of the performance gradient in POMDPs. The algorithm's chief advantages are that it uses only one free parameter beta, which has a natural interpretation in terms of bias-variance trade-off, it requires no knowledge of the underlying state, and it can be applied to infinite state, control and observation spaces. We show how the gradient estimates produced by GPOMDP can be used to perform gradient ascent, both with a traditional stochastic-gradient algorithm, and with an algorithm based on conjugate-gradients that utilizes gradient information to bracket maxima in line searches. Experimental results are presented illustrating both the theoretical results of (Baxter and Bartlett, this volume) on a toy problem, and practical aspects of the algorithms on a number of more realistic problems.

READ FULL TEXT
research
06/03/2011

Infinite-Horizon Policy-Gradient Estimation

Gradient-based approaches to direct policy search in reinforcement learn...
research
09/05/2023

Regret Analysis of Policy Gradient Algorithm for Infinite Horizon Average Reward Markov Decision Processes

In this paper, we consider an infinite horizon average reward Markov Dec...
research
02/13/2022

Understanding Natural Gradient in Sobolev Spaces

While natural gradients have been widely studied from both theoretical a...
research
02/14/2012

Efficient Inference in Markov Control Problems

Markov control algorithms that perform smooth, non-greedy updates of the...
research
06/12/2020

Zeroth-order Deterministic Policy Gradient

Deterministic Policy Gradient (DPG) removes a level of randomness from s...
research
10/14/2021

The Geometry of Memoryless Stochastic Policy Optimization in Infinite-Horizon POMDPs

We consider the problem of finding the best memoryless stochastic policy...
research
08/31/2022

Partial Counterfactual Identification for Infinite Horizon Partially Observable Markov Decision Process

This paper investigates the problem of bounding possible output from a c...

Please sign up or login with your details

Forgot password? Click here to reset