Policy Gradient for Continuing Tasks in Non-stationary Markov Decision Processes

10/16/2020
by   Santiago Paternain, et al.
0

Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities. In this paper we consider the problem of finding optimal policies assuming that they belong to a reproducing kernel Hilbert space (RKHS). To that end we compute unbiased stochastic gradients of the value function which we use as ascent directions to update the policy. A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed. Hence preventing these algorithms to be fully implemented online, which is a desirable property for systems that need to adapt to new tasks and/or environments in deployment. The main requirement for a policy gradient algorithm to work is that the estimate of the gradient at any point in time is an ascent direction for the initial value function. In this work we establish that indeed this is the case which enables to show the convergence of the online algorithm to the critical points of the initial value function. A numerical example shows the ability of our online algorithm to learn to solve a navigation and surveillance problem, in which an agent must loop between to goal locations. This example corroborates our theoretical findings about the ascent directions of subsequent stochastic gradients. It also shows how the agent running our online algorithm succeeds in learning to navigate, following a continuing cyclic trajectory that does not comply with the standard stationarity assumptions in the literature for non episodic training.

READ FULL TEXT

page 1

page 16

research
06/12/2022

Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Conservative Natural Policy Gradient Primal-Dual Algorithm

We consider the problem of constrained Markov decision process (CMDP) in...
research
02/10/2020

Statistically Efficient Off-Policy Policy Gradients

Policy gradient methods in reinforcement learning update policy paramete...
research
08/01/2019

Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes

Policy gradient methods are among the most effective methods in challeng...
research
06/14/2018

Stochastic Variance-Reduced Policy Gradient

In this paper, we propose a novel reinforcement- learning algorithm cons...
research
04/09/2020

Policy Gradient using Weak Derivatives for Reinforcement Learning

This paper considers policy search in continuous state-action reinforcem...
research
11/25/2020

Enhanced Scene Specificity with Sparse Dynamic Value Estimation

Multi-scene reinforcement learning involves training the RL agent across...
research
04/19/2018

Nonparametric Stochastic Compositional Gradient Descent for Q-Learning in Continuous Markov Decision Problems

We consider Markov Decision Problems defined over continuous state and a...

Please sign up or login with your details

Forgot password? Click here to reset