Renewal Monte Carlo: Renewal theory based reinforcement learning

04/03/2018
by   Jayakumar Subramanian, et al.
0

In this paper, we present an online reinforcement learning algorithm, called Renewal Monte Carlo (RMC), for infinite horizon Markov decision processes with a designated start state. RMC is a Monte Carlo algorithm and retains the advantages of Monte Carlo methods including low bias, simplicity, and ease of implementation while, at the same time, circumvents their key drawbacks of high variance and delayed (end of episode) updates. The key ideas behind RMC are as follows. First, under any reasonable policy, the reward process is ergodic. So, by renewal theory, the performance of a policy is equal to the ratio of expected discounted reward to the expected discounted time over a regenerative cycle. Second, by carefully examining the expression for performance gradient, we propose a stochastic approximation algorithm that only requires estimates of the expected discounted reward and discounted time over a regenerative cycle and their gradients. We propose two unbiased estimators for evaluating performance gradients---a likelihood ratio based estimator and a simultaneous perturbation based estimator---and show that for both estimators, RMC converges to a locally optimal policy. We generalize the RMC algorithm to post-decision state models and also present a variant that converges faster to an approximately optimal policy. We conclude by presenting numerical experiments on a randomly generated MDP, event-triggered communication, and inventory management.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/30/2019

Detecting Spiky Corruption in Markov Decision Processes

Current reinforcement learning methods fail if the reward function is im...
research
11/28/2018

A Structure-aware Online Learning Algorithm for Markov Decision Processes

To overcome the curse of dimensionality and curse of modeling in Dynamic...
research
06/02/2022

Policy Gradient Algorithms with Monte-Carlo Tree Search for Non-Markov Decision Processes

Policy gradient (PG) is a reinforcement learning (RL) approach that opti...
research
12/08/2016

Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning

We study the online estimation of the optimal policy of a Markov decisio...
research
02/10/2020

On the Convergence of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning

A simple and natural algorithm for reinforcement learning is Monte Carlo...
research
03/11/2013

Monte-Carlo utility estimates for Bayesian reinforcement learning

This paper introduces a set of algorithms for Monte-Carlo Bayesian reinf...
research
05/31/2021

A unified view of likelihood ratio and reparameterization gradients

Reparameterization (RP) and likelihood ratio (LR) gradient estimators ar...

Please sign up or login with your details

Forgot password? Click here to reset