Batch Policy Learning in Average Reward Markov Decision Processes

07/23/2020
by   Peng Liao, et al.
0

We consider the batch (off-line) policy learning problem in the infinite horizon Markov Decision Process. Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward. We propose a doubly robust estimator for the average reward and show that it achieves semiparametric efficiency given multiple trajectories collected under some behavior policy. Based on the proposed estimator, we develop an optimization algorithm to compute the optimal policy in a parameterized stochastic policy class. The performance of the estimated policy is measured by the difference between the optimal average reward in the policy class and the average reward of the estimated policy and we establish a finite-sample regret guarantee. To the best of our knowledge, this is the first regret bound for batch policy learning in the infinite time horizon setting. The performance of the method is illustrated by simulation studies.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/09/2020

Robust Batch Policy Learning in Markov Decision Processes

We study the sequential decision making problem in Markov decision proce...
research
01/21/2017

Learning Policies for Markov Decision Processes from Data

We consider the problem of learning a policy for a Markov decision proce...
research
10/19/2021

Planning for Package Deliveries in Risky Environments Over Multiple Epochs

We study a risk-aware robot planning problem where a dispatcher must con...
research
10/14/2021

The Geometry of Memoryless Stochastic Policy Optimization in Infinite-Horizon POMDPs

We consider the problem of finding the best memoryless stochastic policy...
research
09/12/2019

Efficiently Breaking the Curse of Horizon: Double Reinforcement Learning in Infinite-Horizon Processes

Off-policy evaluation (OPE) in reinforcement learning is notoriously dif...
research
10/20/2021

Estimating Optimal Infinite Horizon Dynamic Treatment Regimes via pT-Learning

Recent advances in mobile health (mHealth) technology provide an effecti...
research
11/02/2020

Optimal Policies for the Homogeneous Selective Labels Problem

Selective labels are a common feature of consequential decision-making a...

Please sign up or login with your details

Forgot password? Click here to reset