Robust Batch Policy Learning in Markov Decision Processes

11/09/2020
by   Zhengling Qi, et al.
0

We study the sequential decision making problem in Markov decision process (MDP) where each policy is evaluated by a set containing average rewards over different horizon lengths and with different initial distributions. Given a pre-collected dataset of multiple trajectories generated by some behavior policy, our goal is to learn a robust policy in a pre-specified policy class that can maximize the smallest value of this set. Leveraging the semi-parametric efficiency theory from statistics, we develop a policy learning method for estimating the defined robust optimal policy that can efficiently break the curse of horizon under mild technical conditions. A rate-optimal regret bound up to a logarithmic factor is established in terms of the number of trajectories and the number of decision points.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/23/2020

Batch Policy Learning in Average Reward Markov Decision Processes

We consider the batch (off-line) policy learning problem in the infinite...
research
08/28/2023

Rate-Optimal Policy Optimization for Linear Markov Decision Processes

We study regret minimization in online episodic linear Markov Decision P...
research
12/29/2022

An Instrumental Variable Approach to Confounded Off-Policy Evaluation

Off-policy evaluation (OPE) is a method for estimating the return of a t...
research
12/21/2016

ARES: Adaptive Receding-Horizon Synthesis of Optimal Plans

We introduce ARES, an efficient approximation algorithm for generating o...
research
03/01/2014

Dynamic Decision Process Modeling and Relation-line Handling in Distributed Cooperative Modeling System

The Distributed Cooperative Modeling System (DCMS) solves complex decisi...
research
03/13/2022

Policy Learning for Robust Markov Decision Process with a Mismatched Generative Model

In high-stake scenarios like medical treatment and auto-piloting, it's r...
research
02/22/2022

Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process

This paper is concerned with constructing a confidence interval for a ta...

Please sign up or login with your details

Forgot password? Click here to reset