Rate-Optimal Policy Optimization for Linear Markov Decision Processes

08/28/2023
by   Uri Sherman, et al.
0

We study regret minimization in online episodic linear Markov Decision Processes, and obtain rate-optimal O (√(K)) regret where K denotes the number of episodes. Our work is the first to establish the optimal (w.r.t. K) rate of convergence in the stochastic setting with bandit feedback using a policy optimization based approach, and the first to establish the optimal (w.r.t. K) rate in the adversarial setup with full information feedback, for which no algorithm with an optimal rate guarantee is currently known.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/30/2015

A Notation for Markov Decision Processes

This paper specifies a notation for Markov decision processes....
research
05/11/2018

Stochastic Approximation for Risk-aware Markov Decision Processes

In this paper, we develop a stochastic approximation type algorithm to s...
research
09/22/2019

Faster saddle-point optimization for solving large-scale Markov decision processes

We consider the problem of computing optimal policies in average-reward ...
research
03/08/2021

Bandit Linear Optimization for Sequential Decision Making and Extensive-Form Games

Tree-form sequential decision making (TFSDM) extends classical one-shot ...
research
01/31/2021

Online Markov Decision Processes with Aggregate Bandit Feedback

We study a novel variant of online finite-horizon Markov Decision Proces...
research
05/26/2022

Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes with Bandit Feedback

We consider regret minimization for Adversarial Markov Decision Processe...
research
11/09/2020

Robust Batch Policy Learning in Markov Decision Processes

We study the sequential decision making problem in Markov decision proce...

Please sign up or login with your details

Forgot password? Click here to reset