Improved Regret Bounds for Linear Adversarial MDPs via Linear Optimization

02/14/2023
by   Fang Kong, et al.
0

Learning Markov decision processes (MDP) in an adversarial environment has been a challenging problem. The problem becomes even more challenging with function approximation, since the underlying structure of the loss function and transition kernel are especially hard to estimate in a varying environment. In fact, the state-of-the-art results for linear adversarial MDP achieve a regret of Õ(K^6/7) (K denotes the number of episodes), which admits a large room for improvement. In this paper, we investigate the problem with a new view, which reduces linear MDP into linear optimization by subtly setting the feature maps of the bandit arms of linear optimization. This new technique, under an exploratory assumption, yields an improved bound of Õ(K^4/5) for linear adversarial MDP without access to a transition simulator. The new view could be of independent interest for solving other MDP problems that possess a linear structure.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/17/2021

Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function Approximation

We study the reinforcement learning for finite-horizon episodic Markov d...
research
07/18/2021

Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses

Policy optimization is a widely-used method in reinforcement learning. D...
research
07/03/2020

Online learning in MDPs with linear function approximation and bandit feedback

We consider an online learning problem where the learner interacts with ...
research
01/21/2022

Meta Learning MDPs with Linear Transition Models

We study meta-learning in Markov Decision Processes (MDP) with linear tr...
research
08/21/2020

Refined Analysis of FPL for Adversarial Markov Decision Processes

We consider the adversarial Markov Decision Process (MDP) problem, where...
research
07/10/2020

Efficient MDP Analysis for Selfish-Mining in Blockchains

A proof of work (PoW) blockchain protocol distributes rewards to its par...
research
05/10/2019

Learning in structured MDPs with convex cost functions: Improved regret bounds for inventory management

We consider a stochastic inventory control problem under censored demand...

Please sign up or login with your details

Forgot password? Click here to reset