Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function Approximation

02/17/2021
by   Jiafan He, et al.
4

We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback, where the unknown transition probability function is a linear function of a given feature mapping. We propose an optimistic policy optimization algorithm with Bernstein bonus and show that it can achieve Õ(dH√(T)) regret, where H is the length of the episode, T is the number of interaction with the MDP and d is the dimension of the feature mapping. Furthermore, we also prove a matching lower bound of Ω̃(dH√(T)) up to logarithmic factors. To the best of our knowledge, this is the first computationally efficient, nearly minimax optimal algorithm for adversarial Markov decision processes with linear function approximation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/15/2021

Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

We study reinforcement learning in an infinite-horizon average-reward se...
research
12/15/2020

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

We study reinforcement learning (RL) with linear function approximation ...
research
12/12/2022

Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision Processes

We study reinforcement learning (RL) with linear function approximation....
research
12/03/2019

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

We consider the problem of learning in episodic finite-horizon Markov de...
research
02/14/2023

Improved Regret Bounds for Linear Adversarial MDPs via Linear Optimization

Learning Markov decision processes (MDP) in an adversarial environment h...
research
06/11/2021

Safe Reinforcement Learning with Linear Function Approximation

Safety in reinforcement learning has become increasingly important in re...
research
06/17/2020

A maximum-entropy approach to off-policy evaluation in average-reward MDPs

This work focuses on off-policy evaluation (OPE) with function approxima...

Please sign up or login with your details

Forgot password? Click here to reset