A Model Selection Approach for Corruption Robust Reinforcement Learning

10/07/2021
by   Chen-Yu Wei, et al.
0

We develop a model selection approach to tackle reinforcement learning with adversarial corruption in both transition and reward. For finite-horizon tabular MDPs, without prior knowledge on the total amount of corruption, our algorithm achieves a regret bound of 𝒪(min{1/Δ, √(T)}+C) where T is the number of episodes, C is the total amount of corruption, and Δ is the reward gap between the best and the second-best policy. This is the first worst-case optimal bound achieved without knowledge of C, improving previous results of Lykouris et al. (2021); Chen et al. (2021); Wu et al. (2021). For finite-horizon linear MDPs, we develop a computationally efficient algorithm with a regret bound of 𝒪(√((1+C)T)), and another computationally inefficient one with 𝒪(√(T)+C), improving the result of Lykouris et al. (2021) and answering an open question by Zhang et al. (2021b). Finally, our model selection framework can be easily applied to other settings including linear bandits, linear contextual bandits, and MDPs with general function approximation, leading to several improved or new results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/23/2020

Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

We develop several new algorithms for learning Markov Decision Processes...
research
09/28/2020

Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon

Episodic reinforcement learning and contextual bandits are two widely st...
research
08/09/2016

Posterior Sampling for Reinforcement Learning Without Episodes

This is a brief technical note to clarify some of the issues with applyi...
research
06/15/2022

Corruption-Robust Contextual Search through Density Updates

We study the problem of contextual search in the adversarial noise model...
research
10/23/2017

Sequential Matrix Completion

We propose a novel algorithm for sequential matrix completion in a recom...
research
05/18/2022

Slowly Changing Adversarial Bandit Algorithms are Provably Efficient for Discounted MDPs

Reinforcement learning (RL) generalizes bandit problems with additional ...
research
07/06/2022

Model Selection in Reinforcement Learning with General Function Approximations

We consider model selection for classic Reinforcement Learning (RL) envi...

Please sign up or login with your details

Forgot password? Click here to reset