Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation

01/30/2023
by   Uri Sherman, et al.
0

We study reinforcement learning with linear function approximation and adversarially changing cost functions, a setup that has mostly been considered under simplifying assumptions such as full information feedback or exploratory conditions.We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback, featuring a combination of mirror-descent and least squares policy evaluation in an auxiliary MDP used to compute exploration bonuses.Our algorithm obtains an O(K^6/7) regret bound, improving significantly over previous state-of-the-art of O (K^14/15) in this setting. In addition, we present a version of the same algorithm under the assumption a simulator of the environment is available to the learner (but otherwise no exploratory assumptions are made), and prove it obtains state-of-the-art regret of O (K^2/3).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/18/2021

Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses

Policy optimization is a widely-used method in reinforcement learning. D...
research
12/09/2019

Optimism in Reinforcement Learning with Generalized Linear Function Approximation

We design a new provably efficient algorithm for episodic reinforcement ...
research
07/03/2020

Online learning in MDPs with linear function approximation and bandit feedback

We consider an online learning problem where the learner interacts with ...
research
02/19/2020

Optimistic Policy Optimization with Bandit Feedback

Policy optimization methods are one of the most widely used classes of R...
research
11/15/2021

Delayed Feedback in Episodic Reinforcement Learning

There are many provably efficient algorithms for episodic reinforcement ...
research
10/06/2020

Online Linear Optimization with Many Hints

We study an online linear optimization (OLO) problem in which the learne...
research
07/22/2022

Optimism in Face of a Context: Regret Guarantees for Stochastic Contextual MDP

We present regret minimization algorithms for stochastic contextual MDPs...

Please sign up or login with your details

Forgot password? Click here to reset