Variance-Aware Confidence Set: Variance-Dependent Bound for Linear Bandits and Horizon-Free Bound for Linear Mixture MDP

01/29/2021
by   Zihan Zhang, et al.
0

We show how to construct variance-aware confidence sets for linear bandits and linear mixture Markov Decision Process (MDP). Our method yields the following new regret bounds: * For linear bandits, we obtain an O(poly(d)√(1 + ∑_i=1^Kσ_i^2)) regret bound, where d is the feature dimension, K is the number of rounds, and σ_i^2 is the (unknown) variance of the reward at the i-th round. This is the first regret bound that only scales with the variance and the dimension, with no explicit polynomial dependency on K. * For linear mixture MDP, we obtain an O(poly(d, log H)√(K)) regret bound for linear mixture MDP, where d is the number of base models, K is the number of episodes, and H is the planning horizon. This is the first regret bound that only scales logarthmically with H in the reinforcement learning (RL) with linear function approximation setting, thus exponentially improving existing results. Our methods utilize three novel ideas that may be of independent interest: 1) applications of the layering techniques to the norm of input and the magnitude of variance, 2) a recursion-based approach to estimate the variance, and 3) a convex potential lemma that in a sense generalizes the seminal elliptical potential lemma.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/05/2021

Improved Regret Analysis for Variance-Adaptive Linear Bandits and Horizon-Free Linear Mixture MDPs

In online learning problems, exploiting low variance plays an important ...
research
05/23/2022

Computationally Efficient Horizon-Free Reinforcement Learning for Linear Mixture MDPs

Recent studies have shown that episodic reinforcement learning (RL) is n...
research
05/15/2023

Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs

Recent studies have shown that episodic reinforcement learning (RL) is n...
research
11/08/2020

Online Sparse Reinforcement Learning

We investigate the hardness of online reinforcement learning in fixed ho...
research
03/24/2022

Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies

This paper gives the first polynomial-time algorithm for tabular Markov ...
research
05/26/2022

Variance-Aware Sparse Linear Bandits

It is well-known that the worst-case minimax regret for sparse linear ba...
research
10/25/2021

Linear Contextual Bandits with Adversarial Corruptions

We study the linear contextual bandit problem in the presence of adversa...

Please sign up or login with your details

Forgot password? Click here to reset