Improved Regret Analysis for Variance-Adaptive Linear Bandits and Horizon-Free Linear Mixture MDPs

11/05/2021
by   Yeoneung Kim, et al.
5

In online learning problems, exploiting low variance plays an important role in obtaining tight performance guarantees yet is challenging because variances are often not known a priori. Recently, a considerable progress has been made by Zhang et al. (2021) where they obtain a variance-adaptive regret bound for linear bandits without knowledge of the variances and a horizon-free regret bound for linear mixture Markov decision processes (MDPs). In this paper, we present novel analyses that improve their regret bounds significantly. For linear bandits, we achieve Õ(d^1.5√(∑_k^K σ_k^2) + d^2) where d is the dimension of the features, K is the time horizon, and σ_k^2 is the noise variance at time step k, and Õ ignores polylogarithmic dependence, which is a factor of d^3 improvement. For linear mixture MDPs, we achieve a horizon-free regret bound of Õ(d^1.5√(K) + d^3) where d is the number of base models and K is the number of episodes. This is a factor of d^3 improvement in the leading term and d^6 in the lower order term. Our analysis critically relies on a novel elliptical potential `count' lemma. This lemma allows a peeling-based regret analysis, which can be of independent interest.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/29/2021

Variance-Aware Confidence Set: Variance-Dependent Bound for Linear Bandits and Horizon-Free Bound for Linear Mixture MDP

We show how to construct variance-aware confidence sets for linear bandi...
research
06/10/2021

Thompson Sampling with a Mixture Prior

We study Thompson sampling (TS) in online decision-making problems where...
research
05/23/2022

Computationally Efficient Horizon-Free Reinforcement Learning for Linear Mixture MDPs

Recent studies have shown that episodic reinforcement learning (RL) is n...
research
05/26/2022

Variance-Aware Sparse Linear Bandits

It is well-known that the worst-case minimax regret for sparse linear ba...
research
01/31/2023

Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both Worlds in Stochastic and Deterministic Environments

We study variance-dependent regret bounds for Markov decision processes ...
research
02/12/2022

Corralling a Larger Band of Bandits: A Case Study on Switching Regret for Linear Bandits

We consider the problem of combining and learning over a set of adversar...
research
01/31/2023

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

Motivated by concerns about making online decisions that incur undue amo...

Please sign up or login with your details

Forgot password? Click here to reset