Variance-Aware Confidence Set: Variance-Dependent Bound for Linear Bandits and Horizon-Free Bound for Linear Mixture MDP
We show how to construct variance-aware confidence sets for linear bandits and linear mixture Markov Decision Process (MDP). Our method yields the following new regret bounds: * For linear bandits, we obtain an O(poly(d)√(1 + ∑_i=1^Kσ_i^2)) regret bound, where d is the feature dimension, K is the number of rounds, and σ_i^2 is the (unknown) variance of the reward at the i-th round. This is the first regret bound that only scales with the variance and the dimension, with no explicit polynomial dependency on K. * For linear mixture MDP, we obtain an O(poly(d, log H)√(K)) regret bound for linear mixture MDP, where d is the number of base models, K is the number of episodes, and H is the planning horizon. This is the first regret bound that only scales logarthmically with H in the reinforcement learning (RL) with linear function approximation setting, thus exponentially improving existing results. Our methods utilize three novel ideas that may be of independent interest: 1) applications of the layering techniques to the norm of input and the magnitude of variance, 2) a recursion-based approach to estimate the variance, and 3) a convex potential lemma that in a sense generalizes the seminal elliptical potential lemma.
READ FULL TEXT