Perturbed-History Exploration in Stochastic Linear Bandits

03/21/2019
by   Branislav Kveton, et al.
10

We propose a new online algorithm for minimizing the cumulative regret in stochastic linear bandits. The key idea is to build a perturbed history, which mixes the history of observed rewards with a pseudo-history of randomly generated i.i.d. pseudo-rewards. Our algorithm, perturbed-history exploration in a linear bandit (LinPHE), estimates a linear model from its perturbed history and pulls the arm with the highest value under that model. We prove a Õ(d √(n)) gap-free bound on the expected n-round regret of LinPHE, where d is the number of features. Our analysis relies on novel concentration and anti-concentration bounds on the weighted sum of Bernoulli random variables. To show the generality of our design, we extend LinPHE to a logistic reward model. We evaluate both algorithms empirically and show that they are practical.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset