Experimental Design for Regret Minimization in Linear Bandits

11/01/2020
by   Andrew Wagenmaker, et al.
0

In this paper we propose a novel experimental design-based algorithm to minimize regret in online stochastic linear and combinatorial bandits. While existing literature tends to focus on optimism-based algorithms–which have been shown to be suboptimal in many cases–our approach carefully plans which action to take by balancing the tradeoff between information gain and reward, overcoming the failures of optimism. In addition, we leverage tools from the theory of suprema of empirical processes to obtain regret guarantees that scale with the Gaussian width of the action set, avoiding wasteful union bounds. We provide state-of-the-art finite time regret guarantees and show that our algorithm can be applied in both the bandit and semi-bandit feedback regime. In the combinatorial semi-bandit setting, we show that our algorithm is computationally efficient and relies only on calls to a linear maximization oracle. In addition, we show that with slight modification our algorithm can be used for pure exploration, obtaining state-of-the-art pure exploration guarantees in the semi-bandit setting. Finally, we provide, to the best of our knowledge, the first example where optimism fails in the semi-bandit regime, and show that in this setting our algorithm succeeds.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/21/2021

Efficient Pure Exploration for Combinatorial Bandits with Semi-Bandit Feedback

Combinatorial bandits with semi-bandit feedback generalize multi-armed b...
research
05/25/2021

Bias-Robust Bayesian Optimization via Dueling Bandit

We consider Bayesian optimization in settings where observations can be ...
research
05/12/2021

High-Dimensional Experimental Design and Kernel Bandits

In recent years methods from optimal linear experimental design have bee...
research
05/21/2016

Online Influence Maximization under Independent Cascade Model with Semi-Bandit Feedback

We study the stochastic online problem of learning to influence in a soc...
research
07/31/2021

Pure Exploration and Regret Minimization in Matching Bandits

Finding an optimal matching in a weighted graph is a standard combinator...
research
06/01/2022

An α-No-Regret Algorithm For Graphical Bilinear Bandits

We propose the first regret-based approach to the Graphical Bilinear Ban...
research
05/06/2020

DTR Bandit: Learning to Make Response-Adaptive Decisions With Low Regret

Dynamic treatment regimes (DTRs) for are personalized, sequential treatm...

Please sign up or login with your details

Forgot password? Click here to reset