Contextual Blocking Bandits

03/06/2020
by   Soumya Basu, et al.
6

We study a novel variant of the multi-armed bandit problem, where at each time step, the player observes a context that determines the arms' mean rewards. However, playing an arm blocks it (across all contexts) for a fixed number of future time steps. This model extends the blocking bandits model (Basu et al., NeurIPS19) to a contextual setting, and captures important scenarios such as recommendation systems or ad placement with diverse users, and processing diverse pool of jobs. This contextual setting, however, invalidates greedy solution techniques that are effective for its non-contextual counterpart. Assuming knowledge of the mean reward for each arm-context pair, we design a randomized LP-based algorithm which is α-optimal in (large enough) T time steps, where α = d_max2d_max-1(1- ϵ) for any ϵ >0, and d_max is the maximum delay of the arms. In the bandit setting, we show that a UCB based variant of the above online policy guarantees O(log T) regret w.r.t. the α-optimal strategy in T time steps, which matches the Ω(log(T)) regret lower bound in this setting. Due to the time correlation caused by the blocking of arms, existing techniques for upper bounding regret fail. As a first, in the presence of such temporal correlations, we combine ideas from coupling of non-stationary Markov chains and opportunistic sub-sampling with suboptimality charging techniques from combinatorial bandits to prove our regret upper bounds.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/27/2019

Blocking Bandits

We consider a novel stochastic multi-armed bandit setting, where playing...
research
05/24/2017

Multi-Task Learning for Contextual Bandits

Contextual bandits are a form of multi-armed bandit in which the agent h...
research
02/25/2021

Doubly-Adaptive Thompson Sampling for Multi-Armed and Contextual Bandits

To balance exploration and exploitation, multi-armed bandit algorithms n...
research
01/30/2021

Recurrent Submodular Welfare and Matroid Blocking Bandits

A recent line of research focuses on the study of the stochastic multi-a...
research
06/01/2016

Contextual Bandits with Latent Confounders: An NMF Approach

Motivated by online recommendation and advertising systems, we consider ...
research
05/29/2022

Non-Stationary Bandits under Recharging Payoffs: Improved Planning with Sublinear Regret

The stochastic multi-armed bandit setting has been recently studied in t...
research
11/19/2020

Fully Gap-Dependent Bounds for Multinomial Logit Bandit

We study the multinomial logit (MNL) bandit problem, where at each time ...

Please sign up or login with your details

Forgot password? Click here to reset