Compliance-Aware Bandits

02/09/2016
by   Nicolás Della Penna, et al.
0

Motivated by clinical trials, we study bandits with observable non-compliance. At each step, the learner chooses an arm, after, instead of observing only the reward, it also observes the action that took place. We show that such noncompliance can be helpful or hurtful to the learner in general. Unfortunately, naively incorporating compliance information into bandit algorithms loses guarantees on sublinear regret. We present hybrid algorithms that maintain regret bounds up to a multiplicative factor and can incorporate compliance information. Simulations based on real data from the International Stoke Trial show the practical potential of these algorithms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/17/2018

Simple Regret Minimization for Contextual Bandits

There are two variants of the classical multi-armed bandit (MAB) problem...
research
09/20/2017

Bandits with Delayed Anonymous Feedback

We study the bandits with delayed anonymous feedback problem, a variant ...
research
07/26/2019

Lexicographic Multiarmed Bandit

We consider a multiobjective multiarmed bandit problem with lexicographi...
research
07/17/2020

Bandits for BMO Functions

We study the bandit problem where the underlying expected reward is a Bo...
research
06/11/2020

Bandits with Partially Observable Offline Data

We study linear contextual bandits with access to a large, partially obs...
research
09/12/2012

Regret Bounds for Restless Markov Bandits

We consider the restless Markov bandit problem, in which the state of ea...
research
09/30/2022

Artificial Replay: A Meta-Algorithm for Harnessing Historical Data in Bandits

While standard bandit algorithms sometimes incur high regret, their perf...

Please sign up or login with your details

Forgot password? Click here to reset