Corrupted Contextual Bandits with Action Order Constraints

by   Alexander Galozy, et al.

We consider a variant of the novel contextual bandit problem with corrupted context, which we call the contextual bandit problem with corrupted context and action correlation, where actions exhibit a relationship structure that can be exploited to guide the exploration of viable next decisions. Our setting is primarily motivated by adaptive mobile health interventions and related applications, where users might transitions through different stages requiring more targeted action selection approaches. In such settings, keeping user engagement is paramount for the success of interventions and therefore it is vital to provide relevant recommendations in a timely manner. The context provided by users might not always be informative at every decision point and standard contextual approaches to action selection will incur high regret. We propose a meta-algorithm using a referee that dynamically combines the policies of a contextual bandit and multi-armed bandit, similar to previous work, as wells as a simple correlation mechanism that captures action to action transition probabilities allowing for more efficient exploration of time-correlated actions. We evaluate empirically the performance of said algorithm on a simulation where the sequence of best actions is determined by a hidden state that evolves in a Markovian manner. We show that the proposed meta-algorithm improves upon regret in situations where the performance of both policies varies such that one is strictly superior to the other for a given time period. To demonstrate that our setting has relevant practical applicability, we evaluate our method on several real world data sets, clearly showing better empirical performance compared to a set of simple algorithms.


Online learning with Corrupted context: Corrupted Contextual Bandits

We consider a novel variant of the contextual bandit problem (i.e., the ...

Thompson Sampling for Contextual Bandits with Linear Payoffs

Thompson Sampling is one of the oldest heuristics for multi-armed bandit...

Duelling Bandits with Weak Regret in Adversarial Environments

Research on the multi-armed bandit problem has studied the trade-off of ...

Learning from Logged Implicit Exploration Data

We provide a sound and consistent foundation for the use of nonrandom ex...

Visual Prediction of Priors for Articulated Object Interaction

Exploration in novel settings can be challenging without prior experienc...

RELEAF: An Algorithm for Learning and Exploiting Relevance

Recommender systems, medical diagnosis, network security, etc., require ...

Off-policy Bandits with Deficient Support

Learning effective contextual-bandit policies from past actions of a dep...

Please sign up or login with your details

Forgot password? Click here to reset