Semiparametric Contextual Bandits

by   Akshay Krishnamurthy, et al.

This paper studies semiparametric contextual bandits, a generalization of the linear stochastic bandit problem where the reward for an action is modeled as a linear function of known action features confounded by an non-linear action-independent term. We design new algorithms that achieve Õ(d√(T)) regret over T rounds, when the linear function is d-dimensional, which matches the best known bounds for the simpler unconfounded case and improves on a recent result of Greenewald et al. (2017). Via an empirical evaluation, we show that our algorithms outperform prior approaches when there are non-linear confounding effects on the rewards. Technically, our algorithms use a new reward estimator inspired by doubly-robust approaches and our proofs require new concentration inequalities for self-normalized martingales.


page 1

page 2

page 3

page 4


Contextual Bandits with Cross-learning

In the classical contextual bandits problem, in each round t, a learner ...

Adapting to misspecification in contextual bandits with offline regression oracles

Computationally efficient contextual bandits are often based on estimati...

Adversarial Combinatorial Bandits with General Non-linear Reward Functions

In this paper we study the adversarial combinatorial bandit with a known...

Nearly Minimax Algorithms for Linear Bandits with Shared Representation

We give novel algorithms for multi-task and lifelong linear bandits with...

Neural Contextual Bandits via Reward-Biased Maximum Likelihood Estimation

Reward-biased maximum likelihood estimation (RBMLE) is a classic princip...

An Empirical Study of Neural Kernel Bandits

Neural bandits have enabled practitioners to operate efficiently on prob...

Restless Hidden Markov Bandits with Linear Rewards

This paper presents an algorithm and regret analysis for the restless hi...


We study a generalization of the linear stochastic bandit problem with action-dependent features and action-independent confounder. The learning process proceeds for rounds, and in round , the learner receives a context where and is the action set, which we assume to be large but finite. The learner then chooses an action and receives reward r_t(a_t) ⟨θ, z_t,a_t⟩+ f_t(x_t) + ξ_t, where is an unknown parameter vector, is a confounding term that depends on the context but, crucially, does not depend on the chosen action , and is a noise term that is centered and independent of . For each round , let denote the optimal action for that round. The goal of our algorithm is to minimize the regret, defined as Reg(T) ∑_t=1^T r_t(a_t^⋆) - r_t(a_t) = ∑_t=1^T⟨θ, z_t,a_t^⋆ - z_t,a_t⟩. Observe that the noise term , and, more importantly, the confounding term are absent in the final expression, since they are independent of the action choice. We consider the challenging setting where the context and the confounding term are chosen by an adaptive adversary, so they may depend on all information from previous rounds. This is formalized in the following assumption. [Environment] We assume that are generated at the beginning of round , before is chosen. We assume that and are chosen by an adaptive adversary, and that satisfies and . We also impose mild regularity assumptions on the parameter, the feature vectors, and the confounding functions. [Boundedness] Assume that and that for all . Further assume that for all . For simplicity, we assume an upper bound of in these conditions, but our algorithm and analysis can be adapted to more generic regularity conditions.