Semiparametric Contextual Bandits

03/12/2018 ∙ by Akshay Krishnamurthy, et al. ∙ 0

This paper studies semiparametric contextual bandits, a generalization of the linear stochastic bandit problem where the reward for an action is modeled as a linear function of known action features confounded by an non-linear action-independent term. We design new algorithms that achieve Õ(d√(T)) regret over T rounds, when the linear function is d-dimensional, which matches the best known bounds for the simpler unconfounded case and improves on a recent result of Greenewald et al. (2017). Via an empirical evaluation, we show that our algorithms outperform prior approaches when there are non-linear confounding effects on the rewards. Technically, our algorithms use a new reward estimator inspired by doubly-robust approaches and our proofs require new concentration inequalities for self-normalized martingales.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


We study a generalization of the linear stochastic bandit problem with action-dependent features and action-independent confounder. The learning process proceeds for rounds, and in round , the learner receives a context where and is the action set, which we assume to be large but finite. The learner then chooses an action and receives reward r_t(a_t) ⟨θ, z_t,a_t⟩+ f_t(x_t) + ξ_t, where is an unknown parameter vector, is a confounding term that depends on the context but, crucially, does not depend on the chosen action , and is a noise term that is centered and independent of . For each round , let denote the optimal action for that round. The goal of our algorithm is to minimize the regret, defined as Reg(T) ∑_t=1^T r_t(a_t^⋆) - r_t(a_t) = ∑_t=1^T⟨θ, z_t,a_t^⋆ - z_t,a_t⟩. Observe that the noise term , and, more importantly, the confounding term are absent in the final expression, since they are independent of the action choice. We consider the challenging setting where the context and the confounding term are chosen by an adaptive adversary, so they may depend on all information from previous rounds. This is formalized in the following assumption. [Environment] We assume that are generated at the beginning of round , before is chosen. We assume that and are chosen by an adaptive adversary, and that satisfies and . We also impose mild regularity assumptions on the parameter, the feature vectors, and the confounding functions. [Boundedness] Assume that and that for all . Further assume that for all . For simplicity, we assume an upper bound of in these conditions, but our algorithm and analysis can be adapted to more generic regularity conditions.