## Preliminaries

We study a generalization of the linear stochastic bandit problem with
*action-dependent* features and *action-independent*
confounder. The learning process proceeds for rounds, and in round
, the learner receives a context
where
and is the action set, which we assume to be large but
finite. The learner then chooses an action and
receives reward
r_t(a_t) ⟨θ, z_t,a_t⟩+ f_t(x_t) + ξ_t,
where is an unknown parameter vector, is
a confounding term that depends on the context but, crucially,
does not depend on the chosen action , and is a noise
term that is centered and independent of .
For each round , let
denote the optimal action for that round. The goal of
our algorithm is to minimize the regret, defined as
Reg(T) ∑_t=1^T r_t(a_t^⋆) - r_t(a_t)
= ∑_t=1^T⟨θ, z_t,a_t^⋆ - z_t,a_t⟩.
Observe that the noise term , and, more importantly, the
confounding term are absent in the final expression, since
they are independent of the action choice.
We consider the challenging setting where the context and the
confounding term are chosen by an adaptive adversary, so
they may depend on all information from previous rounds. This is
formalized in the following assumption.
[Environment]
We assume that are
generated at the beginning of round , before is chosen. We
assume that and are chosen by an adaptive adversary, and
that satisfies and .
We also impose mild regularity assumptions on the parameter, the
feature vectors, and the confounding functions.
[Boundedness]
Assume that and that for
all . Further assume that
for all .
For simplicity, we assume an upper bound of in these conditions,
but our algorithm and analysis can be adapted to more generic
regularity conditions.