We study a generalization of the linear stochastic bandit problem with action-dependent features and action-independent confounder. The learning process proceeds for rounds, and in round , the learner receives a context where and is the action set, which we assume to be large but finite. The learner then chooses an action and receives reward r_t(a_t) ⟨θ, z_t,a_t⟩+ f_t(x_t) + ξ_t, where is an unknown parameter vector, is a confounding term that depends on the context but, crucially, does not depend on the chosen action , and is a noise term that is centered and independent of . For each round , let denote the optimal action for that round. The goal of our algorithm is to minimize the regret, defined as Reg(T) ∑_t=1^T r_t(a_t^⋆) - r_t(a_t) = ∑_t=1^T⟨θ, z_t,a_t^⋆ - z_t,a_t⟩. Observe that the noise term , and, more importantly, the confounding term are absent in the final expression, since they are independent of the action choice. We consider the challenging setting where the context and the confounding term are chosen by an adaptive adversary, so they may depend on all information from previous rounds. This is formalized in the following assumption. [Environment] We assume that are generated at the beginning of round , before is chosen. We assume that and are chosen by an adaptive adversary, and that satisfies and . We also impose mild regularity assumptions on the parameter, the feature vectors, and the confounding functions. [Boundedness] Assume that and that for all . Further assume that for all . For simplicity, we assume an upper bound of in these conditions, but our algorithm and analysis can be adapted to more generic regularity conditions.