Introduction
Contextual bandits seek to learn a personalized treatment assignment policy in the presence of treatment effects that vary with observed contextual features. In such settings, there is a need to balance the exploration of actions for which there is limited knowledge in order to improve performance in the future against the exploitation of existing knowledge in order to attain better performance in the present (see [Bubeck and CesaBianchi2012] for a survey). Since large amounts of data can be required to learn how the benefits of alternative treatments vary with individual characteristics, contextual bandits can play an important role in making experimentation and learning more efficient. Several successful contextual bandit designs have been proposed [Auer2003], [Li et al.2010], [Agrawal and Goyal2013], [Agarwal et al.2014], [Bastani and Bayati2015]. The existing literature has provided regret bounds (e.g., the general bounds of [Russo and Van Roy2014], the bounds of [Rigollet and Zeevi2010], [Perchet and Rigollet2013], [Slivkins2014] in the case of nonparametric function of arm rewards), has demonstrated successful applications (e.g., news article recommendations [Li et al.2010] or mobile health [Lei, Tewari, and Murphy2017]), and has proposed system designs to apply these algorithms in practice [Agarwal et al.2016].
In the contextual setting, one does not expect to see many future observations with the same context as the current observation, and so the value of learning from pulling an arm for this context accrues when that observation is used to estimate the outcome from this arm for a different context in the future. Therefore, the performance of contextual bandit algorithms can be sensitive to the estimation method of the outcome model or the exploration method used. In the initial phases of learning when samples are small, biases are likely to arise in estimating the outcome model using data from previous nonuniform assignments of contexts to arms. The bias issue is aggravated in the case of a mismatch between the generative model and the functional form used for estimation of the outcome model, or similarly, when the heterogeneity in treatment effects is too complex to estimate well with small datasets. In that case methods that proceed under the assumption that the functional form for the outcome model is correct may be overly optimistic about the extent of the learning so far, and emphasize exploitation over exploration. Another case where biases can arise occurs when training observations from certain regions of the context space are scarce (e.g., prejudice in training data if a nonrepresentative set of users arrives in initial batches of data). These problems are common in realworld settings, such as in survey experiments in the domain of social sciences or in applications to health, recommender systems, or education. For example, early adopters of an online course may have different characteristics than later adopters.
Reweighting or balancing methods address model misspecification by making the estimation “doublyrobust,”, robust against misspecification of the reward function, important here, and robust against the specification of the propensity score (not as important here because in the bandit setting we know the propensity score). The term “doublyrobust” comes from the extensive literature on offline policy evaluation [Scharfstein, Rotnitzky, and Robins1999]
; it means in our case that when comparing two policies using historical data, we get consistent estimates of the average difference in outcomes for segments of the context whether we have either a wellspecified model of rewards or not, as long as we have a good model of the arm assignment policy (i.e., accurate propensity scores). Because in a contextual bandit the learner controls the arm assignment policy conditional on the observed context, we therefore have access to accurate propensity scores even in small samples. So, even when the reward model is severely misspecified, the learner can be used to fsobtain unbiased estimates of the reward function for each range of values of the context.
We suggest the integration of balancing methods from the causal inference literature [Imbens and Rubin2015] in online contextual bandits. We focus on the domain of linear online contextual bandits with provable guarantees, such as LinUCB [Li et al.2010] and LinTS [Agrawal and Goyal2013], and we propose two new algorithms, balanced linear UCB (BLUCB) and
balanced linear Thompson sampling (BLTS)
. BLTS and BLUCB build on LinTS and LinUCB respectively and extend them in a way that makes them less prone to problems of bias. The balancing will lead to lower estimated precision in the reward functions, and thus will emphasize exploration longer than the conventional linear TS and UCB algorithms, leading to more robust estimates.The balancing technique is wellknown in machine learning, especially in domain adaptation and studies in learningtheoretic frameworks
[Huang et al.2007], [Zadrozny2004], [Cortes, Mansour, and Mohri2010]. There is a number of recent works which approach contextual bandits through the framework of causality [Bareinboim, Forney, and Pearl2015], [Bareinboim and Pearl2015], [Forney, Pearl, and Bareinboim2017], [Lattimore, Lattimore, and Reid2016]. There is also a significant body of research that leverages balancing for offline evaluation and learning of contextual bandit or reinforcement learning policies from logged data
[Strehl et al.2010], [Dudík, Langford, and Li2011], [Li et al.2012], [Dudík et al.2014], [Li et al.2014], [Swaminathan and Joachims2015], [Jiang and Li2016], [Thomas and Brunskill2016], [Athey and Wager2017], [Kallus2017], [Wang, Agarwal, and Dudík2017], [Deshpande et al.2017], [Kallus and Zhou2018], [Zhou, Athey, and Wager2018]. In the offline setting, the complexity of the historical assignment policy is taken as given, and thus the difficulty of the offline evaluation and learning of optimal policies is taken as given. Therefore, these results lie at the opposite end of the spectrum from our work, which focuses on the online setting. Methods for reducing the bias due to adaptive data collection have also been studied for noncontextual multiarmed bandits [Villar, Bowden, and Wason2015], [Nie et al.2018], but the nature of the estimation in contextual bandits is qualitatively different. Importance weighted regression in contextual bandits was first mentioned in [Agarwal et al.2014], but without a systematic motivation, analysis and evaluation. To our knowledge, our paper is the first work to integrate balancing in the online contextual bandit setting, to perform a largescale evaluation of it against direct estimation method baselines with theoretical guarantees and to provide a theoretical characterization of balanced contextual bandits that match the regret bound of their direct method counterparts. The effect of importance weighted regression is also evaluated in [Bietti, Agarwal, and Langford2018], but this is a successor to the extended version of our paper [Dimakopoulou et al.2017].We prove that the regret bound of BLTS is and that the regret bound on BLUCB is where is the number of features in the context, is the number of arms and is the horizon. Our regret bounds for BLTS and BLUCB match the existing stateoftheart regret bounds for LinTS [Agrawal and Goyal2013] and LinUCB [Chu et al.2011] respectively. We provide extensive and convincing empirical evidence for the effectiveness of BLTS and BLUCB (in comparison to LinTS and LinUCB) by considering the problem of multiclass classification with bandit feedback. Specifically, we transform a class classification task into a armed contextual bandit [Dudík, Langford, and Li2011] and we use 300 public benchmark datasets for our evaluation. It is important to point out that, even though BLTS and LinTS share the same theoretical guarantee, BLTS outperforms LinTS empirically. Similarly, BLUCB has a strong empirical advantage over LinUCB. In bandits, this phenomenon is not uncommon. For instance, it is wellknown that even though the existing UCB bounds are often tighter than those of Thompson sampling, Thompson sampling performs better in practice than UCB [Chapelle and Li2011]. We find that this is also the case for balanced linear contextual bandits, as in our evaluation BLTS has a strong empirical advantage over BLUCB. Overall, in this largescale evaluation, BLTS outperforms LinUCB, BLUCB and LinTS. In our empirical evaluation, we also consider a synthetic example that simulates in a simple way two issues of bias that often arise in practice, training data with nonrepresentative contexts and model misspecification, and find again that BLTS is the most effective in escaping these biases.
Problem Formulation & Algorithms
Contextual Bandit Setting
In the stochastic contextual bandit setting, there is a finite set of arms, , with cardinality . At every time , the environment produces , where is a
dimensional context vector
and is the reward associated with each arm in . The contextual bandit chooses arm for context and observes the reward only for the chosen arm, . The optimal assignment for context is . The expected cumulative regret over horizon is defined as . At each time , the contextual bandit assigns arm to context based on the history of observations up to that time, . The goal is to find the assignment rule that minimizes .Linear Contextual Bandits
Linear contextual bandits rely on modeling and estimating the reward distribution corresponding to each arm given context . Specifically the expected reward is assumed to be a linear function of the context with some unknown coefficient vector ,
, and the variance is typically assumed to be constant
. In the setting we are studying, there models to be estimated, as many as the arms in . At every time , this estimation of is done separately for each arm on the history of observations corresponding to this arm, . Thompson Sampling [Thompson1933], [Scott2010], [Agrawal and Goyal2012], [Russo et al.2018] and Upper Confidence Bounds (UCB) [Lai and Robbins1985], [Auer, CesaBianchi, and Fischer2002] are two different methods which are highly effective in dealing with the exploration vs. exploitation tradeoff in multiarmed bandits. LinTS [Agrawal and Goyal2013] and LinUCB [Li et al.2010] are linear contextual bandit algorithms associated with Thompson sampling and UCB respectively.At time
, LinTS and LinUCB apply ridge regression with regularization parameter
to the history of observations for each arm , in order to obtain an estimate and its variance . For the new context , and its variance are used by LinTS and LinUCB to obtain the conditional mean of the reward associated with each arm , and its variance . LinTS assumes that the expected reward associated with arm conditional on the context is Gaussian , where is an appropriately chosen constant. LinTS draws a sample from the distribution of each arm and context is then assigned to the arm with the highest sample, . LinUCB uses the estimate and its variance to compute upper confidence bounds for the expected reward of context associated with each arm and assigns the context to the arm with the highest upper confidence bound, , where is an appropriately chosen constant.Linear Contextual Bandits with Balancing
In this section, we show how to integrate balancing methods from the causal inference literature in linear contextual bandits, in order to make estimation less prone to bias issues.
Balanced linear Thompson sampling (BLTS) and balanced linear UCB (BLUCB) are online contextual bandit algorithms that perform balanced estimation of the model of all arms in order to obtain a Gaussian distribution and an upper confidence bound respectively for the reward associated with each arm conditional on the context. We focus on the method of inverse propensity weighting
[Imbens and Rubin2015]. The idea is that at every time , the linear contextual bandit weighs each observation , in the history up to timeby the inverse probability of context
being assigned to arm . This probability is called propensity score and is denoted as . Then, for each arm , the linear contextual bandit weighs each observation in the history of arm by and uses weighted regression to obtain the estimate with variance . In BLTS (Algorithm 1), the propensity scores are known because Thompson sampling performs probability matching, i.e., it assigns a context to an arm with the probability that this arm is optimal. Since computing the propensity scores involves high order integration, they can be approximated via MonteCarlo simulation. Each iteration draws a sample from the posterior reward distribution of each arm conditional on , where the posterior is the one that the algorithm considered at the end of a randomly selected prior time period. The propensity score is the fraction of the MonteCarlo iterations in which arm has the highest sampled reward, where the arrival time of context is treated as random. For every arm , the history is used to obtain a balanced estimate of and its variancewhich produce a normally distributed estimate of
of the reward of arm for context , where is a parameter of the algorithm.In BLUCB (Algorithm 2), the observations are weighed by the inverse of estimated propensity scores. Note that UCBbased contextual bandits have deterministic assignment rules and conditional on the context the propensity score is either zero or one. But with the standard assumption that the arrival of contexts is random, at every time period the estimated probability
is obtained by the prediction of the trained multinomial logistic regression model on
. Subsequently, is used to obtain a balanced estimate of and its variance . These are used to construct the upper confidence bound, , for the reward of arm for context , where is a constant. (For some results, e.g., [Auer2002], needs to be slowly increasing in .)Note that , and , can be computed in closed form or via the bootstrap.
Weighting the observations by the inverse propensity scores reduces bias, but even when the propensity scores are known it increases variance, particularly when they are small. Clipping the propensity scores [Crump et al.2009] with some threshold , e.g. helps control the variance increase. This threshold is an additional parameter to BLTS (Algorithm 1) and BLUCB (Algorithm 2) compared to LinTS and LinUCB. Finally, note that one could integrate in the contextual bandit estimation other covariate balancing methods, such as the method of approximate residual balancing [Athey, Imbens, and Wager2018] or the method of [Kallus2017]. For instance, with approximate residual balancing one would use as weights s.t. where is a tuning parameter, and and then use
to modify the parametric and nonparametric model estimation as outlined before.
Theoretical Guarantees for BLTS and BLUCB
In this section, we establish theoretical guarantees of BLTS and BLUCB that are comparable to LinTS and LinUCB. We start with a few technical assumptions that are standard in the contextual bandits literature.
Assumption 1.
Linear Realizability: There exist parameters such that given any context , .
We use the standard (frequentist) regret criterion and standard assumptions on the regularity of the distributions.
Definition 1.
The instantaneous regret at iteration is , where is the optimal arm at iteration and is the arm taken at iteration . The cumulative regret with horizon is the defined as .
Definition 2.
We denote the canonical filtration of the underlying contextual bandits problem by , where : the sigma algebra^{1}^{1}1All the random variables are defined on some common underlying probability space, which we do not write out explicitly here.
generated by all the random variables up to and including iteration
, plus . In other words, contains all the information that is available before making the decision for iteration .Assumption 2.
For each and every :

SubGaussian Noise: is conditionally subGaussian: there exists some , such that .

Bounded Contexts and Parameters: The contexts and parameters are assumed to be bounded. Consequently, without loss of generality, we can rescale them such that .
Remark 1.
Note that we make no assumption of the underlying process: the contexts need not to be fixed beforehand or come from some stationary process. Further, they can even be adapted to , in which case they are called adversarial contexts in the literature as the contexts can be chosen by an adversary who chooses a context after observing the arms played and the corresponding rewards. If is an IID process, then the problem is known as stochastic contextual bandits. From this viewpoint, adversarial contextual bandits are more general, but the regret bounds tend to be worse. Both are studied in the literature.
Theorem 1.
We refer the reader to Appendix A of the supplemental material of the extended version of this paper [Dimakopoulou et al.2017] for the regret bound proofs.
Remark 2.
The above bound essentially matches the existing stateofthe art regret bounds for linear Thompson sampling with direct model estimation (e.g. [Agrawal and Goyal2013]). Note that in [Agrawal and Goyal2013], an infinite number of arms is also allowed, but all arms share the same parameter . The final regret bound is . Note that even though no explicit dependence on is present in the regret bound (and hence our regret bound appears as a factor of worse), this is to be expected, as we have parameters to estimate, one for each arm. Note that here we do not assume any structure on the arms; they are just standalone parameters, each of which needs to be independently estimated. Similarly, for BLUCB, our regret bound is , which is a factor of worse than that of [Chu et al.2011], which establishes a regret bound. Again, this is because a single true is assumed in [Chu et al.2011], rather than armdependent parameters.
Of course, we also point out that our regret bounds are not tight, nor do they achieve stateoftheart regret bounds in contextual bandits algorithms in general. The lower bound is established in [Chu et al.2011] for linear contextual bandits (again in the context of a single parameter for all arms). In general, UCB based algorithms ([Auer2003, Chu et al.2011, Bubeck and CesaBianchi2012, AbbasiYadkori, Pál, and Szepesvári2011]) tend to have better (and sometimes nearoptimal) theoretical regret bounds. In particular, the stateoftheart bound of for linear contextual bandits is given in [Bubeck and CesaBianchi2012] (optimal up to a factor). However, as mentioned in the introduction, Thompson sampling based algorithms tend to perform much better in practice (even though their regret bounds tend not to match UCB based algorithms, as is also the case here). Hence, our objective here is not to provide stateoftheart regret guarantees. Rather, we are motivated to design algorithms that have better empirical performance (compared to both the existing UCB style algorithms and Thompson sampling style algorithms), which also enjoy the baseline theoretical guarantee.
Finally, we give some quick intuition for the proof. For BLTS, we first show that estimated means concentrate around true mean (i.e. concentrates around ). Then, we establish that sampled means concentrate around the estimated means (i.e. concentrates around ). These two steps together indicate that the sampled mean is close to the true mean. A further consequence of that is we can then bound the instantaneous regret (regret at each time step
) in terms of the sum of two standard deviations: one corresponds to the optimal arm at time
, the other corresponds to the actual selected arm at . The rest of the proof then follows by giving tight characterizations of these two standard deviations. For BLUCB, the proof again utilizes the first concentration mentioned above: the estimated means concentrate around true mean (note that there is no sampled means in BLUCB). The rest of the proof adopts a similar structure as in [Chu et al.2011].Computational Results
In this section, we present computational results that compare the performance of our balanced linear contextual bandits, BLTS and BLUCB, with the direct method linear contextual bandit algorithms that have theoretical guarantees, LinTS and LinUCB. Our evaluation focuses on contextual bandits with linear realizability assumption and strong theoretical guarantees. First, we present a simple synthetic example that simulates bias in the training data by underrepresentation or overrepresentation of certain regions of the context space and investigates the performance of the considered linear contextual bandits both when the outcome model of the arms matches the true reward generative process and when it does not match the true reward generative process. Second, we conduct an experiment by leveraging 300 public, supervised costsensitive classification datasets to obtain contextual bandit problems, treating the features as the context, the labels as the actions and revealing only the reward for the chosen label. We show that BLTS performs better than LinTS and that BLUCB performs better than LinUCB. The randomized assignment nature of Thompson sampling facilitates the estimation of the arms’ outcomes models compared to UCB, and as a result LinTS outperforms LinUCB and BLTS outperforms BLUCB. Overall, BLTS has the best performance. In the supplemental material, we include experiments against the policybased contextual bandit from [Agarwal et al.2014] which is statistically optimal but it is also outperformed by BLTS.
A Synthetic Example
This simulated example aims to reflect in a simple way two issues that often arise in practice. The first issue is the presence of bias in the training data by underrepresentation or overrepresentation of certain regions. A personalized policy that is trained based on such data and is applied to the entire context space will result in biased decisions for certain contexts. The second issue is the problem of mismatch between the true reward generative process and the functional form used for estimation of the outcome model of the arms, which is common in applications with complex generative models. Model misspecification aggravates the presence of bias in the learned policies.
We use this simple example to present in an intuitive manner why balancing and randomized assignment rule help with these issues, before moving on to a largescale evaluation of the algorithms in real datasets in the next section.
Consider a simulation design where there is a warmstart batch of training observations, but it consists of contexts focused on one region of the context space. There are three arms and the contexts are twodimensional with . The rewards corresponding to each arm are generated as follows; , , and , where , . The expected values of the three arms’ rewards are shown in Figure 1.
In the warmstart data, and are generated from a truncated normal distribution on the interval , while in subsequent data and are drawn from without the truncation. Each one of the 50 warmstart contexts is assigned to one of the three arms at random with equal probability. Note that the warmstart contexts belong to a region of the context space where the reward surfaces do not change much with the context. Therefore, when training the reward model for the first time, the estimated reward of arm (blue) is the highest, the one of arm (yellow) is the second highest and the one of arm (red) is the lowest across the context space.
We run our experiment with a learning horizon . The regularization parameter , which is present in all algorithms, is chosen via crossvalidation every time the model is updated. The constant , which is present in all algorithms, is optimized among values in the Thompson sampling bandits (the value corresponds to standard Thompson sampling, [Chapelle and Li2011] suggest that smaller values may lower regret) and among values in the UCB bandits [Chapelle and Li2011]. The propensity threshold for BLTS and BLUCB is optimized among the values .
WellSpecified Outcome Models
In this section, we compare the behavior of LinTS, LinUCB, BLTS and BLUCB when the outcome model of the contextual bandits is wellspecified, i.e., it includes both linear and quadratic terms. Note that this is still in the domain of linear contextual bandits, if we treat the quadratic terms as part of the context.
First, we compare LinTS and LinUCB. Figure (a)a shows that the uncertainty and the stochastic nature of LinTS leads to a “dispersed” assignment of arms and and to the crucial assignment of a few contexts to arm . This allows LinTS to start decreasing the bias in the estimation of all three arms. Within the first few learning observations, LinTS estimates the outcome models of all three arms correctly and finds the optimal assignment. On the other hand, Figure (b)b, shows that the deterministic nature of LinUCB assigns entire regions of the context space to the same arm. As a result not enough contexts are assigned to and LinUCB delays the correction of bias in the estimation of this arm. Another way to understand the problem is that the outcome model in the LinUCB bandit has biased coefficients combined with estimated uncertainty that is too low to incentivize the exploration of arm initially. LinUCB finds the correct assignment after 240 observations.
Second, we study the performance of BLTS and BLUCB. In Figure (d)d, we observe that balancing has a significant impact on the performance of UCB, since BLUCB finds the optimal assignment after 110 observations, much faster than LinUCB. This is because the few observations of arm outside of the context region of the warmstart batch are weighted more heavily by BLUCB. As a result, BLUCB, despite its deterministic nature which complicates estimation, is able to reduce its bias more quickly via balancing Figure (c)c shows that BLTS is also able to find the optimal assignment a few observations earlier than LinTS.
The first column of Table 1 shows the percentage of simulations in which LinTS, LinUCB, BLTS and BLUCB find the optimal assignment within contexts for the wellspecified case. BLTS outperforms all other algorithms by a large margin.
MisSpecified Outcome Models
We now study the behavior of LinTS, LinUCB, BLTS and BLUCB when the outcome models include only linear terms of the context and therefore are misspecified. In realworld domains, the true data generative process is complex and very difficult to capture by the simpler outcome models assumed by the learning algorithms. Hence, model mismatch is very likely.
We first compare LinTS and LinUCB. In Figures (a)a, (b)b, we see that during the first time periods, both bandits assign most contexts to arm and a few contexts to arm . LinTS finds faster than LinUCB the linearly approximated area in which arm is suboptimal. However, both LinTS and LinUCB have trouble identifying that the optimal arm is . Due to the low estimate of from the misrepresentative warmstart observations, LinUCB does not assign contexts to arm for a long time and therefore, delays to estimate the model of correctly. LinTS does assign a few contexts to arm , but they are not enough to quickly correct the estimation bias of arm either. On the other hand, BLTS is able to harness the advantages of the stochastic assignment rule of Thompson sampling. The few contexts assigned to arm are weighted more heavily by BLTS. Therefore, as shown in Figure (c)c, BLTS corrects the estimation error of arm and finds the (constrained) optimal assignment already after 20 observations. On the other hand, BLUCB does not handle better than LinUCB the estimation problem created by the deterministic nature of the assignment in the misspecified case, as shown in Figure (d)d. The second column of table 1 shows the percentage of simulations in which LinTS, LinUCB, BLTS and BLUCB find the optimal assignment within contexts for the misspecified case. Again, BLTS has a strong advantage.
This simple synthetic example allowed us to explain transparently where the benefits of balancing in linear bandits stem from. Balancing helps escape biases in the training data and can be more robust in the case of model misspecification. While, as we proved, balanced linear contextual bandits share the same strong theoretical guarantees, this indicates towards their better performance in practice compared to other contextual bandits with linear realizability assumption. We investigate this further in the next section with an extensive evaluation on real classification datasets.
WellSpecified  MisSpecified  

LinTS  84%  39% 
LinUCB  51%  29% 
BLTS  92%  58% 
BLUCB  79%  30% 
Multiclass Classification with Bandit Feedback
Adapting a classification task to a bandit problem is a common method for comparing contextual bandit algoriths [Dudík, Langford, and Li2011], [Agarwal et al.2014], [Bietti, Agarwal, and Langford2018]. In a classification task, we assume data are drawn IID from a fixed distribution: , where is the context and
is the class. The goal is to find a classifier
that minimizes the classification error . The classifier can be seen as an armselection policy and the classification error is the policy’s expected regret. Further, if only the loss associated with the policy’s chosen arm is revealed, this becomes a contextual bandit setting. So, at time , context is sampled from the dataset, the contextual bandit selects arm and observes reward , where is the unknown, true class of . The performance of a contextual bandit algorithm on a dataset with observations is measured with respect to the normalized cumulative regret, .We use 300 multiclass datasets from the Open Media Library (OpenML). The datasets vary in number of observations, number of classes and number of features. Table 2 summarizes the characteristics of these benchmark datasets. Each dataset is randomly shuffled.
Observations  Datasets 

58  
and  152 
and  57 
33 
Classes  Count 

243  
48  
9 
Features  Count 

154  
106  
40 
We evaluate LinTS, BLTS, LinUCB and BLUCB on these 300 benchmark datasets. We run each contextual bandit on every dataset for different choices of input parameters. The regularization parameter , which is present in all algorithms, is chosen via crossvalidation every time the model is updated. The constant , which is present in all algorithms, is optimized among values in the Thompson sampling bandits [Chapelle and Li2011] and among values in the UCB bandits [Chapelle and Li2011]. The propensity threshold for BLTS and BLUCB is optimized among the values . Apart from baselines that belong in the family of contextual bandits with linear realizability assumption and have strong theoretical guarantees, we also evaluate the policybased ILOVETOCONBANDITS (ILTCB) from [Agarwal et al.2014] that does not estimate a model, but instead it assumes access to an oracle for solving fully supervised costsensitive classification problems and achieves the statistically optimal regret.
Figure 4 shows the pairwise comparison of LinTS, BLTS, LinUCB, BLUCB and ILTCB on the 300 classification datasets. Each point corresponds to a dataset. The coordinate is the normalized cumulative regret of the column bandit and the coordinate is the normalized cumulative regret of the row bandit. The point is blue when the row bandit has smaller normalized cumulative regret and wins over the column bandit. The point is red when the row bandit loses from the column bandit. The point’s size grows with the significance of the win or loss.
The first important observation is that the improved model estimation achieved via balancing leads to better practical performance across a large number of contextual bandit instances. Specifically, BLTS outperforms LinTS and BLUCB outperforms LinUCB. The second important observation is that deterministic assignment rule bandits are at a disadvantage compared to randomized assignment rule bandits. The improvement in estimation via balancing is not enough to outweigh the fact that estimation is more difficult when the assignment is deterministic and BLUCB is outperformed by LinTS. Overall, BLTS which has both balancing and a randomized assignment rule, outperforms all other linear contextual bandits with strong theoretical guarantees. BLTS also outperforms the modelagnostic ILTCB algorithm. We refer the reader to Appendix B of the supplemental material of the extended version of this paper [Dimakopoulou et al.2017] for details on the datasets.
Closing Remarks
Contextual bandits are poised to play an important role in a wide range of applications: content recommendation in webservices, where the learner wants to personalize recommendations (arm) to the profile of a user (context) to maximize engagement (reward); online education platforms, where the learner wants to select a teaching method (arm) based on the characteristics of a student (context) in order to maximize the student’s scores (reward); and survey experiments, where the learner wants to learn what information or persuasion (arm) influences the responses (reward) of subjects as a function of their demographics, political beliefs, or other characteristics (context). In these settings, there are many potential sources of bias in estimation of outcome models, not only due to the inherent adaptive data collection, but also due to mismatch between the true data generating process and the outcome model assumptions, and due to prejudice in the training data in form of underrepresentation or overrepresentation of certain regions of the context space. To reduce bias, we have proposed new contextual bandit algorithms, BLTS and BLUCB, which build on linear contextual bandits LinTS and LinUCB respectively and improve them with balancing methods from the causal inference literature.
We derived the first regret bound analysis for linear contextual bandits with balancing and we showed linear contextual bandits with balancing match the theoretical guarantees of the linear contextual bandits with direct model estimation; namely that BLTS matches the regret bound of LinTS and BLUCB matches the regret bound of LinUCB. A synthetic example simulating covariate shift and model misspecification and a largescale experiment with real multiclass classification datasets demonstrated the effectiveness of balancing in contextual bandits, particularly when coupled with Thompson sampling.
Acknowledgments
The authors would like to thank Emma Brunskill for valuable comments on the paper and John Langford, Miroslav Dudík, Akshay Krishnamurthy and Chicheng Zhang for useful discussions regarding the evaluation on classification datasets. This research is generously supported by ONR grant N000141712131, by the Sloan Foundation, by the “Arvanitidis in Memory of William K. Linvill” Stanford Graduate Fellowship and by the Onassis Foundation.
References
 [AbbasiYadkori, Pál, and Szepesvári2011] AbbasiYadkori, Y.; Pál, D.; and Szepesvári, C. 2011. Improved algorithms for linear stochastic bandits. In NIPS.
 [Agarwal et al.2014] Agarwal, A.; Hsu, D.; Kale, S.; Langford, J.; Li, L.; and Schapire, R. 2014. Taming the monster: A fast and simple algorithm for contextual bandits. ICML.
 [Agarwal et al.2016] Agarwal, A.; Bird, S.; Cozowicz, M.; Hoang, L.; Langford, J.; Lee, S.; Li, J.; Melamed, D.; Oshri, G.; Ribas, O.; Sen, S.; and Slivkins, A. 2016. Making contextual decisions with low technical debt. arXiv preprint arXiv:1606.03966.
 [Agrawal and Goyal2012] Agrawal, S., and Goyal, N. 2012. Analysis of thompson sampling for the multiarmed bandit problem. Journal of Machine Learning Research Workshop and Conference Proceedings.
 [Agrawal and Goyal2013] Agrawal, S., and Goyal, N. 2013. Thompson sampling for contextual bandits with linear payoffs. ICML.
 [Athey and Wager2017] Athey, S., and Wager, S. 2017. Efficient policy learning. arXiv preprint arXiv:1702.02896.
 [Athey, Imbens, and Wager2018] Athey, S.; Imbens, G. W.; and Wager, S. 2018. Approximate residual balancing: debiased inference of average treatment effects in high dimensions. Journal of the Royal Statistical Society.
 [Auer, CesaBianchi, and Fischer2002] Auer, P.; CesaBianchi, N.; and Fischer, P. 2002. Finitetime analysis of the multiarmed bandit problem. Machine Learning.
 [Auer2002] Auer, P. 2002. Using confidence bounds for exploitationexploration tradeoffs. Journal of Machine Learning Research 3(Nov).
 [Auer2003] Auer, P. 2003. Using confidence bounds for exploitationexploration tradeoffs. Journal of Machine Learning Research.
 [Bareinboim and Pearl2015] Bareinboim, E., and Pearl, J. 2015. Causal inference and the datafusion problem. Proceedings of the National Academy of Sciences.
 [Bareinboim, Forney, and Pearl2015] Bareinboim, E.; Forney, A.; and Pearl, J. 2015. Bandits with unobserved confounders: A causal approach. ICML.
 [Bastani and Bayati2015] Bastani, H., and Bayati, M. 2015. Online decisionmaking with highdimensional covariates.
 [Bietti, Agarwal, and Langford2018] Bietti, A.; Agarwal, A.; and Langford, J. 2018. A contextual bandit bakeoff. arXiv preprint arXiv:1802.04064.
 [Bubeck and CesaBianchi2012] Bubeck, S., and CesaBianchi, N. 2012. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends in Machine Learning.
 [Chapelle and Li2011] Chapelle, O., and Li, L. 2011. An empirical evaluation of thompson sampling. NIPS.
 [Chu et al.2011] Chu, W.; Li, L.; Reyzin, L.; and Schapire, R. 2011. Contextual bandits with linear payoff functions. In AISTATS.
 [Cortes, Mansour, and Mohri2010] Cortes, C.; Mansour, Y.; and Mohri, M. 2010. Learning bounds for importance eeighting. NIPS.
 [Crump et al.2009] Crump, R. K.; Hotz, V. J.; Imbens, G. W.; and Mitnik, O. A. 2009. Dealing with limited overlap in estimation of average treatment effects. Biometrika 96(1):187–199.
 [Deshpande et al.2017] Deshpande, Y.; Mackey, L.; Syrgkanis, V.; and Taddy, M. 2017. Accurate inference in adaptive linear models. arXiv preprint arXiv:1712.06695.
 [Dimakopoulou et al.2017] Dimakopoulou, M.; Zhou, Z.; Athey, S.; and Imbens, G. 2017. Estimation considerations in contextual bandits. arXiv preprint arXiv:1711.07077.
 [Dudík et al.2014] Dudík, M.; Erhan, D.; Langford, J.; and Li, L. 2014. Doubly robust policy evaluation and optimization. Statistical Science.
 [Dudík, Langford, and Li2011] Dudík, M.; Langford, J.; and Li, L. 2011. Doubly robust policy evaluation and learning. ICML.
 [Forney, Pearl, and Bareinboim2017] Forney, A.; Pearl, J.; and Bareinboim, E. 2017. Counterfactual datafusion for online reinforcement learners. ICML.
 [Huang et al.2007] Huang, J.; Gretton, A.; Borgwardt, K. M.; Scholkopf, B.; and Smola, A. J. 2007. Correcting sample selection bias by unlabeled data. NIPS.
 [Imbens and Rubin2015] Imbens, G. W., and Rubin, D. B. 2015. Causal Inference in Statistics, Social, and Biomedical Sciences.
 [Jiang and Li2016] Jiang, N., and Li, L. 2016. Doubly robust offpolicy value evaluation for reinforcement learning. ICML.
 [Kallus and Zhou2018] Kallus, N., and Zhou, A. 2018. Policy evaluation and optimization with continuous treatments. AISTATS.
 [Kallus2017] Kallus, N. 2017. Balanced policy evaluation and learning. arXiv preprint arXiv:1705.07384.
 [Lai and Robbins1985] Lai, T., and Robbins, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics.
 [Lattimore, Lattimore, and Reid2016] Lattimore, F.; Lattimore, T.; and Reid, M. D. 2016. Causal bandits: Learning good interventions via causal inference. NIPS.
 [Lei, Tewari, and Murphy2017] Lei, H.; Tewari, A.; and Murphy, S. 2017. An actorcritic contextual bandit algorithm for personalized mobile health interventions. arXiv preprint arXiv:1706.09090.
 [Li et al.2010] Li, L.; Chu, W.; Langford, J.; and Schapire, R. 2010. A contextualbandit approach to personalized news article recommendation. WWW.
 [Li et al.2012] Li, L.; Chu, W.; Langford, J.; Moon, T.; and Wang, X. 2012. An unbiased offline evaluation of contextual bandit algorithms with generalized linear models. Journal of Machine Learning Research Workshop and Conference Proceedings.
 [Li et al.2014] Li, L.; Chen, S.; Kleban, J.; and Gupta, A. 2014. Counterfactual estimation and optimization of click metrics for search engines. arXiv preprint arXiv:1403.1891.

[Nie et al.2018]
Nie, X.; Tian, X.; Taylor, J.; and Zou, J.
2018.
Why adaptively collected data have negative bias and how to correct
for it.
International Conference on Artificial Intelligence and Statistics
.  [Perchet and Rigollet2013] Perchet, V., and Rigollet, P. 2013. The multiarmed bandit problem with covariates. The Annals of Statistics.
 [Rigollet and Zeevi2010] Rigollet, P., and Zeevi, R. 2010. Nonparametric bandits with covariates. COLT.
 [Russo and Van Roy2014] Russo, D., and Van Roy, B. 2014. Learning to optimize via posterior sampling. Mathematics of Operations Research.
 [Russo et al.2018] Russo, D. J.; Van Roy, B.; Kazerouni, A.; Osband, I.; and Wen, Z. 2018. A tutorial on thompson sampling. Foundations and Trends in Machine Learning.
 [Scharfstein, Rotnitzky, and Robins1999] Scharfstein, D. O.; Rotnitzky, A.; and Robins, J. M. 1999. Adjusting for nonignorable dropout using semiparametric nonresponse models. Journal of the American Statistical Association 94(448):1096–1120.
 [Scott2010] Scott, S. 2010. A modern bayesian look at the multiarmed bandit. Applied Stochastic Models in Business and Industry.
 [Slivkins2014] Slivkins, A. 2014. Contextual bandits with similarity information. Journal of Machine Learning Research.
 [Strehl et al.2010] Strehl, A.; Langford, J.; Li, L.; and Kakade, S. M. 2010. Learning from logged implicit exploration data. In NIPS.
 [Swaminathan and Joachims2015] Swaminathan, A., and Joachims, T. 2015. Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research.
 [Thomas and Brunskill2016] Thomas, P., and Brunskill, E. 2016. Dataefficient offpolicy policy evaluation for reinforcement learning. ICML.
 [Thompson1933] Thompson, W. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika.
 [Villar, Bowden, and Wason2015] Villar, S.; Bowden, J.; and Wason, J. 2015. Multiarmed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical Science.
 [Wang, Agarwal, and Dudík2017] Wang, Y. X.; Agarwal, A.; and Dudík, M. 2017. Optimal and adaptive offpolicy evaluation in contextual bandits. ICML.
 [Zadrozny2004] Zadrozny, B. 2004. Learning and evaluating classifiers under sample selection bias. ICML.
 [Zhou, Athey, and Wager2018] Zhou, Z.; Athey, S.; and Wager, S. 2018. Offline multiaction policy learning: Generalization and optimization. arXiv preprint arXiv:1810.04778.