1 Introduction
Many realworld applications, ranging from online news recommendation (Li et al., 2011), advertising (Bottou et al., 2013), and search engines (Li et al., 2015) to personalized healthcare (Zhou et al., 2017), are naturally modeled by the contextual bandit protocol (Langford and Zhang, 2008), where a learner repeatedly observes a context, takes an action, and accrues reward. In news recommendation, the context is any information about the user, such as history of past visits, the action is the recommended article, and the reward could indicate, for instance, the user’s click on the article. The goal is to maximize the reward, but the learner can only observe the reward for the chosen actions, and not for other actions.
In this paper we study a fundamental problem in contextual bandits known as offpolicy evaluation, where the goal is to use the data gathered by a past algorithm, known as the logging policy, to estimate the average reward of a new algorithm, known as the target policy. Highquality offpolicy estimates help avoid costly A/B testing and can also be used as subroutines for optimizing a policy (Dudík et al., 2011; Athey and Wager, 2017).
The key challenge in offpolicy evaluation is distribution mismatch: the decisions made by the target policy differ from those made by the logging policy that collected the data. Three standard approaches tackle this problem. The first approach, known as inverse propensity scoring (IPS) (Horvitz and Thompson, 1952), corrects for this mismatch by reweighting the data. The resulting estimate is unbiased, but may exhibit intolerably high variance when the importance weights (also known as inverse propensity scores) are large. The second approach, direct modeling
(DM) or imputation, sidesteps the problem of large importance weights and directly fits a regression model to predict rewards. Nonparametric variants of direct modeling are asymptotically optimal
(Imbens et al., 2007), but in finite samples, direct modeling often suffers from a large bias (Dudík et al., 2011). The third approach, called the doubly robust (DR) estimator (Robins and Rotnitzky, 1995; Bang and Robins, 2005; Dudík et al., 2011), combines IPS and DM: it first estimates the reward by DM, and then estimates a correction term by IPS. The approach is unbiased, its variance is smaller than that of IPS, and it is asymptotically optimal under weaker assumptions than DM (Rothe, 2016). However, since DR uses the same importance weights as IPS, its variance can still be quite high, unless the reward predictor is highly accurate. Therefore, several works have developed variants of DR that clip or remove large importance weights. Weight clipping incurs a small bias, but substantially decreases the variance, yielding a lower mean squared error than standard DR (Bembom and van der Laan, 2008; Bottou et al., 2013; Wang et al., 2017; Su et al., 2018).In this paper, we continue this line of work, by developing a systematic approach for designing estimators with favorable finitesample performance. Our approach involves shrinking the importance weights to directly optimize a sharp bound on the mean squared error (MSE). We use two styles of upper bounds to obtain two classes of estimators. The first is based on an upper bound that is agnostic to the quality of the reward estimator and yields the previously studied weight clipping (Kang et al., 2007; Strehl et al., 2010; Su et al., 2018), which can be interpreted as pessimistic shrinkage. The second is based on an upper bound that incorporates the quality of the reward predictor, yielding a new estimator, which we call DR with optimistic shrinkage
. Both classes of estimators involve a hyperparameter and specific choices produce the unbiased doubly robust estimator and the lowvariance directmodeling estimator; the optimal hyperparameter improves on both of these.
To tune the hyperparameter and navigate between the two estimator classes, we design a simple modelselection procedure. Model selection is crucial to our approach, but is important even for classical estimators, as their performance is highly datasetdependent. In contrast with supervised learning, where crossvalidation is a simple and effective strategy, distribution mismatch makes model selection quite challenging in contextual bandits. Our model selection approach again involves optimizing a bound on the MSE. Combined with our shrinkage estimators, we prove that our final estimator is never worse than DR. Thus, our estimators retain the asymptotic optimality of DR, but with improved finitesample performance.
We evaluate our approach on benchmark datasets and compare its performance with a number of existing estimators across a comprehensive range of conditions. While our focus is on tuning importance weights, we also vary how we train a reward predictor, including the recently proposed more robust doubly robust (MRDR) approach (Farajtabar et al., 2018), and apply our model selection to pick the best predictor. Our experiments show that our approach typically outperforms stateoftheart methods in both offpolicy evaluation and offpolicy learning settings. We also find that the choice of the reward predictor changes, depending on whether weight shrinkage is used or not. Via extensive ablation studies, we identify a robust configuration of our shrinkage approach that we recommend as a practical choice.
Other related work. Offpolicy estimation is studied in observational settings under the name average treatment effect (ATE) estimation, with many results on asymptotically optimal estimators (Hahn, 1998; Hirano et al., 2003; Imbens et al., 2007; Rothe, 2016), but only few that optimize MSE in finite samples. Most notably, Kallus (2017, 2018) adjusts importance weights by optimizing MSE under smoothness (or parametric) assumptions on the reward function. This can be highly effective when the assumptions hold, but the assumptions are difficult to verify in practice. In contrast, we optimize importance weights without making any modeling assumptions other than boundedness of rewards.
2 Setup
We consider the contextual bandits protocol, where a decision maker interacts with the environment by repeatedly observing a context , choosing an action , and observing a reward . The context space can be uncountably large, but we assume that the action space is finite. In the news recommendation example, describes the history of past visits of a given user, is a recommended article, and equals one if the user clicks on the article and zero otherwise. We assume that contexts are sampled i.i.d. from some distribution and rewards are sampled from some conditional distribution . We write for the expected reward, conditioned on a given context and action, and refer to as the regression function.
The behavior of a decision maker is formalized as a conditional distribution over actions given contexts, referred to as a policy. We also write
for the joint distribution over contextactionreward triples when actions are selected by the policy
. The expected reward of a policy , called the value of , is denoted as .In the offpolicy evaluation problem, we are given a dataset consisting of contextactionreward triples collected by some logging policy , and we would like to estimate the value of a target policy . The quality of an estimator is measured by the mean squared error
where the expectation is with respect to the data generation process. In analyzing the error of an estimator, we rely on the decomposition of MSE into the bias and variance terms:
We consider three standard approaches for offpolicy evaluation. The first two are direct modeling (DM) and inverse propensity scoring (IPS). In the former, we train a reward predictor and use it to impute rewards. In the latter, we simply reweight the data. The two estimators are:
For a concise notation, let , denote the importance weight. We make a standard assumption that is absolutely continuous with respect to , meaning that whenever . This condition ensures that the importance weights are well defined and that
is an unbiased estimator for
. However, if there is substantial mismatch between and , then the importance weights will be large and will have large variance. On the other hand, given any fixed reward predictor (fit on a separate dataset), has low variance, independent of the distribution mismatch, but it is typically biased due to approximation errors in fitting .The third approach, called the doubly robust (DR) estimator (Dudík et al., 2014), combines DM and IPS:
(1) 
The DR estimator applies IPS to a shifted versions of rewards, using as a control variate to decrease the variance of IPS. DR preserves unbiasedness of IPS and achieves asymptotic optimality, as long as it is possible to derive sufficiently good reward predictors given enough data (Rothe, 2016).
When , DR recovers IPS and is plagued by the same large variance. However, even when the reward predictor is perfect, any intrinsic stochasticity in the rewards may cause the terms , appearing in the DR estimator, to be far from zero. Multiplied by large importance weights , these terms yield large variance for DR in comparison with DM. Several approaches seek a more favorable biasvariance tradeoff by clipping, removing, or rescaling the importance weights (Wang et al., 2017; Su et al., 2018; Kallus, 2017, 2018). Our work also seeks to systematically replace the weights with new weights to bring the variance of DR closer to that of DM.
In practice, is biased due to approximation errors, so in this paper we make no assumptions about its quality. At the same time, we would like to make sure that our estimators can adapt to highquality if it is available. To motivate our adaptive estimator, we assume that is trained via weighted least squares regression on a separate dataset than used in . That is, for a dataset , we consider a weighting function and solve
(2) 
where is some function class of reward predictors. Natural choices of the weighting function , explored in our experiments, include , and . We stress that the assumption on how we fit only serves to guide our derivations, but we make no specific assumptions about its quality. In particular, we do not assume that contains a good approximation of .
3 Our Approach: DR with Shrinkage
Our approach replaces the importance weight mapping in the DR estimator eq:dr with a new weight mapping found by directly optimizing sharp bounds on the MSE. The resulting estimator, which we call the doubly robust estimator with shrinkage (DRs) thus depends on both the reward predictor and the weight mapping :
(3) 
We assume that , justifying the terminology “shrinkage”. For a fixed choice of and , we will seek the mapping that minimizes the MSE of , which we simply denote as . We similarly write and for the bias and variance of this estimator.
We treat as the optimization variable and consider two upper bounds on MSE: an optimistic one and a pessimistic one. In both cases, we separately bound and . To bound the bias, we use the following expression, derived from the fact that is unbiased when :
(4) 
To bound the variance, we rely on the following proposition, which states that it suffices to focus on the second moment of the terms
: If and , thenThe proof of prop:variance and other mathematical statements from this paper are in the appendix.
We derive estimators for two different regimes depending on the quality of the reward predictor . Since we do not know the quality of a priori, in the next section we obtain a model selection procedure to select between these two estimators.
3.1 DR with Optimistic Shrinkage
Our first family of estimators is based on an optimistic MSE bound, which adapts to the quality of , and which we expect to be tighter when is more accurate. Recall that is trained to minimize weighted square loss with respect to some weighting function , which we denote as
We next bound the bias and variance in terms of the weighted square loss .
To bound the bias we apply the CauchySchwarz inequality to (4):
(5) 
To bound the variance, we invoke prop:variance and focus on bounding the quantity . Using the CauchySchwarz inequality, the fact that , and , we obtain
(6) 
Combining the bounds (5) and (6) with prop:variance yields the following bound on :
A direct minimization of this bound appears to be a high dimensional optimization problem. Instead of minimizing the bound directly, we note that it is a strictly increasing function of the two expectations that appear in it. Thus, its minimizer must be on the Pareto front with respect to the two expectations, meaning that for some choice of , the minimizer can be obtained by solving
The objective decomposes across contexts and actions. Taking the derivative with respect to and setting to zero yields the solution
where “o” above is a mnemonic for optimistic shrinkage. We refer to the DRs estimator with as the doubly robust estimator with optimistic shrinkage (DRos) and denote it by . Note that this estimator does not depend on , although it was included in the optimization objective.
3.2 DR with Pessimistic Shrinkage
Our second estimator family makes no assumptions on the quality of beyond the range bound , which implies and yields the bias and secondmoment bounds
(7) 
As before, we do not optimize the resulting MSE bound directly and instead solve for the Pareto front points parameterized by :
The objective again decomposes across contextaction pairs, yielding the solution
which recovers (and justifies) existing weightclipping approaches (Kang et al., 2007; Strehl et al., 2010; Su et al., 2018). We refer to the resulting estimator as , for doubly robust with pessimistic shrinkage, since we have used the worstcase bounds in the derivation. See app:examples for detailed calculations.
3.3 Basic Properties
The two shrinkage estimators, for a suitable choice of , are never worse than DR or DM. This is an immediate consequence of their form and serves as a basic sanity check. Let denote either or . Then for any there exists such that
Both estimators actually interpolate between and . As varies, we obtain for and as . The optimal choice of is therefore always competitive with both. While we do not know , in the next section, we derive a model selection procedure that finds a good .
4 Model Selection
All of our estimators can be written as finite sums of the form , where are some fixed hyperparameters and are i.i.d.
random variables. For example
denotes that we are using a reward predictor and the optimistic shrinkage with the parameter . To choose hyperparameters, we first estimate the variance of by the sample varianceWe also form a datadependent upper bound on the bias, which we call . The only requirement is that for all ,
(with high probability), and that
whenever ; this holds for both bias bounds from the previous section, as they become zero when . Now, we simply choose from some class of hyperparameters to optimize the estimate of the MSE:This model selection procedure is related to MAGIC (Thomas and Brunskill, 2016) as well as the model selection procedure for the SWITCH estimator (Wang et al., 2017). In comparison with MAGIC, we pick a single parameter value rather than aggregating several, and we use different bias and variance estimates. SWITCH uses the pessimistic bias bound (7), whereas we will also use two additional bounding strategies. Let be a finite set of hyperparameter values and let denote the subset corresponding to unbiased procedures. Then
The theorem shows that the MSE of our final estimator after model selection is no worse than the best unbiased estimator in consideration. The term is asymptotically negligible, as the MSE itself is . In particular, since the doubly robust estimator is unbiased and a special case of our estimators with , we retain asymptotic optimality as long as we include in the model selection. In fact, since we may perform model selection over many different choices for , our estimator is competitive with the doubly robust approach using the best reward predictor.
There are many natural ways to construct datadependent upper bounds on the bias with the required properties. The three we use in our experiments involve using samples to approximate expectations in: (i) the expression for the bias given in (4), (ii) the optimistic bias bound in (5), and (iii) the pessimistic bias bound in (7
). In our theory, these estimates need to be adjusted to obtain highprobability confidence bounds. In our experiments we evaluate both the basic estimates and a adjusted variant where we add twice the standard error.
5 Experiments
We evaluate our new estimators on the tasks of offpolicy evaluation and offpolicy learning, and compare their performance with previous estimators. Our secondary goal is to identify the configuration of the shrinkage estimator that is most robust for use in practice.
Datasets. Following prior work (Dudík et al., 2014; Swaminathan and Joachims, 2015b; Wang et al., 2017; Farajtabar et al., 2018; Su et al., 2018), we simulate bandit feedback on 9 UCI multiclass classification datasets. This lets us evaluate estimators in a broad range of conditions and gives us groundtruth policy values (see tab:datasets in the appendix for the dataset statistics). Each multiclass dataset with classes corresponds to a contextual bandit problem with possible actions coinciding with classes. We consider either deterministic rewards, where on multiclass example , the action yields the reward , or stochastic rewards where with probability and otherwise. For every dataset, we hold out of the examples to measure groundtruth. On the remaining of the dataset, we use logging policy to simulate bandit examples by sampling a context from the dataset, sampling an action and then observing a deterministic or stochastic reward . The value of varies across experimental conditions. ^{1}^{1}1If is the size of the remaining , we use each example exactly once, but there is still variation in the ordering of examples, actions taken, and rewards.
base  

target  0.9  0  
logging  0.7  0.2  
0.5  0.2  
—  0  
0.3  0.2  
0.5  0.2  
0.95  0.1 
Policies. We use the heldout data to obtain logging and target policies as follows. We first obtain two deterministic policies and by training two logistic models on the same data, but using either the first or second half of the features. We obtain stochastic policies parameterized by , following the softening technique of Farajtabar et al. (2018). Specifically, if and otherwise, where . In offpolicy evaluation experiments, we consider a fixed target and several choices of logging policy (see tab:policies). In offpolicy learning we use as the logging policy.
Reward predictors. We obtain reward predictors by training linear models via weighted least squares with regularization, using 2fold crossvalidation to tune the regularization parameter. We experiment with weights and we also consider the special weight design of Farajtabar et al. (2018), which we call MRDR (see app:experiments for details). In evaluation experiments, we use of the bandit data to train ; in learning experiments, we use of the bandit data to train . In addition to the four trained reward predictors, we also consider .
Baselines. We include a number of estimators in our evaluation: the direct modeling approach (DM), doublyrobust (DR) and its selfnormalized variant (snDR), our approach (DRs), and the doublyrobust version of the switch estimator of Wang et al. (2017), which also performs a form of weight clipping.^{2}^{2}2For simplicity we call this estimator switch, although Wang et al. call it switchDR. Note that DR with is identical to inverse propensity scoring (IPS); we refer to its selfnormalized variant as snIPS. Our estimator and switch have hyperparameters, which are selected by their respective model selection procedures (see app:experiments for details about the hyperparameter grid).
5.1 Offpolicy Evaluation
We begin by evaluating different configurations of DRs via an ablation analysis. Then we compare DRs with baseline estimators. We have a total of experimental conditions: for each of the datasets we use logging policies and consider stochastic or deterministic rewards. Except for the learning curves below, we always take to be all available bandit data ( of the overall dataset).
We measure performance with clipped MSE, , where is the estimator and is the groundtruth value (computed on the heldout of the data). We use 500 replicates of banditdata generation to estimate the MSE; statistical comparisons are based on paired tests at significance level . In some of our ablation experiments, we pick the best hyperparameters against the test set on a perreplicate basis, which we call oracle tuning and always call out explicitly.
Ablation analysis.
tab:ablations displays the results of two ablation studies, one evaluating different reward predictors and the other evaluating the optimistic and pessimistic shrinkage types.
On the left, for each fixed estimator type (e.g., DR) we compare different reward predictors by reporting the number of conditions where it is statistically indistinguishable from the best and the number of conditions where it statistically dominates all other estimators using that predictor. For DRs we use oracle tuning for the shrinkage type and coefficient. The table shows that weight shrinkage strongly influences the choice of regressor. For example, and are top choices for DR, but with the inclusion of shrinkage in DRs, emerges as the best single choice.^{3}^{3}3The oracle can always select the shrinkage parameter in DRs to recover DM or DR, but, according to the table, the oracle choices for and lead to inferior performance compared with . In our comparison experiments below, we therefore restrict our attention to and additionally also consider , because it allows including IPS as a special case of DRs. Somewhat surprisingly, MRDR is in our experiments dominated by other reward predictors (except for ), and this remains true even with a deterministic target policy (see tab:ablations_det in the appendix).
On the right of tab:ablations, we compare optimistic and pessimistic shrinkage when paired with a fixed reward predictor (using oracle tuning for the shrinkage coefficient). We report the number of times that one estimator statistically dominates the other. The results suggest that both shrinkage types are important for robust performance across conditions, so we consider both choices going forward.
Comparisons.
In fig:cdfs, we compare our new estimator with the baselines. We visualize the results by plotting the cumulative distribution function (CDF) of the normalized MSE (normalized by the MSE of snIPS) across conditions for each method. Better performance corresponds to CDF curves towards topleft corner, meaning the method achieves a lower MSE more frequently. The left plot summarizes 54 conditions where the reward is deterministic, while the right plot considers the 54 stochastic reward conditions. For DRs we consider two model selection procedures outlined in sec:modelsel that differ in their choice of
BiasUB. DRsdirect estimates the expectations in the expressions in Eqs. (4), (5), and (7) (corresponding to the bias and bias bounds) by empirical averages and takes their pointwise minimum. DRsupper adds to theses estimates twice their standard error, before taking minimum, more closely matching our theory. For DRs, we only use the zero reward predictor and the one trained with , and we always select between both shrinkage types. Since switch also comes with a model selection procedure, we use it to select between the same two reward predictors as DRs.In the deterministic case (left plot) we see that DRsupper has the best aggregate performance, by a large margin. DRsdirect also has better aggregate performance than the baselines on most of the conditions. In the stochastic case, DRsdirect has similarly strong performance, but DRsupper degrades considerably, suggesting this model selection scheme is less robust to stochastic rewards. We illustrate this phenomenon in fig:curves, plotting the MSE as a function of the number of samples for one choice of a logging policy and dataset, with deterministic rewards on the left and stochastic on the right. Because of a more robust performance, we therefore advocate for DRsdirect as our final method.
5.2 Offpolicy Learning
Following prior work (Swaminathan and Joachims, 2015a; Swaminathan and Joachims, 2015b; Su et al., 2018), we learn a stochastic linear policy where and is a featurization of contextaction pairs. We solve regularized empirical risk minimization via gradient descent, where is a policyvalue estimator and is a hyperparameter. For these experiments, we partition the data into four quarters: one full information segment for training the logging policy and as a test set, and three bandit segments for (1) training reward predictors, (2) learning the policy, and (3) hyperparameter tuning and model selection. The logging policy is and since there is no fixed target policy, we consider three reward predictors: , and trained with and .
In fig:learning we display the performance of four methods (DM, DR, IPS, and DRsdirect) on four of the UCI datasets. For each method, we compute the average value of the learned policy on the test set (averaged over 10 replicates) and we report this value normalized by that for IPS. For DM and DR, we select the hyperparamater and reward predictor optimally in hindsight, while for DRs we use our model selection method. Note that we do not compare with switch here as it is not amenable to gradientbased optimization (Su et al., 2018). Except for the optdigits dataset, where all the methods are comparable, we find that offpolicy learning using DRsdirect always outperforms the baselines.
Acknowledgments
We thank Alekh Agarwal for valuable input in the early discussions about this project. Part of this work was completed while Yi Su was visiting Microsoft Research.
References
 Athey and Wager (2017) Susan Athey and Stefan Wager. Efficient policy learning. arXiv:1702.02896, 2017.
 Bang and Robins (2005) Heejung Bang and James M Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 2005.
 Bembom and van der Laan (2008) Oliver Bembom and Mark J van der Laan. Dataadaptive selection of the truncation level for inverseprobabilityoftreatmentweighted estimators. Technical report, UC Berkeley, 2008.

Bottou et al. (2013)
Léon Bottou, Jonas Peters, Joaquin QuiñoneroCandela, Denis X Charles,
D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and
Ed Snelson.
Counterfactual reasoning and learning systems: The example of
computational advertising.
The Journal of Machine Learning Research
, 2013.  De la Pena and Giné (2012) Victor De la Pena and Evarist Giné. Decoupling: from dependence to independence. Springer Science & Business Media, 2012.
 Dua and Graff (2017) Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
 Dudík et al. (2011) Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In International Conference on Machine Learning, 2011.
 Dudík et al. (2014) Miroslav Dudík, Dumitru Erhan, John Langford, Lihong Li, et al. Doubly robust policy evaluation and optimization. Statistical Science, 2014.
 Farajtabar et al. (2018) Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust offpolicy evaluation. In International Conference on Machine Learning, 2018.
 Hahn (1998) Jinyong Hahn. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 1998.
 Hirano et al. (2003) Keisuke Hirano, Guido W Imbens, and Geert Ridder. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 2003.
 Horvitz and Thompson (1952) Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 1952.
 Imbens et al. (2007) Guido Imbens, Whitney Newey, and Geert Ridder. Meansquarederror calculations for average treatment effects. ssrn.954748, 2007.

Kallus (2017)
Nathan Kallus.
A Framework for Optimal Matching for Causal Inference.
In
International Conference on Artificial Intelligence and Statistics
, 2017.  Kallus (2018) Nathan Kallus. Balanced policy evaluation and learning. In Advances in Neural Information Processing Systems, 2018.
 Kang et al. (2007) Joseph DY Kang, Joseph L Schafer, et al. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 2007.

Langford and Zhang (2008)
John Langford and Tong Zhang.
The epochgreedy algorithm for multiarmed bandits with side information.
In Advances in Neural Information Processing Systems, 2008.  Li et al. (2011) Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextualbanditbased news article recommendation algorithms. In International Conference on Web Search and Data Mining, 2011.
 Li et al. (2015) Lihong Li, Shunbao Chen, Jim Kleban, and Ankur Gupta. Counterfactual estimation and optimization of click metrics in search engines: A case study. In International Conference on World Wide Web, 2015.
 Robins and Rotnitzky (1995) James M Robins and Andrea Rotnitzky. Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 1995.
 Rothe (2016) Christoph Rothe. The value of knowing the propensity score for estimating average treatment effects. IZA Discussion Paper Series, 2016.
 Strehl et al. (2010) Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems, 2010.
 Su et al. (2018) Yi Su, Lequn Wang, Michele Santacatterina, and Thorsten Joachims. Cab: Continuous adaptive blending estimator for policy evaluation and learning. In International Conference on Machine Learning, 2018.
 Swaminathan and Joachims (2015a) Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, 2015a.
 Swaminathan and Joachims (2015b) Adith Swaminathan and Thorsten Joachims. The selfnormalized estimator for counterfactual learning. In Advances in Neural Information Processing Systems, 2015b.

Thomas and Brunskill (2016)
Philip Thomas and Emma Brunskill.
Dataefficient offpolicy policy evaluation for reinforcement learning.
In International Conference on Machine Learning, 2016.  Wang et al. (2017) YuXiang Wang, Alekh Agarwal, and Miroslav Dudik. Optimal and adaptive offpolicy evaluation in contextual bandits. In International Conference on Machine Learning, 2017.
 Zhou et al. (2017) Xin Zhou, Nicole MayerHamblett, Umer Khan, and Michael R Kosorok. Residual weighted learning for estimating individualized treatment rules. Journal of the American Statistical Association, 2017.
Appendix A Derivation of Shrinkage Estimators
In this section we provide detailed derivations for the two estimators.
We first derive the pessimistic version. Recall that the optimization problem decouples across , so we focus on a single pair, and we omit explicit dependence on these. Fixing , we must solve
The optimality conditions are that
The first case for cannot occur, since setting would make (according to the first equation), but we know that . If then we must have . And so, we get
which is the clipped estimator.
For the optimistic version, the optimization problem is
The optimality conditions are
This gives the optimistic estimator
Notice that this estimator does not depend on the weighting function , so it does not depend on how we train the regression model.
Appendix B Proofs
Proof of prop:variance.
The law of total variance gives
For the first term, since does not depend on , it does not contribute to the conditional variance, and we get
For we have
Now, using our assumption that , we get
and hence for all ,
Here in the last step we are using boundedness of the regression function , and the estimated regression function .
Thus, for the second term is at most , and for , we have
using the above calculation. Therefore, the residual terms add up to at most . ∎
Proof of prop:sanity.
We analyze the optimistic version. If , then we simply take at which point the objective is clearly minimized by . Therefore we recover . On the other hand if , then we set so that the minimizer is . This recovers the doublyrobust estimator. ∎
Proof of thm:model_selection.
The main technical part of the proof is a deviation inequality for the sample variance. For this, let us fix , which we drop from notation, and focus on estimating the variance
We have the following lemma [Variance estimation] Let be iid random variables, and assume that almost surely. Then there exists a constant such that for any , with probability at least
Proof.
For this lemma only, define . By direct calculation
We work with the second term first. Let be an iid sample, independent of . Now, by Theorem 3.4.1 of De la Pena and Giné [2012], we have
for a universal constant . Thus, we have decoupled the Ustatistic. Now let us condition on and write , which conditional on is nonrandom. We will apply Bernstein’s inequality on , which is a centered random variable, conditional on . This gives that with probability at least
This bound holds with high probability for any . In particular, since almost surely, we get that with probability
The factors of arise from working through the decoupling inequality.
Let us now address the first term, a simple application of Bernstein’s inequality gives that with probability at least
Combining the two inequalities, we obtain the result. ∎
Since we are estimating the variance of the sample average estimator, we divide by another factor of . Thus the error terms in Lemma B are and respectively, which are both . We will simply write these error terms are from now on.
For the model selection result, first apply Lemma B for all , taking a union bound (This only requires that ). Further take a union bound over the event that for all , if it is needed. Then, observe that for any we have
The first inequality uses Lemma B and the fact that . The second uses that optimizes this quantity, and the third uses the property that by assumption. ∎
b.1 Construction of bias upper bounds.
In this section we give detailed construction of bias upper bounds that we use in the model selection procedure. Recall that this is for the analysis only. Empirically we found that using the estimators alone — not the upper bounds — leads to better performance.
Throughout, we fix a set of hyperparameters , which we suppress from the notation.
Direct bias estimation.
The most straightforward bias estimator is to simply approximate the expectation with a sample average.
This estimator has finitesum structure, and naively, the range of each term is . The variance is at most . Hence Bernstein’s inequality gives that with probability at least
Inflating the estimate by the right hand side gives BiasUB, which is a high probability upper bound on Bias.
Pessimistic estimation.
The bias bound used in the pessimistic estimator and its natural sample estimator are
Note that since we have already eliminated the dependence on the reward, we can analytically evaluate the expectation over actions, which will lead to lower variance in the estimate.
Again we perform a fairly naive analysis. Since and using the fact that the range of the random variable is simply . Therefore, Hoeffding’s inequality gives that with probability
and we use the right hand side for our high probability upper bound.
Optimistic estimation.
For the optimistic bound, we must estimate two terms, one involving the regressor and one involving the importance weights. We use