Doubly robust off-policy evaluation with shrinkage

We design a new family of estimators for off-policy evaluation in contextual bandits. Our estimators are based on the asymptotically optimal approach of doubly robust estimation, but they shrink importance weights to obtain a better bias-variance tradeoff in finite samples. Our approach adapts importance weights to the quality of a reward predictor, interpolating between doubly robust estimation and direct modeling. When the reward predictor is poor, we recover previously studied weight clipping, but when the reward predictor is good, we obtain a new form of shrinkage. To navigate between these regimes and tune the shrinkage coefficient, we design a model selection procedure, which we prove is never worse than the doubly robust estimator. Extensive experiments on bandit benchmark problems show that our estimators are highly adaptive and typically outperform state-of-the-art methods.


page 1

page 2

page 3

page 4


More Robust Doubly Robust Off-policy Evaluation

We study the problem of off-policy evaluation (OPE) in reinforcement lea...

M-estimators of scatter with eigenvalue shrinkage

A popular regularized (shrinkage) covariance estimator is the shrinkage ...

Off-Policy Evaluation via Adaptive Weighting with Data from Contextual Bandits

It has become increasingly common for data to be collected adaptively, f...

Debiasing Samples from Online Learning Using Bootstrap

It has been recently shown in the literature that the sample averages fr...

Doubly Robust Policy Evaluation and Learning

We study decision making in environments where the reward is only partia...

Sample-efficient Nonstationary Policy Evaluation for Contextual Bandits

We present and prove properties of a new offline policy evaluator for an...

Neural Contextual Bandits via Reward-Biased Maximum Likelihood Estimation

Reward-biased maximum likelihood estimation (RBMLE) is a classic princip...

1 Introduction

Many real-world applications, ranging from online news recommendation (Li et al., 2011), advertising (Bottou et al., 2013), and search engines (Li et al., 2015) to personalized healthcare (Zhou et al., 2017), are naturally modeled by the contextual bandit protocol (Langford and Zhang, 2008), where a learner repeatedly observes a context, takes an action, and accrues reward. In news recommendation, the context is any information about the user, such as history of past visits, the action is the recommended article, and the reward could indicate, for instance, the user’s click on the article. The goal is to maximize the reward, but the learner can only observe the reward for the chosen actions, and not for other actions.

In this paper we study a fundamental problem in contextual bandits known as off-policy evaluation, where the goal is to use the data gathered by a past algorithm, known as the logging policy, to estimate the average reward of a new algorithm, known as the target policy. High-quality off-policy estimates help avoid costly A/B testing and can also be used as subroutines for optimizing a policy (Dudík et al., 2011; Athey and Wager, 2017).

The key challenge in off-policy evaluation is distribution mismatch: the decisions made by the target policy differ from those made by the logging policy that collected the data. Three standard approaches tackle this problem. The first approach, known as inverse propensity scoring (IPS) (Horvitz and Thompson, 1952), corrects for this mismatch by reweighting the data. The resulting estimate is unbiased, but may exhibit intolerably high variance when the importance weights (also known as inverse propensity scores) are large. The second approach, direct modeling

(DM) or imputation, sidesteps the problem of large importance weights and directly fits a regression model to predict rewards. Non-parametric variants of direct modeling are asymptotically optimal 

(Imbens et al., 2007), but in finite samples, direct modeling often suffers from a large bias (Dudík et al., 2011). The third approach, called the doubly robust (DR) estimator (Robins and Rotnitzky, 1995; Bang and Robins, 2005; Dudík et al., 2011), combines IPS and DM: it first estimates the reward by DM, and then estimates a correction term by IPS. The approach is unbiased, its variance is smaller than that of IPS, and it is asymptotically optimal under weaker assumptions than DM (Rothe, 2016). However, since DR uses the same importance weights as IPS, its variance can still be quite high, unless the reward predictor is highly accurate. Therefore, several works have developed variants of DR that clip or remove large importance weights. Weight clipping incurs a small bias, but substantially decreases the variance, yielding a lower mean squared error than standard DR (Bembom and van der Laan, 2008; Bottou et al., 2013; Wang et al., 2017; Su et al., 2018).

In this paper, we continue this line of work, by developing a systematic approach for designing estimators with favorable finite-sample performance. Our approach involves shrinking the importance weights to directly optimize a sharp bound on the mean squared error (MSE). We use two styles of upper bounds to obtain two classes of estimators. The first is based on an upper bound that is agnostic to the quality of the reward estimator and yields the previously studied weight clipping (Kang et al., 2007; Strehl et al., 2010; Su et al., 2018), which can be interpreted as pessimistic shrinkage. The second is based on an upper bound that incorporates the quality of the reward predictor, yielding a new estimator, which we call DR with optimistic shrinkage

. Both classes of estimators involve a hyperparameter and specific choices produce the unbiased doubly robust estimator and the low-variance direct-modeling estimator; the optimal hyperparameter improves on both of these.

To tune the hyperparameter and navigate between the two estimator classes, we design a simple model-selection procedure. Model selection is crucial to our approach, but is important even for classical estimators, as their performance is highly dataset-dependent. In contrast with supervised learning, where cross-validation is a simple and effective strategy, distribution mismatch makes model selection quite challenging in contextual bandits. Our model selection approach again involves optimizing a bound on the MSE. Combined with our shrinkage estimators, we prove that our final estimator is never worse than DR. Thus, our estimators retain the asymptotic optimality of DR, but with improved finite-sample performance.

We evaluate our approach on benchmark datasets and compare its performance with a number of existing estimators across a comprehensive range of conditions. While our focus is on tuning importance weights, we also vary how we train a reward predictor, including the recently proposed more robust doubly robust (MRDR) approach (Farajtabar et al., 2018), and apply our model selection to pick the best predictor. Our experiments show that our approach typically outperforms state-of-the-art methods in both off-policy evaluation and off-policy learning settings. We also find that the choice of the reward predictor changes, depending on whether weight shrinkage is used or not. Via extensive ablation studies, we identify a robust configuration of our shrinkage approach that we recommend as a practical choice.

Other related work. Off-policy estimation is studied in observational settings under the name average treatment effect (ATE) estimation, with many results on asymptotically optimal estimators (Hahn, 1998; Hirano et al., 2003; Imbens et al., 2007; Rothe, 2016), but only few that optimize MSE in finite samples. Most notably, Kallus (2017, 2018) adjusts importance weights by optimizing MSE under smoothness (or parametric) assumptions on the reward function. This can be highly effective when the assumptions hold, but the assumptions are difficult to verify in practice. In contrast, we optimize importance weights without making any modeling assumptions other than boundedness of rewards.

2 Setup

We consider the contextual bandits protocol, where a decision maker interacts with the environment by repeatedly observing a context , choosing an action , and observing a reward . The context space can be uncountably large, but we assume that the action space is finite. In the news recommendation example, describes the history of past visits of a given user, is a recommended article, and equals one if the user clicks on the article and zero otherwise. We assume that contexts are sampled i.i.d. from some distribution and rewards are sampled from some conditional distribution . We write for the expected reward, conditioned on a given context and action, and refer to as the regression function.

The behavior of a decision maker is formalized as a conditional distribution over actions given contexts, referred to as a policy. We also write

for the joint distribution over context-action-reward triples when actions are selected by the policy

. The expected reward of a policy , called the value of , is denoted as .

In the off-policy evaluation problem, we are given a dataset consisting of context-action-reward triples collected by some logging policy , and we would like to estimate the value of a target policy . The quality of an estimator is measured by the mean squared error

where the expectation is with respect to the data generation process. In analyzing the error of an estimator, we rely on the decomposition of MSE into the bias and variance terms:

We consider three standard approaches for off-policy evaluation. The first two are direct modeling (DM) and inverse propensity scoring (IPS). In the former, we train a reward predictor and use it to impute rewards. In the latter, we simply reweight the data. The two estimators are:

For a concise notation, let , denote the importance weight. We make a standard assumption that is absolutely continuous with respect to , meaning that whenever . This condition ensures that the importance weights are well defined and that

is an unbiased estimator for

. However, if there is substantial mismatch between and , then the importance weights will be large and will have large variance. On the other hand, given any fixed reward predictor (fit on a separate dataset), has low variance, independent of the distribution mismatch, but it is typically biased due to approximation errors in fitting .

The third approach, called the doubly robust (DR) estimator (Dudík et al., 2014), combines DM and IPS:


The DR estimator applies IPS to a shifted versions of rewards, using as a control variate to decrease the variance of IPS. DR preserves unbiasedness of IPS and achieves asymptotic optimality, as long as it is possible to derive sufficiently good reward predictors given enough data (Rothe, 2016).

When , DR recovers IPS and is plagued by the same large variance. However, even when the reward predictor is perfect, any intrinsic stochasticity in the rewards may cause the terms , appearing in the DR estimator, to be far from zero. Multiplied by large importance weights , these terms yield large variance for DR in comparison with DM. Several approaches seek a more favorable bias-variance trade-off by clipping, removing, or re-scaling the importance weights (Wang et al., 2017; Su et al., 2018; Kallus, 2017, 2018). Our work also seeks to systematically replace the weights with new weights to bring the variance of DR closer to that of DM.

In practice, is biased due to approximation errors, so in this paper we make no assumptions about its quality. At the same time, we would like to make sure that our estimators can adapt to high-quality if it is available. To motivate our adaptive estimator, we assume that is trained via weighted least squares regression on a separate dataset than used in . That is, for a dataset , we consider a weighting function and solve


where is some function class of reward predictors. Natural choices of the weighting function , explored in our experiments, include , and . We stress that the assumption on how we fit only serves to guide our derivations, but we make no specific assumptions about its quality. In particular, we do not assume that contains a good approximation of .

3 Our Approach: DR with Shrinkage

Our approach replaces the importance weight mapping in the DR estimator eq:dr with a new weight mapping found by directly optimizing sharp bounds on the MSE. The resulting estimator, which we call the doubly robust estimator with shrinkage (DRs) thus depends on both the reward predictor and the weight mapping :


We assume that , justifying the terminology “shrinkage”. For a fixed choice of and , we will seek the mapping that minimizes the MSE of , which we simply denote as . We similarly write and for the bias and variance of this estimator.

We treat as the optimization variable and consider two upper bounds on MSE: an optimistic one and a pessimistic one. In both cases, we separately bound and . To bound the bias, we use the following expression, derived from the fact that is unbiased when :


To bound the variance, we rely on the following proposition, which states that it suffices to focus on the second moment of the terms

: If and , then

The proof of prop:variance and other mathematical statements from this paper are in the appendix.

We derive estimators for two different regimes depending on the quality of the reward predictor . Since we do not know the quality of a priori, in the next section we obtain a model selection procedure to select between these two estimators.

3.1 DR with Optimistic Shrinkage

Our first family of estimators is based on an optimistic MSE bound, which adapts to the quality of , and which we expect to be tighter when is more accurate. Recall that is trained to minimize weighted square loss with respect to some weighting function , which we denote as

We next bound the bias and variance in terms of the weighted square loss .

To bound the bias we apply the Cauchy-Schwarz inequality to (4):


To bound the variance, we invoke prop:variance and focus on bounding the quantity . Using the Cauchy-Schwarz inequality, the fact that , and , we obtain


Combining the bounds (5) and (6) with prop:variance yields the following bound on :

A direct minimization of this bound appears to be a high dimensional optimization problem. Instead of minimizing the bound directly, we note that it is a strictly increasing function of the two expectations that appear in it. Thus, its minimizer must be on the Pareto front with respect to the two expectations, meaning that for some choice of , the minimizer can be obtained by solving

The objective decomposes across contexts and actions. Taking the derivative with respect to and setting to zero yields the solution

where “o” above is a mnemonic for optimistic shrinkage. We refer to the DRs estimator with as the doubly robust estimator with optimistic shrinkage (DRos) and denote it by . Note that this estimator does not depend on , although it was included in the optimization objective.

3.2 DR with Pessimistic Shrinkage

Our second estimator family makes no assumptions on the quality of beyond the range bound , which implies and yields the bias and second-moment bounds


As before, we do not optimize the resulting MSE bound directly and instead solve for the Pareto front points parameterized by :

The objective again decomposes across context-action pairs, yielding the solution

which recovers (and justifies) existing weight-clipping approaches (Kang et al., 2007; Strehl et al., 2010; Su et al., 2018). We refer to the resulting estimator as , for doubly robust with pessimistic shrinkage, since we have used the worst-case bounds in the derivation. See app:examples for detailed calculations.

3.3 Basic Properties

The two shrinkage estimators, for a suitable choice of , are never worse than DR or DM. This is an immediate consequence of their form and serves as a basic sanity check. Let denote either or . Then for any there exists such that

Both estimators actually interpolate between and . As varies, we obtain for and as . The optimal choice of is therefore always competitive with both. While we do not know , in the next section, we derive a model selection procedure that finds a good .

4 Model Selection

All of our estimators can be written as finite sums of the form , where are some fixed hyperparameters and are i.i.d.

 random variables. For example

denotes that we are using a reward predictor and the optimistic shrinkage with the parameter . To choose hyperparameters, we first estimate the variance of by the sample variance

We also form a data-dependent upper bound on the bias, which we call . The only requirement is that for all ,

(with high probability), and that

whenever ; this holds for both bias bounds from the previous section, as they become zero when . Now, we simply choose from some class of hyperparameters to optimize the estimate of the MSE:

This model selection procedure is related to MAGIC (Thomas and Brunskill, 2016) as well as the model selection procedure for the SWITCH estimator (Wang et al., 2017). In comparison with MAGIC, we pick a single parameter value rather than aggregating several, and we use different bias and variance estimates. SWITCH uses the pessimistic bias bound (7), whereas we will also use two additional bounding strategies. Let be a finite set of hyperparameter values and let denote the subset corresponding to unbiased procedures. Then

The theorem shows that the MSE of our final estimator after model selection is no worse than the best unbiased estimator in consideration. The term is asymptotically negligible, as the MSE itself is . In particular, since the doubly robust estimator is unbiased and a special case of our estimators with , we retain asymptotic optimality as long as we include in the model selection. In fact, since we may perform model selection over many different choices for , our estimator is competitive with the doubly robust approach using the best reward predictor.

There are many natural ways to construct data-dependent upper bounds on the bias with the required properties. The three we use in our experiments involve using samples to approximate expectations in: (i) the expression for the bias given in (4), (ii) the optimistic bias bound in (5), and (iii) the pessimistic bias bound in (7

). In our theory, these estimates need to be adjusted to obtain high-probability confidence bounds. In our experiments we evaluate both the basic estimates and a adjusted variant where we add twice the standard error.

5 Experiments

We evaluate our new estimators on the tasks of off-policy evaluation and off-policy learning, and compare their performance with previous estimators. Our secondary goal is to identify the configuration of the shrinkage estimator that is most robust for use in practice.

Datasets. Following prior work (Dudík et al., 2014; Swaminathan and Joachims, 2015b; Wang et al., 2017; Farajtabar et al., 2018; Su et al., 2018), we simulate bandit feedback on 9 UCI multi-class classification datasets. This lets us evaluate estimators in a broad range of conditions and gives us ground-truth policy values (see tab:datasets in the appendix for the dataset statistics). Each multi-class dataset with classes corresponds to a contextual bandit problem with possible actions coinciding with classes. We consider either deterministic rewards, where on multiclass example , the action yields the reward , or stochastic rewards where with probability and otherwise. For every dataset, we hold out of the examples to measure ground-truth. On the remaining of the dataset, we use logging policy to simulate bandit examples by sampling a context from the dataset, sampling an action and then observing a deterministic or stochastic reward . The value of varies across experimental conditions. 111If is the size of the remaining , we use each example exactly once, but there is still variation in the ordering of examples, actions taken, and rewards.

target 0.9 0
logging 0.7 0.2
0.5 0.2
0.3 0.2
0.5 0.2
0.95 0.1
Table 1: Policies.

Policies. We use the held-out data to obtain logging and target policies as follows. We first obtain two deterministic policies and by training two logistic models on the same data, but using either the first or second half of the features. We obtain stochastic policies parameterized by , following the softening technique of Farajtabar et al. (2018). Specifically, if and otherwise, where . In off-policy evaluation experiments, we consider a fixed target and several choices of logging policy (see tab:policies). In off-policy learning we use as the logging policy.

Reward predictors. We obtain reward predictors by training linear models via weighted least squares with regularization, using 2-fold cross-validation to tune the regularization parameter. We experiment with weights and we also consider the special weight design of Farajtabar et al. (2018), which we call MRDR (see app:experiments for details). In evaluation experiments, we use of the bandit data to train ; in learning experiments, we use of the bandit data to train . In addition to the four trained reward predictors, we also consider .

Baselines. We include a number of estimators in our evaluation: the direct modeling approach (DM), doubly-robust (DR) and its self-normalized variant (snDR), our approach (DRs), and the doubly-robust version of the switch estimator of Wang et al. (2017), which also performs a form of weight clipping.222For simplicity we call this estimator switch, although Wang et al. call it switch-DR. Note that DR with is identical to inverse propensity scoring (IPS); we refer to its self-normalized variant as snIPS. Our estimator and switch have hyperparameters, which are selected by their respective model selection procedures (see app:experiments for details about the hyperparameter grid).

5.1 Off-policy Evaluation

We begin by evaluating different configurations of DRs via an ablation analysis. Then we compare DRs with baseline estimators. We have a total of experimental conditions: for each of the datasets we use logging policies and consider stochastic or deterministic rewards. Except for the learning curves below, we always take to be all available bandit data ( of the overall dataset).

We measure performance with clipped MSE, , where is the estimator and is the ground-truth value (computed on the held-out of the data). We use 500 replicates of bandit-data generation to estimate the MSE; statistical comparisons are based on paired -tests at significance level . In some of our ablation experiments, we pick the best hyperparameters against the test set on a per-replicate basis, which we call oracle tuning and always call out explicitly.

MRDR DM 0 (0) 47 (23) 45 (22) 41 (31) 11 (5) DR 27 (2) 86 (9) 90 (4) 85 (5) 65 (0) snDR 63 (7) 80 (2) 85 (8) 69 (4) 54 (0) DRs 23 (19) 44 (16) 35 (4) 62 (35) 18 (2) DRps DRos 21 51 58 28 55 30 55 29 MRDR 49 29
Table 2: Ablation analysis. Left: we compare reward predictors using a fixed estimator (with oracle tuning if applicable). We report the number of conditions where a regressor is statistically indistinguishable from the best and, in parenthesis, the number of conditions where it statistically dominates all others. Right: we compare different shrinkage types using a fixed reward predictor (with oracle tuning) reporting the number of conditions where one statistically dominates the other.
Ablation analysis.

tab:ablations displays the results of two ablation studies, one evaluating different reward predictors and the other evaluating the optimistic and pessimistic shrinkage types.

On the left, for each fixed estimator type (e.g., DR) we compare different reward predictors by reporting the number of conditions where it is statistically indistinguishable from the best and the number of conditions where it statistically dominates all other estimators using that predictor. For DRs we use oracle tuning for the shrinkage type and coefficient. The table shows that weight shrinkage strongly influences the choice of regressor. For example, and are top choices for DR, but with the inclusion of shrinkage in DRs, emerges as the best single choice.333The oracle can always select the shrinkage parameter in DRs to recover DM or DR, but, according to the table, the oracle choices for and lead to inferior performance compared with . In our comparison experiments below, we therefore restrict our attention to and additionally also consider , because it allows including IPS as a special case of DRs. Somewhat surprisingly, MRDR is in our experiments dominated by other reward predictors (except for ), and this remains true even with a deterministic target policy (see tab:ablations_det in the appendix).

On the right of tab:ablations, we compare optimistic and pessimistic shrinkage when paired with a fixed reward predictor (using oracle tuning for the shrinkage coefficient). We report the number of times that one estimator statistically dominates the other. The results suggest that both shrinkage types are important for robust performance across conditions, so we consider both choices going forward.

Figure 1: CDF plots of normalized MSE aggregated across all conditions with deterministic rewards (left) and stochastic rewards (right). See tab:ecdf_sig in the appendix for statistical significance.
Figure 2: MSE for a varying number of samples , for the dataset yeast and logging policy , with deterministic rewards (left) and stochastic rewards (right).

In fig:cdfs, we compare our new estimator with the baselines. We visualize the results by plotting the cumulative distribution function (CDF) of the normalized MSE (normalized by the MSE of snIPS) across conditions for each method. Better performance corresponds to CDF curves towards top-left corner, meaning the method achieves a lower MSE more frequently. The left plot summarizes 54 conditions where the reward is deterministic, while the right plot considers the 54 stochastic reward conditions. For DRs we consider two model selection procedures outlined in sec:modelsel that differ in their choice of

BiasUB. DRs-direct estimates the expectations in the expressions in Eqs. (4), (5), and (7) (corresponding to the bias and bias bounds) by empirical averages and takes their pointwise minimum. DRs-upper adds to theses estimates twice their standard error, before taking minimum, more closely matching our theory. For DRs, we only use the zero reward predictor and the one trained with , and we always select between both shrinkage types. Since switch also comes with a model selection procedure, we use it to select between the same two reward predictors as DRs.

In the deterministic case (left plot) we see that DRs-upper has the best aggregate performance, by a large margin. DRs-direct also has better aggregate performance than the baselines on most of the conditions. In the stochastic case, DRs-direct has similarly strong performance, but DRs-upper degrades considerably, suggesting this model selection scheme is less robust to stochastic rewards. We illustrate this phenomenon in fig:curves, plotting the MSE as a function of the number of samples for one choice of a logging policy and dataset, with deterministic rewards on the left and stochastic on the right. Because of a more robust performance, we therefore advocate for DRs-direct as our final method.

5.2 Off-policy Learning

Figure 3: Learning experiments.

Following prior work (Swaminathan and Joachims, 2015a; Swaminathan and Joachims, 2015b; Su et al., 2018), we learn a stochastic linear policy where and is a featurization of context-action pairs. We solve -regularized empirical risk minimization via gradient descent, where is a policy-value estimator and is a hyperparameter. For these experiments, we partition the data into four quarters: one full information segment for training the logging policy and as a test set, and three bandit segments for (1) training reward predictors, (2) learning the policy, and (3) hyperparameter tuning and model selection. The logging policy is and since there is no fixed target policy, we consider three reward predictors: , and trained with and .

In fig:learning we display the performance of four methods (DM, DR, IPS, and DRs-direct) on four of the UCI datasets. For each method, we compute the average value of the learned policy on the test set (averaged over 10 replicates) and we report this value normalized by that for IPS. For DM and DR, we select the hyperparamater and reward predictor optimally in hindsight, while for DRs we use our model selection method. Note that we do not compare with switch here as it is not amenable to gradient-based optimization (Su et al., 2018). Except for the opt-digits dataset, where all the methods are comparable, we find that off-policy learning using DRs-direct always outperforms the baselines.


We thank Alekh Agarwal for valuable input in the early discussions about this project. Part of this work was completed while Yi Su was visiting Microsoft Research.


  • Athey and Wager (2017) Susan Athey and Stefan Wager. Efficient policy learning. arXiv:1702.02896, 2017.
  • Bang and Robins (2005) Heejung Bang and James M Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 2005.
  • Bembom and van der Laan (2008) Oliver Bembom and Mark J van der Laan. Data-adaptive selection of the truncation level for inverse-probability-of-treatment-weighted estimators. Technical report, UC Berkeley, 2008.
  • Bottou et al. (2013) Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising.

    The Journal of Machine Learning Research

    , 2013.
  • De la Pena and Giné (2012) Victor De la Pena and Evarist Giné. Decoupling: from dependence to independence. Springer Science & Business Media, 2012.
  • Dua and Graff (2017) Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL
  • Dudík et al. (2011) Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In International Conference on Machine Learning, 2011.
  • Dudík et al. (2014) Miroslav Dudík, Dumitru Erhan, John Langford, Lihong Li, et al. Doubly robust policy evaluation and optimization. Statistical Science, 2014.
  • Farajtabar et al. (2018) Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, 2018.
  • Hahn (1998) Jinyong Hahn. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 1998.
  • Hirano et al. (2003) Keisuke Hirano, Guido W Imbens, and Geert Ridder. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 2003.
  • Horvitz and Thompson (1952) Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 1952.
  • Imbens et al. (2007) Guido Imbens, Whitney Newey, and Geert Ridder. Mean-squared-error calculations for average treatment effects. ssrn.954748, 2007.
  • Kallus (2017) Nathan Kallus. A Framework for Optimal Matching for Causal Inference. In

    International Conference on Artificial Intelligence and Statistics

    , 2017.
  • Kallus (2018) Nathan Kallus. Balanced policy evaluation and learning. In Advances in Neural Information Processing Systems, 2018.
  • Kang et al. (2007) Joseph DY Kang, Joseph L Schafer, et al. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 2007.
  • Langford and Zhang (2008) John Langford and Tong Zhang.

    The epoch-greedy algorithm for multi-armed bandits with side information.

    In Advances in Neural Information Processing Systems, 2008.
  • Li et al. (2011) Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In International Conference on Web Search and Data Mining, 2011.
  • Li et al. (2015) Lihong Li, Shunbao Chen, Jim Kleban, and Ankur Gupta. Counterfactual estimation and optimization of click metrics in search engines: A case study. In International Conference on World Wide Web, 2015.
  • Robins and Rotnitzky (1995) James M Robins and Andrea Rotnitzky. Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 1995.
  • Rothe (2016) Christoph Rothe. The value of knowing the propensity score for estimating average treatment effects. IZA Discussion Paper Series, 2016.
  • Strehl et al. (2010) Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems, 2010.
  • Su et al. (2018) Yi Su, Lequn Wang, Michele Santacatterina, and Thorsten Joachims. Cab: Continuous adaptive blending estimator for policy evaluation and learning. In International Conference on Machine Learning, 2018.
  • Swaminathan and Joachims (2015a) Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, 2015a.
  • Swaminathan and Joachims (2015b) Adith Swaminathan and Thorsten Joachims. The self-normalized estimator for counterfactual learning. In Advances in Neural Information Processing Systems, 2015b.
  • Thomas and Brunskill (2016) Philip Thomas and Emma Brunskill.

    Data-efficient off-policy policy evaluation for reinforcement learning.

    In International Conference on Machine Learning, 2016.
  • Wang et al. (2017) Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudik. Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning, 2017.
  • Zhou et al. (2017) Xin Zhou, Nicole Mayer-Hamblett, Umer Khan, and Michael R Kosorok. Residual weighted learning for estimating individualized treatment rules. Journal of the American Statistical Association, 2017.

Appendix A Derivation of Shrinkage Estimators

In this section we provide detailed derivations for the two estimators.

We first derive the pessimistic version. Recall that the optimization problem decouples across , so we focus on a single pair, and we omit explicit dependence on these. Fixing , we must solve

The optimality conditions are that

The first case for cannot occur, since setting would make (according to the first equation), but we know that . If then we must have . And so, we get

which is the clipped estimator.

For the optimistic version, the optimization problem is

The optimality conditions are

This gives the optimistic estimator

Notice that this estimator does not depend on the weighting function , so it does not depend on how we train the regression model.

Appendix B Proofs

Proof of prop:variance.

The law of total variance gives

For the first term, since does not depend on , it does not contribute to the conditional variance, and we get

For we have

Now, using our assumption that , we get

and hence for all ,

Here in the last step we are using boundedness of the regression function , and the estimated regression function .

Thus, for the second term is at most , and for , we have

using the above calculation. Therefore, the residual terms add up to at most . ∎

Proof of prop:sanity.

We analyze the optimistic version. If , then we simply take at which point the objective is clearly minimized by . Therefore we recover . On the other hand if , then we set so that the minimizer is . This recovers the doubly-robust estimator. ∎

Proof of thm:model_selection.

The main technical part of the proof is a deviation inequality for the sample variance. For this, let us fix , which we drop from notation, and focus on estimating the variance

We have the following lemma [Variance estimation] Let be iid random variables, and assume that almost surely. Then there exists a constant such that for any , with probability at least


For this lemma only, define . By direct calculation

We work with the second term first. Let be an iid sample, independent of . Now, by Theorem 3.4.1 of De la Pena and Giné [2012], we have

for a universal constant . Thus, we have decoupled the U-statistic. Now let us condition on and write , which conditional on is non-random. We will apply Bernstein’s inequality on , which is a centered random variable, conditional on . This gives that with probability at least

This bound holds with high probability for any . In particular, since almost surely, we get that with probability

The factors of arise from working through the decoupling inequality.

Let us now address the first term, a simple application of Bernstein’s inequality gives that with probability at least

Combining the two inequalities, we obtain the result. ∎

Since we are estimating the variance of the sample average estimator, we divide by another factor of . Thus the error terms in Lemma B are and respectively, which are both . We will simply write these error terms are from now on.

For the model selection result, first apply Lemma B for all , taking a union bound (This only requires that ). Further take a union bound over the event that for all , if it is needed. Then, observe that for any we have

The first inequality uses Lemma B and the fact that . The second uses that optimizes this quantity, and the third uses the property that by assumption. ∎

b.1 Construction of bias upper bounds.

In this section we give detailed construction of bias upper bounds that we use in the model selection procedure. Recall that this is for the analysis only. Empirically we found that using the estimators alone — not the upper bounds — leads to better performance.

Throughout, we fix a set of hyperparameters , which we suppress from the notation.

Direct bias estimation.

The most straightforward bias estimator is to simply approximate the expectation with a sample average.

This estimator has finite-sum structure, and naively, the range of each term is . The variance is at most . Hence Bernstein’s inequality gives that with probability at least

Inflating the estimate by the right hand side gives BiasUB, which is a high probability upper bound on Bias.

Pessimistic estimation.

The bias bound used in the pessimistic estimator and its natural sample estimator are

Note that since we have already eliminated the dependence on the reward, we can analytically evaluate the expectation over actions, which will lead to lower variance in the estimate.

Again we perform a fairly naive analysis. Since and using the fact that the range of the random variable is simply . Therefore, Hoeffding’s inequality gives that with probability

and we use the right hand side for our high probability upper bound.

Optimistic estimation.

For the optimistic bound, we must estimate two terms, one involving the regressor and one involving the importance weights. We use