# Efficient Counterfactual Learning from Bandit Feedback

What is the most statistically efficient way to do off-policy evaluation and optimization with batch data from bandit feedback? For log data generated by contextual bandit algorithms, we consider offline estimators for the expected reward from a counterfactual policy. Our estimators are shown to have lowest variance in a wide class of estimators, achieving variance reduction relative to standard estimators. We also apply our estimators to improve online advertisement design by a major advertisement company. Consistent with the theoretical result, our estimators allow us to improve on the existing bandit algorithm with more statistical confidence compared to a state-of-the-art benchmark.

There are no comments yet.

## Authors

• 8 publications
• 10 publications
• 3 publications
• ### Learning from Bandit Feedback: An Overview of the State-of-the-art

In machine learning we often try to optimise a decision rule that would ...
09/18/2019 ∙ by Olivier Jeunen, et al. ∙ 0

• ### CAB: Continuous Adaptive Blending Estimator for Policy Evaluation and Learning

The ability to perform offline A/B-testing and off-policy learning using...
11/06/2018 ∙ by Yi Su, et al. ∙ 16

• ### A Large-scale Open Dataset for Bandit Algorithms

We build and publicize the Open Bandit Dataset and Pipeline to facilitat...
08/17/2020 ∙ by Yuta Saito, et al. ∙ 0

• ### Cost-Effective Incentive Allocation via Structured Counterfactual Inference

We address a practical problem ubiquitous in modern industry, in which a...
02/07/2019 ∙ by Romain Lopez, et al. ∙ 8

• ### Safe Counterfactual Reinforcement Learning

We develop a method for predicting the performance of reinforcement lear...
02/20/2020 ∙ by Yusuke Narita, et al. ∙ 0

• ### Differentiable Bandit Exploration

We learn bandit policies that maximize the average reward over bandit in...
02/17/2020 ∙ by Craig Boutilier, et al. ∙ 22

• ### Counterfactual Risk Minimization: Learning from Logged Bandit Feedback

We develop a learning principle and an efficient algorithm for batch lea...
02/09/2015 ∙ by Adith Swaminathan, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Interactive bandit systems (e.g. personalized education and medicine, ad/news/recommendation/search platforms) produce log data valuable for evaluating and redesigning the systems. For example, the logs of a news recommendation system record which news article was presented and whether the user read it, giving the system designer a chance to make its recommendation more relevant. Exploiting log data is, however, more difficult than conventional supervised machine learning: the result of each log is only observed for the action chosen by the system (e.g. the presented news) but not for all the other actions the system could have taken. Moreover, the log entries are biased in that the logs over-represent actions favored by the system.

A potential solution to this problem is an A/B test that compares the performance of counterfactual systems. However, A/B testing counterfactual systems is often technically or managerially infeasible, since deploying a new policy is time- and money-consuming, and entails a risk of failure.

This leads us to the problem of counterfactual (off-policy) evaluation and learning, where one aims to use batch data collected by a logging policy to estimate the value of a counterfactual policy or algorithm without employing it [Li et al.2010, Strehl et al.2010, Li et al.2011, Li et al.2012, Bottou et al.2013, Swaminathan and Joachims2015a, Swaminathan and Joachims2015b, Wang, Agarwal, and Dudik2017, Swaminathan et al.2017]. Such evaluation allows us to compare the performance of counterfactual policies to decide which policy should be deployed in the field. This alternative approach thus solves the above problem with the naive A/B test approach.

Method. For off-policy evaluation with log data of bandit feedback, this paper develops and empirically implements a variance minimization technique. Variance reduction and statistical efficiency are important for minimizing the uncertainty we face in decision making. Indeed, an important open question raised by li2015offline li2015offline is how to achieve “statistically more efficient (even optimal) offline estimation” from batch bandit data. This question motivates a set of studies that bound and characterize the variances of particular estimators [Dudík et al.2014, Li, Munos, and Szepesvári2015, Thomas, Theocharous, and Ghavamzadeh2015, Munos et al.2016, Thomas and Brunskill2016, Agarwal et al.2017].

We study this statistical efficiency question in the context of offline policy value estimation with log data from a class of contextual bandit algorithms. This class includes most of the widely-used algorithms such as contextual

-Greedy and Thompson Sampling, as well as their non-contextual analogs and random A/B testing. We allow the logging policy to be unknown, degenerate (non-stochastic), and time-varying, all of which are salient in real-world bandit applications. We also allow the evaluation target policy to be degenerate, again a key feature of real-life situations.

We consider offline estimators for the expected reward from a counterfactual policy. Our estimators can also be used for estimating the average treatment effect. Our estimators are variations of well-known inverse probability weighting estimators (horvitz1952generalization horvitz1952generalization, rosenbaum1983central rosenbaum1983central, and modern studies cited above) except that we use an

estimated propensity score (logging policy) even if we know the true propensity score. We show the following result, building upon Bickel1993 Bickel1993, HIR2003 HIR2003, and Ackerberg2014 Ackerberg2014 among others:

Theoretical Result 1. Our estimators minimize the variance among all reasonable estimators. More precisely, our estimators minimize the asymptotic variance among all “asymptotically normal” estimators (in the standard statistical sense defined in Section 3).

We also provide estimators for the asymptotic variances of our estimators, thus allowing analysts to calculate the variance in practice. In contrast to Result 1, we also find:

Theoretical Result 2. Standard estimators using the true propensity score (logging policy) have larger asymptotic variances than our estimators.

Perhaps counterintuitively, therefore, the policy-maker should use an estimated propensity score even when she knows the true one.

Application. We empirically apply our estimators to evaluate and optimize the design of online advertisement formats. Our application is based on proprietary data provided by CyberAgent Inc., the second largest Japanese advertisement company with about 6 billion USD market capitalization (as of November 2018). This company uses a contextual bandit algorithm to determine the visual design of advertisements assigned to users. Their algorithm produces logged bandit data.

We use this data and our estimators to optimize the advertisement design for maximizing the click through rates (CTR). In particular, we estimate how much the CTR would be improved by a counterfactual policy of choosing the best action (advertisement) for each context (user characteristics). We first obtain the following result:

Empirical Result A.

Consistent with Theoretical Results 1-2, our estimators produce narrower confidence intervals about the counterfactual policy’s CTR than a benchmark using the known propensity score

[Swaminathan and Joachims2015b].

This result is reported in Figure 1

, where the confidence intervals using “True Propensity Score (Benchmark)” are wider than other confidence intervals using propensity scores estimated either by the Gradient Boosting, Random Forest, or Ridge Logistic Regression.

Thanks to this variance reduction, we conclude that the logging policy’s CTR is below the confidence interval of the hypothetical policy of choosing the best advertisement for each context. This leads us to obtain the following bottomline:

Empirical Result B. Unlike the benchmark estimator, our estimator predicts the hypothetical policy to statistically significantly improve the CTR by 10-15% (compared to the logging policy).

Empirical Results A and B therefore show that our estimator can substantially reduce uncertainty we face in real-world policy-making.

## 2 Setup

### 2.1 Data Generating Process

We consider a general multi-armed contextual bandit setting. There is a set of actions (equivalently, arms or treatments), , that the decision maker can choose from. Let denote a potential reward function that maps actions into rewards or outcomes, where is the reward when action is chosen (e.g., whether an advertisement as an action results in a click). Let denote context or covariates (e.g., the user’s demographic profile and browsing history) that the decision maker observes when picking an action. We denote the set of contexts by . We think of

as a random vector with unknown distribution

.

We consider log data coming from the following data generating process (DGP), which is similar to those used in the literature on the offline evaluation of contextual bandit algorithms [Li et al.2010, Strehl et al.2010, Li et al.2011, Li et al.2012, Swaminathan and Joachims2015a, Swaminathan and Joachims2015b, Swaminathan et al.2017]. We observe data with observations. where

is a binary variable indicating whether action

is chosen in round . denotes the reward observed in round , i.e., . denotes the context observed in round .

A key feature of our DGP is that the data are divided into batches, where different batches may use different choice probabilities (propensity scores). Let

denote a random variable indicating the batch to which round

belongs. We treat this batch number as one of context variables and write , where is the vector of context variables other than .

Let denote the potentially unknown probability vector indicating the probability that each action is chosen in round . Here with being the probability that action is chosen. A contextual bandit algorithm is a sequence of distribution functions of choice probabilities conditional on , where for and is the support of , where is the set of distributions over . takes context as input and returns a distribution of probability vector in rounds of batch . can vary across batches but does not change across rounds within batch . We assume that the log data are generated by a contextual bandit algorithm as follows:

• In each round , is i.i.d. drawn from distribution . Re-order round numbers so that they are monotonically increasing in their batch numbers .

• In each round within batch and given , probability vector is drawn from . Action is randomly chosen based on probability vector , creating the action choice and the associated reward .

Here, the contextual bandit algorithm and the realized probability vectors may or may not be known to the analyst. We also allow for the realization of to be degenerate, i.e., a certain action may be chosen with probability at a point in time.

Examples. This DGP allows for many popular bandit algorithms, as the following examples illustrate. In each of the examples below, the contextual bandit algorithm is degenerate and produces a particular probability vector for sure.

###### Example 1 (Random A/B testing).

We always choose each action uniformly at random: always holds for any and any .

In the remaining examples, at every batch , the algorithm uses the history of observations from the previous batches to estimate the mean and the variance of the potential reward under each action conditional on each context: and . We denote the estimates using the history up to batch by and . See Li2012 Li2012 and Dimakopoulou2017 Dimakopoulou2017 for possible estimators based on generalized linear models and generalized random forest, respectively. The initial estimates, and , are set to any values.

###### Example 2 (ϵ-Greedy).

In each round within batch , we choose the best action based on with probability and choose the other actions uniformly at random with probability :

 pta=⎧⎪⎨⎪⎩1−ϵb   if a=argmaxa′∈A^μb−1(a′|~Xt)ϵbm   otherwise.
###### Example 3 (Thompson Sampling using Gaussian priors).

In each round within batch , we sample the potential reward from distribution for each action, and choose the action with the highest sampled potential reward, . As a result, this algorithm chooses actions with the following probabilities:

 pta=Pr{a=argmaxa′∈Ayt(a′)},

where , , and

 ^Σb−1(x)=⎛⎜ ⎜ ⎜⎝^σ2b−1(0|x)000⋱000^σ2b−1(m|x)⎞⎟ ⎟ ⎟⎠.

In Examples 2 and 3, depends on the random realization of the estimates and , and so does the associated . If the data are sufficiently large, the uncertainty in the estimates vanishes: and converge to and , respectively. In this case, becomes nonrandom since it depends on the fixed realizations and . In the following analysis, we consider this large-sample scenario and assume that is nonrandom.

To make the notation simpler, we put together into a single distribution obtained by for each . We use this to rewrite our DGP as follows:

• In each round , is i.i.d. drawn from distribution . Given , probability vector is drawn from . Action is randomly chosen based on probability vector , creating the action choice and the associated reward .

Define

 p0a(x)≡PrD∼p, p∼F(Da=1|X=x)

for each , and let . This is the choice probability vector conditional on each context. We call the logging policy or the propensity score.

is common for all rounds regardless of the batch to which they belong. Thus and are i.i.d. across rounds. Because is i.i.d. and , each observation is i.i.d.. Note also that is independent of conditional on .

### 2.2 Parameters of Interest

We are interested in using the log data to estimate the expected reward from any given counterfactual policy , which chooses a distribution of actions given each context:

 Vπ ≡E(Y(⋅),X)∼G[m∑a=0Y(a)π(a|X)] =E(Y(⋅),X)∼G, D∼p0(X)[m∑a=0Y(a)Daπ(a|X)p0a(X)], (1)

where the last equality uses the independence of and conditional on and the definition of . Here, . We allow the counterfactual policy to be degenerate, i.e., may choose a particular action with probability 1.

Depending on the choice of , represents a variety of parameters of interest. When we set for a particular action and for all for all , equals , the expected reward from action . When we set , and for all for all , equals , the average treatment effect of action over action . Such treatment effects are of scientific and policy interests in medical and social sciences. Business and managerial interests also motivate treatment effect estimation. For example, when a company implements a bandit algorithm using a particular reward measure like an immediate purchase, the company is often interested in treatment effects on other outcomes like long-term user retention.

## 3 Efficient Value Estimation

We consider the efficient estimation of the expected reward from a counterfactual policy, . We consider an estimator consisting of two steps. In the first step, we nonparametrically estimate the propensity score vector by a consistent estimator. Possible estimators include machine learning algorithms such as gradient boosting, as well as nonparametric sieve estimators and kernel regression estimators, as detailed in Section 3.2. In the second step, we plug the estimated propensity into the sample analogue of expression (2.2) to estimate (in practice, some trimming or thresholding may be desirable for numerical stability):

 ^Vπ=1TT∑t=1m∑a=0YtDtaπ(a|Xt)^pa(Xt).

Alternatively, one can use a “self-normalized” estimator inspired by Swaminathan2015b Swaminathan2015b when for all :

 ^VπSN=1T∑Tt=1∑ma=0YtDtaπ(a|Xt)^pa(Xt)1T∑Tt=1∑ma=0Dtaπ(a|Xt)^pa(Xt).

Swaminathan2015b Swaminathan2015b suggest that tends to be less biased than in small sample. Unlike Swaminathan2015b Swaminathan2015b, however, we use the estimated propensity score rather than the true one.

The above estimators estimate a scalar parameter defined as a function of the distribution of , on which we impose no parametric assumption. Our estimators therefore attempt to solve a semiparametric estimation problem, i.e., a partly-parametric and partly-nonparametric estimation problem. For this semiparametric estimation problem, we first derive the semiparametric efficiency bound on how efficient and precise the estimation of the parameter can be, which is a semiparametric analog of the Cramer-Rao bound [Bickel et al.1993]. The asymptotic variance of any asymptotically normal estimator is no smaller than the semiparametric efficiency bound. Following the standard statistics terminology, we say that estimator for parameter is asymptotically normal if as , where denotes convergence in distribution, and

denotes a normally distributed random variable with mean

and variance . We call the asymptotic variance of . The semiparametric efficiency bound for is a lower bound on the asymptotic variance of asymptotically normal estimators; Appendix A provides a formal definition of the semiparametric efficiency bound.

We show the above estimators achieve the semiparametric efficiency bound, i.e., they minimize the asymptotic variance among all asymptotically normal estimators. Our analysis uses a couple of regularity conditions. We first assume that the logging policy ex ante chooses every action with a positive probability for every context.

###### Assumption 1.

There exists some such that for any and for .

Note that Assumption 1 is consistent with the possibility that the realization of takes on value or (as long as it takes on positive values with a positive probability).

We also assume the existence of finite second moments of potential rewards.

###### Assumption 2.

for .

The following proposition provides the semiparametric efficiency bound for . All the proofs are in Appendix B.

###### Lemma 1 (Semiparametric Efficiency Bound).

Under Assumptions 1 and 2, the semiparametric efficiency bound for , the expected reward from counterfactual policy , is

 E[m∑a=0V[Y(a)|X]π(a|X)2p0a(X)+(θ(X)−Vπ)2],

where is the expected reward from policy conditional on .

Lemma 1 implies the semiparametric efficiency bounds for the expected reward from each action and for the average treatment effect, since they are special cases of .

###### Corollary 1.

Suppose that Assumptions 1 and 2 hold. Then, the semiparametric efficiency bound for the expected reward from each action, , is

 E[V[Y(a)|X]p0a(X)+(E[Y(a)|X]−E[Y(a)])2].

The semiparametric efficiency bound for the average treatment effect, , is

 E[V[Y(0)|X]p00(X)+V[Y(a)|X]p0a(X) +(E[Y(a)−Y(0)|X]−E[Y(a)−Y(0)])2].

Our proposed estimators are two-step generalized-method-of-moment estimators and are asymptotically normal under some regularity conditions, one of which requires that the convergence rate of be faster than [Newey1994, Chen2007]. Given the asympotic normality of the estimators, we find that they achieve the semiparametric efficiency bound, building upon Ackerberg2014 Ackerberg2014 among others.

###### Theorem 1 (Efficient Estimators).

Suppose that Assumptions 1 and 2 hold and that is a consistent estimator for . Then, the variance of and achieves the semiparametric efficiency bound for (provided in Lemma 1).

### 3.1 Inefficient Value Estimation

In some environments, we know the true or observe the realization of the probability vectors . In this case, an alternative way to estimate is to use the sample analogue of the expression (2.2) without estimating the propensity score. If we know , a possible estimator is

 ~Vπ=1TT∑t=1m∑a=0YtDtaπ(a|Xt)p0a(Xt).

If we observe the realization of , we may use

 ¨Vπ=1TT∑t=1m∑a=0YtDtaπ(a|Xt)pta.

When for all , it is again possible to use their self-normalized versions:

 ~VπSN=1T∑Tt=1∑ma=0YtDtaπ(a|Xt)p0a(Xt)1T∑Tt=1∑ma=0Dtaπ(a|Xt)p0a(Xt).
 ¨VπSN=1T∑Tt=1∑ma=0YtDtaπ(a|Xt)pta1T∑Tt=1∑ma=0Dtaπ(a|Xt)pta.

These intuitive estimators turn out to be less efficient than the estimators with the estimated propensity score, as the following result shows.

###### Theorem 2 (Inefficient Estimators).

Suppose that the propensity score is known and we observe the realization of . Suppose also that Assumptions 1 and 2 hold and that is a consistent estimator for . Then, the asymptotic variances of and are no smaller than that of and . Generically, and are strictly less efficient than and in the following sense.

1. If , then the asymptotic variances of , , and are strictly larger than that of and .

2. If , then the asymptotic variance of and is strictly larger than that of and .

The condition in Part 1 of Theorem 2 is about the dominating term in the difference between and . The proofs of Theorems 1 and 2 show that the asymptotic variance of is the asymptotic variance of Part 1 of Theorem 2 requires that the second term be not always zero so that the asymptotic variance of is different from that of . As long as the two variances are not the same, achieves variance reduction.

Part 2 of Theorem 2 requires that with a positive probability. This means that is not always the same as the true propensity score , i.e., is not degenerate (recall that is drawn from whose expected value is ). Under this condition, has a strictly larger asymptotic variance than and .

Theorems 1 and 2 suggest that we should use an estimated score regardless of whether the propensity score is known. To develop some intuition for this result, consider a simple situation where the context always takes some constant value . Suppose that we are interested in estimation of the expected reward from action , . Since is constant across rounds, a natural nonparametric estimator for is the proportion of rounds in which action was chosen: . The estimator using the estimated propensity score is

 ^Vπ=1TT∑t=1YtDta^pa(x)=1∑Tt=1DtaT∑t=1YtDta.

The estimator using the true propensity score is

 ~Vπ=1TT∑t=1YtDtap0a(x)=1Tp0a(x)T∑t=1YtDta.

When action happens to be chosen frequently in a sample so that is larger, the absolute value of tends to be larger in the sample. Because of this positive correlation between and the absolute value of , has a smaller variance than , which produces no correlation between the numerator and the denominator. Similar intuition applies to the comparison between and .

### 3.2 How to Estimate Propensity Scores?

There are several options for the first step estimation of the propensity score.

1. A sieve Least Squares (LS) estimator:

 ^pa(⋅)=argminpa(⋅)∈HaT1TT∑t=1(Dta−pa(Xt))2,

where and as . Here is some known basis functions defined on and .

2. A sieve Logit Maximum Likelihood estimator:

 ^p(⋅) =argmaxp(⋅)∈HT1TT∑t=1m∑a=0Datlogpa(Xt),

where . Here and is the set of some basis functions.

3. Prediction of by using a modern machine learning algorithm like random forest, ridge logistic regression, and gradient boosting.

The above estimators are known to satisfy consistency with a convergence rate faster than under regularity conditions [Newey1997, Cattaneo2010, Knight and Fu2000, Blanchard, Lugosi, and Vayatis2003, Bühlmann and Van De Geer2011, Wager and Athey2018].

How should one choose a propensity score estimator? We prefer an estimated score to the true one because it corrects the discrepancy between the realized action assignment in the data and the assignment predicted by the true score. To achieve this goal, a good propensity score estimator should fit the data better than the true one, which means that the estimator should overfit to some extent. As a concrete example, in our empirical analysis, random forest produces a larger (worse) variance than gradient boosting and ridge logistic regression (see Figure 1 and Table 1). This is because random forest fits the data worse, which is due to its bagging aspect preventing random forest from overfitting. In general, however, we do not know which propensity score estimator achieves the best degree of overfitting. We would therefore suggest that the analyst try different estimators to determine which one is most efficient.

## 4 Estimating Asymptotic Variance

We often need to estimate the asymptotic variance of the above estimators. For example, variance estimation is crucial for determining whether a counterfactual policy is statistically significantly better than the logging policy. We propose an estimator that uses the sample analogue of an expression of the asymptotic variance. As shown in the proof of Theorem 1, the asymptotic variance of and is , where such that for each and ,

 g(Y,X,D,θ,p)=m∑a=0YDaπ(a|X)pa(X)−θ,

and

 α(X,D,p,μ)=−m∑a=0μ(a|X)π(a|X)pa(X)(Da−pa(X)).

We estimate this asymptotic variance in two steps. In the first step, we obtain estimates of and using the method in Section 3. In addition, we estimate by nonparametric regression of on using the subsample with for each . Denote the estimate by . For this regression, one may use a sieve Least Squares estimator and machine learning algorithms. In the empirical application below, we use ridge logistic regression.

In the second step, we plug the estimates of , and into the sample analogue of to estimate the asymptotic variance: when we use :

 ˆAVar(^Vπ) = 1TT∑t=1(g(Yt,Xt,Dt,^Vπ,^p)+α(Xt,Dt,^p,^μ))2.

When we use , its asymptotic variance estimator is obtained by replacing with in the above expression.

This asymptotic variance estimator is a two-step generalized-method-of-moment estimator, and is shown to be a consistent estimator under the condition that the first step estimator of is consistent and some regularity conditions [Newey1994].

It is easier to estimate the asymptotic variance of and with the true propensity score. Their asymptotic variance is

by the standard central limit theorem. When we use

, we estimate this asymptotic variance by

 ˆAVar(~Vπ)=1TT∑t=1g(Yt,Xt,Dt,~Vπ,p0)2

When we use , its asymptotic variance estimator is obtained by replacing with in the above expression.

## 5 Real-World Application

We apply our estimators described in Sections 3 and 4 to empirically evaluate and optimize the design of online advertisements. This application uses proprietary data provided by CyberAgent Inc., which we described in the introduction. This company uses a contextual bandit algorithm to determine the visual design of advertisements assigned to user impressions (there are four design choices). This algorithm produces logged bandit data. We use this logged bandit data and our estimators to improve their advertisement design for maximizing the click through rates (CTR). In the notation of our theoretical framework, reward is a click, action is one of the four possible individual advertisement designs, and context is user and ad characteristics used by the company’s logging policy.

The logging policy (the company’s existing contextual bandit algorithm) works as follows. For each round, the logging policy first randomly samples each action’s predicted reward from a beta distribution. This beta distribution is parametrized by the predicted CTR for each context, where the CTR prediction is based on a Factorization Machine

[Rendle2010]. The logging policy then chooses the action (advertisement) with the largest sampled reward prediction. The logging policy and the underlying CTR prediction stay the same for all rounds in each day. Each day therefore performs the role of a batch in the model in Section 2. This somewhat nonstandard logging policy and the resulting log data are an example of our DGP in Section 2.

This logging policy may have room for improvement for several reasons. First, the logging policy randomly samples advertisements and does not necessarily choose the advertisement with the best predicted CTR. Also, the logging policy uses a predictive Factorization Machine for its CTR prediction, which may be different from the causal CTR (the causal effect of each advertisement on the probability of a click).

To improve on the logging policy, we first estimate the propensity score by random forest, ridge logistic regression, or gradient boosting (implemented by XGBoost). These estimators are known to satisfy the regularity conditions (e.g. consistency) required for our theoretical results, as explained in Section

3.2.

With the estimated propensity score, we then use our estimator to estimate the expected reward from two possible policies: (1) the logging policy and (2) a counterfactual policy that chooses the best action (advertisement) that is predicted to maximize the CTR conditional on each context. To implement this counterfactual policy, we estimate by ridge logistic regression for each action and context

used by the logging policy (we apply one-hot encoding to categorical variables in

). Given each context , the counterfactual policy then chooses the action with the highest estimated value of .

Importantly, we use separate data sets for the two estimation tasks (one for the best actions and the other for the expected reward from the hypothetical policy). Specifically, we use data logged during April 20-26, 2018 for estimating the best actions and data during April 27-29 for estimating the expected reward. This data separation allows us to avoid overfitting and overestimation of the CTR gains from the counterfactual policy.

As a benchmark, we also estimate the same expected rewards based on Swaminathan2015b Swaminathan2015b’s self-normalized estimator , which uses the true propensity score. The resulting estimates show the following result:

Empirical Result A. Consistent with Theorems 1-2, our estimator with the estimated score is statistically more efficient than the benchmark with the true score.

This result is reported in Figure 1 and Table 1, where the confidence intervals about the predicted CTR using “True Propensity Score (Benchmark)” are less precise (wider) than those using estimated propensity scores (regardless of which one of the three score estimators to use). The magnitude of this shrinkage in the confidence intervals and standard errors is 6-34%, depending on how to estimate the propensity score.

This variance reduction allows us to conclude that the logging policy is below the lower bound of the confidence interval of the hypothetical policy, giving us confidence in the following implication:

Empirical Result B. Compared to the logging policy, the hypothetical policy (choosing the best advertisement given each context) improves the CTR by 10-15% statistically significantly at the 5% significance level.

## 6 Conclusion

We have investigated the most statistically efficient use of batch bandit data for estimating the expected reward from a counterfactual policy. Our estimators minimize the asymptotic variance among all asymptotically normal estimators (Theorem 1). By contrast, standard estimators have larger asymptotic variances (Theorem 2).

We have also applied our estimators to improve online advertisement design. Compared to the frontier benchmark , our reward estimator provides the company with more statistical confidence in how to improve on its existing bandit algorithm (Empirical Results A and B). The hypothetical policy of choosing the best advertisement given user characteristics would improve the click through rate by 10-15% at the 5% significance level. These empirical results thus highlight the practical values of Theorems 1-2.

Acknowledgments. We are grateful to seminar participants at ICML/IJCAI/AAMAS Workshop “Machine Learning for Causal Inference, Counterfactual Prediction, and Autonomous Action (CausalML)” and RIKEN Center for Advanced Intelligence Project, especially Junya Honda, Masaaki Imaizumi, Atsushi Iwasaki, Kohei Kawaguchi, and Junpei Komiyama.

## References

• [Ackerberg et al.2014] Ackerberg, D.; Chen, X.; Hahn, J.; and Liao, Z. 2014. Asymptotic Efficiency of Semiparametric Two-step GMM. Review of Economic Studies 81(3):919–943.
• [Agarwal et al.2017] Agarwal, A.; Basu, S.; Schnabel, T.; and Joachims, T. 2017. Effective Evaluation Using Logged Bandit Feedback from Multiple Loggers. KDD 687–696.
• [Bickel et al.1993] Bickel, P. J.; Klaassen, C. A. J.; Ritov, Y.; and Wellner, J. A. 1993. Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press.
• [Blanchard, Lugosi, and Vayatis2003] Blanchard, G.; Lugosi, G.; and Vayatis, N. 2003.

On the Rate of Convergence of Regularized Boosting Classifiers.

Journal of Machine Learning Research 4(Oct):861–894.
• [Bottou et al.2013] Bottou, L.; Peters, J.; Quiñonero-Candela, J.; Charles, D. X.; Chickering, D. M.; Portugaly, E.; Ray, D.; Simard, P.; and Snelson, E. 2013. Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. Journal of Machine Learning Research 14(1):3207–3260.
• [Bühlmann and Van De Geer2011] Bühlmann, P., and Van De Geer, S. 2011.

Statistics for High-dimensional Data: Methods, Theory and Applications

.
Springer Science & Business Media.
• [Cattaneo2010] Cattaneo, M. D. 2010. Efficient Semiparametric Estimation of Multi-valued Treatment Effects under Ignorability. Journal of Econometrics 155(2):138–154.
• [Chen, Hong, and Tarozzi2008] Chen, X.; Hong, H.; and Tarozzi, A. 2008. Semiparametric Efficiency in GMM Models with Auxiliary Data. Annals of Statistics 36(2):808–843.
• [Chen2007] Chen, X. 2007. Large Sample Sieve Estimation of Semi-nonparametric Models. Handbook of Econometrics 6:5549–5632.
• [Dimakopoulou, Athey, and Imbens2017] Dimakopoulou, M.; Athey, S.; and Imbens, G. 2017. Estimation Considerations in Contextual Bandits. ArXiv.
• [Dudík et al.2014] Dudík, M.; Erhan, D.; Langford, J.; and Li, L. 2014. Doubly Robust Policy Evaluation and Optimization. Statistical Science 29:485–511.
• [Hahn1998] Hahn, J. 1998. On the Role of the Propensity Score in Efficient Semiparametric Estimation of Average Treatment Effects. Econometrica 66(2):315–331.
• [Hirano, Imbens, and Ridder2003] Hirano, K.; Imbens, G. W.; and Ridder, G. 2003. Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score. Econometrica 71(4):1161–1189.
• [Horvitz and Thompson1952] Horvitz, D. G., and Thompson, D. J. 1952. A Generalization of Sampling Without Replacement from a Finite Universe. Journal of the American Statistical Association 47(260):663–685.
• [Knight and Fu2000] Knight, K., and Fu, W. 2000. Asymptotics for Lasso-type Estimators. Annals of Statistics 1356–1378.
• [Li et al.2010] Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A Contextual-bandit Approach to Personalized News Article Recommendation. WWW 661–670.
• [Li et al.2011] Li, L.; Chu, W.; Langford, J.; and Wang, X. 2011. Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms. WSDM 297–306.
• [Li et al.2012] Li, L.; Chu, W.; Langford, J.; Moon, T.; and Wang, X. 2012. An Unbiased Offline Evaluation of Contextual Bandit Algorithms with Generalized Linear Models. Journal of Machine Learning Research: Workshop and Conference Proceedings 26:19–36.
• [Li, Munos, and Szepesvári2015] Li, L.; Munos, R.; and Szepesvári, C. 2015. Toward Minimax Off-policy Value Estimation. AISTATS 608–616.
• [Li2015] Li, L. 2015. Offline Evaluation and Optimization for Interactive Systems. WSDM.
• [Munos et al.2016] Munos, R.; Stepleton, T.; Harutyunyan, A.; and Bellemare, M. 2016.

Safe and Efficient Off-policy Reinforcement Learning.

NIPS 1054–1062.
• [Newey1990] Newey, W. K. 1990. Semiparametric Efficiency Bounds. Journal of Applied Econometrics 5(2):99–135.
• [Newey1994] Newey, W. K. 1994. The Asymptotic Variance of Semiparametric Estimators. Econometrica 62(6):1349–1382.
• [Newey1997] Newey, W. K. 1997. Convergence Rates and Asymptotic Normality for Series Estimators. Journal of Econometrics 79(1):147–168.
• [Rendle2010] Rendle, S. 2010. Factorization Machines. ICDM 995–1000.
• [Rosenbaum and Rubin1983] Rosenbaum, P. R., and Rubin, D. B. 1983. The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika 70(1):41–55.
• [Strehl et al.2010] Strehl, A.; Langford, J.; Li, L.; and Kakade, S. M. 2010. Learning from Logged Implicit Exploration Data. NIPS 2217–2225.
• [Swaminathan and Joachims2015a] Swaminathan, A., and Joachims, T. 2015a. Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization. Journal of Machine Learning Research 16:1731–1755.
• [Swaminathan and Joachims2015b] Swaminathan, A., and Joachims, T. 2015b. The Self-normalized Estimator for Counterfactual Learning. NIPS 3231–3239.
• [Swaminathan et al.2017] Swaminathan, A.; Krishnamurthy, A.; Agarwal, A.; Dudik, M.; Langford, J.; Jose, D.; and Zitouni, I. 2017. Off-policy Evaluation for Slate Recommendation. NIPS 3635–3645.
• [Thomas and Brunskill2016] Thomas, P., and Brunskill, E. 2016. Data-efficient Off-policy Policy Evaluation for Reinforcement Learning. ICML 2139–2148.
• [Thomas, Theocharous, and Ghavamzadeh2015] Thomas, P. S.; Theocharous, G.; and Ghavamzadeh, M. 2015. High-Confidence Off-Policy Evaluation. AAAI 3000–3006.
• [Wager and Athey2018] Wager, S., and Athey, S. 2018. Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests. Journal of the American Statistical Association 113(523):1228–1242.
• [Wang, Agarwal, and Dudik2017] Wang, Y.-X.; Agarwal, A.; and Dudik, M. 2017. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits. ICML 3589–3597.

## Appendix A Defining Semiparametric Efficiency Bound

We present the definition of semiparametric efficiency bound based on Bickel1993 Bickel1993. Let

be an i.i.d. sample from the probability distribution

on , where is some Euclidean sample space and is its Borel -field. Let be a fixed -finite measure on , and let be the collection of all probability measures dominated by . Consider a subset of such that , and a parameter .

We first define a

regular parametric model

. Consider a subset of that has a parametrization such that

 Q={Pθ:θ∈Θ},

where is a subset of . Let , a density of , and . In the following, is the Hilbert space of -square integrable functions, is the Euclidean norm, and is the Hilbert norm in : .

###### Definition 1 (Definition 2.1.1 in Bickel1993 Bickel1993).

is a regular point of the parametrization if is an interior point of , and

1. The map from to is Fréchet differentiable at : there exists a vector such that

 ∥s(θ0+h)−s(θ0)−˙s(θ0)′h∥=o(|h|)   as h→0.
2. The matrix is nonsingular.

###### Definition 2 (Definition 2.1.2 in Bickel1993 Bickel1993).

A parametrization is regular if:

1. Every point of is regular.

2. The map is continuous from to for .

We call a regular parametric model if it has a regular parametrization.

Now let . Fix and suppose has a total differential vector at . Define

 I−1(P|v,Q)=˙q(θ)I−1(θ)˙q(θ)′,

where

 I(θ)=4∫˙s(θ)˙s(θ)′dμ.

Suppose that there exists a regular parametric model that contains .

###### Definition 3.

The semiparametric efficiency bound for is defined by

 I−1(P0|v,P)≡ sup{I−1(P0|v,Q):Q⊂P%and$Q$isa regular parametric model that contains P0}.

## Appendix B Proofs

Proof of Lemma 1. The derivation of the semiparametric efficiency bound follows the approach of Hahn1998 Hahn1998, HIR2003 HIR2003, chen2008semiparametric chen2008semiparametric, cattaneo2010efficient cattaneo2010efficient and Newey1990 Newey1990. The proof proceeds in four steps: (i) characterize the tangent set for all regular parametric submodels, (ii) verify that the parameter of interest is pathwise differentiable, (iii) verify that the efficient influence function lies in the tangent set, and (iv) calculate the expected square of the influence function.

Consider a regular parametric submodel of the joint distribution of

with parameter and the likelihood given by

 f(y,d,x;β)={Πma=0[fa(y|x;β)pa(x;β)]da}fX(x;β),

where is the conditional density of given , , and is the density of . The log-likelihood function is

 logf(y,d,