1 Introduction
Interactive reinforcement learning (RL) and bandit systems (e.g. personalized education and medicine, ad/news/recommendation/search platforms) produce log data valuable for evaluating and redesigning the systems. For example, the logs of a news recommendation system record which news article was presented and whether the user read it, giving the system designer a chance to make its recommendation more relevant.
Exploiting log data is, however, more difficult than conventional supervised machine learning: the result of each log is only observed for the action chosen by the system (e.g. the presented news) but not for all the other actions the system could have taken. Moreover, the log entries are biased in that the logs overrepresent actions favored by the system.
A potential solution to this problem is an A/B test that compares the performance of counterfactual systems. However, A/B testing counterfactual systems is often technically or managerially infeasible, since deploying a new policy is time and moneyconsuming, and entails a risk of failure.
This leads us to the problem of counterfactual (offpolicy, offline) evaluation and learning, where one aims to use batch data collected by a logging policy to estimate the value of a counterfactual policy or algorithm without employing it. Such evaluation allows us to compare the performance of counterfactual policies to decide which policy should be deployed in the field. This alternative approach thus solves the above problem with the naive A/B test approach. Key prior studies include Li et al. (2010); Strehl et al. (2010); Li et al. (2011, 2012); Swaminathan and Joachims (2015a, b); Wang et al. (2017); Swaminathan et al. (2017); Dimakopoulou et al. (2018); Narita et al. (2019) for bandit algorithms, and Precup et al. (2000, 2001); Bottou et al. (2013); Thomas et al. (2015); Jiang and Li (2016); Munos et al. (2016); Thomas and Brunskill (2016); Gu et al. (2017); Liu et al. (2018); Farajtabar et al. (2018); Hanna et al. (2019); Irpan et al. (2019); Kallus and Uehara (2019) for RL algorithms.
Method. For offpolicy evaluation with log data of RL feedback, this paper develops and empirically implements a novel technique with desirable theoretical properties. To do so, we consider a class of RL algorithms, including contextual bandit algorithms as important special cases. This class includes most of the widelyused algorithms such as (deep) Qlearning, Actor Critique, contextual
Greedy and Thompson Sampling, as well as their noncontextual analogs and random A/B testing. We allow the logging policy to be an unknown function of numerous potentially important state variables. This feature is salient in realworld applications. We also allow the evaluation target policy to be degenerate, again a key feature of reallife situations.
We consider an offline estimator for the expected reward from a counterfactual policy. Our estimator integrates a wellknown doubly robust (DR) estimator (Rotnitzky and Robins (1995) and modern studies cited above) with “double/debiased machine learning” developed in econometrics and statistics (Chernozhukov et al., 2018a, b). Building upon these prior studies, we show the following result:
Theoretical Result. Our estimator is “consistent” in the sense that its prediction converges in probability to the true performance of a counterfactual policy at the rate as the sample size increases. Our estimator is also “asymptotically normal.” We provide a consistent estimator of its asymptotic variance, thus allowing for measuring statistical uncertainty in our prediction.
For many special cases, including all contextual bandit algorithms, our estimator is shown to have lowest variance in a wide class of estimators, achieving variance reduction relative to standard estimators. These theoretical properties suggest that our estimator is safe to use. In contrast, existing estimators lack or have not been shown to have these theoretical guarantees.
Application. We empirically apply our estimator to evaluate and optimize the design of online advertisement formats. Our application is based on proprietary data provided by CyberAgent Inc., the second largest Japanese advertisement company with about 5 billion USD market capitalization (as of February 2020). This company uses randomly chosen bandit algorithms to determine the visual design of advertisements assigned to users. This A/B test of randomly choosing an algorithm produces logged data and the ground truth for the performance of alternative algorithms.
We use this data to examine the performance of our proposed method. Specifically, we use the log data from an algorithm to predict the the click through rates (CTR) of another algorithm, and then assess the accuracy of our prediction by comparing it with the ground truth. This exercise shows the following:
Empirical Result. Our estimator produces smaller mean squared errors than widelyused benchmark methods in the spirit of Jiang and Li (2016) and Thomas and Brunskill (2016).
This result is reported in Figure 1, where the mean squared errors using our estimator (Double Machine Learning; colored red) are lower than others using existing estimators (Inverse Probability Weighting or Doubly Robust). This improvement is statistically significant at the 5% level. Importantly, this result holds regardless of whether we know the datagenerating logging policy or not; these two scenarios correspond to the figure’s two panels, respectively. This finding shows that our estimator can substantially reduce bias and uncertainty we face in realworld decisionmaking.
This improved performance motivates us to use our estimator to optimize the advertisement design for maximizing the CTR. We estimate how much the CTR would be improved by a counterfactual policy of choosing the best action (advertisement) for each context (user characteristics). This exercise produces the following bottomline: Our estimator predicts the hypothetical policy to statistically significantly improve the CTR by 30% (compared to the logging policy) in one of the three campaigns we analyze. Our approach thus generates valuable managerial conclusions.
2 Setup
2.1 Data Generating Process
We consider historical data of trajectories that follow a Markov Decision Process (MDP) as a mathematical description of RL and bandit algorithms. Specifically, we closely follow the setups of
Jiang and Li (2016), Thomas and Brunskill (2016) and Farajtabar et al. (2018).An MDP is given by , where is the state space, is the action space, is the initial state distribution, is the transition function with being the probability of seeing state after taking action given state , and be the conditional distribution of the immediate reward with being the immediate reward distribution conditional on the state and action being . Given , we define the mean reward function as , and define the reward variance function as . For simplicity, we assume that the state space and the action space are finite.
We call a function a policy, which assigns each state a distribution over actions, where is the probability of taking action when the state is . Let be a trajectory, where , and are the state, the action, and the reward in step , respectively, and is the number of steps. We say that a trajectory is generated by a policy , or in short if is generated by the following process:

The initial state is drawn from the initial distribution . Given , the action is randomly chosen based on . The reward is drawn from the conditional reward distribution .

For , the state is determined based on the transition function . Given , the action is randomly chosen based on . The reward is drawn from the conditional reward distribution .
Suppose that we observe historical data where trajectories are independently generated by a fixed behavior policy , i.e., independently across . The historical data is a collection of iid trajectories. Importantly, we allow the components of the data generating process and to vary with the sample size . Specifically, let and be the MDP and the behavior policy, respectively, when the sample size is , and let
denote the resulting probability distribution of
. is allowed to vary with in a way that the functions , and are high dimensional relative to sample size even when is large. In some RL problems, for example, there are a large number of possible states. To capture the feature that the number of states is potentially large relative to sample size , we may consider a sequence of such that is increasing with . For the sake of notational simplicity, we make implicit the dependence of and on .We assume that we know the state space and the action space but know none of the functions , and . In some environments, we know the function
or observe the probability vector
for every trajectory in the historical data. Our approach is usable regardless of the availability of such knowledge on the behavior policy.2.2 Examples
This data generating process allows for many popular RL and bandit algorithms, as the following examples illustrate.
Example 1 (Deep Learning).
In each round , given state , a Q Learning algorithm picks the best action based on the estimated Qvalue of each actions, , which estimates the expected cumulative reward from taking action (following the state and the policy). Choice probabilities can be determined with an Greedy or softmax rule, for instance. In the case where the softmax rule is employed, the probability of taking each action is as follows:
Deep Q Learning algorithms estimate Qvalue functions through deep learning methods.
Example 2 (Actor Critic).
An Actor Critic is a hybrid method of valuebased approach such as Qlearning and policybased method such as REINFORCE. This algorithm has two components called Actor and Critic. Critic estimates the value function and Actor updates the policy using the value of Critic. In each round t, we pick the best action according to the value of Actor with some probability. As in Deep Learning algorithms, we can use Greedy and softmax for determining an action.
Contextual bandit algorithms are also important examples. When , a trajectory takes the form of . Regarding as a context, it is possible to consider as batch data generated by a contextual bandit algorithm. In additional examples below, the algorithms use past data to estimate the mean reward function and the reward variance function . Let and denote any given estimators of and , respectively.
Example 3 (Greedy).
When the context is , we choose the best action based on with probability and choose an action uniformly at random with probability :
Example 4 (Thompson Sampling using Gaussian priors).
When the context is , we sample the potential reward
from the normal distribution
for each action, and choose the action with the highest sampled potential reward, . As a result, this algorithm chooses actions with the following probabilities:where , , and is the diagonal matrix whose diagonal entries are .
2.3 Prediction Target
With the historical data , we are interested in estimating the discounted value of the evaluation policy , which might be different from :
where is the discount factor.
3 Estimator and Its Properties
The estimation of involves estimation of the behavior policy (if unknown), the transition function , and the mean reward function . These functions may be high dimensional and complex. To handle the issue, we use the double/debiased machine learning (DML) method developed by Chernozhukov et al. (2018a).
Before presenting our proposed estimator, we introduce some notations. is a trajectory of the state and action up to step . is an importance weight function with
being the probability of the first steps of under the evaluation policy divided by its probability under the behavior policy . Viewing as a function of , define as the value of defined as above with the true behavior policy replaced with a candidate function , i.e.,
We can think of as the estimated importance weight function when we use as the estimate of . By definition, , where the lefthand side is evaluated at the true and the righthand side is the true importance weight function.
Finally, let be the actionvalue function under policy at step , where . Using the transition function and the mean reward function , can be obtained recursively: let
(1) 
and for ,
(2) 
Our estimator is based on the following expression of (Thomas and Brunskill, 2016):
(3) 
where for any candidate tuple , we define
where we set . We construct our estimator as follows.

Take a fold random partition of trajectory indices such that the size of each fold is . Also, for each , define .

Given , , the DML estimator is given by
Possible estimation methods for , and
in Step 2 are (i) classical nonparametric methods such as kernel and series estimation, (ii) offtheshelf machine learning methods such as random forests, lasso, neural nets, and boosted regression trees, and (iii) existing methods developed in the offpolicy policy evaluation literature such as representation balancing MDPs
(Liu et al., 2018). These methods, especially (ii) and (iii), are usable even when the analyst does not know the relevant state variables and there are a large number of potentially important state variables. This DML estimator differs from the DR estimator developed by Jiang and Li (2016) and Thomas and Brunskill (2016) in that we use the crossfitting procedure, as explained next.A. CrossFitting. The above method uses a samplesplitting procedure called crossfitting, where we split the data into folds, take the sample analogue of (3) using one of the folds () with and estimated from the remaining folds () plugged in, and average the estimates over the folds to produce a single estimator. Crossfitting has two advantages. First, if we used instead the whole sample both for estimating and and for computing the final estimate of (the “fulldata” variant of the DR estimator of Thomas and Brunskill (2016)), substantial bias might arise due to overfitting (Chernozhukov et al., 2018a; Newey and Robins, 2018). Crossfitting removes the potential bias by making the estimate of independent of the estimates of and . Thanks to this, the DML estimator has properties such as consistency and asymptotic normality under milder condtions than those necessary without sample splitting.
Second, a standard sample splitting procedure uses a half of the data to construct estimates of and and the other half to compute the estimate of (the DR estimator of Jiang and Li (2016) and the “halfdata” variant of the DR estimator of Thomas and Brunskill (2016)).^{1}^{1}1In their experiments, Jiang and Li (2016) also implement crossfitting as one variant of the DR estimator. Their DR estimator with crossfitting is the same as our DML estimator if we plug the true behavior policy into DML. However, we (i) allow the behavior policy to be unknown, (ii) allow for having a large number of potentially important state variables, and (iii) present statistical properties of the estimator while they do not. In contrast, crossfitting swaps the roles of the main fold () and the rest () so that all trajectories are used for the final estimate, which enables us to make efficient use of data.
B. Neyman Orthogonality. There is another key ingredient important for DML to have desirable properties. The DML estimator is constructed by plugging in the estimates of and , which may be severely biased due to regularization if they are estimated with machine learning methods. However, the DML estimator is expected to be robust to the bias, since satisfies the Neyman orthogonality condition (Chernozhukov et al., 2018a). The condition requires that for any candidate tuple ,
where is the tuple of the true functions (see the supplementary material for the proof that DML satisfies this). Intuitively, the Neyman orthogonality condition means that the righthand side of (3) is locally insensitive to the value of and . As a result, plugging in noisy estimates of and does not strongly affect the estimate of .
In contrast, the wellknown Inverse Probability Weighting (IPW) estimator is based on the following expression of : , where for any candidate , we define . The Neyman orthogonality condition does not hold for IPW: for some ,
Therefore, IPW is not robust to bias in the estimate of .
3.1 consistency and Asymptotic Normality
Let
be the variance of . To derive the properties of the DML estimator, we make the following assumption. Recall that , the probability distribution of , is allowed to vary with the sample size , and that any characteristics of the distribution may implicitly depend on .
Assumption 1.

There exists a constant such that for all .

The estimator belongs to a set with probability approaching one, where contains the true and satisfies the following:

There exist constants and such that for all .



Assumption 1 (a) assumes that the variance of is nonzero and finite. Assumption 1 (b) states that the estimator belongs to the set , a shrinking neighborhood of the true , with probability approaching one. It requires that converge to at a sufficiently fast rate so that the three rate conditions in Assumption 1 (b) are satisfied.
The following proposition establishes the consistency and asymptotic normality of and provides a consistent estimator for the asymptotic variance.
Proposition 1.
Suppose that Assumption 1 holds. Then,
where the symbol denotes convergence in distribution. Moreover, let
be a sample analogue of . We then have
The above convergence results hold under any sequence of probability distributions as long as Assumption 1 holds. Therefore, our approach is usable, for example, in the case where there are a growing number of possible states, that is, is increasing with .
3.2 A Special Case: Contextual Bandits
Suppose that the algorithms are contextual bandits, as in Examples 3 and 4. When , the DML estimator becomes
where is the estimator of using the subset of data . This estimator is the same as the DR estimator of Dudík et al. (2014) except that we use the cross fitting procedure. Proposition 1 has the following implication for the contextual bandit case.
Corollary 1.
The variance expression coincides with the “semiparametric efficiency bound” obtained by Narita et al. (2019), where the semiparametric efficiency bound is the smallest possible asymptotic variance among all consistent and asymptotically normal estimators. Hence is the lowest variance estimator at a given .
4 Experiment
Estimated propensity score  True propensity score  
Method  relativeRMSE  relativeRMSE 
IPW  0.72531 (0.03077)  0.72907 (0.03061) 
DR  0.65470 (0.02900)  0.65056 (0.02880) 
DML  0.56623 (0.02952)  0.56196 (0.02922) 
Sample size  # impressions assigned to the MAB algorithm = 40,101,050  
# impressions assigned to the CB algorithm = 299,342 
Notes
: This table shows the relative root mean squared error of the predicted CTRs of the evaluation policy compared to its actual CTR. The standard errors of these root mean squared errors are in parentheses.
We apply our estimator described in Section 3 to empirically evaluate the design of online advertisements. This application uses proprietary data provided by CyberAgent Inc., which we described in the introduction. This company uses bandit algorithms to determine the visual design of advertisements assigned to user impressions in a mobile game. Some examples of ad designs are shown in Figure 2.
Our data are logged data from a 7days A/B test on several ad “campaigns,” where each campaign randomly uses either a multiarmed bandit (MAB) algorithm or a contextual bandit (CB) algorithm for each user impression. Both of the two algorithms are updated every 6 hours. As the A/B test lasts for 7 days, there are 28 time periods in total. The MAB policy and CB policy stay constant across rounds within each time period.
In the notation of our theoretical framework, reward is a click while action is one of the possible individual advertisement designs. If the algorithm is a contextual bandit, context is user and ad characteristics used by the algorithm. Context is high dimensional and has tens of thousands of possible values.
We utilize this data to examine the performance of our proposed method. For each campaign and time period, we regard the MAB algorithm as the behavior policy and the CB algorithm as the evaluation policy . We use the log data from user impressions assigned to the MAB algorithm to estimate the value of the CB policy by our method. We also compute the actual value of the CB policy using the log data from user impressions assigned to the CB algorithm. We then compare the predicted value with the actual one.
We consider the relative root mean squared error (relativeRMSE) as a performance metric. Let and denote the number of campaigns () and the number of time periods (), respectively. Let denote the number of user impressions assigned to the CB algorithm for campaign in period . Let and be the estimated value and the actual value (the click rate) for campaign in period . We define relativeRMSE as follows:
As the actual click rate varies across campaigns and time periods, we normalize the prediction error by dividing it by the actual value to equally weight every campaigntime combination.
We calculate the standard error (statistical uncertainty) of relativeRMSE by a bootstraplike procedure. This procedure is based on normal approximation of the distributions of and : for each , and , where is the true value of policy , is the estimator for the asymptotic variance of given in Proposition 1, is the number of impressions used to estimate , and is the sample variance of the click indicator among the impressions assigned to the CB algorithm.
The standard error is computed as follows. First, we compute and for each . Second, we draw and independently from and for every , and calculate the relativeRMSE using the draws
. We then repeat the second step 100,000 times, and compute the standard deviation of the simulated relativeRMSEs.
We examine the performance of three off policy evaluators, Inverse Probability Weighting (IPW; Strehl et al., 2010), Doubly Robust (DR), and Double Machine Learning (DML; our proposal) with fold crossfitting. IPW and DR are given by
where is the estimator of using the whole sample. does not satisfy the Neyman orthogonality nor use the crossfitting. satisfies the Neyman orthogonality but does not use the crossfitting. As we formalize in the theory part, the DML estimator is expected to perform better than the other two estimators.
For both DR and DML, we use LightGBM (Ke et al., 2017) as a reward estimator
. For DR, we split the whole data into training and validation sets, train the reward model with the training set, and tune hyperparameters with the validation set. We then use the tuned model and the whole sample to obtain the reward estimator
. For DML, we hold out a fold () and apply the same procedure to the rest of the data () to obtain the reward estimator . We repeat this for each of the folds.For the propensity score, we use the true one or an estimated one. We compute the true propensity score by Monte Carlo simulation of the beta distribution used in Thompson Sampling the MAB algorithm uses. Our estimated propensity score is the empirical share of a selected arm in the log.
We present our key empirical result in Table 1 and Figure 1. Regardless of whether the true propensity score is available or not, DML outperforms IPW by more than 20 % and outperforms DR by more than 10 % in terms of relativeRMSE. These differences are statistically significant at 5% level. This result supports the value of our proposed estimator.
Finally, we use our proposed method to measure the performance of a new counterfactual policy of choosing the ad design predicted to maximize the CTR. We obtain the counterfactual policy by training a click prediction model with LightGBM on the data from to . In the training, we set the number of leaves to 20 and the learning rate to 0.01, and decide the number of boost rounds by crossvalidation. We then use our estimator to predict its performance on the data where . This final period contains three campaigns. The resulting offpolicy evaluation results are reported in Figure 3. The results show that the counterfactual policy performs better than the existing algorithms in two of the three campaigns, with a statistically significant improvement in a campaign.
5 Conclusion
This paper proposes a new offpolicy evaluation method, by marrying the doubly robust estimator with double/debiased machine learning. Our estimator has two features. First, unlike the IPW estimator, it is robust to the bias in the estimates of the behavior policy and of the actionvalue function (Neyman orthogonality). Second, we use a samplesplitting procedure called crossfitting. This removes overfitting bias that would arise without sample splitting but still makes full use of data, which makes our estimator better than DR estimators. Theoretically, we show that our estimator is consistent and asymptotically normal with a consistent variance estimator, thus allowing for correct statistical inference. Our experiment shows that our estimator outperforms the standard DR and IPW estimators in terms of the root mean squared error. This result not only demonstrates the capability of our estimator to reduce prediction errors, but also suggests the more general possibility that the two features of our estimator (Neyman orthogonality and crossfitting) may improve many variants of the DR estimator such as MAGIC (Thomas and Brunskill, 2016), SWITCH (Wang et al., 2017) and MRDR (Farajtabar et al., 2018).
References
 Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. Journal of Machine Learning Research 14 (1), pp. 3207–3260. Cited by: §1.
 Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal 21 (1), pp. C1–C68. Cited by: Appendix B, Appendix B, §1, §3, §3, §3.
 Locally robust semiparametric estimation. Arxiv. Cited by: §1.
 Estimation Considerations in Contextual Bandits. ArXiv. External Links: 1711.07077 Cited by: §1.
 Doubly Robust Policy Evaluation and Optimization. Statistical Science 29, pp. 485–511. Cited by: §3.2.
 More robust doubly robust offpolicy evaluation. In Proceedings of the 35th International Conference on Machine Learning, pp. 1447–1456. Cited by: §1, §2.1, §5.
 Interpolated policy gradient: merging onpolicy and offpolicy gradient estimation for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3846–3855. Cited by: §1.
 Importance sampling policy evaluation with an estimated behavior policy. In Proceedings of the 36th International Conference on Machine Learning, pp. 2605–2613. Cited by: §1.
 Offpolicy evaluation via offpolicy classification. In Advances in Neural Information Processing Systems, Cited by: §1.
 Doubly robust offpolicy value evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, pp. 652–661. Cited by: §1, §1, §2.1, §3, §3, footnote 1.
 Intrinsically efficient, stable, and bounded offpolicy evaluation for reinforcement learning. In Advances in Neural Information Processing Systems, Cited by: §1.

Lightgbm: a highly efficient gradient boosting decision tree
. In Advances in Neural Information Processing Systems, pp. 3146–3154. Cited by: §4.  An Unbiased Offline Evaluation of Contextual Bandit Algorithms with Generalized Linear Models. In Journal of Machine Learning Research: Workshop and Conference Proceedings, Vol. 26, pp. 19–36. Cited by: §1.
 A Contextualbandit Approach to Personalized News Article Recommendation. In Proceedings of the 19th International Conference on World Wide Web, pp. 661–670. Cited by: §1.
 Unbiased Offline Evaluation of Contextualbanditbased News Article Recommendation Algorithms. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 297–306. Cited by: §1.
 Representation balancing mdps for offpolicy policy evaluation. In Advances in Neural Information Processing Systems, pp. 2644–2653. Cited by: §1, §3.
 Safe and Efficient Offpolicy Reinforcement Learning. In Advances in Neural Information Processing Systems, pp. 1054–1062. Cited by: §1.

Efficient counterfactual learning from bandit feedback.
In
Proceedings of the 33rd AAAI Conference on Artificial Intelligence
, pp. 4634–4641. Cited by: §1, §3.2.  Crossfitting and fast remainder rates for semiparametric estimation. Arxiv. Cited by: §3.
 OffPolicy TemporalDifference Learning with Function Approximation. In Proceedings of the 18th International Conference on Machine Learning, pp. 417–424. Cited by: §1.
 Eligibility Traces for OffPolicy Policy Evaluation. In Proceedings of the 17th International Conference on Machine Learning, pp. 759–766. Cited by: §1.
 Semiparametric regression estimation in the presence of dependent censoring. Biometrika 82 (4), pp. 805–820. Cited by: §1.
 Learning from Logged Implicit Exploration Data. In Advances in Neural Information Processing Systems, pp. 2217–2225. Cited by: §1, §4.
 Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization. Journal of Machine Learning Research 16, pp. 1731–1755. Cited by: §1.
 The Selfnormalized Estimator for Counterfactual Learning. In Advances in Neural Information Processing Systems, pp. 3231–3239. Cited by: §1.
 Offpolicy Evaluation for Slate Recommendation. In Advances in Neural Information Processing SystemsAdvances in Neural Information Processing Systems, pp. 3635–3645. Cited by: §1.
 Dataefficient offpolicy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, pp. 2139–2148. Cited by: §1, §1, §2.1, §3, §3, §3, §3, §5.
 High confidence policy improvement. In Proceedings of the 32th International Conference on Machine Learning, pp. 2380–2388. Cited by: §1.
 Optimal and Adaptive Offpolicy Evaluation in Contextual Bandits. In Proceedings of the 34th International Conference on Machine Learning, pp. 3589–3597. Cited by: §1, §5.
Appendices
Appendix A Lemmas
Lemma 1.
For , .
Proof.
Let denote the probability of observing trajectory when for some policy . Under our data generating process,
and for ,
Hence, for any . We then have that
Comments
There are no comments yet.