1 Introduction
In reinforcement learning, one of the most fundamental problems is
policy evaluation — estimate the average reward obtained by running a given policy to select actions in an unknown system. A straightforward solution is to simply run the policy and measure the rewards it collects. In many applications, however, running a new policy in the actual system can be expensive or even impossible. For example, flying a helicopter with a new policy can be risky as it may lead to crashes; deploying a new ad display policy on a website may be catastrophic to user experience; testing a new treatment on patients may simply be impossible for legal and ethical reasons; etc.These difficulties make it critical to do offpolicy policy evaluation (Precup et al., 2000, Sutton et al., 2010), which is sometimes referred to as offline evaluation in the bandit literature (Li et al., 2011) or counterfactual reasoning (Bottou et al., 2013). Here, we still aim to estimate the average reward of a target policy, but instead of being able to run the policy online, we only have access to a sample of observations made about the unknown system, which may be collected in the past using a different policy. Offpolicy evaluation has been found useful in a number of important applications (Langford et al., 2008, Li et al., 2011, Bottou et al., 2013) and can also be looked as a key building block for policy optimization which, as in supervised learning, can often be reduced to evaluation, as long as the complexity of the policy class is wellcontrolled (Ng and Jordan, 2000). For example, it has played an important role in many optimization algorithms for Markov decision processes (e.g., HeidrichMeisner and Igel 2009) and bandit problems (Auer et al., 2002, Langford and Zhang, 2008, Strehl et al., 2011). ^{size=,color=red!20!white,}^{size=,color=red!20!white,}todo: size=,color=red!20!white,Lihong: Others? Examples for partialmonitor games? In the context of supervised learning, in the covariate shift literature, the problem of estimating losses under changing distributions is crucial for model selection (Sugiyama and Müller, 2005, Yu and Szepesvári, 2012)
and also appears in active learning
(Dasgupta, 2011). In the statistical literature, on the other hand, the problem appears in the context of randomized experiments. Here, the focus is on the twoaction (binary) case where the goal is to estimate the difference between the expected rewards of the two actions (Hirano et al., 2003), which is slightly (but not essentially) different than our setting. ^{size=,color=green!20!white,}^{size=,color=green!20!white,}todo: size=,color=green!20!white,Csaba: Mention in conclusion generalizationThe topic of the present paper is offpolicy evaluation in finite settings, under a mean squared error criterion (MSE). As opposed to the statistics literature (Hirano et al., 2003), we are interested in results for finite sample sizes. In particular, we are interested in limits of performance (minimax MSE) given fixed policies, but unknown stochastic rewards with bounded mean reward, as well as the performance of estimation procedures compared to the minimax MSE. We argue that the finite setting is not a key limitation when focusing on the scaling behavior of the MSE of algorithms. Moreover, we are not aware of prior work that would have studied the above problem (i.e., relating the MSE of algorithms to the best possible MSE). Our main results are as follows: We start with a lower bound on the minimax MSE, to set a target for the estimation procedures. Next, we derive the exact MSE of the likelihood ratio (or importanceweighted) estimator (LR), which is shown to have an extra (uncontrollable) factor as compared to the minimax MSE lower bound. Next, we consider the estimator which estimates the mean rewards by sample means, which we call the regression estimator (REG). The motivation of studying this estimator is both its simplicity and also because it is known that a related estimator is asymptotically efficient (Hirano et al., 2003). The main question is whether the asymptotic efficiency transfers into finitetime efficiency. Our answer to this is mixed: We show that the MSE of REG is within a constant factor of the minimax MSE lower bound, however, the “constant” depends on the number of actions (
), or a lower bound on the variance. We also show that the dependence of the MSE of REG on the number actions is unavoidable. In any case, for “small” action sets or high noise setting, the REG estimator can be thought of as a minimax nearoptimal estimator. We also show that for small sample sizes (up to
) all estimators must suffer a constant MSE. Numerical experiments illustrate the tightness of the analysis. Implications for more complicated settings, such as policy evaluation in contextual bandits and Markov Decision Processes (MDPs). The question of designing a nearly minimax estimator independently of any problem parameters remains open. All the proofs ot given in the main text can be found in the supplementary material.2 Multiarmed Bandit
Let be a finite set of actions. Data is generated by the following process: ^{1}^{1}1The data is actually a list, not a set. We keep the notation for historical reasons. are independent copies of , where and for some unknown family of distributions and known policy . We are also given a known target policy and want to estimate its value, based on the knowledge of , and , where the quality of an estimate constructed based on (and ) is measured by its meansquared error, .
Define and , where stands for the variance. Further, let . For convenience, we will identify any function with the
dimensional vector whose
th component is . Thus, , , etc. will also be looked at as vectors. Note that we do not assume that the rewards are bounded from either direction. ^{size=,color=green!20!white,}^{size=,color=green!20!white,}todo: size=,color=green!20!white,Csaba: Really? ^{size=,color=red!20!white,}^{size=,color=red!20!white,}todo: size=,color=red!20!white,Lihong: Where to mention ? Shall we assume here?A few quantities are introduced to facilitate discussions that follow:
Note that and are functions of and , but this dependence is suppressed. Also, and are independent in that there are no constants such that for any . Finally, let
be the probability of having
no sample of in .2.1 A Minimax Lower Bound
We start with establishing a minimax lower bound that characterizes the inherent hardness of the offpolicy evaluation problem. An estimator can be considered as a function that maps to an estimate of , denoted . Fix . We consider the minimax optimal risk subject to ^{size=,color=green!20!white,}^{size=,color=green!20!white,}todo: size=,color=green!20!white,Csaba: Sometimes we use , sometimes . Unify the notation or tell the reader to anticipate this. and for all :
where for vectors , holds if and only if for . For , we let denote the probability that none of the actions in the data falls into : . Note that this definition generalizes . We also let .
Theorem 1.
For any , , , and , one has
Furthermore,
(1) 
Proof.
To prove the first part of the lower bound, fix a subset of actions and choose an environment , where is the set of environments such that and . Introduce the notation to denote expectation when the data is generated by environment . ^{size=,color=red!20!white,}^{size=,color=red!20!white,}todo: size=,color=red!20!white,Lihong: Why call “environment”, not “distribution”?
Let be the data generated based on and and let denote the estimate produced by some algorithm . Define to be the set of actions in the dataset that is seen by the algorithm. Clearly, for any such that they agree on the complement of (but may differ on actions in ),^{size=,color=red!20!white,}^{size=,color=red!20!white,}todo: size=,color=red!20!white,Lihong: Agreement between and , not between and ?
(2) 
Now, and by adapting the argument that the MSE is lower bounded by the bias squared, . Hence, . We get an even smaller quantity if we further restrict the environments to environments that also satisfy on . Now, by (2), for all these environments, takes on a common value, denote it by . Hence, . Since , , where we use the shorthand . Plugging this into the previous inequality we get . Since was arbitrary, we get .
For the second part, consider a class of normal distributions with fixed reward variances
but different reward expectations: , where , for some tobespecified vector that satisfies . The datagenerating distribution is in , but is unknown otherwise.It is easy to see that the policy value between any two distributions in differ by at least . Indeed, for any , . It follows that, in order to achieve a squared error less than , one needs to identify the underlying datagenerating from , based on the observed sample . The problem now reduces to finding a minimax lower bound for hypothesis testing in the given finite set .
We resort to the informationtheoretic machinery based on Fano’s inequality (see, e.g., Raginsky and Rakhlin (2011)). Define an oracle which, when queried, outputs with and . Let the distribution of when is used be denoted by . Let collect distributions such that is normal. Consider . Then,
The divergence measures how much information is carried in one sample from the oracle to tell from . To obtain the tightest lower bound, we should minimize the divergence. Subject to the constraint , the divergence is minimized by setting , and is . Now setting , and applying Lemma 1, Theorem 1 and the “Information Radius bound” from Raginsky and Rakhlin (2011), we have . ^{size=,color=green!20!white,}^{size=,color=green!20!white,}todo: size=,color=green!20!white,Csaba: Can we add these to the appendix? Reorganizing terms and combining with the first term complete the proof of the first statement.
For the second part, note that it suffices to consider asymptotically unbiased estimators (cf. the generalized CramerRao lower bound, Theorem 7.3 of
Ibragimov and Has’minskii 1981). For any such estimator, the CramerRao lower bound gives the result with the parametric family chosen to be , where is the unknown parameter to be estimated, and is the density of the normal distribution with mean and variance and the quantity to be estimated is . For details, see Section A.1. ∎The next corollary says that the minimax risk is constant when the number of samples is :
Corollary 1.
For , , .
Proof.
Choose to minimize subject to the constraint . Note that . Choosing such that gives the result. ∎
We conjecture that the result can be strengthened by increasing the upper limit on . ^{size=,color=green!20!white,}^{size=,color=green!20!white,}todo: size=,color=green!20!white,Csaba: One of the conjectures; mention in conclusion
2.2 Likelihood Ratio Estimator
One of the most popular estimators is known as the propensity score estimator in the statistical literature (Rosenbaum and Rubin, 1983, 1985), or the importance weighting estimator (Bottou et al., 2013). We call it the likelihood ratio estimator, as it estimates the unknown value using likelihood ratios, or importance weights:
Its distinguishing feature is that it is unbiased: , implying that the MSE is purely contributed by the variance of the estimator. The main result in this subsection shows that this estimator does not achieve the minimax lower bound up to any constant (by making ). The proof (given in the appendix) is based on a direct calculation using the law of total variance.
Proposition 1.
It holds that .
We see that as compared to the lower bound on the minimax MSE, an extra factor appears. In the next section, we will see that this factor is superfluous, showing that the MSE of LR can be “unreasonably large”.
2.3 Regression Estimator
For convenience, define to be the number of samples for action in , and the total rewards of . The regression estimator (REG) is given by
For brevity, we will also write , where we take to be zero. The name of the estimator comes from the fact that it estimates the reward function, and the problem of estimating the reward function can be thought of as a regression problem.
Interestingly, as can be verified by direct calculation, the REG estimator can also be written as
(3) 
where is the empirical estimate of . Hence, the main difference between LR and REG is that the former uses to reweight the data, while the latter uses the empirical estimates . It may appear that LR is superior since it uses the “right” quantity. Surprisingly, REG turns out to be much more robust than LR, as will be shown shortly; further discussion is made in Section D.
For the next statement, the counterpart of creftype 1, the following quantities will be useful:
Proposition 2.
Fix . Assume that is nonnegative valued. ^{size=,color=red!20!white,}^{size=,color=red!20!white,}todo: size=,color=red!20!white,Lihong: How about making it an assumption in the general setting? Then it holds that . Further, for any such that the rewards have normal distributions, defining to be the bias of , .
Proof sketch.
For the upper bound use that the MSE equals the sum of squared bias and the variance. It can be verified that REG is slightly biased: . For the variance term, we use the law of total variance to yield: , where the first term is , and the second term is upper bounded (Lemma 2) by . The proof is then completed by adding squared bias to variance, and using definitions of , , and . The lower bound follows from the (generalized) CramerRao inequality. ∎
The main result of this section is the following theorem that characterizes the MSE of REG in terms of the minimax optimal MSE.
Theorem 2 (Minimax Optimality of the Regression Estimator).
The following hold:

For any , , such that , , and , it holds for any that
(4) where is an i.i.d. sample from .

A suboptimality factor of in the above result is unavoidable: For , there exists such that for any ,
Thus for , this ratio is at least .

The estimator is asymptotically minimax optimal:
We need the following lemma, which may be of interest on its own:
Lemma 1.
Let be
independent Bernoulli random variables with parameter
. Letting , , , we have for any and that . Further, when , we haveProof of Theorem 2.
First, we bound in terms of . From Lemma 1, , while if , . Plugging these into the definition of , we have for all . Furthermore, when , thanks to monotonicity of the function for , we have
(5) 
Now, to bound , remember that one lower bound for is , where is the range for . ^{size=,color=green!20!white,}^{size=,color=green!20!white,}todo: size=,color=green!20!white,Csaba: The range has to be in the final result.. Hence,
(6) 
Hence, using , ^{size=,color=green!20!white,}^{size=,color=green!20!white,}todo: size=,color=green!20!white,Csaba: constant?
(7) 
On the other hand, assuming that , we also have
where in the last inequality we used that and , which is true for any , and finally also the definition of . Similarly to the previous case, we get
For the second part of the result, choose , . For , . Hence, we have . Now, consider the LR estimator. Choosing , we have and so by creftype 1,
Hence, .
2.4 Simulation Results
This subsection corroborates our analysis with simulation results that empirically demonstrate the impact of key quantities on the MSE of the two estimators. Two sets of experiments are done, corresponding to the left and right panels in Figure 1. In all experiments, we repeat the datageneration process (with ) 10,000 times, and compute the MSE of each estimator. All reward distributions are normal distributions with and different means. We then plot normalized MSE (MSE multiplied by sample size ), or nMSE, against .
The first experiment is to compare the finitetime as well as asymptotic accuracy of and . We choose , , . Three choices of are used: (a) , (b) , and (c) . These choices lead to increasing values of (with approximately fixed). Clearly, the nMSE of remains constant, equal to , as predicted in Proposition 1. In contrast, the nMSE of is large when is small, because of the high bias, and then quickly converges to the asymptotic minimax rate (Theorem 2, part iii). As can be arbitrarily larger than , it follows that is preferred over , as least for sufficiently large that is needed to drive the bias down. It should be noted that in practice, after is generated, it is easy to quantify the bias of simply by identifying the set of actions with .
The second experiment is to show how affects the nMSE of . Here, we choose , , , and vary . As Figure 1 (right) shows, a larger gives a harder time, which is consistent with Theorem 2 (part i). Not only does the maximum nMSE grow approximately linearly with , the number of samples needed for nMSE to start decreasing also scales roughly as , as indicated by part ii of Theorem 2. ^{size=,color=red!20!white,}^{size=,color=red!20!white,}todo: size=,color=red!20!white,Lihong: Why instead of ?
3 Extensions
In this section, we consider extensions of our previous results to contextual bandits and Markovian Decision Processes, while implications to semisupervised learning (Zhu and Goldberg, 2009) are discussed in the supplementary material.
3.1 Contextual Bandits
The problem setup is as follows: In addition to the finite action set , we are also given a context set . A policy now is a map such that for any ,
is a probability distribution over the action space
. For notational convenience, we will use instead of . The set of policies over and will be denoted by . The process generating the data is described by the following: are independent copies of , where , and for some unknown family of distributions and known policy and context distribution . For simplicity, we fix .We are also given a known target policy and want to estimate its value, based on the knowledge of , , and , where the quality of an estimate constructed based on (and ) is measured by its mean squared error, , just like in the case of contextless bandits. ^{size=,color=green!20!white,}^{size=,color=green!20!white,}todo: size=,color=green!20!white,Csaba: mean squared error, or meansquared error? in any ways, write it one way. Let for , . An estimator can be considered as a function that maps to an estimate of , denoted . Fix . The minimax optimal risk subject to for all is defined by
The main observation is that the estimation problem for the contextual case can actually be reduced to the contextless bandit case by treating the contextaction pairs as “actions” belonging to the product space . For any policy , by slightly abusing notation, let
be the joint distribution of
when , . This way, we can map any contextual policy evaluation problem defined by ,, , and a sample size into a contextless policy evaluation problem defined by , , with action set . Therefore, with and defined similarly, one can conclude the following results:Theorem 3.
Pick any , , , and . Then, one has