# On Minimax Optimal Offline Policy Evaluation

This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy. We first consider the multi-armed bandit case, establish a minimax risk lower bound, and analyze the risk of two standard estimators. It is shown, and verified in simulation, that one is minimax optimal up to a constant, while another can be arbitrarily worse, despite its empirical success and popularity. The results are applied to related problems in contextual bandits and fixed-horizon Markov decision processes, and are also related to semi-supervised learning.

## Authors

• 55 publications
• 64 publications
• 80 publications
• ### MOTS: Minimax Optimal Thompson Sampling

Thompson sampling is one of the most widely used algorithms for many onl...
03/03/2020 ∙ by Tianyuan Jin, et al. ∙ 11

• ### Simulation Based Algorithms for Markov Decision Processes and Multi-Action Restless Bandits

We consider multi-dimensional Markov decision processes and formulate a ...
07/25/2020 ∙ by Rahul Meshram, et al. ∙ 0

• ### Minimax Policy for Heavy-tailed Multi-armed Bandits

We study the stochastic Multi-Armed Bandit (MAB) problem under worst cas...
07/20/2020 ∙ by Lai Wei, et al. ∙ 0

• ### Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

This paper studies the statistical theory of batch data reinforcement le...
02/21/2020 ∙ by Yaqi Duan, et al. ∙ 0

• ### Gaussian One-Armed Bandit and Optimization of Batch Data Processing

We consider the minimax setup for Gaussian one-armed bandit problem, i.e...
01/25/2019 ∙ by Alexander Kolnogorov, et al. ∙ 0

• ### Adaptive Estimator Selection for Off-Policy Evaluation

We develop a generic data-driven method for estimator selection in off-p...
02/18/2020 ∙ by Yi Su, et al. ∙ 0

• ### Confident Off-Policy Evaluation and Selection through Self-Normalized Importance Weighting

We consider off-policy evaluation in the contextual bandit setting for t...
06/18/2020 ∙ by Ilja Kuzborskij, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In reinforcement learning, one of the most fundamental problems is

policy evaluation — estimate the average reward obtained by running a given policy to select actions in an unknown system. A straightforward solution is to simply run the policy and measure the rewards it collects. In many applications, however, running a new policy in the actual system can be expensive or even impossible. For example, flying a helicopter with a new policy can be risky as it may lead to crashes; deploying a new ad display policy on a website may be catastrophic to user experience; testing a new treatment on patients may simply be impossible for legal and ethical reasons; etc.

These difficulties make it critical to do off-policy policy evaluation (Precup et al., 2000, Sutton et al., 2010), which is sometimes referred to as offline evaluation in the bandit literature (Li et al., 2011) or counterfactual reasoning (Bottou et al., 2013). Here, we still aim to estimate the average reward of a target policy, but instead of being able to run the policy online, we only have access to a sample of observations made about the unknown system, which may be collected in the past using a different policy. Off-policy evaluation has been found useful in a number of important applications (Langford et al., 2008, Li et al., 2011, Bottou et al., 2013) and can also be looked as a key building block for policy optimization which, as in supervised learning, can often be reduced to evaluation, as long as the complexity of the policy class is well-controlled (Ng and Jordan, 2000). For example, it has played an important role in many optimization algorithms for Markov decision processes (e.g., Heidrich-Meisner and Igel 2009) and bandit problems (Auer et al., 2002, Langford and Zhang, 2008, Strehl et al., 2011). size=,color=red!20!white,size=,color=red!20!white,todo: size=,color=red!20!white,Lihong: Others? Examples for partial-monitor games? In the context of supervised learning, in the covariate shift literature, the problem of estimating losses under changing distributions is crucial for model selection (Sugiyama and Müller, 2005, Yu and Szepesvári, 2012)

and also appears in active learning

(Dasgupta, 2011). In the statistical literature, on the other hand, the problem appears in the context of randomized experiments. Here, the focus is on the two-action (binary) case where the goal is to estimate the difference between the expected rewards of the two actions (Hirano et al., 2003), which is slightly (but not essentially) different than our setting. size=,color=green!20!white,size=,color=green!20!white,todo: size=,color=green!20!white,Csaba: Mention in conclusion generalization

The topic of the present paper is off-policy evaluation in finite settings, under a mean squared error criterion (MSE). As opposed to the statistics literature (Hirano et al., 2003), we are interested in results for finite sample sizes. In particular, we are interested in limits of performance (minimax MSE) given fixed policies, but unknown stochastic rewards with bounded mean reward, as well as the performance of estimation procedures compared to the minimax MSE. We argue that the finite setting is not a key limitation when focusing on the scaling behavior of the MSE of algorithms. Moreover, we are not aware of prior work that would have studied the above problem (i.e., relating the MSE of algorithms to the best possible MSE). Our main results are as follows: We start with a lower bound on the minimax MSE, to set a target for the estimation procedures. Next, we derive the exact MSE of the likelihood ratio (or importance-weighted) estimator (LR), which is shown to have an extra (uncontrollable) factor as compared to the minimax MSE lower bound. Next, we consider the estimator which estimates the mean rewards by sample means, which we call the regression estimator (REG). The motivation of studying this estimator is both its simplicity and also because it is known that a related estimator is asymptotically efficient (Hirano et al., 2003). The main question is whether the asymptotic efficiency transfers into finite-time efficiency. Our answer to this is mixed: We show that the MSE of REG is within a constant factor of the minimax MSE lower bound, however, the “constant” depends on the number of actions (

), or a lower bound on the variance. We also show that the dependence of the MSE of REG on the number actions is unavoidable. In any case, for “small” action sets or high noise setting, the REG estimator can be thought of as a minimax near-optimal estimator. We also show that for small sample sizes (up to

) all estimators must suffer a constant MSE. Numerical experiments illustrate the tightness of the analysis. Implications for more complicated settings, such as policy evaluation in contextual bandits and Markov Decision Processes (MDPs). The question of designing a nearly minimax estimator independently of any problem parameters remains open. All the proofs ot given in the main text can be found in the supplementary material.

## 2 Multi-armed Bandit

Let be a finite set of actions. Data is generated by the following process: 111The data is actually a list, not a set. We keep the notation for historical reasons. are independent copies of , where and for some unknown family of distributions and known policy . We are also given a known target policy and want to estimate its value, based on the knowledge of , and , where the quality of an estimate constructed based on (and ) is measured by its mean-squared error, .

Define and , where stands for the variance. Further, let . For convenience, we will identify any function with the

-dimensional vector whose

th component is . Thus, , , etc. will also be looked at as vectors. Note that we do not assume that the rewards are bounded from either direction. size=,color=green!20!white,size=,color=green!20!white,todo: size=,color=green!20!white,Csaba: Really? size=,color=red!20!white,size=,color=red!20!white,todo: size=,color=red!20!white,Lihong: Where to mention ? Shall we assume here?

A few quantities are introduced to facilitate discussions that follow:

 V1 := E[V(π(A)πD(A)R|A)]=∑aπ2(a)πD(a)σ2Φ(a), V2 := V(E[π(A)πD(A)R|A])=V(π(A)πD(A)rΦ(A))=∑aπ2(a)πD(a)rΦ(a)2−(vπΦ)2.

Note that and are functions of and , but this dependence is suppressed. Also, and are independent in that there are no constants such that for any . Finally, let

be the probability of having

no sample of in .

### 2.1 A Minimax Lower Bound

We start with establishing a minimax lower bound that characterizes the inherent hardness of the off-policy evaluation problem. An estimator can be considered as a function that maps to an estimate of , denoted . Fix . We consider the minimax optimal risk subject to size=,color=green!20!white,size=,color=green!20!white,todo: size=,color=green!20!white,Csaba: Sometimes we use , sometimes . Unify the notation or tell the reader to anticipate this. and for all :

 R∗n(π,πD,Rmax,σ2):=infAsupΦ:σ2Φ≤σ2,0≤rΦ≤RmaxE[(ˆvA(π,πD,Dn)−vπΦ)2],

where for vectors , holds if and only if for . For , we let denote the probability that none of the actions in the data falls into : . Note that this definition generalizes . We also let .

###### Theorem 1.

For any , , , and , one has

 R∗n(π,πD,Rmax,σ2)≥14max(R2maxmaxB⊂Aπ2(B)pB,n,V1n).

Furthermore,

 liminfn→∞R∗n(π,πD,Rmax,σ2)V1/n≥1. (1)
###### Proof.

To prove the first part of the lower bound, fix a subset of actions and choose an environment , where is the set of environments such that and . Introduce the notation to denote expectation when the data is generated by environment . size=,color=red!20!white,size=,color=red!20!white,todo: size=,color=red!20!white,Lihong: Why call “environment”, not “distribution”?

Let be the data generated based on and and let denote the estimate produced by some algorithm . Define to be the set of actions in the dataset that is seen by the algorithm. Clearly, for any such that they agree on the complement of (but may differ on actions in ),size=,color=red!20!white,size=,color=red!20!white,todo: size=,color=red!20!white,Lihong: Agreement between and , not between and ?

 EΦ[ˆvA(Dn)|S∩B=∅]=EΦ′[ˆvA(Dn)|S∩B=∅]. (2)

Now, and by adapting the argument that the MSE is lower bounded by the bias squared, . Hence, . We get an even smaller quantity if we further restrict the environments to environments that also satisfy on . Now, by (2), for all these environments, takes on a common value, denote it by . Hence, . Since , , where we use the shorthand . Plugging this into the previous inequality we get . Since was arbitrary, we get .

For the second part, consider a class of normal distributions with fixed reward variances

but different reward expectations: , where , for some to-be-specified vector that satisfies . The data-generating distribution is in , but is unknown otherwise.

It is easy to see that the policy value between any two distributions in differ by at least . Indeed, for any , . It follows that, in order to achieve a squared error less than , one needs to identify the underlying data-generating from , based on the observed sample . The problem now reduces to finding a minimax lower bound for hypothesis testing in the given finite set .

We resort to the information-theoretic machinery based on Fano’s inequality (see, e.g., Raginsky and Rakhlin (2011)). Define an oracle which, when queried, outputs with and . Let the distribution of when is used be denoted by . Let collect distributions such that is normal. Consider . Then,

 D(PY|Φ∥PY|Φ′)=∑aπD(a)D(Φ(⋅|a)∥Φ′(⋅|a))=2ε(i−j)2∑aπD(a)Δ(a)2σ(a)2.

The divergence measures how much information is carried in one sample from the oracle to tell from . To obtain the tightest lower bound, we should minimize the divergence. Subject to the constraint , the divergence is minimized by setting , and is . Now setting , and applying Lemma 1, Theorem 1 and the “Information Radius bound” from Raginsky and Rakhlin (2011), we have . size=,color=green!20!white,size=,color=green!20!white,todo: size=,color=green!20!white,Csaba: Can we add these to the appendix? Reorganizing terms and combining with the first term complete the proof of the first statement.

For the second part, note that it suffices to consider asymptotically unbiased estimators (cf. the generalized Cramer-Rao lower bound, Theorem 7.3 of

Ibragimov and Has’minskii 1981). For any such estimator, the Cramer-Rao lower bound gives the result with the parametric family chosen to be , where is the unknown parameter to be estimated, and is the density of the normal distribution with mean and variance and the quantity to be estimated is . For details, see Section A.1. ∎

The next corollary says that the minimax risk is constant when the number of samples is :

For , , .

###### Proof.

Choose to minimize subject to the constraint . Note that . Choosing such that gives the result. ∎

We conjecture that the result can be strengthened by increasing the upper limit on . size=,color=green!20!white,size=,color=green!20!white,todo: size=,color=green!20!white,Csaba: One of the conjectures; mention in conclusion

### 2.2 Likelihood Ratio Estimator

One of the most popular estimators is known as the propensity score estimator in the statistical literature (Rosenbaum and Rubin, 1983, 1985), or the importance weighting estimator (Bottou et al., 2013). We call it the likelihood ratio estimator, as it estimates the unknown value using likelihood ratios, or importance weights:

 ˆvLR(π,πD,Dn):=1nn∑i=1π(Ai)πD(Ai)Ri.

Its distinguishing feature is that it is unbiased: , implying that the MSE is purely contributed by the variance of the estimator. The main result in this subsection shows that this estimator does not achieve the minimax lower bound up to any constant (by making ). The proof (given in the appendix) is based on a direct calculation using the law of total variance.

###### Proposition 1.

It holds that .

We see that as compared to the lower bound on the minimax MSE, an extra factor appears. In the next section, we will see that this factor is superfluous, showing that the MSE of LR can be “unreasonably large”.

### 2.3 Regression Estimator

For convenience, define to be the number of samples for action in , and the total rewards of . The regression estimator (REG) is given by

 ˆvReg(π,Dn):=∑aπ(a)ˆr(a),%whereˆr(a):={0,if n(a)=0;R(a)n(a),otherwise.

For brevity, we will also write , where we take to be zero. The name of the estimator comes from the fact that it estimates the reward function, and the problem of estimating the reward function can be thought of as a regression problem.

Interestingly, as can be verified by direct calculation, the REG estimator can also be written as

 ˆvReg(π,Dn)=1nn∑i=1π(Ai)ˆπD(Ai)Ri, (3)

where is the empirical estimate of . Hence, the main difference between LR and REG is that the former uses to reweight the data, while the latter uses the empirical estimates . It may appear that LR is superior since it uses the “right” quantity. Surprisingly, REG turns out to be much more robust than LR, as will be shown shortly; further discussion is made in Section D.

For the next statement, the counterpart of creftype 1, the following quantities will be useful:

 V0,n :=(∑aπ(a)rΦ(a)pa,n)2+∑aπ2(a)r2Φ(a)pa,n(1−pa,n) and V3,n :=∑aE[I{n(a)>0}ˆπD(a)−1πD(a)]π(a)2σ2(a).
###### Proposition 2.

Fix . Assume that is nonnegative valued. size=,color=red!20!white,size=,color=red!20!white,todo: size=,color=red!20!white,Lihong: How about making it an assumption in the general setting? Then it holds that . Further, for any such that the rewards have normal distributions, defining to be the bias of , .

###### Proof sketch.

For the upper bound use that the MSE equals the sum of squared bias and the variance. It can be verified that REG is slightly biased: . For the variance term, we use the law of total variance to yield: , where the first term is , and the second term is upper bounded (Lemma 2) by . The proof is then completed by adding squared bias to variance, and using definitions of , , and . The lower bound follows from the (generalized) Cramer-Rao inequality. ∎

The main result of this section is the following theorem that characterizes the MSE of REG in terms of the minimax optimal MSE.

###### Theorem 2 (Minimax Optimality of the Regression Estimator).

The following hold:

1. For any , , such that , , and , it holds for any that

 MSE(ˆvReg(π,Dn))≤K{min(4K,maxar2Φ(a)σ2Φ(a))+5}R∗n(π,πD,Rmax,σ2), (4)

where is an i.i.d. sample from .

2. A suboptimality factor of in the above result is unavoidable: For , there exists such that for any ,

 MSE(ˆvReg(π,Dn))R∗n(π,πD,Rmax,0)≥ne−2n/(K−1).

Thus for , this ratio is at least .

3. The estimator is asymptotically minimax optimal:

 limsupn→∞MSE(ˆvReg(π,Dn))R∗n(π,πD,Rmax,σ2)≤1.

We need the following lemma, which may be of interest on its own:

###### Lemma 1.

Let be

independent Bernoulli random variables with parameter

. Letting , , , we have for any and that . Further, when , we have

###### Proof of Theorem 2.

First, we bound in terms of . From Lemma 1, , while if , . Plugging these into the definition of , we have for all . Furthermore, when , thanks to monotonicity of the function for , we have

 V3,n≤2V1√2nπD∗(√32ln(nπ∗D2)+1). (5)

Now, to bound , remember that one lower bound for is , where is the range for . size=,color=green!20!white,size=,color=green!20!white,todo: size=,color=green!20!white,Csaba: The range has to be in the final result.. Hence,

 V0,n =K2(1K∑aπ(a)rΦ(a)pa,n)2+∑aπ2(a)r2Φ(a)pa,n(1−pa,n) ≤K∑aπ2(a)r2Φ(a)p2a,n+∑aπ2(a)r2Φ(a)pa,n(1−pa,n) ≤K∑aπ2(a)r2Φ(a)pa,n≤K2maxaπ2(a)r2Φ(a)pa,n. (6)

Hence, using , size=,color=green!20!white,size=,color=green!20!white,todo: size=,color=green!20!white,Csaba: constant?

 MSE(ˆvReg) ≤V0,n+V1+V3n≤4K2maxaπ2(a)r2Φ(a)pa,n+5V1n≤(4K2+5)R∗n. (7)

On the other hand, assuming that , we also have

 V0,n ≤K∑aπ2(a)r2Φ(a)pa,n≤Kmaxb∈A(r2Φ(b)σ2(b))∑apa,nπ2(a)σ2(a)≤Kmaxb∈A(r2Φ(b)σ2(b))V1n,

where in the last inequality we used that and , which is true for any , and finally also the definition of . Similarly to the previous case, we get

 MSE(ˆvReg) ≤{Kmaxb∈A(r2Φ(b)σ2(b))+5}V1n≤{Kmaxb∈A(r2Φ(b)σ2(b)+5)}R∗n.

Combining this with (7) gives (4).

For the second part of the result, choose , . For , . Hence, we have . Now, consider the LR estimator. Choosing , we have and so by creftype 1,

 supΦ:0≤rΦ≤1,σ2Φ=0MSE(ˆvLR)=supΦ:0≤rΦ≤1,σ2Φ=0V2/n≤1n.

Hence, .

Finally, the for the last part, fix any , , such that . Then, for large enough, , where is a problem dependent constant, and the second inequality used (5) and (6). Combining this with (1) of Theorem 1 gives the desired result. ∎

### 2.4 Simulation Results

This subsection corroborates our analysis with simulation results that empirically demonstrate the impact of key quantities on the MSE of the two estimators. Two sets of experiments are done, corresponding to the left and right panels in Figure 1. In all experiments, we repeat the data-generation process (with ) 10,000 times, and compute the MSE of each estimator. All reward distributions are normal distributions with and different means. We then plot normalized MSE (MSE multiplied by sample size ), or nMSE, against .

The first experiment is to compare the finite-time as well as asymptotic accuracy of and . We choose , , . Three choices of are used: (a) , (b) , and (c) . These choices lead to increasing values of (with approximately fixed). Clearly, the nMSE of remains constant, equal to , as predicted in Proposition 1. In contrast, the nMSE of is large when is small, because of the high bias, and then quickly converges to the asymptotic minimax rate (Theorem 2, part iii). As can be arbitrarily larger than , it follows that is preferred over , as least for sufficiently large that is needed to drive the bias down. It should be noted that in practice, after is generated, it is easy to quantify the bias of simply by identifying the set of actions with .

The second experiment is to show how affects the nMSE of . Here, we choose , , , and vary . As Figure 1 (right) shows, a larger gives a harder time, which is consistent with Theorem 2 (part i). Not only does the maximum nMSE grow approximately linearly with , the number of samples needed for nMSE to start decreasing also scales roughly as , as indicated by part ii of Theorem 2. size=,color=red!20!white,size=,color=red!20!white,todo: size=,color=red!20!white,Lihong: Why instead of ?

## 3 Extensions

In this section, we consider extensions of our previous results to contextual bandits and Markovian Decision Processes, while implications to semi-supervised learning (Zhu and Goldberg, 2009) are discussed in the supplementary material.

### 3.1 Contextual Bandits

The problem setup is as follows: In addition to the finite action set , we are also given a context set . A policy now is a map such that for any ,

is a probability distribution over the action space

. For notational convenience, we will use instead of . The set of policies over and will be denoted by . The process generating the data is described by the following: are independent copies of , where , and for some unknown family of distributions and known policy and context distribution . For simplicity, we fix .

We are also given a known target policy and want to estimate its value, based on the knowledge of , , and , where the quality of an estimate constructed based on (and ) is measured by its mean squared error, , just like in the case of contextless bandits. size=,color=green!20!white,size=,color=green!20!white,todo: size=,color=green!20!white,Csaba: mean squared error, or mean-squared error? in any ways, write it one way. Let for , . An estimator can be considered as a function that maps to an estimate of , denoted . Fix . The minimax optimal risk subject to for all is defined by

The main observation is that the estimation problem for the contextual case can actually be reduced to the contextless bandit case by treating the context-action pairs as “actions” belonging to the product space . For any policy , by slightly abusing notation, let

be the joint distribution of

when , . This way, we can map any contextual policy evaluation problem defined by ,, , and a sample size into a contextless policy evaluation problem defined by , , with action set . Therefore, with