1 Introduction
In many realworld decisionmaking scenarios, evaluating a novel policy by directly executing it in the environment is generally costly and can even be downright risky. Examples include evaluating a recommendation policy (SwaminathanKADL17; zheng2018drn), a treatment policy (hirano2003efficient; murphy2001marginal), and a traffic light control policy (van2016coordinated). Offpolicy policy evaluation methods (OPPE) utilize a set of previouslycollected trajectories (for example, website interaction logs, patient trajectories, or robot trajectories) to estimate the value of a novel decisionmaking policy without interacting with the environment (precup2001off; DudikLL11)
. For many reinforcement learning applications, the value of the decision is defined in a long or infinitehorizon, which makes OPPE more challenging.
The stateoftheart methods for infinitehorizon offpolicy policy evaluation rely on learning (discounted) state stationary distribution corrections or ratios
. In particular, for each state in the environments, these methods estimate the likelihood ratio of the longterm probability measure for the state to be visited in a trajectory generated by the
target policy, normalized by the probability measure generated by the behavior policy. This approach can effectively avoid the exponentially high variance compared to the more classic importance sampling (IS) estimation methods (precup2000eligibility; DudikLL11; hirano2003efficient; wang2017optimal; murphy2001marginal), especially for infinitehorizon policy evaluation (liumethod; dualdice; hallak2017consistent). However, learning state stationary distribution requires detailed information on distributions of the behavior policy, and we call them policyaware methods. As a consequence, policyaware methods are difficult to apply when offpolicy data are pregenerated by multiple behavior policies or when the behavior policy’s form is unknown. To address this issue, dualdice proposes a policyagnostic method, DualDice, which learns the joint stateaction stationary distribution correction that is much higher dimension, and therefore needs more model parameters than the state stationary distribution. Besides, there is no theoretic comparison between policyaware and policyagnostic methods.In this paper, we propose a partially policyagnostic method, EMP (estimated mixture policy) for infinitehorizon offpolicy policy evaluation with multiple known or unknown behavior policies. EMP is partially policyagnostic in the since that it does not necessarily require knowledge of the individual behavior policies. Instead, it involves a preestimation step to estimate a single mixed policy that will be defined formally later. Like the method in liumethod, EMP also learns the state stationary distribution correction, so it remains computationally cheap and is scalable in terms of the number of behavior policies. Inspired by hanna2019importance, we construct a theoretical bound for the mean square error (MSE) of the stationary distribution corrections learned by EMP. In particular, we show that in the singlebehavior policy setting, EMP yields smaller MSE than the policyaware method. On the other hand, compared to DualDice, EMP learns the state stationary distribution correction of smaller dimension, more importantly the estimation of the mixture policy can be considered as an inductive bias as far as the stationary distribution correction is concerned, and hence could achieve better performance when the preestimation is not expensive. In addition, we propose an adhoc improvement of EMP, whose theoretical analysis is left for future studies. EMP is compared with both policyaware and policyagnostic methods in a set of continuous and discrete control tasks and shows significant improvement.
2 Background and Related Work
2.1 Infinitehorizon Offpolicy Policy Evaluation
We consider a Markov Decision Process (MDP) and our goal is to estimate the infinitehorizon
average reward. The environment is specified by a tuple , consisting of a state space, an action space, a reward function, and a transition probability function. A policy interacts with the environment iteratively, starting with an initial state . At step , the policy produces a distribution over the actions , from which an action is sampled and applied to the environment. The environment stochastically produces a scalar reward and a next state . The infinitehorizon average reward under policy isWithout gathering new data, offpolicy policy evaluation (OPPE) considers the problem of estimating the expected reward of a target policy via a precollected stateactionreward tuples from policies that are different from , which are called behavior policies. In our paper, we consider the general setting that the data are generated by multiple behavior policies . Most OPPE literature has focused on the singlebehaviorpolicy case where . In this case, we denote the behavior policy by . Roughly speaking, most OPPE methods can be grouped into two categories: importancesampling(IS) based OPPE and stationarydistributioncorrection based OPPE.
2.2 Importance Sampling Policy Evaluation Using Exact and Estimated Behavior Policy
As for shorthorizon offpolicy policy evaluation, importance sampling policy evaluation (IS) methods (precup2001off; DudikLL11; SwaminathanKADL17; PrecupSS00; horvitz1952generalization) have shown promising empirical results. The main idea of importance sampling based OPPE is using importance weighting to correct the mismatch between the target policy and the behavior policy that generates the trajectory.
li2015toward and hanna2019importance show that using estimated behavior policy in the importance weighting can obtain importance sampling estimation with smaller mean square error (MSE). EMP also uses estimated policy, but there are two key difference between EMP and the previous works: (1) EMP is not an ISbased method, it involves a minmax problem; (2) EMP focuses on multiplebehaviorpolicy setting while previous works have focused on singlebehavior setting.
2.3 Policy Evaluation via Learning Stationary Distribution Correction
The stateoftheart methods for longhorizon offpolicy policy evaluation are stationarydistributioncorrection based (liumethod; dualdice; hallak2017consistent). Let and be the stationary distribution of state under the behavior policy and target policy respectively. The main idea is directly applying importance weighting by on the stationary statevisitation distributions to avoid the exploding variance suffered from IS, and estimate the average reward as
For example, liumethod uses minmax approach to estimate directly from the data. This class of methods require exact knowledge of behavior policy and are not straightforward to apply in multiplebehaviorpolicy setting. Recently, dualdice proposes DualDice to overcome such limitation by learning the stateaction stationary distribution correction .
3 Single Behavior Policy
We first consider the task of stationary distribution correction learning in the simple case where the data are generated by a single behavior policy as previous state stationary distribution correction methods. To explain the minmax problem formulation of the learning task, we first breifly review the method introduced by liumethod in Section 3.1, which we shall refer as the BCH method in the rest of the paper. In Section 3.2, we show that it is beneficial to replace the exact values of the behavior policy in the minmax problem by their estimated values in two folds. First, this extends the method to application setting where the behavior policy is unknown. Second, even when the behavior policy is known with exact values, we prove that the stationary distribution correction learned by the minmax problem with estimated behavior policy has smaller MSE. We will deal with multiplebehaviorpolicy cases in Section 4.
3.1 Learning Stationary Distribution Correction with Exact Behavior Policy
Assume the data, consisting of stateactionnextstate tuples, are generated by a single behavior policy , i.e. . Recall that and are the stationary state distribution under the behavior and target policy respectively, and is the stationary distribution correction. In the rest of Section 3, by slight notation abusion, we also denote , and .
We briefly review the BCH method proposed by liumethod. As is the stationary distribution of as under policy , it follows that:
(1) 
Therefore, for any function ,
Recall that , so and the data sample satisfy the following equation
BCH solves the above equation via the following minmax problem:
(2) 
and use kernel method to solve . The derivation of kernel method are put in Appendix A.
3.2 Learning Stationary Distribution Correction with Estimated Behavior Policy
The objective function in the minmax problem (2), evaluated by data sample, can be viewed as a onestep importance sampling estimation. As shown in hanna2019importance
, importance sampling with estimated behavior policy has smaller MSE. Motivated by this fact and the heuristic that better objective function evaluation will lead to more accurate solution, we show that the BCH method can also be improved by using estimated behavior policy and obtain smaller asymptotic MSE. We will use this result to build theoretic guarantee for the performance of EMP method in Section
4.To formally state the theoretic result, we need introduce more notation. Assume that we are given a class of stationary distribution correction , and there exists such that the true distribution correction . Let be the stationary distribution correction learned by the minmax problem (2) and be that learned by a minmax problem using estimated policy:
(3) 
The intuition is that the value of is estimated from the data sample and appears in the denominator, as a result, it could cancel out a certain amount of random error in data sample. We use a maximum likelihood method to estimate the behavior policies for discrete and continuous control tasks. The details are in Appendix E.2. Based on the proof techniques in henmi2007, we establish the following theoretic guarantee that using estimated behavior policy yields better estimates of the stationary distribution correction.
Theorem 1.
Under some mild conditions, we have, asymptotically
As a direct consequence, we derive the finitesample error bound for .
Corollary 1.
(informal) Let be the number of tuples in the data,
.
4 EMP for Multiple Behavior Policies
In this section, we shall propose our EMP method for offpolicy policy evaluation with multiple known or unknown behavior policies and establish theoretic results on variance reduction of EMP.
Before that, we first give a detailed description on the data sample and its distribution. Assume the stateactionnextstate tuples are generated by different unknown behavior policies . Let be the stationary state distribution and be the number of stateactionnextstate tuples by policy , for . Let and denote by the proportion of data generated by policy . We use to denote the data set and . Note that the policy label in the subscript is only for notation clarity and it is not revealed in the data. Then, a single tuple simply follows the marginal distribution . With slight notation abusion, we write .
4.1 EMP Method
Now we derive the EMP method in the multiplebehaviorpolicy setting and explicitly explain what is the mixed policy to be estimated in EMP.
Let be the mixture of stationary distributions of the behavior policies. For each stateaction pair , define as the weighted average of the behavior policies:
(4) 
It is easy to check that for each , is a distribution on the action space and hence defines a policy by itself. We call the mixed policy. Let , which is a state distribution ratio. Then, , and satisfy the following relation with the average reward .
Proposition 1.
(5) 
Besides, the state distribution ratio can be characterized by the stationary equation.
Proposition 2.
The function (up to a constant) if and only if,
(6) 
In the special case when , i.e. the data are generated by a single behavior policy, Proposition 2 reduces to Theorem 1 of liumethod. The above two Propositions indicate that, to certain extend, the tuples generated by multiple behavior policies can be pooled together and treated as if they are generated by a single behavior policy .
Note that expression of (4) involves not only the behavior policies but also the state stationary distributions. In EMP method, we shall use a preestimation step to generate an estimate from the data. Based on Proposition 2, the state distribution ratio can be estimated by the following minmax problem
(7) 
Finally, EMP estimates the average reward according to (5) where is approximate by data .
Applying Theorem 1, we show that using the estimated in EMP can actually reduce the MSE of the learned stationary distribution ratio .
4.2 Why Pooling is Beneficial for EMP
One important feature of EMP is that it pools the data from different policy behaviors together and treat them as if they are from a single mixed policy. Of course, pooling makes EMP applicable to settings with minimal information on the behavior policies, for instance, EMP does not even require the knowledge on the number of behavior policies. In this part, we show that, the pooling feature of EMP is not just a compromise to the lack of behavior policy information, it also leads to variance reduction in an intrinsic manner.
If instead, the data can be classified according to the behavior policies and treated separately, we can still use EMP, which reduces to (
3), or any other singlebehaviorpolicy method, to obtain the stationary distribution correction for each behavior policy. Given , a common approach for variance reduction is to apply multiple importance sampling (MIS) (Tirinzoni2019; Veach1995) technique and the average reward estimator is of the form(8) 
where the function is often referred to as heuristics and must be a partition of unity, i.e., for all . It has been proved by (Veach1995) that MIS is unbiased, and, for given , there is an optimal heuristic function to minimize the variance of .
Proposition 4.
For MIS with fixed values of , among all possible values of heuristics , the balanced heuristic
reaches the minimal variance.
Plug the optimal heuristic into MIS estimator (8), and we will obtain that the optimal MIS estimator coincides with the EMP estimator (5), i.e.
(9) 
In this light, by pooling the data together and directly learning . EMP also learns the optimal MIS weight inexplicitly.





Singlebehaviorpolicy results of BCH, EMP and WIS across continuous and discrete environments with average reward. Each node indicates the mean value and the bars represents the standard error of the mean.
5 Experiment
In this section, we conduct experiments in three discretecontrol tasks Taxi, Singlepath, Gridworld and one continuouscontrol task Pendulum (see Appendix E.1 for the details), with following purposes: (i) to compare the performance of distribution correction learning using policyaware, policyagnostic and partially agnosticpolicy methods (in Sec 5.1); (ii) to compare the performance of the proposed EMP with existing OPPE methods (in Sec 5.1 and 5.2); (iii) to explore potential improvement of EMP methods (in Sec 5.2). We will release the codes with the publication of this paper for relevant study.
5.1 Results for Single Behavior Policy
In this section, we compare the EMP method with the BCH method and stepwise importance sampling (IS) in the setting of singlebehavior policy, i.e. the data is generated from a single behavior policy.
Experiment Setup. A single behavior policy which is learned by a certain reinforcement learning algorithm ^{1}^{1}1We use Qlearning in discrete control tasks and Actor Critic in continuous control tasks. for evaluating BCH and IS. This single behavior policy then generates a set of trajectories consisting of sasr tuples. These tuples are used to estimate the behaviour policy for EMP methods as well as estimating the stationary distribution corrections for estimating the average step reward of the target policy.


Stationary Distribution Learning Performance. We choose the Taxi domain as an example to compare the stationary distribution and learned by BCH and EMP. Figure 1(a) shows the scatter pairs and estimated by 200 trajectories of 200 steps. It shows that approximate better than . Figure 1(b) and Figure 1(b) compare the TV distance from and to under different data sample sizes. The results indicate that both and converge, while converges faster and is significantly closer to when the data size is small. These observations are well consistent with Theorem 1.




Policy Evaluation Performance. Figure 2 reports the MSE of policy evaluation by EMP, BCH and IS methods for the 4 different environments. We observe that, (i) EMP consistently obtains smaller MSE than the other two methods for different sample scales and different environments. (ii) The performance of EMP and BCH improves as the number of trajectories and length of horizons increase, while the IS method suffers from growing variance. our method correctly estimates the true density ratio over the state space.
Partially Policyagnostic versus Policyagnostic OPPE. Figure 3 reports the comparison results for the policyaware BCH, partially policyagnostic EMP and a policyagnostic method, which we call it stateaction distribution learning (SADL) and whose formal formulation is given in Appendix D. The results show that all three methods obtain improvement as the number of length of trajectories increase. Roughly speaking, both EMP and SADL outperform BCH. The policyagnostic SADL is better than EMP in the cases of small sample size. But when the sample size increases so that the estimated behavior policy is more accurate, EMP gradually exceeds SADL.
Remark: In our implementation of SADL, we use the same minmax formulation and optimization solver as EMP so that the comparison could shed more lights on the impact of behavior policy information on the performance of offpolicy policy evaluation. We will report the comparison result between EMP and DualDice once the code is released.
5.2 Results for Multiple Unknown Behavior Policies
As for multiple behavior policies, we conduct experiments in policyaware and partially policyagnostic settings. We report the results of partially policyagnostic setting in this section and the policyaware setting is described in Appendix E.4. Because partially policyagnostic version consistently achieves better performance.
Experiment Setup. We implement the following 4 methods: (1) the proposed EMP; (2) the multiple importance sampling (MIS) method as in Tirinzoni2019 using balanced heuristics; (3) EMP (single), in which we apply EMP for each subgroup of samples generated by a single behavior policy to obtain one OPPE value and finally output their average; (4) KLEMP, which is an adhoc improvement of EMP using more information on the behavior policies and whose implementation details are given in Appendix E.3.
Policy Evaluation Performance. Figure 4 reports the log MSE of the 4 methods in different environments with different sample scales. It shows that the proposed EMP outperforms both MIS and EMP (single). It is interesting to note that in EMP (single), actually more information on the behavior policies is learned than in EMP, but the learned stationary distribution corrections are mixed with naively equal weights. So, the advantage of EMP over EMP (single) can be probably attributed to (1) the robustness due to less required information on behavior policies; (2) a nearoptimal weighted average that is automatically learned by pooling together the samples from different behavior policies.
On the other hand, we see that the performance of KLEMP has greater improvement with the increase of sample size and eventually outperform EMP in cases of large sample size. This is because, KLEMP replaces the fixed sample proportion (i.e. as defined in Section 4.2) with a KL divergencebased proportion, which is better estimated with more data sample.
6 Conclusion
In this paper, we advocate the viewpoint of partial policyawareness and the benefits of estimating a mixture policy for offpolicy policy evaluation. The theoretical results of reduced variance coupled with experimental results illustrate the power of this class of methods. One key question that still remains is the following: if we are willing to estimate the individual behavior policies, can we further improve EMP by developing an efficient algorithm to compute the optimal weights? One other question is a direct comparison of DualDice and EMP when the code of DualDice is released, this will allow us to see the props and cons of inductive bias offered by the Bellman equation used by DualDice and direct estimation of the mixture policy used by EMP.
References
Appendix A Kernel Method
We use the reproducing kernel Hilbert space to solve the minimax problem of BCH (liumethod). The key property of RKHS we leveraged is called reproducing property. The reproducing property claims, for any function ( is a RKHS), the evaluation of at point x equals its inner product with another function in RKHS: .
Given the objective function of BCH . We use the reproducing property to obtain the closed form representation of , which is shown as follows:
.
This equation has been proved in BCH liumethod.
Appendix B Proof of Theorem 1
b.1 Assumptions
In this appendix, we provide the mathematical details and proof of Theorem 1. We first introduce some notations and assumptions.
We assume the behavior policy belongs to a class of policies , where is the parameter space, i.e. there exists such that . The estimated behavior policy is obtained via maximum likelihood method, i.e.
We assume central limit theorem holds for
. Recall that we have assumed in Section 3.2 that the true stationary distribution correction . Using the kernel method introduced in Appendix A, our estimation is obtained viawith and
Assumption 1.
We assume the following regularity conditions on :

is second order differentiable.

is finite.

is finite and nonzero.

is finite.
Here we simply write as for the simplicity of notation.
b.2 Proof of Theorem 1
Proof.
Following the kernel method,
Then, , we have
Similarly, we have
and Define According to our estimation method,
Therefore, Following the proof of Theorem 1 of (Henmin et al. 2007), it suffices to prove that
(10) 
One can check
The last equality holds because . Besides, we have
Then, we derive
Here, we define
Note that
Therefore
Comments
There are no comments yet.