In many real-world decision-making scenarios, evaluating a novel policy by directly executing it in the environment is generally costly and can even be downright risky. Examples include evaluating a recommendation policy (SwaminathanKADL17; zheng2018drn), a treatment policy (hirano2003efficient; murphy2001marginal), and a traffic light control policy (van2016coordinated). Off-policy policy evaluation methods (OPPE) utilize a set of previously-collected trajectories (for example, website interaction logs, patient trajectories, or robot trajectories) to estimate the value of a novel decision-making policy without interacting with the environment (precup2001off; DudikLL11)
. For many reinforcement learning applications, the value of the decision is defined in a long- or infinite-horizon, which makes OPPE more challenging.
The state-of-the-art methods for infinite-horizon off-policy policy evaluation rely on learning (discounted) state stationary distribution corrections or ratios
. In particular, for each state in the environments, these methods estimate the likelihood ratio of the long-term probability measure for the state to be visited in a trajectory generated by thetarget policy, normalized by the probability measure generated by the behavior policy. This approach can effectively avoid the exponentially high variance compared to the more classic importance sampling (IS) estimation methods (precup2000eligibility; DudikLL11; hirano2003efficient; wang2017optimal; murphy2001marginal), especially for infinite-horizon policy evaluation (liumethod; dualdice; hallak2017consistent). However, learning state stationary distribution requires detailed information on distributions of the behavior policy, and we call them policy-aware methods. As a consequence, policy-aware methods are difficult to apply when off-policy data are pre-generated by multiple behavior policies or when the behavior policy’s form is unknown. To address this issue, dualdice proposes a policy-agnostic method, DualDice, which learns the joint state-action stationary distribution correction that is much higher dimension, and therefore needs more model parameters than the state stationary distribution. Besides, there is no theoretic comparison between policy-aware and policy-agnostic methods.
In this paper, we propose a partially policy-agnostic method, EMP (estimated mixture policy) for infinite-horizon off-policy policy evaluation with multiple known or unknown behavior policies. EMP is partially policy-agnostic in the since that it does not necessarily require knowledge of the individual behavior policies. Instead, it involves a pre-estimation step to estimate a single mixed policy that will be defined formally later. Like the method in liumethod, EMP also learns the state stationary distribution correction, so it remains computationally cheap and is scalable in terms of the number of behavior policies. Inspired by hanna2019importance, we construct a theoretical bound for the mean square error (MSE) of the stationary distribution corrections learned by EMP. In particular, we show that in the single-behavior policy setting, EMP yields smaller MSE than the policy-aware method. On the other hand, compared to DualDice, EMP learns the state stationary distribution correction of smaller dimension, more importantly the estimation of the mixture policy can be considered as an inductive bias as far as the stationary distribution correction is concerned, and hence could achieve better performance when the pre-estimation is not expensive. In addition, we propose an ad-hoc improvement of EMP, whose theoretical analysis is left for future studies. EMP is compared with both policy-aware and policy-agnostic methods in a set of continuous and discrete control tasks and shows significant improvement.
2 Background and Related Work
2.1 Infinite-horizon Off-policy Policy Evaluation
We consider a Markov Decision Process (MDP) and our goal is to estimate the infinite-horizonaverage reward. The environment is specified by a tuple , consisting of a state space, an action space, a reward function, and a transition probability function. A policy interacts with the environment iteratively, starting with an initial state . At step , the policy produces a distribution over the actions , from which an action is sampled and applied to the environment. The environment stochastically produces a scalar reward and a next state . The infinite-horizon average reward under policy is
Without gathering new data, off-policy policy evaluation (OPPE) considers the problem of estimating the expected reward of a target policy via a pre-collected state-action-reward tuples from policies that are different from , which are called behavior policies. In our paper, we consider the general setting that the data are generated by multiple behavior policies . Most OPPE literature has focused on the single-behavior-policy case where . In this case, we denote the behavior policy by . Roughly speaking, most OPPE methods can be grouped into two categories: importance-sampling(IS) based OPPE and stationary-distribution-correction based OPPE.
2.2 Importance Sampling Policy Evaluation Using Exact and Estimated Behavior Policy
As for short-horizon off-policy policy evaluation, importance sampling policy evaluation (IS) methods (precup2001off; DudikLL11; SwaminathanKADL17; PrecupSS00; horvitz1952generalization) have shown promising empirical results. The main idea of importance sampling based OPPE is using importance weighting to correct the mismatch between the target policy and the behavior policy that generates the trajectory.
li2015toward and hanna2019importance show that using estimated behavior policy in the importance weighting can obtain importance sampling estimation with smaller mean square error (MSE). EMP also uses estimated policy, but there are two key difference between EMP and the previous works: (1) EMP is not an IS-based method, it involves a min-max problem; (2) EMP focuses on multiple-behavior-policy setting while previous works have focused on single-behavior setting.
2.3 Policy Evaluation via Learning Stationary Distribution Correction
The state-of-the-art methods for long-horizon off-policy policy evaluation are stationary-distribution-correction based (liumethod; dualdice; hallak2017consistent). Let and be the stationary distribution of state under the behavior policy and target policy respectively. The main idea is directly applying importance weighting by on the stationary state-visitation distributions to avoid the exploding variance suffered from IS, and estimate the average reward as
For example, liumethod uses min-max approach to estimate directly from the data. This class of methods require exact knowledge of behavior policy and are not straightforward to apply in multiple-behavior-policy setting. Recently, dualdice proposes DualDice to overcome such limitation by learning the state-action stationary distribution correction .
3 Single Behavior Policy
We first consider the task of stationary distribution correction learning in the simple case where the data are generated by a single behavior policy as previous state stationary distribution correction methods. To explain the min-max problem formulation of the learning task, we first breifly review the method introduced by liumethod in Section 3.1, which we shall refer as the BCH method in the rest of the paper. In Section 3.2, we show that it is beneficial to replace the exact values of the behavior policy in the min-max problem by their estimated values in two folds. First, this extends the method to application setting where the behavior policy is unknown. Second, even when the behavior policy is known with exact values, we prove that the stationary distribution correction learned by the min-max problem with estimated behavior policy has smaller MSE. We will deal with multiple-behavior-policy cases in Section 4.
3.1 Learning Stationary Distribution Correction with Exact Behavior Policy
Assume the data, consisting of state-action-next-state tuples, are generated by a single behavior policy , i.e. . Recall that and are the stationary state distribution under the behavior and target policy respectively, and is the stationary distribution correction. In the rest of Section 3, by slight notation abusion, we also denote , and .
We briefly review the BCH method proposed by liumethod. As is the stationary distribution of as under policy , it follows that:
Therefore, for any function ,
Recall that , so and the data sample satisfy the following equation
BCH solves the above equation via the following min-max problem:
and use kernel method to solve . The derivation of kernel method are put in Appendix A.
3.2 Learning Stationary Distribution Correction with Estimated Behavior Policy
The objective function in the min-max problem (2), evaluated by data sample, can be viewed as a one-step importance sampling estimation. As shown in hanna2019importance
, importance sampling with estimated behavior policy has smaller MSE. Motivated by this fact and the heuristic that better objective function evaluation will lead to more accurate solution, we show that the BCH method can also be improved by using estimated behavior policy and obtain smaller asymptotic MSE. We will use this result to build theoretic guarantee for the performance of EMP method in Section4.
To formally state the theoretic result, we need introduce more notation. Assume that we are given a class of stationary distribution correction , and there exists such that the true distribution correction . Let be the stationary distribution correction learned by the min-max problem (2) and be that learned by a min-max problem using estimated policy:
The intuition is that the value of is estimated from the data sample and appears in the denominator, as a result, it could cancel out a certain amount of random error in data sample. We use a maximum likelihood method to estimate the behavior policies for discrete and continuous control tasks. The details are in Appendix E.2. Based on the proof techniques in henmi2007, we establish the following theoretic guarantee that using estimated behavior policy yields better estimates of the stationary distribution correction.
Under some mild conditions, we have, asymptotically
As a direct consequence, we derive the finite-sample error bound for .
(informal) Let be the number of tuples in the data,
4 EMP for Multiple Behavior Policies
In this section, we shall propose our EMP method for off-policy policy evaluation with multiple known or unknown behavior policies and establish theoretic results on variance reduction of EMP.
Before that, we first give a detailed description on the data sample and its distribution. Assume the state-action-next-state tuples are generated by different unknown behavior policies . Let be the stationary state distribution and be the number of state-action-next-state tuples by policy , for . Let and denote by the proportion of data generated by policy . We use to denote the data set and . Note that the policy label in the subscript is only for notation clarity and it is not revealed in the data. Then, a single tuple simply follows the marginal distribution . With slight notation abusion, we write .
4.1 EMP Method
Now we derive the EMP method in the multiple-behavior-policy setting and explicitly explain what is the mixed policy to be estimated in EMP.
Let be the mixture of stationary distributions of the behavior policies. For each state-action pair , define as the weighted average of the behavior policies:
It is easy to check that for each , is a distribution on the action space and hence defines a policy by itself. We call the mixed policy. Let , which is a state distribution ratio. Then, , and satisfy the following relation with the average reward .
Besides, the state distribution ratio can be characterized by the stationary equation.
The function (up to a constant) if and only if,
In the special case when , i.e. the data are generated by a single behavior policy, Proposition 2 reduces to Theorem 1 of liumethod. The above two Propositions indicate that, to certain extend, the tuples generated by multiple behavior policies can be pooled together and treated as if they are generated by a single behavior policy .
Note that expression of (4) involves not only the behavior policies but also the state stationary distributions. In EMP method, we shall use a pre-estimation step to generate an estimate from the data. Based on Proposition 2, the state distribution ratio can be estimated by the following min-max problem
Finally, EMP estimates the average reward according to (5) where is approximate by data .
Applying Theorem 1, we show that using the estimated in EMP can actually reduce the MSE of the learned stationary distribution ratio .
4.2 Why Pooling is Beneficial for EMP
One important feature of EMP is that it pools the data from different policy behaviors together and treat them as if they are from a single mixed policy. Of course, pooling makes EMP applicable to settings with minimal information on the behavior policies, for instance, EMP does not even require the knowledge on the number of behavior policies. In this part, we show that, the pooling feature of EMP is not just a compromise to the lack of behavior policy information, it also leads to variance reduction in an intrinsic manner.
If instead, the data can be classified according to the behavior policies and treated separately, we can still use EMP, which reduces to (3), or any other single-behavior-policy method, to obtain the stationary distribution correction for each behavior policy. Given , a common approach for variance reduction is to apply multiple importance sampling (MIS) (Tirinzoni2019; Veach1995) technique and the average reward estimator is of the form
where the function is often referred to as heuristics and must be a partition of unity, i.e., for all . It has been proved by (Veach1995) that MIS is unbiased, and, for given , there is an optimal heuristic function to minimize the variance of .
For MIS with fixed values of , among all possible values of heuristics , the balanced heuristic
reaches the minimal variance.
In this light, by pooling the data together and directly learning . EMP also learns the optimal MIS weight inexplicitly.
Single-behavior-policy results of BCH, EMP and WIS across continuous and discrete environments with average reward. Each node indicates the mean value and the bars represents the standard error of the mean.
In this section, we conduct experiments in three discrete-control tasks Taxi, Singlepath, Gridworld and one continuous-control task Pendulum (see Appendix E.1 for the details), with following purposes: (i) to compare the performance of distribution correction learning using policy-aware, policy-agnostic and partially agnostic-policy methods (in Sec 5.1); (ii) to compare the performance of the proposed EMP with existing OPPE methods (in Sec 5.1 and 5.2); (iii) to explore potential improvement of EMP methods (in Sec 5.2). We will release the codes with the publication of this paper for relevant study.
5.1 Results for Single Behavior Policy
In this section, we compare the EMP method with the BCH method and step-wise importance sampling (IS) in the setting of single-behavior policy, i.e. the data is generated from a single behavior policy.
Experiment Set-up. A single behavior policy which is learned by a certain reinforcement learning algorithm 111We use Q-learning in discrete control tasks and Actor Critic in continuous control tasks. for evaluating BCH and IS. This single behavior policy then generates a set of trajectories consisting of s-a-s-r tuples. These tuples are used to estimate the behaviour policy for EMP methods as well as estimating the stationary distribution corrections for estimating the average step reward of the target policy.
Stationary Distribution Learning Performance. We choose the Taxi domain as an example to compare the stationary distribution and learned by BCH and EMP. Figure 1(a) shows the scatter pairs and estimated by 200 trajectories of 200 steps. It shows that approximate better than . Figure 1(b) and Figure 1(b) compare the TV distance from and to under different data sample sizes. The results indicate that both and converge, while converges faster and is significantly closer to when the data size is small. These observations are well consistent with Theorem 1.
Policy Evaluation Performance. Figure 2 reports the MSE of policy evaluation by EMP, BCH and IS methods for the 4 different environments. We observe that, (i) EMP consistently obtains smaller MSE than the other two methods for different sample scales and different environments. (ii) The performance of EMP and BCH improves as the number of trajectories and length of horizons increase, while the IS method suffers from growing variance. our method correctly estimates the true density ratio over the state space.
Partially Policy-agnostic versus Policy-agnostic OPPE. Figure 3 reports the comparison results for the policy-aware BCH, partially policy-agnostic EMP and a policy-agnostic method, which we call it state-action distribution learning (SADL) and whose formal formulation is given in Appendix D. The results show that all three methods obtain improvement as the number of length of trajectories increase. Roughly speaking, both EMP and SADL outperform BCH. The policy-agnostic SADL is better than EMP in the cases of small sample size. But when the sample size increases so that the estimated behavior policy is more accurate, EMP gradually exceeds SADL.
Remark: In our implementation of SADL, we use the same min-max formulation and optimization solver as EMP so that the comparison could shed more lights on the impact of behavior policy information on the performance of off-policy policy evaluation. We will report the comparison result between EMP and DualDice once the code is released.
5.2 Results for Multiple Unknown Behavior Policies
As for multiple behavior policies, we conduct experiments in policy-aware and partially policy-agnostic settings. We report the results of partially policy-agnostic setting in this section and the policy-aware setting is described in Appendix E.4. Because partially policy-agnostic version consistently achieves better performance.
Experiment Set-up. We implement the following 4 methods: (1) the proposed EMP; (2) the multiple importance sampling (MIS) method as in Tirinzoni2019 using balanced heuristics; (3) EMP (single), in which we apply EMP for each subgroup of samples generated by a single behavior policy to obtain one OPPE value and finally output their average; (4) KL-EMP, which is an ad-hoc improvement of EMP using more information on the behavior policies and whose implementation details are given in Appendix E.3.
Policy Evaluation Performance. Figure 4 reports the log MSE of the 4 methods in different environments with different sample scales. It shows that the proposed EMP outperforms both MIS and EMP (single). It is interesting to note that in EMP (single), actually more information on the behavior policies is learned than in EMP, but the learned stationary distribution corrections are mixed with naively equal weights. So, the advantage of EMP over EMP (single) can be probably attributed to (1) the robustness due to less required information on behavior policies; (2) a near-optimal weighted average that is automatically learned by pooling together the samples from different behavior policies.
On the other hand, we see that the performance of KL-EMP has greater improvement with the increase of sample size and eventually outperform EMP in cases of large sample size. This is because, KL-EMP replaces the fixed sample proportion (i.e. as defined in Section 4.2) with a KL divergence-based proportion, which is better estimated with more data sample.
In this paper, we advocate the viewpoint of partial policy-awareness and the benefits of estimating a mixture policy for off-policy policy evaluation. The theoretical results of reduced variance coupled with experimental results illustrate the power of this class of methods. One key question that still remains is the following: if we are willing to estimate the individual behavior policies, can we further improve EMP by developing an efficient algorithm to compute the optimal weights? One other question is a direct comparison of DualDice and EMP when the code of DualDice is released, this will allow us to see the props and cons of inductive bias offered by the Bellman equation used by DualDice and direct estimation of the mixture policy used by EMP.
Appendix A Kernel Method
We use the reproducing kernel Hilbert space to solve the mini-max problem of BCH (liumethod). The key property of RKHS we leveraged is called reproducing property. The reproducing property claims, for any function ( is a RKHS), the evaluation of at point x equals its inner product with another function in RKHS: .
Given the objective function of BCH . We use the reproducing property to obtain the closed form representation of , which is shown as follows:
This equation has been proved in BCH liumethod.
Appendix B Proof of Theorem 1
In this appendix, we provide the mathematical details and proof of Theorem 1. We first introduce some notations and assumptions.
We assume the behavior policy belongs to a class of policies , where is the parameter space, i.e. there exists such that . The estimated behavior policy is obtained via maximum likelihood method, i.e.
We assume central limit theorem holds for. Recall that we have assumed in Section 3.2 that the true stationary distribution correction . Using the kernel method introduced in Appendix A, our estimation is obtained via
We assume the following regularity conditions on :
is second order differentiable.
is finite and non-zero.
Here we simply write as for the simplicity of notation.
b.2 Proof of Theorem 1
Following the kernel method,
Then, , we have
Similarly, we have
and Define According to our estimation method,
Therefore, Following the proof of Theorem 1 of (Henmin et al. 2007), it suffices to prove that
One can check
The last equality holds because . Besides, we have
Then, we derive
Here, we define