One of the most fundamental problems in reinforcement learning (RL) ispolicy evaluation, where we seek to estimate the expected long-term payoff of a given target policy in a decision making environment. An important variant of this problem, off-policy evaluation (OPE) (Precup00ET), is motivated by applications where deploying a policy in a live environment entails significant cost or risk (Murphy01MM; Thomas15HCPE). To circumvent these issues, OPE attempts to estimate the value of a target policy by referring only to a dataset of experience previously gathered by other policies in the environment. Often, such logging or behavior policies are not known explicitly (e.g., the experience may come from human actors), which necessitates the use of behavior-agnostic OPE methods (NacChoDaiLi19).
While behavior-agnostic OPE appears to be a daunting problem, a number of estimators have recently been developed for this scenario (NacChoDaiLi19; uehara2019minimax; zhang2020gendice; zhang2020gradientdice), demonstrating impressive empirical results. Such estimators, known collectively as the “DICE” family for DIstribution Correction Estimation, model the ratio between the propensity of the target policy to visit particular state-action pairs relative to their likelihood of appearing in the logged data. A distribution corrector of this form can then be directly used to estimate the value of the target policy.
Although there are many commonalities between the various DICE estimators, their derivations are distinct and seemingly incompatible. For example, DualDICE (NacChoDaiLi19) is derived by a particular change-of-variables technique, whereas GenDICE (zhang2020gendice) observes that the substitution strategy cannot work in the average reward setting, and proposes a distinct derivation based on distribution matching. GradientDICE (zhang2020gradientdice) notes that GenDICE exacerbates optimization difficulties, and proposes a variant designed for limited sampling capabilities. Despite these apparent differences in these methods, the algorithms all involve a minimax optimization that has a strikingly similar form, which suggests that there is a common connection that underlies the alternative derivations.
We show that the previous DICE formulations are all in fact equivalent to regularized Lagrangians of the same linear program (LP). This LP shares an intimate relationship with the policy evaluation problem, and has a primal form we refer to as the -LP and a dual form we refer to as the -LP. The primal form has been concurrently identified and studied in the context of policy optimization (algae), but we focus on the -LP formulation for off-policy evaluation here, which we find to have a more succinct and revealing form for this purpose. Using the -LP, we identify a number of key choices in translating it into a stable minimax optimization problem – i.e. whether to include redundant constraints, whether to regularize the primal or dual variables – in addition to choices in how to translate an optimized solution into an asymptotic unbiased, “unbiased” for short, estimate of the policy value. We use this characterization to show that the known members of the DICE family are a small subset of specific choices made within a much larger, unexplored set of potential OPE methods.
To understand the consequences of the various choices, we provide a comprehensive study. First, we theoretically investigate which configurations lead to bias in the primal or dual solutions, and when this affects the final estimates. Our analysis shows that the dual solutions offer greater flexibility in stabilizing the optimization while preserving asymptotic unbiasedness, versus primal solutions. We also perform an extensive empirical evaluation of the various choices across different domains and function approximators, and identify novel configurations that improve the observed outcomes.
We consider an infinite-horizon Markov Decision Process (MDP)(puterman1994markov), specified by a tuple
, which consists of a state space, action space, reward function, transition probability function, initial state distribution, and discount factor.111 For simplicity, we focus on the discounted case where unless otherwise specified. The same conclusions generally hold for the undiscounted case with ; see appendix:undiscounted for more details. A policy interacts with the environment starting at an initial state , producing a distribution over from which an action is sampled and applied to the environment at each step . The environment produces a scalar reward ,222 We consider a a deterministic reward function. All of our results apply to stochastic rewards as well. and transitions to a new state .
2.1 Policy Evaluation
The value of a policy is defined as the normalized expected per-step reward it obtains:
In the policy evaluation setting, the policy being evaluated is referred to as the target policy. The value of a policy may be expressed in two equivalent ways:
where and are the state-action values and visitations of , respectively, which satisfy
Note that and are linear operators that are transposes (adjoints) of each other. We refer to as the policy transition operator and as the transpose policy transition operator. The function corresponds to the -values of the policy ; it maps state-action pairs to the expected value of policy when run in the environment starting at . The function corresponds to the on-policy distribution of
; it is the normalized distribution over state-action pairsmeasuring the likelihood enounters the pair , averaging over time via -discounting. We make the following standard assumption, which is common in previous policy evaluation work (zhang2020gendice; algae). [MDP ergodicity] There is unique fixed point solution to (4). When , (4) always has a unique solution, as cannot belong to the spectrum of . For , the assumption reduces to ergodicity for discrete case under a restriction of to a normalized distribution; the continuous case is treated by meyn2012markov.
2.2 Off-policy Evaluation via the DICE Family
Off-policy evaluation (OPE) aims to estimate using only a fixed dataset of experiences. Specifically, we assume access to a finite dataset , where , are samples from some unknown distribution , , and . We at times abuse notation and use or as a shorthand for , which simulates sampling from the dataset when using a finite number of samples.
The recent DICE methods take advantage of the following expression for the policy value:
where is the distribution correction ratio. The existing DICE estimators seek to approximate this ratio without knowledge of or , and then apply (5) to derive an estimate of . This general paradigm is supported by the following assumption. [Boundedness] The stationary correction ratio is bounded, . When , DualDICE (NacChoDaiLi19) chooses a convex objective whose optimal solution corresponds to this ratio, and employs a change of variables to transform the dependence on to . GenDICE (zhang2020gendice), on the other hand, minimizes a divergence between successive on-policy state-action distributions, and introduces a normalization constraint to ensure the estimated ratios average to over the off-policy dataset. Both DualDICE and GenDICE apply Fenchel duality to reduce an intractable convex objective to a minimax objective, which enables sampling and optimization in a stochastic or continuous action space. GradientDICE (zhang2020gradientdice) extends GenDICE by using a linear parametrization so that the minimax optimization is convex-concave with convergence guarantees.
3 A Unified Framework of DICE Estimators
In this section, given a fixed target policy , we present a linear programming representation (LP) of its state-action stationary distribution , referred to as the -LP. The dual of this LP has solution , thus revealing the duality between the -function and the -distribution of any policy . We then discuss the mechanisms by which one can improve optimization stability through the application of regularization and redundant constraints. Although in general this may introduce bias into the final value estimate, there are a number of valid configurations for which the resulting estimator for remains unbiased. We show that existing DICE algorithms cover several choices of these configurations, while there is also a sizable subset which remains unexplored.
3.1 Linear Programming Representation for the -distribution
The following theorem presents a formulation of in terms of a linear program with respect to the constraints in (4) and (3). Given a policy , under asmp:mdp_reg, its value defined in (1) can be expressed by the following -LP:
We refer to the -LP above as the dual problem. Its corresponding primal LP is
Notice that under asmp:mdp_reg, the constraint in (6) determines a unique solution, which is the stationary distribution . Therefore, the objective will be by definition. On the other hand, due to the contraction of , the primal problem is feasible and the solution is , which shows the optimal objective value will also be , implying strong duality holds. ∎
thm:dual-succinct presents a succinct LP representation for policy value and reveals the duality between the -function and -distribution, thus providing an answer to the question raised by uehara2019minimax. Although the -LP provides a mechanism for policy evaluation, directly solving either the primal or dual -LPs is difficult due to the number of constraints, which will present difficulties when the state and action spaces is uncountable. These issues are exaggerated in the off-policy setting where one only has access to samples from a stochastic process. To overcome these difficulties, one can instead approach these primal and dual LPs through the Lagrangian,
In order to enable the use of an arbitrary off-policy distribution , we make the change of variables . This yields an equivalent Lagrangian in a more convenient form:
The Lagrangian has primal and dual solutions and . Approximate solutions to one or both of can be used to estimate , by either using the standard DICE paradigm in (5) which corresponds to the dual objective in (6) or, alternatively, by using the primal objective in (7) or the Lagrangian objective in (8); we further discuss these choices later in this section. Although the Lagrangian in (8) should in principle be able to derive the solutions and so yield accurate estimates of
, in practice there are a number of optimization difficulties that are liable to be encountered. Specifically, even in tabular case, due to lack of curvature, the Lagrangian is not strongly-convex-strongly-concave, and so one cannot guarantee the convergence of the final solution with stochastic gradient descent-ascent (SGDA). These optimization issues can become more severe when moving to the continuous case with neural network parametrization, which is the dominant application case in practice. In order to mitigate these issues, we present a number of ways to introduce more stability into the optimization and discuss how these mechanisms may trade-off with the bias of the final estimate. We will show that the application of certain mechanisms recovers the existing members of the DICE family, while a larger set remains unexplored.
3.2 Regularizations and Redundant Constraints
The augmented Lagrangian method (ALM) (rockafellar1974augmented) is proposed exactly for circumventing the optimization instability, where strong convexity is introduced by adding extra regularizations without changing the optimal solution. However, directly applying ALM, i.e., adding or where denotes the -divergence, will introduce extra difficulty, both statistically and algorithmically, due to the conditional expectation operator in and inside of the non-linear function in and , which is known as “double sample” in the RL literature (Baird95residualalgorithms). Therefore, the vanilla stochastic gradient descent is no longer applicable (dai2016learning), due to the bias in the gradient estimator.
In this section, we use the spirit of ALM but explore other choices of regularizations to introduce strong convexity to the original Lagrangian (8). In addition to regularizations, we also employ the use of redundant constraints, which serve to add more structure to the optimization without affecting the optimal solutions. We will later analyze for which configurations these modifications of the original problem will lead to biased estimates for .
We first present the unified objective in full form equipped with all choices of regularizations and redundant constraints:
Now, let us explain each term in .
Primal and Dual regularization: To introduce better curvature into the Lagrangian, we introduce primal and dual regularization or , respectively. Here are some convex and lower-semicontinuous functions.
Reward: Scaling the reward may be seen as an extension of the dual regularizer, as it is a component in the dual objective in (6). We consider .
Positivity: Recall that the solution to the original Lagrangian is . We thus consider adding a positivity constraint to the dual variable. This may be interpreted as modifying the original -LP in (6) to add a condition to its objective.
As we can see, the latter two options come from the properties of optimal dual solution, and this suggests that their inclusion would not affect the optimal dual solution. On the other hand, the first two options (primal/dual regularization and reward scaling) will in general affect the solutions to the optimization. Whether a bias in the solution affects the final estimate depends on the estimator being used. Given estimates , there are three potential ways to estimate .
The following theorem outlines when a choice of regularizations, redundant constraints, and final estimator will provably result in an unbiased estimate of policy value. [Regularization profiling] Under asmp:mdp_reg and5, we summarize the effects of , , , , , which corresponds to primal and dual regularizations, w/w.o. reward, and positivity and normalization constraints. without considering function approximation.
|Regularization (with or without )|
Notice that the primal and dual solutions can both be unbiased under specific regularization configurations, but the dual solutions are unbiased in 6 out of 8 such configurations, whereas the primal solution is unbiased in only 1 configuration. The primal solution additionally requires the positivity constraint to be excluded (see details in appendix:reg_prof), further restricting its optimization choices.
The Lagrangian estimator is unbiased when at least one of or are unbiased. This property is referred to as doubly robust in the literature (jiang2015doubly) This seems to imply that the Lagrangian estimator is optimal for behavior-agnostic off-policy evaluation. However, this is not the case as we will see in the empirical analysis. Instead, the approximate dual solutions are typically more accurate than approximate primal solutions. Since neither is exact, the Lagrangian suffers from error in both, while the dual estimator will exhibit more robust performance, as it solely relies on the approximate .
3.3 Recovering Existing OPE Estimators
This organization provides a complete picture of the DICE family of estimators. Existing DICE estimators can simply be recovered by picking one of the valid regularization configurations:
DualDICE (NacChoDaiLi19): without and without . DualDICE also derives an unconstrained primal form where optimization is exclusively over the primal variables (see appendix:alter_primal). This form results in a biased estimate but avoids difficults in minimax optimization, which again is a tradeoff between optimization stability and solution unbiasedness.
GenDICE (zhang2020gendice) and GradientDICE (zhang2020gradientdice): with . GenDICE differs from GradientDICE in that GenDICE enables whereas GradientDICE disables it.
MQL and MWL (uehara2019minimax): and , both without and without .
LSTDQ (lagoudakis2003least): With linear parametrization for and , for any unbiased estimator without and in thm:reg_profile, we can recover LSTDQ. Please refer to appendix:recover_ope for details.
Algae -LP (algae): without and without .
BestDICE: with and with . More importantly, we discover a variant that achieves the best performance, which was not identified without this unified framework.
In this section, we empirically verify the theoretical findings. We evaluate different choices of estimators, regularizers, and constraints, on a set of OPE tasks ranging from tabular (Grid) to discrete-control (Cartpole) and continuous-control (Reacher), under linear and neural network parametrizations, with offline data collected from behavior policies with different noise levels ( and ). See appendix:exp for implementation details and additional results. Our empirical conclusions are as follows:
The dual estimator is unbiased under more configurations and yields best performance out of all estimators, and furthermore exhibits strong robustness to scaling and shifting of MDP rewards.
Dual regularization () yields better estimates than primal regularization; the choice of exhibits a slight advantage to .
The inclusion of redundant constraints ( and ) improves stability and estimation performance.
As expected, optimization using the unconstrained primal form is more stable but also more biased than optimization using the minimax regularized Lagrangian.
Based on these findings, we propose a particular set of choices that generally performs well, overlooked by previously proposed DICE estimators: the dual estimator with regularized dual variable () and redundant constraints () optimized with the Lagrangian.
4.1 Choice of Estimator (, , or )
We first consider the choice of estimator. In each case, we perform Lagrangian optimization with regularization chosen according to thm:reg_profile to not bias the resulting estimator. We also use and include redundant constraints for and in the dual estimator. Although not shown, we also evaluated combinations of regularizations which can bias the estimator (as well as no regularizations) and found that these generally performed worse; see sec:exp_reg for a subset of these experiments.
Our evaluation of different estimators is presented in fig:est_reg. We find that the dual estimator consistently produces the best estimates across different tasks and behavior policies. In comparison, the primal estimates are significantly worse. While the Lagrangian estimator can improve on the primal, it generally exhibits higher variance than the dual estimator. Presumably, the Lagrangian does not benefit from the doubly robust property, since both solutions are biased in this practical setting.
To more extensively evaluate the dual estimator, we investigate its performance when the reward function is scaled by a constant, shifted by a constant, or exponentiated. 333Note this is separate from , which only affects optimization. We use exclusively here. To control for difficulties in optimization, we first parametrize the primal and dual variables as linear functions, and use stochastic gradient descent to solve the convex-concave minimax objective in (3.2) with , , and . Since a linear parametrization changes the ground truth of evaluation, we compute the upper and lower estimation bounds by only parameterizing the primal or the dual variable as a linear function. fig:est_robust (top) shows the estimated per-step reward of the Grid task. When the original reward is used (col. 1), the primal, dual, and Lagrangian estimates eventually converge to roughly the same value (even though primal estimates converge much slower). When the reward is scaled by 10 or 100 times or shifted by 5 or 10 units (the original reward is between 0 and 1), the resulting primal estimates are severely affected and do not converge given the same number of gradient updates. When performing this same evaluation with neural network parametrization (fig:est_robust, bottom), the primal estimates continue to exhibit sensitivity to reward transformations, whereas the dual estimates stay roughly the same after being transformed back to the original scale. We further implemented target network for training stability of the primal variable, and the same concolusion holds (see Appendix). Note that while the dual solution is robust to the scale and range of rewards, the optimization objective used here still has , which is different from where is no longer a valid estimator.
4.2 Choice of Regularization (, , and )
Next, we study the choice between regularizing the primal or dual variables. Given the results of sec:exp_est, we focus on ablations using the dual estimator to estimate . Results are presented in fig:reg. As expected, we see that regularizing the primal variables when leads to a biased estimate, especially in Grid (), Reacher (), and Cartpole. Regularizing the dual variable (blue lines) on the other hand does not incur such a bias. Additionally, the value of has little effect on the final estimates when the dual variable is regularized (dotted versus solid blue lines). While the invariance to may not generalize to other tasks, an advantage of the dual estimates with regularized dual variable is the flexibility to set or depending on the reward function.
4.3 Choice of Redundant Constraints ( and )
So far our experiments with the dual estimator used and in the optimizations, corresponding to the normalization and positive constraints in the -LP. However, these are in principle not necessary when , and so we evaluate the effect of removing them. Given the results of the previous sections, we focus our ablations on the use of the dual estimator with dual regularization .
Normalization. We consider the effect of removing the normalization constraint (). fig:opt (row 1) shows the effect of keeping (blue curve) or removing (red curve) this constraint during training. We see that training becomes less stable and approximation error increases, even when .
Positivity. We continue to evaluate the effect of removing the positivity constraint , which, in our previous experiments, was enforced via applying a square function to the dual variable neural network output. Results are presented in fig:opt (row 2), where we again see that the removal of this constraint is detrimental to optimization stability and estimator accuracy.
4.4 Choice of Optimization (Lagrangian or Unconstrained Primal Form)
So far, our experiments have used minimax optimization via the Lagrangian to learn primal and dual variables. We now consider solving the unconstrained primal form of the -LP, which sec:reg_lagrangian suggests may lead to an easier, but biased, optimization. fig:opt (row 3) indeed shows that the unconstrained primal reduces variance on Grid and produces better estimates on Cartpole. Both environments have discrete action spaces. Reacher, on the other hand, has a continuous action space, which creates difficulty when taking the expectation over next step samples, causing bias in the unconstrained primal form. Given this mixed performance, we generally advocate for the Lagrangian, unless the task is discrete-action and the stochasticity of the dynamics is known to be low.
5 Related work
Off-policy evaluation has long been studied in the RL literature (farajtabar2018more; jiang2015doubly; kallus2019double; munos2016safe; Precup00ET; Thomas15HCPE). While some approaches are model-based (fonteneau13batch), or work by estimating the value function (duan2020minimaxoptimal), most rely on importance reweighting to transform the off-policy data distribution to the on-policy target distribution. They often require to know or estimate the behavior policy, and suffer a variance exponential in the horizon, both of which limit their applications. Recently, a series of works were proposed to address these challenges (kallus2019efficiently; liu2018breaking; tang20doubly). Among them is the DICE family (NacChoDaiLi19; zhang2020gendice; zhang2020gradientdice), which performs some form of stationary distribution estimation. The present paper develops a convex duality framework that unifies many of these algorithms, and offers further important insights. Many OPE algorithms may be understood to correspond to the categories considered here. Naturally, the recent stationary distribution correction algorithms (NacChoDaiLi19; zhang2020gendice; zhang2020gradientdice), are the dual methods. The FQI-style estimator (duan2020minimaxoptimal) loosely corresponds to our primal estimator. Moreover, Lagrangian-type estimators are also considered (tang20doubly; uehara2019minimax), although some are not for the behavior-agnostic setting (tang20doubly).
Convex duality has been widely used in machine learning, and in RL in particular. In one line of literature, it was used to solve the Bellman equation, whose fixed point is the value function(dai18sbeed; du2017stochastic; LiuLiuGhaMahetal15). Here, duality facilitates derivation of an objective function that can be conveniently approximated by sample averages, so that solving for the fixed point is converted to that of finding a saddle point. Another line of work, more similar to the present paper, is to optimize the Lagrangian of the linear program that characterizes the value function (basserrano2019faster; chen2018scalable; wang2017randomized). In contrast to our work, these algorithms typically do not incorporate off-policy correction, but assume the availability of on-policy samples.
We have proposed a unified view of off-policy evaluation via the regularized Lagrangian of the -LP. Under this unification, existing DICE algorithms are recovered by specific (suboptimal) choices of regularizers, (redundant) constraints, and ways to convert optimized solutions to policy values. By systematically studying the mathematical properties and empirical effects of these choices, we have found that the dual estimates (i.e., policy value in terms of the state-action distribution) offer greater flexibility in incorporating optimization stablizers while preserving asymptotic unibasedness, in comparison to the primal estimates (i.e., estimated -values). Our study also reveals alternative estimators not previously identified in the literature that exhibit improved performance. Overall, these findings suggest promising new directions of focus for OPE research in the offline setting.
We thank Hanjun Dai and other members of the Google Brain team for helpful discussions.
Appendix A Proof for thm:reg_profile
The full enumeration of and results in configurations. We note that it is enough to characterize the solutions under these different configurations. Clearly, the primal estimator is unbiased when , and the dual estimator is unbiased when . For the Lagrangian estimator , we may write it in two ways:
Now we continue to characterizing under different configurations. First, when , it is clear that the solutions are always unbiased by virtue of thm:dual-succinct (see also (algae)). When , the solutions are in general biased. We summarize the remaining configurations (in the discounted case) of and in the table below. We provide proofs for the configurations of the shaded cells. Proofs for the rest configurations can be found in (NacChoDaiLi19; algae).
|Regularizer (w./w.o. )||Case|
|> 0||free||iii||(NacChoDaiLi19; algae)|
|= 0||free||vii||(NacChoDaiLi19; algae)||(NacChoDaiLi19; algae)|
iii)-iv) In this configuration, the regularized Lagrangian (3.2) becomes
which is equivalent to
Apply the Fenchel duality w.r.t. , we have
If achieves the minimum at zero, it is obvious that
Therefore, we have
i)-ii) Following the derivation in case iii)-iv), we have the regularized Lagrangian as almost the same as (A) but has an extra term , i.e.
We first consider the case where the is free and the normalization constraint is not enforced.
After applying the Fenchel duality w.r.t. , we have
where the second equation comes from the fact and last equation comes from Fenchel duality with .
Then, we can characterize
If we have the positive constraint, i.e., , we denote
By first-order optimality condition, we have
For , we obtain from the Fenchel duality relationship,
As we can see, in both i) and ii), none of the optimal dual solution satisfies the normalization condition. Therefore, with the extra normalization constraint, the optimization will be obviously biased.
v)-viii) These cases are also proved in algae and we provide a more succinct proof here. In these configurations, whether is involved or not does not affect the proof. We will keep this component for generality. We ignore the and for simplicity, the conclusion does not affected, since the optimal solution automatically satisfies these constraints.