## 1 Introduction

A key challenge in reinforcement learning (RL, Sutton & Barto 2018) is off-policy evaluation (Precup et al., 2001; Maei, 2011; Jiang & Li, 2015; Sutton & Barto, 2018; Liu et al., 2018; Nachum et al., 2019; Zhang et al., 2020), where we want to estimate the performance of a target policy (average reward in the continuing setting or expected total discounted reward in the episodic setting (Puterman, 2014)), from data generated by one or more behavior policies. Compared with on-policy evaluation (Sutton, 1988), which requires data generated by the target policy, off-policy evaluation is more flexible. We can evaluate a new policy with existing data in a replay buffer (Lin, 1992) without interacting with the environment again. We can also evaluate multiple target policies simultaneously when following a single behavior policy (Sutton et al., 2011).

One major challenge in off-policy evaluation is dealing with distribution mismatch: the state distribution of the target policy is different from the sampling distribution. This mismatch leads to divergence of the off-policy linear temporal difference learning algorithm (Baird, 1995; Tsitsiklis & Van Roy, 1997). Precup et al. (2001)

address distribution mismatch with products of importance sampling ratios, which, however, suffer from large variance. To correct for distribution mismatch without incurring large variance,

Hallak & Mannor (2017); Liu et al. (2018) propose to directly learn the density ratio between the state distribution of the target policy and the sampling distribution using function approximation.^{1}

^{1}1Such density ratios are referred to as

*stationary values*in Zhang et al. (2020) Intuitively, this learned density ratio is a “marginalization” of the products of importance sampling ratios.

The density ratio learning algorithms from Hallak & Mannor (2017); Liu et al. (2018) require data generated by a *single known* behavior policy.
Nachum et al. (2019) relax this constraint in DualDICE,
which is compatible with *multiple unknown* behavior policies and *offline* training.
DualDICE, however, copes well with only the total discounted reward criterion and cannot be used under the average reward criterion.
Zhang et al. (2020) observe that DualDICE becomes instable as the discounting factor grows towards 1.
To address the limitation of DualDICE, Zhang et al. (2020) propose GenDICE,
which is compatible with multiple unknown behavior policies and offline training *under both criteria*.
Zhang et al. (2020) show empirically that GenDICE achieves a new state-of-the-art in the field of off-policy evaluation.

In this paper, we point out key problems with GenDICE.
In particular, the optimization problem in GenDICE is *not* a convex-concave saddle-point problem (CCSP) once nonlinearity in optimization variable parameterization is introduced to ensure positivity,
so primal-dual algorithms are not guaranteed to find the desired solution even with tabular representation.
However, such nonlinearity is essential to ensure the consistency of GenDICE.
This is a fundamental contradiction,
resulting from GenDICE’s original formulation of the optimization problem.

Furthermore, we propose GradientDICE, which overcomes these problems. GradientDICE optimizes a different objective from GenDICE by using the Perron-Frobenius theorem (Horn & Johnson, 2012) and eliminating GenDICE’s use of divergence. Consequently, nonlinearity in parameterization is not necessary for GradientDICE, which is provably convergent under linear function approximation. Finally, we provide empirical results demonstrating the advantages of GradientDICE over GenDICE and DualDICE.

## 2 Background

We use vectors and functions interchangeably when this does not confuse. For example, let

be a function; we also use to denote the corresponding vector in . All vectors are column vectors.We consider an infinite-horizon MDP with a finite state space , a finite action space , a transition kernel , a reward function , a discount factor , and an initial state distribution . The initial state is sampled from . At time step , an agent at selects an action according to , where is the policy being followed by the agent. The agent then proceeds to the next state according to and gets a reward satisfying .

Similar to Zhang et al. (2020), we consider two performance measurements for the policy : the total discounted reward

and the average reward

When the Markov chain induced by

is ergodic, is always well defined (Puterman, 2014). Throughout this paper, we implicitly assume this ergodicity whenever we consider . When considering , we are interested in the normalized discounted state-action occupation . Letbe the probability of occupying the state-action pair

at the time step following . Then, we have(1) |

When considering , we are interested in the stationary state-action distribution

(2) |

To simplify notation, we extend the definition of and from to by defining and . It follows that for any , we have .

We are interested in estimating without executing the policy . Similar to Zhang et al. (2020), we assume access to a fixed dataset . Here the state-action pair is sampled from an unknown distribution , which may result from multiple unknown behavior policies. The reward satisfies . The successor state is sampled from . It is obvious that . Hence one possible approach for estimating is to learn the density ratio directly.

We assume for every and use to denote a diagonal matrix whose diagonal entry is . Let , Zhang et al. (2020) show

(3) |

where the operator is defined as

(4) |

and is the state-action pair transition matrix, i.e., . The operator is similar to the Bellman operator but in the reverse direction. Similar ideas have been explored by Hallak & Mannor (2017); Liu et al. (2018); Gelada & Bellemare (2019). As is a probability measure, Zhang et al. (2020) propose to compute by solving the following optimization problem:

(5) |

where is elementwise greater or equal, is an -divergence (Nowozin et al., 2016) associated with a convex, lower semi-continuous generator function with . Let be two probability measures; we have . The -divergence is used mainly for the ease of optimization but see Zhang et al. (2020) for discussion of other possible divergences. Due to the difficulty in solving the constrained problem (5) directly, Zhang et al. (2020) propose to solve the following problem instead:

(6) |

where is a constant. We have

To make the optimization tractable and address a double sampling issue, Zhang et al. (2020) rewrite as , where is the Fenchel conjugate of , and use the interchangeability principle for interchanging maximization and expectation (Shapiro et al., 2014; Dai et al., 2016), yielding the following problem, which is equivalent to (6):

(7) |

where

(8) | ||||

(9) | ||||

(10) | ||||

(11) |

Here are shorthand for , , respectively. Zhang et al. (2020) show is convex in and concave in , i.e., (7) is a convex-concave saddle-point problem. Zhang et al. (2020) therefore use a primal-dual algorithm (i.e, perform stochastic gradient ascent on

and stochastic gradient descent on

) to find the saddle-point, yielding GENeralized stationary DIstribution Correction Estimation (GenDICE).## 3 Problems with GenDICE

In this section, we discuss several problems with GenDICE.

### 3.1 Use of Divergences

Zhang et al. (2020) propose to consider a family of divergences, the -divergences. However, -divergences are defined between probability measures. So in (6) implicitly requires its arguments to be valid probability measures. Consequently, (6) still has the implicit constraint that . However, the main motivation for Zhang et al. (2020) to transform (5) into (6) is to get rid of this equality constraint. By using divergences, they do not really get rid of it. When this implicit constraint is considered, the problem (6) is still hard to optimize, as discussed in Zhang et al. (2020).

We can, of course, just ignore this implicit constraint and interpret as a generic function instead of a divergence. Namely, we do not require its arguments to be valid probability measures. In this scenario, however, there is no guarantee that is always nonnegative, which plays a central role in proving Lemma 1’s claim that GenDICE is consistent. For example, consider the KL-divergence, where . If holds for all (which is impossible when and are probability measures), clearly we have . While Zhang et al. (2020) propose not to use KL divergence due to numerical instability, here we provide a more principled explanation that if KL divergence is used, Lemma 1 does not necessarily hold. Zhang et al. (2020) propose to use -divergence instead. Fortunately, -divergence has the property that implies , even if and are not probability measures. This property ensures Lemma 1 holds even we just consider as a generic function instead of a divergence. But not all divergences have this property. Moreover, even if -divergence is considered, is still necessary for Lemma 1 to hold. This requirement () is also problematic, as discussed in the next section.

To summarize, we argue that *divergence is not a good choice to form the optimization objective for density ratio learning*.

### 3.2 Use of Primal-Dual Algorithms

We assume , are parameterized by respectively. As (7) requires , Zhang et al. (2020) propose to add extra nonlinearity, e.g., , or , in the parameterization of . Plugging the approximation in (7) yields

(12) |

Here and emphasize the fact that and are parameterized functions.

There is now a contradiction.
On the one hand, is *not* necessarily CCSP when nonlinearity is introduced in the parameterization.
In the definition of in (7), the sign of depends on and .
Unless is linear in , the convexity of w.r.t. is in general hard to analyze (Boyd & Vandenberghe, 2004),
even we just add after a linear parameterization.
Although Zhang et al. (2020) demonstrate great empirical success from a primal-dual algorithm,
this optimization procedure is *not* theoretically justified as is not necessarily a convex-concave function.
On the other hand, if we do not apply any nonlinearity in ,
there is no guarantee that even with a tabular representation.
Then Lemma 1 does not necessarily hold and GenDICE is not necessarily consistent.

Besides applying nonlinearity, one may also consider projection to account for the constraint , i.e., after each stochastic gradient update, we project back to the region where . With nonlinear function approximation, it is not clear how to achieve this. With tabular or linear representation, such projection reduces to inequality-constrained quadratic programming, where the number of constraints grows linearly w.r.t. the number of states (not state features), indicating such projection does not scale well.

To summarize, *applying a primal-dual algorithm for the GenDICE objective is not theoretically justified, even with a tabular representation*.

### 3.3 A Hard Example for GenDICE

We now provide a concrete example to demonstrate the defects of GenDICE. We consider a single state MDP (Figure 1) with two actions, both of which lead to that single state. We set . Therefore, we have . Under this setting, it is easy to verify that . We now instantiate (7) with a -divergence and as recommended by Zhang et al. (2020), where . To solve (7), we need . As suggested by Zhang et al. (2020), we use nonlinearity. Namely, we define . Now our optimization variables are . It is easy to verify that at the point , we have , indicating GenDICE stops at this point if the true gradient is used. However, is obviously not the optimum. Details of the computation are provided in the appendix. This suboptimality indeed results from the fact that once nonlinearity is introduced in , it is not convex-concave even with a tabular representation.

## 4 GradientDICE

As discussed above, the problems with GenDICE come mainly from the formulation of (6), namely the constraint and the use of the divergence . We eliminate both by considering the following problem instead:

(13) |

where is a constant and stands for . Readers familiar with Gradient TD methods (Sutton et al., 2009a, b) or residual gradients (Baird, 1995) may find the first term of this objective similar to the Mean Squared Bellman Error (MSBE). However, while in MSBE the norm is induced by , we consider a norm induced by . This norm is carefully designed and provides expectations that we can sample from, which will be clear once is expanded (see Eq (14) below). Remarkably, we have:

###### Theorem 1.

is optimal for (13) iif .

###### Proof.

Sufficiency: Obviously is optimal.

Necessity: (a) :
In this case is nonsingular.
The linear system has only one solution,
we must have .
(b) : If is optimal, we have and ,
i.e.,

is a left eigenvector of

associated with the Perron-Frobenius eigenvalue 1. Note

is also a left eigenvector of associated with the eigenvalue 1. According to the Perron-Frobenius theorem for nonnegative irreducible matrices (Horn & Johnson, 2012), the left eigenspace of the Perron-Frobenius eigenvalue is 1-dimensional. Consequently, there exists a scalar

such that . On the other hand, , implying , i.e., . ∎###### Remark 1.

Unlike the problem formulation in Zhang et al. (2020) (see (5) and (6)), we do not use as a constraint and can still guarantee there is no degenerate solution. Eliminating this constraint is key to eliminating nonlinearity. Although the Perron-Frobenius theorem can also be used in the formulation of Zhang et al. (2020), their use of the divergence still requires .

With , we have

(14) | ||||

(15) | ||||

(16) |

where the equality comes also from the Fenchel conjugate and the interchangeability principle as in Zhang et al. (2020). We, therefore, consider the following problem

(17) |

where

(18) | ||||

(19) | ||||

(20) | ||||

(21) | ||||

(22) | ||||

(23) |

Here the equality comes from the fact that

(24) |

This problem is an *unconstrained* optimization problem and is convex (linear) in and concave in .
Assuming is parameterized by respectively and
including ridge regularization for for reasons that will soon be clear,
we consider the following problem

(25) |

where is a constant. When a linear architecture is considered for and , the problem (25) is CCSP. Namely, it is convex in and concave in . We use to denote the feature matrix, each row of which is , where is the feature function. Assuming , we perform gradient descent on and gradient ascent on . As we use techniques similar to Gradient TD methods to prove the convergence of our new algorithm, we term it Gradient stationary DIstribution Correction Estimation (GradientDICE):

(26) | ||||

(27) | ||||

(28) | ||||

(29) |

Here , (c.f. in (7)), is a sequence of deterministic nonnegative nonincreasing learning rates satisfying the Robin-Monro’s condition (Robbins & Monro, 1951), i.e., .

### 4.1 Convergence Analysis

Let , we rewrite the GradientDICE updates as

(30) |

where

(31) | ||||

(32) |

Defining , the limiting behavior of GradientDICE is governed by

(33) | ||||

(34) |

###### Assumption 1.

has linearly independent columns.

###### Assumption 2.

A is nonsingular or .

###### Assumption 3.

The features

have uniformly bounded second moments.

###### Remark 2.

Assumption 1 ensures is strictly positive definite. When , it is common to assume is nonsingular (Maei, 2011), the ridge regularization (i.e., ) is then optional. When , can easily be singular (e.g., in a tabular setting). We, therefore, impose the extra ridge regularization. Assumption 3 is commonly used in Gradient TD methods (Maei, 2011).

We provide a detailed proof of Theorem 2 in the appendix, which is inspired by Sutton et al. (2009a). One key step in the proof is to show that the real parts of all eigenvalues of are strictly negative. The in Sutton et al. (2009a) satisfies this condition easily. However, for our to satisfy this condition when , we must have , which motivates the use of ridge regularization.

With simple block matrix inversion expanding , we have , where

(35) | ||||

(36) | ||||

(37) | ||||

(38) |

The maximization step in (25) is quadratic (with linear function approximation) and thus can be solved analytically. Simple algebraic manipulation together with Assumption 1 shows that this quadratic problem has a unique optimizer for all . Plugging the analytical solution for the maximization step in (25), the KKT conditions then state that the optimizer for the minimization step must satisfy , where

(39) | ||||

(40) |

Assumption 2 ensures is nonsingular. Using the Sherman-Morrison formula (Sherman & Morrison, 1950), it is easy to verify . For a quick sanity check, it is easy to verify that holds when , , and , using the fact .

### 4.2 Consistency Analysis

To ensure convergence, we require ridge regularization in (25) for the setting . The asymptotic solution is therefore biased. We now study the regularization path consistency for the setting , i.e., we study the behavior of when approaches 0.

Case 1: .

Here indicates the column space.
As has linearly independent columns (Assumption 1),
we use to denote the unique satisfying .
As , can be singular.
Hence both and can be ill-defined.
We now show under some regularization,
we still have the desired consistency.
As is always positive semidefinite,
we consider its eigendecomposition ,
where

is an orthogonal matrix,

, is the rank of , are eigenvalues. Let , we have###### Proposition 1.

Assuming is positive definite, , then where denotes the vector consisting of the elements indexed by in the vector .

###### Proof.

According to the Perron-Frobenius theorem (c.f. the proof of Theorem 1), it suffices to show

(41) |

where

(42) | ||||

(43) |

as is the only satisfying . With eigendecomposition of , we can compute explicitly. Simple algebraic manipulation then yields

(44) | ||||

(45) |

where . The desired limits then follow from the L’Hopital’s rule. ∎

###### Remark 3.

The assumption is not restrictive as it is independent of learnable parameters and mainly controlled by features. Requiring to be positive definite is more restrictive but it holds at least for the tabular setting (i.e., ). The difficulty of the setting comes mainly from the fact that the objective of the minimization step in the problem (25) is no longer strictly convex when (i.e., can be singular). Thus there may be multiple optima for this minimization step, only one of which is . Extra domain knowledge (e.g., assumptions in the proposition statement) is necessary to ensure the regularization path converges to the desired optimum. We provide a sufficient condition here and leave the analysis of necessary conditions for future work.

Case 2:

In this scenario, it is not clear how to define .
The minimization step in (25) can have multiple optima and it is not clear which one is the best.
To analyze this scenario, we need to explicitly define projection in the optimization objective like Mean Squared Projected Bellman Error (Sutton et al., 2009a),
instead of using an MSBE-like objective.
We leave this for future work.

### 4.3 Finite Sample Analysis

We now provide a finite sample analysis for a variant of GradientDICE, *Projected GradientDICE* (Algorithm 1),
where we introduce projection and iterates average,
akin to Nemirovski et al. (2009); Liu et al. (2015).

Intuitively, Projected GradientDICE groups the in GradientDICE into . Precisely, we have ,

(46) | ||||

(47) | ||||

(48) |

, . and are projections onto and w.r.t. norm, is the number of iterations, and is a learning rate, detailed below. We consider the following problem

(49) |

It is easy to see is a convex-concave function and its saddle point is unique. We assume

###### Assumption 4.

and are bounded, closed, and convex, .

###### Proposition 2.

Both and the learning rates are detailed in the proof in the appendix. Note it is possible to conduct a finite sample analysis without introducing projection using arguments from Lakshminarayanan & Szepesvari (2018), which we leave for future work.

## 5 Experiments

In this section, we present experiments comparing GradientDICE to GenDICE and DualDICE. All curves are averaged over 30 independent runs and shaded regions indicate one standard derivation.

### 5.1 Density Ratio Learning

We consider two variants of Boyan’s Chain (Boyan, 1999) as shown in Figure 4. In particular, we use Episodic Boyan’s Chain when and Continuing Boyan’s Chain when . We consider a uniform sampling distribution, i.e., , and a target policy satisfying . We design a sequence of tasks by varying the discount factor in .

We train all compared algorithms for steps. We evaluate the Mean Squared Error (MSE) for the predicted every 500 steps, computed as , where the ground truth is computed analytically. We use learning rates in the form of for all algorithms, where is a constant initial learning rate. We tune and from and with a grid search for all algorithms (for a fair comparison, we also add this ridge regularization for GenDICE and DualDICE). For each

, we select the best hyperparameters (

) minimizing the Area Under Curve (AUC), a proxy for learning speed. For the penalty coefficient, we always set as recommended by Zhang et al. (2020). They find has little influence on overall performance.We first consider a tabular representation, i.e., the values of and are stored in lookup tables. In particular, as GenDICE requires , we use the nonlinearity for its -table. We do not use any nonlinearity for GradientDICE and DualDICE. The results in Figure 2 show that GradientDICE learns faster than GenDICE and DualDICE in 5 out of 6 tasks and has a lower variance than GenDICE in 4 of 6 tasks. In terms of the quality of asymptotic solutions, GradientDICE outperforms GenDICE and DualDICE in 4 and 5 tasks respectively and matches their performance in the remaining tasks.

Next we consider a linear architecture. We use the same state features as Boyan (1999), detailed in the appendix. We use two independent sets of weights for the two actions and still use the nonlinearity for GenDICE. We do not use any nonlinearity for GradientDICE and DualDICE. The results in Figure 3 show that for , GradientDICE is more stable than GenDICE (for the remaining cases, their variances are similar). For , the asymptotic solutions found by GradientDICE have lower error than that of GenDICE (for , the errors are comparable). GradientDICE leans faster than DualDICE in 4 out of 6 tasks and matches DualDICE’s performance in the remaining 2 tasks.

### 5.2 Off-Policy Evaluation

We now benchmark DualDICE, GenDICE, and GradientDICE in an off-policy evaluation problem. We consider Reacher-v2 from OpenAI Gym (Brockman et al., 2016). Our target policy is a policy trained via TD3 (Fujimoto et al., 2018) for steps. Our behavior policies are composed by adding Gaussian noise to the target policy. We use two behavior policies with and respectively. We execute each behavior policy for steps to collect transitions, which form our dataset with transitions. This dataset is fixed across the experiments.

We use neural networks to parameterize and

, each of which is represented by a two-hidden-layer network with 64 hidden units and ReLU

(Nair & Hinton,
Comments

There are no comments yet.