1 Introduction
The use of modelfree reinforcement learning (RL) in conjunction with function approximation has proliferated in recent years, demonstrating successful applications in fields such as robotics (Andrychowicz et al., 2018; Nachum et al., 2019a), game playing (Mnih et al., 2013), and conversational systems (Gao et al., 2019). These successes often rely on onpolicy access to the environment; , during the learning process agents may collect new experience from the environment using policies they choose, and these interactions are effectively unlimited. By contrast, in many realworld applications of RL, interaction with the environment is costly, if not impossible, hence experience collection during learning is limited, necessitating the use of offpolicy RL methods, , algorithms which are able to learn from logged experience collected by potentially multiple and possibly unknown behavior policies.
The offpolicy nature of many practical applications presents a significant challenge for RL algorithms. The traditional maxreturn objective is in the form of an onpolicy expectation, and thus, policy gradient methods (Sutton et al., 2000; Konda and Tsitsiklis, 2000)
require samples from the onpolicy distribution to estimate the gradient of this objective. The most straightforward way to reconcile policy gradient with offpolicy settings is via importance weighting
(Precup et al., 2000). However, this approach is prone to high variance and instability without appropriate damping
(Munos et al., 2016; Wang et al., 2016; Gruslys et al., 2017; Schulman et al., 2017). The more common approach to the offpolicy problem is to simply ignore it, which is exactly what has been proposed by many existing offpolicy policy gradient methods (Degris et al., 2012; Silver et al., 2014). These algorithms simply compute the gradients of the maxreturn objective with respect to samples from the offpolicy data, ignoring distribution shift in the samples. The justification for this approach is that the maximum return policy will be optimal regardless of the sampling distribution of states. However, such a justification is unsound in function approximation settings, where models have limited expressiveness, with potentially disastrous consequences on optimization and convergence (e.g., Lu et al., 2018).Valuebased methods provide an alternative that may be more promising for the offpolicy setting. In these methods, a value function is learned either as a critic to a learned policy (as in actorcritic) or as the maximum return value function itself (as in learning). This approach is based on dynamic programming in tabular settings, which is inherently offpolicy and independent of any underlying data distribution. Nevertheless, when using function approximation, the objective is traditionally expressed as an expectation over singlestep Bellman errors, which reraises the question, “What should the expectation be?” Some theoretical work suggests that the ideal expectation is in fact the onpolicy expectation (Sutton et al., 2000; Silver et al., 2014; Nachum et al., 2018). In practice, this problem is usually ignored, with the same justification as that made for offpolicy policy gradient methods. It is telling that actorcritic or learning algorithms advertised as offpolicy still require large amounts of online interaction with the environment (Haarnoja et al., 2018; Hessel et al., 2018).
In this work, we present an ALgorithm for policy Gradient from Arbitrary Experience via DICE (AlgaeDICE)^{1}^{1}1DICE is an abbreviation for distribution correction estimation and is taken from the DualDICE work (Nachum et al., 2019b) on offpolicy policy evaluation. Although our current work notably focuses on policy optimization as opposed to evaluation and only implicitly estimates the distribution corrections, our derivations are nevertheless partly inspired by this previous work. as an alternative to policy gradient and valuebased methods. We start with the dual formulation of the maxreturn objective, which is expressed in terms of normalized stateaction occupancies rather than a policy or value function. Traditionally, this objective is considered unattractive, since access to the occupancies either requires an onpolicy expectation (similar to policy gradient methods) or learning a function approximator to satisfy singlestep constraints (similar to valuebased methods). We demonstrate how these problems can be remedied by adding a controllable regularizer and applying a carefully chosen change of variables, obtaining a joint objective over a policy and an auxiliary dual function (that can be interpreted as a critic). Crucially, this objective relies only on access to samples from an arbitrary offpolicy data distribution, collected by potentially multiple and possibly unknown behavior policies (under some mild conditions). Unlike traditional actorcritic methods, which use a separate objective for actor and critic, this formulation trains the policy (actor) and dual function (critic) to optimize the same objective. Further illuminating the connection to policy gradient methods, we show that if the dual function is optimized, the gradient of the proposed objective with respect to the policy parameters is exactly the onpolicy policy gradient. This way, our approach naturally avoids issues of distribution mismatch without any explicit use of importance weights. We continue to provide an alternative derivation of the same results, based on a primaldual form of the returnmaximizing RL problem, and notably this perspective extends the previous results to both undiscounted settings and unregularized maxreturn objectives. Finally, we provide empirical evidence that AlgaeDICE can perform well on benchmark RL tasks.
2 Background
We consider the RL problem presented as a Markov Decision Process (MDP)
(Puterman, 1994), which is specified by a tupleconsisting of a state space, an action space, a reward function, a transition probability function, and an initial state distribution. A policy
interacts with the environment by starting at an initial state , and iteratively producing a sequence of distributions over , at steps , from which actions are sampled and successively applied to the environment. At each step, the environment produces a scalar reward and a next state . In RL, one wishes to learn a returnmaximizing policy:(1) 
where describes the future rewards accumulated by from any stateaction pair ,
(2) 
and is a discount factor. This objective may be equivalently written in its dual form (Puterman, 1994; Wang et al., 2008) in terms of the policy’s normalized state visitation distribution as
(3) 
where
(4) 
As we will discuss in sec:qlp and appendix:proof_details, these objectives are the primal and dual of the same linear programming (LP) problem.
In function approximation settings, optimizing requires access to gradients. The policy gradient theorem (Sutton et al., 2000) provides the gradient of as
(5) 
To properly estimate this gradient one requires access to onpolicy samples from and access to estimates of the value function . The first requirement means that every gradient estimate of necessitates interaction with the environment, which limits applicability of this method in settings where interaction with the environment is expensive or infeasible. The second requirement means that one must maintain estimates of the function to learn . This leads to the family of actorcritic algorithms that alternate between updates to (the actor) and updates to a approximator (the critic). The critic is learned by encouraging it to satisfy singlestep Bellman consistencies,
(6) 
where is the expected Bellman operator with respect to . Thus, the critic is learned according to some variation on
(7) 
for some distribution . Although the use of an arbitrary suggests the critic may be learned offpolicy, to achieve satisfactory performance, actorcritic algorithms generally rely on augmenting a replay buffer with new onpolicy experience. Theoretical work has suggested that if one desires compatible function approximation, then an appropriate is, in fact, the onpolicy distribution (Sutton et al., 2000; Silver et al., 2014; Nachum et al., 2018).
In this work, we focus on the offpolicy setting directly. Specifically, we are given a dataset , where ; ; and has been sampled according to an unknown process. We let denote the unknown stateaction distribution, and additionally assume access to a sample of initial states, , such that .
3 AlgaeDICE via Density Regularization
We begin by presenting an informal derivation of our method, motivated as a regularization of the dual maxreturn objective in (3). In Section 4 we will present our results more formally as a consequence of the Lagrangian of a linear programming formulation of the maxreturn objective.
3.1 A Regularized OffPolicy MaxReturn Objective
The maxreturn objective (3) is written exclusively in terms of the onpolicy distribution . To introduce an offpolicy distribution into the objective, we incorporate a regularizer:
(8) 
with and denoting the divergence induced by a convex function :
(9) 
where we have used the shorthand . This form of regularization encourages conservative behavior, compelling the stateaction occupancies of to remain close to the offpolicy distribution, which can improve generalization. We emphasize that the introduction of this regularizer is to enable the subsequent derivations and not to impose a strong constraint on the optimal policy. Indeed, by appropriately choosing and , the strength of the regularization can be controlled. Later, we will show that many of our results also hold for exploratory regularization () and even for no regularization at all ().
At first glance, the regularization in (8) seems to complicate things. Not only do we still require onpolicy samples from , but we also have to compute , which in general can be difficult. To make this objective more approachable, we transform the divergence to its variational form (Nguyen et al., 2010) by use of a dual function :
(10) 
where is the convex (or Fenchel) conjugate of . With the objective in (10), we are finally ready to eliminate the expectation over onpolicy samples from . To do so, we make a change of variables, inspired by DualDICE (Nachum et al., 2019b). Define as the fixed point of a variant of the Bellman equation,
(11) 
Equivalently, . Note that always exists and is bounded when and are bounded (Puterman, 1994). Applying this change of variables to (10) (after some telescoping, see Nachum et al. (2019b)) yields
(12) 
The resulting objective is now completely offpolicy, relying only on access to samples from the initial state distribution and the offpolicy dataset . Thus, we have our first theorem, providing an offpolicy formulation of the maxreturn objective: [Primal AlgaeDICE ] Under mild conditions on , the regularized maxreturn objective may be expressed as a maxmin optimization:
(13) 
Remark (extension to ):
It is clear that the same derivations above may apply to an exploratory regularizer of the same form with , which leads to the following optimization problem:
(14) 
Remark (Fenchel AlgaeDICE)
The appearance of inside in the second term of (12) presents a challenge in practice, since involves an expectation over the transition function , whereas one typically only has access to a single empirical sample from for a given stateaction pair. This challenge, known as double sampling in the RL literature (Baird, 1995), can prevent the algorithm from finding the desired value function, even with infinite data. There are several alternatives to handle this issue (e.g., Antos et al., 2008; Farahmand et al., 2016; Feng et al., 2019). Here, we apply the dual embedding technique (Dai et al., 2016, 2018b). Specifically, the dual representation of ,
can be substituted into (12), to result in a  problem:
(15) 
As we will see in Section 4, under mild conditions, strong duality holds in the inner  of (15), hence one can switch the and to reduce to a more convenient  form.
3.2 Consistent Policy Gradient using OffPolicy Data
The equivalence between the objective in (12) and the onpolicy maxreturn objective can be highlighted by considering the gradient of this objective with respect to . First, consider the optimal for (10). By taking the gradient of with respect to and setting this to 0, one finds that satisfies
(16) 
Accordingly, for any , the optimal for (12) satisfies
(17) 
Thus, we may express the gradient of with respect to as
where we have used Danskin’s theorem (Bertsekas, 1999) to ignore gradients of through . Hence, if the dual function is optimized, the gradient of the offpolicy objective is exactly the onpolicy policy gradient, with value function given by .
To characterize this value function, note that from (11), is a value function with respect to augmented reward . Recalling the expression for in (16) and the fact that the derivatives and are inverses of each other, we have, . This derivation leads to our second theorem: If the dual function is optimized, the gradient of the offpolicy objective with respect to is the regularized onpolicy policy gradient:
(18) 
where, is the value function of with respect to rewards .
Remark
3.3 Connection to ActorCritic
The relationship between the proposed offpolicy objective and the classic policy gradient becomes more profound when we consider the form of the objective under specific choices of convex function . If we take , then and the proposed objective is reminiscent of actorcritic:
The second term alone is an instantiation of the offpolicy critic objective in actorcritic. However, in actorcritic, the use of an offpolicy objective for the critic is difficult to theoretically motivate. Moreover, in practice, critic and actor learning can both suffer from the mismatch between the offpolicy distribution and the onpolicy . By contrast, our derivations show that the introduction of the first term to the objective transforms the offpolicy actorcritic algorithm to an onpolicy actorcritic, without any explicit use of importance weights. Moreover, while standard actorcritic has two separate objectives for value and policy, our proposed objective is a single, unified objective. Both the policy and value functions are trained with respect to the same offpolicy objective.
4 A Lagrangian View of AlgaeDICE
We now show how AlgaeDICE can be alternatively derived from the Lagrangian of a linear programming (LP) formulation of the function. Please refer to appendix:proof_details for details. We begin by introducing some notations and assumptions, which have appeared in the literature (e.g., Puterman, 1994; Nachum et al., 2019b; Zhang et al., 2020). [Bounded rewards] The rewards of the MDP are bounded by some finite constant : . For the next assumption, we introduce the transpose Bellman operator:
(19) 
[MDP regularity] The transposed Bellman operator has a unique fixed point solution.^{2}^{2}2 When , has a unique fixed point regardless of the underlying MDP. For , in the discrete case, the assumption reduces to requiring that be ergodic. The continuous case for is more involved; see Meyn and Tweedie (2012); Levin and Peres (2017) for a detailed discussion. [Bounded ratio] The target density ratio is bounded by some finite constant : . [Characterization of ] The function is convex with domain and continuous derivative . The convex (Fenchel) conjugate of is , and is closed and strictly convex. The derivative is continuous and its range is a superset of . For convenience, we define where . will serve as the range for .
Our derivation begins with a formalization of the LP characterization of the function and its dual form:
Given a policy ,
the average return of may be expressed in the primal and dual forms as
respectively. Under Assumptions 4 and 19, strong duality holds, , for optimal solutions . The optimal primal satisfies for all reachable by and the optimal dual is .
Consider the Lagrangian of , which would typically be expressed with a sum (or integral) of constraints weighted by . We can reparametrize the dual variable as to express the Lagrangian as,
(22) 
The optimal of this Lagrangian is , and this optimal solution is not affected by expanding the allowable range of to all of . However, in practice, the linear structure in (22) can induce numerical instability. Therefore, inspired by the augmented Lagrangian method, we introduce regularization. By adding a special regularizer using convex , we obtain
(23) 
We characterize the optimizers and and the optimum value of this objective in the following theorem. Interestingly, although the regularization can affect the optimal primal solution , the optimal dual solution is unchanged. Under Assumptions 4–2, the solution to (23) is given by,
The optimal value is . Thus, we have recovered the Fenchel AlgaeDICE objective for , given in Equation 15. Furthermore, one may reverse the Legendre transform, , to recover the Primal AlgaeDICE objective in Equation (13).
The derivation of this same result from the LP perspective allows us to exploit strong duality. Specifically, under the assumption that and are bounded, does not change if we optimize over a bounded space , as long as . In this case, strong duality holds (Ekeland and Temam, 1999, Proposition 2.1), and we obtain
This implies that, for computational efficiency, we can optimize the policy via
(24) 
Remark (extensions to or ):
Although AlgaeDICE is originally derived for and in sec:diceform, the Lagrangian view of the LP formulation of can be used to generalize the algorithm to and . In particular, for , one can directly use the original Lagrangian for the LP. For the case , the problem reduces to the Lagrangian of the LP for an undiscounted function; details are delegated to appendix:proof_details.
Remark (offpolicy evaluation):
The LP form of the values leading to the Lagrangian (22) can be directly used for behavioragnostic offpolicy evaluation (OPE). In fact, existing estimators for OPE in the behavioragnostic setting which typically reduce the OPE problem to estimation of quantities (, DualDICE (Nachum et al., 2019b) and GenDICE (Zhang et al., 2020)) can be recast as special cases by introducing different regularizations to the Lagrangian. As we have shown, the solution to the Lagrangian provides both (regularized) values and the desired stateaction corrections as primal and dual variables simultaneously.
5 Related Work
Algorithmically, our proposed method follows a Lagrangian primaldual view of the LP characterization of the function, which leads to a saddlepoint problem. Several recent works (e.g., Chen and Wang, 2016; Wang, 2017; Dai et al., 2018a, b; Chen et al., 2018; Lee and He, 2018) also considered saddlepoint formulations for policy improvement, derived from fundamentally different perspectives. In particular, Dai et al. (2018a) exploit a saddlepoint formulation for the multistep (path) conditions on the consistency between optimal value function and policy. Other works (Chen and Wang, 2016; Wang, 2017; Dai et al., 2018b; Chen et al., 2018) consider the (augmented) Lagrangian of the LP characterization of Bellman optimality for the optimal function, which is slightly different from the LP characterization with respect to the optimal function we consider. Although slight, the difference between the  and LPs is crucial to enable behavioragnostic policy optimization in AlgaeDICE. If one were to follow derivations similar to AlgaeDICE but for the function LP, some form of explicit importance weighting (and thus knowledge of the behavior policy) would be required, as in recent work on offpolicy estimation (Tang et al., 2019; Uehara and Jiang, 2019). We further note that the application of a regularizer on the dual variable to yield Primal AlgaeDICE is key to transforming the Lagrangian optimization over values and stateaction occupancies — typical in these previous works — to an optimization over values and policies, which is more common in practice and can help generalization (e.g., Swaminathan and Joachims, 2015).
The regularization we employ is inspired by previous uses of regularization in RL. Adding regularization to MDPs (Neu et al., 2017; Geist et al., 2019) has been investigated for many different purposes in the literature, including exploration (de Farias and Van Roy, 2000; Haarnoja et al., 2017, 2018), smoothing (Dai et al., 2018b), avoiding premature convergence (Nachum et al., 2017a), ensuring tractability (Todorov, 2006), and mitigating observation noise (Rubin et al., 2012; Fox et al., 2016). We note that the regularization employed by AlgaeDICE as a divergence over stateaction densities is markedly different from these previous works, which mostly regularize only the action distributions of a policy conditioned on state. An approach more similar to ours is given by Belousov and Peters (2017), which regularizes the maxreturn objective using an divergence over stateaction densities. Their derivations are similar in spirit to ours, using the method of Lagrange multipliers, but their result is distinct in a number of key characteristics. First, their objective (analogous to ours in (12)) includes not only policy and values but also a number of additional functions, complicating any practical implementation. Second, their results are restricted to conservative regularization (), whereas our findings extend to both exploratory regularization and unregularized objectives (). Third, the algorithm proposed by Belousov and Peters (2017) follows a bilevel optimization, in which the policy is learned using a separate and distinct objective. In contrast, our proposed AlgaeDICE uses a single, unified objective for both policy and value learning.
Lastly, there are a number of works which (like ours) perform policy gradient on offpolicy data via distribution correction. The key differentiator is in how the distribution corrections are computed. One common method is to reweight offpolicy samples by considering eligibility traces (Precup et al., 2000; Geist and Scherrer, 2014), i.e., compute weights by taking the product of peraction importance weights over a trajectory. Thus, these methods can suffer from high variance as the length of trajectory increases, known as the “curse of horizon” (Liu et al., 2018). A more recent work (Liu et al., 2019) attempts to weight updates by estimated stateaction distribution corrections. This is more in line with our proposed AlgaeDICE, which implicitly estimates these quantities. One key difference is that this previous work explicitly estimates these corrections, which results in a bilevel optimization, as opposed to our more appealing unified objective. It is also important to note that both eligibility trace methods and the technique outlined in Liu et al. (2019) require knowledge of the behavior policy. In contrast, AlgaeDICE is a behavioragnostic offpolicy policy gradient method, which may be more relevant in practice. Compared to existing behavioragnostic offpolicy estimators (Nachum et al., 2019b; Zhang et al., 2020), this work considers the substantially more challenging problem of policy optimization.
6 Experiments
We present empirical evaluations of AlgaeDICE, first in a tabular setting using the Four Rooms domain (Sutton et al., 1999) and then on a suite of continuous control benchmarks using MuJoCo (Todorov et al., 2012) and OpenAI Gym (Brockman et al., 2016).
6.1 Four Rooms
We begin by considering the tabular setting given by the Four Rooms environment (Sutton et al., 1999), in which an agent must navigate to a target location within a gridworld. In this tabular setting, we evaluate Primal AlgaeDICE (Equations 12 and 13) with . This way, for any , the dual value function may be solved exactly using standard matrix operations. Thus, we train by iteratively solving for via matrix operations and then taking a gradient step for . We collect an offpolicy dataset by running a uniformly random policy for 500 trajectories, where each trajectory is initalized at a random state and is of length 10. This dataset is kept fixed, placing us in the completely offline regime. We use and .
Graphical depictions of learned policies and dual value functions are presented in Figure 1, where each plot shows and during the first, fourth, seventh, and tenth iterations of training. The opacity of each square is determined by the Bellman residuals of at that state. Recall that the Bellman residuals of the optimal are the density ratios . We see that this is reflected in the learned . At the beginning of training, the residuals are high around the initial state. As training progresses, there is a clear path (or paths) of highresidual states going from initial to target state. Thus we see that learns to properly correct for distribution shifts in the offpolicy experience distributions. The algorithm successfully learns to optimize a policy using these corrected gradients, as shown by the arrows denoting preferred actions of the learned policy.
We further provide quantitative results in Figure 2. We plot the average perstep reward of AlgaeDICE compared to actorcritic in both online and offline settings. As a point of comparison, the behavior policy used to collect data for the offline setting achieves average reward of . Although all the variants are able to significantly improve upon this baseline, we see that AlgaeDICE performance is only negligibly affected by the type of dataset, while performance of actorcritic degrades in the offline regime. See app:details for experimental details.
6.2 Continuous Control
We now present results of AlgaeDICE on a set of continuous control benchmarks using MuJoCo (Todorov et al., 2012) and OpenAI Gym (Brockman et al., 2016). We evaluate the performance of Primal AlgaeDICE with . Our empirical objective is thus given by
where is a singlesample estimate of the Bellman residual:
(25) 
We note that using a singlesample estimate for the Bellman residual in general leads to biased gradients, although previous works have found this to not have a significant practical effect in these domains (Kostrikov et al., 2019). We make the following additional practical modifications:

If is fully optimized, will be the density ratio , and thus always nonnegative. However, during optimization, this may not always hold, which can affect policy learning. Thus, when calculating gradients of this objective with respect to , we clip the value of from below at 0.
For training we parameterize and
using neural networks and perform alternating stochastic gradient descent on their parameters.
We present our results in Figure 3. We see that AlgaeDICE can perform well in these settings, achieving performance that is roughly competitive with the stateoftheart SAC and TD3 algorithms. There are potentially more possible improvements to these practical results by choosing (or ) appropriately. In app:moreresults, we conduct a preliminary investigation into polynomial , showing that certain polynomials can at times provide better performance than . A more detailed and systematic study of this and other design choices for implementing AlgaeDICE is an interesting avenue for future work.
HalfCheetah  Hopper  Walker2d 

Ant  Humanoid 

, plotting the performance of 10 randomly seeded training runs, with shaded region representing half a standard deviation and
axis given by environment steps. There are potentially better results achievable by using a choice of other than for AlgaeDICE; see app:moreresults for a preliminary investigation.7 Conclusion
We have introduced an ALgorithm for policy Gradient from Arbitrary Experience via DICE, or AlgaeDICE, for behavioragnostic, offpolicy policy improvement in reinforcement learning. Based on a linear programming characterization of the function, we derived the new approach from a Lagrangian saddlepoint formulation. The resulting algorithm, AlgaeDICE, automatically compensates for the distribution shift in collected offpolicy data, and achieves an estimate of the onpolicy policy gradient using this offpolicy data.
Acknowledgments
We thank Marc Bellemare, Nicolas Le Roux, George Tucker, Rishabh Agarwal, Dibya Ghosh, and the rest of the Google Brain team for insightful thoughts and discussions.
References
 Andrychowicz et al. (2018) Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous inhand manipulation. arXiv preprint arXiv:1808.00177, 2018.
 Antos et al. (2008) András Antos, Csaba Szepesvári, and Rémi Munos. Learning nearoptimal policies with Bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.
 Baird (1995) Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30–37. Elsevier, 1995.
 Belousov and Peters (2017) Boris Belousov and Jan Peters. fdivergence constrained policy improvement. arXiv preprint arXiv:1801.00056, 2017.
 Bertsekas (1999) D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, second edition, 1999.
 Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI gym. arXiv preprint arXiv:1606.01540, 2016.
 Chen and Wang (2016) Yichen Chen and Mengdi Wang. Stochastic primaldual methods and sample complexity of reinforcement learning. arXiv preprint arXiv:1612.02516, 2016.
 Chen et al. (2018) Yichen Chen, Lihong Li, and Mengdi Wang. Scalable bilinear learning using state and action features. arXiv preprint arXiv:1804.10328, 2018.
 Dai et al. (2016) Bo Dai, Niao He, Yunpeng Pan, Byron Boots, and Le Song. Learning from conditional distributions via dual embeddings. CoRR, abs/1607.04579, 2016.
 Dai et al. (2018a) Bo Dai, Albert Shaw, Niao He, Lihong Li, and Le Song. Boosting the actor with dual critic. ICLR, 2018a. arXiv:1712.10282.
 Dai et al. (2018b) Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. SBEED: Convergent reinforcement learning with nonlinear function approximation. In Proceedings of the ThirtyFifth International Conference on Machine Learning (ICML), pages 1133–1142, 2018b.
 de Farias and Van Roy (2000) Daniela Pucci de Farias and Benjamin Van Roy. On the existence of fixed points for approximate value iteration and temporaldifference learning. Journal of Optimization Theory and Applications, 105(3):589–608, 2000.
 Degris et al. (2012) Thomas Degris, Martha White, and Richard S Sutton. Offpolicy actorcritic. arXiv preprint arXiv:1205.4839, 2012.
 Ekeland and Temam (1999) Ivar Ekeland and Roger Temam. Convex analysis and variational problems, volume 28. Siam, 1999.
 Farahmand et al. (2016) Amirmassoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesvári, and Shie Mannor. Regularized policy iteration with nonparametric function spaces. Journal of Machine Learning Research, 17(130):1–66, 2016.
 Feng et al. (2019) Yihao Feng, Lihong Li, and Qiang Liu. A kernel loss for solving the Bellman equation. In Advances in Neural Information Processing Systems 32 (NeurIPS), 2019.
 Fox et al. (2016) Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. In UAI, 2016.
 Fujimoto et al. (2018) Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actorcritic methods. arXiv preprint arXiv:1802.09477, 2018.
 Gao et al. (2019) Jianfeng Gao, Michel Galley, and Lihong Li. Neural approaches to Conversational AI. Foundations and Trends in Information Retrieval, 13(2–3):127–298, 2019.
 Geist and Scherrer (2014) M. Geist and B. Scherrer. Offpolicy learning with eligibility traces: A survey. The Journal of Machine Learning Research, 15(1):289–333, 2014.
 Geist et al. (2019) Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized Markov decision processes. arXiv preprint arXiv:1901.11275, 2019.
 Gruslys et al. (2017) A. Gruslys, M. Azar, M. Bellemare, and R. Munos. The reactor: A sampleefficient actorcritic architecture. arXiv preprint arXiv:1704.04651, 2017.
 Haarnoja et al. (2017) Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energybased policies. arXiv preprint arXiv:1702.08165, 2017.
 Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.

Hessel et al. (2018)
Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski,
Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver.
Rainbow: Combining improvements in deep reinforcement learning.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  Konda and Tsitsiklis (2000) Vijay R. Konda and John N. Tsitsiklis. Actorcritic algorithms. In Advances in Neural Information Processing Systems 12 (NIPS), pages 1008–1014, 2000.
 Kostrikov et al. (2019) Ilya Kostrikov, Ofir Nachum, and Jonathan Tompson. Imitation learning via offpolicy distribution matching. 2019.
 Lee and He (2018) Donghwan Lee and Niao He. Stochastic primaldual qlearning. arXiv preprint arXiv:1810.08298, 2018.
 Levin and Peres (2017) David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
 Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Liu et al. (2018) Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinitehorizon offpolicy estimation. In Advances in Neural Information Processing Systems, pages 5356–5366, 2018.
 Liu et al. (2019) Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Offpolicy policy gradient with state distribution correction. In Proceedings of the ThirtyFifth Conference on Uncertainty in Artificial Intelligence (UAI), 2019.
 Lu et al. (2018) Tyler Lu, Dale Schuurmans, and Craig Boutilier. Nondelusional Qlearning and valueiteration. In Advances in Neural Information Processing Systems, pages 9949–9959, 2018.
 Meyn and Tweedie (2012) Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability. Springer Science & Business Media, 2012.
 Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Munos et al. (2016) R. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare. Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems, pages 1054–1062, 2016.
 Nachum et al. (2017a) Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. arXiv preprint arXiv:1702.08892, 2017a.
 Nachum et al. (2017b) Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. TrustPCL: An offpolicy trust region method for continuous control. arXiv preprint arXiv:1707.01891, 2017b.
 Nachum et al. (2018) Ofir Nachum, Mohammad Norouzi, George Tucker, and Dale Schuurmans. Smoothed action value functions for learning Gaussian policies. arXiv preprint arXiv:1803.02348, 2018.
 Nachum et al. (2019a) Ofir Nachum, Michael Ahn, Hugo Ponte, Shixiang Gu, and Vikash Kumar. Multiagent manipulation via locomotion using hierarchical sim2real. arXiv preprint arXiv:1908.05224, 2019a.
 Nachum et al. (2019b) Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. DualDICE: Behavioragnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems 32 (NeurIPS), 2019b.
 Neu et al. (2017) Gergely Neu, Anders Jonsson, and Vicenç Gómez. A unified view of entropyregularized markov decision processes. arXiv preprint arXiv:1705.07798, 2017.
 Nguyen et al. (2010) XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
 Precup et al. (2000) D. Precup, R. Sutton, and S. Singh. Eligibility traces for offpolicy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, pages 759–766, 2000.
 Puterman (1994) Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994.
 Rubin et al. (2012) Jonathan Rubin, Ohad Shamir, and Naftali Tishby. Trading value and information in MDPs. Decision Making with Imperfect Decision Makers, pages 57–74, 2012.
 Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Silver et al. (2014) David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning (ICML), pages 387–395, 2014.
 Sutton et al. (1999) Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211, 1999.
 Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 Swaminathan and Joachims (2015) Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research, 16(1):1731–1755, 2015.
 Tang et al. (2019) Ziyang Tang, Yihao Feng, Lihong Li, Dengyong Zhou, and Qiang Liu. Doubly robust bias reduction in infinite horizon offpolicy estimation. arXiv preprint arXiv:1910.07186, 2019.
 Todorov (2006) Emanuel Todorov. Linearlysolvable Markov decision problems. In NIPS, pages 1369–1376, 2006.
 Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE, 2012.
 Uehara and Jiang (2019) Masatoshi Uehara and Nan Jiang. Minimax weight and Qfunction learning for offpolicy evaluation. arXiv preprint arXiv:1910.12809, 2019.
 Wang (2017) Mengdi Wang. Randomized Linear Programming Solves the Discounted Markov Decision Problem In NearlyLinear Running Time. ArXiv eprints, 2017.
 Wang et al. (2008) Tao Wang, Daniel Lizotte, Michael Bowling, and Dale Schuurmans. Dual representations for dynamic programming. 2008.
 Wang et al. (2016) Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actorcritic with experience replay. arXiv preprint arXiv:1611.01224, 2016.
 Zhang et al. (2020) Ruiyi Zhang, Bo Dai, Li Lihong, and Dale Schuurmans. GenDICE: Generalized offline estimation of stationary values, 2020. Preprint.
Appendix A Proof Details
We follow the notations in main text. Abusing notation slightly, we will use and interchangeably.
theorem:q_lp
Given a policy ,
the average return of may be expressed in the primal and dual forms as
respectively. Under Assumptions 4 and 19, strong duality holds, , for optimal solutions . The optimal primal satisfies for all reachable by and the optimal dual is .
Recall that is monotonic; that is, given two bounded functions and , implies . Therefore, for any feasbile , we have , proving the first claim.
The duality of the linear program (4) can be obtained as
which is exactly (4). Notice that the equality constraints correspond to a system of linear equations of dimension : , where , , and
is the identity matrix. Since the matrix
is nonsingular, the system has a unique solution given byFinally, when , we can rewrite , so , as desired.
Comments
There are no comments yet.