The use of model-free reinforcement learning (RL) in conjunction with function approximation has proliferated in recent years, demonstrating successful applications in fields such as robotics (Andrychowicz et al., 2018; Nachum et al., 2019a), game playing (Mnih et al., 2013), and conversational systems (Gao et al., 2019). These successes often rely on on-policy access to the environment; , during the learning process agents may collect new experience from the environment using policies they choose, and these interactions are effectively unlimited. By contrast, in many real-world applications of RL, interaction with the environment is costly, if not impossible, hence experience collection during learning is limited, necessitating the use of off-policy RL methods, , algorithms which are able to learn from logged experience collected by potentially multiple and possibly unknown behavior policies.
The off-policy nature of many practical applications presents a significant challenge for RL algorithms. The traditional max-return objective is in the form of an on-policy expectation, and thus, policy gradient methods (Sutton et al., 2000; Konda and Tsitsiklis, 2000)
require samples from the on-policy distribution to estimate the gradient of this objective. The most straightforward way to reconcile policy gradient with off-policy settings is via importance weighting(Precup et al., 2000)
. However, this approach is prone to high variance and instability without appropriate damping(Munos et al., 2016; Wang et al., 2016; Gruslys et al., 2017; Schulman et al., 2017). The more common approach to the off-policy problem is to simply ignore it, which is exactly what has been proposed by many existing off-policy policy gradient methods (Degris et al., 2012; Silver et al., 2014). These algorithms simply compute the gradients of the max-return objective with respect to samples from the off-policy data, ignoring distribution shift in the samples. The justification for this approach is that the maximum return policy will be optimal regardless of the sampling distribution of states. However, such a justification is unsound in function approximation settings, where models have limited expressiveness, with potentially disastrous consequences on optimization and convergence (e.g., Lu et al., 2018).
Value-based methods provide an alternative that may be more promising for the off-policy setting. In these methods, a value function is learned either as a critic to a learned policy (as in actor-critic) or as the maximum return value function itself (as in -learning). This approach is based on dynamic programming in tabular settings, which is inherently off-policy and independent of any underlying data distribution. Nevertheless, when using function approximation, the objective is traditionally expressed as an expectation over single-step Bellman errors, which re-raises the question, “What should the expectation be?” Some theoretical work suggests that the ideal expectation is in fact the on-policy expectation (Sutton et al., 2000; Silver et al., 2014; Nachum et al., 2018). In practice, this problem is usually ignored, with the same justification as that made for off-policy policy gradient methods. It is telling that actor-critic or -learning algorithms advertised as off-policy still require large amounts of online interaction with the environment (Haarnoja et al., 2018; Hessel et al., 2018).
In this work, we present an ALgorithm for policy Gradient from Arbitrary Experience via DICE (AlgaeDICE)111DICE is an abbreviation for distribution correction estimation and is taken from the DualDICE work (Nachum et al., 2019b) on off-policy policy evaluation. Although our current work notably focuses on policy optimization as opposed to evaluation and only implicitly estimates the distribution corrections, our derivations are nevertheless partly inspired by this previous work. as an alternative to policy gradient and value-based methods. We start with the dual formulation of the max-return objective, which is expressed in terms of normalized state-action occupancies rather than a policy or value function. Traditionally, this objective is considered unattractive, since access to the occupancies either requires an on-policy expectation (similar to policy gradient methods) or learning a function approximator to satisfy single-step constraints (similar to value-based methods). We demonstrate how these problems can be remedied by adding a controllable regularizer and applying a carefully chosen change of variables, obtaining a joint objective over a policy and an auxiliary dual function (that can be interpreted as a critic). Crucially, this objective relies only on access to samples from an arbitrary off-policy data distribution, collected by potentially multiple and possibly unknown behavior policies (under some mild conditions). Unlike traditional actor-critic methods, which use a separate objective for actor and critic, this formulation trains the policy (actor) and dual function (critic) to optimize the same objective. Further illuminating the connection to policy gradient methods, we show that if the dual function is optimized, the gradient of the proposed objective with respect to the policy parameters is exactly the on-policy policy gradient. This way, our approach naturally avoids issues of distribution mismatch without any explicit use of importance weights. We continue to provide an alternative derivation of the same results, based on a primal-dual form of the return-maximizing RL problem, and notably this perspective extends the previous results to both undiscounted settings and unregularized max-return objectives. Finally, we provide empirical evidence that AlgaeDICE can perform well on benchmark RL tasks.
We consider the RL problem presented as a Markov Decision Process (MDP)(Puterman, 1994), which is specified by a tuple
consisting of a state space, an action space, a reward function, a transition probability function, and an initial state distribution. A policyinteracts with the environment by starting at an initial state , and iteratively producing a sequence of distributions over , at steps , from which actions are sampled and successively applied to the environment. At each step, the environment produces a scalar reward and a next state . In RL, one wishes to learn a return-maximizing policy:
where describes the future rewards accumulated by from any state-action pair ,
As we will discuss in sec:q-lp and appendix:proof_details, these objectives are the primal and dual of the same linear programming (LP) problem.
In function approximation settings, optimizing requires access to gradients. The policy gradient theorem (Sutton et al., 2000) provides the gradient of as
To properly estimate this gradient one requires access to on-policy samples from and access to estimates of the -value function . The first requirement means that every gradient estimate of necessitates interaction with the environment, which limits applicability of this method in settings where interaction with the environment is expensive or infeasible. The second requirement means that one must maintain estimates of the -function to learn . This leads to the family of actor-critic algorithms that alternate between updates to (the actor) and updates to a -approximator (the critic). The critic is learned by encouraging it to satisfy single-step Bellman consistencies,
where is the expected Bellman operator with respect to . Thus, the critic is learned according to some variation on
for some distribution . Although the use of an arbitrary suggests the critic may be learned off-policy, to achieve satisfactory performance, actor-critic algorithms generally rely on augmenting a replay buffer with new on-policy experience. Theoretical work has suggested that if one desires compatible function approximation, then an appropriate is, in fact, the on-policy distribution (Sutton et al., 2000; Silver et al., 2014; Nachum et al., 2018).
In this work, we focus on the off-policy setting directly. Specifically, we are given a dataset , where ; ; and has been sampled according to an unknown process. We let denote the unknown state-action distribution, and additionally assume access to a sample of initial states, , such that .
3 AlgaeDICE via Density Regularization
We begin by presenting an informal derivation of our method, motivated as a regularization of the dual max-return objective in (3). In Section 4 we will present our results more formally as a consequence of the Lagrangian of a linear programming formulation of the max-return objective.
3.1 A Regularized Off-Policy Max-Return Objective
The max-return objective (3) is written exclusively in terms of the on-policy distribution . To introduce an off-policy distribution into the objective, we incorporate a regularizer:
with and denoting the -divergence induced by a convex function :
where we have used the shorthand . This form of regularization encourages conservative behavior, compelling the state-action occupancies of to remain close to the off-policy distribution, which can improve generalization. We emphasize that the introduction of this regularizer is to enable the subsequent derivations and not to impose a strong constraint on the optimal policy. Indeed, by appropriately choosing and , the strength of the regularization can be controlled. Later, we will show that many of our results also hold for exploratory regularization () and even for no regularization at all ().
At first glance, the regularization in (8) seems to complicate things. Not only do we still require on-policy samples from , but we also have to compute , which in general can be difficult. To make this objective more approachable, we transform the -divergence to its variational form (Nguyen et al., 2010) by use of a dual function :
where is the convex (or Fenchel) conjugate of . With the objective in (10), we are finally ready to eliminate the expectation over on-policy samples from . To do so, we make a change of variables, inspired by DualDICE (Nachum et al., 2019b). Define as the fixed point of a variant of the Bellman equation,
The resulting objective is now completely off-policy, relying only on access to samples from the initial state distribution and the off-policy dataset . Thus, we have our first theorem, providing an off-policy formulation of the max-return objective: [Primal AlgaeDICE ] Under mild conditions on , the regularized max-return objective may be expressed as a max-min optimization:
Remark (extension to ):
It is clear that the same derivations above may apply to an exploratory regularizer of the same form with , which leads to the following optimization problem:
Remark (Fenchel AlgaeDICE)
The appearance of inside in the second term of (12) presents a challenge in practice, since involves an expectation over the transition function , whereas one typically only has access to a single empirical sample from for a given state-action pair. This challenge, known as double sampling in the RL literature (Baird, 1995), can prevent the algorithm from finding the desired value function, even with infinite data. There are several alternatives to handle this issue (e.g., Antos et al., 2008; Farahmand et al., 2016; Feng et al., 2019). Here, we apply the dual embedding technique (Dai et al., 2016, 2018b). Specifically, the dual representation of ,
can be substituted into (12), to result in a -- problem:
3.2 Consistent Policy Gradient using Off-Policy Data
The equivalence between the objective in (12) and the on-policy max-return objective can be highlighted by considering the gradient of this objective with respect to . First, consider the optimal for (10). By taking the gradient of with respect to and setting this to 0, one finds that satisfies
Accordingly, for any , the optimal for (12) satisfies
Thus, we may express the gradient of with respect to as
where we have used Danskin’s theorem (Bertsekas, 1999) to ignore gradients of through . Hence, if the dual function is optimized, the gradient of the off-policy objective is exactly the on-policy policy gradient, with -value function given by .
To characterize this -value function, note that from (11), is a -value function with respect to augmented reward . Recalling the expression for in (16) and the fact that the derivatives and are inverses of each other, we have, . This derivation leads to our second theorem: If the dual function is optimized, the gradient of the off-policy objective with respect to is the regularized on-policy policy gradient:
where, is the -value function of with respect to rewards .
3.3 Connection to Actor-Critic
The relationship between the proposed off-policy objective and the classic policy gradient becomes more profound when we consider the form of the objective under specific choices of convex function . If we take , then and the proposed objective is reminiscent of actor-critic:
The second term alone is an instantiation of the off-policy critic objective in actor-critic. However, in actor-critic, the use of an off-policy objective for the critic is difficult to theoretically motivate. Moreover, in practice, critic and actor learning can both suffer from the mismatch between the off-policy distribution and the on-policy . By contrast, our derivations show that the introduction of the first term to the objective transforms the off-policy actor-critic algorithm to an on-policy actor-critic, without any explicit use of importance weights. Moreover, while standard actor-critic has two separate objectives for value and policy, our proposed objective is a single, unified objective. Both the policy and value functions are trained with respect to the same off-policy objective.
4 A Lagrangian View of AlgaeDICE
We now show how AlgaeDICE can be alternatively derived from the Lagrangian of a linear programming (LP) formulation of the -function. Please refer to appendix:proof_details for details. We begin by introducing some notations and assumptions, which have appeared in the literature (e.g., Puterman, 1994; Nachum et al., 2019b; Zhang et al., 2020). [Bounded rewards] The rewards of the MDP are bounded by some finite constant : . For the next assumption, we introduce the transpose Bellman operator:
[MDP regularity] The transposed Bellman operator has a unique fixed point solution.222 When , has a unique fixed point regardless of the underlying MDP. For , in the discrete case, the assumption reduces to requiring that be ergodic. The continuous case for is more involved; see Meyn and Tweedie (2012); Levin and Peres (2017) for a detailed discussion. [Bounded ratio] The target density ratio is bounded by some finite constant : . [Characterization of ] The function is convex with domain and continuous derivative . The convex (Fenchel) conjugate of is , and is closed and strictly convex. The derivative is continuous and its range is a superset of . For convenience, we define where . will serve as the range for .
Our derivation begins with a formalization of the LP characterization of the -function and its dual form:
Given a policy ,
the average return of may be expressed in the primal and dual forms as
Consider the Lagrangian of , which would typically be expressed with a sum (or integral) of constraints weighted by . We can reparametrize the dual variable as to express the Lagrangian as,
The optimal of this Lagrangian is , and this optimal solution is not affected by expanding the allowable range of to all of . However, in practice, the linear structure in (22) can induce numerical instability. Therefore, inspired by the augmented Lagrangian method, we introduce regularization. By adding a special regularizer using convex , we obtain
We characterize the optimizers and and the optimum value of this objective in the following theorem. Interestingly, although the regularization can affect the optimal primal solution , the optimal dual solution is unchanged. Under Assumptions 4–2, the solution to (23) is given by,
The optimal value is . Thus, we have recovered the Fenchel AlgaeDICE objective for , given in Equation 15. Furthermore, one may reverse the Legendre transform, , to recover the Primal AlgaeDICE objective in Equation (13).
The derivation of this same result from the LP perspective allows us to exploit strong duality. Specifically, under the assumption that and are bounded, does not change if we optimize over a bounded space , as long as . In this case, strong duality holds (Ekeland and Temam, 1999, Proposition 2.1), and we obtain
This implies that, for computational efficiency, we can optimize the policy via
Remark (extensions to or ):
Although AlgaeDICE is originally derived for and in sec:dice-form, the Lagrangian view of the LP formulation of can be used to generalize the algorithm to and . In particular, for , one can directly use the original Lagrangian for the LP. For the case , the problem reduces to the Lagrangian of the LP for an undiscounted -function; details are delegated to appendix:proof_details.
Remark (off-policy evaluation):
The LP form of the -values leading to the Lagrangian (22) can be directly used for behavior-agnostic off-policy evaluation (OPE). In fact, existing estimators for OPE in the behavior-agnostic setting which typically reduce the OPE problem to estimation of quantities (, DualDICE (Nachum et al., 2019b) and GenDICE (Zhang et al., 2020)) can be recast as special cases by introducing different regularizations to the Lagrangian. As we have shown, the solution to the Lagrangian provides both (regularized) -values and the desired state-action corrections as primal and dual variables simultaneously.
5 Related Work
Algorithmically, our proposed method follows a Lagrangian primal-dual view of the LP characterization of the -function, which leads to a saddle-point problem. Several recent works (e.g., Chen and Wang, 2016; Wang, 2017; Dai et al., 2018a, b; Chen et al., 2018; Lee and He, 2018) also considered saddle-point formulations for policy improvement, derived from fundamentally different perspectives. In particular, Dai et al. (2018a) exploit a saddle-point formulation for the multi-step (path) conditions on the consistency between optimal value function and policy. Other works (Chen and Wang, 2016; Wang, 2017; Dai et al., 2018b; Chen et al., 2018) consider the (augmented) Lagrangian of the LP characterization of Bellman optimality for the optimal -function, which is slightly different from the LP characterization with respect to the optimal -function we consider. Although slight, the difference between the - and -LPs is crucial to enable behavior-agnostic policy optimization in AlgaeDICE. If one were to follow derivations similar to AlgaeDICE but for the -function LP, some form of explicit importance weighting (and thus knowledge of the behavior policy) would be required, as in recent work on off-policy estimation (Tang et al., 2019; Uehara and Jiang, 2019). We further note that the application of a regularizer on the dual variable to yield Primal AlgaeDICE is key to transforming the Lagrangian optimization over values and state-action occupancies — typical in these previous works — to an optimization over values and policies, which is more common in practice and can help generalization (e.g., Swaminathan and Joachims, 2015).
The regularization we employ is inspired by previous uses of regularization in RL. Adding regularization to MDPs (Neu et al., 2017; Geist et al., 2019) has been investigated for many different purposes in the literature, including exploration (de Farias and Van Roy, 2000; Haarnoja et al., 2017, 2018), smoothing (Dai et al., 2018b), avoiding premature convergence (Nachum et al., 2017a), ensuring tractability (Todorov, 2006), and mitigating observation noise (Rubin et al., 2012; Fox et al., 2016). We note that the regularization employed by AlgaeDICE as a divergence over state-action densities is markedly different from these previous works, which mostly regularize only the action distributions of a policy conditioned on state. An approach more similar to ours is given by Belousov and Peters (2017), which regularizes the max-return objective using an -divergence over state-action densities. Their derivations are similar in spirit to ours, using the method of Lagrange multipliers, but their result is distinct in a number of key characteristics. First, their objective (analogous to ours in (12)) includes not only policy and values but also a number of additional functions, complicating any practical implementation. Second, their results are restricted to conservative regularization (), whereas our findings extend to both exploratory regularization and unregularized objectives (). Third, the algorithm proposed by Belousov and Peters (2017) follows a bi-level optimization, in which the policy is learned using a separate and distinct objective. In contrast, our proposed AlgaeDICE uses a single, unified objective for both policy and value learning.
Lastly, there are a number of works which (like ours) perform policy gradient on off-policy data via distribution correction. The key differentiator is in how the distribution corrections are computed. One common method is to re-weight off-policy samples by considering eligibility traces (Precup et al., 2000; Geist and Scherrer, 2014), i.e., compute weights by taking the product of per-action importance weights over a trajectory. Thus, these methods can suffer from high variance as the length of trajectory increases, known as the “curse of horizon” (Liu et al., 2018). A more recent work (Liu et al., 2019) attempts to weight updates by estimated state-action distribution corrections. This is more in line with our proposed AlgaeDICE, which implicitly estimates these quantities. One key difference is that this previous work explicitly estimates these corrections, which results in a bi-level optimization, as opposed to our more appealing unified objective. It is also important to note that both eligibility trace methods and the technique outlined in Liu et al. (2019) require knowledge of the behavior policy. In contrast, AlgaeDICE is a behavior-agnostic off-policy policy gradient method, which may be more relevant in practice. Compared to existing behavior-agnostic off-policy estimators (Nachum et al., 2019b; Zhang et al., 2020), this work considers the substantially more challenging problem of policy optimization.
We present empirical evaluations of AlgaeDICE, first in a tabular setting using the Four Rooms domain (Sutton et al., 1999) and then on a suite of continuous control benchmarks using MuJoCo (Todorov et al., 2012) and OpenAI Gym (Brockman et al., 2016).
6.1 Four Rooms
We begin by considering the tabular setting given by the Four Rooms environment (Sutton et al., 1999), in which an agent must navigate to a target location within a gridworld. In this tabular setting, we evaluate Primal AlgaeDICE (Equations 12 and 13) with . This way, for any , the dual value function may be solved exactly using standard matrix operations. Thus, we train by iteratively solving for via matrix operations and then taking a gradient step for . We collect an off-policy dataset by running a uniformly random policy for 500 trajectories, where each trajectory is initalized at a random state and is of length 10. This dataset is kept fixed, placing us in the completely offline regime. We use and .
Graphical depictions of learned policies and dual value functions are presented in Figure 1, where each plot shows and during the first, fourth, seventh, and tenth iterations of training. The opacity of each square is determined by the Bellman residuals of at that state. Recall that the Bellman residuals of the optimal are the density ratios . We see that this is reflected in the learned . At the beginning of training, the residuals are high around the initial state. As training progresses, there is a clear path (or paths) of high-residual states going from initial to target state. Thus we see that learns to properly correct for distribution shifts in the off-policy experience distributions. The algorithm successfully learns to optimize a policy using these corrected gradients, as shown by the arrows denoting preferred actions of the learned policy.
We further provide quantitative results in Figure 2. We plot the average per-step reward of AlgaeDICE compared to actor-critic in both online and offline settings. As a point of comparison, the behavior policy used to collect data for the offline setting achieves average reward of . Although all the variants are able to significantly improve upon this baseline, we see that AlgaeDICE performance is only negligibly affected by the type of dataset, while performance of actor-critic degrades in the offline regime. See app:details for experimental details.
6.2 Continuous Control
We now present results of AlgaeDICE on a set of continuous control benchmarks using MuJoCo (Todorov et al., 2012) and OpenAI Gym (Brockman et al., 2016). We evaluate the performance of Primal AlgaeDICE with . Our empirical objective is thus given by
where is a single-sample estimate of the Bellman residual:
We note that using a single-sample estimate for the Bellman residual in general leads to biased gradients, although previous works have found this to not have a significant practical effect in these domains (Kostrikov et al., 2019). We make the following additional practical modifications:
If is fully optimized, will be the density ratio , and thus always non-negative. However, during optimization, this may not always hold, which can affect policy learning. Thus, when calculating gradients of this objective with respect to , we clip the value of from below at 0.
For training we parameterize and
We present our results in Figure 3. We see that AlgaeDICE can perform well in these settings, achieving performance that is roughly competitive with the state-of-the-art SAC and TD3 algorithms. There are potentially more possible improvements to these practical results by choosing (or ) appropriately. In app:more-results, we conduct a preliminary investigation into polynomial , showing that certain polynomials can at times provide better performance than . A more detailed and systematic study of this and other design choices for implementing AlgaeDICE is an interesting avenue for future work.
, plotting the performance of 10 randomly seeded training runs, with shaded region representing half a standard deviation and-axis given by environment steps. There are potentially better results achievable by using a choice of other than for AlgaeDICE; see app:more-results for a preliminary investigation.
We have introduced an ALgorithm for policy Gradient from Arbitrary Experience via DICE, or AlgaeDICE, for behavior-agnostic, off-policy policy improvement in reinforcement learning. Based on a linear programming characterization of the -function, we derived the new approach from a Lagrangian saddle-point formulation. The resulting algorithm, AlgaeDICE, automatically compensates for the distribution shift in collected off-policy data, and achieves an estimate of the on-policy policy gradient using this off-policy data.
We thank Marc Bellemare, Nicolas Le Roux, George Tucker, Rishabh Agarwal, Dibya Ghosh, and the rest of the Google Brain team for insightful thoughts and discussions.
- Andrychowicz et al. (2018) Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177, 2018.
- Antos et al. (2008) András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.
- Baird (1995) Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30–37. Elsevier, 1995.
- Belousov and Peters (2017) Boris Belousov and Jan Peters. f-divergence constrained policy improvement. arXiv preprint arXiv:1801.00056, 2017.
- Bertsekas (1999) D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, second edition, 1999.
- Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI gym. arXiv preprint arXiv:1606.01540, 2016.
- Chen and Wang (2016) Yichen Chen and Mengdi Wang. Stochastic primal-dual methods and sample complexity of reinforcement learning. arXiv preprint arXiv:1612.02516, 2016.
- Chen et al. (2018) Yichen Chen, Lihong Li, and Mengdi Wang. Scalable bilinear learning using state and action features. arXiv preprint arXiv:1804.10328, 2018.
- Dai et al. (2016) Bo Dai, Niao He, Yunpeng Pan, Byron Boots, and Le Song. Learning from conditional distributions via dual embeddings. CoRR, abs/1607.04579, 2016.
- Dai et al. (2018a) Bo Dai, Albert Shaw, Niao He, Lihong Li, and Le Song. Boosting the actor with dual critic. ICLR, 2018a. arXiv:1712.10282.
- Dai et al. (2018b) Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. SBEED: Convergent reinforcement learning with nonlinear function approximation. In Proceedings of the Thirty-Fifth International Conference on Machine Learning (ICML), pages 1133–1142, 2018b.
- de Farias and Van Roy (2000) Daniela Pucci de Farias and Benjamin Van Roy. On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Optimization Theory and Applications, 105(3):589–608, 2000.
- Degris et al. (2012) Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic. arXiv preprint arXiv:1205.4839, 2012.
- Ekeland and Temam (1999) Ivar Ekeland and Roger Temam. Convex analysis and variational problems, volume 28. Siam, 1999.
- Farahmand et al. (2016) Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesvári, and Shie Mannor. Regularized policy iteration with nonparametric function spaces. Journal of Machine Learning Research, 17(130):1–66, 2016.
- Feng et al. (2019) Yihao Feng, Lihong Li, and Qiang Liu. A kernel loss for solving the Bellman equation. In Advances in Neural Information Processing Systems 32 (NeurIPS), 2019.
- Fox et al. (2016) Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. In UAI, 2016.
- Fujimoto et al. (2018) Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018.
- Gao et al. (2019) Jianfeng Gao, Michel Galley, and Lihong Li. Neural approaches to Conversational AI. Foundations and Trends in Information Retrieval, 13(2–3):127–298, 2019.
- Geist and Scherrer (2014) M. Geist and B. Scherrer. Off-policy learning with eligibility traces: A survey. The Journal of Machine Learning Research, 15(1):289–333, 2014.
- Geist et al. (2019) Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized Markov decision processes. arXiv preprint arXiv:1901.11275, 2019.
- Gruslys et al. (2017) A. Gruslys, M. Azar, M. Bellemare, and R. Munos. The reactor: A sample-efficient actor-critic architecture. arXiv preprint arXiv:1704.04651, 2017.
- Haarnoja et al. (2017) Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. arXiv preprint arXiv:1702.08165, 2017.
- Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
Hessel et al. (2018)
Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski,
Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver.
Rainbow: Combining improvements in deep reinforcement learning.
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Konda and Tsitsiklis (2000) Vijay R. Konda and John N. Tsitsiklis. Actor-critic algorithms. In Advances in Neural Information Processing Systems 12 (NIPS), pages 1008–1014, 2000.
- Kostrikov et al. (2019) Ilya Kostrikov, Ofir Nachum, and Jonathan Tompson. Imitation learning via off-policy distribution matching. 2019.
- Lee and He (2018) Donghwan Lee and Niao He. Stochastic primal-dual q-learning. arXiv preprint arXiv:1810.08298, 2018.
- Levin and Peres (2017) David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
- Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Liu et al. (2018) Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pages 5356–5366, 2018.
- Liu et al. (2019) Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient with state distribution correction. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI), 2019.
- Lu et al. (2018) Tyler Lu, Dale Schuurmans, and Craig Boutilier. Non-delusional Q-learning and value-iteration. In Advances in Neural Information Processing Systems, pages 9949–9959, 2018.
- Meyn and Tweedie (2012) Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability. Springer Science & Business Media, 2012.
- Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- Munos et al. (2016) R. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pages 1054–1062, 2016.
- Nachum et al. (2017a) Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. arXiv preprint arXiv:1702.08892, 2017a.
- Nachum et al. (2017b) Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Trust-PCL: An off-policy trust region method for continuous control. arXiv preprint arXiv:1707.01891, 2017b.
- Nachum et al. (2018) Ofir Nachum, Mohammad Norouzi, George Tucker, and Dale Schuurmans. Smoothed action value functions for learning Gaussian policies. arXiv preprint arXiv:1803.02348, 2018.
- Nachum et al. (2019a) Ofir Nachum, Michael Ahn, Hugo Ponte, Shixiang Gu, and Vikash Kumar. Multi-agent manipulation via locomotion using hierarchical sim2real. arXiv preprint arXiv:1908.05224, 2019a.
- Nachum et al. (2019b) Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. DualDICE: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems 32 (NeurIPS), 2019b.
- Neu et al. (2017) Gergely Neu, Anders Jonsson, and Vicenç Gómez. A unified view of entropy-regularized markov decision processes. arXiv preprint arXiv:1705.07798, 2017.
- Nguyen et al. (2010) XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
- Precup et al. (2000) D. Precup, R. Sutton, and S. Singh. Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, pages 759–766, 2000.
- Puterman (1994) Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994.
- Rubin et al. (2012) Jonathan Rubin, Ohad Shamir, and Naftali Tishby. Trading value and information in MDPs. Decision Making with Imperfect Decision Makers, pages 57–74, 2012.
- Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Silver et al. (2014) David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning (ICML), pages 387–395, 2014.
- Sutton et al. (1999) Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
- Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
- Swaminathan and Joachims (2015) Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research, 16(1):1731–1755, 2015.
- Tang et al. (2019) Ziyang Tang, Yihao Feng, Lihong Li, Dengyong Zhou, and Qiang Liu. Doubly robust bias reduction in infinite horizon off-policy estimation. arXiv preprint arXiv:1910.07186, 2019.
- Todorov (2006) Emanuel Todorov. Linearly-solvable Markov decision problems. In NIPS, pages 1369–1376, 2006.
- Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE, 2012.
- Uehara and Jiang (2019) Masatoshi Uehara and Nan Jiang. Minimax weight and Q-function learning for off-policy evaluation. arXiv preprint arXiv:1910.12809, 2019.
- Wang (2017) Mengdi Wang. Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear Running Time. ArXiv e-prints, 2017.
- Wang et al. (2008) Tao Wang, Daniel Lizotte, Michael Bowling, and Dale Schuurmans. Dual representations for dynamic programming. 2008.
- Wang et al. (2016) Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224, 2016.
- Zhang et al. (2020) Ruiyi Zhang, Bo Dai, Li Lihong, and Dale Schuurmans. GenDICE: Generalized offline estimation of stationary values, 2020. Preprint.
Appendix A Proof Details
We follow the notations in main text. Abusing notation slightly, we will use and interchangeably.
Given a policy ,
the average return of may be expressed in the primal and dual forms as
respectively. Under Assumptions 4 and 19, strong duality holds, , for optimal solutions . The optimal primal satisfies for all reachable by and the optimal dual is .
Recall that is monotonic; that is, given two bounded functions and , implies . Therefore, for any feasbile , we have , proving the first claim.
The duality of the linear program (4) can be obtained as
which is exactly (4). Notice that the equality constraints correspond to a system of linear equations of dimension : , where , , and
is the identity matrix. Since the matrixis nonsingular, the system has a unique solution given by
Finally, when , we can rewrite , so , as desired.