Log In Sign Up

Revisiting Peng's Q(λ) for Modern Reinforcement Learning

by   Tadashi Kozuno, et al.

Off-policy multi-step reinforcement learning algorithms consist of conservative and non-conservative algorithms: the former actively cut traces, whereas the latter do not. Recently, Munos et al. (2016) proved the convergence of conservative algorithms to an optimal Q-function. In contrast, non-conservative algorithms are thought to be unsafe and have a limited or no theoretical guarantee. Nonetheless, recent studies have shown that non-conservative algorithms empirically outperform conservative ones. Motivated by the empirical results and the lack of theory, we carry out theoretical analyses of Peng's Q(λ), a representative example of non-conservative algorithms. We prove that it also converges to an optimal policy provided that the behavior policy slowly tracks a greedy policy in a way similar to conservative policy iteration. Such a result has been conjectured to be true but has not been proven. We also experiment with Peng's Q(λ) in complex continuous control tasks, confirming that Peng's Q(λ) often outperforms conservative algorithms despite its simplicity. These results indicate that Peng's Q(λ), which was thought to be unsafe, is a theoretically-sound and practically effective algorithm.


page 8

page 25


Conservative Exploration in Reinforcement Learning

While learning in an unknown Markov Decision Process (MDP), an agent sho...

Deep Conservative Policy Iteration

Conservative Policy Iteration (CPI) is a founding algorithm of Approxima...

Conservative Dual Policy Optimization for Efficient Model-Based Reinforcement Learning

Provably efficient Model-Based Reinforcement Learning (MBRL) based on op...

On the Performance Bounds of some Policy Search Dynamic Programming Algorithms

We consider the infinite-horizon discounted optimal control problem form...

Soft-Robust Algorithms for Handling Model Misspecification

In reinforcement learning, robust policies for high-stakes decision-maki...

Variance-Reduced Conservative Policy Iteration

We study the sample complexity of reducing reinforcement learning to a s...

Quantification before Selection: Active Dynamics Preference for Robust Reinforcement Learning

Training a robust policy is critical for policy deployment in real-world...

1 Introduction

Q-learning is a canonical algorithm in reinforcement learning (RL) (Watkins, 1989)

. It is a single-step algorithm, in that it only uses individual transitions to update value estimates. Many

multi-step generalisations of Q-learning have been proposed, which allow temporally-extended trajectories to be used in the updating of values (Bertsekas and Ioffe, 1996; Watkins, 1989; Peng and Williams, 1994, 1996; Precup et al., 2000; Harutyunyan et al., 2016; Munos et al., 2016; Rowland et al., 2020), potentially leading to more efficient credit assignment. Indeed, multi-step algorithms have often been observed to outperform single-step algorithms for control in a variety of RL tasks (Mousavi et al., 2017; Harb and Precup, 2017; Hessel et al., 2018; Barth-Maron et al., 2018; Kapturowski et al., 2018; Daley and Amato, 2019).

However, using multi-step algorithms for RL comes with both theoretical and practical difficulties. The discrepancy between the policy that generated the data to be learnt from (the behavior policy) and the policy being learnt about (the target policy) can lead to complex, non-convergent behavior in these algorithms, and so must be considered carefully. There are two main approaches to deal with this discrepancy (cf. Table 1). Conservative methods ensure convergence is guaranteed no matter what behavior policy is used, typically by truncating the trajectories used for learning. By contrast, non-conservative methods typically do not truncate trajectories, and as a result do not come with generic convergence guarantees. Nevertheless, non-conservative methods have consistently been found to outperform conservative methods in practical large-scale applications. Thus, there is a clear gap in our understanding about non-conservative methods; why do they so work well in practice, but lack the guarantees of their conservative counterparts?

Algorithm Conservative Convergence Convergence to
-trace (Rowland et al., 2020) No ? ?
C-trace (Rowland et al., 2020) No ? ?
HQL (Harutyunyan et al., 2016) No ✓(with small ) ✓(with small )
Retrace (Munos et al., 2016) Yes
TBL (Precup et al., 2000) Yes
Uncorrected -step Return No ? ?
WQL (Watkins, 1989) Yes
PQL (Peng and Williams, 1994) No (biased) (cf. caption)
Table 1: List of off-policy multi-step algorithms for control. Harutyunyan’s Q(), Tree-backup, Watkins’ Q(), and Peng’s Q() are abbreviated as HQL, TBL, WQL, and PQL, respectively (cf. Section 3.2 for details of the algorithms). Conservative column indicates if an algorithm is conservative or not (cf. Section 4). Convergence column indicates the convergence of algorithms to any fixed point, whereas Convergence to column indicates the convergence of algorithms to the optimal Q-function . indicates new results in the present paper. PQL converges to a biased fixed-point when the behavior policy is fixed. It converges to when a behavior policy is updated appropriately. (An exact condition is given in Section 5.)

In this paper, we address this question by studying a representative non-conservative algorithm, Peng’s Q() (Peng and Williams, 1994, 1996, PQL), in more realistic learning settings. Our results show that while PQL does not learn optimal policies under arbitrary behavior policies, a convergence guarantee can be recovered if the behavior policy tracks the target policy, as is often the case in practice. This represents a closing of the gap between the strong empirical performance of non-conservative methods and their previous lack of theoretical guarantees.

More concretely, our primary theoretical contributions bring new understanding to PQL, and are summarized as follows:

  • [leftmargin=0.4cm,topsep=0pt,itemsep=0pt]

  • A proof that PQL with a fixed behavior policy converges to a ”biased” (i.e., different from ) fixed-point.

  • Analysis of the quality of the resulting policy.

  • Convergence of PQL to an optimal policy when using appropriate behavior policy updates.

  • Error propagation analysis when using approximations.

In addition to these theoretical insights, we validate the empirical performance of PQL through extensive experiments. Our focus is on continuous control tasks, where one encounters many technical challenges that do not exist in discrete control tasks (cf. Section 7.2). They are also accessible to a wider range of readers. We show that PQL can be easily extended to popular off-policy actor-critic algorithms such as DDPG, TD3 and SAC (Lillicrap et al., 2016; Fujimoto et al., 2018; Haarnoja et al., 2018). Over a large subset of tasks, PQL consistently outperforms other conservative and non-conservative baseline alternatives.

2 Notation and Definitions

For a finite set and an arbitrary set , we let and

be the probability simplex over

and the set of all mappings from to , respectively.

Markov Decision Processes (MDP).

We consider an MDP defined by a tuple , where is the finite state space, the finite action space, the state transition probability kernel, the initial state distribution, the (conditional) reward distribution, and the discount factor (Puterman, 1994). We let be a reward function defined by .

On the Finiteness of the State and Action Spaces.

While we assume both and to be finite, most of theoretical results in the paper hold in continuous state spaces with appropriate measure-theoretic considerations. The finiteness assumption on the action space is necessary to guarantee the existence of the optimal policy (Puterman, 1994). In Appendix B, we discuss assumptions necessary to extend our theoretical results to continuous action spaces.

Policy and Value Functions.

Suppose a policy . We consider the standard RL setup where an agent interacts with an environment, generating a sequence of state-action-reward tuples with

being an action sampled from some policy; throughout, we denote random variables by upper cases. Define 

as the cumulative return. The state-value and Q-functions are defined by and , respectively, where the conditioning by means .

Evaluation and Control.

Two key tasks in RL are evaluation and control. The problem of evaluation is to learn the Q-function of a fixed policy. The aim in the control setting is to learn an optimal policy defined as to satisfy (the inequality is point-wise, i.e., for all ). Similarly to , we let denote the optimal Q-function . As a greedy policy with respect to is optimal, it suffices to learn . In this paper, we are particularly interested in the off-policy control setting, where an agent collects data with a behavior policy , which is not necessarily the agent’s current policy . On-policy settings are a special case where .

3 Multi-step RL Algorithms and Operators

Operators play a crucial role in RL since all value-based RL algorithms (exactly or approximately) update a Q-function based on the recursion , where is an operator that characterizes each algorithm. In this section, we review multi-step RL algorithms and their operators.

Basic Operators.

Assume we have a fixed policy . With an abuse of notations, we define operators and by

for any and , respectively (hereafter, we omit ”for any…” in definitions of operators for brevity). We define their composite . As a result, the Bellman operator is defined by . For a function , we let be the set of all greedy policies111Note that there may be multiple greedy policies due to ties. with respect to . The Bellman optimality operator is defined by with 222Note that this definition is independent of the choice of .. Q-learning approximates the value iteration (VI) updates .

3.1 On-policy Multi-step Operators for Control

We first introduce on-policy multi-step operators for control.

Modified Policy Iteration (MPI).

MPI uses the recursion for Q-function updates (Puterman and Shin, 1978), where . The -step return operator is defined by .

-Policy Iteration (-Pi).

-PI uses the recursion for Q-function updates (Bertsekas and Ioffe, 1996), where . The -return operator is defined as

where , and .

3.2 Off-policy Multi-step Operators for Control

Next, we explain off-policy multi-step operators for control. We note that on-policy algorithms in the last subsection can be converted to off-policy versions by using importance sampling (Precup et al., 2000; Casella and Berger, 2002).

Uncorrected -step Return.

For a sequence of behavior policies , the uncorrected -step return algorithm uses the recursion for Q-function updates (Hessel et al., 2018; Kapturowski et al., 2018), where . Here, the uncorrected -step return operator is defined for any policies and by

Peng’s Q() (Pql)

For a sequence of behavior policies , PQL uses the recursion for Q-function updates (Peng and Williams, 1994, 1996), where . Here, the PQL operator is defined for any policies and by


where . Note that PQL is a generalization of -PI because it reduces to -PI when . In other words, PQL is

-PI with one additional degree of freedom in


General Retrace.

We next introduce a general version of the Retrace operator (Munos et al., 2016), from which other operators are obtained as special cases.

For a behavior policy and a target policy , we let be an operator defined by

where is an arbitrary non-negative function over whose choice depends on an algorithm. Note that for any , can be estimated off-policy with data collected under the behavior policy .

A general Retrace operator is obtained by replacing of in the -return operator with . Concretely,

The general Retrace algorithm updates its Q-function by , where is a sequence of arbitrary non-negative functions over , is an arbitrary sequence of behavior policies, and is a sequence of target policies that depends on an algorithm. Given the choices of and in Table 2, we recover a few known algorithms (Watkins, 1989; Peng and Williams, 1994, 1996; Precup et al., 2000; Harutyunyan et al., 2016; Munos et al., 2016; Rowland et al., 2020).

The general Retrace algorithm is off-policy as can be estimated off-policy by the following estimator given a trajectory collected under :


where , and is the TD error at time step .

Retrace Any
Table 2: Choices of and in off-policy multi-step operators for control. See Section 3.2 for details. The same abbreviations as those in Table 1 are used. For brevity, we defined . We denote by and by . -trace and C-trace look the same in the table, but C-trace adaptively changes so that the trace length matches to a target trace length.

4 Conservative and Non-conservative Multi-step RL Algorithms

Munos et al. (2016) showed that the following conditions suffice for the convergence of the general Retrace to :

  1. [leftmargin=0.5cm,itemsep=0pt,topsep=0pt]

  2. for any and .

  3. satisfies some greediness condition, such as -greediness with decreasing as increases; cf. Munos et al. (2016) for further details.

We call algorithms that satisfy the first condition conservative algorithms for reasons to be explained below. Otherwise, we call the algorithms non-conservative. See Table 1 for the classification of algorithms. The uncorrected -step return algorithm can also be viewed as a non-conservative algorithm with non-Markovian traces that depend also on the past.

Conservativeness, Theoretical Guarantees, and Empirical Performance of Algorithms.

Recall that in the general Retrace update estimator (2), the effect of the TD error is attenuated by in addition to . Hence, from the backward view (Sutton and Barto, 1998), the first condition intuitively requires that the trace must be cut if a sub-trajectory is unlikely under relative to . As a result, conservative algorithms only carry out safe updates to Q-functions.

As shown in (Munos et al., 2016), such conservative updates enable a convergence guarantee of general conservative algorithms. However, Rowland et al. (2020) observed that it often results in frequent trace cuts, and conservative algorithms usually benefit less from multi-step updates.

In contrast, non-conservative algorithms accumulate TD errors without carefully cutting traces. As a result, non-conservative algorithms might perform poorly. As we show later (Proposition 5), it is the case at least for Harutyunyan’s Q() (Harutyunyan et al. (2016), HQL), an instance of non-conservative algorithms, when a behavior policy is fixed. Nonetheless, non-conservative algorithms are known to perform well in practice (Hessel et al., 2018; Kapturowski et al., 2018; Daley and Amato, 2019). To understand its reason, it is important to characterize what kind of updates to the behavior policy entail the convergence of the overall algorithm. In the following sections, we take a step forward along this direction. We establish the convergence guarantee of PQL under two setups: (1) when the behavior policy is fixed; (2) when the behavior policy is updated in an appropriate way.

5 Theoretical Analysis of Peng’s Q()

In this section, we analyze Peng’s Q(). We start with the exact case where there is no update errors in value functions. Later, we will consider the approximate case when accounting for update errors. The following lemma is particularly useful in theoretical analyses as well as practical implementations.

Lemma 1 (Harutyunyan et al., 2016).

The PQL operator can be rewritten in the following forms:


This is proven in (Harutyunyan et al., 2016), but we provided a proof in Appendix C for completeness. ∎

5.1 Exact Case with a Fixed Behavior Policy

We now analyze PQL with a fixed behavior policy . While the behavior policy is not fixed in a practical situation, the analysis shows a trade-off between bias and convergence rate. This trade-off is analogous to the bias-contraction-rate trade-off of off-policy multi-step algorithms for policy evaluation (Rowland et al., 2020) and sheds some light on important properties of PQL.

Concretely, we analyze the following algorithm:


Harutyunyan et al. (2016) has proven that a fixed point of the PQL operator coincides with the unique fixed point of , which is guaranteed to exist since is a contraction with modulus under -norm (see Appendix A for details about the contraction and other notions).

The existence of a fixed point does not imply the convergence of PQL, and we need to show that the distance between and the fixed point is decreasing. With the following theorem, we show that PQL does converge.

Theorem 2.

Let be a policy such that for any policy , where the inequality is point-wise. Then, , and of PQL (3) uniformly converges to with the rate , where .


See Appendix E. ∎

We build intuitions about the bias-convergence-rate trade-off implied in Theorem 2. When increases, the fixed point is , whose bias against arguably increases; at the same time, the contraction rate decreases, so that the contraction is faster.

Remark 1.

In Section 7.6 of (Sutton and Barto, 1998), it is conjectured that PQL with a fixed policy would converge to a hybrid of and . Theorem 2 gives an answer to this conjecture and shows that Sutton and Barto (1998)’s conjecture is not necessarily true. Rather, the theorem shows that PQL converges to the Q-function of the best policy among policies of the form .

5.2 Approximate Case with a Fixed Behavior Policy

In practice, value-update errors are inevitable due to e.g., finite-sample estimations and function approximation errors. In this subsection, we provide the error propagation analysis of PQL with a fixed behavior policy. As we will see, the analysis depicts a trade-off between fixed point bias and error tolerance.

We analyze the following algorithm:

where denotes the value-update error at iteration . For simplicity, we use and in this subsection.

In Section 5.1, we showed when at every , and . Therefore, is an approximation to , and thus it is natural to define as the loss of using rather than . The following theorem provides an upper bound for the loss.

Theorem 3.

For any , the following holds:

where is the -norm defined for any real-valued function by .


See Appendix G. ∎

As we have already explained the bias-convergence-rate trade-off, for now we ignore the term and focus on the error term. For simplicity, we assume for every . Then,

In contrast, an analogous result of -PI is (Scherrer, 2013). When , these results coincide, which is expected since both -PI and PQL degenerate to value iteration. When , PQL’s error dependency is , which is significantly better than . However in this case, PQL is completely biased and converges to . At intermediate values of , PQL achieves a trade-off between error tolerance with bias by changing .

5.3 Approximate Case with Behavior Policy Updates

Previously, we have analyzed PQL with a fixed behavior policy. However, in practice, the behavior policy is updated along with the target policy. Besides, value-update errors are inevitable in complex tasks. As a result, PQL may behave quite differently in a practical scenario. This motivates our analysis for the following algorithm:333This algorithm updates the behavior policy after each application of the PQL operator. In Appendix F, we analyze a case where the behavior policy is updated after multiple applications of the PQL operator.


where , and . Note that when , this algorithm reduces to -PI as a special case. Though this behavior policy update closely resembles to that of conservative policy iteration (Kakade and Langford, 2002), here we require .

This algorithm has the following performance guarantee.

Theorem 4.

For any , the following holds:

where . Hence, PQL with behavior policy updates converges to the optimal policy with the rate .


See Appendix H. ∎

The first term on the right hand side shows the convergence of PQL with behavior policy updates in an exact case, i.e., for any . It states that the fastest convergence rate is (achieved when ), which is the same as the convergence rate of VI (Munos, 2005), policy iteration (Munos, 2003), MPI (Scherrer et al., 2012, 2015), and -PI (Scherrer, 2013). When , the convergence rate coincides with that of conservative policy iteration (Scherrer, 2014). However we are not aware of a similar result of conservative -PI, which would be an analogue of PQL considered here. Theorem 4 also provides the error dependency of PQL (the second term on the right hand side). It coincides with the previous result of the above algorithms when , as one would expect, since PQL with is precisely -PI. Nonetheless PQL allows some degree of off-policiness when .

5.4 Oscillatory Behavior of HQL

In this section, we have proven the convergence of exact PQL (i.e., no value-update errors). However, the following proposition shows that exact HQL, an instance of non-conservative algorithms, does not converge in an MDP when the behavior policy is fixed. Nonetheless, in the same MDP, setting the behavior policy to a greedy policy guarantees the convergence.

Proposition 5.

There is an MDP such that when exact HQL is run with a fixed policy for all , , and , HQL’s Q-function oscillates between two functions, and its greedy policy oscillate between optimal and sub-optimal policies. Contrarily, if , HQL converges to an optimal policy.


A proof of the first claim is given in Appendix D. The second claim immediately follows by noting that if , HQL is -PI, which is known to converge (Bertsekas and Ioffe, 1996). ∎

While this result is specialized to HQL, it sheds light on an important aspect of non-conservative algorithms in general:

While non-conservative algorithms may perform poorly when the behavior policy is fixed, they may converge to when the behavior policy is updated.

The above captures a critical aspect of how algorithms behave in practice, where the behavior policy is continuously updated.

6 Deep RL Implementations

We next show that Peng’s Q() can be conveniently implemented with established off-policy deep RL algorithms. Our experiments focus on continuous control problems where the action space . A primary motivation for considering continuous control benchmarks (e.g., (Brockman et al., 2016; Tassa et al., 2020)) is that they are usually more accessible to a wider RL research community, compared to challenging discrete control benchmarks such as Atari games (Bellemare et al., 2013).

6.1 Off-policy Actor-critic Algorithms

Off-policy actor-critic algorithms maintain a policy with parameter and a Q-function critic with parameter . For the policy, a popular choice is the point mass distribution , where (Lillicrap et al., 2016; Fujimoto et al., 2018; Barth-Maron et al., 2018). The algorithm collects data with an exploratory behavior policy and saves tuples into a replay buffer . At each training iteration, the critic is updated by minimizing squared errors against a Q-function target . The policy is updated via the deterministic policy gradient (Silver et al., 2014). See further details in Appendix J.

6.2 Implementations of Multi-step Operators

While approximate estimates to are arguably the simplest to implement, it only myopically looks ahead for one step. Usually, the learning can be significantly sped up when the targets are constructed with multi-step operators. (See, e.g, empirical examples in (Hessel et al., 2018; Barth-Maron et al., 2018; Kapturowski et al., 2018) and theoretical insights in (Rowland et al., 2020)) For example, the uncorrected -step operator is estimated as follows (Hessel et al., 2018): given a -step trajectory , the target at is computed as . Similar estimates could be derived for all multi-step operators introduced in Section 3, especially Peng’s Q(). We present full details in Appendix J.

Desirable empirical properties of Peng’s Q().

The estimates of Peng’s Q() do not require importance sampling ratios . This is especially valuable for continuous control, where the policy could be deterministic, in which case algorithms such as Retrace (Munos et al., 2016) cuts traces immediately. Even when policies are stochastic and traces based on IS ratios are not cut immediately, prior work suggests that the trace cuts are usually pessimistic especially for high-dimensional action space (see, e.g., (Wang et al., 2017) for implementation techniques to mitigate the issue).

7 Experiments

To build better intuitions about Peng’s Q(), we start with tabular examples in Section 7.1. We will see that the empirical properties of Peng’s Q() echo the theoretical analysis in previous sections. In Section 7.2, we evaluate Peng’s Q() in the deep RL contexts. We combine Peng’s Q() with baseline deep RL algorithms and compare its performance against alternative operators.

7.1 A tabular example

(a) Final performance
(b) Learning curves
Figure 1: Performance on tree MDPs. Figure(a) shows how performance changes as a function of three depth ; Figure(b) shows the learning curves of different operators.

Tree MDP.

We consider toy examples with a tree MDP of depth . The MDPs are binary trees, with each node corresponding to a state. Starting from any non-leaf state, the two actions transition the agent to one of its child nodes with probability one. Each episode lasts for steps and the agent always starts at the root node. The rewards are zero everywhere except at the leftmost leaf node and at the rightmost leaf node. The behavior policy is for all states .

Note that there is a sub-optimal policy of collecting at the rightmost leaf. The behavior policy is by design biased towards taking right moves, such that it is easy for the agent to learn the sub-optimal policy. The optimal policy is to take left moves and collect . Throughout training, we optimize the target policy while fixing the behavior policy . This echos the theoretical setup in Section 5.2. See Appendix J for further details on the setup.


In Figure 1(a), we show the converged performance of different algorithms as a function of the MDP’s tree depth . When , all algorithms achieve the optimal performance; when , as increases, the fixed point bias of Peng’s Q() hurts the performance drastically. This is less severe for , whose performance decays less quickly. On the other hand, both Retrace and the one-step operator learn the optimal policy even for . However, when increases, it becomes difficult to sample the optimal trajectory, making it easy to get trapped with the sub-optimal policy. As such, the sparse rewards make it difficult to learn meaningful Q-functions, unless the return signals get propagated effectively (i.e,. do not cut traces). This is shown in Figure 1(a), where Peng’s Q() with is the only baseline that achieves the sub-optimal performance, while all other algorithms fail to learn anything.

Similar observations are made in Figure 1(b), where we compare Peng’s Q() for various under (solid lines) and (dotted lines). Small corresponds to less bias in the Q-function fixed points, and should asymptotically converge to higher performance; on the other hand, large suffers sub-optimality when is small, but gains a substantial advantage when the is large.

7.2 Deep RL experiments


We evaluate performance over environments with a number of different physics simulation backends, such as MuJoCo (Todorov et al., 2012) based DeepMind (DM) control suite (Tassa et al., 2020)

and an open sourced simulator Bullet physics

(Coumans and Bai, 2016). Due to space limit, below we only show results for DM control suite and provide a more complete set of evaluations in Appendix J.

Baseline comparison.

We use TD3 (Fujimoto et al., 2018) as the base algorithm. We compare with a few multi-step baselines: (1) one-step (also the base algorithm); (2) Uncorrected -step with a fixed ; (3) Peng’s Q() with a fixed ; (4) Retrace and C-trace. Among all baselines, uncorrected -step operator is the most commonly used non-conservative operator while Retrace is a representative conservative operator. See Appendix J for more details. All algorithms are trained with a fixed number of steps and results are averaged across random seeds.

Standard benchmark results.

In the top row of Figure 2, we show evaluations on standard benchmarks. Across most tasks, Peng’s Q() performs more stably than other baseline algorithms. We see that Peng’s Q() learns generally as fast as other baselines, and in some cases significantly faster than others. Note that though Peng’s Q() does not necessarily obtain the best learning performance per each task, it consistently ranks as the top two algorithms (with ties). This is in contrast to baseline algorithms whose performance rank might vary drastically across tasks. For example, the one-step TD3 performs well in CheetahRun while performs poorly in WalkerWalk. Also, both Ctrace and Retrace generally significantly perform more poorly. We provide further analysis in Appendix J.

Figure 2: Evaluation of baseline algorithms over standard DM control domains. The first row shows results on standard benchmarks; the second row shows results on sparse reward variants of the benchmarks. Four task names are labeled at the bottom. In each plot, x-axis shows the number of training steps and y-axis shows the performance. In standard benchmarks, Peng’s Q() generally performs more stably than other algorithms; in sparse reward benchmarks, Peng’s Q() outperforms all other algorithms across all presented tasks.

Sparse rewards results.

In the bottom row of Figure 2, we show evaluations on sparse reward variants of the benchmark tasks. See details on these environments in Appendix J. Sparse rewards are challenging for deep RL algorithms, as it is more difficult to numerically propagate learning signals across time steps. Accordingly, sparse rewards are natural benchmarks for operator-based algorithms. Across all tasks, Peng’s Q() consistently outperforms other baselines. In a few cases, uncorrected -step also outperforms the baseline TD3 – we speculate that this is because the former propagates the learning signal more efficiently, which is critical for sparse rewards. Compared to uncorrected -step, Peng’s Q() seems to achieve a better trade-off between efficient propagation of learning signals and fixed point biases, which leads to relatively stable and consistent performance gains across all selected benchmark tasks.

7.3 Additional deep RL experiments

Maximum-entropy RL.

In Appendix I, we show how Peng’s Q() can be extended to maximum-entropy RL (Ziebart et al., 2008; Fox et al., 2016; Haarnoja et al., 2017, 2018). We combine multi-step operators with maximum-entropy deep RL algorithms such as SAC (Haarnoja et al., 2018) and show performance gains over benchmark tasks. See Appendix J for further details.

Ablation study on .

In Appendix J, we provide an ablation study on the effect of . We show that the performance of Peng’s Q() depends on the choice of . Nevertheless, we find that a single can usually lead to fairly uniform performance gains across a large number of benchmarks.

8 Conclusion

In this paper, we have studied the non-conservative off-policy algorithm Peng’s Q(), and shown that while in the worst case its convergence guarantees are less strong than conservative algorithms such as Retrace, convergence guarantees to the optimal policy are recovered when the behavior policy closely tracks the target policy. This has important consequences for deep RL theory and practice, as this condition often holds when agents are trained through replay buffers, and serves to close the gap between the strong empirical performance observed with non-conservative algorithms in deep RL, and their previous lack of theory.

We expect this to have several important consequences for deep RL theory and practice. Firstly, these results make clear that the degree of off-policyness is an important quantity that has real impact on the success of deep RL algorithms, and incorporating quantities related to this into the analysis of off-policy algorithms will be important for developing theoretical understanding of deep RL. Secondly, these findings add weight to growing empirical work highlighting that quantities such as replay buffer size and replay ratio are crucial to the success of deep RL agents (Zhang and Sutton, 2017; Daley and Amato, 2019; Fedus et al., 2020), and deserve further attention.

We believe the analysis presented in this paper is an important step towards a deeper understanding of non-conservative methods, and there are several open questions suitable for future work. For example, the convergence guarantee in Theorem 4 requires . However we conjecture that this assumption can be lifted. Besides, while we did not analyze the concentrability coefficients of PQL, Scherrer (2014) reports that conservative policy iteration, which is analogous to PQL, has a better concentrability coefficients. Finally, careful error propagation analyses of gap-increasing algorithms (Azar et al., 2012; Kozuno et al., 2019) and policy-update-regularized algorithms (Vieillard et al., 2020) show a slow update of policies confer the stability against errors on algorithms. In PQL with behavior policy updates, we expect a similar result when takes an intermediate value.


TK was supported by JSPS KAKENHI Grant Numbers 16H06563. TK thanks Prof. Kenji Doya, Dongqi Han, and Ho Ching Chiu at Okinawa Institute of Science and Technology (OIST) for their valuable comments. TK is also grateful to the research support of OIST to the Neural Computation Unit, where TK partially conducted this research. In particular, TK is thankful for OIST’s Scientific Computation and Data Analysis section, which maintains a cluster we used for many of our experiments. YHT acknowledges the computational support from Google Cloud Platform.


  • J. Achiam (2018) Spinning Up in Deep Reinforcement Learning. Cited by: §J.2, §J.2, Appendix B.
  • K. Asadi and M. L. Littman (2017) An Alternative Softmax Operator for Reinforcement Learning. In

    Proceedings of the International Conference on Machine Learning

    Cited by: Appendix I.
  • M. G. Azar, V. Gómez, and H. J. Kappen (2012) Dynamic policy programming. Journal of Machine Learning Research 13 (103), pp. 3207–3245. Cited by: §8.
  • G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, D. TB, A. Muldal, N. Heess, and T. Lillicrap (2018) Distributed distributional deterministic policy gradients. In Proceedings of the International Conference on Learning Representations, Cited by: §1, §6.1, §6.2.
  • M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling (2013) The Arcade Learning Environment: An Evaluation Platform for General Agents.

    Journal of Artificial Intelligence Research

    47, pp. 253–279.
    Cited by: §6.
  • D. P. Bertsekas and S. Ioffe (1996) Temporal differences-based policy iteration and applications in neuro-dynamic programming. Technical report Technical Report LIDS-P-2349, Lab. for Info. and Decision Systems Report, MIT, Cambridge, Massachusetts. Cited by: §1, §3.1, §5.4.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. arXiv preprint arXiv:1606.01540. Cited by: §6.
  • G. Casella and R. L. Berger (2002) Statistical Inference. Vol. 2, Duxbury Pacific Grove, CA. Cited by: §3.2.
  • E. Coumans and Y. Bai (2016) PyBullet, a Python module for physics simulation for games, robotics and machine learning. Note: Cited by: §7.2.
  • B. Daley and C. Amato (2019) Reconciling -returns with experience replay. In Advances in Neural Information Processing Systems, Cited by: §1, §4, §8.
  • W. Fedus, P. Ramachandran, R. Agarwal, Y. Bengio, H. Larochelle, M. Rowland, and W. Dabney (2020) Revisiting fundamentals of experience replay. In Proceedings of the International Conference on Machine Learning, Cited by: §8.
  • R. Fox, A. Pakman, and N. Tishby (2016) Taming the noise in reinforcement learning via soft updates. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, Cited by: Appendix I, Appendix I, §7.3.
  • S. Fujimoto, H. Van Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning, Cited by: §J.1, §J.2, §1, §6.1, §7.2.
  • T. Haarnoja, H. Tang, P. Abbeel, and S. Levine (2017) Reinforcement learning with deep energy-based policies. In Proceedings of the International Conference on Machine Learning, Cited by: Appendix I, Appendix I, §7.3.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, Cited by: §J.2, §J.3, §J.7, Appendix I, §1, §7.3.
  • J. Harb and D. Precup (2017) Investigating recurrence and eligibility traces in deep Q-networks. arXiv preprint arXiv:1704.05495. Cited by: §1.
  • A. Harutyunyan, M. G. Bellemare, T. Stepleton, and R. Munos (2016) Q() with off-policy corrections. In Proceedings of the International Conference on Algorithmic Learning Theory, Cited by: Appendix D, Table 1, §1, §3.2, §4, §5.1, §5, Lemma 1.
  • H. V. Hasselt (2010) Double Q-learning. In Advances in Neural Information Processing Systems, Cited by: §J.2.
  • M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. G. Azar, and D. Silver (2018) Rainbow: combining improvements in deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §J.3, §1, §3.2, §4, §6.2.
  • S. Kakade and J. Langford (2002) Approximately optimal approximate reinforcement learning. In Proceedings of the International Conference on Machine Learning, Cited by: §5.3.
  • S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and W. Dabney (2018) Recurrent experience replay in distributed reinforcement learning. In Proceedings of the International Conference on Learning Representations, Cited by: §1, §3.2, §4, §6.2.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, Cited by: §J.2.
  • T. Kozuno, E. Uchibe, and K. Doya (2019) Theoretical analysis of efficiency and robustness of softmax and gap-increasing operators in reinforcement learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Cited by: §8.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations, Cited by: §J.1, §J.2, §1, §6.1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §J.1, §J.2.
  • S. S. Mousavi, M. Schukat, E. Howley, and P. Mannion (2017) Applying Q()-learning in deep reinforcement learning to play Atari games. In AAMAS Workshop on Adaptive Learning Agents, Cited by: §1.
  • R. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare (2016) Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, Cited by: §J.3, Revisiting Peng’s Q() for Modern Reinforcement Learning, Table 1, §1, §3.2, §3.2, item 2, §4, §4, §6.2.
  • R. Munos (2003) Error bounds for approximate policy iteration. In Proceedings of the International Conference on Machine Learning, Cited by: §5.3.
  • R. Munos (2005) Error bounds for approximate value iteration. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §5.3.
  • J. Oh, Y. Guo, S. Singh, and H. Lee (2018)

    Self-imitation learning

    In Proceedings of the International Conference on Machine Learning, Cited by: §J.6.
  • J. Peng and R. J. Williams (1994) Incremental multi-step Q-learning. In Proceedings of the International Conference on Machine Learning, Cited by: Table 1, §1, §1, §3.2, §3.2.
  • J. Peng and R. J. Williams (1996) Incremental multi-step Q-learning. Machine learning 22 (1), pp. 283–290. Cited by: §1, §1, §3.2, §3.2.
  • D. Precup, R. S. Sutton, and S. P. Singh (2000) Eligibility traces for off-policy policy evaluation. In Proceedings of the International Conference on Machine Learning, Cited by: §J.3, Table 1, §1, §3.2, §3.2.
  • M. L. Puterman and M. C. Shin (1978) Modified policy iteration algorithms for discounted Markov decision problems. Management Science 24 (11), pp. 1127–1137. Cited by: §3.1.
  • M. L. Puterman (1994) Markov decision processes: discrete stochastic dynamic programming. 1st edition, John Wiley & Sons, Inc., USA. External Links: ISBN 0471619779 Cited by: Appendix B, Appendix B, Appendix E, §2, §2.
  • M. Rowland, W. Dabney, and R. Munos (2020) Adaptive trade-offs in off-policy learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Cited by: §J.3, Appendix G, Table 1, §1, §3.2, §4, §5.1, §6.2.
  • B. Scherrer, V. Gabillon, M. Ghavamzadeh, and M. Geist (2012) Approximate modified policy iteration. In Proceedings of the International Conference on Machine Learning, Cited by: §5.3.
  • B. Scherrer, M. Ghavamzadeh, V. Gabillon, B. Lesner, and M. Geist (2015) Approximate modified policy iteration and its application to the game of Tetris. Journal of Machine Learning Research 16, pp. 1629–1676. Cited by: §5.3.
  • B. Scherrer (2013) Performance bounds for policy iteration and application to the game of Tetris. Journal of Machine Learning Research 14 (1), pp. 1181–1227. Cited by: §5.2, §5.3.
  • B. Scherrer (2014) Approximate policy iteration schemes: a comparison. In Proceedings of the International Conference on Machine Learning, Cited by: Appendix B, §5.3, §8.
  • D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller (2014) Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning, Cited by: §J.1, §6.1.
  • R. S. Sutton and A. G. Barto (1998) Reinforcement Learning: An Introduction. 1 edition, MIT Press. Cited by: §4, Remark 1.
  • Y. Tassa, S. Tunyasuvunakool, A. Muldal, Y. Doron, S. Liu, S. Bohez, J. Merel, T. Erez, T. Lillicrap, and N. Heess (2020) dm_control: Software and Tasks for Continuous Control. External Links: 2006.12983 Cited by: §6, §7.2.
  • E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In Proceedings of the International Conference on Intelligent Robots and Systems, Cited by: §7.2.
  • N. Vieillard, T. Kozuno, B. Scherrer, O. Pietquin, R. Munos, and M. Geist (2020) Leverage the average: an analysis of KL regularization in reinforcement learning. In Advances in Neural Information Processing Systems, Cited by: §8.
  • Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas (2017) Sample efficient actor-critic with experience replay. In Proceedings of the International Conference on Learning Representations, Cited by: §6.2.
  • C. J. C. H. Watkins (1989) Learning from delayed rewards. Ph.D. Thesis, University of Cambridge, Cambridge, UK. Cited by: Table 1, §1, §3.2.
  • S. Zhang and R. S. Sutton (2017) A deeper look at experience replay. In NeurIPS Workshop on Deep Reinforcement Learning, Cited by: §8.
  • B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey (2008) Maximum entropy inverse reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: Appendix I, §7.3.

Appendix A Preliminaries for Theoretical Analyses

In this appendix, we explain important notions we used in our theoretical analyses.

Contraction and Monotonicity of Operators.

An operator from a normed space to another normed space is said to be a contraction if there is a constant such that . This constant is sometimes called as modulus. For example, is a contraction with modulus . In the main text, we usually meant a contraction under and did not always mention which norm is considered.

A related notion is a non-expansion. If an operator satisfies only , it is said to be a non-expansion. For example, is a non-expansion, as proven later.

Monotonicity is probably the most important property in our analyses. An operator is said to be monotone if for any and satisfying . For example, is monotone: if (point-wisely, i.e., at every ), holds too, as one can easily confirm from

Let be a constant function taking everywhere. If a linear operator is monotone and satisfies with a scalar , we have . Indeed,

imply . Thus, is non-expansive as . Note that is also a non-expansive operator for any , as one can easily confirm.

Appendix B On an Extension of Theoretical Results to Continuous Action Spaces

In this appendix, we explain how to extend our theoretical results to a case where both the state and action spaces are continuous. We mainly follow Appendix B in (Puterman, 1994). We ask interested readers to refer to the textbook.


Let and be Polish spaces. We denote by the set of all Borel-measurable functions from to a bounded closed interval , where ; throughout this appendix, the Borel -algebra is always considered. We denote by the set of all Borel probability measures on . We say that a real-valued function on is upper semicontinuous (usc) at a point if for any sequence of points converging to . We say that is usc if it is usc at any point. We denote by the set of all usc functions from to a bounded closed interval , where . We say that a stochastic kernel is continuous if for any bounded continuous function and any sequence of points converging to .

Main Discussion.

We impose the following assumption on MDPs. It is necessary to guarantee that all functions in the analyses are usc, as we shall explain soon.

Assumption 6.

The state and action spaces are compact subsets of finite-dimensional Euclidean spaces equipped with Borel -algebras. The reward function is an usc function bounded by , and the state transition probability kernel is continuous.

We first explain that there exists an optimal policy that is a measurable function from the state space to the action space . Let . We denote by the max operator defined by for any . Theorem B.5 in Puterman (1994) guarantees that is usc. Furthermore, Proposition B.4 in Puterman (1994) guarantees that is usc. It is easy to confirm that both and are bounded by . Since a sum of usc functions is again usc (Puterman, 1994, Proposition B.1.a), belongs to . Suppose the recursion . Proposition B.1.e in Puterman (1994) guarantees that is usc. Proposition B.4 in Puterman (1994) guarantees that there exists a measurable function such that . Accordingly, there exists an optimal policy that is a measurable function from to .

From the above discussion, it is easy to confirm that all in the exact version of PQL (3) belong to given that the behavior policy is continuous. Therefore, the proof of Theorem 2 in Appendix E is valid under the assumption that is continuous. We note that it is a weak assumption because the behavior policy is often continuous in practice. Indeed, an action distribution

is frequently a normal distribution whose mean and diagonal covariance matrix are continuous functions of a state

expressed by, for example, neural networks. As a result, as long as all elements of the diagonal covariance matrix are bounded from below by some constant, the probability density function of

is bounded. Therefore, the dominated convergence theorem can be used to show that is continuous. When there is an element of the diagonal covariance matrix converging to , this argument does not hold. However, it is a pathological case that usual implementations, such as SpinningUp (Achiam, 2018), try to avoid by value clipping.

For other theoretical results, we need two additional assumptions: (i) all behavior policies and are continuous, and (ii) all error functions belong to . As for the assumption (i), it is a weak assumption as noted above. (See also the following paragraph on the relaxation of ’s exact greediness.) As for the assumption (ii), it is also a weak assumption: because approximates , there is no strong reason to use a function approximator that does not belong to ; using a function approximator belonging to guarantees that belongs to . Similar arguments can be made even when the behavior policy is updated, and we can conclude that these assumptions are weak.

We finally mention how to relax the exact greedy assumption that . When the action space is continuous, it is not feasible to find an exact greedy policy even if is continuous. In addition, it is often the case that a policy is expressed by a neural network. However, it is relatively straightforward to extend our theoretical analyses to a case where this exact greedy assumption is relaxed to a -greedy assumption, that is, , where . A similar near-greedy condition is found in, for example, Scherrer (2014).

Appendix C A Proof of Lemma 1 (Different Forms of the PQL Operator)

In this appendix, we prove Lemma 1, which provides the following forms of the PQL operator:

We first recall the original PQL operator (1): . Note that each term in the sum can be rewritten as . Therefore,

Note that


The right hand side can be rewritten as follows: