1 Introduction
Qlearning is a canonical algorithm in reinforcement learning (RL) (Watkins, 1989)
. It is a singlestep algorithm, in that it only uses individual transitions to update value estimates. Many
multistep generalisations of Qlearning have been proposed, which allow temporallyextended trajectories to be used in the updating of values (Bertsekas and Ioffe, 1996; Watkins, 1989; Peng and Williams, 1994, 1996; Precup et al., 2000; Harutyunyan et al., 2016; Munos et al., 2016; Rowland et al., 2020), potentially leading to more efficient credit assignment. Indeed, multistep algorithms have often been observed to outperform singlestep algorithms for control in a variety of RL tasks (Mousavi et al., 2017; Harb and Precup, 2017; Hessel et al., 2018; BarthMaron et al., 2018; Kapturowski et al., 2018; Daley and Amato, 2019).However, using multistep algorithms for RL comes with both theoretical and practical difficulties. The discrepancy between the policy that generated the data to be learnt from (the behavior policy) and the policy being learnt about (the target policy) can lead to complex, nonconvergent behavior in these algorithms, and so must be considered carefully. There are two main approaches to deal with this discrepancy (cf. Table 1). Conservative methods ensure convergence is guaranteed no matter what behavior policy is used, typically by truncating the trajectories used for learning. By contrast, nonconservative methods typically do not truncate trajectories, and as a result do not come with generic convergence guarantees. Nevertheless, nonconservative methods have consistently been found to outperform conservative methods in practical largescale applications. Thus, there is a clear gap in our understanding about nonconservative methods; why do they so work well in practice, but lack the guarantees of their conservative counterparts?
Algorithm  Conservative  Convergence  Convergence to 

trace (Rowland et al., 2020)  No  ?  ? 
Ctrace (Rowland et al., 2020)  No  ?  ? 
HQL (Harutyunyan et al., 2016)  No  ✓(with small )  ✓(with small ) 
Retrace (Munos et al., 2016)  Yes  ✓  ✓ 
TBL (Precup et al., 2000)  Yes  ✓  ✓ 
Uncorrected step Return  No  ?  ? 
WQL (Watkins, 1989)  Yes  ✓  ✓ 
PQL (Peng and Williams, 1994)  No  ✓ (biased)  ✓ (cf. caption) 
In this paper, we address this question by studying a representative nonconservative algorithm, Peng’s Q() (Peng and Williams, 1994, 1996, PQL), in more realistic learning settings. Our results show that while PQL does not learn optimal policies under arbitrary behavior policies, a convergence guarantee can be recovered if the behavior policy tracks the target policy, as is often the case in practice. This represents a closing of the gap between the strong empirical performance of nonconservative methods and their previous lack of theoretical guarantees.
More concretely, our primary theoretical contributions bring new understanding to PQL, and are summarized as follows:

[leftmargin=0.4cm,topsep=0pt,itemsep=0pt]

A proof that PQL with a fixed behavior policy converges to a ”biased” (i.e., different from ) fixedpoint.

Analysis of the quality of the resulting policy.

Convergence of PQL to an optimal policy when using appropriate behavior policy updates.

Error propagation analysis when using approximations.
In addition to these theoretical insights, we validate the empirical performance of PQL through extensive experiments. Our focus is on continuous control tasks, where one encounters many technical challenges that do not exist in discrete control tasks (cf. Section 7.2). They are also accessible to a wider range of readers. We show that PQL can be easily extended to popular offpolicy actorcritic algorithms such as DDPG, TD3 and SAC (Lillicrap et al., 2016; Fujimoto et al., 2018; Haarnoja et al., 2018). Over a large subset of tasks, PQL consistently outperforms other conservative and nonconservative baseline alternatives.
2 Notation and Definitions
For a finite set and an arbitrary set , we let and
be the probability simplex over
and the set of all mappings from to , respectively.Markov Decision Processes (MDP).
We consider an MDP defined by a tuple , where is the finite state space, the finite action space, the state transition probability kernel, the initial state distribution, the (conditional) reward distribution, and the discount factor (Puterman, 1994). We let be a reward function defined by .
On the Finiteness of the State and Action Spaces.
While we assume both and to be finite, most of theoretical results in the paper hold in continuous state spaces with appropriate measuretheoretic considerations. The finiteness assumption on the action space is necessary to guarantee the existence of the optimal policy (Puterman, 1994). In Appendix B, we discuss assumptions necessary to extend our theoretical results to continuous action spaces.
Policy and Value Functions.
Suppose a policy . We consider the standard RL setup where an agent interacts with an environment, generating a sequence of stateactionreward tuples with
being an action sampled from some policy; throughout, we denote random variables by upper cases. Define
as the cumulative return. The statevalue and Qfunctions are defined by and , respectively, where the conditioning by means .Evaluation and Control.
Two key tasks in RL are evaluation and control. The problem of evaluation is to learn the Qfunction of a fixed policy. The aim in the control setting is to learn an optimal policy defined as to satisfy (the inequality is pointwise, i.e., for all ). Similarly to , we let denote the optimal Qfunction . As a greedy policy with respect to is optimal, it suffices to learn . In this paper, we are particularly interested in the offpolicy control setting, where an agent collects data with a behavior policy , which is not necessarily the agent’s current policy . Onpolicy settings are a special case where .
3 Multistep RL Algorithms and Operators
Operators play a crucial role in RL since all valuebased RL algorithms (exactly or approximately) update a Qfunction based on the recursion , where is an operator that characterizes each algorithm. In this section, we review multistep RL algorithms and their operators.
Basic Operators.
Assume we have a fixed policy . With an abuse of notations, we define operators and by
for any and , respectively (hereafter, we omit ”for any…” in definitions of operators for brevity). We define their composite . As a result, the Bellman operator is defined by . For a function , we let be the set of all greedy policies^{1}^{1}1Note that there may be multiple greedy policies due to ties. with respect to . The Bellman optimality operator is defined by with ^{2}^{2}2Note that this definition is independent of the choice of .. Qlearning approximates the value iteration (VI) updates .
3.1 Onpolicy Multistep Operators for Control
We first introduce onpolicy multistep operators for control.
Modified Policy Iteration (MPI).
MPI uses the recursion for Qfunction updates (Puterman and Shin, 1978), where . The step return operator is defined by .
Policy Iteration (Pi).
PI uses the recursion for Qfunction updates (Bertsekas and Ioffe, 1996), where . The return operator is defined as
where , and .
3.2 Offpolicy Multistep Operators for Control
Next, we explain offpolicy multistep operators for control. We note that onpolicy algorithms in the last subsection can be converted to offpolicy versions by using importance sampling (Precup et al., 2000; Casella and Berger, 2002).
Uncorrected step Return.
Peng’s Q() (Pql)
For a sequence of behavior policies , PQL uses the recursion for Qfunction updates (Peng and Williams, 1994, 1996), where . Here, the PQL operator is defined for any policies and by
(1) 
where . Note that PQL is a generalization of PI because it reduces to PI when . In other words, PQL is
PI with one additional degree of freedom in
.General Retrace.
We next introduce a general version of the Retrace operator (Munos et al., 2016), from which other operators are obtained as special cases.
For a behavior policy and a target policy , we let be an operator defined by
where is an arbitrary nonnegative function over whose choice depends on an algorithm. Note that for any , can be estimated offpolicy with data collected under the behavior policy .
A general Retrace operator is obtained by replacing of in the return operator with . Concretely,
The general Retrace algorithm updates its Qfunction by , where is a sequence of arbitrary nonnegative functions over , is an arbitrary sequence of behavior policies, and is a sequence of target policies that depends on an algorithm. Given the choices of and in Table 2, we recover a few known algorithms (Watkins, 1989; Peng and Williams, 1994, 1996; Precup et al., 2000; Harutyunyan et al., 2016; Munos et al., 2016; Rowland et al., 2020).
The general Retrace algorithm is offpolicy as can be estimated offpolicy by the following estimator given a trajectory collected under :
(2) 
where , and is the TD error at time step .
Algorithm  

trace  
Ctrace  
HQL  
Retrace  Any  
TBL  Any  
WQL  
PQL 
4 Conservative and Nonconservative Multistep RL Algorithms
Munos et al. (2016) showed that the following conditions suffice for the convergence of the general Retrace to :

[leftmargin=0.5cm,itemsep=0pt,topsep=0pt]

for any and .

satisfies some greediness condition, such as greediness with decreasing as increases; cf. Munos et al. (2016) for further details.
We call algorithms that satisfy the first condition conservative algorithms for reasons to be explained below. Otherwise, we call the algorithms nonconservative. See Table 1 for the classification of algorithms. The uncorrected step return algorithm can also be viewed as a nonconservative algorithm with nonMarkovian traces that depend also on the past.
Conservativeness, Theoretical Guarantees, and Empirical Performance of Algorithms.
Recall that in the general Retrace update estimator (2), the effect of the TD error is attenuated by in addition to . Hence, from the backward view (Sutton and Barto, 1998), the first condition intuitively requires that the trace must be cut if a subtrajectory is unlikely under relative to . As a result, conservative algorithms only carry out safe updates to Qfunctions.
As shown in (Munos et al., 2016), such conservative updates enable a convergence guarantee of general conservative algorithms. However, Rowland et al. (2020) observed that it often results in frequent trace cuts, and conservative algorithms usually benefit less from multistep updates.
In contrast, nonconservative algorithms accumulate TD errors without carefully cutting traces. As a result, nonconservative algorithms might perform poorly. As we show later (Proposition 5), it is the case at least for Harutyunyan’s Q() (Harutyunyan et al. (2016), HQL), an instance of nonconservative algorithms, when a behavior policy is fixed. Nonetheless, nonconservative algorithms are known to perform well in practice (Hessel et al., 2018; Kapturowski et al., 2018; Daley and Amato, 2019). To understand its reason, it is important to characterize what kind of updates to the behavior policy entail the convergence of the overall algorithm. In the following sections, we take a step forward along this direction. We establish the convergence guarantee of PQL under two setups: (1) when the behavior policy is fixed; (2) when the behavior policy is updated in an appropriate way.
5 Theoretical Analysis of Peng’s Q()
In this section, we analyze Peng’s Q(). We start with the exact case where there is no update errors in value functions. Later, we will consider the approximate case when accounting for update errors. The following lemma is particularly useful in theoretical analyses as well as practical implementations.
Lemma 1 (Harutyunyan et al., 2016).
The PQL operator can be rewritten in the following forms:
Proof.
5.1 Exact Case with a Fixed Behavior Policy
We now analyze PQL with a fixed behavior policy . While the behavior policy is not fixed in a practical situation, the analysis shows a tradeoff between bias and convergence rate. This tradeoff is analogous to the biascontractionrate tradeoff of offpolicy multistep algorithms for policy evaluation (Rowland et al., 2020) and sheds some light on important properties of PQL.
Concretely, we analyze the following algorithm:
(3) 
Harutyunyan et al. (2016) has proven that a fixed point of the PQL operator coincides with the unique fixed point of , which is guaranteed to exist since is a contraction with modulus under norm (see Appendix A for details about the contraction and other notions).
The existence of a fixed point does not imply the convergence of PQL, and we need to show that the distance between and the fixed point is decreasing. With the following theorem, we show that PQL does converge.
Theorem 2.
Let be a policy such that for any policy , where the inequality is pointwise. Then, , and of PQL (3) uniformly converges to with the rate , where .
Proof.
See Appendix E. ∎
We build intuitions about the biasconvergencerate tradeoff implied in Theorem 2. When increases, the fixed point is , whose bias against arguably increases; at the same time, the contraction rate decreases, so that the contraction is faster.
Remark 1.
In Section 7.6 of (Sutton and Barto, 1998), it is conjectured that PQL with a fixed policy would converge to a hybrid of and . Theorem 2 gives an answer to this conjecture and shows that Sutton and Barto (1998)’s conjecture is not necessarily true. Rather, the theorem shows that PQL converges to the Qfunction of the best policy among policies of the form .
5.2 Approximate Case with a Fixed Behavior Policy
In practice, valueupdate errors are inevitable due to e.g., finitesample estimations and function approximation errors. In this subsection, we provide the error propagation analysis of PQL with a fixed behavior policy. As we will see, the analysis depicts a tradeoff between fixed point bias and error tolerance.
We analyze the following algorithm:
where denotes the valueupdate error at iteration . For simplicity, we use and in this subsection.
In Section 5.1, we showed when at every , and . Therefore, is an approximation to , and thus it is natural to define as the loss of using rather than . The following theorem provides an upper bound for the loss.
Theorem 3.
For any , the following holds:
where is the norm defined for any realvalued function by .
Proof.
See Appendix G. ∎
As we have already explained the biasconvergencerate tradeoff, for now we ignore the term and focus on the error term. For simplicity, we assume for every . Then,
In contrast, an analogous result of PI is (Scherrer, 2013). When , these results coincide, which is expected since both PI and PQL degenerate to value iteration. When , PQL’s error dependency is , which is significantly better than . However in this case, PQL is completely biased and converges to . At intermediate values of , PQL achieves a tradeoff between error tolerance with bias by changing .
5.3 Approximate Case with Behavior Policy Updates
Previously, we have analyzed PQL with a fixed behavior policy. However, in practice, the behavior policy is updated along with the target policy. Besides, valueupdate errors are inevitable in complex tasks. As a result, PQL may behave quite differently in a practical scenario. This motivates our analysis for the following algorithm:^{3}^{3}3This algorithm updates the behavior policy after each application of the PQL operator. In Appendix F, we analyze a case where the behavior policy is updated after multiple applications of the PQL operator.
(4)  
where , and . Note that when , this algorithm reduces to PI as a special case. Though this behavior policy update closely resembles to that of conservative policy iteration (Kakade and Langford, 2002), here we require .
This algorithm has the following performance guarantee.
Theorem 4.
For any , the following holds:
where . Hence, PQL with behavior policy updates converges to the optimal policy with the rate .
Proof.
See Appendix H. ∎
The first term on the right hand side shows the convergence of PQL with behavior policy updates in an exact case, i.e., for any . It states that the fastest convergence rate is (achieved when ), which is the same as the convergence rate of VI (Munos, 2005), policy iteration (Munos, 2003), MPI (Scherrer et al., 2012, 2015), and PI (Scherrer, 2013). When , the convergence rate coincides with that of conservative policy iteration (Scherrer, 2014). However we are not aware of a similar result of conservative PI, which would be an analogue of PQL considered here. Theorem 4 also provides the error dependency of PQL (the second term on the right hand side). It coincides with the previous result of the above algorithms when , as one would expect, since PQL with is precisely PI. Nonetheless PQL allows some degree of offpoliciness when .
5.4 Oscillatory Behavior of HQL
In this section, we have proven the convergence of exact PQL (i.e., no valueupdate errors). However, the following proposition shows that exact HQL, an instance of nonconservative algorithms, does not converge in an MDP when the behavior policy is fixed. Nonetheless, in the same MDP, setting the behavior policy to a greedy policy guarantees the convergence.
Proposition 5.
There is an MDP such that when exact HQL is run with a fixed policy for all , , and , HQL’s Qfunction oscillates between two functions, and its greedy policy oscillate between optimal and suboptimal policies. Contrarily, if , HQL converges to an optimal policy.
Proof.
While this result is specialized to HQL, it sheds light on an important aspect of nonconservative algorithms in general:
While nonconservative algorithms may perform poorly when the behavior policy is fixed, they may converge to when the behavior policy is updated.
The above captures a critical aspect of how algorithms behave in practice, where the behavior policy is continuously updated.
6 Deep RL Implementations
We next show that Peng’s Q() can be conveniently implemented with established offpolicy deep RL algorithms. Our experiments focus on continuous control problems where the action space . A primary motivation for considering continuous control benchmarks (e.g., (Brockman et al., 2016; Tassa et al., 2020)) is that they are usually more accessible to a wider RL research community, compared to challenging discrete control benchmarks such as Atari games (Bellemare et al., 2013).
6.1 Offpolicy Actorcritic Algorithms
Offpolicy actorcritic algorithms maintain a policy with parameter and a Qfunction critic with parameter . For the policy, a popular choice is the point mass distribution , where (Lillicrap et al., 2016; Fujimoto et al., 2018; BarthMaron et al., 2018). The algorithm collects data with an exploratory behavior policy and saves tuples into a replay buffer . At each training iteration, the critic is updated by minimizing squared errors against a Qfunction target . The policy is updated via the deterministic policy gradient (Silver et al., 2014). See further details in Appendix J.
6.2 Implementations of Multistep Operators
While approximate estimates to are arguably the simplest to implement, it only myopically looks ahead for one step. Usually, the learning can be significantly sped up when the targets are constructed with multistep operators. (See, e.g, empirical examples in (Hessel et al., 2018; BarthMaron et al., 2018; Kapturowski et al., 2018) and theoretical insights in (Rowland et al., 2020)) For example, the uncorrected step operator is estimated as follows (Hessel et al., 2018): given a step trajectory , the target at is computed as . Similar estimates could be derived for all multistep operators introduced in Section 3, especially Peng’s Q(). We present full details in Appendix J.
Desirable empirical properties of Peng’s Q().
The estimates of Peng’s Q() do not require importance sampling ratios . This is especially valuable for continuous control, where the policy could be deterministic, in which case algorithms such as Retrace (Munos et al., 2016) cuts traces immediately. Even when policies are stochastic and traces based on IS ratios are not cut immediately, prior work suggests that the trace cuts are usually pessimistic especially for highdimensional action space (see, e.g., (Wang et al., 2017) for implementation techniques to mitigate the issue).
7 Experiments
To build better intuitions about Peng’s Q(), we start with tabular examples in Section 7.1. We will see that the empirical properties of Peng’s Q() echo the theoretical analysis in previous sections. In Section 7.2, we evaluate Peng’s Q() in the deep RL contexts. We combine Peng’s Q() with baseline deep RL algorithms and compare its performance against alternative operators.
7.1 A tabular example
Tree MDP.
We consider toy examples with a tree MDP of depth . The MDPs are binary trees, with each node corresponding to a state. Starting from any nonleaf state, the two actions transition the agent to one of its child nodes with probability one. Each episode lasts for steps and the agent always starts at the root node. The rewards are zero everywhere except at the leftmost leaf node and at the rightmost leaf node. The behavior policy is for all states .
Note that there is a suboptimal policy of collecting at the rightmost leaf. The behavior policy is by design biased towards taking right moves, such that it is easy for the agent to learn the suboptimal policy. The optimal policy is to take left moves and collect . Throughout training, we optimize the target policy while fixing the behavior policy . This echos the theoretical setup in Section 5.2. See Appendix J for further details on the setup.
Results.
In Figure 1(a), we show the converged performance of different algorithms as a function of the MDP’s tree depth . When , all algorithms achieve the optimal performance; when , as increases, the fixed point bias of Peng’s Q() hurts the performance drastically. This is less severe for , whose performance decays less quickly. On the other hand, both Retrace and the onestep operator learn the optimal policy even for . However, when increases, it becomes difficult to sample the optimal trajectory, making it easy to get trapped with the suboptimal policy. As such, the sparse rewards make it difficult to learn meaningful Qfunctions, unless the return signals get propagated effectively (i.e,. do not cut traces). This is shown in Figure 1(a), where Peng’s Q() with is the only baseline that achieves the suboptimal performance, while all other algorithms fail to learn anything.
Similar observations are made in Figure 1(b), where we compare Peng’s Q() for various under (solid lines) and (dotted lines). Small corresponds to less bias in the Qfunction fixed points, and should asymptotically converge to higher performance; on the other hand, large suffers suboptimality when is small, but gains a substantial advantage when the is large.
7.2 Deep RL experiments
Evaluations.
We evaluate performance over environments with a number of different physics simulation backends, such as MuJoCo (Todorov et al., 2012) based DeepMind (DM) control suite (Tassa et al., 2020)
and an open sourced simulator Bullet physics
(Coumans and Bai, 2016). Due to space limit, below we only show results for DM control suite and provide a more complete set of evaluations in Appendix J.Baseline comparison.
We use TD3 (Fujimoto et al., 2018) as the base algorithm. We compare with a few multistep baselines: (1) onestep (also the base algorithm); (2) Uncorrected step with a fixed ; (3) Peng’s Q() with a fixed ; (4) Retrace and Ctrace. Among all baselines, uncorrected step operator is the most commonly used nonconservative operator while Retrace is a representative conservative operator. See Appendix J for more details. All algorithms are trained with a fixed number of steps and results are averaged across random seeds.
Standard benchmark results.
In the top row of Figure 2, we show evaluations on standard benchmarks. Across most tasks, Peng’s Q() performs more stably than other baseline algorithms. We see that Peng’s Q() learns generally as fast as other baselines, and in some cases significantly faster than others. Note that though Peng’s Q() does not necessarily obtain the best learning performance per each task, it consistently ranks as the top two algorithms (with ties). This is in contrast to baseline algorithms whose performance rank might vary drastically across tasks. For example, the onestep TD3 performs well in CheetahRun while performs poorly in WalkerWalk. Also, both Ctrace and Retrace generally significantly perform more poorly. We provide further analysis in Appendix J.
Sparse rewards results.
In the bottom row of Figure 2, we show evaluations on sparse reward variants of the benchmark tasks. See details on these environments in Appendix J. Sparse rewards are challenging for deep RL algorithms, as it is more difficult to numerically propagate learning signals across time steps. Accordingly, sparse rewards are natural benchmarks for operatorbased algorithms. Across all tasks, Peng’s Q() consistently outperforms other baselines. In a few cases, uncorrected step also outperforms the baseline TD3 – we speculate that this is because the former propagates the learning signal more efficiently, which is critical for sparse rewards. Compared to uncorrected step, Peng’s Q() seems to achieve a better tradeoff between efficient propagation of learning signals and fixed point biases, which leads to relatively stable and consistent performance gains across all selected benchmark tasks.
7.3 Additional deep RL experiments
Maximumentropy RL.
In Appendix I, we show how Peng’s Q() can be extended to maximumentropy RL (Ziebart et al., 2008; Fox et al., 2016; Haarnoja et al., 2017, 2018). We combine multistep operators with maximumentropy deep RL algorithms such as SAC (Haarnoja et al., 2018) and show performance gains over benchmark tasks. See Appendix J for further details.
Ablation study on .
In Appendix J, we provide an ablation study on the effect of . We show that the performance of Peng’s Q() depends on the choice of . Nevertheless, we find that a single can usually lead to fairly uniform performance gains across a large number of benchmarks.
8 Conclusion
In this paper, we have studied the nonconservative offpolicy algorithm Peng’s Q(), and shown that while in the worst case its convergence guarantees are less strong than conservative algorithms such as Retrace, convergence guarantees to the optimal policy are recovered when the behavior policy closely tracks the target policy. This has important consequences for deep RL theory and practice, as this condition often holds when agents are trained through replay buffers, and serves to close the gap between the strong empirical performance observed with nonconservative algorithms in deep RL, and their previous lack of theory.
We expect this to have several important consequences for deep RL theory and practice. Firstly, these results make clear that the degree of offpolicyness is an important quantity that has real impact on the success of deep RL algorithms, and incorporating quantities related to this into the analysis of offpolicy algorithms will be important for developing theoretical understanding of deep RL. Secondly, these findings add weight to growing empirical work highlighting that quantities such as replay buffer size and replay ratio are crucial to the success of deep RL agents (Zhang and Sutton, 2017; Daley and Amato, 2019; Fedus et al., 2020), and deserve further attention.
We believe the analysis presented in this paper is an important step towards a deeper understanding of nonconservative methods, and there are several open questions suitable for future work. For example, the convergence guarantee in Theorem 4 requires . However we conjecture that this assumption can be lifted. Besides, while we did not analyze the concentrability coefficients of PQL, Scherrer (2014) reports that conservative policy iteration, which is analogous to PQL, has a better concentrability coefficients. Finally, careful error propagation analyses of gapincreasing algorithms (Azar et al., 2012; Kozuno et al., 2019) and policyupdateregularized algorithms (Vieillard et al., 2020) show a slow update of policies confer the stability against errors on algorithms. In PQL with behavior policy updates, we expect a similar result when takes an intermediate value.
Acknowledgement
TK was supported by JSPS KAKENHI Grant Numbers 16H06563. TK thanks Prof. Kenji Doya, Dongqi Han, and Ho Ching Chiu at Okinawa Institute of Science and Technology (OIST) for their valuable comments. TK is also grateful to the research support of OIST to the Neural Computation Unit, where TK partially conducted this research. In particular, TK is thankful for OIST’s Scientific Computation and Data Analysis section, which maintains a cluster we used for many of our experiments. YHT acknowledges the computational support from Google Cloud Platform.
References
 Spinning Up in Deep Reinforcement Learning. Cited by: §J.2, §J.2, Appendix B.

An Alternative Softmax Operator for Reinforcement Learning.
In
Proceedings of the International Conference on Machine Learning
, Cited by: Appendix I.  Dynamic policy programming. Journal of Machine Learning Research 13 (103), pp. 3207–3245. Cited by: §8.
 Distributed distributional deterministic policy gradients. In Proceedings of the International Conference on Learning Representations, Cited by: §1, §6.1, §6.2.

The Arcade Learning Environment: An Evaluation Platform for General Agents.
Journal of Artificial Intelligence Research
47, pp. 253–279. Cited by: §6.  Temporal differencesbased policy iteration and applications in neurodynamic programming. Technical report Technical Report LIDSP2349, Lab. for Info. and Decision Systems Report, MIT, Cambridge, Massachusetts. Cited by: §1, §3.1, §5.4.
 OpenAI gym. arXiv preprint arXiv:1606.01540. Cited by: §6.
 Statistical Inference. Vol. 2, Duxbury Pacific Grove, CA. Cited by: §3.2.
 PyBullet, a Python module for physics simulation for games, robotics and machine learning. Note: http://pybullet.org Cited by: §7.2.
 Reconciling returns with experience replay. In Advances in Neural Information Processing Systems, Cited by: §1, §4, §8.
 Revisiting fundamentals of experience replay. In Proceedings of the International Conference on Machine Learning, Cited by: §8.
 Taming the noise in reinforcement learning via soft updates. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, Cited by: Appendix I, Appendix I, §7.3.
 Addressing function approximation error in actorcritic methods. In Proceedings of the International Conference on Machine Learning, Cited by: §J.1, §J.2, §1, §6.1, §7.2.
 Reinforcement learning with deep energybased policies. In Proceedings of the International Conference on Machine Learning, Cited by: Appendix I, Appendix I, §7.3.
 Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, Cited by: §J.2, §J.3, §J.7, Appendix I, §1, §7.3.
 Investigating recurrence and eligibility traces in deep Qnetworks. arXiv preprint arXiv:1704.05495. Cited by: §1.
 Q() with offpolicy corrections. In Proceedings of the International Conference on Algorithmic Learning Theory, Cited by: Appendix D, Table 1, §1, §3.2, §4, §5.1, §5, Lemma 1.
 Double Qlearning. In Advances in Neural Information Processing Systems, Cited by: §J.2.
 Rainbow: combining improvements in deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §J.3, §1, §3.2, §4, §6.2.
 Approximately optimal approximate reinforcement learning. In Proceedings of the International Conference on Machine Learning, Cited by: §5.3.
 Recurrent experience replay in distributed reinforcement learning. In Proceedings of the International Conference on Learning Representations, Cited by: §1, §3.2, §4, §6.2.
 Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, Cited by: §J.2.
 Theoretical analysis of efficiency and robustness of softmax and gapincreasing operators in reinforcement learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Cited by: §8.
 Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations, Cited by: §J.1, §J.2, §1, §6.1.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §J.1, §J.2.
 Applying Q()learning in deep reinforcement learning to play Atari games. In AAMAS Workshop on Adaptive Learning Agents, Cited by: §1.
 Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems, Cited by: §J.3, Revisiting Peng’s Q() for Modern Reinforcement Learning, Table 1, §1, §3.2, §3.2, item 2, §4, §4, §6.2.
 Error bounds for approximate policy iteration. In Proceedings of the International Conference on Machine Learning, Cited by: §5.3.
 Error bounds for approximate value iteration. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §5.3.

Selfimitation learning
. In Proceedings of the International Conference on Machine Learning, Cited by: §J.6.  Incremental multistep Qlearning. In Proceedings of the International Conference on Machine Learning, Cited by: Table 1, §1, §1, §3.2, §3.2.
 Incremental multistep Qlearning. Machine learning 22 (1), pp. 283–290. Cited by: §1, §1, §3.2, §3.2.
 Eligibility traces for offpolicy policy evaluation. In Proceedings of the International Conference on Machine Learning, Cited by: §J.3, Table 1, §1, §3.2, §3.2.
 Modified policy iteration algorithms for discounted Markov decision problems. Management Science 24 (11), pp. 1127–1137. Cited by: §3.1.
 Markov decision processes: discrete stochastic dynamic programming. 1st edition, John Wiley & Sons, Inc., USA. External Links: ISBN 0471619779 Cited by: Appendix B, Appendix B, Appendix E, §2, §2.
 Adaptive tradeoffs in offpolicy learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Cited by: §J.3, Appendix G, Table 1, §1, §3.2, §4, §5.1, §6.2.
 Approximate modified policy iteration. In Proceedings of the International Conference on Machine Learning, Cited by: §5.3.
 Approximate modified policy iteration and its application to the game of Tetris. Journal of Machine Learning Research 16, pp. 1629–1676. Cited by: §5.3.
 Performance bounds for policy iteration and application to the game of Tetris. Journal of Machine Learning Research 14 (1), pp. 1181–1227. Cited by: §5.2, §5.3.
 Approximate policy iteration schemes: a comparison. In Proceedings of the International Conference on Machine Learning, Cited by: Appendix B, §5.3, §8.
 Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning, Cited by: §J.1, §6.1.
 Reinforcement Learning: An Introduction. 1 edition, MIT Press. Cited by: §4, Remark 1.
 dm_control: Software and Tasks for Continuous Control. External Links: 2006.12983 Cited by: §6, §7.2.
 Mujoco: a physics engine for modelbased control. In Proceedings of the International Conference on Intelligent Robots and Systems, Cited by: §7.2.
 Leverage the average: an analysis of KL regularization in reinforcement learning. In Advances in Neural Information Processing Systems, Cited by: §8.
 Sample efficient actorcritic with experience replay. In Proceedings of the International Conference on Learning Representations, Cited by: §6.2.
 Learning from delayed rewards. Ph.D. Thesis, University of Cambridge, Cambridge, UK. Cited by: Table 1, §1, §3.2.
 A deeper look at experience replay. In NeurIPS Workshop on Deep Reinforcement Learning, Cited by: §8.
 Maximum entropy inverse reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: Appendix I, §7.3.
Appendix A Preliminaries for Theoretical Analyses
In this appendix, we explain important notions we used in our theoretical analyses.
Contraction and Monotonicity of Operators.
An operator from a normed space to another normed space is said to be a contraction if there is a constant such that . This constant is sometimes called as modulus. For example, is a contraction with modulus . In the main text, we usually meant a contraction under and did not always mention which norm is considered.
A related notion is a nonexpansion. If an operator satisfies only , it is said to be a nonexpansion. For example, is a nonexpansion, as proven later.
Monotonicity is probably the most important property in our analyses. An operator is said to be monotone if for any and satisfying . For example, is monotone: if (pointwisely, i.e., at every ), holds too, as one can easily confirm from
Let be a constant function taking everywhere. If a linear operator is monotone and satisfies with a scalar , we have . Indeed,
imply . Thus, is nonexpansive as . Note that is also a nonexpansive operator for any , as one can easily confirm.
Appendix B On an Extension of Theoretical Results to Continuous Action Spaces
In this appendix, we explain how to extend our theoretical results to a case where both the state and action spaces are continuous. We mainly follow Appendix B in (Puterman, 1994). We ask interested readers to refer to the textbook.
Notation.
Let and be Polish spaces. We denote by the set of all Borelmeasurable functions from to a bounded closed interval , where ; throughout this appendix, the Borel algebra is always considered. We denote by the set of all Borel probability measures on . We say that a realvalued function on is upper semicontinuous (usc) at a point if for any sequence of points converging to . We say that is usc if it is usc at any point. We denote by the set of all usc functions from to a bounded closed interval , where . We say that a stochastic kernel is continuous if for any bounded continuous function and any sequence of points converging to .
Main Discussion.
We impose the following assumption on MDPs. It is necessary to guarantee that all functions in the analyses are usc, as we shall explain soon.
Assumption 6.
The state and action spaces are compact subsets of finitedimensional Euclidean spaces equipped with Borel algebras. The reward function is an usc function bounded by , and the state transition probability kernel is continuous.
We first explain that there exists an optimal policy that is a measurable function from the state space to the action space . Let . We denote by the max operator defined by for any . Theorem B.5 in Puterman (1994) guarantees that is usc. Furthermore, Proposition B.4 in Puterman (1994) guarantees that is usc. It is easy to confirm that both and are bounded by . Since a sum of usc functions is again usc (Puterman, 1994, Proposition B.1.a), belongs to . Suppose the recursion . Proposition B.1.e in Puterman (1994) guarantees that is usc. Proposition B.4 in Puterman (1994) guarantees that there exists a measurable function such that . Accordingly, there exists an optimal policy that is a measurable function from to .
From the above discussion, it is easy to confirm that all in the exact version of PQL (3) belong to given that the behavior policy is continuous. Therefore, the proof of Theorem 2 in Appendix E is valid under the assumption that is continuous. We note that it is a weak assumption because the behavior policy is often continuous in practice. Indeed, an action distribution
is frequently a normal distribution whose mean and diagonal covariance matrix are continuous functions of a state
expressed by, for example, neural networks. As a result, as long as all elements of the diagonal covariance matrix are bounded from below by some constant, the probability density function of
is bounded. Therefore, the dominated convergence theorem can be used to show that is continuous. When there is an element of the diagonal covariance matrix converging to , this argument does not hold. However, it is a pathological case that usual implementations, such as SpinningUp (Achiam, 2018), try to avoid by value clipping.For other theoretical results, we need two additional assumptions: (i) all behavior policies and are continuous, and (ii) all error functions belong to . As for the assumption (i), it is a weak assumption as noted above. (See also the following paragraph on the relaxation of ’s exact greediness.) As for the assumption (ii), it is also a weak assumption: because approximates , there is no strong reason to use a function approximator that does not belong to ; using a function approximator belonging to guarantees that belongs to . Similar arguments can be made even when the behavior policy is updated, and we can conclude that these assumptions are weak.
We finally mention how to relax the exact greedy assumption that . When the action space is continuous, it is not feasible to find an exact greedy policy even if is continuous. In addition, it is often the case that a policy is expressed by a neural network. However, it is relatively straightforward to extend our theoretical analyses to a case where this exact greedy assumption is relaxed to a greedy assumption, that is, , where . A similar neargreedy condition is found in, for example, Scherrer (2014).
Appendix C A Proof of Lemma 1 (Different Forms of the PQL Operator)
In this appendix, we prove Lemma 1, which provides the following forms of the PQL operator:
We first recall the original PQL operator (1): . Note that each term in the sum can be rewritten as . Therefore,
Note that
Consequently,
The right hand side can be rewritten as follows: