1 Introduction
Policy optimization is a major framework in modelfree reinforcement learning (RL), with successful applications in challenging domains (Silver et al., 2016; Berner et al., 2019; Vinyals et al., 2019). Along with scaling up to powerful computational architectures (Mnih et al., 2016; Espeholt et al., 2018), significant algorithmic performance gains are driven by insights into the drawbacks of naïve policy gradient algorithms (Sutton et al., 2000). Among all algorithmic improvements, two of the most prominent are: trustregion policy search (Schulman et al., 2015, 2017; Abdolmaleki et al., 2018; Song et al., 2020) and offpolicy corrections (Munos et al., 2016; Wang et al., 2017; Gruslys et al., 2018; Espeholt et al., 2018).
At the first glance, these two streams of ideas focus on orthogonal aspects of policy optimization. For trustregion policy search, the idea is to constrain the size of policy updates. This limits the deviations between consecutive policies and lowerbounds the performance of the new policy (Kakade and Langford, 2002; Schulman et al., 2015). On the other hand, offpolicy corrections require that we account for the discrepancy between target policy and behavior policy. Espeholt et al. (2018) has observed that the corrections are especially useful for distributed algorithms, where behavior policy and target policy typically differ. Both algorithmic ideas have contributed significantly to stabilizing policy optimization.
In this work, we partially unify both algorithmic ideas into a single framework. In particular, we noticed that as a ubiquitous approximation method, Taylor expansions share highlevel similarities with both trust region policy search and offpolicy corrections. To get highlevel intuitions of such similarities, consider a simple 1D example of Taylor expansions. Given a sufficiently smooth realvalued function on the real line , the th order Taylor expansion of at is where are the th order derivatives at . First, a common feature shared by Taylor expansions and trustregion policy search is the inherent notion of a trust region constraint. Indeed, in order for convergence to take place, a trustregion constraint is required ^{1}^{1}1Here, is the convergence radius of the expansions, which in general depends on the function and origin .. Second, when using the truncation as an approximation to the original function , Taylor expansions satisfy the requirement of offpolicy evaluations: evaluate target policy with behavior data. Indeed, to evaluate the truncation at any (target policy), we only require the behavior policy “data” at (i.e., derivatives ).
Our paper proceeds as follows. In Section 2, we start with a general result of applying Taylor expansions to Qfunctions. When we apply the same technique to the RL objective, we reuse the general result and derive a higherorder policy optimization objective. This leads to Section 3, where we formally present the Taylor Expansion Policy Optimization (TayPO) and generalize prior work (Schulman et al., 2015, 2017) as a firstorder special case. In Section , we make clear connection between Taylor expansions and (Harutyunyan et al., 2016), a common returnbased offpolicy evaluation operator. Finally, in Section 5, we show the performance gains due to the higherorder objectives across a range of stateoftheart distributed deep RL agents.
2 Taylor expansion for reinforcement learning
Consider a Markov Decision Process (MDP) with state space
and action space . Let policy be a distribution over actions give state . At a discrete time , the agent in state takes action , receives reward , and transitions to a next state . We assume a discount factor . Let be the action value function (Qfunction) from state taking action and following policy . For convenience, we use to denote the discounted visitation distribution starting from stateaction pair and following , such that . We thus have . We focus on the RL objective of optimizing starting from a fixed initial state .We define some useful matrix notation. For ease of analysis, we assume that and are both finite. Let denote the reward function and denote the transition matrix such that . We also define
as the vector Qfunction. This matrix notation facilitates compact derivations, for example, the Bellman equation writes as
.2.1 Taylor Expansion of Qfunctions.
In this part, we state the Taylor expansion of Qfunctions. Our motivation for the expansion is the following: Assume we aim to estimate
for target policy , and we only have access to data collected under a behavior policy . Since can be readily estimated with the collected data, how do we approximate with ?Clearly, when , then . Whenever , starts to deviate from . Therefore, we apply Taylor expansion to describe the deviation in the orders of . We provide the following result.
Theorem 1.
(proved in Appendix B) For any policies and and any , we have
In addition, if , then the limit for exists and we have
(1) 
The constraint between and is a result of the convergence radius of the Taylor expansion. The derivation follows by recursively applying the following equality: Please refer to the Appendix B for a proof. For ease of notation, denote the th term on the RHS of Eq. 1 as . This gives rise to .
To represent explicitly with the deviation between and , consider a diagonal matrix where and where is the Dirac delta function; we restrict to the case where . This diagonal matrix is a measure of the deviation between and . The above expression can be rewritten as
(2) 
We will see that the expansion in Eq. 2 is useful in Section3 when we derive the Taylor expansion of the difference between the performances of two policies, . In Section 4, we also provide the connection between Taylor expansion and offpolicy evaluation.
2.2 Taylor expansion of reinforcement learning objective
When searching for a better policy, we are often interested in the difference . With Eq. 2, we can derive a similar Taylor expansion result for . Let (resp., ) be the shorthand notation for (resp., ). Here, we formalize the orders of the expansion as the number of times that ratios appear in the expression, e.g., the firstorder expansion should only involve up to the first order, without higher order terms, e.g., cross product . We denote the th order as and by construction . Next, we derive practically useful expressions for .
We provide a derivation sketch below and give the details in Appendix F. Let
be the joint distribution of policies and state at time
such that . Note that the RL objective equivalently writes as and can be expressed as an inner product . This allows us to import results from Eq. 2,(3)  
By reading off different orders of the expansion from the RHS of Eq. 3, we derive
(4)  
It is worth noting that the th order expansion of the RL objective is a mixture of the th and th order Qfunction expansions. This is because integrates over the initial and the initial difference contributes one order of difference in .
Below, we illustrate the results for and . To make the results more intuitive, we convert the matrix notation of Eq. 3 into explicit expectations under .
Firstorder expansion.
By converting from Eq. 4 into expectations, we get
(5) 
To be precise (Eq. 5) to account for the normalization of the distribution . Note that is exactly the same as surrogate objective proposed in prior work on scalable policy optimization (Kakade and Langford, 2002; Schulman et al., 2015, 2017). Indeed, these works proposed to estimate and optimize such a surrogate objective at each iteration while enforcing a trust region. In the following, we generalize this objective with Taylor expansions.
Secondorder expansion.
By converting from Eq. 4 into expectations, we get
(6) 
Again, accounting for the normalization, (Eq. 6). To calculate the above expectation, we first start from , and sample a pair from the discounted distribution . Then, we use as the starting point and sample another pair from . This implies that the secondorder expansion can be estimated only via samples under , which will be essential for policy optimization in practice.
It is worth noting that the second stateaction pair with the argument instead of . This is because only contains terms sampled across strictly different time steps.
Higherorder expansions.
Similarly to the firstorder and secondorder expansions, higherorder expansions are also possible by including proper higherorder terms in . For general , can be expressed as (omitting the normalization constants)
(7) 
Here, are sampled sequentially, each following a discounted visitation distribution conditional on the previous stateaction pair. We show their detailed derivations in Appendix F. Furthermore, we discuss the tradeoff of different orders in Section 3.
Interpretation & intuition.
Evaluating with data under requires importance sampling (IS) . In general, since can differ from at all stateaction pairs, computing exactly with full IS requires corrections at all steps along generated trajectories. Firstorder expansion (Eq. 5) corresponds to carrying out only one single correction at sampled stateaction pair along the trajectories: Indeed, in computing Eq. 5, we sample a stateaction pair along the trajectory and calculate one single IS correction . Similarly, the secondorder expansion (Eq. 6) goes one step further and considers the IS correction at two different steps and . As such, Taylor expansions of the RL objective can be interpreted as increasingly tight approximations of the full IS correction.
3 Taylor expansion for policy optimization
In highdimensional policy optimization, where exact algorithms such as dynamic programming are not feasible, it is necessary to learn from sampled data. In general, the sampled data are collected under a behavior policy different from the target policy . For example, in trustregion policy search (e.g., TRPO, Schulman et al., 2015; PPO, Schulman et al., 2017), is the new policy while is a previous policy; in asynchronous distributed algorithms (Mnih et al., 2016; Espeholt et al., 2018; Horgan et al., 2018; Kapturowski et al., 2019), is the learner policy while is delayed actor policy. In this section, we show the fundamental connection between trustregion policy search and Taylor expansions, and propose the general framework of Taylor expansion policy optimization (TayPO).
3.1 Generalized trustregion policy pptimization
For policy optimization, it is necessary that the update function (e.g., policy gradients or surrogate objectives) can be estimated with sampled data under behavior policy . Taylor expansions are a natural paradigm to satisfy this requirement. Indeed, to optimize , consider optimizing^{2}^{2}2Once again, the equality holds under certain conditions, detailed in Section 4.
(8) 
Though we have shown that for all are expectations under
, it is not feasible to unbiasedly estimate the RHS of Eq.
8 because it involves an infinite number of terms. In practice, we can truncate the objective up to th order and drop because it does not involveHowever, for any fixed , optimizing the truncated objective in an unconstrained way is risky: As become increasingly different, the approximation becomes more inaccurate and we stray away from optimizing the objective of interest. The approximation error comes from the residual — to control the magnitude of the residual, it is natural to constrain with some . Indeed, it is straightforward to show that
where ^{3}^{3}3Here we define . Please see Appendix A.1 for more detailed derivations. We formalize the entire local optimization problem as generalized trustregion policy optimization (generalized TRPO),
(9) 
Monotonic improvement.
While maximizing the surrogate objective under trustregion constraints (Eq. 9), it is desirable to have performance guarantee on the true objective . Below, Theorem 2 gives such a result.
Theorem 2.
Connections to prior work on trustregion policy search.
The generalized TRPO extends the formulation of prior work, e.g., TRPO/PPO of Schulman et al. (2015, 2017). Indeed, idealized forms of these algorithms are a special case for , though for practical purposes the constraint is replaced by averaged KL constraints.^{4}^{4}4Instead of forming the constraints explicitly, PPO (Schulman et al., 2017) enforces the constraints implicitly by clipping IS ratios.
3.2 TayPO: Optimizing with th order expansion
Though there is a theoretical motivation to use trustregion constraints for policy optimization (Schulman et al., 2015; Abdolmaleki et al., 2018), such constraints are rarely explicitly enforced in practice in its most standard form (Eq. 9). Instead, trust regions are implicitly encouraged via e.g., ratio clipping (Schulman et al., 2017) or parameter averaging (Wang et al., 2017)
. In largescale distributed settings, algorithms already benefit from diverse sample collections for variance reduction of the parameter updates
(Mnih et al., 2016; Espeholt et al., 2018), which brings the desired stability for learning and makes trustregion constraints less necessary (either explicit or implicit). Therefore, we focus on the setting where no trust region is explicitly enforced. We introduce a new family of algorithm TayPO, which applies the th order Taylor expansions for policy optimization.Unbiased estimations with variance reduction.
In practice, as expectations under can be estimated as over a single trajectory. Take as an example: Given a trajectory by , assume we have access to some estimates of , e.g., cumulative returns. To generate a sample from
, we can first sample a random time from a geometric distribution with success probability
, i.e., . Second, we sample another random time with geometric distribution but conditional on .^{5}^{5}5As explained in Section 2.2, since contains IS ratios at strictly different time steps, it is required that . Then, a single sample estimate of Eq. 6 is given byFurther, the following shows the effect of replacing Qvalues by advantages .
Theorem 3.
In practice, when computing , replacing by still produces an unbiased estimate and potentially reduces variance. This naturally recovers the result in prior work for (Schulman et al., 2016).
Higherorder objectives and tradeoffs.
When , we can construct objectives with higherorder terms. The motivation is that with high , forms a closer approximation to the objective of interest: . Why not then have as large as possible? This comes at a tradeoff. For example, let us compare and : Though forms a closer approximation to than in expectation, it could have higher variance during estimation when e.g., and have a nonnegative correlation. Indeed, as , approximates the full IS correction, which is known to have high variance (Munos et al., 2016).
How many orders to take in practice?
Though the higherorder policy optimization formulation generalizes previous results (Schulman et al., 2015, 2017) as an firstorder special case, does it suffice to only include firstorder terms in practice?
To assess the effects of Taylor expansions, consider a policy evaluation problem on a random MDP (see Appendix H.1 for the detailed setup): Given a target policy and a behavior policy , the approximation error of the th order expansion is . In Figure 1, We show the relative errors as a function of . Groundtruth quantities such as are always computed analytically. Solid lines show results where all estimates are also computed analytically, e.g., is computed as . Observe that the errors decrease drastically as the expansion order increases. To quantify how sample estimates impact the quality of approximations, we recompute the estimates but with replaced by empirical estimates . Results are shown in dashed curves. Now comparing , observe that both errors go up compared to their fully analytic counterparts  both become more similar when is small.
3.3 TayPO — Secondorder policy optimization
From here onwards, we focus on TayPO. At any iteration, the data are collected under behavior policy in the form of partial trajectories of length . The learner maintains a parametric policy to be optimized. First, we carry out advantage estimation for stateaction pairs on the partial trajectories. This could be naïvely estimated as where are value function baselines. One could also adopt more advanced estimation techniques such as generalized advantage estimation (GAE, Schulman et al., 2016). Then, we construct surrogate objectives for optimization: the firstorder component as well as secondorder component , based on Eq. 5 and Eq. 6 respectively. Note that we replace all by for variance reduction.
Therefore, our final objective function becomes
(11) 
The parameter is updated via gradient ascent . Similar ideas can be applied to valuebased algorithms, for which we provide details in Appendix G.
4 Unifying the concepts: Taylor expansion as returnbased offpolicy evaluation
So far we have made the connection between Taylor expansions and TRPO. On the other hand, as introduced in Section 1, Taylor expansions can also be intimately related to offpolicy evaluation. Below, we formalize their connections. With Taylor expansions, we provide a consistent and unified view of TRPO and offpolicy evaluation.
4.1 Taylor expansion as offpolicy evaluation
In the general setting of offpolicy evaluation, the data is collected under a behavior policy while the objective is to evaluate . Returnbased offpolicy evaluation operators (Munos et al., 2016) are a family of operators , indexed by (per stateaction) tracecutting coefficients , a behavior policy and a target policy
where is the (sub)probability transition kernel for policy . Starting from any Qfunction , repeated applications of the operator will result in convergence to , i.e.,
as , subject to certain conditions on . To state the main results, recall that Eq. 2 rewrites as In practice, we take a finite and use the approximation .
Next, we state the following result establishing a connection between th order Taylor expansion and the returnbased offpolicy operator applied times.
Theorem 4.
Theorem 4 shows that when we approximate by the Taylor expansion up to the th order, , it is equivalent to generating an approximation by times applying the offpolicy evaluation operator on . We also note that the offpolicy evaluation operator in Theorem 4 is the operator (Harutyunyan et al., 2016) with .^{6}^{6}6As a side note, we also show that the advatnage estimation method GAE (Schulman et al., 2016) is highly related to the operator in Appendix F.1.
Alternative proof for convergence for .
Since Taylor expansions converge within a convergence radius, which in this case corresponds to , it implies that with converges when this condition holds. In fact, this coincides with the condition deduced by Harutyunyan et al. (2016).^{7}^{7}7Note that this alternative proof only works for the case where the initial .
4.2 An operator view of trustregion policy optimization
With the connection between Taylor expansion and offpolicy evaluation, along with the connection between Taylor expansion and TRPO (Section 3) we give a novel interpretation of TRPO: The th order generalized TRPO is approximately equivalent to iterating times the offpolicy evaluation operator .
To make our claim explicit, recall the RL objective in matrix form is . Now consider approximating by applying the evaluation operator to , iterating times. This produces the surrogate objective , approximately equivalent to that of the generalized TRPO (Eq. 9).^{8}^{8}8The th order Taylor expansion of is slightly different from that of the RL objective by construction; see Appendix B for details. As a result, the generalized TRPO (including TRPO; Schulman et al., 2015) can be interpreted as approximating the exact RL objective ), by times iterating the evaluation operator on to approximate . When does this evaluation operator converge? Recall that converges when , i.e., there is a trust region constraint on . This is consistent with the motivation of generalized TRPO discussed in Section 3, where a trust region is required for monotonic improvements.
5 Experiments
We evaluate the potential benefits of applying secondorder expansions in a diverse set of scenarios. In particular, we test if the secondorder correction helps with (1) policybased and (2) valuebased algorithms.
In largescale experiments, to take advantage of computational architectures, actors () and learners () are not perfectly synchronized. For case (1), in Section 5.1, we show that even in cases where they almost synchronize (), higherorder corrections are still helpful. Then, in Section 5.2, we study how the performance of a general distributed policybased agent (e.g., IMPALA, Espeholt et al., 2018) is influenced by the discrepancy between actors and learners. For case (2), in Section 5.3, we show the benefits of secondorder expansions in with a stateoftheart valuebased agent R2D2 (Kapturowski et al., 2019).
Evaluation.
Architecture for distributed agents.
Distributed agents generally consist of a central learner and multiple actors (Nair et al., 2015; Mnih et al., 2016; Babaeizadeh et al., 2017; BarthMaron et al., 2018; Horgan et al., 2018). We focus on two main setups: Type I includes agents such as IMPALA (Espeholt et al., 2018) (see blue arrows in Figure 5 in Appendix H.3). See Section 5.1 and Section 5.2; Type II includes agents such as R2D2 (Kapturowski et al., 2019; see orange arrows in Figure 5 in Appendix H.3). See Section 5.3. We provide details on hyperparameters of experiment setups in respective subsections in Appendix H.
Practical considerations.
5.1 Near onpolicy policy optimization
The policybased agent maintains a target policy network for the learner and a set of behavior policy networks for the actors. The actor parameters are delayed copies of the learner parameter . To emulate a near onpolicy situation , we minimize the delay of the parameter passage between the central learner and actors, by hosting both learner/actors on the same machine.
We compare secondorder expansions with two baselines: firstorder and zeroorder. For the firstorder baseline, we also adopt the PPO technique of clipping: in Eq. 5 with . Clipping the ratio enforces an implicit trust region with the goal of increased stability (Schulman et al., 2017). This technique has been shown to generally outperform a naïve explicit constraint, as done in the original TRPO (Schulman et al., 2015). In Appendix H.5, we detail how we implemented PPO on the asynchronous architecture. Each baseline trains on the entire Atari suite for M frames and we compare the mean/median humannormalized scores.
The comparison results are shown in Figure 2. Please see the median score curves in Figure 6 in Appendix H.5. We make several observations: (1) Offpolicy corrections are very critical. Going from zeroorder (no correction) to firstorder improves the performance most significantly, even when the delays between actors and the learner are minimized as much as possible; (2) Secondorder correction significantly improves on the firstorder baseline. This might be surprising, because when near onpolicy, one should expect the difference between additional secondorder correction to be less important. This implies that in fully asynchronous architecture, it is challenging to obtain sufficiently onpolicy data and additional corrections can be helpful.
5.2 Distributed offpolicy policy optimization
We adopt the same setup as in Section 5.1. To maximize the overall throughput of the agent, the central learner and actors are distributed on different host machines. As a result, both parameter passage from the learner to actors and data passage from actors to the learner could be severely delayed. This creates a natural offpolicy scenario with .
We compare secondorder with two baselines: firstorder and Vtrace. The Vtrace is used in the original IMPALA agent (Espeholt et al., 2018) and we present its details in Appendix H.6. We are interested in how the agent’s performance changes as the level of offpolicy increases. In practice, the level of offpolicy can be controlled and measured as the delay (measured in milliseconds) of the parameter passage from the learner to actors. Results are shown in Figure 3, where xaxis shows the artificial delays (in scale) and yaxis shows the mean humannormalized scores after training for M frames. Note that the total delay consists of both artificial delays and inherent delays in the distributed system.
We make several observations: (1) All baseline variants’ performance degrades as the delays increase. All baseline offpolicy corrections are subject to failures as the level of offpolicines increases. (2) While all baselines perform rather similarly when delays are small, as the level of offpolicy increases, secondorder correction degrades slightly more gracefully than the other baselines. This implies that secondorder is a more robust offpolicy correction method than other current alternatives.
5.3 Distributed valuebased learning
The valuebased agent maintains a Qfunction network for the learner and a set of delayed Qfunction networks for the actors. Let be an operator such that returns the greedy policy with respect to . The actors generate partial trajectories by executing an and send data to a replay buffer. The target policy is greedy with respect to the current Qfunction . The learner samples partial trajectories from the replay buffer and updates parameters by minimizing Bellman errors computed along sampled trajectories. Here we focus on R2D2, a special instance of distributed valuebased agent. Please refer to Kapturowski et al. (2019) for a complete review of all algorithmic details of valuebased agents such as R2D2.
Across all baseline variants, the learner computes regression targets for the network to approximate . The targets are calculated based on partial trajectories under which require offpolicy corrections. We compare several correction variants: zeroorder, firstorder, Retrace (Munos et al., 2016; Rowland et al., 2020) and secondorder. Please see algorithmic details in Appendix G.
The comparison results are in Figure 4 where we show the mean scores. We make several observations: (1) secondorder correction leads to marginally better performance than firstorder and retrace, and significantly better than zeroorder. (2) In general, unbiased (or slightly biased) offpolicy corrections do not yet perform as well as radically biased offpolicy variants, such as uncorrectednstep (Kapturowski et al., 2019; Rowland et al., 2020). (3) Zeroorder performs the worst — though it is able to reach super human performance on most games as other variants but then the performance quickly plateaus. See Appendix H.7 for more results.
6 Discussion and conclusion
The idea of IS is the core of most offpolicy evaluation techniques (Precup et al., 2000; Harutyunyan et al., 2016; Munos et al., 2016). We showed that Taylor expansions construct approximations to the full IS corrections and hence intimately relate to established offpolicy evaluation techniques.
However, the connection between IS and policy optimization is less straightforward. Prior work focuses on applying offpolicy corrections directly to policy gradient estimators (Jie and Abbeel, 2010; Espeholt et al., 2018) instead of the surrogate objectives which generate the gradients. Though standard policy optimization objectives (Schulman et al., 2015, 2017) involve IS weights, their link with IS is not made explicit. Closely related to our work is that of Tomczak et al. (2019), where they identified such optimization objectives as biased approximations to the full IS objective (Metelli et al., 2018). We characterized such approximations as the firstorder special case of Taylor expansions and derived their natural generalizations.
In summary, we showed that Taylor expansions naturally connect trustregion policy search with offpolicy evaluations. This new formulation unifies previous results, opens doors to new algorithms and bring significant gains to certain stateoftheart deep RL agents.
Acknowledgements.
Great thanks to Mark Rowland for insightful discussions during the development of ideas as well as extremely useful feedbacks on earlier versions of this paper. The authors also thank Diana Borsa, JeanBastien Grill, Florent Altché, Tadashi Kozuno, Zhongwen Xu, Steven Kapturowski, and Simon Schmitt for helpful discussions.
References
 Maximum a posteriori policy optimisation. In International Conference on Learning Representations, Cited by: §1, §3.2.
 Reinforcement learning through asynchronous advantage actorcritic on a gpu. International Conference on Learning Representations. Cited by: §H.3, §5.
 Distributional policy gradients. In International Conference on Learning Representations, Cited by: §H.3, §5.

The arcade learning environment: an evaluation platform for general agents.
Journal of Artificial Intelligence Research
47, pp. 253–279. Cited by: §H.2, §5.  Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680. Cited by: §1.
 IMPALA: scalable distributed deepRL with importance weighted actorlearner architectures. In International Conference on Machine Learning, Cited by: Figure 5, §H.3, §H.5, §H.6, §H.6, §H.6, §H.6, §H.6, §H.7, §1, §1, §3.2, §3, §5, §5.2, §5, §6.
 The reactor: a fast and sampleefficient actorcritic agent for reinforcement learning. In International Conference on Learning Representations, Cited by: §1.
 Q() with OffPolicy Corrections. Refereed Conference In Algorithmic Learning Theory, Cited by: §F.1, §1, §4.1, §4.1, §6.
 Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, Cited by: §H.6.
 Distributed prioritized experience replay. In International Conference on Learning Representations, Cited by: §H.3, §H.3, §H.7, §3, §5.
 On a connection between importance sampling and the likelihood ratio policy gradient. In Neural Information Processing Systems, Cited by: §6.
 Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, Cited by: §1, §2.2.
 Recurrent experience replay in distributed reinforcement learning. In International Conference on Learning Representations, Cited by: Appendix G, Figure 5, §H.3, §H.7, §3, §5, §5.3, §5.3, §5.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §H.7.
 Policy optimization via importance sampling. In Neural Information Processing Systems, Cited by: §6.
 Asynchronous methods for deep reinforcement learning. In International conference on machine learning, Cited by: §H.3, §H.5, §H.5, §H.6, §H.7, §1, §3.2, §3, §5.

Playing atari with deep reinforcement learning.
In
NIPS Deep Learning Workshop
, Cited by: §H.6.  Safe and efficient offpolicy reinforcement learning. In Neural Information Processing Systems, Cited by: §F.1, Appendix G, §H.7, §1, §3.2, §4.1, §5.3, §6.
 Massively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296. Cited by: §H.3, §5.
 Observe and look further: achieving consistent performance on atari. arXiv preprint arXiv:1805.11593. Cited by: §H.7.
 Eligibility traces for offpolicy policy evaluation. In International Conference on Machine Learning, Cited by: §6.
 Adaptive tradeoffs in offpolicy learning. In International Conference on Artificial Intelligence and Statistics, Cited by: Appendix G, §5.3, §5.3.
 Trust region policy optimization. In International conference on machine learning, Cited by: Appendix D, §1, §1, §1, §2.2, §3.1, §3.2, §3.2, §3.2, §3, §4.2, §5.1, §6.
 Highdimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations, Cited by: §F.1, §H.6, §3.2, §3.3, footnote 6.
 Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: Appendix D, §F.1, §H.5, §1, §1, §2.2, §3.1, §3.2, §3.2, §3.2, §3, §5.1, §6, footnote 4.

Mastering the game of go with deep neural networks and tree search
. Nature 529, pp. 484–503. External Links: Link Cited by: §1.  VMPO: onpolicy maximum a posteriori policy optimization for discrete and continuous control. In International Conference on Learning Representations, Cited by: §1.
 Policy gradient methods for reinforcement learning with function approximation. In Neural Information Processing Systems, Cited by: §1.

Lecture 6.5rmsprop: divide the gradient by a running average of its recent magnitude
. COURSERA: Neural networks for machine learning 4 (2), pp. 26–31. Cited by: §H.5, §H.6.  Policy optimization through approximated importance sampling. arXiv preprint arXiv:1910.03857. Cited by: §6.
 Grandmaster level in starcraft ii using multiagent reinforcement learning. Nature 575 (7782), pp. 350–354. Cited by: §1.
 Sample efficient actorcritic with experience replay. International Conference on Learning Representations. Cited by: §1, §3.2.
 Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning, Cited by: §H.7.
Appendix A Derivation of results for generalized trustregion policy optimization
a.1 Controlling the residuals of Taylor expansions
We summarize the bound on the magnitude of the Taylor expansion residuals of the Qfunction as a proposition.
Proposition 1.
Recall the definition of the Taylor expansion residual of the Qfunction from the main text, . Let be the infinity norm . Let be the maximum reward in the entire MDP, . Finally, let . Then
(13) 
Proof.
The proof follows by bounding each the magnitude of term ,
The above derivation shows that once we have as . In the above derivation, we have applied the bound which will also be helpful in later derivations. ∎
a.2 Deriving Taylor expansions of RL objective
Recall that the RHS of Eq. 2 are the Taylor expansions of Qfunctions . By construction, . Though Eq. 2 shows the expansion of the entire vector , for optimization purposes, we care about the RL objective from a starting state , , where follows the definition from the main paper .
Now we focus on calculating for general . For simplicity, we write as and henceforth we might use these notations interchangeably. Now consider the RHS of Eq. 3. By definition of the th order Taylor expansion of , we maintain terms where appears at most times. Equivalently, in matrix form, we remove the higher order terms of while only maintaining terms such as . This allows us to conclude that
Furthermore, we can single out each term
Appendix B Proof of Theorem 1
Proof.
We derive the Taylor expansion of Qfunction into different orders of . For that purpose, we recursively make use of the following matrix equality
which can be derived either from matrix inversion equality or directly verified. Since , we can use the previous equality to get
Next, we recursively apply the equality times,
Now if then we can bound the supnorm in of the above term as
thus the th order residual term vanishes when . As a result, the limit is well defined and we deduce
∎
Appendix C Proof of Theorem 2
Proof.
To derive the monotonic improvement theorem for generalized TRPO, it is critical to bound . We achieve this by simply bounding each term separately. Recall that from Appendix A.1 we have . Without loss of generality, we first assume for ease of derivations.
This leads to a bound over the residuals
Since we have the equality for we can deduce the following monotonic improvement,
(14) 
To write the above statement in a compact way, we define the gap
To derive the result for general , note that the gap has a linear dependency on . Hence the general gap is
which gives produces the monotonic improvement result (Eq. 10) stated in the main paper. ∎
Appendix D Proof of Theorem 3
Proof.
It is known that for , replacing by in the estimation can potentially reduce variance (Schulman et al., 2015, 2017) yet keeps the estimate unbiased. Below, we show that in general, replacing by renders the estimate of unbiased for general .
As shown above and more clearly in Appendix F, can be written as
(15) 
Note that for clarity, in the above expectation, we omit an explicit sequence of discounted visitation distributions (for detailed derivations of this sequence of visitation distributions, see Appendix F). Next, we leverage the conditional expectation with respect to to yield
(16) 
The above derivation shows that indeed, replacing by does not change the value the expectation, while potentially reducing the variance of the overall estimation. ∎
Appendix E Proof of Theorem 4
Proof.
From the definition of the return offpolicy evaluation operator , we have
Thus is a linear operator, and
Applying this step times, we deduce
Applying the above operator to we deduce that
which proves our claim. ∎
Appendix F Alternative derivation for Taylor expansions of RL objective
In this section, we provide an alternative derivation of the Taylor expansion of the RL objective. Let . In cases where (e.g., for the trustregion case), . To calculate using data from , a natural technique is employ importance sampling (IS),
To derive Taylor expansion in an intuitive way, consider expanding the product , assuming that this infinite product is finite. Assume all with some small . A secondorder Taylor expansion is
(17) 
Now, consider the term associated with ,
(18) 
Note that in the last equality, the factor is absorbed into the discounted visitation distribution . It is then clear that this term is exactly the firstorder expansion shown in the main paper.
Similarly, we could derive the secondorder expansion by studying the term associated with .
(19) 
Note that similar to the firstorder expansion, the discount factor is absorbed into the discounted visitation distribution and respectively. Here note that the second discounted visitation distribution is instead of — this is because by construction and we need to sample the second state conditional on the time difference to be . The above is exactly the secondorder expansion .
By a similar argument, we can derive expansion for all higherorder expansion by considering the term associated with . This would introduce discounted visitation distributions