# Taylor Expansion Policy Optimization

In this work, we investigate the application of Taylor expansions in reinforcement learning. In particular, we propose Taylor expansion policy optimization, a policy optimization formalism that generalizes prior work (e.g., TRPO) as a first-order special case. We also show that Taylor expansions intimately relate to off-policy evaluation. Finally, we show that this new formulation entails modifications which improve the performance of several state-of-the-art distributed algorithms.

## Authors

• 25 publications
• 55 publications
• 71 publications
• ### Hinge Policy Optimization: Rethinking Policy Improvement and Reinterpreting PPO

Policy optimization is a fundamental principle for designing reinforceme...
10/26/2021 ∙ by Hsuan-Yu Yao, et al. ∙ 6

• ### Taylor Expansion of Discount Factors

In practical reinforcement learning (RL), the discount factor used for e...
06/11/2021 ∙ by Yunhao Tang, et al. ∙ 0

• ### Learning to Run challenge solutions: Adapting reinforcement learning methods for neuromusculoskeletal environments

In the NIPS 2017 Learning to Run challenge, participants were tasked wit...
04/02/2018 ∙ by Łukasz Kidziński, et al. ∙ 0

• ### Revisit Policy Optimization in Matrix Form

In tabular case, when the reward and environment dynamics are known, pol...
09/19/2019 ∙ by Sitao Luan, et al. ∙ 1

• ### Wasserstein Reinforcement Learning

We propose behavior-driven optimization via Wasserstein distances (WDs) ...
06/11/2019 ∙ by Aldo Pacchiano, et al. ∙ 1

• ### Decaying Clipping Range in Proximal Policy Optimization

Proximal Policy Optimization (PPO) is among the most widely used algorit...
02/20/2021 ∙ by Mónika Farsang, et al. ∙ 0

• ### New Techniques for Algorithm Portfolio Design

We present and evaluate new techniques for designing algorithm portfolio...
06/13/2012 ∙ by Matthew Streeter, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Policy optimization is a major framework in model-free reinforcement learning (RL), with successful applications in challenging domains (Silver et al., 2016; Berner et al., 2019; Vinyals et al., 2019). Along with scaling up to powerful computational architectures (Mnih et al., 2016; Espeholt et al., 2018), significant algorithmic performance gains are driven by insights into the drawbacks of naïve policy gradient algorithms (Sutton et al., 2000). Among all algorithmic improvements, two of the most prominent are: trust-region policy search (Schulman et al., 2015, 2017; Abdolmaleki et al., 2018; Song et al., 2020) and off-policy corrections (Munos et al., 2016; Wang et al., 2017; Gruslys et al., 2018; Espeholt et al., 2018).

At the first glance, these two streams of ideas focus on orthogonal aspects of policy optimization. For trust-region policy search, the idea is to constrain the size of policy updates. This limits the deviations between consecutive policies and lower-bounds the performance of the new policy (Kakade and Langford, 2002; Schulman et al., 2015). On the other hand, off-policy corrections require that we account for the discrepancy between target policy and behavior policy. Espeholt et al. (2018) has observed that the corrections are especially useful for distributed algorithms, where behavior policy and target policy typically differ. Both algorithmic ideas have contributed significantly to stabilizing policy optimization.

In this work, we partially unify both algorithmic ideas into a single framework. In particular, we noticed that as a ubiquitous approximation method, Taylor expansions share high-level similarities with both trust region policy search and off-policy corrections. To get high-level intuitions of such similarities, consider a simple 1D example of Taylor expansions. Given a sufficiently smooth real-valued function on the real line , the -th order Taylor expansion of at is where are the -th order derivatives at . First, a common feature shared by Taylor expansions and trust-region policy search is the inherent notion of a trust region constraint. Indeed, in order for convergence to take place, a trust-region constraint is required 111Here, is the convergence radius of the expansions, which in general depends on the function and origin .. Second, when using the truncation as an approximation to the original function , Taylor expansions satisfy the requirement of off-policy evaluations: evaluate target policy with behavior data. Indeed, to evaluate the truncation at any (target policy), we only require the behavior policy “data” at (i.e., derivatives ).

Our paper proceeds as follows. In Section 2, we start with a general result of applying Taylor expansions to Q-functions. When we apply the same technique to the RL objective, we reuse the general result and derive a higher-order policy optimization objective. This leads to Section 3, where we formally present the Taylor Expansion Policy Optimization (TayPO) and generalize prior work (Schulman et al., 2015, 2017) as a first-order special case. In Section , we make clear connection between Taylor expansions and (Harutyunyan et al., 2016), a common return-based off-policy evaluation operator. Finally, in Section 5, we show the performance gains due to the higher-order objectives across a range of state-of-the-art distributed deep RL agents.

## 2 Taylor expansion for reinforcement learning

Consider a Markov Decision Process (MDP) with state space

and action space . Let policy be a distribution over actions give state . At a discrete time , the agent in state takes action , receives reward , and transitions to a next state . We assume a discount factor . Let be the action value function (Q-function) from state taking action and following policy . For convenience, we use to denote the discounted visitation distribution starting from state-action pair and following , such that . We thus have . We focus on the RL objective of optimizing starting from a fixed initial state .

We define some useful matrix notation. For ease of analysis, we assume that and are both finite. Let denote the reward function and denote the transition matrix such that . We also define

as the vector Q-function. This matrix notation facilitates compact derivations, for example, the Bellman equation writes as

.

### 2.1 Taylor Expansion of Q-functions.

In this part, we state the Taylor expansion of Q-functions. Our motivation for the expansion is the following: Assume we aim to estimate

for target policy , and we only have access to data collected under a behavior policy . Since can be readily estimated with the collected data, how do we approximate with ?

Clearly, when , then . Whenever , starts to deviate from . Therefore, we apply Taylor expansion to describe the deviation in the orders of . We provide the following result.

###### Theorem 1.

(proved in Appendix B) For any policies and  and any , we have

 Qπ−Qμ=

In addition, if , then the limit for exists and we have

 (1)

The constraint between and is a result of the convergence radius of the Taylor expansion. The derivation follows by recursively applying the following equality: Please refer to the Appendix B for a proof. For ease of notation, denote the -th term on the RHS of Eq. 1 as . This gives rise to .

To represent explicitly with the deviation between and , consider a diagonal matrix where and where is the Dirac delta function; we restrict to the case where . This diagonal matrix is a measure of the deviation between and . The above expression can be rewritten as

 Qπ−Qμ=∞∑k=1(γ(I−γPμ)−1Pμ(Dπ/μ−I))kQμ. (2)

We will see that the expansion in Eq. 2 is useful in Section3 when we derive the Taylor expansion of the difference between the performances of two policies, . In Section 4, we also provide the connection between Taylor expansion and off-policy evaluation.

### 2.2 Taylor expansion of reinforcement learning objective

When searching for a better policy, we are often interested in the difference . With Eq. 2, we can derive a similar Taylor expansion result for . Let (resp., ) be the shorthand notation for (resp., ). Here, we formalize the orders of the expansion as the number of times that ratios appear in the expression, e.g., the first-order expansion should only involve up to the first order, without higher order terms, e.g., cross product . We denote the -th order as and by construction . Next, we derive practically useful expressions for .

We provide a derivation sketch below and give the details in Appendix F. Let

be the joint distribution of policies and state at time

such that . Note that the RL objective equivalently writes as and can be expressed as an inner product . This allows us to import results from Eq. 2,

 J(π)−J(μ)=πT0Qπ−μT0Qμ (3)

By reading off different orders of the expansion from the RHS of Eq. 3, we derive

 L1(π,μ) =(π0−μ0)TQμ+μT0U1, (4) Lk(π,μ) =(π0−μ0)TUk−1+μT0Uk, ∀k≥2.

It is worth noting that the -th order expansion of the RL objective is a mixture of the -th and -th order Q-function expansions. This is because integrates over the initial and the initial difference contributes one order of difference in .

Below, we illustrate the results for and . To make the results more intuitive, we convert the matrix notation of Eq. 3 into explicit expectations under .

#### First-order expansion.

By converting from Eq. 4 into expectations, we get

 (5)

To be precise (Eq. 5) to account for the normalization of the distribution . Note that is exactly the same as surrogate objective proposed in prior work on scalable policy optimization (Kakade and Langford, 2002; Schulman et al., 2015, 2017). Indeed, these works proposed to estimate and optimize such a surrogate objective at each iteration while enforcing a trust region. In the following, we generalize this objective with Taylor expansions.

#### Second-order expansion.

By converting from Eq. 4 into expectations, we get

 (6)

Again, accounting for the normalization, (Eq. 6). To calculate the above expectation, we first start from , and sample a pair from the discounted distribution . Then, we use as the starting point and sample another pair from . This implies that the second-order expansion can be estimated only via samples under , which will be essential for policy optimization in practice.

It is worth noting that the second state-action pair with the argument instead of . This is because only contains terms sampled across strictly different time steps.

#### Higher-order expansions.

Similarly to the first-order and second-order expansions, higher-order expansions are also possible by including proper higher-order terms in . For general , can be expressed as (omitting the normalization constants)

 (7)

Here, are sampled sequentially, each following a discounted visitation distribution conditional on the previous state-action pair. We show their detailed derivations in Appendix F. Furthermore, we discuss the trade-off of different orders in Section 3.

#### Interpretation & intuition.

Evaluating with data under requires importance sampling (IS) . In general, since can differ from at all state-action pairs, computing exactly with full IS requires corrections at all steps along generated trajectories. First-order expansion (Eq. 5) corresponds to carrying out only one single correction at sampled state-action pair along the trajectories: Indeed, in computing Eq. 5, we sample a state-action pair along the trajectory and calculate one single IS correction . Similarly, the second-order expansion (Eq. 6) goes one step further and considers the IS correction at two different steps and . As such, Taylor expansions of the RL objective can be interpreted as increasingly tight approximations of the full IS correction.

## 3 Taylor expansion for policy optimization

In high-dimensional policy optimization, where exact algorithms such as dynamic programming are not feasible, it is necessary to learn from sampled data. In general, the sampled data are collected under a behavior policy different from the target policy . For example, in trust-region policy search (e.g., TRPO, Schulman et al., 2015; PPO, Schulman et al., 2017), is the new policy while is a previous policy; in asynchronous distributed algorithms (Mnih et al., 2016; Espeholt et al., 2018; Horgan et al., 2018; Kapturowski et al., 2019), is the learner policy while is delayed actor policy. In this section, we show the fundamental connection between trust-region policy search and Taylor expansions, and propose the general framework of Taylor expansion policy optimization (TayPO).

### 3.1 Generalized trust-region policy pptimization

For policy optimization, it is necessary that the update function (e.g., policy gradients or surrogate objectives) can be estimated with sampled data under behavior policy . Taylor expansions are a natural paradigm to satisfy this requirement. Indeed, to optimize , consider optimizing222Once again, the equality holds under certain conditions, detailed in Section 4.

 maxπ J(π)=maxπ J(μ)+∞∑k=1Lk(π,μ). (8)

Though we have shown that for all are expectations under

, it is not feasible to unbiasedly estimate the RHS of Eq.

8 because it involves an infinite number of terms. In practice, we can truncate the objective up to -th order and drop because it does not involve

However, for any fixed , optimizing the truncated objective in an unconstrained way is risky: As become increasingly different, the approximation becomes more inaccurate and we stray away from optimizing the objective of interest. The approximation error comes from the residual — to control the magnitude of the residual, it is natural to constrain with some . Indeed, it is straightforward to show that

where 333Here we define . Please see Appendix A.1 for more detailed derivations. We formalize the entire local optimization problem as generalized trust-region policy optimization (generalized TRPO),

 maxπ K∑k=1Lk(π,μ),  ||π−μ||1≤ε. (9)

#### Monotonic improvement.

While maximizing the surrogate objective under trust-region constraints (Eq. 9), it is desirable to have performance guarantee on the true objective . Below, Theorem 2 gives such a result.

###### Theorem 2.

(proved in Appendix C) When the policy is optimized based on the trust-region objective Eq. 9 and , the performance is lower bounded as

 J(π) ≥J(μ)+K∑k=1Lk−GK, (10) where GK

Note that if then as , the gap . Therefore, when optimizing based on Eq. 9, the performance is always lower-bounded according to Eq. 10.

#### Connections to prior work on trust-region policy search.

The generalized TRPO extends the formulation of prior work, e.g., TRPO/PPO of Schulman et al. (2015, 2017). Indeed, idealized forms of these algorithms are a special case for , though for practical purposes the constraint is replaced by averaged KL constraints.444Instead of forming the constraints explicitly, PPO (Schulman et al., 2017) enforces the constraints implicitly by clipping IS ratios.

### 3.2 TayPO-k: Optimizing with k-th order expansion

Though there is a theoretical motivation to use trust-region constraints for policy optimization (Schulman et al., 2015; Abdolmaleki et al., 2018), such constraints are rarely explicitly enforced in practice in its most standard form (Eq. 9). Instead, trust regions are implicitly encouraged via e.g., ratio clipping (Schulman et al., 2017) or parameter averaging (Wang et al., 2017)

. In large-scale distributed settings, algorithms already benefit from diverse sample collections for variance reduction of the parameter updates

(Mnih et al., 2016; Espeholt et al., 2018), which brings the desired stability for learning and makes trust-region constraints less necessary (either explicit or implicit). Therefore, we focus on the setting where no trust region is explicitly enforced. We introduce a new family of algorithm TayPO-, which applies the -th order Taylor expansions for policy optimization.

#### Unbiased estimations with variance reduction.

In practice, as expectations under can be estimated as over a single trajectory. Take as an example: Given a trajectory by , assume we have access to some estimates of , e.g., cumulative returns. To generate a sample from

, we can first sample a random time from a geometric distribution with success probability

, i.e., . Second, we sample another random time with geometric distribution but conditional on .555As explained in Section 2.2, since contains IS ratios at strictly different time steps, it is required that . Then, a single sample estimate of Eq. 6 is given by

Further, the following shows the effect of replacing Q-values by advantages .

###### Theorem 3.

(proved in Appendix D) The computation of based on Eq. 7 is exact when replacing by , i.e. can be expressed as

In practice, when computing , replacing by still produces an unbiased estimate and potentially reduces variance. This naturally recovers the result in prior work for (Schulman et al., 2016).

When , we can construct objectives with higher-order terms. The motivation is that with high , forms a closer approximation to the objective of interest: . Why not then have as large as possible? This comes at a trade-off. For example, let us compare and : Though forms a closer approximation to than in expectation, it could have higher variance during estimation when e.g., and have a non-negative correlation. Indeed, as , approximates the full IS correction, which is known to have high variance (Munos et al., 2016).

#### How many orders to take in practice?

Though the higher-order policy optimization formulation generalizes previous results (Schulman et al., 2015, 2017) as an first-order special case, does it suffice to only include first-order terms in practice?

To assess the effects of Taylor expansions, consider a policy evaluation problem on a random MDP (see Appendix H.1 for the detailed setup): Given a target policy and a behavior policy , the approximation error of the -th order expansion is . In Figure 1, We show the relative errors as a function of . Ground-truth quantities such as are always computed analytically. Solid lines show results where all estimates are also computed analytically, e.g., is computed as . Observe that the errors decrease drastically as the expansion order increases. To quantify how sample estimates impact the quality of approximations, we re-compute the estimates but with replaced by empirical estimates . Results are shown in dashed curves. Now comparing , observe that both errors go up compared to their fully analytic counterparts - both become more similar when is small.

This provides motivations for second-order expansions. While first-orders are a default choice for common deep RL algorithms (Schulman et al., 2015, 2017), from the simple MDP example we see that the second-order expansions could potentially improve upon the first-order, even with sample estimates.

### 3.3 TayPO-2 — Second-order policy optimization

From here onwards, we focus on TayPO-. At any iteration, the data are collected under behavior policy in the form of partial trajectories of length . The learner maintains a parametric policy to be optimized. First, we carry out advantage estimation for state-action pairs on the partial trajectories. This could be naïvely estimated as where are value function baselines. One could also adopt more advanced estimation techniques such as generalized advantage estimation (GAE, Schulman et al., 2016). Then, we construct surrogate objectives for optimization: the first-order component as well as second-order component , based on Eq. 5 and Eq. 6 respectively. Note that we replace all by for variance reduction.

Therefore, our final objective function becomes

 ^Lθ≜^L1(πθ,μ)+^L2(πθ,μ). (11)

The parameter is updated via gradient ascent . Similar ideas can be applied to value-based algorithms, for which we provide details in Appendix G.

## 4 Unifying the concepts: Taylor expansion as return-based off-policy evaluation

So far we have made the connection between Taylor expansions and TRPO. On the other hand, as introduced in Section 1, Taylor expansions can also be intimately related to off-policy evaluation. Below, we formalize their connections. With Taylor expansions, we provide a consistent and unified view of TRPO and off-policy evaluation.

### 4.1 Taylor expansion as off-policy evaluation

In the general setting of off-policy evaluation, the data is collected under a behavior policy while the objective is to evaluate . Return-based off-policy evaluation operators (Munos et al., 2016) are a family of operators , indexed by (per state-action) trace-cutting coefficients , a behavior policy and a target policy

 Rπ,μcQ≜Q+(I−γPcμ)−1(r+γPπQ−Q),

where is the (sub)-probability transition kernel for policy . Starting from any Q-function , repeated applications of the operator will result in convergence to , i.e.,

 (Rπ,μc)KQ→Qπ,

as , subject to certain conditions on . To state the main results, recall that Eq. 2 rewrites as In practice, we take a finite and use the approximation .

Next, we state the following result establishing a connection between -th order Taylor expansion and the return-based off-policy operator applied times.

###### Theorem 4.

(proved in Appendix E) For any , any policies and ,

 Qμ+K∑k=1Uk=(Rπ,μ1)KQμ, (12)

where is short for .

Theorem 4 shows that when we approximate by the Taylor expansion up to the -th order, , it is equivalent to generating an approximation by times applying the off-policy evaluation operator on . We also note that the off-policy evaluation operator in Theorem 4 is the operator (Harutyunyan et al., 2016) with .666As a side note, we also show that the advatnage estimation method GAE (Schulman et al., 2016) is highly related to the operator in Appendix F.1.

#### Alternative proof for Q(λ) convergence for λ=1.

Since Taylor expansions converge within a convergence radius, which in this case corresponds to , it implies that with converges when this condition holds. In fact, this coincides with the condition deduced by Harutyunyan et al. (2016).777Note that this alternative proof only works for the case where the initial .

### 4.2 An operator view of trust-region policy optimization

With the connection between Taylor expansion and off-policy evaluation, along with the connection between Taylor expansion and TRPO (Section 3) we give a novel interpretation of TRPO: The -th order generalized TRPO is approximately equivalent to iterating times the off-policy evaluation operator .

To make our claim explicit, recall the RL objective in matrix form is . Now consider approximating by applying the evaluation operator to , iterating times. This produces the surrogate objective , approximately equivalent to that of the generalized TRPO (Eq. 9).888The -th order Taylor expansion of is slightly different from that of the RL objective by construction; see Appendix B for details. As a result, the generalized TRPO (including TRPO; Schulman et al., 2015) can be interpreted as approximating the exact RL objective ), by times iterating the evaluation operator on to approximate . When does this evaluation operator converge? Recall that converges when , i.e., there is a trust region constraint on . This is consistent with the motivation of generalized TRPO discussed in Section 3, where a trust region is required for monotonic improvements.

## 5 Experiments

We evaluate the potential benefits of applying second-order expansions in a diverse set of scenarios. In particular, we test if the second-order correction helps with (1) policy-based and (2) value-based algorithms.

In large-scale experiments, to take advantage of computational architectures, actors () and learners () are not perfectly synchronized. For case (1), in Section 5.1, we show that even in cases where they almost synchronize (), higher-order corrections are still helpful. Then, in Section 5.2, we study how the performance of a general distributed policy-based agent (e.g., IMPALA, Espeholt et al., 2018) is influenced by the discrepancy between actors and learners. For case (2), in Section 5.3, we show the benefits of second-order expansions in with a state-of-the-art value-based agent R2D2 (Kapturowski et al., 2019).

#### Evaluation.

All evaluation environments are done on the entire suite of Atari games (Bellemare et al., 2013). We report human-normalized scores for each level, calculated as , where and are the performances of human and a random policy on level respectively; with details in Appendix H.2.

#### Architecture for distributed agents.

Distributed agents generally consist of a central learner and multiple actors (Nair et al., 2015; Mnih et al., 2016; Babaeizadeh et al., 2017; Barth-Maron et al., 2018; Horgan et al., 2018). We focus on two main setups: Type I includes agents such as IMPALA (Espeholt et al., 2018) (see blue arrows in Figure 5 in Appendix H.3). See Section 5.1 and Section 5.2; Type II includes agents such as R2D2 (Kapturowski et al., 2019; see orange arrows in Figure 5 in Appendix H.3). See Section 5.3. We provide details on hyper-parameters of experiment setups in respective subsections in Appendix H.

#### Practical considerations.

We can extend the TayPO- objective (Eq. 11) to with . By choosing , one achieves bias-variance trade-offs of the final objective and hence the update. We found (exact TayPO-) working reasonably well. See Appendix H.4 for the ablation study on  and further details.

### 5.1 Near on-policy policy optimization

The policy-based agent maintains a target policy network for the learner and a set of behavior policy networks for the actors. The actor parameters are delayed copies of the learner parameter . To emulate a near on-policy situation , we minimize the delay of the parameter passage between the central learner and actors, by hosting both learner/actors on the same machine.

We compare second-order expansions with two baselines: first-order and zero-order. For the first-order baseline, we also adopt the PPO technique of clipping: in Eq. 5 with . Clipping the ratio enforces an implicit trust region with the goal of increased stability (Schulman et al., 2017). This technique has been shown to generally outperform a naïve explicit constraint, as done in the original TRPO (Schulman et al., 2015). In Appendix H.5, we detail how we implemented PPO on the asynchronous architecture. Each baseline trains on the entire Atari suite for M frames and we compare the mean/median human-normalized scores.

The comparison results are shown in Figure 2. Please see the median score curves in Figure 6 in Appendix H.5. We make several observations: (1) Off-policy corrections are very critical. Going from zero-order (no correction) to first-order improves the performance most significantly, even when the delays between actors and the learner are minimized as much as possible; (2) Second-order correction significantly improves on the first-order baseline. This might be surprising, because when near on-policy, one should expect the difference between additional second-order correction to be less important. This implies that in fully asynchronous architecture, it is challenging to obtain sufficiently on-policy data and additional corrections can be helpful.

### 5.2 Distributed off-policy policy optimization

We adopt the same setup as in Section 5.1. To maximize the overall throughput of the agent, the central learner and actors are distributed on different host machines. As a result, both parameter passage from the learner to actors and data passage from actors to the learner could be severely delayed. This creates a natural off-policy scenario with .

We compare second-order with two baselines: first-order and V-trace. The V-trace is used in the original IMPALA agent (Espeholt et al., 2018) and we present its details in Appendix H.6. We are interested in how the agent’s performance changes as the level of off-policy increases. In practice, the level of off-policy can be controlled and measured as the delay (measured in milliseconds) of the parameter passage from the learner to actors. Results are shown in Figure 3, where x-axis shows the artificial delays (in scale) and y-axis shows the mean human-normalized scores after training for M frames. Note that the total delay consists of both artificial delays and inherent delays in the distributed system.

We make several observations: (1) All baseline variants’ performance degrades as the delays increase. All baseline off-policy corrections are subject to failures as the level of off-policines increases. (2) While all baselines perform rather similarly when delays are small, as the level of off-policy increases, second-order correction degrades slightly more gracefully than the other baselines. This implies that second-order is a more robust off-policy correction method than other current alternatives.

### 5.3 Distributed value-based learning

The value-based agent maintains a Q-function network for the learner and a set of delayed Q-function networks for the actors. Let be an operator such that returns the -greedy policy with respect to . The actors generate partial trajectories by executing an and send data to a replay buffer. The target policy is greedy with respect to the current Q-function . The learner samples partial trajectories from the replay buffer and updates parameters by minimizing Bellman errors computed along sampled trajectories. Here we focus on R2D2, a special instance of distributed value-based agent. Please refer to Kapturowski et al. (2019) for a complete review of all algorithmic details of value-based agents such as R2D2.

Across all baseline variants, the learner computes regression targets for the network to approximate . The targets are calculated based on partial trajectories under which require off-policy corrections. We compare several correction variants: zero-order, first-order, Retrace (Munos et al., 2016; Rowland et al., 2020) and second-order. Please see algorithmic details in Appendix G.

The comparison results are in Figure 4 where we show the mean scores. We make several observations: (1) second-order correction leads to marginally better performance than first-order and retrace, and significantly better than zero-order. (2) In general, unbiased (or slightly biased) off-policy corrections do not yet perform as well as radically biased off-policy variants, such as uncorrected-nstep (Kapturowski et al., 2019; Rowland et al., 2020). (3) Zero-order performs the worst — though it is able to reach super human performance on most games as other variants but then the performance quickly plateaus. See Appendix H.7 for more results.

## 6 Discussion and conclusion

The idea of IS is the core of most off-policy evaluation techniques (Precup et al., 2000; Harutyunyan et al., 2016; Munos et al., 2016). We showed that Taylor expansions construct approximations to the full IS corrections and hence intimately relate to established off-policy evaluation techniques.

However, the connection between IS and policy optimization is less straightforward. Prior work focuses on applying off-policy corrections directly to policy gradient estimators (Jie and Abbeel, 2010; Espeholt et al., 2018) instead of the surrogate objectives which generate the gradients. Though standard policy optimization objectives (Schulman et al., 2015, 2017) involve IS weights, their link with IS is not made explicit. Closely related to our work is that of Tomczak et al. (2019), where they identified such optimization objectives as biased approximations to the full IS objective (Metelli et al., 2018). We characterized such approximations as the first-order special case of Taylor expansions and derived their natural generalizations.

In summary, we showed that Taylor expansions naturally connect trust-region policy search with off-policy evaluations. This new formulation unifies previous results, opens doors to new algorithms and bring significant gains to certain state-of-the-art deep RL agents.

#### Acknowledgements.

Great thanks to Mark Rowland for insightful discussions during the development of ideas as well as extremely useful feedbacks on earlier versions of this paper. The authors also thank Diana Borsa, Jean-Bastien Grill, Florent Altché, Tadashi Kozuno, Zhongwen Xu, Steven Kapturowski, and Simon Schmitt for helpful discussions.

## References

• A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller (2018) Maximum a posteriori policy optimisation. In International Conference on Learning Representations, Cited by: §1, §3.2.
• M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, and J. Kautz (2017) Reinforcement learning through asynchronous advantage actor-critic on a gpu. International Conference on Learning Representations. Cited by: §H.3, §5.
• G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, D. TB, A. Muldal, N. Heess, and T. Lillicrap (2018) Distributional policy gradients. In International Conference on Learning Representations, Cited by: §H.3, §5.
• M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling (2013) The arcade learning environment: an evaluation platform for general agents.

Journal of Artificial Intelligence Research

47, pp. 253–279.
Cited by: §H.2, §5.
• C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, et al. (2019) Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680. Cited by: §1.
• L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu (2018) IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures. In International Conference on Machine Learning, Cited by: Figure 5, §H.3, §H.5, §H.6, §H.6, §H.6, §H.6, §H.6, §H.7, §1, §1, §3.2, §3, §5, §5.2, §5, §6.
• A. Gruslys, W. Dabney, M. G. Azar, B. Piot, M. Bellemare, and R. Munos (2018) The reactor: a fast and sample-efficient actor-critic agent for reinforcement learning. In International Conference on Learning Representations, Cited by: §1.
• A. Harutyunyan, M. G. Bellemare, T. Stepleton, and R. Munos (2016) Q() with Off-Policy Corrections. Refereed Conference In Algorithmic Learning Theory, Cited by: §F.1, §1, §4.1, §4.1, §6.
• K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. Cited by: §H.6.
• D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. van Hasselt, and D. Silver (2018) Distributed prioritized experience replay. In International Conference on Learning Representations, Cited by: §H.3, §H.3, §H.7, §3, §5.
• T. Jie and P. Abbeel (2010) On a connection between importance sampling and the likelihood ratio policy gradient. In Neural Information Processing Systems, Cited by: §6.
• S. Kakade and J. Langford (2002) Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, Cited by: §1, §2.2.
• S. Kapturowski, G. Ostrovski, W. Dabney, J. Quan, and R. Munos (2019) Recurrent experience replay in distributed reinforcement learning. In International Conference on Learning Representations, Cited by: Appendix G, Figure 5, §H.3, §H.7, §3, §5, §5.3, §5.3, §5.
• D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §H.7.
• A. M. Metelli, M. Papini, F. Faccio, and M. Restelli (2018) Policy optimization via importance sampling. In Neural Information Processing Systems, Cited by: §6.
• V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, Cited by: §H.3, §H.5, §H.5, §H.6, §H.7, §1, §3.2, §3, §5.
• V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. In

NIPS Deep Learning Workshop

,
Cited by: §H.6.
• R. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare (2016) Safe and efficient off-policy reinforcement learning. In Neural Information Processing Systems, Cited by: §F.1, Appendix G, §H.7, §1, §3.2, §4.1, §5.3, §6.
• A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. De Maria, V. Panneershelvam, M. Suleyman, C. Beattie, S. Petersen, et al. (2015) Massively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296. Cited by: §H.3, §5.
• T. Pohlen, B. Piot, T. Hester, M. G. Azar, D. Horgan, D. Budden, G. Barth-Maron, H. Van Hasselt, J. Quan, M. Večerík, et al. (2018) Observe and look further: achieving consistent performance on atari. arXiv preprint arXiv:1805.11593. Cited by: §H.7.
• D. Precup, R. S. Sutton, and S. P. Singh (2000) Eligibility traces for off-policy policy evaluation. In International Conference on Machine Learning, Cited by: §6.
• M. Rowland, W. Dabney, and R. Munos (2020) Adaptive trade-offs in off-policy learning. In International Conference on Artificial Intelligence and Statistics, Cited by: Appendix G, §5.3, §5.3.
• J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In International conference on machine learning, Cited by: Appendix D, §1, §1, §1, §2.2, §3.1, §3.2, §3.2, §3.2, §3, §4.2, §5.1, §6.
• J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel (2016) High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations, Cited by: §F.1, §H.6, §3.2, §3.3, footnote 6.
• J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: Appendix D, §F.1, §H.5, §1, §1, §2.2, §3.1, §3.2, §3.2, §3.2, §3, §5.1, §6, footnote 4.
• D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis (2016)

Mastering the game of go with deep neural networks and tree search

.
• H. F. Song, A. Abdolmaleki, J. T. Springenberg, A. Clark, H. Soyer, J. W. Rae, S. Noury, A. Ahuja, S. Liu, D. Tirumala, N. Heess, D. Belov, M. Riedmiller, and M. M. Botvinick (2020) V-MPO: on-policy maximum a posteriori policy optimization for discrete and continuous control. In International Conference on Learning Representations, Cited by: §1.
• R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Neural Information Processing Systems, Cited by: §1.
• T. Tieleman and G. Hinton (2012)

Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude

.
COURSERA: Neural networks for machine learning 4 (2), pp. 26–31. Cited by: §H.5, §H.6.
• M. B. Tomczak, D. Kim, P. Vrancx, and K. Kim (2019) Policy optimization through approximated importance sampling. arXiv preprint arXiv:1910.03857. Cited by: §6.
• O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. (2019) Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575 (7782), pp. 350–354. Cited by: §1.
• Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas (2017) Sample efficient actor-critic with experience replay. International Conference on Learning Representations. Cited by: §1, §3.2.
• Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas (2016) Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning, Cited by: §H.7.

## Appendix A Derivation of results for generalized trust-region policy optimization

### a.1 Controlling the residuals of Taylor expansions

We summarize the bound on the magnitude of the Taylor expansion residuals of the Q-function as a proposition.

###### Proposition 1.

Recall the definition of the Taylor expansion residual of the Q-function from the main text, . Let be the infinity norm . Let be the maximum reward in the entire MDP, . Finally, let . Then

 (13)
###### Proof.

The proof follows by bounding each the magnitude of term ,

 ≤∞∑k=K+1||Uk||

The above derivation shows that once we have as . In the above derivation, we have applied the bound which will also be helpful in later derivations. ∎

### a.2 Deriving Taylor expansions of RL objective

Recall that the RHS of Eq. 2 are the Taylor expansions of Q-functions . By construction, . Though Eq. 2 shows the expansion of the entire vector , for optimization purposes, we care about the RL objective from a starting state , , where follows the definition from the main paper .

Now we focus on calculating for general . For simplicity, we write as and henceforth we might use these notations interchangeably. Now consider the RHS of Eq. 3. By definition of the -th order Taylor expansion of , we maintain terms where appears at most times. Equivalently, in matrix form, we remove the higher order terms of while only maintaining terms such as . This allows us to conclude that

Furthermore, we can single out each term

 Lk =(π0−μ0)TTk−1+μT0Uk,  k≥2 L1 =(π0−μ0)TQμ+μT0U1.

## Appendix B Proof of Theorem 1

###### Proof.

We derive the Taylor expansion of Q-function into different orders of . For that purpose, we recursively make use of the following matrix equality

 (I−γPπ)−1=(I−γPμ)−1+γ(I−γPμ)−1(Pπ−Pμ)(I−γPπ)−1,

which can be derived either from matrix inversion equality or directly verified. Since , we can use the previous equality to get

 Qπ =(I−γPμ)−1R+γ(I−γPμ)−1(Pπ−Pμ)(I−γPπ)−1R =Qμ+γ(I−γPμ)−1(Pπ−Pμ)Qπ.

Next, we recursively apply the equality times,

 Qπ

Now if then we can bound the sup-norm in of the above term as

 ∥γ(I−γPμ)−1(Pπ−Pμ)∥∞=γ1−γ||π−μ||1<1,

thus the -th order residual term vanishes when . As a result, the limit is well defined and we deduce

 Qπ

## Appendix C Proof of Theorem 2

###### Proof.

To derive the monotonic improvement theorem for generalized TRPO, it is critical to bound . We achieve this by simply bounding each term separately. Recall that from Appendix A.1 we have . Without loss of generality, we first assume for ease of derivations.

This leads to a bound over the residuals

Since we have the equality for we can deduce the following monotonic improvement,

 (14)

To write the above statement in a compact way, we define the gap

To derive the result for general , note that the gap has a linear dependency on . Hence the general gap is

which gives produces the monotonic improvement result (Eq. 10) stated in the main paper. ∎

## Appendix D Proof of Theorem 3

###### Proof.

It is known that for , replacing by in the estimation can potentially reduce variance (Schulman et al., 2015, 2017) yet keeps the estimate unbiased. Below, we show that in general, replacing by renders the estimate of unbiased for general .

As shown above and more clearly in Appendix F, can be written as

 (15)

Note that for clarity, in the above expectation, we omit an explicit sequence of discounted visitation distributions (for detailed derivations of this sequence of visitation distributions, see Appendix F). Next, we leverage the conditional expectation with respect to to yield

 LK(π,μ) (16)

The above derivation shows that indeed, replacing by does not change the value the expectation, while potentially reducing the variance of the overall estimation. ∎

## Appendix E Proof of Theorem 4

###### Proof.

From the definition of the return off-policy evaluation operator , we have

 Rπ,μ1Q

Thus is a linear operator, and

Applying this step times, we deduce

Applying the above operator to we deduce that

which proves our claim. ∎

## Appendix F Alternative derivation for Taylor expansions of RL objective

In this section, we provide an alternative derivation of the Taylor expansion of the RL objective. Let . In cases where (e.g., for the trust-region case), . To calculate using data from , a natural technique is employ importance sampling (IS),

 J(π)

To derive Taylor expansion in an intuitive way, consider expanding the product , assuming that this infinite product is finite. Assume all with some small . A second-order Taylor expansion is

 (17)

Now, consider the term associated with ,

 (18)

Note that in the last equality, the factor is absorbed into the discounted visitation distribution . It is then clear that this term is exactly the first-order expansion shown in the main paper.

Similarly, we could derive the second-order expansion by studying the term associated with .

 (19)

Note that similar to the first-order expansion, the discount factor is absorbed into the discounted visitation distribution and respectively. Here note that the second discounted visitation distribution is instead of — this is because by construction and we need to sample the second state conditional on the time difference to be . The above is exactly the second-order expansion .

By a similar argument, we can derive expansion for all higher-order expansion by considering the term associated with . This would introduce