ProMP: Proximal Meta-Policy Search

10/16/2018 ∙ by Jonas Rothfuss, et al. ∙ KIT berkeley college 0

Credit assignment in Meta-reinforcement learning (Meta-RL) is still poorly understood. Existing methods either neglect credit assignment to pre-adaptation behavior or implement it naively. This leads to poor sample-efficiency during meta-training as well as ineffective task identification strategies. This paper provides a theoretical analysis of credit assignment in gradient-based Meta-RL. Building on the gained insights we develop a novel meta-learning algorithm that overcomes both the issue of poor credit assignment and previous difficulties in estimating meta-policy gradients. By controlling the statistical distance of both pre-adaptation and adapted policies during meta-policy search, the proposed algorithm endows efficient and stable meta-learning. Our approach leads to superior pre-adaptation policy behavior and consistently outperforms previous Meta-RL algorithms in sample-efficiency, wall-clock time, and asymptotic performance. Our code is available at https://github.com/jonasrothfuss/promp.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 24

Code Repositories

ProMP

ProMP: Proximal Meta-Policy Search


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A remarkable trait of human intelligence is the ability to adapt to new situations in the face of limited experience. In contrast, our most successful artificial agents struggle in such scenarios. While achieving impressive results, they suffer from high sample complexity in learning even a single task, fail to generalize to new situations, and require large amounts of additional data to successfully adapt to new environments. Meta-learning addresses these shortcomings by learning how to learn. Its objective is to learn an algorithm that allows the artificial agent to succeed in an unseen task when only limited experience is available, aiming to achieve the same fast adaptation that humans possess (Schmidhuber, 1987; Thrun & Pratt, 1998).

Despite recent progress, deep reinforcement learning (RL) still relies heavily on hand-crafted features and reward functions as well as engineered problem specific inductive bias. Meta-RL aims to forego such reliance by acquiring inductive bias in a data-driven manner. Recent work proves this approach to be promising, demonstrating that Meta-RL allows agents to obtain a diverse set of skills, attain better exploration strategies, and learn faster through meta-learned dynamics models or synthetic returns  (Duan et al., 2016; Xu et al., 2018; Gupta et al., 2018b; Saemundsson et al., 2018).

Meta-RL is a multi-stage process in which the agent, after a few sampled environment interactions, adapts its behavior to the given task. Despite its wide utilization, little work has been done to promote theoretical understanding of this process, leaving Meta-RL grounded on unstable foundations. Although the behavior prior to the adaptation step is instrumental for task identification, the interplay between pre-adaptation sampling and posterior performance of the policy remains poorly understood. In fact, prior work in gradient-based Meta-RL has either entirely neglected credit assignment to the pre-update distribution (Finn et al., 2017) or implemented such credit assignment in a naive way (Al-Shedivat et al., 2018; Stadie et al., 2018).

To our knowledge, we provide the first formal in-depth analysis of credit assignment w.r.t. pre-adaptation sampling distribution in Meta-RL. Based on our findings, we develop a novel Meta-RL algorithm. First, we analyze two distinct methods for assigning credit to pre-adaptation behavior. We show that the recent formulation introduced by Al-Shedivat et al. (2018) and Stadie et al. (2018) leads to poor credit assignment, while the MAML formulation (Finn et al., 2017)

potentially yields superior meta-policy updates. Second, based on insights from our formal analysis, we highlight both the importance and difficulty of proper meta-policy gradient estimates. In light of this, we propose the low variance curvature (LVC) surrogate objective which yields gradient estimates with a favorable bias-variance trade-off. Finally, building upon the LVC estimator we develop Proximal Meta-Policy Search (ProMP), an efficient and stable meta-learning algorithm for RL. In our experiments, we show that ProMP consistently outperforms previous Meta-RL algorithms in sample-efficiency, wall-clock time, and asymptotic performance.

2 Related Work

Meta-Learning concerns the question of “learning to learn”, aiming to acquire inductive bias in a data driven manner, so that the learning process in face of unseen data or new problem settings is accelerated (Schmidhuber, 1987; Schmidhuber et al., 1997; Thrun & Pratt, 1998).

This can be achieved in various ways. One category of methods attempts to learn the “learning program” of an universal Turing machine in form of a recurrent / memory-augmented model that ingests datasets and either outputs the parameters of the trained model 

(Hochreiter et al., 2001; Andrychowicz et al., 2016; Chen et al., 2017; Ravi & Larochelle, 2017) or directly outputs predictions for given test inputs (Duan et al., 2016; Santoro et al., 2016; Mishra et al., 2018). Though very flexible and capable of learning very efficient adaptations, such methods lack performance guarantees and are difficult to train on long sequences that arise in Meta-RL.

Another set of methods embeds the structure of a classical learning algorithm in the meta-learning procedure, and optimizes the parameters of the embedded learner during the meta-training (Hüsken & Goerick, 2000; Finn et al., 2017; Nichol et al., 2018; Miconi et al., 2018). A particular instance of the latter that has proven to be particularly successful in the context of RL is gradient-based meta-learning (Finn et al., 2017; Al-Shedivat et al., 2018; Stadie et al., 2018). Its objective is to learn an initialization such that after one or few steps of policy gradients the agent attains full performance on a new task. A desirable property of this approach is that even if fast adaptation fails, the agent just falls back on vanilla policy-gradients. However, as we show, previous gradient-based Meta-RL methods either neglect or perform poor credit assignment w.r.t. the pre-update sampling distribution.

A diverse set of methods building on Meta-RL, has recently been introduced. This includes: learning exploration strategies (Gupta et al., 2018b), synthetic rewards (Sung et al., 2017; Xu et al., 2018), unsupervised policy acquisition (Gupta et al., 2018a), model-based RL (Clavera et al., 2018; Saemundsson et al., 2018), learning in competitive environments (Al-Shedivat et al., 2018) and meta-learning modular policies (Frans et al., 2018; Alet et al., 2018). Many of the mentioned approaches build on previous gradient-based meta-learning methods that insufficiently account for the pre-update distribution. ProMP overcomes these deficiencies, providing the necessary framework for novel applications of Meta-RL in unsolved problems.

3 Background

Reinforcement Learning.

A discrete-time finite Markov decision process (MDP),

, is defined by the tuple . Here, is the set of states, the action space, the transition distribution, represents the initial state distribution, is a reward function, and the time horizon. We omit the discount factor in the following elaborations for notational brevity. However, it is straightforward to include it by substituting the reward by . We define the return as the sum of rewards along a trajectory . The goal of reinforcement learning is to find a policy that maximizes the expected return .

Meta-Reinforcement Learning goes one step further, aiming to learn a learning algorithm which is able to quickly learn the optimal policy for a task drawn from a distribution of tasks . Each task corresponds to a different MDP. Typically, it is assumed that the distribution of tasks share the action and state space, but may differ in their reward function or their dynamics.

Gradient-based meta-learning aims to solve this problem by learning the parameters of a policy such that performing a single or few steps of vanilla policy gradient (VPG) with the given task leads to the optimal policy for that task. This meta-learning formulation, also known under the name of MAML, was first introduced by Finn et al. (2017). We refer to it as formulation I which can be expressed as maximizing the objective

In that denotes the update function which depends on the task , and performs one VPG step towards maximizing the performance of the policy in . For national brevity and conciseness we assume a single policy gradient adaptation step. Nonetheless, all presented concepts can easily be extended to multiple adaptation steps.

Later work proposes a slightly different notion of gradient-based Meta-RL, also known as E-MAML, that attempts to circumvent issues with the meta-gradient estimation in MAML (Al-Shedivat et al., 2018; Stadie et al., 2018):

Formulation II views as a deterministic function that depends on sampled trajectories from a specific task. In contrast to formulation I, the expectation over pre-update trajectories is applied outside of the update function. Throughout this paper we refer to as pre-update policy, and as post-update policy.

4 Sampling Distribution Credit Assignment

This section analyzes the two gradient-based Meta-RL formulations introduced in Section 3. Figure 1 illustrates the stochastic computation graphs (Schulman et al., 2015b) of both formulations. The red arrows depict how credit assignment w.r.t the pre-update sampling distribution is propagated. Formulation I (left) propagates the credit assignment through the update step, thereby exploiting the full problem structure. In contrast, formulation II (right) neglects the inherent structure, directly assigning credit from post-update return to the pre-update policy which leads to noisier, less effective credit assignment.

Figure 1: Stochastic computation graphs of meta-learning formulation I (left) and formulation II (right). The red arrows illustrate the credit assignment from the post-update returns to the pre-update policy through . (Deterministic nodes: Square; Stochastic nodes: Circle)

Both formulations optimize for the same objective, and are equivalent at the order. However, because of the difference in their formulation and stochastic computation graph, their gradients and the resulting optimization step differs. In the following, we shed light on how and where formulation II loses signal by analyzing the gradients of both formulations, which can be written as (see Appendix A for more details and derivations)

(1)

The first term is equal in both formulations, but the second term, , differs between them. In particular, they correspond to

(2)
(3)
(4)

simply corresponds to a policy gradient step on the post-update policy w.r.t

, followed by a linear transformation from post- to pre-update parameters. It corresponds to increasing the likelihood of the trajectories

that led to higher returns. However, this term does not optimize for the pre-update sampling distribution, i.e., which trajectories led to better adaptation steps.

The credit assignment w.r.t. the pre-updated sampling distribution is carried out by the second term. In formulation II, can be viewed as standard reinforcement learning on with as reward signal, treating the update function as part of the unknown dynamics of the system. This shifts the pre-update sampling distribution to better adaptation steps.

Formulation I takes the causal dependence of on into account. It does so by maximizing the inner product of pre-update and post-update policy gradients (see Eq. 4). This steers the pre-update policy towards 1) larger post-updates returns 2) larger adaptation steps , 3) better alignment of pre- and post-update policy gradients (Li et al., 2017; Nichol et al., 2018). When combined, these effects directly optimize for adaptation. As a result, we expect the first meta-policy gradient formulation, , to yield superior learning properties.

5 Low Variance Curvature Estimator

In the previous section we show that the formulation introduced by Finn et al. (2017) results in superior meta-gradient updates, which should in principle lead to improved convergence properties. However, obtaining correct and low variance estimates of the respective meta-gradients proves challenging. As discussed by Foerster et al. (2018), and shown in Appendix B.3, the score function surrogate objective approach is ill suited for calculating higher order derivatives via automatic differentiation toolboxes. This important fact was overlooked in the original RL-MAML implementation (Finn et al., 2017) leading to incorrect meta-gradient estimates111Note that MAML is theoretically sound, but does not attend to correctly estimating the meta-policy gradients. As consequence, the gradients in the corresponding implementation do not comply with the theory.. But, even when properly implemented, we show that these gradients exhibit high variance.

Specifically, the estimation of the hessian of the RL-objective, which is inherent in the meta-gradients, requires special consideration. In this section, we motivate and introduce the low variance curvature estimator (LVC): an improved estimator for the hessian of the RL-objective which promotes better meta-policy gradient updates. As we show in Appendix A.1, we can write the gradient of the meta-learning objective as

(5)

Since the update function resembles a policy gradient step, its gradient involves computing the hessian of the reinforcement learning objective, i.e., . Estimating this hessian has been discussed in Baxter & Bartlett (2001) and Furmston et al. (2016). In the infinite horizon MDP case, Baxter & Bartlett (2001) derived a decomposition of the hessian. We extend their finding to the finite horizon case, showing that the hessian can be decomposed into three matrix terms (see Appendix B.2 for proof):

(6)

whereby

Here denotes the expected state-action value function under policy at time .

Computing the expectation of the RL-objective is in general intractable. Typically, its gradients are computed with a Monte Carlo estimate based on the policy gradient theorem (Eq. 82). In practical implementations, such an estimate is obtained by automatically differentiating a surrogate objective (Schulman et al., 2015b). However, this results in a highly biased hessian estimate which just computes , entirely dropping the terms and . In the notation of the previous section, it leads to neglecting the term, ignoring the influence of the pre-update sampling distribution.

The issue can be overcome using the DiCE formulation, which allows to compute unbiased higher-order Monte Carlos estimates of arbitrary stochastic computation graphs (Foerster et al., 2018). The DiCE-RL objective can be rewritten as follows

(7)
(8)

In that, denotes the “stop_gradient” operator, i.e., but . The sequential dependence of within the trajectory, manifesting itself through the product of importance weights in (7), results in high variance estimates of the hessian . As noted by Furmston et al. (2016), is particularly difficult to estimate, since it involves three nested sums along the trajectory. In section 7.2 we empirically show that the high variance estimates of the DiCE objective lead to noisy meta-policy gradients and poor learning performance.

To facilitate a sample efficient meta-learning, we introduce the low variance curvature (LVC) estimator:

(9)
(10)

By removing the sequential dependence of within trajectories, the hessian estimate neglects the term which leads to a variance reduction, but makes the estimate biased. The choice of this objective function is motivated by findings in Furmston et al. (2016): under certain conditions the term vanishes around local optima , i.e., as . Hence, the bias of the LVC estimator becomes negligible close to local optima. The experiments in section 7.2 underpin the theoretical findings, showing that the low variance hessian estimates obtained through improve the sample-efficiency of meta-learning by a significant margin when compared to . We refer the interested reader to Appendix B for derivations and a more detailed discussion.

6 ProMP: Proximal Meta-Policy Search

Building on the previous sections, we develop a novel meta-policy search method based on the low variance curvature objective which aims to solve the following optimization problem:

(11)

Prior work has optimized this objective using either vanilla policy gradient (VPG) or TRPO (Schulman et al., 2015a). TRPO holds the promise to be more data efficient and stable during the learning process when compared to VPG. However, it requires computing the Fisher information matrix (FIM). Estimating the FIM is particularly problematic in the meta-learning set up. The meta-policy gradients already involve second order derivatives; as a result, the time complexity of the FIM estimate is cubic in the number of policy parameters. Typically, the problem is circumvented using finite difference methods, which introduce further approximation errors.

The recently introduced PPO algorithm (Schulman et al., 2017) achieves comparable results to TRPO with the advantage of being a first order method. PPO uses a surrogate clipping objective which allows it to safely take multiple gradient steps without re-sampling trajectories.

In case of Meta-RL, it does not suffice to just replace the post-update reward objective with . In order to safely perform multiple meta-gradient steps based on the same sampled data from a recent policy , we also need to 1) account for changes in the pre-update action distribution , and 2) bound changes in the pre-update state visitation distribution (Kakade & Langford, 2002).

We propose Proximal Meta-Policy Search (ProMP) which incorporates both the benefits of proximal policy optimization and the low variance curvature objective (see Alg. 1.) In order to comply with requirement 1), ProMP replaces the “stop gradient” importance weight by the likelihood ratio , which results in the following objective

(12)

An important feature of this objective is that its derivatives w.r.t evaluated at are identical to those of the LVC objective, and it additionally accounts for changes in the pre-update action distribution. To satisfy condition 2) we extend the clipped meta-objective with a KL-penalty term between and . This KL-penalty term enforces a soft local “trust region” around , preventing the shift in state visitation distribution to become large during optimization. This enables us to take multiple meta-policy gradient steps without re-sampling. Altogether, ProMP optimizes

(13)

ProMP consolidates the insights developed throughout the course of this paper, while at the same time making maximal use of recently developed policy gradients algorithms. First, its meta-learning formulation exploits the full structural knowledge of gradient-based meta-learning. Second, it incorporates a low variance estimate of the RL-objective hessian. Third, ProMP controls the statistical distance of both pre- and post-adaptation policies, promoting efficient and stable meta-learning. All in all, ProMP consistently outperforms previous gradient-based meta-RL algorithms in sample complexity, wall clock time, and asymptotic performance (see Section 7.1).

0:  Task distribution , step sizes , , KL-penalty coefficient , clipping range
1:  Randomly initialize
2:  while  not converged do
3:     Sample batch of tasks
4:     for step  do
5:        if  then
6:           Set
7:           for all  do
8:              Sample pre-update trajectories from using
9:              Compute adapted parameters with
10:              Sample post-update trajectories from using
11:        Update using each
Algorithm 1 Proximal Meta-Policy Search (ProMP)

7 Experiments

In order to empirically validate the theoretical arguments outlined above, this section provides a detailed experimental analysis that aims to answer the following questions: (i) How does ProMP perform against previous Meta-RL algorithms? (ii) How do the lower variance but biased LVC gradient estimates compare to the high variance, unbiased DiCE estimates? (iii) Do the different formulations result in different pre-update exploration properties? (iv) How do formulation I and formulation II differ in their meta-gradient estimates and convergence properties?

To answer the posed questions, we evaluate our approach on six continuous control Meta-RL benchmark environments based on OpenAI Gym and the Mujoco simulator (Brockman et al., 2016; Todorov et al., 2012). A description of the experimental setup is found in Appendix D. In all experiments, the reported curves are averaged over at least three random seeds. Returns are estimated based on sampled trajectories from the adapted post-update policies and averaged over sampled tasks. The source code and the experiments data are available on our supplementary website.222https://sites.google.com/view/pro-mp

7.1 Meta-Gradient Based Comparison

We compare our method, ProMP, in sample complexity and asymptotic performance to four other gradient-based approaches: TRPO-MAML (Finn et al., 2017), E-MAML-TRPO, E-MAML-VPG (Stadie et al., 2018), and LVC-VPG, an ablated version of our method that uses the LVC objective in the adaptation step and meta-optimizes with vanilla policy gradient. These algorithms are benchmarked on six different locomotion tasks that require adaptation: the half-cheetah and walker must switch between running forward and backward, the high-dimensional agents ant and humanoid must learn to adapt to run in different directions in the 2D-plane, and the hopper and walker have to adapt to different configuration of their dynamics.

Figure 2: Meta-learning curves of ProMP and four other gradient-based meta-learning algorithms in six different Mujoco environments. ProMP outperforms previous work in all the the environments.
Figure 3:

Upper: Relative standard deviation of meta-policy gradients. Lower: Return in the HalfCheetahFwdBack environment.

The results, shown in Figure 2, highlight the strength of ProMP in terms of sample efficiency and asymptotic performance. They also demonstrate the positive effect of the LVC objective: LVC-VPG, even though optimized with vanilla policy gradient, is often able to achieve comparable results to the the prior methods that are optimized with TRPO. When compared to E-MAML-VPG, LVC proves strictly superior in performance which underpins the soundness of the theory developed throughout this paper. Results for four additional environments are displayed in Appendix D

along with hyperparameter settings, environment specifications and a wall-clock time comparison of the algorithms.

7.2 Estimator Variance
and Its Effect on Meta-Learning

In Section 5 we discussed how the DiCE formulation yields unbiased but high variance estimates of the RL-objective hessian and served as motivation for the low variance curvature (LVC) estimator. Here we investigate the meta-gradient variance of both estimators as well as its implication on the learning performance. Specifically, we report the relative standard deviation of the meta-policy gradients as well as the average return throughout the learning process in the HalfCheetahFwdBack environment. The results, depicted in Figure 3, highlight the advantage of the low variance curvature estimate. The trajectory level dependencies inherent in the DiCE estimator lead to a meta-gradient standard deviation that is on average two times higher when compared to LVC. As the learning curves indicate, the noisy gradients impede sample efficient meta-learning in case of DiCE. Meta-policy search based on the LVC estimator leads to substantially better learning properties.

7.3 Comparison of Initial Sampling Distributions

Here we evaluate the effect of the different objectives on the learned pre-update sampling distribution. We compare the low variance curvature (LVC) estimator with TRPO (LVC-TRPO) against MAML (Finn et al., 2017) and E-MAML-TRPO (Stadie et al., 2018) in a 2D environment on which the exploration behavior can be visualized. Each task of this environment corresponds to reaching a different corner location; however, the 2D agent only experiences reward when it is sufficiently close to the corner (translucent regions of Figure 4). Thus, to successfully identify the task, the agent must explore the different regions. We perform three inner adaptation steps on each task, allowing the agent to fully change its behavior from exploration to exploitation.

Figure 4: Exploration patterns of the pre-update policy and exploitation post-update with different update functions. Through its superior credit assignment, the LVC objective learns a pre-update policy that is able to identify the current task and respectively adapt its policy, successfully reaching the goal (dark green circle).

The different exploration-exploitation strategies are displayed in Figure 4. Since the MAML implementation does not assign credit to the pre-update sampling trajectory, it is unable to learn a sound exploration strategy for task identification and thus fails to accomplish the task. On the other hand, E-MAML, which corresponds to formulation II, learns to explore in long but random paths: because it can only assign credit to batches of pre-update trajectories, there is no notion of which actions in particular facilitate good task adaptation. As consequence the adapted policy slightly misses the task-specific target. The LVC estimator, instead, learns a consistent pattern of exploration, visiting each of the four regions, which it harnesses to fully solve the task.

7.4 Gradient Update Directions of the Two Meta-RL Formulations

Figure 5: Meta-gradient updates of policy parameters and in a 1D environment w.r.t Formulation I (red) and Formulation II (green).

To shed more light on the differences of the gradients of formulation I and formulation II, we evaluate the meta-gradient updates and the corresponding convergence to the optimum of both formulations in a simple 1D environment. In this environment, the agent starts in a random position in the real line and has to reach a goal located at the position 1 or -1. In order to visualize the convergence, we parameterize the policy with only two parameters and . We employ formulation I by optimizing the DiCE objective with VPG, and formulation II by optimizing its (E-MAML) objective with VPG.

Figure 5 depicts meta-gradient updates of the parameters for both formulations. Formulation I (red) exploits the internal structure of the adaptation update yielding faster and steadier convergence to the optimum. Due to its inferior credit assignment, formulation II (green) produces noisier gradient estimates leading to worse convergence properties.

8 Conclusion

In this paper we propose a novel Meta-RL algorithm, proximal meta-policy search (ProMP), which fully optimizes for the pre-update sampling distribution leading to effective task identification. Our method is the result of a theoretical analysis of gradient-based Meta-RL formulations, based on which we develop the low variance curvature (LVC) surrogate objective that produces low variance meta-policy gradient estimates. Experimental results demonstrate that our approach surpasses previous meta-reinforcement learning approaches in a diverse set of continuous control tasks. Finally, we underpin our theoretical contributions with illustrative examples which further justify the soundness and effectiveness of our method.

Acknowledgments

Ignasi Clavera was supported by the La Caixa Fellowship. The research leading to these results has received funding from the German Research Foundation (DFG: Deutsche Forschungsgemeinschaft) under Priority Program on Autonomous Learning (SPP 1527) and was supported by Berkeley Deep Drive, Amazon Web Services, and Huawei. Also we thank Abhishek Gupta, Chelsea Finn, aand Aviv Tamar for their valuable feedback.

References

Appendix A Two Meta-Policy Gradient Formulations

In this section we discuss two different gradient-based meta-learning formulations, derive their gradients and analyze the differences between them.

a.1 Meta-Policy Gradient Formulation I

The first meta-learning formulation, known as MAML (Finn et al., 2017), views the inner update rule as a mapping from the pre-update parameter and the task to an adapted policy parameter . The update function can be viewed as stand-alone procedure that encapsulates sampling from the task-specific trajectory distribution and updating the policy parameters. Building on this concept, the meta-objective can be written as

(14)

The task-specific gradients follow as

(15)
(16)
(17)

In order to derive the gradients of the inner update it is necessary to know the structure of . The main part of this paper assumes the inner update rule to be a policy gradient descent step

(18)
(19)

Thereby the second term in (19) is the local curvature (hessian) of the inner adaptation objective function. The correct hessian of the inner objective can be derived as follows:

(20)
(21)
(22)
(23)
(24)

a.2 Meta-Policy Gradient Formulation II

The second meta-reinforcement learning formulation views the the inner update as a deterministic function of the pre-update policy parameters and trajectories sampled from the pre-update trajectory distribution. This formulation was introduced in Al-Shedivat et al. (2018) and further discussed with respect to its exploration properties in Stadie et al. (2018).

Viewing as a function that adapts the policy parameters to a specific task given policy rollouts in this task, the corresponding meta-learning objective can be written as

(25)

Since the first part of the gradient derivation is agnostic to the inner update rule , we only assume that the inner update function is differentiable w.r.t. . First we rewrite the meta-objective as expectation of task specific objectives under the task distribution. This allows us to express the meta-policy gradients as expectation of task-specific gradients:

(26)

The task specific gradients can be calculated as follows

As in A.1 the structure of must be known in order to derive the gradient . Since we assume the inner update to be vanilla policy gradient, the respective gradient follows as

The respective gradient of follows as

(27)
(28)

a.3 Comparing the Gradients of the Two Formulations

In the following we analyze the differences between the gradients derived for the two formulations. To do so, we begin with by inserting the gradient of the inner adaptation step (19) into (17):

(29)

We can substitute the hessian of the inner objective by its derived expression from (24) and then rearrange the terms. Also note that where is the MDP horizon.

(30)
(31)
(32)
(33)

Next, we rearrange the gradient of into a similar form as . For that, we start by inserting (28) for and replacing the expectation over pre-update trajectories by the expectation over a single trajectory .

(34)
(35)

While the first part of the gradients match ((32) and (34)), the second part ((33) and (35)) differs. Since the second gradient term can be viewed as responsible for shifting the pre-update sampling distribution towards higher post-update returns, we refer to it as . To further analyze the difference between and we slightly rearrange (33) and put both gradient terms next to each other:

(36)
(37)

In the following we interpret and and compare of the derived gradient terms, aiming to provide intuition for the differences between the formulations:

The first gradient term that matches in both formulations corresponds to a policy gradient step on the post-update policy . Since itself is a function of , the term can be seen as linear transformation of the policy gradient update from the post-update parameter into . Although takes into account the functional relationship between and , it does not take into account the pre-update sampling distribution .

This is where comes into play: