Log In Sign Up

Sub-policy Adaptation for Hierarchical Reinforcement Learning

Hierarchical Reinforcement Learning is a promising approach to long-horizon decision-making problems with sparse rewards. Unfortunately, most methods still decouple the lower-level skill acquisition process and the training of a higher level that controls the skills in a new task. Treating the skills as fixed can lead to significant sub-optimality in the transfer setting. In this work, we propose a novel algorithm to discover a set of skills, and continuously adapt them along with the higher level even when training on a new task. Our main contributions are two-fold. First, we derive a new hierarchical policy gradient, as well as an unbiased latent-dependent baseline. We introduce Hierarchical Proximal Policy Optimization (HiPPO), an on-policy method to efficiently train all levels of the hierarchy simultaneously. Second, we propose a method of training time-abstractions that improves the robustness of the obtained skills to environment changes. Code and results are available at .


Learning Functionally Decomposed Hierarchies for Continuous Control Tasks

Solving long-horizon sequential decision making tasks in environments wi...

Hierarchical Reinforcement Learning with Advantage-Based Auxiliary Rewards

Hierarchical Reinforcement Learning (HRL) is a promising approach to sol...

Value Function Spaces: Skill-Centric State Abstractions for Long-Horizon Reasoning

Reinforcement learning can train policies that effectively perform compl...

Robust Hierarchical Planning with Policy Delegation

We propose a novel framework and algorithm for hierarchical planning bas...

Skill Transfer via Partially Amortized Hierarchical Planning

To quickly solve new tasks in complex environments, intelligent agents n...

Rethinking Learning Dynamics in RL using Adversarial Networks

We present a learning mechanism for reinforcement learning of closely re...

Inter-Level Cooperation in Hierarchical Reinforcement Learning

This article presents a novel algorithm for promoting cooperation betwee...

Code Repositories

1 Introduction

Reinforcement learning (RL) has made great progress in a variety of domains, from playing games such as Pong and Go Mnih et al. [2015], Silver et al. [2017] to automating robotic locomotion Schulman et al. [2015], Heess et al. [2017], Florensa et al. [2018a], dexterous manipulation Florensa et al. [2017a], Andrychowicz et al. , and perception Nair et al. [2018], Florensa et al. [2018b]. Yet, most work in RL is still learning a new behavior from scratch when faced with a new problem. This is particularly inefficient when dealing with tasks that are hard to solve due to sparse rewards or long horizons, or when solving many related tasks.

A promising technique to overcome this limitation is Hierarchical Reinforcement Learning (HRL) [Sutton et al., 1999, Florensa et al., 2017b]. In this paradigm, policies have several modules of abstraction, so the reuse of a subset of the modules becomes easier. The most common case consists of temporal abstraction Precup [2000], Dayan and Hinton [1993], where a higher-level policy (manager) takes actions at a lower frequency, and its actions condition the behavior of some lower level skills or sub-policies. When transferring knowledge to a new task, most prior works fix the skills and train a new manager on top. Despite having a clear benefit in kick-starting the learning in the new task, having fixed skills can considerably cap the final performance on the new task Florensa et al. [2017b]. Little work has been done on adapting pre-trained sub-policies to be optimal for a new task.

In this paper, we develop a new framework for adapting all levels of temporal hierarchies simultaneously. First, we derive an efficient approximated hierarchical policy gradient. The key insight is that, despite the decisions of the manager being unobserved latent variables from the point of view of the Markovian environment, from the perspective of the sub-policies they can be considered as part of the observation. This provides a type of decoupling of the gradient with respect to the manager and the sub-policies parameters that greatly simplifies the policy gradient computation, in a principled way. It also justifies theoretically a technique used in other prior works Frans et al. [2018]

. Second, we introduce a sub-policy specific baseline for our hierarchical policy gradient. We show using this baseline is unbiased, and our experiments reveal faster convergence, suggesting efficient gradient variance reduction. Then we introduce a more stable way of using this gradient, Hierarchical Proximal Policy Optimization (HiPPO). This helps us take more conservative steps in our policy space

Schulman et al. [2017], necessary in hierarchies because of the interdependence of each layer. Finally we also evaluate the benefit of varying the time-commitment to the sub-policies, and show it helps both in terms of final performance and zero-shot adaptation to similar tasks.

2 Related Work

The key points in HRL are how the different levels of the hierarchy are defined, trained, and then re-used. In this work, we are interested in approaches that allow us to build temporal abstractions by having a higher level taking decisions at a slower frequency than a lower-level. There has been growing interest in HRL for the past few decades [Sutton et al., 1999, Precup, 2000], but only recently has it been applied to high-dimensional continuous domains as we do in this work Kulkarni et al. [2016], Daniel et al. [2016].

To obtain the lower level policies, or skills, most methods exploit some additional assumptions, like access to demonstrations Le et al. [2018], Merel et al. [2019], Ranchod et al. [2015], Sharma et al. [2018], policy sketches Andreas et al. [2017], or task decomposition into sub-tasks Ghavamzadeh and Mahadevan [2003], Sohn et al. [2018]. Other methods use a different reward for the lower level, often constraining it to be a “goal reacher” policy, where the signal from the higher level is the goal to reach [Nachum et al., 2018, Levy et al., 2019, Vezhnevets et al., 2017]. These methods are very promising for state-reaching tasks, but might require access to goal-reaching reward systems not defined in the original MDP, and are more limited when training on tasks beyond state-reaching. Our method does not require any additional supervision, and the obtained skills are not constrained to be goal-reaching.

When transferring skills to a new environment, most HRL methods keep them fixed and simply train a new higher-level on top [Hausman et al., 2018, Heess et al., 2016]. Other work allows for building on previous skills by constantly supplementing the set of skills with new ones Shu et al. [2018], but they require a hand-defined curriculum of tasks, and the previous skills are never fine-tuned. Our algorithm allows for seamless adaptation of the skills, showing no trade-off between leveraging the power of the hierarchy and the final performance in a new task. Other methods use invertible functions as skills [Haarnoja et al., 2018], and therefore a fixed skill can be fully over-written when a new layer of hierarchy is added on top. This kind of “fine-tuning” is promising, although they do not apply it to temporally extended skills as we are interested in here.

One of the most general frameworks to define temporally extended hierarchies is the options framework [Sutton et al., 1999], and it has recently been applied to continuous state spaces Bacon et al. [2017]. One of the most delicate parts of this formulation is the termination policy, and it requires several regularizers to avoid skill collapse Harb et al. [2017], Vezhnevets et al. [2016]. This modification of the objective may be difficult to tune and affects the final performance. Instead of adding such penalties, we propose having skills of a random length, not controlled by the agent during training of the skills. The benefit is two-fold: no termination policy to train, and more stable skills that transfer better. Furthermore, these works only used discrete action MDPs. We lift this assumption, and show good performance of our algorithm in complex locomotion tasks.

The closest work to ours in terms of final algorithm is the one proposed by Frans et al. [2018]. Their method can be included in our framework, and hence benefits from our new theoretical insights. We also introduce two modifications that are shown to be highly beneficial: the random time-commitment explained above, and the notion of an information bottleneck to obtain skills that generalize better.

3 Preliminaries

Figure 1: Temporal hierarchy studied in this paper. A latent code is sampled from the manager policy every time-steps, using the current observation . The actions are sampled from the sub-policy conditioned on the same latent code from timestep to timestep

We define a discrete-time finite-horizon discounted Markov decision process (MDP) by a tuple

, where is a state set, is an action set,

is the transition probability distribution,

is a discount factor, and the horizon. Our objective is to find a stochastic policy that maximizes the expected discounted reward within the MDP, . We denote by the entire state-action trajectory, where , and .

In this work, we tackle the problem of learning a hierarchical policy and efficiently adapting all the levels in the hierarchy to perform a new task. Usually, hierarchical policies are composed by a higher level, or manager , and a lower level, or sub-policy . The higher level does not take actions in the environment directly, but rather outputs a command, or latent variable , that conditions the behavior of the lower level. In this line of work, the manager typically operates at a lower frequency than the sub-policies. Specifically, the manager just observes the environment every time-steps. When the manager receives a new observation it decides which low level policy to commit to for environment steps by the means of a latent code . Figure 1 depicts this set-up where the high level frequency

is a random variable, which is a contribution of this paper as described in future sections.

4 Problem Statement

Hierarchical Reinforcement Learning (HRL) has the potential to leverage previously acquired knowledge to accelerate learning in related tasks. For instance, if an agent has learned to navigate in a certain environment, a good hierarchical policy could decompose the task into different sub-policies, each sub-policy advancing the agent in different directions. Then, in a new navigation problem, the manager of the hierarchical policy can make use of each of the sub-policies, effectively improving exploration when those are executed for an extended period of time. As a result, longer horizon tasks and tasks that require multiple skills can be solved in an efficient manner.

Prior works have been focused on learning a manager that combines these sub-policies, but they do not further train the sub-policies when learning a new task. However, preventing the skills from learning results in sub-optimal behavior in new tasks. This effect is exacerbated when the skills were learned in a task agnostic way or in a different environment. In this paper, we present a HRL method that learns all levels of abstraction in the hierarchical policy: the manager learns to make use of the low-level skills, while the skills are continuously adapted to attain maximum performance in the given task. We derive a policy gradient update for hierarchical policies that monotonically improves the performance. Furthermore, we demonstrate that our approach prevents sub-policy collapse behavior, when the manager ends up using just one skill, observed in previous approaches.

5 Efficient Hierarchical Policy Gradients

When using a hierarchical policy, the intermediate decision taken by the higher level is not directly applied in the environment. This consideration makes it unclear how it should be incorporated into the Markovian framework of RL: should it be treated as an observed variable, like an action, or as a latent? The answer to this question impacts the methods applicable to HRL and how the gradient of the RL objective with respect to the parameters of a hierarchical policy is computed.

In this section, we first prove that one framework is an approximation of the other under mild assumptions. Then, we derive an unbiased baseline for the HRL setup that reduces its variance. Thirdly, we introduce the notion of information bottleneck and trajectory compression, which proves critical for learning reusable skills. Finally, with these findings, we present our method, Hierarchical Proximal Policy Optimization (HiPPO), an on-policy algorithm for hierarchical policies that monotonically improves the RL objective, allowing learning at all levels of the policy and preventing sub-policy collapse.

5.1 Approximate Hierarchical Policy Gradient

Policy gradient algorithms are based on the likelihood ratio trick Williams [1992]

to estimate the gradient of returns with respect to the policy parameters as


In the context of HRL, a hierarchical policy with a manager selects every time-steps one of sub-policies to execute. These sub-policies, indexed by

, can be represented as a single conditional probability distribution over actions

. This allows us to not only use a given set of sub-policies, but also leverage skills learned with Stochastic Neural Networks (SNNs) 

Florensa et al. [2017b]. Under this framework, the probability of a trajectory can be written as


The mixture action distribution, which presents itself as an additional summation over skills, prevents additive factorization when taking the logarithm, as in Eq. 1. This can yield considerable numerical instabilities due to the product of the sub-policy probabilities. For instance, in the case where all the skills are distinguishable all the sub-policies probabilities but one will have small values, resulting in an exponentially small value. In the following Lemma, we derive an approximation of the policy gradient, whose error tends to zero as the skills become more diverse, and draw insights on the interplay of the manager actions.

Lemma 1.

If the skills are sufficiently differentiated, then the latent variable can be treated as part of the observation to compute the gradient of the trajectory probability. Let and be Lipschitz functions w.r.t. their parameters, and assume that , then


See Appendix. ∎

Our assumption can be seen as having diverse skills. Namely, for each action there is just one sub-policy that gives it high probability. In this case, the latent variable can be treated as part of the observation to compute the gradient of the trajectory probability. Many algorithms to extract lower-level skills are based on promoting diversity among the skills Florensa et al. [2017b], Eysenbach et al. [2019], therefore usually satisfying our assumption. We further analyze how well this assumption holds in our experiments section.

5.2 Unbiased Sub-policy Baseline

The policy gradient estimate obtained when applying the log-likelihood ratio trick as derived above is known to have large variance. A very common approach to mitigate this issue without biasing the estimate is to subtract a baseline from the returns Peters and Schaal [2008]. The more precise the learned baseline can be, the more effective this technique is. It is well known that such baselines can be made state-dependent without incurring any bias. However, it is still unclear how to formulate a baseline for all the levels in a hierarchical policy, since an action dependent baseline does introduce bias in the gradient. Here, we show how, under the assumptions of Lemma 1, we can formulate an unbiased latent dependent baseline for the approximate gradient (Eq. 4).

Lemma 2.

For any functions and we have:


See Appendix. ∎

Now we apply Lemma 1 and Lemma 2 to Eq. 1. By using the corresponding value functions as the function baseline, the return can be replaced by the Advantage function Schulman et al. [2015], and we obtain the following gradient expression:

This hierarchical policy gradient estimate has lower variance than without baselines, but using it for policy optimization through stochastic gradient descent still yields an unstable algorithm. In the next section, we further improve the stability and sample efficiency of the policy optimization by incorporating techniques from Proximal Policy Optimization

Schulman et al. [2017].

5.3 Hierarchical Proximal Policy Optimization

Using an appropriate step size in policy space is critical for stable policy learning. Modifying the policy parameters in some directions may have a minimal impact on the distribution over actions, whereas small changes in other directions might change its behavior drastically and hurt training efficiency. Trust region policy optimization (TRPO) uses a constraint on the KL-divergence between the old policy and the new policy to prevent this issue [Schulman et al., 2015]. Unfortunately, hierarchical policies are generally represented by complex distributions without closed form expressions for the KL-divergence. Therefore, to improve the stability of our hierarchical policy gradient we turn towards Proximal Policy Optimization (PPO) [Schulman et al., 2017]. PPO is a more flexible and compute-efficient algorithm. In a nutshell, it replaces the KL-divergence constraint with a cost function that achieves the same trust region benefits, but only requires the computation of the likelihood. Letting , the PPO objective is:

Since the likelihood ratio comes from importance sampling, we can adapt our approximated hierarchical policy gradient with the same approach. Letting and , and using the super-index clip to denote the clipped objective version, we obtain the new surrogate objective:

We call this algorithm Hierarchical Proximal Policy Optimization (HiPPO). Next, we introduce two critical additions: a switching of the time-commitment between skills, and an information bottleneck at the lower-level. Both are detailed in the following subsections.

5.4 Varying Time-commitment

Most hierarchical methods either consider a fixed time-commitment to the lower level skills Florensa et al. [2017b], Frans et al. [2018], or implement the complex options framework Precup [2000], Bacon et al. [2017]. In this work we propose an in-between, where the time-commitment to the skills is a random variable sampled from a fixed distribution just before the manager takes a decision. This modification does not hinder final performance, and we show it improves zero-shot adaptation to a new task. This approach to sampling rollouts is detailed given in Algorithm 1.

1:  Input: skills , manager , time-commitment bounds and , horizon , and bottleneck function
2:  Reset environment: , .
3:  while  do
4:     Sample time-commitment
5:     Sample skill
6:     for  do
7:        Sample action
8:        Observe new state and reward
9:     end for
11:  end while
12:  Output:
Algorithm 1 Collect Rollout
  Input: skills , manager , horizon , learning rate
  while not done do
     for actor = 1, 2, …, N do
        Obtain trajectory with Collect Rollout
        Estimate advantages and
     end for
  end while
Algorithm 2 HiPPO

5.5 Information Bottleneck through Masking

If we apply the above HiPPO algorithm in the general case, there is little incentive to either learn or maintain a diverse set of skills. We claim this can be addressed via two simple additions:

  • [noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

  • Let only take a finite number of values

  • Provide a masked observation to the skills

The masking function restricts the information about the task, such that a single skill cannot perform the full task. Skill collapse is a common problem in hierarchical methods, requiring the use of regularizers Bacon et al. [2017], Harb et al. [2017], Vezhnevets et al. [2016]. There are some natural choices of discussed in the robotics literature, like the agent-space and problem-space split Konidaris and Barto [2007], Florensa et al. [2017b], that hide all task-related information and only allow the sub-policies to see proprioceptive information. With this setup, all the missing information needed to perform the task must come from the sequence of latent codes passed to the skills. We can interpret this as a lossy compression, whereby the manager encodes the relevant problem information into bits sufficient for the next timesteps. The full algorithm is detailed in Algorithm 2.

6 Experiments

Figure 2: Snake and Ant are the two agents that we evaluate in the Gather environments.

We design the experiments to answer the following questions: 1) How does HiPPO compare against a flat policy when learning from scratch? 2) Does it lead to more robust policies? 3) How well does it adapt already learned skills? and 4) Does our skill diversity assumption hold in practice?

6.1 Tasks

To answer the posed questions, we evaluate our new algorithms on a variety of robotic navigation tasks. Each task is a different robot trying to solve the Gather environment Duan et al. [2016], depicted in Figure 2, in which the agent must collect apples (green balls, +1 reward) while avoiding bombs (red balls, -1 reward). This is a challenging hierarchical task with sparse rewards that requires agents to simultaneously learn perception, locomotion, and higher-level planning capabilities. We use 2 different types of robots within this environment. Snake is a 5-link robot with a 17-dimensional observation space and 4-dimensional action space; and Ant a quadrupedal robot with a 27-dimensional observation space and 8-dimensional action space. Both can move and rotate in all directions, and Ant faces the added challenge of avoiding falling over irrecoverably.

6.2 Learning from Scratch

In this section, we study the benefit of using the HiPPO algorithm instead of standard PPO on a flat policy Schulman et al. [2017]. The results, shown in Figure 3, demonstrate that training from scratch with HiPPO leads faster learning and better performance than flat PPO. Furthermore, the benefit of HiPPO does not just come from having temporally correlated exploration, as PPO with action repeat converges at a performance level well below our method. Finally, Figure 4 shows the effectiveness of using the presented baseline.

Figure 3: Comparison of Flat PPO, HiPPO, and HiPPO with randomized period learning from scratch on different environments.
Figure 4: Effect of using a Skill baseline as defined in Section 5.2

6.3 Robustness to Dynamics Perturbations

We try several different modifications to the base Snake Gather and Ant Gather environments. One at a time, we change the body mass, dampening of the joints, body inertia, and friction characteristics of both robots. The results, presented in Table 1, show that HiPPO with randomized period not only learns faster initially on the original task, but it is also able to better handle these dynamics changes. In terms of the percent change in policy performance between the training environment and test environment, it is able to outperform HiPPO with fixed period on 6 out of 8 related tasks without even taking any gradient steps. Our hypothesis is that the randomized period teaches the policy to adapt to wide variety of scenarios, while its information bottleneck is able to keep separate its representations for planning and locomotion, so changes in dynamics aren’t able to simultaneously affect both.

Gather Algorithm Initial Mass Dampening Inertia Friction
Snake Flat PPO 2.72 3.16 (+16%) 2.75 (+1%) 2.11 (-22%) 2.75 (+1%)
HiPPO, 4.38 3.28 (-25%) 3.27 (-25%) 3.03 (-31%) 3.27 (-25%)
HiPPO random 5.11 4.09 (-20%) 4.03 (-21%) 3.21 (-37%) 4.03 (-21%)
Ant Flat PPO 2.25 2.53 (+12%) 2.13 (-5%) 2.36 (+5%) 1.96 (-13%)
HiPPO, 3.84 3.31 (-14%) 3.37 (-12%) 2.88 (-25%) 3.07 (-20%)
HiPPO random 3.22 3.37 (+5%) 2.57 (-20%) 3.36 (+4%) 2.84 (-12%)
Table 1: Zero-shot transfer performance of flat PPO, HiPPO, and HiPPO with randomized period. The performance in the initial environment is shown, as well as the average performance over 25 rollouts in each new modified environment.

6.4 Adaptation of Pre-Trained Skills

For this task, we take 6 pre-trained subpolicies encoded by a Stochastic Neural Network Tang and Salakhutdinov [2013] that were trained in a diversity-promoting environment [Florensa et al., 2017b]. We fine-tune them with HiPPO on the Gather environment, but with an extra penalty on the velocity of the Center of Mass. This can be understood as a preference for cautious behavior. This requires adjustment of the sub-policies, which were trained with a proxy reward encouraging them to move as far as possible (and hence quickly). Fig. 5 shows the difference between fixing the sub-policies and only training a manager with PPO vs using HiPPO to simultaneously train a manager and fine-tune the skills. The two initially learn at the same rate, but HiPPO’s ability to adjust to the new dynamics allows it to reach a higher final performance.

Figure 5: Benefit of adapting some given skills when the preferences of the environment are different from those of the environment where the skills were originally trained.

6.5 Skill Diversity Assumption

In Lemma 1, we assumed that the sub-policies present ought to be diverse. This allowed us to derive a more efficient and numerically stable gradient. In this section, we empirically test the validity of our assumption, as well as the quality of our approximation. For this we run, on Snake Gather and Ant Gather, the HiPPO algorithm both from scratch and on some pretrained skills as described in the previous section. In Table 2, we report the average maximum probability under other sub-policies, corresponding to from the assumption. We observe that in all settings this is on the order of magnitude of 0.1. Therefore, under the that we use in our experiments, the term we neglect has a factor

. It is not surprising then that the average cosine similarity between the full gradient and the approximated one is almost 1, as also reported in Table

2. We only ran two random seeds of these experiments, as the results seemed pretty consistent, and they are more computationally challenging to run.

Gather Algorithm Cosine Similarity
Snake Adapt given skills
Ant Adapt given skills
Table 2: Empirical evaluation of Lemma 1. On the right column we evaluate the quality of our assumption by computing what is the average largest probability of a certain action under other skills. On the left column we report cosine similarity between our approximate gradient and the gradient computed using Eq. 2 without approximation.

7 Conclusions and Future Work

In this paper, we examined how to effectively adapt hierarchical policies. We began by deriving a hierarchical policy gradient and approximation of it. We then proposed a new method, HiPPO, that can stably train multiple layers of a hierarchy. The adaptation experiments suggested that we can optimize pretrained skills for downstream environments, and learn emergent skills without any unsupervised pre-training. We also explored hierarchy from an information bottleneck point of view, demonstrating that HiPPO with randomized period can learn from scratch on sparse-reward and long time horizon tasks, while outperforming non-hierarchical methods on zero-shot transfer.

There are many enticing avenues of future work. For instance, replacing the manually designed bottleneck with a variational autoencoder with an information bottleneck could further improve HiPPO’s performance and extend the gains seen here to other tasks. Also, as HiPPO provides a policy architecture and gradient expression, we could explore using meta-learning on top of it in order to learn better skills that are more useful on a distribution of different tasks.


Appendix A Hyperparameters and Architectures

For all experiments, both PPO and HiPPO used learning rate , clipping parameter , 10 gradient updates per iteration, a batch size of 100,000, and discount . HiPPO used sub-policies. Ant Gather has a horizon of 5000, while Snake Gather has a horizon of 8000 due to its larger size. All runs used three random seeds. HiPPO uses a manager network with 2 hidden layers of 32 units, and a skill network with 2 hidden layers of 64 units. In order to have roughly the same number of parameters for each algorithm, flat PPO uses a network with 2 hidden layers with 256 and 64 units respectively. For HiPPO with randomized period, we resample every time the manager network outputs a latent, and provide the number of timesteps until the next latent selection as an input into both the manager and skill networks. The single baselines and skill-dependent baselines used a MLP with 2 hidden layers of 32 units to fit the value function. The skill-dependent baseline receives, in addition to the full observation, the active latent code and the time remaining until the next skill sampling.

Appendix B Proofs

Lemma 1. If the skills are sufficiently differentiated, then the latent variable can be treated as part of the observation to compute the gradient of the trajectory probability. Concretely, if and are Lipschitz in their parameters, and , then


From the point of view of the MDP, a trajectory is a sequence . Let’s assume we use the hierarchical policy introduced above, with a higher-level policy modeled as a parameterized discrete distribution with possible outcomes . We can expand into the product of policy and environment dynamics terms, with denoting the th possible value out of the choices,

Taking the gradient of with respect to the policy parameters , the dynamics terms disappear, leaving:

The sum over possible values of prevents the logarithm from splitting the product over the -step sub-trajectories. This term is problematic, as this product quickly approaches as increases, and suffers from considerable numerical instabilities. Instead, we want to approximate this sum of products by a single one of the terms, which can then be decomposed into a sum of logs. For this we study each of the terms in the sum: the gradient of a sub-trajectory probability under a specific latent . Now we can use the assumption that the skills are easy to distinguish, . Therefore, the probability of the sub-trajectory under a latent different than the one that was originally sampled , is upper bounded by . Taking the gradient, applying the product rule, and the Lipschitz continuity of the policies, we obtain that for all ,

Thus, we can across the board replace the summation over latents by the single term corresponding to the latent that was sampled at that time.

Interestingly, this is exactly . In other words, it’s the gradient of the probability of that trajectory, where the trajectory now includes the variables as if they were observed.

Lemma 2. For any functions and we have:


We can use the law of iterated expectations as well as the fact that the interior expression only depends on and :

Then, we can write out the definition of the expectation and undo the gradient-log trick to prove that the baseline is unbiased.

Subtracting a state- and subpolicy- dependent baseline from the second term is also unbiased, i.e.

We’ll follow the same strategy to prove the second equality: apply the same law of iterated expectations trick, express the expectation as an integral, and undo the gradient-log trick.