Supervised Policy Update

We propose a new sample-efficient methodology, called Supervised Policy Update (SPU), for deep reinforcement learning. Starting with data generated by the current policy, SPU optimizes over the proximal policy space to find a non-parameterized policy. It then solves a supervised regression problem to convert the non-parameterized policy to a parameterized policy, from which it draws new samples. There is significant flexibility in setting the labels in the supervised regression problem, with different settings corresponding to different underlying optimization problems. We develop a methodology for finding an optimal policy in the non-parameterized policy space, and show how Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) can be addressed by this methodology. In terms of sample efficiency, our experiments show SPU can outperform PPO for simulated robotic locomotion tasks.



There are no comments yet.


page 1

page 2

page 3

page 4


Hierarchical Approaches for Reinforcement Learning in Parameterized Action Space

We explore Deep Reinforcement Learning in a parameterized action space. ...

Optimistic Distributionally Robust Policy Optimization

Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization...

Improving Generalization in Mountain Car Through the Partitioned Parameterized Policy Approach via Quasi-Stochastic Gradient Descent

The reinforcement learning problem of finding a control policy that mini...

Truly Proximal Policy Optimization

Proximal policy optimization (PPO) is one of the most successful deep re...

Off-policy evaluation for MDPs with unknown structure

Off-policy learning in dynamic decision problems is essential for provid...

Generalized Proximal Policy Optimization with Sample Reuse

In real-world decision making tasks, it is critical for data-driven rein...

Trust Region-Guided Proximal Policy Optimization

Model-free reinforcement learning relies heavily on a safe yet explorato...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The policy gradient problem in deep reinforcement learning (DRL) can be defined as seeking a parameterized policy with high expected reward. An issue with policy gradient methods is poor sample efficiency (Kakade, 2003; Schulman et al., 2015a; Wang et al., 2016b; Wu et al., 2017; Schulman et al., 2017). In algorithms such as REINFORCE (Williams, 1992), new samples are needed for every gradient step. When generating samples is expensive (such as robotic environments), sample efficiency is of central concern. The sample efficiency of an algorithm is defined to be the number of calls to the environment required to attain a specified performance level (Kakade, 2003).

Thus, given the current policy and a fixed number of trajectories (samples) generated, the goal of the sample efficiency problem is to construct a new policy with the highest performance improvement possible. To do so, it is desirable to limit the search to policies that are close to the original policy (Kakade, 2002; Schulman et al., 2015a; Wu et al., 2017; Achiam et al., 2017; Schulman et al., 2017; Tangkaratt et al., 2018). Intuitively, if the candidate new policy is far from the original policy , it may not perform better than the original policy because too much emphasis is being placed on the relatively small batch of new data generated by , and not enough emphasis is being placed on the relatively large amount of data and effort previously used to construct .

This guideline of limiting the search to nearby policies seems reasonable in principle, but requires a distance between the current policy and the candidate new policy , and then attempt to solve the constrained optimization problem:

subject to (2)


is an estimate of

, the performance of policy , based on the previous policy and the batch of fresh data generated by . The objective (1) attempts to maximize the performance of the updated policy, and the constraint (2) ensures that the updated policy is not too far from the policy that was used to generate the data. Several recent papers (Kakade, 2002; Schulman et al., 2015a, 2017; Tangkaratt et al., 2018) belong to the framework (1)-(2).

Our work also strikes the right balance between performance and simplicity. The implementation is only slightly more involved than PPO (Schulman et al., 2017). Simplicity in RL algorithms has its own merits. This is especially useful when RL algorithms are used to solve problems outside of traditional RL testbeds, which is becoming a trend (Zoph & Le, 2016; Mingxing Tan, 2018).

We propose a new methodology, called Supervised Policy Update (SPU), for this sample efficiency problem. The methodology is general in that it applies to both discrete and continuous action spaces, and can address a wide variety of constraint types for (2). Starting with data generated by the current policy, SPU optimizes over a proximal policy space to find an optimal non-parameterized policy. It then solves a supervised regression problem to convert the non-parameterized policy to a parameterized policy, from which it draws new samples. We develop a general methodology for finding an optimal policy in the non-parameterized policy space, and then illustrate the methodology for three different definitions of proximity. We also show how the Natural Policy Gradient and Trust Region Policy Optimization (NPG/TRPO) problems and the Proximal Policy Optimization (PPO) problem can be addressed by this methodology. While SPU is substantially simpler than NPG/TRPO in terms of mathematics and implementation, our extensive experiments show that SPU is more sample efficient than TRPO in Mujoco simulated robotic tasks and PPO in Atari video game tasks.

Off-policy RL algorithms generally achieve better sample efficiency than on-policy algorithms (Haarnoja et al., 2018). However, the performance of an on-policy algorithm can usually be substantially improved by incorporating off-policy training (Mnih et al. (2015), Wang et al. (2016a)). Our paper focuses on igniting interests in separating finding the optimal policy into a two-step process: finding the optimal non-parameterized policy, and then parameterizing this optimal policy. We also wanted to deeply understand the on-policy case before adding off-policy training. We thus compare with algorithms operating under the same algorithmic constraints, one of which is being on-policy. We leave the extension to off-policy to future work. We do not claim state-of-the-art results.

2 Preliminaries

We consider a Markov Decision Process (MDP) with state space

, action space , and reward function , , . Let denote a policy, let be the set of all policies, and let the expected discounted reward be:


where is a discount factor and is a sample trajectory. Let be the advantage function for policy (Levine, 2017). Deep reinforcement learning considers a set of parameterized policies

, where each policy is parameterized by a neural network called the policy network. In this paper, we will consider optimizing over the parameterized policies in

as well as over the non-parameterized policies in . For concreteness, we assume that the state and action spaces are finite. However, our methodology also applies to continuous state and action spaces, as shown in the Appendix.

One popular approach to maximizing over is to apply stochastic gradient ascent. The gradient of evaluated at a specific can be shown to be (Williams, 1992):


We can approximate (4) by sampling N trajectories of length T from :


Additionally, define

for the the future state probability distribution for policy

, and denote for the probability distribution over the action space when in state and using policy . Further denote for the KL divergence from to , and denote the following as the “aggregated KL divergence”.


2.1 Surrogate Objectives for the Sample Efficiency Problem

For the sample efficiency problem, the objective is typically approximated using samples generated from (Schulman et al., 2015a; Achiam et al., 2017; Schulman et al., 2017). Two different approaches are typically used to approximate . We can make a first order approximation of around (Kakade, 2002; Peters & Schaal, 2008a, b; Schulman et al., 2015a):


where is the sample estimate (5). The second approach is to approximate the state distribution with (Achiam et al., 2017; Schulman et al., 2017; Achiam, 2017):


There is a well-known bound for the approximation (8) (Kakade & Langford, 2002; Achiam et al., 2017). Furthermore, the approximation matches to the first order with respect to the parameter (Achiam et al., 2017).

3 Related Work

Natural gradient (Amari, 1998) was first introduced to policy gradient by Kakade (Kakade, 2002) and then in (Peters & Schaal, 2008a, b; Achiam, 2017; Schulman et al., 2015a). referred to collectively here as NPG/TRPO. Algorithmically, NPG/TRPO finds the gradient update by solving the sample efficiency problem (1)-(2) with , i.e., use the aggregate KL-divergence for the policy proximity constraint (2). NPG/TRPO addresses this problem in the parameter space . First, it approximates with the first-order approximation (7) and using a similar second-order method. Second, it uses samples from to form estimates of these two approximations. Third, using these estimates (which are functions of ), it solves for the optimal . The optimal is a function of and of , the sample average of the Hessian evaluated at . TRPO also limits the magnitude of the update to ensure (i.e., ensuring the sampled estimate of the aggregated KL constraint is met without the second-order approximation).

SPU takes a very different approach by first (i) posing and solving the optimization problem in the non-parameterized policy space, and then (ii) solving a supervised regression problem to find a parameterized policy that is near the optimal non-parameterized policy. A recent paper, Guided Actor Critic (GAC), independently proposed a similar decomposition (Tangkaratt et al., 2018). However, GAC is much more restricted in that it considers only one specific constraint criterion (aggregated reverse-KL divergence) and applies only to continuous action spaces. Furthermore, GAC incurs significantly higher computational complexity, e.g. at every update, it minimizes the dual function to obtain the dual variables using SLSQP. MPO also independently propose a similar decomposition (Abbas Abdolmaleki, 2018)

. MPO uses much more complex machinery, namely, Expectation Maximization to address the DRL problem. However, MPO has only demonstrates preliminary results on problems with discrete actions whereas our approach naturally applies to problems with either discrete or continuous actions. In both GAC and MPO, working in the non-parameterized space is a by-product of applying the main ideas in those papers to DRL. Our paper demonstrates that the decomposition alone is a general and useful technique for solving constrained policy optimization.

Clipped-PPO (Schulman et al., 2017) takes a very different approach to TRPO. At each iteration, PPO makes many gradient steps while only using the data from . Without the clipping, PPO is the approximation (8). The clipping is analogous to the constraint (2) in that it has the goal of keeping close to . Indeed, the clipping keeps from becoming neither much larger than nor much smaller than . Thus, although the clipped PPO objective does not squarely fit into the optimization framework (1)-(2), it is quite similar in spirit. We note that the PPO paper considers adding the KL penalty to the objective function, whose gradient is similar to ours. However, this form of gradient was demonstrated to be inferior to Clipped-PPO. To the best of our knowledge, it is only until our work that such form of gradient is demonstrated to outperform Clipped-PPO.

Actor-Critic using Kronecker-Factored Trust Region (ACKTR) (Wu et al., 2017) proposed using Kronecker-factored approximation curvature (K-FAC) to update both the policy gradient and critic terms, giving a more computationally efficient method of calculating the natural gradients. ACER (Wang et al., 2016a)

exploits past episodes, linearizes the KL divergence constraint, and maintains an average policy network to enforce the KL divergence constraint. In future work, it would of interest to extend the SPU methodology to handle past episodes. In contrast to bounding the KL divergence on the action distribution as we have done in this work, Relative Entropy Policy Search considers bounding the joint distribution of state and action and was only demonstrated to work for small problems

(Jan Peters, 2010).

4 SPU Framework

The SPU methodology has two steps. In the first step, for a given constraint criterion , we find the optimal solution to the non-parameterized problem:

subject to (10)

Note that is not restricted to the set of parameterized policies . As commonly done, we approximate the objective function (8). However, unlike PPO/TRPO, we are not approximating the constraint (2). We will show below the optimal solution for the non-parameterized problem (9)-(10) can be determined nearly in closed form for many natural constraint criteria .

In the second step, we attempt to find a policy in the parameterized space that is close to the target policy . Concretely, to advance from to , we perform the following steps:

  1. We first sample trajectories using policy , giving sample data , . Here is an estimate of the advantage value . (For simplicity, we index the samples with rather than with corresponding to the th sample in the th trajectory.)

  2. For each , we define the target distribution to be the optimal solution to the constrained optimization problem (9)-(10) for a specific constraint .

  3. We then fit the policy network to the target distributions , . Specifically, to find

    , we minimize the following supervised loss function:


    For this step, we initialize with the weights for . We minimize the loss function

    with stochastic gradient descent methods. The resulting

    becomes our .

5 SPU Applied to Specific Proximity Criteria

To illustrate the SPU methodology, for three different but natural types of proximity constraints, we solve the corresponding non-parameterized optimization problem and derive the resulting gradient for the SPU supervised learning problem. We also demonstrate that different constraints lead to very different but intuitive forms of the gradient update.

5.1 Forward Aggregate and Disaggregate KL Constraints

We first consider constraint criteria of the form:

subject to (13)

Note that this problem is equivalent to minimizing subject to the constraints (13) and (14). We refer to (13) as the "aggregated KL constraint" and to (14) as the "disaggregated KL constraint". These two constraints taken together restrict from deviating too much from . We shall refer to (12)-(14) as the forward-KL non-parameterized optimization problem.

Note that this problem without the disaggregated constraints is analogous to the TRPO problem. The TRPO paper actually prefers enforcing the disaggregated constraint to enforcing the aggregated constraints. However, for mathematical conveniences, they worked with the aggregated constraints: "While it is motivated by the theory, this problem is impractical to solve due to the large number of constraints. Instead, we can use a heuristic approximation which considers the average KL divergence"

(Schulman et al., 2015a). The SPU framework allows us to solve the optimization problem with the disaggregated constraints exactly. Experimentally, we compared against TRPO in a controlled experimental setting, e.g. using the same advantage estimation scheme, etc. Since we clearly outperform TRPO, we argue that SPU’s two-process procedure has significant potentials.

For each , define: where is the normalization term. Note that is a function of . Further, for each s, let be such that . Also let .

Theorem 1

The optimal solution to the problem (12)-(14) is given by:


where is chosen so that (Proof in subsection A.1).

Equation (15) provides the structure of the optimal non-parameterized policy. As part of the SPU framework, we then seek a parameterized policy that is close to , that is, minimizes the loss function (11). For each sampled state , a straightforward calculation shows (Appendix Appendix B):


where for and for . We estimate the expectation in (16) with the sampled action and approximate as (obtained from the critic network), giving:


To simplify the algorithm, we slightly modify (17). We replace the hyper-parameter with the hyper-parameter and tune rather than . Further, we set for all in (17) and introduce per-state acceptance to enforce the disaggregated constraints, giving the approximate gradient:


We make the approximation that the disaggregated constraints are only enforced on the states in the sampled trajectories. We use (18) as our gradient for supervised training of the policy network. The equation (18) has an intuitive interpretation: the gradient represents a trade-off between the approximate performance of (as captured by ) and how far diverges from (as captured by ). For the stopping criterion, we train until .

5.2 Backward KL Constraint

In a similar manner, we can derive the structure of the optimal policy when using the reverse KL-divergence as the constraint. For simplicity, we provide the result for when there are only disaggregated constraints. We seek to find the non-parameterized optimal policy by solving:

Theorem 2

The optimal solution to the problem (19)-(20) is given by:


where and (Proof in subsection A.2).

Note that the structure of the optimal policy with the backward KL constraint is quite different from that with the forward KL constraint. A straight forward calculation shows (Appendix Appendix B):


The equation (22) has an intuitive interpretation. It increases the probability of action if and decreases the probability of action if . (22) also tries to keep close to by minimizing their KL divergence.

5.3 Constraint

In this section we show how a PPO-like objective can be formulated in the context of SPU. Recall from Section 3 that the the clipping in PPO can be seen as an attempt at keeping from becoming neither much larger than nor much smaller than for . In this subsection, we consider the constraint function


which leads us to the following optimization problem:

subject to (25)

Note that here we are using a variation of the SPU methodology described in Section 4 since here we first create estimates of the expectations in the objective and constraints and then solve the optimization problem (rather than first solve the optimization problem and then take samples as done for Theorems 1 and 2). Note that we have also included an aggregated constraint (26) in addition to the PPO-like constraint (25), which further ensures that the updated policy is close to .

Theorem 3

The optimal solution to the optimization problem (24-26) is given by:


for some where (Proof in subsection A.3).

To simplify the algorithm, we treat as a hyper-parameter rather than . After solving for , we seek a parameterized policy that is close to by minimizing their mean square error over sampled states and actions, i.e. by updating in the negative direction of . This loss is used for supervised training instead of the KL because we take estimates before forming the optimization problem. Thus, the optimal values for the decision variables do not completely characterize a distribution. We refer to this approach as SPU with the constraint.

Although we consider three classes of proximity constraint, there may be yet another class that leads to even better performance. The methodology allows researchers to explore other proximity constraints in the future.

6 Experimental Results

Extensive experimental results demonstrate SPU outperforms recent state-of-the-art methods for environments with continuous or discrete action spaces. We provide ablation studies to show the importance of the different algorithmic components, and a sensitivity analysis to show that SPU’s performance is relatively insensitive to hyper-parameter choices. There are two definitions we use to conclude A is more sample efficient than B: (i) A takes fewer environment interactions to achieve a pre-defined performance threshold (Kakade, 2003); (ii) the averaged final performance of A is higher than that of B given the same number environment interactions (Schulman et al., 2017). Implementation details are provided in Appendix Appendix D.

6.1 Results on Mujoco

The Mujoco (Todorov et al., 2012) simulated robotics environments provided by OpenAI gym (Brockman et al., 2016) have become a popular benchmark for control problems with continuous action spaces. In terms of final performance averaged over all available ten Mujoco environments and ten different seeds in each, SPU with constraint (Section 5.3) and SPU with forward KL constraints (Section 5.1) outperform TRPO by and respectively. Since the forward-KL approach is our best performing approach, we focus subsequent analysis on it and hereafter refer to it as SPU. SPU also outperforms PPO by . Figure 1 illustrates the performance of SPU versus TRPO, PPO.

Figure 1: SPU versus TRPO, PPO on 10 Mujoco environments in 1 million timesteps. The x-axis indicates timesteps. The y-axis indicates the average episode reward of the last 100 episodes.

To ensure that SPU is not only better than TRPO in terms of performance gain early during training, we further retrain both policies for 3 million timesteps. Again here, SPU outperforms TRPO by . Figure 3 in the Appendix illustrates the performance for each environment. Code for the Mujoco experiments is at

6.2 Ablation Studies for Mujoco

The indicator variable in (18) enforces the disaggregated constraint. We refer to it as per-state acceptance. Removing this component is equivalent to removing the indicator variable. We refer to using

to determine the number of training epochs as

dynamic stopping. Without this component, the number of training epochs is a hyper-parameter. We also tried removing from the gradient update step in (18). Table 1 illustrates the contribution of the different components of SPU to the overall performance. The third row shows that the term makes a crucially important contribution to SPU. Furthermore, per-state acceptance and dynamic stopping are both also important for obtaining high performance, with the former playing a more central role. When a component is removed, the hyper-parameters are retuned to ensure that the best possible performance is obtained with the alternative (simpler) algorithm.

Approach Percentage better than TRPO Performance vs. original algorithm
Original Algorithm 27% 0%
No grad KL 4% - 85%
No dynamic stopping 24% - 11%
No per-state acceptance 9% - 67%
Table 1: Ablation study for SPU

6.3 Sensitivity Analysis on Mujoco

To demonstrate the practicality of SPU, we show that its high performance is insensitive to hyper-parameter choice. One way to show this is as follows: for each SPU hyper-parameter, select a reasonably large interval, randomly sample the value of the hyper parameter from this interval, and then compare SPU (using the randomly chosen hyper-parameter values) with TRPO. We sampled 100 SPU hyper-parameter vectors (each vector including

), and for each one determined the relative performance with respect to TRPO. First, we found that for all 100 random hyper-parameter value samples, SPU performed better than TRPO. and of the samples outperformed TRPO by at least and respectively. The full CDF is given in Figure 4 in the Appendix. We can conclude that SPU’s superior performance is largely insensitive to hyper-parameter values.

6.4 Results on Atari

(Rajeswaran et al., 2017; Mania et al., 2018) demonstrates that neural networks are not needed to obtain high performance in many Mujoco environments. To conclusively evaluate SPU, we compare it against PPO on the Arcade Learning Environments (Bellemare et al., 2012) exposed through OpenAI gym (Brockman et al., 2016). Using the same network architecture and hyper-parameters, we learn to play 60 Atari games from raw pixels and rewards. This is highly challenging because of the diversity in the games and the high dimensionality of the observations.

Here, we compare SPU against PPO because PPO outperforms TRPO by in Mujoco. Averaged over 60 Atari environments and 20 seeds, SPU is better than PPO in terms of averaged final performance. Figure 2 provides a high-level overview of the result. The dots in the shaded area represent environments where their performances are roughly similar. The dots to the right of the shaded area represent environment where SPU is more sample efficient than PPO. We can draw two conclusions: (i) In 36 environments, SPU and PPO perform roughly the same ; SPU clearly outperforms PPO in 15 environments while PPO clearly outperforms SPU in 9; (ii) In those 15+9 environments, the extent to which SPU outperforms PPO is much larger than the extent to which PPO outperforms SPU. Figure 5, Figure 6 and Figure 7 in the Appendix illustrate the performance of SPU vs PPO throughout training. SPU’s high performance in both the Mujoco and Atari domains demonstrates its high performance and generality.

Figure 2: High-level overview of results on Atari

7 Acknowledgements

We would like to acknowledge the extremely helpful support by the NYU Shanghai High Performance Computing Administrator Zhiguo Qi. We also are grateful to OpenAI for open-sourcing their baselines codes.


Appendix A Proofs for non-parameterized optimization problems

a.1 Forward KL Aggregated and Disaggregated Constraints

We first show that (12)-(14) is a convex optimization. To this end, first note that the objective (12) is a linear function of the decision variables :́   ,  . The LHS of (14) can be rewritten as: . The second term is a linear function of . The first term is a convex function since the second derivative of each summand is always positive. The LHS of (14) is thus a convex function. By extension, the LHS of (13) is also a convex function since it is a nonnegative weighted sum of convex functions. The problem (12)-(14) is thus a convex optimization problem. According to Slater’s constraint qualification, strong duality holds since is a feasible solution to (12)-(14) where the inequality holds strictly.

We can therefore solve (12)-(14) by solving the related Lagrangian problem. For a fixed consider:

subject to (29)

The above problem decomposes into separate problems, one for each state :

subject to (31)

Further consider the unconstrained problem (30) without the constraint (31):

subject to (33)

A simple Lagrange-multiplier argument shows that the opimal solution to (32)-(34) is given by:

where is defined so that is a valid distribution. Now returning to the decomposed constrained problem (30)-(31), there are two cases to consider. The first case is when . In this case, the optimal solution to (30)-(31) is . The second case is when . In this case the optimal is with replaced with , where is the solution to . Thus, an optimal solution to (30)-(31) is given by:


where .

To find the Lagrange multiplier , we can then do a line search to find the that satisfies:


a.2 Backward KL Constraint

The problem (19)-(20) decomposes into separate problems, one for each state :

subject to (38)

After some algebra, we see that above optimization problem is equivalent to:

subject to (40)

where . (39)-(42) is a convex optimization problem with Slater’s condition holding. Strong duality thus holds for the problem (39)-(42). Applying standard Lagrange multiplier arguments, it is easily seen that the solution to (39)-(42) is

where and are constants chosen such that the disaggregegated KL constraint is binding and the sum of the probabilities equals 1. It is easily seen and

a.3 constraint

The problem (24-26) is equivalent to:

subject to (44)

This problem is clearly convex. is a feasible solution where the inequality constraint holds strictly. Strong duality thus holds according to Slater’s constraint qualification. To solve (43)-(45), we can therefore solve the related Lagrangian problem for fixed :

subject to (47)

which is separable and decomposes into m separate problems, one for each :

subject to (49)

The solution to the unconstrained problem (48) without the constraint (49) is:

Now consider the constrained problem (48)-(49). If and , the optimal solution is . Similarly, If and , the optimal solution is . Rearranging the terms gives Theorem 3. To obtain , we can perform a line search over so that the constraint (45) is binding.

Appendix B Derivations the gradient of loss function for SPU

Let stands for CrossEntropy.

b.1 Forward-KL