1 Introduction
The policy gradient problem in deep reinforcement learning (DRL) can be defined as seeking a parameterized policy with high expected reward. An issue with policy gradient methods is poor sample efficiency (Kakade, 2003; Schulman et al., 2015a; Wang et al., 2016b; Wu et al., 2017; Schulman et al., 2017). In algorithms such as REINFORCE (Williams, 1992), new samples are needed for every gradient step. When generating samples is expensive (such as robotic environments), sample efficiency is of central concern. The sample efficiency of an algorithm is defined to be the number of calls to the environment required to attain a specified performance level (Kakade, 2003).
Thus, given the current policy and a fixed number of trajectories (samples) generated, the goal of the sample efficiency problem is to construct a new policy with the highest performance improvement possible. To do so, it is desirable to limit the search to policies that are close to the original policy (Kakade, 2002; Schulman et al., 2015a; Wu et al., 2017; Achiam et al., 2017; Schulman et al., 2017; Tangkaratt et al., 2018). Intuitively, if the candidate new policy is far from the original policy , it may not perform better than the original policy because too much emphasis is being placed on the relatively small batch of new data generated by , and not enough emphasis is being placed on the relatively large amount of data and effort previously used to construct .
This guideline of limiting the search to nearby policies seems reasonable in principle, but requires a distance between the current policy and the candidate new policy , and then attempt to solve the constrained optimization problem:
(1)  
subject to  (2) 
where
is an estimate of
, the performance of policy , based on the previous policy and the batch of fresh data generated by . The objective (1) attempts to maximize the performance of the updated policy, and the constraint (2) ensures that the updated policy is not too far from the policy that was used to generate the data. Several recent papers (Kakade, 2002; Schulman et al., 2015a, 2017; Tangkaratt et al., 2018) belong to the framework (1)(2).Our work also strikes the right balance between performance and simplicity. The implementation is only slightly more involved than PPO (Schulman et al., 2017). Simplicity in RL algorithms has its own merits. This is especially useful when RL algorithms are used to solve problems outside of traditional RL testbeds, which is becoming a trend (Zoph & Le, 2016; Mingxing Tan, 2018).
We propose a new methodology, called Supervised Policy Update (SPU), for this sample efficiency problem. The methodology is general in that it applies to both discrete and continuous action spaces, and can address a wide variety of constraint types for (2). Starting with data generated by the current policy, SPU optimizes over a proximal policy space to find an optimal nonparameterized policy. It then solves a supervised regression problem to convert the nonparameterized policy to a parameterized policy, from which it draws new samples. We develop a general methodology for finding an optimal policy in the nonparameterized policy space, and then illustrate the methodology for three different definitions of proximity. We also show how the Natural Policy Gradient and Trust Region Policy Optimization (NPG/TRPO) problems and the Proximal Policy Optimization (PPO) problem can be addressed by this methodology. While SPU is substantially simpler than NPG/TRPO in terms of mathematics and implementation, our extensive experiments show that SPU is more sample efficient than TRPO in Mujoco simulated robotic tasks and PPO in Atari video game tasks.
Offpolicy RL algorithms generally achieve better sample efficiency than onpolicy algorithms (Haarnoja et al., 2018). However, the performance of an onpolicy algorithm can usually be substantially improved by incorporating offpolicy training (Mnih et al. (2015), Wang et al. (2016a)). Our paper focuses on igniting interests in separating finding the optimal policy into a twostep process: finding the optimal nonparameterized policy, and then parameterizing this optimal policy. We also wanted to deeply understand the onpolicy case before adding offpolicy training. We thus compare with algorithms operating under the same algorithmic constraints, one of which is being onpolicy. We leave the extension to offpolicy to future work. We do not claim stateoftheart results.
2 Preliminaries
We consider a Markov Decision Process (MDP) with state space
, action space , and reward function , , . Let denote a policy, let be the set of all policies, and let the expected discounted reward be:(3) 
where is a discount factor and is a sample trajectory. Let be the advantage function for policy (Levine, 2017). Deep reinforcement learning considers a set of parameterized policies
, where each policy is parameterized by a neural network called the policy network. In this paper, we will consider optimizing over the parameterized policies in
as well as over the nonparameterized policies in . For concreteness, we assume that the state and action spaces are finite. However, our methodology also applies to continuous state and action spaces, as shown in the Appendix.One popular approach to maximizing over is to apply stochastic gradient ascent. The gradient of evaluated at a specific can be shown to be (Williams, 1992):
(4) 
We can approximate (4) by sampling N trajectories of length T from :
(5) 
Additionally, define
for the the future state probability distribution for policy
, and denote for the probability distribution over the action space when in state and using policy . Further denote for the KL divergence from to , and denote the following as the “aggregated KL divergence”.(6) 
2.1 Surrogate Objectives for the Sample Efficiency Problem
For the sample efficiency problem, the objective is typically approximated using samples generated from (Schulman et al., 2015a; Achiam et al., 2017; Schulman et al., 2017). Two different approaches are typically used to approximate . We can make a first order approximation of around (Kakade, 2002; Peters & Schaal, 2008a, b; Schulman et al., 2015a):
(7) 
where is the sample estimate (5). The second approach is to approximate the state distribution with (Achiam et al., 2017; Schulman et al., 2017; Achiam, 2017):
(8) 
There is a wellknown bound for the approximation (8) (Kakade & Langford, 2002; Achiam et al., 2017). Furthermore, the approximation matches to the first order with respect to the parameter (Achiam et al., 2017).
3 Related Work
Natural gradient (Amari, 1998) was first introduced to policy gradient by Kakade (Kakade, 2002) and then in (Peters & Schaal, 2008a, b; Achiam, 2017; Schulman et al., 2015a). referred to collectively here as NPG/TRPO. Algorithmically, NPG/TRPO finds the gradient update by solving the sample efficiency problem (1)(2) with , i.e., use the aggregate KLdivergence for the policy proximity constraint (2). NPG/TRPO addresses this problem in the parameter space . First, it approximates with the firstorder approximation (7) and using a similar secondorder method. Second, it uses samples from to form estimates of these two approximations. Third, using these estimates (which are functions of ), it solves for the optimal . The optimal is a function of and of , the sample average of the Hessian evaluated at . TRPO also limits the magnitude of the update to ensure (i.e., ensuring the sampled estimate of the aggregated KL constraint is met without the secondorder approximation).
SPU takes a very different approach by first (i) posing and solving the optimization problem in the nonparameterized policy space, and then (ii) solving a supervised regression problem to find a parameterized policy that is near the optimal nonparameterized policy. A recent paper, Guided Actor Critic (GAC), independently proposed a similar decomposition (Tangkaratt et al., 2018). However, GAC is much more restricted in that it considers only one specific constraint criterion (aggregated reverseKL divergence) and applies only to continuous action spaces. Furthermore, GAC incurs significantly higher computational complexity, e.g. at every update, it minimizes the dual function to obtain the dual variables using SLSQP. MPO also independently propose a similar decomposition (Abbas Abdolmaleki, 2018)
. MPO uses much more complex machinery, namely, Expectation Maximization to address the DRL problem. However, MPO has only demonstrates preliminary results on problems with discrete actions whereas our approach naturally applies to problems with either discrete or continuous actions. In both GAC and MPO, working in the nonparameterized space is a byproduct of applying the main ideas in those papers to DRL. Our paper demonstrates that the decomposition alone is a general and useful technique for solving constrained policy optimization.
ClippedPPO (Schulman et al., 2017) takes a very different approach to TRPO. At each iteration, PPO makes many gradient steps while only using the data from . Without the clipping, PPO is the approximation (8). The clipping is analogous to the constraint (2) in that it has the goal of keeping close to . Indeed, the clipping keeps from becoming neither much larger than nor much smaller than . Thus, although the clipped PPO objective does not squarely fit into the optimization framework (1)(2), it is quite similar in spirit. We note that the PPO paper considers adding the KL penalty to the objective function, whose gradient is similar to ours. However, this form of gradient was demonstrated to be inferior to ClippedPPO. To the best of our knowledge, it is only until our work that such form of gradient is demonstrated to outperform ClippedPPO.
ActorCritic using KroneckerFactored Trust Region (ACKTR) (Wu et al., 2017) proposed using Kroneckerfactored approximation curvature (KFAC) to update both the policy gradient and critic terms, giving a more computationally efficient method of calculating the natural gradients. ACER (Wang et al., 2016a)
exploits past episodes, linearizes the KL divergence constraint, and maintains an average policy network to enforce the KL divergence constraint. In future work, it would of interest to extend the SPU methodology to handle past episodes. In contrast to bounding the KL divergence on the action distribution as we have done in this work, Relative Entropy Policy Search considers bounding the joint distribution of state and action and was only demonstrated to work for small problems
(Jan Peters, 2010).4 SPU Framework
The SPU methodology has two steps. In the first step, for a given constraint criterion , we find the optimal solution to the nonparameterized problem:
(9)  
subject to  (10) 
Note that is not restricted to the set of parameterized policies . As commonly done, we approximate the objective function (8). However, unlike PPO/TRPO, we are not approximating the constraint (2). We will show below the optimal solution for the nonparameterized problem (9)(10) can be determined nearly in closed form for many natural constraint criteria .
In the second step, we attempt to find a policy in the parameterized space that is close to the target policy . Concretely, to advance from to , we perform the following steps:

We first sample trajectories using policy , giving sample data , . Here is an estimate of the advantage value . (For simplicity, we index the samples with rather than with corresponding to the th sample in the th trajectory.)

We then fit the policy network to the target distributions , . Specifically, to find
, we minimize the following supervised loss function:
(11) For this step, we initialize with the weights for . We minimize the loss function
with stochastic gradient descent methods. The resulting
becomes our .
5 SPU Applied to Specific Proximity Criteria
To illustrate the SPU methodology, for three different but natural types of proximity constraints, we solve the corresponding nonparameterized optimization problem and derive the resulting gradient for the SPU supervised learning problem. We also demonstrate that different constraints lead to very different but intuitive forms of the gradient update.
5.1 Forward Aggregate and Disaggregate KL Constraints
We first consider constraint criteria of the form:
(12)  
subject to  (13)  
(14) 
Note that this problem is equivalent to minimizing subject to the constraints (13) and (14). We refer to (13) as the "aggregated KL constraint" and to (14) as the "disaggregated KL constraint". These two constraints taken together restrict from deviating too much from . We shall refer to (12)(14) as the forwardKL nonparameterized optimization problem.
Note that this problem without the disaggregated constraints is analogous to the TRPO problem. The TRPO paper actually prefers enforcing the disaggregated constraint to enforcing the aggregated constraints. However, for mathematical conveniences, they worked with the aggregated constraints: "While it is motivated by the theory, this problem is impractical to solve due to the large number of constraints. Instead, we can use a heuristic approximation which considers the average KL divergence"
(Schulman et al., 2015a). The SPU framework allows us to solve the optimization problem with the disaggregated constraints exactly. Experimentally, we compared against TRPO in a controlled experimental setting, e.g. using the same advantage estimation scheme, etc. Since we clearly outperform TRPO, we argue that SPU’s twoprocess procedure has significant potentials.For each , define: where is the normalization term. Note that is a function of . Further, for each s, let be such that . Also let .
Theorem 1
The optimal solution to the problem (12)(14) is given by:
(15) 
where is chosen so that (Proof in subsection A.1).
Equation (15) provides the structure of the optimal nonparameterized policy. As part of the SPU framework, we then seek a parameterized policy that is close to , that is, minimizes the loss function (11). For each sampled state , a straightforward calculation shows (Appendix Appendix B):
(16) 
where for and for . We estimate the expectation in (16) with the sampled action and approximate as (obtained from the critic network), giving:
(17) 
To simplify the algorithm, we slightly modify (17). We replace the hyperparameter with the hyperparameter and tune rather than . Further, we set for all in (17) and introduce perstate acceptance to enforce the disaggregated constraints, giving the approximate gradient:
(18) 
We make the approximation that the disaggregated constraints are only enforced on the states in the sampled trajectories. We use (18) as our gradient for supervised training of the policy network. The equation (18) has an intuitive interpretation: the gradient represents a tradeoff between the approximate performance of (as captured by ) and how far diverges from (as captured by ). For the stopping criterion, we train until .
5.2 Backward KL Constraint
In a similar manner, we can derive the structure of the optimal policy when using the reverse KLdivergence as the constraint. For simplicity, we provide the result for when there are only disaggregated constraints. We seek to find the nonparameterized optimal policy by solving:
(19)  
(20) 
Theorem 2
The optimal solution to the problem (19)(20) is given by:
(21) 
where and (Proof in subsection A.2).
Note that the structure of the optimal policy with the backward KL constraint is quite different from that with the forward KL constraint. A straight forward calculation shows (Appendix Appendix B):
(22) 
5.3 Constraint
In this section we show how a PPOlike objective can be formulated in the context of SPU. Recall from Section 3 that the the clipping in PPO can be seen as an attempt at keeping from becoming neither much larger than nor much smaller than for . In this subsection, we consider the constraint function
(23) 
which leads us to the following optimization problem:
(24)  
subject to  (25)  
(26) 
Note that here we are using a variation of the SPU methodology described in Section 4 since here we first create estimates of the expectations in the objective and constraints and then solve the optimization problem (rather than first solve the optimization problem and then take samples as done for Theorems 1 and 2). Note that we have also included an aggregated constraint (26) in addition to the PPOlike constraint (25), which further ensures that the updated policy is close to .
Theorem 3
The optimal solution to the optimization problem (2426) is given by:
(27) 
for some where (Proof in subsection A.3).
To simplify the algorithm, we treat as a hyperparameter rather than . After solving for , we seek a parameterized policy that is close to by minimizing their mean square error over sampled states and actions, i.e. by updating in the negative direction of . This loss is used for supervised training instead of the KL because we take estimates before forming the optimization problem. Thus, the optimal values for the decision variables do not completely characterize a distribution. We refer to this approach as SPU with the constraint.
Although we consider three classes of proximity constraint, there may be yet another class that leads to even better performance. The methodology allows researchers to explore other proximity constraints in the future.
6 Experimental Results
Extensive experimental results demonstrate SPU outperforms recent stateoftheart methods for environments with continuous or discrete action spaces. We provide ablation studies to show the importance of the different algorithmic components, and a sensitivity analysis to show that SPU’s performance is relatively insensitive to hyperparameter choices. There are two definitions we use to conclude A is more sample efficient than B: (i) A takes fewer environment interactions to achieve a predefined performance threshold (Kakade, 2003); (ii) the averaged final performance of A is higher than that of B given the same number environment interactions (Schulman et al., 2017). Implementation details are provided in Appendix Appendix D.
6.1 Results on Mujoco
The Mujoco (Todorov et al., 2012) simulated robotics environments provided by OpenAI gym (Brockman et al., 2016) have become a popular benchmark for control problems with continuous action spaces. In terms of final performance averaged over all available ten Mujoco environments and ten different seeds in each, SPU with constraint (Section 5.3) and SPU with forward KL constraints (Section 5.1) outperform TRPO by and respectively. Since the forwardKL approach is our best performing approach, we focus subsequent analysis on it and hereafter refer to it as SPU. SPU also outperforms PPO by . Figure 1 illustrates the performance of SPU versus TRPO, PPO.
To ensure that SPU is not only better than TRPO in terms of performance gain early during training, we further retrain both policies for 3 million timesteps. Again here, SPU outperforms TRPO by . Figure 3 in the Appendix illustrates the performance for each environment. Code for the Mujoco experiments is at https://github.com/quanvuong/Supervised_Policy_Update.
6.2 Ablation Studies for Mujoco
The indicator variable in (18) enforces the disaggregated constraint. We refer to it as perstate acceptance. Removing this component is equivalent to removing the indicator variable. We refer to using
to determine the number of training epochs as
dynamic stopping. Without this component, the number of training epochs is a hyperparameter. We also tried removing from the gradient update step in (18). Table 1 illustrates the contribution of the different components of SPU to the overall performance. The third row shows that the term makes a crucially important contribution to SPU. Furthermore, perstate acceptance and dynamic stopping are both also important for obtaining high performance, with the former playing a more central role. When a component is removed, the hyperparameters are retuned to ensure that the best possible performance is obtained with the alternative (simpler) algorithm.Approach  Percentage better than TRPO  Performance vs. original algorithm 

Original Algorithm  27%  0% 
No grad KL  4%   85% 
No dynamic stopping  24%   11% 
No perstate acceptance  9%   67% 
6.3 Sensitivity Analysis on Mujoco
To demonstrate the practicality of SPU, we show that its high performance is insensitive to hyperparameter choice. One way to show this is as follows: for each SPU hyperparameter, select a reasonably large interval, randomly sample the value of the hyper parameter from this interval, and then compare SPU (using the randomly chosen hyperparameter values) with TRPO. We sampled 100 SPU hyperparameter vectors (each vector including
), and for each one determined the relative performance with respect to TRPO. First, we found that for all 100 random hyperparameter value samples, SPU performed better than TRPO. and of the samples outperformed TRPO by at least and respectively. The full CDF is given in Figure 4 in the Appendix. We can conclude that SPU’s superior performance is largely insensitive to hyperparameter values.6.4 Results on Atari
(Rajeswaran et al., 2017; Mania et al., 2018) demonstrates that neural networks are not needed to obtain high performance in many Mujoco environments. To conclusively evaluate SPU, we compare it against PPO on the Arcade Learning Environments (Bellemare et al., 2012) exposed through OpenAI gym (Brockman et al., 2016). Using the same network architecture and hyperparameters, we learn to play 60 Atari games from raw pixels and rewards. This is highly challenging because of the diversity in the games and the high dimensionality of the observations.
Here, we compare SPU against PPO because PPO outperforms TRPO by in Mujoco. Averaged over 60 Atari environments and 20 seeds, SPU is better than PPO in terms of averaged final performance. Figure 2 provides a highlevel overview of the result. The dots in the shaded area represent environments where their performances are roughly similar. The dots to the right of the shaded area represent environment where SPU is more sample efficient than PPO. We can draw two conclusions: (i) In 36 environments, SPU and PPO perform roughly the same ; SPU clearly outperforms PPO in 15 environments while PPO clearly outperforms SPU in 9; (ii) In those 15+9 environments, the extent to which SPU outperforms PPO is much larger than the extent to which PPO outperforms SPU. Figure 5, Figure 6 and Figure 7 in the Appendix illustrate the performance of SPU vs PPO throughout training. SPU’s high performance in both the Mujoco and Atari domains demonstrates its high performance and generality.
7 Acknowledgements
We would like to acknowledge the extremely helpful support by the NYU Shanghai High Performance Computing Administrator Zhiguo Qi. We also are grateful to OpenAI for opensourcing their baselines codes.
References
 Abbas Abdolmaleki (2018) Yuval Tassa Remi Munos Nicolas Heess Martin Riedmiller Abbas Abdolmaleki, Jost Tobias Springenberg. Maximum a posteriori policy optimisation. 2018. URL https://arxiv.org/abs/1806.06920.
 Achiam (2017) Joshua Achiam. Advanced policy gradient methods. http://rll.berkeley.edu/deeprlcourse/f17docs/lecture_13_advanced_pg.pdf, 2017.

Achiam et al. (2017)
Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel.
Constrained policy optimization.
In
International Conference on Machine Learning
, pp. 22–31, 2017.  Amari (1998) ShunIchi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
 Bellemare et al. (2012) Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. CoRR, abs/1207.4708, 2012. URL http://arxiv.org/abs/1207.4708.
 Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. CoRR, abs/1606.01540, 2016. URL http://arxiv.org/abs/1606.01540.
 Dhariwal et al. (2017) Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. https://github.com/openai/baselines, 2017.
 Duan et al. (2016) Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. CoRR, abs/1604.06778, 2016. URL http://arxiv.org/abs/1604.06778.
 Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. CoRR, abs/1801.01290, 2018. URL http://arxiv.org/abs/1801.01290.
 Jan Peters (2010) Yasemin Altun Jan Peters, Katharina Mulling. Relative entropy policy search. 2010. URL https://www.aaai.org/ocs/index.php/AAAI/AAAI10/paper/viewFile/1851/2264.
 Kakade (2003) Sham Kakade. On the sample complexity of reinforcement learning. PhD thesis, University of London London, England, 2003.
 Kakade & Langford (2002) Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pp. 267–274, 2002.
 Kakade (2002) Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.
 Kingma & Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980.
 Levine (2017) Sergey Levine. UC Berkeley CS294 deep reinforcement learning lecture notes. http://rail.eecs.berkeley.edu/deeprlcoursefa17/index.html, 2017.
 Mania et al. (2018) Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018.
 Mingxing Tan (2018) Ruoming Pang Vijay Vasudevan Quoc V. Le Mingxing Tan, Bo Chen. Mnasnet: Platformaware neural architecture search for mobile. 2018. URL https://arxiv.org/abs/1807.11626.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518, 2015. URL http://dx.doi.org/10.1038/nature14236.
 Peters & Schaal (2008a) Jan Peters and Stefan Schaal. Natural actorcritic. Neurocomputing, 71(79):1180–1190, 2008a.
 Peters & Schaal (2008b) Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682–697, 2008b.
 Rajeswaran et al. (2017) Aravind Rajeswaran, Kendall Lowrey, Emanuel Todorov, and Sham Kakade. Towards generalization and simplicity in continuous control. CoRR, abs/1703.02660, 2017. URL http://arxiv.org/abs/1703.02660.
 Schulman et al. (2015a) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015a.
 Schulman et al. (2015b) John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. CoRR, abs/1506.02438, 2015b. URL http://arxiv.org/abs/1506.02438.
 Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Tangkaratt et al. (2018) Voot Tangkaratt, Abbas Abdolmaleki, and Masashi Sugiyama. Guide actorcritic for continuous control. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=BJk59JZ0b.
 Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. 2012. URL https://ieeexplore.ieee.org/abstract/document/6386109/authors.
 Wang et al. (2016a) Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Rémi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actorcritic with experience replay. CoRR, abs/1611.01224, 2016a. URL http://arxiv.org/abs/1611.01224.
 Wang et al. (2016b) Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actorcritic with experience replay. arXiv preprint arXiv:1611.01224, 2016b.
 Williams (1992) Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. In Reinforcement Learning, pp. 5–32. Springer, 1992.
 Wu et al. (2017) Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trustregion method for deep reinforcement learning using kroneckerfactored approximation. In Advances in neural information processing systems, pp. 5285–5294, 2017.
 Zoph & Le (2016) Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. CoRR, abs/1611.01578, 2016. URL http://arxiv.org/abs/1611.01578.
Appendix A Proofs for nonparameterized optimization problems
a.1 Forward KL Aggregated and Disaggregated Constraints
We first show that (12)(14) is a convex optimization. To this end, first note that the objective (12) is a linear function of the decision variables :́ , . The LHS of (14) can be rewritten as: . The second term is a linear function of . The first term is a convex function since the second derivative of each summand is always positive. The LHS of (14) is thus a convex function. By extension, the LHS of (13) is also a convex function since it is a nonnegative weighted sum of convex functions. The problem (12)(14) is thus a convex optimization problem. According to Slater’s constraint qualification, strong duality holds since is a feasible solution to (12)(14) where the inequality holds strictly.
We can therefore solve (12)(14) by solving the related Lagrangian problem. For a fixed consider:
(28)  
subject to  (29) 
The above problem decomposes into separate problems, one for each state :
(30)  
subject to  (31) 
Further consider the unconstrained problem (30) without the constraint (31):
(32)  
subject to  (33)  
(34) 
A simple Lagrangemultiplier argument shows that the opimal solution to (32)(34) is given by:
where is defined so that is a valid distribution. Now returning to the decomposed constrained problem (30)(31), there are two cases to consider. The first case is when . In this case, the optimal solution to (30)(31) is . The second case is when . In this case the optimal is with replaced with , where is the solution to . Thus, an optimal solution to (30)(31) is given by:
(35) 
where .
To find the Lagrange multiplier , we can then do a line search to find the that satisfies:
(36) 
a.2 Backward KL Constraint
The problem (19)(20) decomposes into separate problems, one for each state :
(37)  
subject to  (38) 
After some algebra, we see that above optimization problem is equivalent to:
(39)  
subject to  (40)  
(41)  
(42) 
where . (39)(42) is a convex optimization problem with Slater’s condition holding. Strong duality thus holds for the problem (39)(42). Applying standard Lagrange multiplier arguments, it is easily seen that the solution to (39)(42) is
where and are constants chosen such that the disaggregegated KL constraint is binding and the sum of the probabilities equals 1. It is easily seen and
a.3 constraint
The problem (2426) is equivalent to:
(43)  
subject to  (44)  
(45) 
This problem is clearly convex. is a feasible solution where the inequality constraint holds strictly. Strong duality thus holds according to Slater’s constraint qualification. To solve (43)(45), we can therefore solve the related Lagrangian problem for fixed :
(46)  
subject to  (47) 
which is separable and decomposes into m separate problems, one for each :
(48)  
subject to  (49) 
The solution to the unconstrained problem (48) without the constraint (49) is:
Now consider the constrained problem (48)(49). If and , the optimal solution is . Similarly, If and , the optimal solution is . Rearranging the terms gives Theorem 3. To obtain , we can perform a line search over so that the constraint (45) is binding.
Appendix B Derivations the gradient of loss function for SPU
Let stands for CrossEntropy.
Comments
There are no comments yet.