1 Introduction
Reinforcement learning with flexible function approximators such as neural networks, also referred to as “deep RL”, holds great promises for continuous control and robotics. Neural networks can express complex dependencies between highdimensional and multimodal input and output spaces, and learningbased approaches can find solutions that would be difficult to craft by hand. Unfortunately, the generality and flexibility of learning based approaches with neural networks can come at a price: Deep reinforcement learning algorithms can require large amounts of training data; they can suffer from stability problems, especially in highdimensional continuous action spaces
[13; 42]; and they can be sensitive to hyperparameter settings. Even though attempts to control robots or simulated robots with neural networks go back a long time [35; 39; 44], it has only been recently that algorithms have emerged which are able to scale to challenging problems [6; 19; 33] – including first successes in the datarestricted domain of physical robots [16; 22; 33; 38].Modelfree offpolicy actorcritic algorithms have several appealing properties. In particular, they make minimal assumptions about the control problem, and can be dataefficient when used in combination with a appropriate data reuse. They can also scale well when implemented appropriately (see e.g. Gu et al. [16]; Popov et al. [36]). Broadly speaking, many offpolicy algorithms are implemented by alternating between two steps: i) a policy evaluation step in which an actionvalue function is learned for the current policy; and ii) a policy improvement step during which the policy is modified given the current actionvalue function.
In this paper we outline a general policy iteration framework and motivate it from both an intuitive perspective as well as a “RL as inference” perspective. In the case when the MDP collapses to a bandit setting our framework can be related to the blackbox optimization literature. We propose an algorithm that works reliably across a wide range of tasks and requires minimal hyperparameter tuning to achieve state of the art results on several benchmark suites. Similarly to Maximum a Posteriori Policy Optimisation algorithm (MPO) [4], it estimates the actionvalue function for a policy and then uses this Qfunction to update the policy. The policy improvement step builds on ideas from the blackbox optimization and KLregularized control literature. It first estimates a local, nonparametric policy that is obtained by reweighting the samples from the current/prior policy, and subsequently fits a new parametric policy via weighted maximum likelihood learning. Trustregion like constraints ensure stability of the procedure. The algorithm simplifies the original formulation of MPO while improving its robustness via decoupled optimization of policy’s mean and covariance. We show that our algorithm solves standard continuous control benchmark tasks from the DeepMind control suite (including control of a humanoid with 56 action dimensions), from the OpenAI Gym, and also the challenging “Parkour” tasks from Heess et al. [19], all with the same hyperparameter settings and a single actor for data collection.
2 Problem Statement
In this paper we are focused on actorcritic algorithms for stable and data efficient policy optimization. Actorcritic algorithms decompose the policy optimization problem into two distinct subproblems as also outlined in Algorithm 1: i) estimating the stateconditional action value (the Qfunction, denoted by ) given a policy , and, ii) improving given an estimate of .
We consider the usual discounted reinforcement learning (RL) problem defined by a Markov decision process (MDP). The MDP consists of continuous states
, actions , an initial state distribution, transition probabilities
which specify the probability of transitioning from state to under action , a reward function and the discount factor . The policy with parameters is a distribution over actions given a state . We optimize the objective,(1) 
where the expectation is taken with respect to the trajectory distribution induced by . We define the actionvalue function associated with as the expected cumulative discounted return when choosing action in state and acting subsequently according to policy as . This function satisfies the recursive expression where is the value function of .
The true Qfunction of an MDP and policy in iteration provides the information needed to estimate a new policy that will have a higher expected discounted return than ; and will thus improve our objective. This is the core idea underlying policy iteration [45]: if for all states we change our policy to pick actions that have higher value with higher probability then the overall objective is guaranteed to improve. For instance, we could attempt to choose where as action selection rule – assuming an accurate .
In this paper we specifically focus on the problem of reliably optimizing given . In particular, in Section 4 we discuss update rules for stochastic policies that explicitly control the change in from one iteration to the next. And we show how to avoid premature convergence when Gaussian policies are used.
3 Policy Evaluation (Step 1)
Policy Evaluation is concerned with learning an approximate Qfunction (policy evaluation). In principle, any offpolicy method for learning Qfunctions could be used here, as long as it provides sufficiently accurate value estimates. This includes making use of recent advances such as distributional RL [8; 7] or Retrace [30]. To separate the effects of Policy Improvement and better value estimation we focus on simple 1step temporal difference (TD) learning for most of the paper (showing advantages from better policy evaluation approaches in separate experiments). We fit a parametric Qfunction with parameters by minimizing the squared (TD) error
where , which we optimize via gradient descent. We let be the parameters of a target network (the parameters of the last Qfunction) that is held constant for steps (and then copied from the optimized parameters ). For brevity of notation we drop the subscript and dependence on parameters in the following section and write .
4 Policy Improvement (Step 23)
The policy improvement step consists of optimizing for drawn from the visitation distribution . In practice, we replace with draws from a replay buffer. As argued intuitively in Section 2, if we improve this expectation in all states and for an accurate , this will improve our objective (Equation 1). Below we describe two approaches that perform this optimization. They do not fully optimize to avoid being misled by errors in the approximated Qfunction – while keeping exploration. We find a solution capturing some information about the local value landscape in the shape of the distribution. Maintaining this information is important for exploration and future optimization steps. Both approaches employ a twostep procedure: they first construct a nonparametric estimate s.t. (Step 2). They then project this nonparametric representation back onto the manifold of parameterized policies by finding
(2) 
which amounts to supervised learning – or maximum likelihood estimation (MLE) (Step 3). This split of the improvement step into sample based estimation followed by supervised learning allows us to separate the neural network fitting from the RL procedure, enabling regularization in the latter.
4.1 Finding action weights (Step 2)
Given a learned approximate Qfunction, in each policy optimization step, we first sample K states from the replay buffer. Secondly, we sample N actions for each state from the last policy distribution, forming the sample based estimate, i.e, where i denotes the action index and j denotes the state index in the replay. We then evaluate each stateaction pair using the Qfunction (). Now, given states, actions, and their corresponding Qvalues, i.e. we want to first readjust the probabilities for the given actions in each state such that better actions have higher probability. These updated probabilities are expressed via the weights , forming the nonparametric, sample based improved policy, i.e, To determine , one could assign probabilities manually to the actions based on the ranking of actions w.r.t their Qvalues. This approach has been used in the blackbox and stochastic search communities and can be related to methods such as CMAES [18] and the crossentropy method [40]. In general, we can calculate weights using any rank preserving transformation of the Qvalues. If the weights additionally form a proper sample based distribution, satisfying: i) positivity of weights, and ii) normalization . We now discuss various valid transformations of the Qvalues.
Using ranking to transform Qvalues.
In particular, one such weighting would be to choose the weight of the th best action for the th sampled state to be proportional to , where N is the number of action samples per state and is a temperature parameter (if this would correspond to an update similar to CMAES). Intuitively, we set a fixed pseudo probability for each action based on their rank, such that the expected Qvalue under this new samplebased (state dependent) distribution increases.
Using an exponential transformation of the Qvalues.
Alternatively, we can obtain the weights by optimizing for an optimal assignment of action probabilities directly. If we additionally want to constrain the change of the policy this corresponds to solving the following KL regularized objective:
Here, the first constraint forces the weights to stay close to the last policy probabilities, i.e. bounds the average relative entropy, or average KL, since samples are drawn from . The second constraint ensures that weights are normalized. The solution will be new weights, given through the categorical probabilities , such that the expected Qvalue increases while constraining the reduction in entropy (to prevent the weights from collapsing onto one action immediately). This objective has been used in the RL and bandit optimization literature before (see e.g. [34; 5]) and, when combined with Qlearning has some optimality guarantees [4]. As it turns out, its solution can be obtained in closed form, and consists of a softmax over Qvalues:
where . The temperature corresponding to the constraint can be found automatically by solving the following convex dual function alongside our policy optimization:
We found that, in practice, this optimization can be performed via a few steps of gradient descent on for each batch after the weight calculation. As should be positive, we use a projection operator to project back the to feasible positive space after each gradient step. We use Adam [23] to optimize together with all other parameters. We refer to the appendix in section B for a derivation of this objective from RL as Inference perspective and the dual.
Using an identity transformation.
An interesting other possibility is to use an identity transformation. While not respecting the desiderata from above, this would bring our method close to an expected policy gradient algorithm [11]. We discuss this choice in detail in Section A in the appendix.
4.2 Fitting an improved policy (Step 3)
So far, for each state, we obtained an improved samplebased distribution over actions. Next, we want to generalize this samplebased solution over state and action space – which is required when we want to select better actions in unseen situations during control. For this, we solve a weighted supervised learning problem
(3) 
where are the parameters of our function approximator (a neural network) which we initialize from the weights of the previous policy . This objective corresponds to minimization of the KL divergence between the sample based distribution from Step 2 and the parametric policy , as given in Equation (2).
Unfortunately, sample based maximum likelihood estimation can suffer from overfitting to the samples from Step 2. Additionally, these sample weights themselves can be unreliable due to a poor approximation of – potentially resulting in a large change of the action distribution in the wrong direction when optimizing Equation (3). One effective regularization that addresses both concerns is to limit the overall change in the parametric policy. This additional regularization has a different effect than enforcing tighter constraints in Step 2, which would still only limit the change in the samplebased distribution. To direcly limit the change in the parametric policy (even in regions of the action space we have not sampled from) we thus employ an additional KL constraint^{1}^{1}1We note that other commonly used regularization techniques might be worth investigating. and change the objective from Equation (3) to
(4) 
where denotes the allowed expected change over state distribution in KL divergence for the policy. To make this objective amenable to gradient based optimization we employ Lagrangian Relaxation, yielding the following primal optimization problem:
We solve for by iterating the inner and outer optimization programs independently: We fix the parameters to their current value and optimize for the Lagrangian multipliers (inner minimization) and then we fix the Lagrangian multipliers to their current value and optimize for (outer maximization). In practice we found it effective to simply perform one gradient step each in inner and outer optimization for each sampled batch of data. This lead to good satisfaction of the constraints throughout coordinate gradient decent training.
4.2.1 Fitting an improved Gaussian policy
The method described in the main part of Section 4.2 works for any distribution. However, in particular for continuous action spaces it still can suffer from premature convergence as it is shown in Figure 1(left). The reason is that, in each policy improvement step we are essentially optimising for the expected reward for state given actions from the last policy. In such a setting, the optimal solution is to give a probability of 1 to the best action (or equal probabilities to equally good actions) based on its Qvalue and zero to other actions. This means that the policy will collapse on the best action to optimise the expected reward even though the best action is not the true optimal action. We can postpone this effect by adding a KL constraint, however, in each iteration the policy will lose entropy to cover the best actions it has seen, albeit slowly, depending on the shape of and the choice of . And it therefore still can converge prematurely.
We found that when using Gaussian policies, a simple change can avoid premature convergence in Step 3: we can decouple the objective for the policy mean and covariance matrix which, as intuitively described below, will fix this issue. This technique is also employed in the CMAES and TRCMAES algorithms [18; 3] for bandit problems, but we generalize it to nonlinear parameterizations. Concretely, we jointly optimize the neural network weights to maximize two objectives: one for updating the mean with the the covariance fixed to the one of the last policy (target network) and one for updating the covariance while fixing the mean to the one from the target network. This yields the following optimization objectives for the updated mean and covariance:
s.t.  
Here, and respectively refer to the mean and covariance of obtained from the previous policy and . We solve this optimization by performing gradient descent on an objective derived via the same Langrangian relaxation technique as in Section 4.2.
This procedure has two advantages: 1) the gradient w.r.t. the parameters of the covariance is now independent of changes in the mean; hence the only way the policy can increase the likelihood of good samples far away from the mean is by stretching along the value landscape. This gives us the ability to grow and shrink the distribution supervised by samples without introducing any extra entropy term to the objective [1; 47] (see also Figures 1 and 2
for an experiment showing this effect). 2) we can set the KL bound for mean and covariance separately. The latter is especially useful in highdimensional action spaces, where we want to avoid problems with illconditioning of the covariance matrix but want fast learning, enabled by large changes to the mean. The complete algorithm is listed in Algorithm
2.Please note that the objective we optimise here still is the weighted maximum likelihood objective from Equation 4, with the difference that we optimise it in a coordinate ascent fashion  resulting in the decoupled updates with different KL bounds. In general, such a procedure can also be applied for optimising different policy classes. For other distributions, such as mixtures of Gaussians, we can still use the same procedure and optimise for means, covariances and categorical distribution independently, getting the same effect as for the Gaussian case. If the policy is deterministic (as in DDPG) then the exploration variance is fixed and we would simply optimize the mean of a Gaussian. For categorical distributions each component can be optimized independently. However, the application of the coordinate ascent updates will have to be derived on a per distribution basis.
5 Related Work
Our algorithm employs ideas used in the family of Evolutionary Strategies (ES) algorithms. The objectives for the Gaussian case in Section 4.2.1 can be seen as a generalization of the trust region CMAES updates [18; 3] and similar algorithms [49; 40] to a, stateful, sequential setting with an imperfectly estimated evaluation function.This is discussed further in Section C in the appendix. However, rather than optimizing mean and covariance directly, we assume that these are parameterized as a nonlinear function of the state, and we combine gradient based optimization of network parameters with gradientfree updates to the policy distribution in action space. Separately previous work has used ES to directly optimize the weights of a neural network policy [41] which can be sample inefficient in highdimensional parameter spaces. In contrast, our approach operates in action space and exploits the sequentiality of the RL problem.
An alternative view of our algorithm is obtained from the perspective of RL as inference. This perspective has recently received much attention in the literature [26; 10; 31; 4]
and a number of expectationmaximization based RL algorithms have been proposed (see e.g.
[32; 34; 12]) including the original MPO algorithm [4]. Concretely, the objectives for policy improvement algorithms mentioned above can be obtained from the perspective of performing Expectation Maximization (EM) on the likelihood where denotes an optimality event whose density is proportional to the Qvalue. More details on this connection are given in Section B in the appendix. In particular, we can recover MPO by choosing an exponential transformation in the weighting step and removing decoupled updates on mean and covariance. MPO, in turn, is related to relative entropy policy search (REPS) [34], with differences due to the construction of the sample based policy (REPS considers a sample based approximation to the joint stateaction distribution ) and the additional regularization in the policy fitting step which is not presented in REPS.Conservative policy search algorithms such as Trust Region Policy Optimization [42], Proximal Policy Optimization [43] and their many derivatives make use of a similar KL constraint as in our Step 3 to stabilize learning. The supervised nature of our policy fitting procedure resembles methods from Approximate Dynamic Programming such as Regularized [14] and Classification based Policy Iteration [25] – which has been scaled to highdimensional discrete problems [15] – and the classic CrossEntropy Method (CEM) [40]. The idea of separating fitting and improvement of the policy is shared with works such as [26] and [10; 29].
6 Results
To illustrate core features of our algorithm we first present results on two standard optimization problems. These highlight the benefit of decoupled maximum likelihood for Gaussian policies. We then perform experiments on 24 tasks from the DeepMind control suite [48], three high dimensional parkour tasks from [19] and four high dimensional tasks from OpenAI gym [9]. Depictions of task sets are in the appendix (Figure 9, 10).
6.1 Standard Functions
To isolate the evaluation of our policy improvement procedure from errors in the estimation of we performed experiments using two fixed standard functions. We make both functions state and action dependent by first defining an auxiliary variable , that varies linearly with to the action (the functions can thus seen as “ground truth” Qfunctions. We consider: i) the sphere function , and ii) the well known Rosenbrock function [28] . The global optimal action for state for both of these functions is given as ; in which the optimal Qvalue of zero is obtained. Instead of using a replay buffer, we sample 100 states from a uniform state distribution in the interval for each batch and sample 10 actions from our current policy for calculating weights.
The results for the dimensional sphere function are depicted in Figure 1. We plot the learning progress of the Gaussian policy for state (every 20 iterations) for both the weighted MLE which also used by MPO [4] and the decoupled optimization approach. The decoupled optimization starts by increasing the variance. Only when the optimum is found the variance start shrinking, and the distribution successfully converges on the optimum. The MLE procedure always shrinks variance, causing premature convergenceeven though we purposefully started with a larger variance for the MLE objective.
Figure 2
shows the average return over states for each iteration as well as the policy standard deviation for 10 dimensional versions of the Rosenbrock and Sphere functions. We observe that the decoupled optimization successfully solves both tasks, although the initial standard deviation is small. In contrast, the MLE approach converges prematurely.
6.2 Continuous control benchmark tasks
6.2.1 Experimental setup
Unless noted otherwise we use the decoupled updates proposed in Section 4.2.1
in combination with the exponential transformation. Experiments with ranking based weights are given in the appendix in Section D. The policy is a stateconditional Gaussian parameterized by a feedforward neural network. We use a single learner process on a GPU and a single actor process on a CPU to gather data from the environment, performing asynchronous learning from a replay buffer. Unlike e.g. in
BarthMaron et al. [7] we do not perform distributed data collection. We use a single fixed set of hyperparameters across all tasks to show the reliability of the proposed algorithm. Details on all network architectures used, and the hyperparameters are given in the appendix in section G. For each task, we repeat the experiments five times and report the mean performance and standard deviation.6.2.2 Control Suite Tasks
Full results are given in Figure 11 and Figure 12 in the appendix. We here focus on five domains for detailed comparisons: the acrobot (with 2 action dimensions), the swimmer (15 action dimensions), the cheetah (6 action dimensions), the humanoid (22 action dimensions) and the CMU humanoid (56 action dimension), as illustrated in the appendix in Figure 9.
Ablations
We consider four ablations of our algorithm comparing: i) the full algorithm, ii) no KL constraints ii) varying the strength of the KL on the mean, iii) varying the strength of the KL on the covariance. First, we compare the optimization with KL bounds on the mean and the covariance with a variant when there is no KL bound. I.e. fitting is performed via MLE. As depicted in Figure 2(a), without a constraint learning becomes unstable and considerably lower asymptotic performance is achieved. We found that decoupling the policy improvement objective for the mean and covariance can alleviate premature convergence. In Figures 2(c) and Figure 2(d) we compare for two environments different settings for the bounds on the mean while keeping the bound on the covariance fixed to the best value obtained via a gridsearch, and vice versa. The results show that bounding both mean and covariance matrix is important to achieve stable performance. This is consistent with existing studies which have previously found that avoiding premature convergence (typically by tuning exploration noise) can be vitally important [13]. In general, we find that constraints are important for reliable learning across tasks.
Comparison to DDPG and SVG
Figure 11, shows comparisons to two optimized reference implementations of DDPG [27] and SVG(0) [21], using the same asynchronous actorlearner separation and Qlearning procedure as for our algorithm. For both baselines we place a tanh on the mean of the Gaussian behaviour policy. While DDPG uses a fixed variance, for SVG we also set a minimum variance by adding a constant of to the diagonal covariance. We found this to be necessary for stabilizing the baselines. No restrictions are placed on either mean or covariance in our method. For SVG we used entropy regularization with a fixed coefficient (we refer to Section G.2 in the appendix for details). All algorithms perform similar in the lowdimensional tasks, but differences emerge in higherdimensional tasks. Overall, when using a single set of hyperparameters for all tasks, our algorithm is more stable than the reference algorithms. Especially in problems with a high dimensional action space it achieves a better asymptotic performance than the baselines.^{2}^{2}2We note that better performance could be obtained by tuning on a pertask basis for all algorithms.
6.2.3 Parkour tasks
In this section, we consider three Parkour tasks [19]. These tasks require the policy to steer a robotic body through an obstacle course, performing jumps and avoidance maneuvers. A depiction of the environments can be found in the appendix, Figure 10. We use the same setup and hyperparameters as in the previous section. This includes still using only one GPU for learning and one actor for interacting with the environment, a significant reduction in compute compared to the previously used 32128 actors for solving these tasks [4; 7; 19]. We compare two variants of our method with two variants of SVG and DDPG: a version with TD0 to fit the Qfunction, and a version with Retrace [30].
As shown in Figure 5, only our method is able to solve all tasks. In addition, our algorithm is capable of solving these challenging tasks while running on a single workstation. Analyzing the results in more detail we can observe that learning the Qfunction with TD(0) leads to overall slower learning than when using Retrace, as well as to lower asymptotic performance. This gap increases with task complexity. Among the parkour tasks, humanoid3D gaps is the hardest as it requires controlling a humanoid body to jump across gaps, see Figure 10 in the appendix. Parkour2D is the easiest and only requires the policy to control a walker in a 2D environment. We observed a similar trend for other tasks although the difference is less dramatic on lowdimensional tasks such as the ones in the control suite.
6.2.4 OpenAI Gym
Finally, we consider OpenAI gym tasks[9] to compare our method with softactor critic algorithm (SAC)[17] which is an actorcritic algorithm very similar to SVG(0) that optimizes the entropy regularized objective expected reward objective. We use four tasks from OpenAI gym [9], i.e, Ant, Walker2d, Humanoid run, Humanoid standup for evaluating our method against SAC. For policy evaluation we use Retrace [30]. We report the evaluation performance as in [17] every 1000 environment steps and compare to their final performance. To obtain a similar data generation rate to [17] we slowed down the actor such that it generated 1 trajectory each 5 seconds. We used the same hyperparameters for learning as we used for Parkour suite and DeepMind control suite. Our results in figure 6 show that we achieve considerably better asymptoticperformance than the ones reported by SAC[17] in these environments with onpar sample efficiency. With thesame hyper parameters, our method also solves humanoidstand with final return of 4000000 which is 1001000 order of magnitude different than the final return of other environments.
7 Conclusion
We have presented a policy iteration algorithm for highdimensional continuous control problems. The algorithm alternates between Qvalue estimation, local policy improvement and parametric policy fitting; hard constraints control the rate of change of the policy. And a decoupled update for mean and covarinace of a Gaussian policy avoids premature convergence. Our analysis shows that when an approximate Qfunction is used, slow updates to the policy can be critical to achieve reliable learning. Our comparison on 31 continuous control tasks with rather diverse properties using a limited amount of compute and a single set of hyperparameters demonstrate the robustness our method while it achieves state of art results.
References
 Abdolmaleki et al. [2015] Abdolmaleki, A., Lioutikov, R., Peters, J. R., Lau, N., Reis, L. P., and Neumann, G. (2015). Modelbased relative entropy stochastic search. In Advances in Neural Information Processing Systems, pages 3537–3545.

Abdolmaleki et al. [2017a]
Abdolmaleki, A., Price, B., Lau, N., Reis, L. P., and Neumann, G. (2017a).
Contextual covariance matrix adaptation evolutionary strategies.
International Joint Conferences on Artificial Intelligence Organization (IJCAI).

Abdolmaleki et al. [2017b]
Abdolmaleki, A., Price, B., Lau, N., Reis, L. P., Neumann, G., et al. (2017b).
Deriving and improving cmaes with information geometric trust
regions.
Proceedings of the Genetic and Evolutionary Computation Conference
.  Abdolmaleki et al. [2018a] Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M. A. (2018a). Maximum a posteriori policy optimisation. CoRR, abs/1806.06920.
 Abdolmaleki et al. [2018b] Abdolmaleki, A., Springenberg, T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M. (2018b). Maximum a posteriori policy optimization. under review, https://openreview.net/forum?id=S1ANxQW0b.
 Bansal et al. [2018] Bansal, T., Pachocki, J., Sidor, S., Sutskever, I., and Mordatch, I. (2018). Emergent complexity via multiagent competition. In International Conference on Learning Representations (ICLR).
 BarthMaron et al. [2018] BarthMaron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan, D., TB, D., Muldal, A., Heess, N., and Lillicrap, T. (2018). Distributional policy gradients. In International Conference on Learning Representations (ICLR).

Bellemare et al. [2017]
Bellemare, M. G., Dabney, W., and Munos, R. (2017).
A distributional perspective on reinforcement learning.
In
Proceedings of the 34th International Conference on Machine Learning, (ICML)
.  Brockman et al. [2016] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. CoRR, abs/1606.01540.
 Chebotar et al. [2016] Chebotar, Y., Kalakrishnan, M., Yahya, A., Li, A., Schaal, S., and Levine, S. (2016). Path integral guided policy search. CoRR, abs/1610.00529.
 Ciosek and Whiteson [2018] Ciosek, K. and Whiteson, S. (2018). Expected policy gradients. In Conference on Artificial Intelligence (AAAI).
 Deisenroth et al. [2013] Deisenroth, M. P., Neumann, G., and Peters, J. (2013). A survey on policy search for robotics. Found. Trends Robot.
 Duan et al. [2016] Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. (2016). Benchmarking deep reinforcement learning for continuous control. CoRR, abs/1604.06778.
 Farahmand et al. [2009] Farahmand, A. M., Ghavamzadeh, M., Mannor, S., and Szepesvári, C. (2009). Regularized policy iteration. In Advances in Neural Information Processing Systems 21 (NIPS).
 Gabillon et al. [2013] Gabillon, V., Ghavamzadeh, M., and Scherrer, B. (2013). Approximate dynamic programming finally performs well in the game of tetris. In Advances in Neural Information Processing Systems 26 (NIPS).
 Gu et al. [2017] Gu, S., Holly, E., Lillicrap, T., and Levine, S. (2017). Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates. In Proceedings 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE.
 Haarnoja et al. [2018] Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. CoRR, abs/1801.01290.
 Hansen et al. [1997] Hansen, N., Hansen, N., Ostermeier, A., and Ostermeier, A. (1997). Convergence properties of evolution strategies with the derandomized covariance matrix adaptation: Cmaes.
 Heess et al. [2017] Heess, N., TB, D., Sriram, S., Lemmon, J., Merel, J., Wayne, G., Tassa, Y., Erez, T., Wang, Z., Eslami, A., Riedmiller, M., and Silver, D. (2017). Emergence of locomotion behaviours in rich environments.
 Heess et al. [2015] Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., and Tassa, Y. (2015). Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems 28 (NIPS).
 Heess et al. [2016] Heess, N., Wayne, G., Tassa, Y., Lillicrap, T., Riedmiller, M., and Silver, D. (2016). Learning and transfer of modulated locomotor controllers. arXiv preprint arXiv:1610.05182.
 Kalashnikov et al. [2018] Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V., and Levine, S. (2018). QTOpt: Scalable Deep Reinforcement Learning for VisionBased Robotic Manipulation. ArXiv eprints.
 Kingma and Ba [2014] Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 Kingma and Welling [2013] Kingma, D. P. and Welling, M. (2013). Autoencoding variational bayes. arXiv preprint arXiv:1312.6114.
 Lazaric et al. [2016] Lazaric, A., Ghavamzadeh, M., and Munos, R. (2016). Analysis of classificationbased policy iteration algorithms. Journal of Machine Learning Research.
 Levine and Abbeel [2014] Levine, S. and Abbeel, P. (2014). Learning neural network policies with guided policy search under unknown dynamics. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 27 (NIPS).
 Lillicrap et al. [2015] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
 Molga and Smutnicki [2005] Molga, M. and Smutnicki, C. (2005). Test functions for optimization needs. Test functions for optimization needs, 101.
 Montgomery and Levine [2016] Montgomery, W. H. and Levine, S. (2016). Guided policy search via approximate mirror descent. In Advances in Neural Information Processing Systems 29 (NIPS).
 Munos et al. [2016] Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. G. (2016). Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems (NIPS).
 Nachum et al. [2017] Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017). Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems 30 (NIPS).
 Neumann [2011] Neumann, G. (2011). Variational inference for policy search in changing situations. In Proceedings of the 28th international conference on machine learning (ICML11), pages 817–824.
 OpenAI et al. [2018] OpenAI, :, Andrychowicz, M., Baker, B., Chociej, M., Jozefowicz, R., McGrew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., Schneider, J., Sidor, S., Tobin, J., Welinder, P., Weng, L., and Zaremba, W. (2018). Learning Dexterous InHand Manipulation. ArXiv eprints.
 Peters et al. [2010] Peters, J., Mülling, K., and Altün, Y. (2010). Relative entropy policy search. In Proceedings of the TwentyFourth AAAI Conference on Artificial Intelligence (AAAI).
 Pomerleau [1989] Pomerleau, D. A. (1989). Alvinn: An autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems 1 (NIPS).
 Popov et al. [2017] Popov, I., Heess, N., Lillicrap, T., Hafner, R., BarthMaron, G., Vecerik, M., Lampe, T., Tassa, Y., Erez, T., and Riedmiller, M. (2017). Dataefficient Deep Reinforcement Learning for Dexterous Manipulation. ArXiv eprints.

Rezende et al. [2014]
Rezende, D. J., Mohamed, S., and Wierstra, D. (2014).
Stochastic backpropagation and approximate inference in deep generative models.
In Proceedings of the 31st International Conference on Machine Learning (ICML).  Riedmiller et al. [2018] Riedmiller, M. A., Hafner, R., Lampe, T., Neunert, M., Degrave, J., de Wiele, T. V., Mnih, V., Heess, N., and Springenberg, J. T. (2018). Learning by playing  solving sparse reward tasks from scratch. In International Conference on Machine Learning (ICML).
 Riedmiller et al. [2007] Riedmiller, M. A., Montemerlo, M., and Dahlkamp, H. (2007). Learning to drive a real car in 20 minutes. In FBIT. IEEE Computer Society.

Rubinstein and Kroese [2004]
Rubinstein, R. Y. and Kroese, D. P. (2004).
The Cross Entropy Method: A Unified Approach To Combinatorial Optimization, Montecarlo Simulation (Information Science and Statistics)
. SpringerVerlag.  Salimans et al. [2017] Salimans, T., Ho, J., Chen, X., and Sutskever, I. (2017). Evolution strategies as a scalable alternative to reinforcement learning. arXiv abs/1703.03864, abs/1703.03864.
 Schulman et al. [2015] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML).
 Schulman et al. [2017] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. CoRR, abs/1707.06347.
 Stone et al. [2005] Stone, P., Sutton, R. S., and Kuhlmann, G. (2005). Reinforcement learning for RoboCupsoccer keepaway. Adaptive Behavior, 13(3):165–188.
 Sutton and Barto [1998] Sutton, R. S. and Barto, A. G. (1998). Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA.
 Sutton et al. [1999] Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, pages 1057–1063, Cambridge, MA, USA. MIT Press.
 Tangkaratt et al. [2017] Tangkaratt, V., Abdolmaleki, A., and Sugiyama, M. (2017). Guide actorcritic for continuous control. arXiv preprint arXiv:1705.07606.
 Tassa et al. [2018] Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., de Las Casas, D., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., Lillicrap, T. P., and Riedmiller, M. A. (2018). Deepmind control suite. CoRR, abs/1801.00690.
 Wierstra et al. [2008] Wierstra, D., Schaul, T., Peters, J., and Schmidhuber, J. (2008). Fitness expectation maximization. In International Conference on Parallel Problem Solving from Nature, pages 337–346. Springer.
Appendix A Relation to Policy Gradients
An interesting possibility is to use an identity transformation in Step 2 of our alogrithm (instead of using ranking or an exponential transformation). While not respecting the desiderata i) and ii) from above this would bring our method close to an expected policy gradient algorithm [11]. We discuss this choice in detail in the appendix. It is instructive to also consider the case of an identity as the transformation function. Clearly, the identity is rank preserving. It does, however, not satisfy the additional requirements i) positivity of weights, and ii) weights are normalized such that outlined in the main paper (weights can be negative and are not normalized). This hints at the fact that it would make our procedure susceptible to instabilities caused by scaling Qvalues and can result in “agressive” changes of the policy distribution away from bad samples (for which we have negative weights). Considering the identity is, nonetheless, an interesting exercise as it highlights a connection of our algorithm to a likelihoodratio policy gradient [46] approach since we would obtain: ; which looks similar to the expected policy gradient (EPG) [11], where multiple actions are also used to estimate the expectation. On closer inspection, however, one can observe that the above expectation is w.r.t. samples from the old policy and not w.r.t. (which would be required for EPG). Equivalence to a policy gradient can hence only be achieved for the first gradient step (for which ).
Appendix B Policy Improvement as Inference
In the paper, we motivated the policy update rules from a more intuitive perspective. In this section we use inference to derive our policy improvement algorithm. The Eand Mstep that we derive here, directly correspond to Step 2 and 3 in the main paper. First, we assume that the Qfunction is given and we would like to improve our policy given this Qfunction, see algorithm 1 in the paper. In order to interpret this goal in a mathematical sense, we assume there is an observable binary improvement event . When , our policy improved and we have achieved our goal. Now we ask, if the policy would have improved, i.e. , what would the parameters of that improved policy be? More concretely, we can optimize the maximum a posteriori or equivalently:
after marginalizing out action and state and considering random variable dependencies, it is equivalent to optimizing
Here is the stationary state distribution and is given in each policy improvement step. In our case, is the distribution of the states in the replay buffer. is the policy parametrized by and is a prior distribution over the parameters . This prior is fixed during the policy improvement step and we set it such that we stay close to the old policy during each policy improvement step. is the probability density of the improvement event, if our policy would choose the action in the state . In the other word defines the probability that in state , taking action
over other possible actions, would improve the policy. As we prefer actions with higher Qvalues, this probability density function can be defined by
, where is a monotonically increasing and therefore rank preserving function of Q function. This is a sensible choice, as choosing an action with higher Qvalues should have a higher probability of improving the policy in that state.However, explicitly solving this equation for is hard. Yet, the expectationmaximisation algorithm does give an efficient method for maximizing in this setting. Therefore, our strategy is to repeatedly construct a lowerbound on this probability density in the Estep, and then optimize that lowerbound in the Mstep. Following prior work in the field, we construct a lower bound on using the following decomposition,
where is an arbitrary variational distribution. Please note that the second term is a lower bound as the first term is always positive. In effect, and are unknown, even though is given.
We can now focus on the underlying meaning of . If in each state we knew which action would lead to a policy improvement, we would fit a policy which outputs that action in each state. However, we do not have access to that knowledge. Instead, in the Estep we use the Qfunction to infer a distribution over the actions of which we know that choosing those actions would improve the policy. In the Mstep, we then fit a policy to this distribution such that those actions are selected by the newly fitted policy, hence the policy is improved.
b.1 EStep (Step 2 in main paper)
In the Estep (which would correspond to Step 2 in the main paper), we choose the based variational distribution (approximated via the sample based distribution in the main paper) such that the lower bound on is as tight as possible. We know this is the case when the KL term is zero given the old policy . Therefore we minimize the KL term given the old policy, i.e,
which is equivalent to minimizing,
(5) 
We can solve this optimization problem in closed form, which gives us
Please note that here we only solve for as the state distribution is given and should remain unchanged.
This solution weighs the actions based on their relative improvement probability . At this point we can define using any arbitrary positive function . For example, we could rank the actions based on their Qvalues and assigning positive values to the actions based on their ranking. Alternatively we could define . Note that temperature term is used to keep the solutions diverse, as we would like to represent the policy with a distribution of solutions instead of only one single solution. Yet, we imply a preferences over solutions by weighing them. However, tuning the temperature is difficult. In order to optimize , we plug the exponential transformation in Equation (5) and after rearranging terms, our optimization problem is
or instead of treating the KL bound as a penalty, we can enforce the bound as a constraint:
(6)  
Note that when the is parametric this is the policy optimization objective for MPOparametric [4], TRPO [42] , PPO [43] and SAC [17]
(if old policy is a uniform distribution). Note that In our case
is a nonparametric and samples based distribution, and we can solve this constraint optimization in close form for each sample state ,and easily optimize for the correct using the convex dual function.
Please see section B.2.1 for dual function derivation details. Now if we estimate the integrals using state samples from replay buffer and our old policy we recover the policy and dual function given in Step 2 of the main paper.
b.2 Mstep (Step 3 in main paper)
Since we obtained the variational distribution , we have found a tight lower bound to our density function . Now we can optimize the parameters of the policy in order to maximize this lower bound,
This corresponds to the maximum likelihood estimation step (Step 3) in the main paper.
As a prior on the parameters , we can say that the new policy should be close to the old policy , or more formal, we can choose
Using this approximation, we find a new optimization problem:
Alternatively we can use a hard constraint to obtain:
Because of the prior, we do not greedily optimize the Mstep objective. Therefore our approach belongs to the category of generalized expectation maximization algorithms. Now if we approximate the integrals in the Estep and Mstep using the states samples from replay buffer and the action samples from the old policy we will obtain the exact update rules we proposed in paper. Algorithm 2 illustrates algorithmic steps.
b.2.1 Dual function Derivation
The Estep with a nonparametric variational distribution solves the following program:
First we write the Lagrangian equation, i.e,
Next we maximise the Lagrangian w.r.t the primal variable . The derivative w.r.t reads,
Setting it to zero and rearranging terms we get
However the last exponential term is a normalisation constant for . Therefore we can write,
(7) 
Now to obtain the dual function , we plug in the solution to the KL constraint term of the lagrangian and it results in,
Most of the terms cancel out and after rearranging the terms we obtain,
Note that we have already calculated the term inside the integral in equation 7. By plugging in equation 7 we will have the dual function,
Appendix C Relation to Evolutionary Strategy algorithms
On a high level, the difference between our algorithm and evolutionary strategy (ES) algorithms is that the problem handled by our algorithm is stateful, whereas in ES one typically considers a bandit problem. Another difference is that in ES, the value function (or in the stateless case, the reward function) does not change and the goal is to find the optimum solution given a fixed reward function. However in our setting the Qfunction changes when the policy changes. Nonetheless, if we consider only a onestep policy improvement for one single and fixed state, given a Qfunction – while staying close to the old policy – then we can recover the mean and covariance update rules of CMAES (assuming that the Gaussian policy is not parameterized by a nonlinear function as in our main paper).
Concretely, considering a bandit problem, we can perform statefree optimization by directly sampling a set of actions from the old policy and evaluate them given the reward function. After that we can use any weighting method such as ranking or exponential transformation to reweight the actions. Subsequently we can solve the decoupled objectives in the main text section 4 in closed form when we use a soft constraint on KL in Step 3 (and, as mentioned above assuming mean and covariance are our only parameters), i.e,
Here and define how much we move from the old distribution. This is the exact update rule for CMAES with the difference that CMAES sets to zero resulting in an unregulated update rule for the mean of the Gaussian policy. This choice makes sense when one is optimizing against the true reward function. However, in the reinforcement learning setting the Qfunction should be estimated and typically has high variance. Therefore a constraint on the mean to limit exploiting the Qfunction is necessary as we showed in our experiments. Please see [3, 2] for more details on derivations.
The above is not only the case for CMAES. Depending on the weighing strategy, the interpolation factors
, and the use of or , we recover the update rules not only for CMAES, but also Episodic PI, Episodic PICMA, Episodic Reps, Episodic Power, Cross Entropy methods and EDAs [12]. We recover TRCMAES [3] in the case where instead of the soft constraints in Mstep, we use the hard constraints on the KL. If we use our formulation for contextual RL with a linear function approximator and stateindependent covariance we recover the update rules from contextual CMAES [2].One interesting observation is that the per state solution we obtain (assuming no generalization over states is performed via a neural network), is a convex interpolation between the last policy and the sample based policy. The change of this distribution is upper bounded for each state, i.e, in case of a Gaussian distribution the policy for each state is at most the sample Gaussian distribution, even when we set the constraint on the KL to infinity. As a matter of fact, the current policy is changing towards the sample policy in Step 3. This can be interpreted as following a natural gradient where the maximum change is upper bounded and the direction of the change is the optimal improved distribution.
Appendix D Ranking versus Exponential Transformation
Figure 7 compares two different strategies for weighting actions, exponential transformation and ranking. For the ranking results we weight the actions for each state using the following formula:
where is the rank of the action based on its Qvalue, N is number of actions per state (which is 20 in our case) and is temperature parameter which we set to 10.
We did not observe a noticeable difference between the two transformations. However, we recommend an exponential transformation over ranking. Mainly because it allows for efficient optimization of the temperature parameter.
Appendix E Additional Visualizations Regarding Premature Convergence
Figure 8 visualizes the evolution of the policy for state [0,0] when optimizing a statefull Q function that is quadratic in action space. The results show that when we use MLE, without decoupling the updates for mean and covariance, the policy suffers from premature convergence. However, when we decouple the updates the variance naturally grows and shrinks.
Appendix F Additional Experiments on DeepMind control suite tasks
Figure 12 provides additional results on the control suite tasks, comparing against two baselines.
Appendix G Experiment details
In this section we outline the details on the hyperparameters used for our algorithm and baselines, DDPG and SVG. All continuous control experiments use a feedforward neural network except for Parkour tasks as described in section G.4. The policy is given by a Gaussian distribution with a diagonal covariance matrix, i.e, . The neural network outputs the mean and diagonal Cholesky factors , such that . The diagonal factor has positive diagonal elements enforced by the softplus transform to enforce positive definiteness of the diagonal covariance matrix.
Tables 2,1 and 3 show the hyper parameters we used for all three algorithms. We found layer normalization and tanh on output of the layer normalization are important for stability of all algorithms.
We also found that: 1) a tanh operation on the mean of the distribution and 2) forcing a minimum variance are required for DDPG and SVG. We emphasize that our algorithm does not use any such tricks.
For our algorithm the most important hyper parameters are the constraints in Step 1 and Step 2.
Hyperparameters  SVG 

Policy net  200200200 
Q function net  500500500 
Entropy Regularization Factor  0.001 
Discount factor ()  0.99 
Adam learning rate  0.0003 
Replay buffer size  2000000 
Target network update period  250 
Batch size  3072 
Activation function  elu 
Tanh on output of layer norm  Yes 
Layer norm on first layer  Yes 
Tanh on Gaussian mean  Yes 
Min variance  0.1 
Max variance  unbounded 
g.1 Settings for standard functions
We use a two layer neural network with 50 neurons to map the current state of the network to the mean and diagonal covariance of the Gaussian policy. The parameters of this neural network
are then optimized using the procedure described in the algorithm section. We set , and . Please note that, the mean is effectively unregulated because of a loose KL bound. The reason is that, here we have access to a perfect Qfunction and therefore we can exploit it as much as it is possible. This situation is different in the RL setting, where the Qfunction is estimated and can be noisy.Hyperparameters  Ours 

Policy net  200200200 
Number of actions sampled per state  20 
Q function net  500500500 
0.1  
0.0005  
0.00001  
Discount factor ()  0.99 
Adam learning rate  0.0003 
Replay buffer size  2000000 
Target network update period  250 
Batch size  3072 
Activation function  elu 
Layer norm on first layer  Yes 
Tanh on output of layer norm  Yes 
Tanh on Gaussian mean  No 
Min variance  Zero 
Max variance  unbounded 
g.2 Additional Details on the SVG baseline
For the stochastic value gradients (SVG0) baseline we use the same policy parameterization as for our algorithm, e.g. we have
where
denotes the identity matrix and
is computed from the network output via a softplus activation function.To obtain a baseline that is, in spirit, similar to our algorithm we used SVG in combination with Entropy regularization. That is, we optimize the policy via gradiend ascent, following the reparameterized gradient for a given state s sampled from the replay:
(8) 
which can be computed, using the reparameterization trick, as
(9) 
where
is now a deterministic function of a sample from the standard multivariate normal distribution. See e.g.
[20] (for SVG) as well as [37, 24] (for the reparameterization trick) for a detailed explanation.g.3 Additional Details on the DDPG baseline
For the DDPG baseline we parameterize only the mean of the policy , using fixed univariate Gaussian exploration noise of in all dimensions, that is the behaviour policy for collecting experience can be described as
where denotes the identity matrix. To improve the mean of the policy we follow the deterministic policy gradient:
(10) 
Policy evaluation is performed with Qlearning as described in the main text. The hyperparameters used for DDPG are described in Table 3. We highlight that good performance for DDPG can be achieved when hyperparameters are tuned correctly; we found that a tanh activation function on the mean combined with layer normalization in the first layer of policy and Qfunction are crucial in this regard.
Hyperparameters  DDPG 

Policy net  200200200 
Q function net  500500500 
Discount factor ()  0.99 
Adam learning rate  0.0001 
Replay buffer size  2000000 
Target network update period  250 
Batch size  3072 
Activation function  elu 
Tanh on networks input  Yes 
Tanh on output of layer norm  Yes 
Tanh on Gaussian mean  Yes 
Min variance  0.3 
Max variance  0.3 
g.4 Network Architecture for Parkour
For the parkour experiments we used the same hyperparameters but changed the architecture of the feedforward network. This is due to the fact that for these problems both proprioceptive information about the robot’s state as well as information of the terrain height is available. We thus used the same network architecture as in [7, 4], which in turn was derived from the networks in [19]: each of the two input streams is passed through a twolayer feedforward neural network with 200 units each for the critic (100 units each for the actor), before being passed through one layer of 100 units combining both modalities; on top of which a final layer computes the Qvalue (or in case of the policy produces mean and diagonal covariance).
Appendix H Additional Experiments on the Control Suite
We provide a full evaluation on 27 tasks from the DeepMind control suite (see Figure 9) and the parkour suite (see Figure 10). Please see the main text for the results on parkour tasks. Figure 11 and 12 in appendix shows the full results for the control suite tasks. The results suggest that while other baselines perform well, only our algorithm performs well across all tasks, achieving better asymptotic performances for high dimensional tasks.