1 Introduction
Model free reinforcement learning algorithms can acquire sophisticated behaviours by interacting with the environment while receiving simple rewards. Recent experiments (Mnih et al., 2015; Jaderberg et al., 2016; Heess et al., 2017)
successfully combined these algorithms with powerful deep neuralnetwork approximators while benefiting from the increase of compute capacity.
Unfortunately, the generality and flexibility of these algorithms comes at a price: They can require a large number of samples and – especially in continuous action spaces – suffer from high gradient variance. Taken together these issues can lead to unstable learning and/or slow convergence. Nonetheless, recent years have seen significant progress, with improvements to different aspects of learning algorithms including stability, dataefficiency and speed, enabling notable results on a variety of domains, including locomotion
(Heess et al., 2017; Peng et al., 2016), multiagent behaviour (Bansal et al., 2017) and classical control (Duan et al., 2016).Two types of algorithms currently dominate scalable learning for continuous control problems: First, TrustRegion Policy Optimisation (TRPO; Schulman et al. 2015) and the derivative family of Proximal Policy Optimisation algorithms (PPO; Schulman et al. 2017b). These policygradient algorithms are onpolicy by design, reducing gradient variance through large batches and limiting the allowed change in parameters. They are robust, applicable to highdimensional problems, and require moderate parameter tuning, making them a popular first choice (Ho & Ermon, 2016). However, as onpolicy algorithms, they suffer from poor sample efficiency.
In contrast, offpolicy valuegradient algorithms such as the Deep Deterministic Policy Gradient (DDPG, Silver et al. 2014; Lillicrap et al. 2016), Stochastic Value Gradient (SVG, Heess et al. 2015), and the related Normalized Advantage Function formulation (NAF, Gu et al. 2016b) rely on experience replay and learned (action)value functions. These algorithms exhibit much better data efficiency, approaching the regime where experiments with real robots are possible (Gu et al., 2016a; Andrychowicz et al., 2017). While also popular, these algorithms can be difficult to tune, especially for highdimensional domains like general robot manipulation tasks.
In this paper we propose a novel offpolicy algorithm that benefits from the best properties of both classes. It exhibits the scalability, robustness and hyperparameter insensitivity of onpolicy algorithms, while offering the dataefficiency of offpolicy, valuebased methods.
To derive our algorithm, we take advantage of the duality between control and estimation by using Expectation Maximisation (EM), a powerful tool from the probabilistic estimation toolbox, in order to solve control problems. This duality can be understood as replacing the question “what are the actions which maximise future rewards?” with the question “assuming future success in maximising rewards, what are the actions most likely to have been taken?”. By using this estimation objective we have more control over the policy change in both E and M steps, yielding robust learning. We show below that several algorithms, including TRPO, can be directly related to this perspective. We leverage the fast convergence properties of EMstyle coordinate ascent by alternating a nonparametric databased Estep which reweights stateaction samples, with a supervised, parametric Mstep using deep neural networks.
In contrast to typical offpolicy valuegradient algorithms, the new algorithm does not require gradient of the Qfunction to update the policy. Instead it uses samples from the Qfunction to compare different actions in a given state. And subsequently it updates the policy such that better actions in that state will have better probabilities to be chosen.
We evaluate our algorithm on a broad spectrum of continuous control problems including a 56 DoF humanoid body. All experiments used the same optimisation hyperparameters ^{1}^{1}1With the exception of the number of samples collected between updates.. Our algorithm shows remarkable data efficiency often solving the tasks we consider an order of magnitude faster than the stateoftheart. A video of some resulting behaviours can be found here youtu.be/he_BPw32PwU.
2 Background and Notation
2.1 Related Work
Casting Reinforcement Learning (RL) as an inference problem has a long history dating back at least two decades (Dayan & Hinton, 1997). The framework presented here is inspired by a variational inference perspective on RL that has previously been utilised in multiple studies; c.f. Dayan & Hinton (1997); Neumann (2011); Deisenroth et al. (2013); Rawlik et al. (2012); Levine & Koltun (2013); Florensa et al. (2017).
Particular attention has been paid to obtaining maximum entropy policies as the solution to an inference problem. The penalisation of determinism can be seen encouraging both robustness and simplicity. Among these are methods that perform trajectory optimisation using either linearised dynamics (Todorov, 2008; Toussaint, 2009; Levine & Koltun, 2013) or general dynamics as in path integral control (Kappen, 2005; Theodorou et al., 2010). In contrast to these algorithms, here we do not assume the availability of a transition model and avoid onpolicy optimisation. A number of other authors have considered the same perspective but in a modelfree RL setting (Neumann, 2011; Peters et al., 2010a; Florensa et al., 2017; Daniel et al., 2016) or inverse RL problems (Ziebart et al., 2008). These algorithms are more directly related to our work and can be cast in the same (EMlike) alternating optimisation scheme on which we base our algorithm. However, they typically lack the maximisation (M)step – with the prominent exception of REPS, ACREPS, PIGPS and MDGPS (Peters et al., 2010a; Wirth et al., 2016; Chebotar et al., 2016; Montgomery & Levine, 2016) to which our algorithm is closely related as outlined below. An interesting recent addition to these approaches is an EMperspective on the PoWER algorithm (Roux, 2016) which uses the same iterative policy improvement employed here, but commits to parametric inference distributions and avoids an exponential reward transformation, resulting in a harder to optimise lower bound.
As an alternative to these policy gradient inspired algorithms, the class of recent algorithms for soft Qlearning (e.g. Rawlik et al. (2012); Haarnoja et al. (2017); Fox et al. (2016) parameterise and estimate a so called “soft” Qfunction directly, implicitly inducing a maximum entropy policy. A perspective that can also be extended to hierarchical policies (Florensa et al., 2017), and has recently been used to establish connections between Qlearning and policy gradient methods (O’Donoghue et al., 2016; Schulman et al., 2017a). In contrast, we here rely on a parametric policy, our bound and derivation is however closely related to the definition of the soft (entropy regularised) Qfunction.
A line of work, that is directly related to the “RL as inference” perspective, has focused on using information theoretic regularisers such as the entropy of the policy or the KullbackLeibler divergence (KL) between policies to stabilise standard RL objectives. In fact, most stateoftheart policy gradient algorithms fall into this category. For example see the entropy regularization terms used in
Mnih et al. (2016) or the KL constraints employed by work on trustregion based methods (Schulman et al., 2015, 2017b; Gu et al., 2017; Wang et al., 2017). The latter methods introduce a trust region constraint, defined by the KL divergence between the new policy and the old policy, so that the expected KL divergence over state space is bounded. From the perspective of this paper these trustregion based methods can be seen as optimising a parametric Estep, as in our algorithm, but are “missing” an explicit Mstep.Finally, the connection between RL and inference has been invoked to motivate work on exploration. The most prominent examples for this are formed by work on Boltzmann exploration such as Kaelbling et al. (1996); Perkins & Precup (2002); Sutton (1990); O’Donoghue et al. (2017), which can be connected back to soft Qlearning (and thus to our approach) as shown in Haarnoja et al. (2017).
2.2 Markov decision Processes
We consider the problem of finding an optimal policy
for a discounted reinforcement learning (RL) problem; formally characterized by a Markov decision process (MDP). The MDP consists of: continuous states
, actions , transition probabilities – specifying the probability of transitioning from state to under action –, a reward function as well as the discounting factor . The policy (with parameters) is assumed to specify a probability distribution over action choices given any state and – together with the transition probabilities – gives rise to the stationary distribution
.Using these basic quantities we can now define the notion of a Markov sequence or trajectory sampled by following the policy ; i.e. with ; and the expected return . We will use the shorthand .
3 Maximum a Posteriori Policy Optimisation
Our approach is motivated by the well established connection between RL and probabilistic inference. This connection casts the reinforcement learning problem as that of inference in a particular probabilistic model. Conventional formulations of RL aim to find a trajectory that maximizes expected reward. In contrast, inference formulations start from a prior distribution over trajectories, condition a desired outcome such as achieving a goal state, and then estimate the posterior distribution over trajectories consistent with this outcome.
A finitehorizon undiscounted reward formulation can be cast as inference problem by constructing a suitable probabilistic model via a likelihood function , where is a temperature parameter. Intuitively, can be interpreted as the event of obtaining maximum reward by choosing an action; or the event of succeeding at the RL task (Toussaint, 2009; Neumann, 2011). With this definition we can define the following lower bound on the likelihood of optimality for the policy :
(1)  
(2) 
where is the trajectory distribution induced by policy as described in section 2.2 and is an auxiliary distribution over trajectories that will discussed in more detail below. The lower bound is the evidence lower bound (ELBO) which plays an important role in the probabilistic modeling literature. It is worth already noting here that optimizing (2) with respect to can be seen as a KL regularized RL problem.
An important motivation for transforming a RL problem into an inference problem is that this allows us draw from the rich toolbox of inference methods: For instance,
can be optimized with the familiy of expectation maximization (EM) algorithms which alternate between improving
with respect to and . In this paper we follow classical (Dayan & Hinton, 1997) and more recent works (e.g. Peters et al. 2010b; Levine & Koltun 2013; Daniel et al. 2016; Wirth et al. 2016) and cast policy search as a particular instance of this family. Our algorithm then combines properties of existing approaches in this family with properties of recent offpolicy algorithms for neural networks.The algorithm alternates between two phases which we refer to as E and M step in reference to an EMalgorithm. The Estep improves with respect to . Existing EM policy search approaches perform this step typically by reweighting trajectories with sample returns (Kober & Peters, 2009) or via local trajectory optimization (Levine & Koltun, 2013)
. We show how offpolicy deep RL techniques and valuefunction approximation can be used to make this step both scalable as well as data efficient. The Mstep then updates the parametric policy in a supervised learning step using the reweighted stateaction samples from the Estep as targets.
These choices lead to the following desirable properties: (a) lowvariance estimates of the expected return via function approximation; (b) lowsample complexity of value function estimate via robust offpolicy learning; (c) minimal parametric assumption about the form of the trajectory distribution in the Estep; (d) policy updates via supervised learning in the M step; (e) robust updates via hard trustregion constraints in both the E and the M step.
3.1 Policy Improvement
The derivation of our algorithm then starts from the infinitehorizon analogue of the KLregularized expected reward objective from Equation (2). In particular, we consider variational distributions that factor in the same way as , i.e. which yields:
(3) 
Note that due to the assumption about the structure of the KL over trajectories decomposes into a KL over the individual stateconditional action distributions. This objective has also been considered e.g. by Haarnoja et al. (2017); Schulman et al. (2017a). The additional
term is a prior over policy parameters and can be motivated by a maximum aposteriori estimation problem (see appendix for more details).
We also define the regularized Qvalue function associated with (3) as
(4) 
with . Note that and are not part of the Qfunction as they are not a function of the action.
We observe that optimizing with respect to is equivalent to solving an expected reward RL problem with augmented reward . In this view represents a default policy towards which is regularized – i.e. the current best policy. The MPO algorithm treats as the primary object of interest. In this case serves as an auxiliary distribution that allows optimizing via alternate coordinate ascent in and , analogous to the expectationmaximization algorithm in the probabilistic modelling literature. In our case, the Estep optimizes with respect to while the Mstep optimizes with respect to . Different optimizations in the Estep and Mstep lead to different algorithms. In particular, we note that for the case where is an uninformative prior a variant of our algorithm has a monotonic improvement guarantee as show in the Appendix A.
3.2 EStep
In the Estep of iteration we perform a partial maximization of with respect to given . We start by setting and estimate the unregularized actionvalue function:
(5) 
since . In practice we estimate from offpolicy data (we refer to Section 4 for details about the policy evaluation step). This greatly increases the data efficiency of our algorithm. Given we improve the lower bound w.r.t. by first expanding via the regularized Bellman operator , and optimize the “onestep” KL regularised objective
(6)  
since and thus .
Maximizing Equation (6), thus obtaining , does not fully optimize since we treat as constant with respect to . An intuitive interpretation is that it chooses the softoptimal action for one step and then resorts to executing policy . In the language of the EM algorithm this optimization implements a partial Estep. In practice we also choose to be the stationary distribution as given through samples from the replay buffer.
Constrained Estep
The reward and the KL terms are on an arbitray relative scale. This can make it difficult to choose . We therefore replace the soft KL regularization with a hard constraint with parameter , i.e,
(7)  
If we choose to explicitly parameterize – option 1 below – the resulting optimisation is similar to that performed by the recent TRPO algorithm for continuous control (Schulman et al., 2015); only in an offpolicy setting. Analogously, the unconstrained objective (6) is similar to the objective used by PPO (Schulman et al., 2017b). We note, however, that the KL is reversed when compared to the KL used by TRPO and PPO.
To implement (7) we need to choose a form for the variational policy . Two options arise:

We can choose a nonparametric representation of given by sample based distribution over actions for a state . To achieve generalization in state space we then fit a parametric policy in the Mstep. This is possible since in our framework the optimisation of Equation (7) is only the first step of an EM procedure and we thus do not have to commit to a parametric distribution that generalises across the state space at this point.
Fitting a parametric policy in the Mstep is a supervised learning problem, allowing us to employ various regularization techniques at that point. It also makes it easier to enforce the hard KL constraint.
Non parametric variational distribution
In the nonparametric case we can obtain the optimal sample based distribution over actions for each state – the solution to Equation (7) – in closed form (see the appendix for a full derivation), as,
(8) 
where we can obtain by minimising the following convex dual function,
(9) 
after the optimisation of which we can evaluate on given samples.
This optimization problem is similar to the one solved by relative entropy policy search (REPS) (Peters et al., 2010a) with the difference that we optimise only for the conditional variational distribution
instead of a joint distribution
– effectively fixing to the stationary distribution given by previously collected experience – and we use the Q function of the old policy to evaluate the integral over . While this might seem unimportant it is crucial as it allows us to estimate the integral over actions with multiple samples without additional environment interaction. This greatly reduces the variance of the estimate and allows for fully offpolicy learning at the cost of performing only a partial optimization of as described above.3.3 Mstep
Given from the Estep we can optimize the lower bound with respect to to obtain an updated policy . Dropping terms independent of this entails solving for the solution of
(10) 
which corresponds to a weighted maximum aposteriroi estimation (MAP) problem where samples are weighted by the variational distribution from the Estep. Since this is essentially a supervised learning step we can choose any policy representation in combination with any prior for regularisation. In this paper we set to a Gaussian prior around the current policy, i.e, where are the parameters of the current policy distribution, is the empirical Fisher information matrix and is a positive scalar.
As shown in the appendix this suggests the following generalized Mstep:
(11) 
which can be rewritten as the hard constrained version:
(12)  
This additional constraint minimises the risk of overfitting the samples, i.e. it helps us to obtain a policy that generalises beyond the stateaction samples used for the optimisation. In practice we have found the KL constraint in the M step to greatly increase stability of the algorithm. We also note that in the Estep we are using the reverse, modeseeking, KL while in the Mstep we are using the forward, momentmatching, KL which reduces the tendency of the entropy of the parametric policy to collapse. This is in contrast to other RL algorithms that use Mprojection without KL constraint to fit a parametric policy
(Peters et al., 2010a; Wirth et al., 2016; Chebotar et al., 2016; Montgomery & Levine, 2016). Using KL constraint in Mstep has also been shown effective for stochastic search algorithms (Abdolmaleki et al., 2017).4 Policy Evaluation
Our method is directly applicable in an offpolicy setting. For this, we have to rely on a stable policy evaluation operator to obtain a parametric representation of the Qfunction . We make use of the policy evaluation operator from the Retrace algorithm Munos et al. (2016), which we found to yield stable policy evaluation in practice^{2}^{2}2We note that, despite this empirical finding, Retrace may not be guaranteed to be stable with function approximation (Touati et al., 2017).. Concretely, we fit the Qfunction as represented by a neural network, with parameters , by minimising the squared loss:
(13)  
where denotes the output of a target Qnetwork, with parameters , that we copy from the current parameters after each Mstep. We truncate the infinite sum after steps by bootstrapping with (rather than considering a return). Additionally, denotes the probabilities of an arbitrary behaviour policy. In our case we use an experience replay buffer and hence is given by the action probabilities stored in the buffer; which correspond to the action probabilities at the time of action selection.
5 Experiments
For our experiments we evaluate our MPO algorithm across a wide range of tasks. Specifically, we start by looking at the continuous control tasks of the DeepMind Control Suite (Tassa et al. (2018), see Figure 1), and then consider the challenging parkour environments recently published in Heess et al. (2017)
. In both cases we use a Gaussian distribution for the policy whose mean and covariance are parameterized by a neural network (see appendix for details). In addition, we present initial experiments for discrete control using ATARI environments using a categorical policy distribution (whose logits are again parameterized by a neural network) in the appendix.
5.1 Evaluation on control suite
The suite of continuous control tasks that we are evaluating against contains 18 tasks, comprising a wide range of domains including well known tasks from the literature. For example, the classical cartpole and acrobot dynamical systems, 2D and Humanoid walking as well as simple lowdimensional planar reaching and manipulation tasks. This suite of tasks was built in python on top of mujoco and will also be open sourced to the public by the time of publication.
While we include plots depicting the performance of our algorithm on all tasks below; comparing it against the stateoftheart algorithms in terms of dataefficiency. We want to start by directing the attention of the reader to a more detailed evaluation on three of the harder tasks from the suite.
5.1.1 Detailed Analysis on Walker2D, Acrobot, Hopper
We start by looking at the results for the classical Acrobot task (two degrees of freedom, one continuous action dimension) as well as the 2D walker (which has 12 degrees of freedom and thus a 12 dimensional action space and a 21 dimensional state space) and the hopper standing task. The reward in the Acrobot task is the distance of the robots endeffector to an upright position of the underactuated system. For the walker task it is given by the forward velocity, whereas in the hopper the requirement is to stand still.
Figure 2 shows the results for this task obtained by applying our algorithm MPO as well as several ablations – in which different parts were removed from the MPO optimization – and two baselines: our implementation of Proximal Policy Optimization (PPO) (Schulman et al., 2017b) and DDPG. The hyperparameters for MPO were kept fixed for all experiments in the paper (see the appendix for hyperparameter settings).
As a first observation, we can see that MPO gives stable learning on all tasks and, thanks to its fully offpolicy implementation, is significantly more sample efficient than the onpolicy PPO baseline. Furthermore, we can observe that changing from the nonparametric variational distribution to a parametric distribution^{3}^{3}3We note that we use a value function baseline in this setup. See appendix for details. (which, as described above, can be related to PPO) results in only a minor asymptotic performance loss but slowed down optimisation and thus hampered sample efficiency; which can be attributed to the fact that the parametric distribution required a stricter KL constraint. Removing the automatically tuned KL constraint and replacing it with a manually set entropy regulariser then yields an offpolicy actorcritic method with Retrace. This policy gradient method still uses the idea of estimating the integral over actions – and thus, for a gradient based optimiser, its likelihood ratio derivative – via multiple action samples (as judged by a QRetrace critic). This idea has previously been coined as using the expected policy gradient (EPG) (Ciosek & Whiteson, 2017) and we hence denote the corresponding algorithm with EPG + Retrace, which nolonger follows the intuitions of the MPO perspective. EPG + Retrace performed well when the correct entropy regularisation scale is used. This, however, required task specific tuning (c.f. Figure 4 where this hyperparameter was set to the one that performed best in average across tasks). Finally using only a single sample to estimate the integral (and hence the likelihood ratio gradient) results in an actorcritic variant with Retrace that is the least performant offpolicy algorithm in our comparison.
5.1.2 Complete results on the control suite
The results for MPO (nonparameteric) – and a comparison to an implementation of stateoftheart algorithms from the literature in our framework – on all the environments from the control suite that we tested on are shown in Figure 4. All tasks have rewards that are scaled to be between 0 and 1000. We note that in order to ensure a fair comparison all algorithms ran with exactly the same network configuration, used a single learner (no distributed computation), used the same optimizer and were tuned w.r.t. their hyperparameters for best performance across all tasks. We refer to the appendix for a complete description of the hyperparameters. Our comparison is made in terms of dataefficiency.
From the plot a few trends are readily apparent: i) We can clearly observe the advantage in terms of dataefficiency that methods relying on a Qcritic obtain over the PPO baseline. This difference is so extreme that in several instances the PPO baseline converges an order of magnitude slower than the offpolicy algorithms and we thus indicate the asymptotic performance of each algorithm of PPO and DDPG (which also improved significantly later during training in some instances) with a colored star in the plot; ii) the difference between the MPO results and the (expected) policy gradient (EPG) with entropy regularisation confirm our suspicion from Section 5.1.1: finding a good setting for the entropy regulariser that transfers across environments without additional constraints on the policy distribution is very difficult, leading to instabilities in the learning curves. In contrast to this the MPO results appear to be stable across all environments; iii) Finally, in terms of dataefficiency the methods utilising Retrace obtain a clear advantage over DDPG. The single learner vanilla DDPG implementation learns the lower dimensional environments quickly but suffers in terms of learning speed in environments with sparse rewards (finger, acrobot) and higher dimensional action spaces. Overall, MPO is able to solve all environments using surprisingly moderate amounts of data. On average less than 1000 trajectories (or samples) are needed to reach the best performance.
5.2 Highdimensional continuous control
Next we turn to evaluating our algorithm on two higherdimensional continuous control problems; humanoid and walker. To make computation time bearable in these more complicated domains we utilize a parallel variant of our algorithm: in this implementation K learners are all independently collecting data from an instance of the environment. Updates are performed at the end of each collected trajectory using distributed synchronous gradient descent on a shared set of policy and Qfunction parameters (we refer to the appendix for an algorithm description). The results of this experiment are depicted in Figure 3.
For the Humanoid running domain we can observe a similar trend to the experiments from the previous section: MPO quickly finds a stable running policy, outperforming all other algorithms in terms of sample efficiency also in this highdimensional control problem.
The case for the Walker2D parkour domain (where we compare against a PPO baseline) is even more striking: where standard PPO requires approximately 1M trajectories to find a good policy MPO finds a solution that is asymptotically no worse than the PPO solution in in about 70k trajectories (or 60M samples), resulting in an order of magnitude improvement. In addition to the walker experiment we have also evaluated MPO on the Parkour domain using a humanoid body (with 22 degrees of freedom) which was learned successfully (not shown in the plot, please see the supplementary video).
5.3 Discrete control
As a proof of concept – showcasing the robustness of our algorithm and its hyperparameters – we performed an experiment on a subset of the games contained contained in the "Arcade Learning Environment" (ALE) where we used the same hyperparameter settings for the KL constraints as for the continuous control experiments. The results of this experiment can be found in the Appendix.
6 Conclusion
We have presented a new offpolicy reinforcement learning algorithm called Maximum aposteriori Policy Optimisation (MPO). The algorithm is motivated by the connection between RL and inference and it consists of an alternating optimisation scheme that has a direct relation to several existing algorithms from the literature. Overall, we arrive at a novel, offpolicy algorithm that is highly data efficient, robust to hyperparameter choices and applicable to complex control problems. We demonstrated the effectiveness of MPO on a large set of continuous control problems.
Acknowledgements
The authors would like to thank David Budden, Jonas Buchli, Roland Hafner, Tom Erez, Jonas Degrave, Guillaume Desjardins, Brendan O’Donoghue and many others of the DeepMind team for their support and feedback during the preparation of this manuscript.
References

Abdolmaleki et al. (2017)
Abbas Abdolmaleki, Bob Price, Nuno Lau, Luis Paulo Reis, Gerhard Neumann,
et al.
Deriving and improving cmaes with information geometric trust
regions.
Proceedings of the Genetic and Evolutionary Computation Conference
, 2017.  Andrychowicz et al. (2017) Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay, 2017.
 Bansal et al. (2017) Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch. Emergent complexity via multiagent competition, 2017.

Bellemare et al. (2017)
Marc G. Bellemare, Will Dabney, and Rémi Munos.
A distributional perspective on reinforcement learning.
In
Proceedings of the 34th International Conference on Machine Learning, ICML
, 2017.  Chebotar et al. (2016) Yevgen Chebotar, Mrinal Kalakrishnan, Ali Yahya, Adrian Li, Stefan Schaal, and Sergey Levine. Path integral guided policy search. CoRR, abs/1610.00529, 2016.
 Ciosek & Whiteson (2017) Kamil Ciosek and Shimon Whiteson. Expected policy gradients. CoRR, abs/1706.05374, 2017.
 Daniel et al. (2016) C. Daniel, G. Neumann, O. Kroemer, and J. Peters. Hierarchical relative entropy policy search. Journal of Machine Learning Research (JMLR), 2016.
 Dayan & Hinton (1997) Peter Dayan and Geoffrey E Hinton. Using expectationmaximization for reinforcement learning. Neural Computation, 9(2):271–278, 1997.
 Deisenroth et al. (2013) Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A survey on policy search for robotics. Foundations and Trends in Robotics, 2(12):1–142, 2013.
 Duan et al. (2016) Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pp. 1329–1338. JMLR.org, 2016. URL http://dl.acm.org/citation.cfm?id=3045390.3045531.
 Florensa et al. (2017) Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical reinforcement learning. CoRR, abs/1704.03012, 2017.

Fox et al. (2016)
Roy Fox, Ari Pakman, and Naftali Tishby.
Taming the noise in reinforcement learning via soft updates.
In
Proceedings of the ThirtySecond Conference on Uncertainty in Artificial Intelligence UAI
, 2016.  Gu et al. (2016a) Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation. arXiv preprint arXiv:1610.00633, 2016a.
 Gu et al. (2016b) Shixiang Gu, Tim Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep qlearning with modelbased acceleration. In International Conference on Machine Learning (ICML), 2016b.
 Gu et al. (2017) Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, and Sergey Levine. Qprop: Sampleefficient policy gradient with an offpolicy critic. In 5th International Conference on Learning Representations (ICLR), 2017.
 Haarnoja et al. (2017) Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energybased policies. CoRR, abs/1702.08165, 2017.
 Heess et al. (2015) Nicolas Heess, Gregory Wayne, David Silver, Tim Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems (NIPS), pp. 2926–2934, 2015.
 Heess et al. (2017) Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017.

Ho & Ermon (2016)
Jonathan Ho and Stefano Ermon.
Generative adversarial imitation learning.
In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems 29, pp. 4565–4573. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/6391generativeadversarialimitationlearning.pdf.  Jaderberg et al. (2016) Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks, 2016.
 Kaelbling et al. (1996) Leslie Pack Kaelbling, Michael L. Littman, and Andrew P. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.
 Kappen (2005) H J Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of Statistical Mechanics: Theory and Experiment, 2005.
 Kober & Peters (2009) Jens Kober and Jan Peters. Policy search for motor primitives in robotics. In Advances in neural information processing systems, pp. 849–856, 2009.
 Levine & Koltun (2013) Sergey Levine and Vladlen Koltun. Variational policy search via trajectory optimization. In Advances in Neural Information Processing Systems, pp. 207–215, 2013.
 Lillicrap et al. (2016) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. International Conference on Learning Representations (ICLR), 2016.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning (ICML), 2016.
 Montgomery & Levine (2016) William Montgomery and Sergey Levine. Guided policy search as approximate mirror descent. CoRR, abs/1607.04614, 2016. URL http://arxiv.org/abs/1607.04614.
 Munos et al. (2016) Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc G. Bellemare. Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems (NIPS), 2016.
 Neumann (2011) Gerhard Neumann. Variational inference for policy search in changing situations. In Proceedings of the 28th international conference on machine learning (ICML11), pp. 817–824, 2011.
 O’Donoghue et al. (2016) Brendan O’Donoghue, Rémi Munos, Koray Kavukcuoglu, and Volodymyr Mnih. PGQ: combining policy gradient and qlearning. CoRR, abs/1611.01626, 2016.
 O’Donoghue et al. (2017) Brendan O’Donoghue, Ian Osband, Rémi Munos, and Volodymyr Mnih. The uncertainty bellman equation and exploration. CoRR, abs/1709.05380, 2017. URL http://arxiv.org/abs/1709.05380.
 Peng et al. (2016) Xue Bin Peng, Glen Berseth, and Michiel van de Panne. Terrainadaptive locomotion skills using deep reinforcement learning. ACM Transactions on Graphics (Proc. SIGGRAPH 2016), 2016.
 Perkins & Precup (2002) Theodore J. Perkins and Doina Precup. A convergent form of approximate policy iteration. In Advances in Neural Information Processing Systems 15 (NIPS). MIT Press, Cambridge, MA, 2002.
 Peters et al. (2010a) Jan Peters, Katharina Mülling, and Yasemin Altün. Relative entropy policy search. In Proceedings of the TwentyFourth AAAI Conference on Artificial Intelligence (AAAI), 2010a.
 Peters et al. (2010b) Jan Peters, Katharina Mülling, and Yasemin Altun. Relative entropy policy search. In AAAI. Atlanta, 2010b.
 Rawlik et al. (2012) Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On stochastic optimal control and reinforcement learning by approximate inference. In (R:SS 2012), 2012. Runner Up Best Paper Award.
 Roux (2016) Nicolas Le Roux. Efficient iterative policy optimization. CoRR, abs/1612.08967, 2016. URL http://arxiv.org/abs/1612.08967.
 Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015.
 Schulman et al. (2017a) John Schulman, Pieter Abbeel, and Xi Chen. Equivalence between policy gradients and soft qlearning. CoRR, abs/1704.06440, 2017a.
 Schulman et al. (2017b) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017b. URL http://arxiv.org/abs/1707.06347.
 Silver et al. (2014) David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International Conference on Machine Learning (ICML), 2014.
 Sutton (1990) Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In International Conference on Machine Learning (ICML), pp. 216–224, 1990.
 Tassa et al. (2018) Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy Lillicrap, and Martin Riedmiller. Deepmind control suite, 2018. URL http://arxiv.org/abs/1801.00690.
 Theodorou et al. (2010) Evangelos Theodorou, Jonas Buchli, and Stefan Schaal. A generalized path integral control approach to reinforcement learning. Journal of Machine Learning Research (JMLR), 2010.
 Todorov (2008) Emanuel Todorov. General duality between optimal control and estimation. In Proceedings of the 47th IEEE Conference on Decision and Control, CDC 2008, December 911, 2008, Cancún, México, pp. 4286–4292, 2008.
 Touati et al. (2017) Ahmed Touati, PierreLuc Bacon, Doina Precup, and Pascal Vincent. Convergent treebackup and retrace with function approximation. CoRR, abs/1705.09322, 2017. URL http://arxiv.org/abs/1705.09322.
 Toussaint (2009) Marc Toussaint. Robot trajectory optimization using approximate inference. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pp. 1049–1056, 2009. ISBN 9781605585161.
 Wang et al. (2017) Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Rémi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actorcritic with experience replay. 5th International Conference on Learning Representations (ICLR), 2017.
 Wirth et al. (2016) Christian Wirth, Johannes Furnkranz, and Gerhard Neumann. Modelfree preferencebased reinforcement learning. In 30th AAAI Conference on Artificial Intelligence, AAAI 2016, pp. 2222–2228, 2016.
 Ziebart et al. (2008) Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, pp. 1433–1438, 2008.
Appendix A Proof of monotonic improvement for the KLregularized policy optimization procedure
In this section we prove a monotonic improvement guarantee for KLregularized policy optimization via alternating updates on and under the assumption that the prior on is uninformative.
a.1 Regularized Reinforcement Learning
Let be an arbitrary policy. For any other policy such that, for all , , define the regularized reward for policy :
where .
Bellman operators:
Define the regularized Bellman operator for policy
and the nonregularized Bellman operator for policy
Value function:
Define the regularized value function for policy as
and the nonregularized value function
Proposition 1.
For any , we have and . Indeed
Optimal value function and policy
Define the optimal regularized value function: , and the optimal (nonregularized) value function: .
The optimal policy of the regularized problem and the optimal policy of the nonregularized problem .
Proposition 2.
We have that is the unique fixed point of , and is the unique fixed point of . Thus we have the following Bellman equations: For all ,
(14)  
(15)  
(16)  
(17) 
Notice that (16) holds for all actions , and not in expectation w.r.t. only.
a.2 Regularized joint policy gradient
We now consider a parametrized policy and consider maximizing the regularized joint policy optimization problem for a given initial state (this could be a distribution over initial states). Thus we want to find a parameter that (locally) maximizes
We start with an initial parameter and define a sequence of policies parametrized by , in the following way:

Given , define

Define as
(18)
Proposition 3.
We have the following properties:

The policy satisfies:
(19) where .

We have
(20) 
For sufficiently small, we have
(21) where is a numerical constant, and is the norm of the gradient (minimized by the algorithm):
Thus we build a sequence of policies whose values are nondecreasing thus converge to a local maximum. In addition, the improvement is lowerbounded by a constant times the norm of the gradient, thus the algorithm keeps improving the performance until the gradient vanishes (when we reach the limit of the capacity of our representation).
Proof.
We have
from which we deduce (19). Now, from the definition of , we have
Now, since is a monotone operator (i.e. if elementwise, then ) and its fixed point is , we have
which proves (20).
Now, in order to prove (21) we derive the following steps.
Step 1:
From the definition of we have, for any ,
(22) 
Writing the functional that we minimize
the update rule is . Thus we have that for sufficiently small ,
(23) 
where .
Step 2:
We deduce
This rewrites:
(24) 
Step 3:
Now a bit of algebra. For two stochastic matrices and , we have
Applying this equality to the transition matrices and and since , we have:
Appendix B Additional Experiment: Discrete control
As a proof of concept – showcasing the robustness of our algorithm and its hyperparameters – we performed an experiment on a subset of the games contained contained in the "Arcade Learning Environment" (ALE). For this experiment we used the same hyperparameter settings for the KL constraints as for the continuous control experiments as well as the same learning rate and merely altered the network architecture to the standard network structure used by DQN Mnih et al. (2015) – and created a seperate network with the same architecture, but predicting the parameters of the policy distribution. A comparison between our algorithm and well established baselines from the literature, in terms of the mean performance, is listed in Table 1. While we do not obtain stateoftheart performance in this experiment, the fact that MPO is competitive, outofthebox in these domains suggests that combining the ideas presented in this paper with recent advances for RL with discrete actions (Bellemare et al., 2017) could be a fruitful avenue for future work.
Game/Agent  Human  DQN  Prior. Dueling  C51  MPO 

Pong  14.6  19.5  20.9  20.9  20.9 
Breakout  30.5  385.5  366.0  748  360.5 
Q*bert  13,455.0  13,117.3  18,760.3  23,784  10,317.0 
Tennis  8.3  12.2  0.0  23.1  22.2 
Boxing  12.1  88.0  98.9  97.8  82.0 
Appendix C Experiment details
In this section we give the details on the hyperparameters used for each experiment. All the continuous control experiments use a feedforward network except for Parkour2d were we used the same network architecture as in Heess et al. (2017). Other hyper parameters for MPO with non parametric variational distribution were set as follows,
Hyperparameter  control suite  humanoid 

Policy net 
100100  200200 
Q function net  200200  300300 
0.1  "  
0.1  "  
0.0001  "  
Discount factor ()  0.99  " 
Adam learning rate  0.0005  " 
Hyperparameters for MPO with parametric variational distribution were as follows,
Hyperparameter  control suite tasks  humanoid 

Policy net  100100  200200 
Q function net  200200  300300 
0.1  "  
0.0001  "  
Discount factor ()  0.99  " 
Adam learning rate  0.0005  " 
Appendix D Derivation of update rules for a Gaussian Policy
For continuous control we assume that the policy is given by a Gaussian distribution with a full covariance matrix, i.e, . Our neural network outputs the mean and Cholesky factor , such that . The lower triagular factor has positive diagonal elements enforced by the softplus transform .
d.1 Nonparametric variational distribution
In this section we provide the derivations and implementation details for the nonparametric variational distribution case for both Estep and Mstep.
d.2 EStep
The Estep with a nonparametric variational solves the following program, where we have replaced expectations with integrals to simplify the following derivations:
First we write the Lagrangian equation, i.e,
Next we maximise the Lagrangian w.r.t the primal variable . The derivative w.r.t reads,