Variational Inference MPC for Bayesian Model-based Reinforcement Learning

In recent studies on model-based reinforcement learning (MBRL), incorporating uncertainty in forward dynamics is a state-of-the-art strategy to enhance learning performance, making MBRLs competitive to cutting-edge model free methods, especially in simulated robotics tasks. Probabilistic ensembles with trajectory sampling (PETS) is a leading type of MBRL, which employs Bayesian inference to dynamics modeling and model predictive control (MPC) with stochastic optimization via the cross entropy method (CEM). In this paper, we propose a novel extension to the uncertainty-aware MBRL. Our main contributions are twofold: Firstly, we introduce a variational inference MPC, which reformulates various stochastic methods, including CEM, in a Bayesian fashion. Secondly, we propose a novel instance of the framework, called probabilistic action ensembles with trajectory sampling (PaETS). As a result, our Bayesian MBRL can involve multimodal uncertainties both in dynamics and optimal trajectories. In comparison to PETS, our method consistently improves asymptotic performance on several challenging locomotion tasks.


page 6

page 13


PlaNet of the Bayesians: Reconsidering and Improving Deep Planning Network by Incorporating Bayesian Inference

In the present paper, we propose an extension of the Deep Planning Netwo...

Self-Supervised Representation Learning as Multimodal Variational Inference

This paper proposes a probabilistic extension of SimSiam, a recent self-...

Dreaming: Model-based Reinforcement Learning by Latent Imagination without Reconstruction

In the present paper, we propose a decoder-free extension of Dreamer, a ...

Variational Inference MPC using Tsallis Divergence

In this paper, we provide a generalized framework for Variational Infere...

Curious iLQR: Resolving Uncertainty in Model-based RL

Curiosity as a means to explore during reinforcement learning problems h...

Deep Learning Tubes for Tube MPC

Learning-based control aims to construct models of a system to use for p...

Variational Inference MPC using Normalizing Flows and Out-of-Distribution Projection

We propose a Model Predictive Control (MPC) method for collision-free na...

1 Introduction

Model predictive control (MPC) is a powerful and accepted technology for advanced control systems such as manufacturing processes [36], HVAC systems [3], power electronics [37], autonomous vehicles [30], and humanoids [23]. MPC utilizes the specified models of system dynamics to predict future states and rewards (or costs) to plan future actions that maximize the total reward over the predicted trajectories. Especially for industrial applications, the clear explainability of such a decision-making process is advantageous. Furthermore, in some tasks (e.g., games) [33], planning-based policies of this nature could outperform reactive

-policies (e.g., full neural network policies).

Model-based reinforcement learning (MBRL) methods that employ expressive function approximators (e.g., deep neural networks: DNNs) [10, 40, 28] present appealing approaches for MPC. The main difficulty in introducing MPC to practical systems is specifying the forward dynamics models of target systems. However, accurate system identification is challenging in many advanced applications. Take robotics for example, where robots encounter floors and walls, and must be able to manipulate some objects, making the dynamics highly non-linear. The main objective of MBRL is to train approximators of complex dynamics through experiences in real systems. The general procedure of MBRL is summarized as; (1. training-step) train the approximate model with a given training dataset, then (2. test-step) execute the actions (or policies) optimized with the dynamics model in a real environment and augment the dataset with the observed results. The above training and test steps are iteratively conducted to collect sufficient and diverse data so as to achieve the desired performance.

One feature of MBRL is its considerable sample efficiency compared to model-free reinforcement learning (MFRL), which directly trains policies through experiences. In other words, MBRL requires much less test time in real environments. In addition, MBRL benefits from the generalizability of the trained model, which can be easily applied to new tasks in the same system. However, the asymptotic performance of MBRL is generally inferior to that of model-free methods. This discrepancy is primarily due to the overfitting of dynamics models to the few data available during initial MBRL steps, which is called the model-bias problem [10]. Several studies have demonstrated that incorporating uncertainty in dynamics models can alleviate this issue. The uncertainty-aware modeling is realized by Bayesian inference employing a Gaussian Process [10], dropout as variational inference [11, 12, 20], or neural network ensembles [8, 24, 9].

Probabilistic ensembles with trajectory sampling (PETS) [8] is one type of uncertainty-aware MBRL. As an MPC-oriented MBRL method, PETS conducts trajectory optimization via the cross entropy method (CEM) [5] by using trajectories probabilistically sampled from the ensemble networks. Experiments have demonstrated that PETS can achieve competitive performance over state-of-the-art MFRL methods like Soft Actor Critic (SAC) [15], while yielding much higher sample efficiency. Since our primary interest is MPC and its application to practical systems, this paper mainly focuses on PETS and treats this method as a strong baseline.

Considering the success of probabilistic dynamics modeling, incorporating uncertainty in optimal trajectories appears very promising for MBRL. However, an optimization scheme that can utilize uncertainty has not yet been discussed. Although several stochastic approaches, including CEM, model predictive path integral (MPPI) [39, 40], covariance matrix adaptation evolution strategy (CMA-ES) [17], and proportional CEM (Prop-CEM) [13], have been proposed, they are not uncertainty-aware and tend to underestimate uncertainty. In addition, although their optimization procedures are very similar, they have been independently derived. Consequently, theoretical relations among these methods are unclear, preventing us from systematically understanding and reformulating them to be uncertainty-aware in a Bayesian fashion.

Motivated by these, in this paper, we propose a novel MPC concept for Bayesian MBRL. The organization and contributions of this paper are summarized as follows. (1) In Sec. 3, we introduce a novel MPC framework, variational inference MPC (VI-MPC), which generalizes and reformulates various stochastic MPC methods in a Bayesian fashion. The key observations for deriving this framework are organized in Sec. 2

, where we point out that general stochastic optimization methods can be regarded as the moment matching of the optimal trajectory posterior, which appear in a Bayesian MBRL formulation. (2) In Sec. 

4, we propose a novel instance of the framework, called probabilistic action ensembles with trajectory sampling (PaETS). Toy task examples and the concept of our method are exhibited in Fig. 1. (3) In Sec. 5, we demonstrate that our method consistently outperforms PETS via experiments with challenging locomotion tasks in the MuJoCo physics simulator [35].

(a) Vanilla CEM used in PETS [8]: VIMPC(‘CEM’, ‘Gaussian’, False) (b) PaETS (Ours): VIMPC(‘CEM’, ‘GMM(M=5)’, True)
Figure 1: Toy task examples that illustrate the concept of our method. The objective of this task is to navigate a point mass on the - plane by actuating with maximum magnitude , while avoiding obstacles

. This task is designed to have multiple (sub-)optimal trajectories. (a) A trajectory found by vanilla CEM. (b) Multiple trajectories found by PaETS that approximates the trajectory posterior via variational inference with a Gaussian mixture model. The line-width indicates the magnitude of mixture-coefficients. Exploiting diverse plans encourages active exploration in state-action spaces, improving the optimization performance and training dataset diversity. The notation of

VIMPC() is introduced in Sec. 3.

2 Model-based Reinforcement Learning as Bayesian Inference

In this section, we describe MBRL as a Bayesian inference problem using control as inference framework [25]. Fig. 2 displays the graphical model for the formulation, with which an MBRL procedure can be re-written in a Bayesian fashion: (1. training-step) do inference of . (2. test-step) do inference of , then, sample actions from the posterior and execute the actions in a real environment. We denote a trajectory as , where and respectively represent state and action. Given a state-action pair at time , the next state can be predicted by a forward-dynamics model parameterized with . The posterior of is inferred from training dataset , where consists of states and actions observed

during the test step. To formulate optimal control as inference, we auxiliarly introduce a binary random variable

to represent the optimality of (, ). Given , trajectory optimization can be expressed as an inference problem:


where uninformative action prior (i.e., 

: uniform distribution) is supposed. For readability,

is simply denoted as . For the same reason, we omit the subscripts of sequences , . In the remainder of the paper, this simplified notation is employed. In Sec. 2.12.2, we review how these inference problems have been approximately handled in previous works.

2.1 Inference of Forward-dynamics Posterior

Figure 2: Graphical model for Bayesian MBRL.

Given a sufficiently parameterized expressive model, i.e., DNNs, one of the most practical and promising schemes for approximating the posterior is to utilize neural network ensembles [8, 24, 9]. This process approximates the posterior as a set of particles , where is Dirac delta function and is the number of networks. Each particle

is independently trained by stochastic gradient descent so as to (sub-)optimize

. Although this approximation is incompletely Bayesian, this scheme has several useful features. First, we can simply implement this process in standard deep learning frameworks. Furthermore, the ensemble model successfully involves multimodal uncertainty in the exact posterior.

Another possible way to infer is dropout as variational inference [11, 12, 20], which approximates

as a Gaussian distribution

. It is proofed that the variational inference problem: approximately equivalent to training networks with dropout, where denote Kullback-Leibler (KL) divergence. Although this scheme is also simple and theoretically supported, approximation by a single Gaussian distribution tends to underestimate uncertainty (or multimodality) in the posterior. To remedy this problem, -divergence dropout has been proposed [26], which replaces KL-divergence to -divergence so as to prevent from overfitting a single mode. However, as long as is Gaussian, the multimodality cannot be managed well.

In our preliminary experiments of MBRL, we have tested the above two schemes and observed that the ensemble performs much better than (-)dropout. This result provides us with the insight that capturing multimodality in the posterior has crucial effects in MBRL literature. Therefore, in this paper, we also employ this ensemble scheme to approximate in the same way as our baseline: PETS [8]. In Sec. 4, we also attempt to incorporate multimodality in the posterior .

2.2 Moment Matching of Trajectory Posterior

This section clarifies the connection between trajectory optimization and the posterior approximation problem. The key observation delineated here is that several MPC methods, including CEM used in PETS and MPPI, can be regarded as the moment matching of the posterior.

Given an inferred model posterior , we can sample trajectories from (1).111 Trajectory sampling methods with have been discussed and experimented in [8]. In this paper, we employ the TS1 method suggested in the reference (see 3–6 in Alg. 1). Let us approximate the action posterior with a Gaussian distribution . The mean of posterior action sequence

can be estimated by moment matching:




Eq. (2) can be viewed as a weighted average where each sampled action is weighted by the likelihood of optimality

. In the same way, we can also estimate the variance of the posterior


In practice, sampling from uniform distribution is quite inefficient and requires almost infinite samples. Hence, let us consider iteratively estimating the parameters by incorporating importance sampling. Let , be the estimated parameters at iteration ; we can rearrange (2) as


It is worth noting that a similar iterative law can also be derived by solving the optimization problem by mirror descent [27, 29]. To connect this inference problem to trajectory optimization, we define the optimality likelihood with trajectory reward and a monotonically increasing function , as . If we define the same as [25, 31], an optimization algorithm similar to MPPI [39, 40, 29] is recovered. As summarized in Table 1, other similarities to well-known optimization algorithms, including CEM, can be observed with different optimality definitions. 222 We implicitly assume the existence of step-wise likelihood corresponding to each definition. Since another graphical model with a single unified optimality can be defined, the existence is not critical.

MPPI [39] CEM [5] Prop-CEM [13] CMA-ES [17]
Table 1: Optimization algorithms derived by moment matching of and different definitions; indicates an indicator function, denotes rank-preserving transformation.

There is a discrepancy between (4) and the CEM implementation in [8]; in which is used instead of . Since is a convex function, Jensen’s inequality holds in this case, thus . The equality holds when is constant, implying that for low-variance and for high-variance (or more uncertain) . Namely, underestimates the optimality likelihood if generates uncertain trajectories. Since we have experimentally observed that this filtering effect of demonstrates higher optimization performance than (see Sec. A

), this paper heuristically employs the use of


In practice, expectation operators should be implemented on digital computers through the Monte Carlo integration with sampled actions and trajectories for each action: and .

3 Variational Inference MPC: From Moment Matching to Inference

Given uncertainty in a dynamics model, it is natural to suppose that optimal trajectories are also uncertain. However, as exhibited in the previous section, PETS employs the moment matching of the trajectory posterior, ignoring almost uncertainty in optimal trajectories. In this section, we newly introduce a variational inference MPC (VI-MPC) framework to formulate MBRL as fully Bayesian and involve uncertainty both in the dynamics and optimalities.

Let us consider a variational inference problem: . We assume the variational distribution is decomposed to ; hence, we introduce as a posterior, which takes the similar decomposable form as . This assumption forces optimal state transitions to be controlled only by  [25]. As shown in Sec. B.1, this inference problem can be transformed to the maximization problem: . A notable property is that this objective has an entropy regularization term , which encourages

to have broader shape to capture more uncertainty. For the sake of convenience, we introduce a tunable hyperparameter

to the optimality likelihood . Then the above objective can be transformed as . By applying mirror descent [7] to this optimization problem, we can derive an update law for (see Sec. C for the detailed derivation):


where , are hyperparameters and is absorbed into them. is inverted step-size to control optimization speed and is the weight of the entropy regularization term .

Eq. (5) suggests a novel and general MPC framework, which we call variational inference MPC (VI-MPC). To realize a specific VI-MPC method, we specify the following parameters: (1) optimality definition (or ; see Table 1), (2) variational distribution model , and (3) entropy regularization or . We did not include into the specifications since it is highly dependent on the optimality definition (see Sec. G). In this paper, we describe the above specifications as VIMPC(<optimality_def>, <variational_dist>, <max_ent>). For example, we respectively express vanilla CEM and MPPI as VIMPC(‘CEM’, ‘Gaussian’, False) and VIMPC(‘MPPI’, ‘Gaussian’, False). In Sec. 4, we propose a new instance of VI-MPC to incorporate multimodal uncertainty in the posterior.

4 Probabilistic Action Ensembles with Trajectory Sampling

As reviewed in Sec. 2.1, previous methods have successfully involved multimodality in with network ensembles. If this multimodality in is given, other distributions depending on , including , would also be multimodal. In other words, there are various possible optimal trajectories (or actions) like Fig. 1. It is obvious that VIMPC(*, ‘Gaussian’, *) will still easily fail to capture multimodality because of overfitting to a single mode. Inspired by the success of the ensemble approach for dynamics modeling, we propose a novel VI-MPC method that introduces action ensembles with a Gaussian mixture model (GMM), i.e., VIMPC(*, ‘GMM(M=*)’, *), which we call PaETS (Probabilistic Action Ensembles with Trajectory Sampling).

PaETS defines the variational distribution as


where and is the number of components of the mixture model. Now, we derive the iteration scheme to update the parameters of GMM. At first, drawing samples from , we approximate as a discretized distribution (or a set of particles):


where . Just after sampling, the weight of each particle is uniform: . By substituting this approximated distribution to (5), the update law for the particle weights is derived as


Then we estimate

, which maximizes the observation probability of the weighted particles:


By taking the derivative and borrowing the concept of the EM algorithm [4], we get the update laws of which take the weight-average form like (4) (see Sec. D for the complete definition):

Figure 3: Evaluated locomotion tasks simulated in MuJoCo.

Fig. 7 in Sec. E illustrates how this method works in a toy optimization task.

In summary, PaETS and the MPC utilizing it are respectively described in Algs. 1 and 2, where is the number of iterations for optimization and is the length of the task episode. At in Alg. 2, s are initialized independently at random. At , s and s are reset to be initial values, encouraging exploration for the next time-step and preventing from degenerating to a single mode. If we set , these procedures are almost equivalent to those of PETS. The use of GMM () does not increase computational complexity significantly (see Sec. F).

Input: State , GMM param.  and
Output: Optimized GMM param. 
1 for  to  do
        Sample actions Sample states  // TS1 method
2        , Eval.  Calc.  by (8) Update by (10)
Algorithm 1 PaETS
Input: Initial state
Data: Training data , initial variance  
Infer  // train ensemble neural networks
 // random init.
for  to  // controll loop
1 do
        Exec. Alg. 1 Sample Send to actuators and observe  // warm startup
Algorithm 2 MPC with PaETS

5 Experiments

5.1 Comparison to State-of-the-art Methods

The main objective of this experiment is to demonstrate that PaETS has advantages over the state-of-the-art MBRL baseline: PETS [8]. In this experiment, PaETS and PETS (or vanilla CEM) were implemented using our same codebase with different parameters, i.e., VIMPC(‘CEM’, ‘GMM(M=5)’, True) for PaETS, and VIMPC(‘CEM’, ‘GMM(M=1)’, False) for PETS. We also evaluated another MBRL baseline with MPPI  [40], realized as VIMPC(‘MPPI’, ‘GMM(M=1)’, False). These above methods share the settings for inference (training of network ensembles). The state-of-the-art MFRL method SAC [15], was also evaluated to compare asymptotic performance.333We used the open-source code: Fig. 3 illustrates the simulated locomotion tasks evaluated in this experiment, which are complex and challenging due to their high non-linearity. All the tasks, except for HalfCheetah, were not evaluated in the original PETS paper [8]. Other details about our implementation and experimental settings are described in Sec. G and Sec. H. Fig. 4 presents the experimental results, in which PaETS consistently exhibits better asymptotic performance than that of the MBRL baselines. In addition, PaETS outperforms or is comparable to SAC while requiring significantly fewer samples (about x10 more sample efficient).

Figure 4: Learning curves for different tasks and algorithms. These are averaged results of 8 (for MBRL) and 20 (for SAC) independent training trials with different random seeds. We stopped the training when convergence was observed or after reaching the specified test steps ( for MBRL and for SAC). The asymptotic performances (averages of the last 10 test steps) are depicted in dashed lines.

5.2 Ablation Study

This experiment clarifies which component of PaETS (GMM and entropy-regularization) contributed to the overall improvement. Fig. 5 expresses the results of this ablation study and Welch’s -test for some selected representative pairs. From this figure, one can observe that the use of GMM () significantly improves performance. The effect of the regularization () is relatively small, but not negligible. In certain tasks, setting to particular values could improve the performance. In the case of , the regularization sheds light on actions sampled from low , thus encouraging to be multimodal. In some tasks which requires rather delicate controls (e.g., Hopper, Walker2d), the effect of seems less significant. Fig. 6 examines sensitivity with the number of mixture components , for which achieved the highest performance. If infinite or enough samples are given (), it would be reasonable to set to be large enough to capture multimodality. However, in practice, is finite and could be small enough due to computational constraints. In this case, larger makes it difficult to approximate as a set of particles , resulting in degradation of the optimization performance.

Figure 5: Asymptotic performance comparison with varying s and

s. These are averaged results over 8 different MBRL trials and the last 10 test steps. The error bars denote confidence intervals (95%). Symbols ‘*’, ‘**’ and ‘n.s.’ respectively mean

, and in Welch’s -test.
Figure 6: Asymptotic performance comparison with varying and fixed . Only the HalfCheetah task is evaluated in this test.

6 Related Work

Dynamics Posterior Inference Recent MBRL methods, MB-MPO (Model-Based Meta-Policy-Optimization) [9] and ME-TRPO (Model Ensemble Trust Region Optimization) [24], also employ network ensembles to model dynamics, but they utilize the ensembles differently than we do: to train policy networks, not MPC.

Trajectory Optimization Sequential Monte-Carlo based MPC, described as VIMPC(*, ‘Particles’, False), has been introduced in [21], but it requires well-designed proposal distribution to sample particles for the next iteration . Another particle-based method has been derived [31] by utilizing the control as inference framework. However, this method relies on not only a dynamics model, but also policy and value functions to manage particles, so MFRL methods must be incorporated.

Recent studies have demonstrated that entropy regularization is a promising strategy in policy training  [1, 2, 14, 15]. However, to the best of our knowledge, the introduction of entropy regularization to MPC is novel along with explicit multimodal expression to successfully realize their synergistic effect.

Ref. [38] also systematically organizes the stochastic MPC methods from the perspective of online learning, but uncertainty-aware discussions from a Bayesian viewpoint are not conducted.

Bayesian Reformulation Ref. [19]

proposes a novel approach to generative adversarial imitation learning (GAIL) 

[18], which reformulates general GAIL in a Bayesian fashion and utilizes ensembles to infer discriminator posteriors. Another Bayesian reformulation of GAIL integrates imitation and reinforcement learning by introducing another optimality (i.e., imitation optimality [22].

7 Conclusion & Discussions

This paper introduces a novel VI-MPC framework that systematically generalizes and reformulates various stochastic MPC methods in a Bayesian fashion. We also devise a novel instance of this framework, called PaETS, which can successfully incorporate multimodal uncertainty in optimal trajectories. By combining our method and the recent uncertainty-aware dynamics modeling with neural network ensembles, our Bayesian MBRL is able to involve multimodalities both in dynamics and optimalities. In addition, our method is a quite simple extension of general stochastic methods and requires no significant additional computational complexity. Our experiments demonstrate that PaETS can improve asymptotic performance compared to the leading MBRL baseline PETS, and thus substantially enhances MBRL potential to be more competitive to the state-of-the-art MFRL.

Considering the simplicity and generalizability of VI-MPC and PaETS, we expect that our concept is applicable to a variety of tasks, such as traditional MPC with deterministic dynamics and advanced MPC with latent dynamics from pixels by Deep Planning Network  [16]. By introducing a categorical mixture model as a variational distribution, application to combinational optimizations is also feasible. In fact, our ongoing work includes experiments of discrete MPC for a practical system.

A question that remains is how to determine VI-MPC specifications. As implied in Fig. 4, the best optimality definition could be task dependent (e.g., MPPI outperformed vanilla CEM in the Ant but not in other tasks). The regularization weight also has task dependency as shown in Fig. 5. It would be challenging but interesting future work to add the parameters to the graphical model in Fig. 2 as latent variables to infer promising parameters along with optimal trajectories, like infinite GMM [32]. Another appealing endeavor for future work is to introduce the concept of parallel tempering [6]

in Markov Chain Monte Carlo. By adaptively varying different temperatures (

in our case) of ensemble actions, we can expect the ensemble diversity to improve.

We thank Vishwajeet Singh, Hiroki Nakamura and Akira Kinose for their cooperation in this study during their student-internship periods. Most of the experiments were conducted in ABCI (AI Bridging Cloud Infrastructure), built by the National Institute of Advanced Industrial Science and Technology, Japan.


  • [1] A. Abdolmaleki, R. Lioutikov, J. R. Peters, N. Lau, L. P. Reis, and G. Neumann (2015) Model-based relative entropy stochastic search. In NeurIPS, Cited by: §6.
  • [2] A. Abdolmaleki, B. Price, N. Lau, L. P. Reis, and G. Neumann (2017) Deriving and improving CMA-ES with information geometric trust regions. In GECCO, Cited by: §6.
  • [3] A. Afram and F. Janabi-Sharifi (2014) Theory and applications of HVAC control systems–a review of model predictive control (MPC). Building and Environment 72, pp. 343–355. Cited by: §1.
  • [4] J. A. Bilmes et al. (1998)

    A gentle tutorial of the EM algorithm and its application to parameter estimation for gaussian mixture and hidden markov models

    International Computer Science Institute 4 (510), pp. 126. Cited by: §4.
  • [5] Z. I. Botev, D. P. Kroese, R. Y. Rubinstein, and P. L’Ecuyer (2013) The cross-entropy method for optimization. In Handbook of statistics, Vol. 31, pp. 35–59. Cited by: §1, Table 1.
  • [6] S. Brooks, A. Gelman, G. Jones, and X. Meng (2011) Handbook of Markov Chain Monte Carlo. Cited by: §7.
  • [7] S. Bubeck et al. (2015) Convex optimization: algorithms and complexity. Vol. 8. Cited by: Appendix C, §3.
  • [8] K. Chua, R. Calandra, R. McAllister, and S. Levine (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In NeurIPS, Cited by: Appendix F, Figure 1, §1, §1, §2.1, §2.1, §2.2, §5.1, footnote 1.
  • [9] I. Clavera, J. Rothfuss, J. Schulman, Y. Fujita, T. Asfour, and P. Abbeel (2018) Model-based reinforcement learning via meta-policy optimization. In CoRL, Cited by: §1, §2.1, §6.
  • [10] M. Deisenroth and C. E. Rasmussen (2011) PILCO: a model-based and data-efficient approach to policy search. In ICML, Cited by: §1, §1.
  • [11] Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In ICML, Cited by: §1, §2.1.
  • [12] Y. Gal, J. Hron, and A. Kendall (2017) Concrete dropout. In NeurIPS, Cited by: §1, §2.1.
  • [13] S. Goschin, A. Weinstein, and M. Littman (2013)

    The cross-entropy method optimizes for quantiles

    In ICML, Cited by: §1, Table 1.
  • [14] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine (2017) Reinforcement learning with deep energy-based policies. In ICML, Cited by: §6.
  • [15] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, Cited by: §1, §5.1, §6.
  • [16] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson (2018) Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551. Cited by: §7.
  • [17] N. Hansen, S. D. Müller, and P. Koumoutsakos (2003) Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary computation 11 (1), pp. 1–18. Cited by: §1, Table 1.
  • [18] J. Ho and S. Ermon (2016) Generative adversarial imitation learning. In NeurIPS, Cited by: §6.
  • [19] W. Jeon, S. Seo, and K. Kim (2018) A bayesian approach to generative adversarial imitation learning. In NeurIPS, Cited by: §6.
  • [20] G. Kahn, A. Villaflor, V. Pong, P. Abbeel, and S. Levine (2017) Uncertainty-aware reinforcement learning for collision avoidance. arXiv preprint arXiv:1702.01182. Cited by: §1, §2.1.
  • [21] N. Kantas, J. Maciejowski, and A. Lecchini-Visintini (2009) Sequential monte carlo for model predictive control. In Nonlinear model predictive control, pp. 263–273. Cited by: §6.
  • [22] A. Kinose and T. Tadahiro (2019) Integration of imitation learning using GAIL and reinforcement learning using task-achievement rewards via probabilistic generative model. arXiv preprint arXiv:1907.02140. Cited by: §6.
  • [23] S. Kuindersma, R. Deits, M. Fallon, A. Valenzuela, H. Dai, F. Permenter, T. Koolen, P. Marion, and R. Tedrake (2016) Optimization-based locomotion planning, estimation, and control design for the Atlas humanoid robot. Autonomous Robots 40 (3), pp. 429–455. Cited by: §1.
  • [24] T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel (2018) Model-ensemble trust-region policy optimization. In ICLR, Cited by: §1, §2.1, §6.
  • [25] S. Levine (2018) Reinforcement learning and control as probabilistic inference: tutorial and review. arXiv preprint arXiv:1805.00909. Cited by: §2.2, §2, §3.
  • [26] Y. Li and Y. Gal (2017) Dropout inference in bayesian neural networks with alpha-divergences. In ICML, Cited by: §2.1.
  • [27] M. Miyashita, S. Yano, and T. Kondo (2018) Mirror descent search and its acceleration. Robotics and Autonomous Systems 106, pp. 107–116. Cited by: §2.2.
  • [28] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine (2018) Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In ICRA, Cited by: §1.
  • [29] M. Okada and T. Taniguchi (2018) Acceleration of gradient-based path integral method for efficient optimal and inverse optimal control. In ICRA, Cited by: §2.2.
  • [30] B. Paden, M. Čáp, S. Z. Yong, D. Yershov, and E. Frazzoli (2016) A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Trans. Intell. Veh. 1 (1), pp. 33–55. Cited by: §1.
  • [31] A. Piche, V. Thomas, C. Ibrahim, Y. Bengio, and C. Pal (2018) Probabilistic planning with sequential monte carlo methods. In ICLR, Cited by: §2.2, §6.
  • [32] C. E. Rasmussen (2000) The infinite gaussian mixture model. In NeurIPS, Cited by: §7.
  • [33] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529 (7587), pp. 484. Cited by: §1.
  • [34] E. Theodorou, J. Buchli, and S. Schaal (2010) A generalized path integral control approach to reinforcement learning.

    Journal of Machine Learning Research

    11, pp. 3137–3181.
    Cited by: Appendix G.
  • [35] E. Todorov, T. Erez, and Y. Tassa (2012) MuJoCo: a physics engine for model-based control. In IROS, Cited by: §1.
  • [36] F. D. Vargas-Villamil and D. E. Rivera (2000) Multilayer optimization and scheduling using model predictive control: application to reentrant semiconductor manufacturing lines. Computers & Chemical Engineering 24 (8), pp. 2009–2021. Cited by: §1.
  • [37] S. Vazquez, J. Leon, L. Franquelo, J. Rodriguez, H. A. Young, A. Marquez, and P. Zanchetta (2014) Model predictive control: a review of its applications in power electronics. IEEE Ind. Electron. Mag. 8 (1), pp. 16–31. Cited by: §1.
  • [38] N. Wagener, C. Cheng, J. Sacks, and B. Boots (2019) An online learning approach to model predictive control. arXiv preprint arXiv:1902.08967. Cited by: §6.
  • [39] G. Williams, P. Drews, B. Goldfain, J. M. Rehg, and E. A. Theodorou (2016) Aggressive driving with model predictive path integral control. In ICRA, Cited by: Appendix F, §1, §2.2, Table 1.
  • [40] G. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots, and E. A. Theodorou (2017) Information theoretic MPC for model-based reinforcement learning. In ICRA, Cited by: Appendix F, §1, §1, §2.2, §5.1.

Appendix A Comparison Between and

We evaluated the impact of and on the optimization performance of (vanilla) CEM and MPPI, the results of which are summarized in Table 2, where gained much higher rewards than .

Table 2: Episode reward of HalfCheetah task with and . A common dynamics model (sufficiently trained ensemble neural network by MBRL) was employed for this test. Ten different trials were conducted and the results were averaged.

Appendix B Derivations

b.1 Derivation of the Variational Inference Objective

By using the assumption of , the KL-divergence can be transformed as


Appendix C Derivation of (5)

In this section, we simply denote as and as for readability. Let us consider the optimization problem:


By applying mirror descent [7], the iterative update law of is given as


where is the inner-product operator, is a hyper-parameter related to the step-size, and is the Lagrange multiplier for the constraint . The arguments in the operator can be rearranged as


where, we used the relations:


The integrand of (16) can be organized as


Integrating the above equation yields,


By minimizing this equation, we get:


The Lagrange multiplier can be removed using the constraint :


Considering the discussion in Sec. 2.2 and Sec. A, we compute as


Substituting (25) to (23) results in:


Marginalizing ), we finally obtain:


In (5), we replaced , .

Appendix D Complete Definition of PaETS


Appendix E Optimization of Toy Objective Function by PaETS

Fig. 7 illustrates how PaETS optimizes in a toy multimodal objective function.

Figure 7: The optimization process of a 2D multimodal objective function by PaETS (VIMPC(‘MPPI’, ‘GMM(M=2)’, True)), in which two distribution components are successfully optimized to fit the two modals. depict particles that approximates .

Appendix F Computational Complexity

The main computational bottleneck of PaETS (and PETS) is the execution of 3–6 in Alg. 1, in which total trajectories must be sampled. In our experiment, and were respectively set as , as in [8]. Compared to PETS, PaETS requires additional procedures like action sampling from GMM (2) and GMM parameter update (

9). However, these additional procedures are easily parallelizable on GPUs, and their computation times are much shorter than the above mentioned bottleneck. In the experiments with our early prototype in TensorFlow, it took about 57 ms for

and 55 ms for (equivalent to PETS) to execute one iteration of the for-loop in Alg. 1 on a single NVIDIA RTX2080 GPU. The above execution time does not meet the real-time constraints (e.g., 30 Hz). However, considering the success of the real-time implementation of MPPI in [39, 40], we believe real-time implantation of our method is feasible with optimized implementation using compiled language, low-level GPU APIs, and thorough tuning of hyperparameters (e.g., , and DNN complexity).

Appendix G Implementation Notes

Cross Entropy Method

It is general technique to adaptively determine in Table 1 so that only the top- samples satisfies the threshold condition. We employ this technique and the eliteness ratio is set to be . has no effect on CEM optimization since takes binary values.


Reward normalization heuristics, as suggested in [34], were also introduced for our MPPI implementation as


where . was set to be as also suggested in [34].

Entropy Regularization

The value of is very sensitive to task settings, especially for the dimensionalities of action spaces. To make insensitive, we introduced the following normalization trick inspired by the above heuristics. First, we rearrange (8) as


Then, we replace to normalized one:


By applying these heuristics, the range of entropy bonus is limited to , where the action with the lowest probability among samples gains the highest entropy bonus of .

Appendix H Experimental Setup

We used MuJoCo tasks modified from standard OpenAI Gym tasks.444 Table 3 summarizes the task settings, where , and respectively denote the velocity, orientation angle, and height of the agents. Penalty functions , are newly introduced to encourage the agents to move forward in the proper form. Instead, done flags used originally for early task stopping are removed. , are defined as


We modified the range of actions (i.e., torques) from to to exaggerate uncertainties in the optimal trajectory posteriors.

Task Reward Function Misc.
Table 3: MuJoCo task settings.

Table 4 summarizes the shared parameter settings for MBRL (PaETS, PETS, and MPPI). For SAC, we used the default parameters from the original codebase.

HalfCheetah Ant Hopper Walker2d
: prediction horizon 30 30 60 45
: weight of entropy regularizer 0.5 0.25 0.5 0.5
: # sampled actions 500
: # trajectories for each action 20
: # optimization-iterations 5
: # episode length 1000
: # neural networks 5
hidden nodes (200, 200, 200, 200)
activation function Swish
optimizer Adam
learning rate
batch-size 160
Table 4: MBRL parameters.