1 Introduction
Model predictive control (MPC) is a powerful and accepted technology for advanced control systems such as manufacturing processes [36], HVAC systems [3], power electronics [37], autonomous vehicles [30], and humanoids [23]. MPC utilizes the specified models of system dynamics to predict future states and rewards (or costs) to plan future actions that maximize the total reward over the predicted trajectories. Especially for industrial applications, the clear explainability of such a decisionmaking process is advantageous. Furthermore, in some tasks (e.g., games) [33], planningbased policies of this nature could outperform reactive
policies (e.g., full neural network policies).
Modelbased reinforcement learning (MBRL) methods that employ expressive function approximators (e.g., deep neural networks: DNNs) [10, 40, 28] present appealing approaches for MPC. The main difficulty in introducing MPC to practical systems is specifying the forward dynamics models of target systems. However, accurate system identification is challenging in many advanced applications. Take robotics for example, where robots encounter floors and walls, and must be able to manipulate some objects, making the dynamics highly nonlinear. The main objective of MBRL is to train approximators of complex dynamics through experiences in real systems. The general procedure of MBRL is summarized as; (1. trainingstep) train the approximate model with a given training dataset, then (2. teststep) execute the actions (or policies) optimized with the dynamics model in a real environment and augment the dataset with the observed results. The above training and test steps are iteratively conducted to collect sufficient and diverse data so as to achieve the desired performance.
One feature of MBRL is its considerable sample efficiency compared to modelfree reinforcement learning (MFRL), which directly trains policies through experiences. In other words, MBRL requires much less test time in real environments. In addition, MBRL benefits from the generalizability of the trained model, which can be easily applied to new tasks in the same system. However, the asymptotic performance of MBRL is generally inferior to that of modelfree methods. This discrepancy is primarily due to the overfitting of dynamics models to the few data available during initial MBRL steps, which is called the modelbias problem [10]. Several studies have demonstrated that incorporating uncertainty in dynamics models can alleviate this issue. The uncertaintyaware modeling is realized by Bayesian inference employing a Gaussian Process [10], dropout as variational inference [11, 12, 20], or neural network ensembles [8, 24, 9].
Probabilistic ensembles with trajectory sampling (PETS) [8] is one type of uncertaintyaware MBRL. As an MPCoriented MBRL method, PETS conducts trajectory optimization via the cross entropy method (CEM) [5] by using trajectories probabilistically sampled from the ensemble networks. Experiments have demonstrated that PETS can achieve competitive performance over stateoftheart MFRL methods like Soft Actor Critic (SAC) [15], while yielding much higher sample efficiency. Since our primary interest is MPC and its application to practical systems, this paper mainly focuses on PETS and treats this method as a strong baseline.
Considering the success of probabilistic dynamics modeling, incorporating uncertainty in optimal trajectories appears very promising for MBRL. However, an optimization scheme that can utilize uncertainty has not yet been discussed. Although several stochastic approaches, including CEM, model predictive path integral (MPPI) [39, 40], covariance matrix adaptation evolution strategy (CMAES) [17], and proportional CEM (PropCEM) [13], have been proposed, they are not uncertaintyaware and tend to underestimate uncertainty. In addition, although their optimization procedures are very similar, they have been independently derived. Consequently, theoretical relations among these methods are unclear, preventing us from systematically understanding and reformulating them to be uncertaintyaware in a Bayesian fashion.
Motivated by these, in this paper, we propose a novel MPC concept for Bayesian MBRL. The organization and contributions of this paper are summarized as follows. (1) In Sec. 3, we introduce a novel MPC framework, variational inference MPC (VIMPC), which generalizes and reformulates various stochastic MPC methods in a Bayesian fashion. The key observations for deriving this framework are organized in Sec. 2
, where we point out that general stochastic optimization methods can be regarded as the moment matching of the optimal trajectory posterior, which appear in a Bayesian MBRL formulation. (2) In Sec.
4, we propose a novel instance of the framework, called probabilistic action ensembles with trajectory sampling (PaETS). Toy task examples and the concept of our method are exhibited in Fig. 1. (3) In Sec. 5, we demonstrate that our method consistently outperforms PETS via experiments with challenging locomotion tasks in the MuJoCo physics simulator [35].2 Modelbased Reinforcement Learning as Bayesian Inference
In this section, we describe MBRL as a Bayesian inference problem using control as inference framework [25]. Fig. 2 displays the graphical model for the formulation, with which an MBRL procedure can be rewritten in a Bayesian fashion: (1. trainingstep) do inference of . (2. teststep) do inference of , then, sample actions from the posterior and execute the actions in a real environment. We denote a trajectory as , where and respectively represent state and action. Given a stateaction pair at time , the next state can be predicted by a forwarddynamics model parameterized with . The posterior of is inferred from training dataset , where consists of states and actions observed
during the test step. To formulate optimal control as inference, we auxiliarly introduce a binary random variable
to represent the optimality of (, ). Given , trajectory optimization can be expressed as an inference problem:(1) 
where uninformative action prior (i.e.,
: uniform distribution) is supposed. For readability,
is simply denoted as . For the same reason, we omit the subscripts of sequences , . In the remainder of the paper, this simplified notation is employed. In Sec. 2.1–2.2, we review how these inference problems have been approximately handled in previous works.2.1 Inference of Forwarddynamics Posterior
Given a sufficiently parameterized expressive model, i.e., DNNs, one of the most practical and promising schemes for approximating the posterior is to utilize neural network ensembles [8, 24, 9]. This process approximates the posterior as a set of particles , where is Dirac delta function and is the number of networks. Each particle
is independently trained by stochastic gradient descent so as to (sub)optimize
. Although this approximation is incompletely Bayesian, this scheme has several useful features. First, we can simply implement this process in standard deep learning frameworks. Furthermore, the ensemble model successfully involves multimodal uncertainty in the exact posterior.
Another possible way to infer is dropout as variational inference [11, 12, 20], which approximates
. It is proofed that the variational inference problem: approximately equivalent to training networks with dropout, where denote KullbackLeibler (KL) divergence. Although this scheme is also simple and theoretically supported, approximation by a single Gaussian distribution tends to underestimate uncertainty (or multimodality) in the posterior. To remedy this problem, divergence dropout has been proposed [26], which replaces KLdivergence to divergence so as to prevent from overfitting a single mode. However, as long as is Gaussian, the multimodality cannot be managed well.In our preliminary experiments of MBRL, we have tested the above two schemes and observed that the ensemble performs much better than ()dropout. This result provides us with the insight that capturing multimodality in the posterior has crucial effects in MBRL literature. Therefore, in this paper, we also employ this ensemble scheme to approximate in the same way as our baseline: PETS [8]. In Sec. 4, we also attempt to incorporate multimodality in the posterior .
2.2 Moment Matching of Trajectory Posterior
This section clarifies the connection between trajectory optimization and the posterior approximation problem. The key observation delineated here is that several MPC methods, including CEM used in PETS and MPPI, can be regarded as the moment matching of the posterior.
Given an inferred model posterior , we can sample trajectories from (1).^{1}^{1}1 Trajectory sampling methods with have been discussed and experimented in [8]. In this paper, we employ the TS1 method suggested in the reference (see 3–6 in Alg. 1). Let us approximate the action posterior with a Gaussian distribution . The mean of posterior action sequence
can be estimated by moment matching:
(2) 
where
(3) 
Eq. (2) can be viewed as a weighted average where each sampled action is weighted by the likelihood of optimality
. In the same way, we can also estimate the variance of the posterior
.In practice, sampling from uniform distribution is quite inefficient and requires almost infinite samples. Hence, let us consider iteratively estimating the parameters by incorporating importance sampling. Let , be the estimated parameters at iteration ; we can rearrange (2) as
(4) 
It is worth noting that a similar iterative law can also be derived by solving the optimization problem by mirror descent [27, 29]. To connect this inference problem to trajectory optimization, we define the optimality likelihood with trajectory reward and a monotonically increasing function , as . If we define the same as [25, 31], an optimization algorithm similar to MPPI [39, 40, 29] is recovered. As summarized in Table 1, other similarities to wellknown optimization algorithms, including CEM, can be observed with different optimality definitions. ^{2}^{2}2 We implicitly assume the existence of stepwise likelihood corresponding to each definition. Since another graphical model with a single unified optimality can be defined, the existence is not critical.
MPPI [39]  CEM [5]  PropCEM [13]  CMAES [17]  

There is a discrepancy between (4) and the CEM implementation in [8]; in which is used instead of . Since is a convex function, Jensen’s inequality holds in this case, thus . The equality holds when is constant, implying that for lowvariance and for highvariance (or more uncertain) . Namely, underestimates the optimality likelihood if generates uncertain trajectories. Since we have experimentally observed that this filtering effect of demonstrates higher optimization performance than (see Sec. A
), this paper heuristically employs the use of
.In practice, expectation operators should be implemented on digital computers through the Monte Carlo integration with sampled actions and trajectories for each action: and .
3 Variational Inference MPC: From Moment Matching to Inference
Given uncertainty in a dynamics model, it is natural to suppose that optimal trajectories are also uncertain. However, as exhibited in the previous section, PETS employs the moment matching of the trajectory posterior, ignoring almost uncertainty in optimal trajectories. In this section, we newly introduce a variational inference MPC (VIMPC) framework to formulate MBRL as fully Bayesian and involve uncertainty both in the dynamics and optimalities.
Let us consider a variational inference problem: . We assume the variational distribution is decomposed to ; hence, we introduce as a posterior, which takes the similar decomposable form as . This assumption forces optimal state transitions to be controlled only by [25]. As shown in Sec. B.1, this inference problem can be transformed to the maximization problem: . A notable property is that this objective has an entropy regularization term , which encourages
to have broader shape to capture more uncertainty. For the sake of convenience, we introduce a tunable hyperparameter
to the optimality likelihood . Then the above objective can be transformed as . By applying mirror descent [7] to this optimization problem, we can derive an update law for (see Sec. C for the detailed derivation):(5) 
where , are hyperparameters and is absorbed into them. is inverted stepsize to control optimization speed and is the weight of the entropy regularization term .
Eq. (5) suggests a novel and general MPC framework, which we call variational inference MPC (VIMPC). To realize a specific VIMPC method, we specify the following parameters: (1) optimality definition (or ; see Table 1), (2) variational distribution model , and (3) entropy regularization or . We did not include into the specifications since it is highly dependent on the optimality definition (see Sec. G). In this paper, we describe the above specifications as VIMPC(<optimality_def>, <variational_dist>, <max_ent>). For example, we respectively express vanilla CEM and MPPI as VIMPC(‘CEM’, ‘Gaussian’, False) and VIMPC(‘MPPI’, ‘Gaussian’, False). In Sec. 4, we propose a new instance of VIMPC to incorporate multimodal uncertainty in the posterior.
4 Probabilistic Action Ensembles with Trajectory Sampling
As reviewed in Sec. 2.1, previous methods have successfully involved multimodality in with network ensembles. If this multimodality in is given, other distributions depending on , including , would also be multimodal. In other words, there are various possible optimal trajectories (or actions) like Fig. 1. It is obvious that VIMPC(*, ‘Gaussian’, *) will still easily fail to capture multimodality because of overfitting to a single mode. Inspired by the success of the ensemble approach for dynamics modeling, we propose a novel VIMPC method that introduces action ensembles with a Gaussian mixture model (GMM), i.e., VIMPC(*, ‘GMM(M=*)’, *), which we call PaETS (Probabilistic Action Ensembles with Trajectory Sampling).
PaETS defines the variational distribution as
(6) 
where and is the number of components of the mixture model. Now, we derive the iteration scheme to update the parameters of GMM. At first, drawing samples from , we approximate as a discretized distribution (or a set of particles):
(7) 
where . Just after sampling, the weight of each particle is uniform: . By substituting this approximated distribution to (5), the update law for the particle weights is derived as
(8) 
Then we estimate
, which maximizes the observation probability of the weighted particles:
(9) 
By taking the derivative and borrowing the concept of the EM algorithm [4], we get the update laws of which take the weightaverage form like (4) (see Sec. D for the complete definition):
(10) 
In summary, PaETS and the MPC utilizing it are respectively described in Algs. 1 and 2, where is the number of iterations for optimization and is the length of the task episode. At in Alg. 2, s are initialized independently at random. At , s and s are reset to be initial values, encouraging exploration for the next timestep and preventing from degenerating to a single mode. If we set , these procedures are almost equivalent to those of PETS. The use of GMM () does not increase computational complexity significantly (see Sec. F).
5 Experiments
5.1 Comparison to Stateoftheart Methods
The main objective of this experiment is to demonstrate that PaETS has advantages over the stateoftheart MBRL baseline: PETS [8]. In this experiment, PaETS and PETS (or vanilla CEM) were implemented using our same codebase with different parameters, i.e., VIMPC(‘CEM’, ‘GMM(M=5)’, True) for PaETS, and VIMPC(‘CEM’, ‘GMM(M=1)’, False) for PETS. We also evaluated another MBRL baseline with MPPI [40], realized as VIMPC(‘MPPI’, ‘GMM(M=1)’, False). These above methods share the settings for inference (training of network ensembles). The stateoftheart MFRL method SAC [15], was also evaluated to compare asymptotic performance.^{3}^{3}3We used the opensource code: https://github.com/pranz24/pytorchsoftactorcritic Fig. 3 illustrates the simulated locomotion tasks evaluated in this experiment, which are complex and challenging due to their high nonlinearity. All the tasks, except for HalfCheetah, were not evaluated in the original PETS paper [8]. Other details about our implementation and experimental settings are described in Sec. G and Sec. H. Fig. 4 presents the experimental results, in which PaETS consistently exhibits better asymptotic performance than that of the MBRL baselines. In addition, PaETS outperforms or is comparable to SAC while requiring significantly fewer samples (about x10 more sample efficient).
5.2 Ablation Study
This experiment clarifies which component of PaETS (GMM and entropyregularization) contributed to the overall improvement. Fig. 5 expresses the results of this ablation study and Welch’s test for some selected representative pairs. From this figure, one can observe that the use of GMM () significantly improves performance. The effect of the regularization () is relatively small, but not negligible. In certain tasks, setting to particular values could improve the performance. In the case of , the regularization sheds light on actions sampled from low , thus encouraging to be multimodal. In some tasks which requires rather delicate controls (e.g., Hopper, Walker2d), the effect of seems less significant. Fig. 6 examines sensitivity with the number of mixture components , for which achieved the highest performance. If infinite or enough samples are given (), it would be reasonable to set to be large enough to capture multimodality. However, in practice, is finite and could be small enough due to computational constraints. In this case, larger makes it difficult to approximate as a set of particles , resulting in degradation of the optimization performance.
6 Related Work
Dynamics Posterior Inference Recent MBRL methods, MBMPO (ModelBased MetaPolicyOptimization) [9] and METRPO (Model Ensemble Trust Region Optimization) [24], also employ network ensembles to model dynamics, but they utilize the ensembles differently than we do: to train policy networks, not MPC.
Trajectory Optimization Sequential MonteCarlo based MPC, described as VIMPC(*, ‘Particles’, False), has been introduced in [21], but it requires welldesigned proposal distribution to sample particles for the next iteration . Another particlebased method has been derived [31] by utilizing the control as inference framework. However, this method relies on not only a dynamics model, but also policy and value functions to manage particles, so MFRL methods must be incorporated.
Recent studies have demonstrated that entropy regularization is a promising strategy in policy training [1, 2, 14, 15]. However, to the best of our knowledge, the introduction of entropy regularization to MPC is novel along with explicit multimodal expression to successfully realize their synergistic effect.
Ref. [38] also systematically organizes the stochastic MPC methods from the perspective of online learning, but uncertaintyaware discussions from a Bayesian viewpoint are not conducted.
Bayesian Reformulation Ref. [19]
proposes a novel approach to generative adversarial imitation learning (GAIL)
[18], which reformulates general GAIL in a Bayesian fashion and utilizes ensembles to infer discriminator posteriors. Another Bayesian reformulation of GAIL integrates imitation and reinforcement learning by introducing another optimality (i.e., imitation optimality ) [22].7 Conclusion & Discussions
This paper introduces a novel VIMPC framework that systematically generalizes and reformulates various stochastic MPC methods in a Bayesian fashion. We also devise a novel instance of this framework, called PaETS, which can successfully incorporate multimodal uncertainty in optimal trajectories. By combining our method and the recent uncertaintyaware dynamics modeling with neural network ensembles, our Bayesian MBRL is able to involve multimodalities both in dynamics and optimalities. In addition, our method is a quite simple extension of general stochastic methods and requires no significant additional computational complexity. Our experiments demonstrate that PaETS can improve asymptotic performance compared to the leading MBRL baseline PETS, and thus substantially enhances MBRL potential to be more competitive to the stateoftheart MFRL.
Considering the simplicity and generalizability of VIMPC and PaETS, we expect that our concept is applicable to a variety of tasks, such as traditional MPC with deterministic dynamics and advanced MPC with latent dynamics from pixels by Deep Planning Network [16]. By introducing a categorical mixture model as a variational distribution, application to combinational optimizations is also feasible. In fact, our ongoing work includes experiments of discrete MPC for a practical system.
A question that remains is how to determine VIMPC specifications. As implied in Fig. 4, the best optimality definition could be task dependent (e.g., MPPI outperformed vanilla CEM in the Ant but not in other tasks). The regularization weight also has task dependency as shown in Fig. 5. It would be challenging but interesting future work to add the parameters to the graphical model in Fig. 2 as latent variables to infer promising parameters along with optimal trajectories, like infinite GMM [32]. Another appealing endeavor for future work is to introduce the concept of parallel tempering [6]
in Markov Chain Monte Carlo. By adaptively varying different temperatures (
in our case) of ensemble actions, we can expect the ensemble diversity to improve.We thank Vishwajeet Singh, Hiroki Nakamura and Akira Kinose for their cooperation in this study during their studentinternship periods. Most of the experiments were conducted in ABCI (AI Bridging Cloud Infrastructure), built by the National Institute of Advanced Industrial Science and Technology, Japan.
References
 [1] (2015) Modelbased relative entropy stochastic search. In NeurIPS, Cited by: §6.
 [2] (2017) Deriving and improving CMAES with information geometric trust regions. In GECCO, Cited by: §6.
 [3] (2014) Theory and applications of HVAC control systems–a review of model predictive control (MPC). Building and Environment 72, pp. 343–355. Cited by: §1.

[4]
(1998)
A gentle tutorial of the EM algorithm and its application to parameter estimation for gaussian mixture and hidden markov models
. International Computer Science Institute 4 (510), pp. 126. Cited by: §4.  [5] (2013) The crossentropy method for optimization. In Handbook of statistics, Vol. 31, pp. 35–59. Cited by: §1, Table 1.
 [6] (2011) Handbook of Markov Chain Monte Carlo. Cited by: §7.
 [7] (2015) Convex optimization: algorithms and complexity. Vol. 8. Cited by: Appendix C, §3.
 [8] (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In NeurIPS, Cited by: Appendix F, Figure 1, §1, §1, §2.1, §2.1, §2.2, §5.1, footnote 1.
 [9] (2018) Modelbased reinforcement learning via metapolicy optimization. In CoRL, Cited by: §1, §2.1, §6.
 [10] (2011) PILCO: a modelbased and dataefficient approach to policy search. In ICML, Cited by: §1, §1.
 [11] (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In ICML, Cited by: §1, §2.1.
 [12] (2017) Concrete dropout. In NeurIPS, Cited by: §1, §2.1.

[13]
(2013)
The crossentropy method optimizes for quantiles
. In ICML, Cited by: §1, Table 1.  [14] (2017) Reinforcement learning with deep energybased policies. In ICML, Cited by: §6.
 [15] (2018) Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, Cited by: §1, §5.1, §6.
 [16] (2018) Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551. Cited by: §7.
 [17] (2003) Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMAES). Evolutionary computation 11 (1), pp. 1–18. Cited by: §1, Table 1.
 [18] (2016) Generative adversarial imitation learning. In NeurIPS, Cited by: §6.
 [19] (2018) A bayesian approach to generative adversarial imitation learning. In NeurIPS, Cited by: §6.
 [20] (2017) Uncertaintyaware reinforcement learning for collision avoidance. arXiv preprint arXiv:1702.01182. Cited by: §1, §2.1.
 [21] (2009) Sequential monte carlo for model predictive control. In Nonlinear model predictive control, pp. 263–273. Cited by: §6.
 [22] (2019) Integration of imitation learning using GAIL and reinforcement learning using taskachievement rewards via probabilistic generative model. arXiv preprint arXiv:1907.02140. Cited by: §6.
 [23] (2016) Optimizationbased locomotion planning, estimation, and control design for the Atlas humanoid robot. Autonomous Robots 40 (3), pp. 429–455. Cited by: §1.
 [24] (2018) Modelensemble trustregion policy optimization. In ICLR, Cited by: §1, §2.1, §6.
 [25] (2018) Reinforcement learning and control as probabilistic inference: tutorial and review. arXiv preprint arXiv:1805.00909. Cited by: §2.2, §2, §3.
 [26] (2017) Dropout inference in bayesian neural networks with alphadivergences. In ICML, Cited by: §2.1.
 [27] (2018) Mirror descent search and its acceleration. Robotics and Autonomous Systems 106, pp. 107–116. Cited by: §2.2.
 [28] (2018) Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In ICRA, Cited by: §1.
 [29] (2018) Acceleration of gradientbased path integral method for efficient optimal and inverse optimal control. In ICRA, Cited by: §2.2.
 [30] (2016) A survey of motion planning and control techniques for selfdriving urban vehicles. IEEE Trans. Intell. Veh. 1 (1), pp. 33–55. Cited by: §1.
 [31] (2018) Probabilistic planning with sequential monte carlo methods. In ICLR, Cited by: §2.2, §6.
 [32] (2000) The infinite gaussian mixture model. In NeurIPS, Cited by: §7.
 [33] (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529 (7587), pp. 484. Cited by: §1.

[34]
(2010)
A generalized path integral control approach to reinforcement learning.
Journal of Machine Learning Research
11, pp. 3137–3181. Cited by: Appendix G.  [35] (2012) MuJoCo: a physics engine for modelbased control. In IROS, Cited by: §1.
 [36] (2000) Multilayer optimization and scheduling using model predictive control: application to reentrant semiconductor manufacturing lines. Computers & Chemical Engineering 24 (8), pp. 2009–2021. Cited by: §1.
 [37] (2014) Model predictive control: a review of its applications in power electronics. IEEE Ind. Electron. Mag. 8 (1), pp. 16–31. Cited by: §1.
 [38] (2019) An online learning approach to model predictive control. arXiv preprint arXiv:1902.08967. Cited by: §6.
 [39] (2016) Aggressive driving with model predictive path integral control. In ICRA, Cited by: Appendix F, §1, §2.2, Table 1.
 [40] (2017) Information theoretic MPC for modelbased reinforcement learning. In ICRA, Cited by: Appendix F, §1, §1, §2.2, §5.1.
Appendix A Comparison Between and
We evaluated the impact of and on the optimization performance of (vanilla) CEM and MPPI, the results of which are summarized in Table 2, where gained much higher rewards than .
CEM  MPPI  

Appendix B Derivations
b.1 Derivation of the Variational Inference Objective
By using the assumption of , the KLdivergence can be transformed as
(11)  
(12)  
(13) 
Appendix C Derivation of (5)
In this section, we simply denote as and as for readability. Let us consider the optimization problem:
(14) 
By applying mirror descent [7], the iterative update law of is given as
(15) 
where is the innerproduct operator, is a hyperparameter related to the stepsize, and is the Lagrange multiplier for the constraint . The arguments in the operator can be rearranged as
(16) 
where, we used the relations:
(17) 
(18) 
The integrand of (16) can be organized as
(19)  
(20)  
(21) 
Integrating the above equation yields,
(22) 
By minimizing this equation, we get:
(23) 
The Lagrange multiplier can be removed using the constraint :
(24)  
(25) 
Considering the discussion in Sec. 2.2 and Sec. A, we compute as
(26) 
Substituting (25) to (23) results in:
(27) 
Marginalizing ), we finally obtain:
(28) 
In (5), we replaced , .
Appendix D Complete Definition of PaETS
(29)  
(30)  
(31)  
(32)  
(33) 
Appendix E Optimization of Toy Objective Function by PaETS
Fig. 7 illustrates how PaETS optimizes in a toy multimodal objective function.
Appendix F Computational Complexity
The main computational bottleneck of PaETS (and PETS) is the execution of 3–6 in Alg. 1, in which total trajectories must be sampled. In our experiment, and were respectively set as , as in [8]. Compared to PETS, PaETS requires additional procedures like action sampling from GMM (2) and GMM parameter update (
9). However, these additional procedures are easily parallelizable on GPUs, and their computation times are much shorter than the above mentioned bottleneck. In the experiments with our early prototype in TensorFlow, it took about 57 ms for
and 55 ms for (equivalent to PETS) to execute one iteration of the forloop in Alg. 1 on a single NVIDIA RTX2080 GPU. The above execution time does not meet the realtime constraints (e.g., 30 Hz). However, considering the success of the realtime implementation of MPPI in [39, 40], we believe realtime implantation of our method is feasible with optimized implementation using compiled language, lowlevel GPU APIs, and thorough tuning of hyperparameters (e.g., , and DNN complexity).Appendix G Implementation Notes
Cross Entropy Method
It is general technique to adaptively determine in Table 1 so that only the top samples satisfies the threshold condition. We employ this technique and the eliteness ratio is set to be . has no effect on CEM optimization since takes binary values.
Mppi
Entropy Regularization
The value of is very sensitive to task settings, especially for the dimensionalities of action spaces. To make insensitive, we introduced the following normalization trick inspired by the above heuristics. First, we rearrange (8) as
(35) 
Then, we replace to normalized one:
(36) 
By applying these heuristics, the range of entropy bonus is limited to , where the action with the lowest probability among samples gains the highest entropy bonus of .
Appendix H Experimental Setup
We used MuJoCo tasks modified from standard OpenAI Gym tasks.^{4}^{4}4https://github.com/openai/gym Table 3 summarizes the task settings, where , and respectively denote the velocity, orientation angle, and height of the agents. Penalty functions , are newly introduced to encourage the agents to move forward in the proper form. Instead, done flags used originally for early task stopping are removed. , are defined as
(37) 
(38) 
We modified the range of actions (i.e., torques) from to to exaggerate uncertainties in the optimal trajectory posteriors.
Task  Reward Function  Misc.  

HalfCheetah  —  
Ant  
Hopper  
Walker2d 
Table 4 summarizes the shared parameter settings for MBRL (PaETS, PETS, and MPPI). For SAC, we used the default parameters from the original codebase.
HalfCheetah  Ant  Hopper  Walker2d  

: prediction horizon  30  30  60  45 
: weight of entropy regularizer  0.5  0.25  0.5  0.5 
: # sampled actions  500  
: # trajectories for each action  20  
: # optimizationiterations  5  
: # episode length  1000  
: # neural networks  5  
hidden nodes  (200, 200, 200, 200)  
activation function  Swish  
optimizer  Adam  
learning rate  
batchsize  160 