1 Introduction
Reinforcement learning can be generally classified into two categories: modelfree reinforcement learning (MFRL) and modelbased reinforcement learning (MBRL). There is a surge of interest in MBRL recently due to its higher sampleefficiency comparing with MFRL
(Kurutach et al., 2018; Heess et al., 2015; Asadi et al., 2018). Despite its success, MBRL still faces a challenging problem, i.e., the modelbias, where the imperfect dynamics model would degrade the performance of the algorithm (Kurutach et al., 2018). Unfortunately, such things always happen when the environment is sufficiently complex. There are a few efforts to mitigate such issues by combining modelbased and modelfree approaches. Heess et al. (2015) compute the value gradient along real system trajectories instead of planned ones to avoid the compounded error. Kalweit and Boedecker (2017) mix the real data and imaginary data from the model and then trainfunction. An ensemble of neural networks can be applied to model the environment dynamics, which effectively reduces the error of the model
(Kurutach et al., 2018; Clavera et al., 2018; Chua et al., 2018).Indeed, how to exploit the real and imaginary data is a key question in modelbased reinforcement learning. Recent modelbased algorithms applying Dynastyle updates have demonstrated promising results (Sutton, 1990; Kurutach et al., 2018; Luo et al., 2018). They collect real data using the current policy to train the dynamics model. Then the policy is improved using stateoftheart modelfree reinforcement learning algorithms with imagined data generated by the learned model. Our argument is that why not directly embed the model into the policy improvement? To this end, we derive a reinforcement learning algorithm called modelembedding modelbased reinforcement learning (MEMB) in the framework of the probabilistic reinforcement learning (Levine, 2018).
We provide the theoretical result on the error of the long term return in MEMB, which is caused by the model bias and policy distribution shift given the Lipschitz continuity condition of the model and policy. In addition, our analysis takes consideration of the length of the rollout step, which helps us to design the algorithm. In MEMB, the dynamics model and reward model are trained with the real data set collected from the environment. Then we simply train and function using the real data set with the update rule derived from the maximum entropy principle (several other ways to include the imaginary data can also be applied, see discussions in Section 3). In the policy improvement step, the stochastic actor samples an action with the real state as the input, and then the state switches from to according to the learned dynamics model.
We link the learned dynamics model, reward model, and policy to compute an analytic policy gradient by the backpropagation. Comparing with the likelihoodratio estimator usually used in the MFRL method, such value gradient method would reduce the variance of the policy gradient
(Heess et al., 2015). The other merit of MEMB is its computational efficiency. Several stateoftheart MBRL algorithms generate hundreds of thousands imaginary data from the model and a few real samples (Luo et al., 2018; Janner et al., 2019). Then the huge imaginary data set feeds into MFRL algorithms, which may be sampleefficient in terms of real samples but not computationalfriendly. On the contrary, our algorithm embeds the model in the policy update. Thus we can implement it efficiently by computing policy gradient several times in each iteration (see our algorithm 1) and do not need to do the calculation on the huge imaginary data set.Notice SVG (Heess et al., 2015) also embeds the model to compute the policy gradient. However, there are several key differences between MEMB and SVG.

To alleviate the issue of the compounded error, SVG proposes a conservative algorithm where just real data is used to evaluate policy gradients and the imaginary data is wasted. However, our theorem shows that the imaginary data from the short rollout from the learned model can be trusted. In our work, the policy is trained with the model and imaginary dataset times in each iteration of the algorithm. In the ablation study (appendix B), we demonstrate such difference leads to a large gap in the performance.

We provide a theoretical guarantee of the algorithm, which is not included in SVG.

We derive our algorithm in the framework of the probabilistic reinforcement learning. The entropy term would encourage the exploration, prevent the early convergence to the suboptimal policies, and show stateoftheart performance in MFRL (Haarnoja et al., 2018).
In addition, MEMB avoids the importance sampling in the offpolicy setting by sampling the action from and transition from the learned model, which further reduces the variance of the gradient estimation.
Contributions: We derive an elegant, sampleefficient, and computationalfriendly ^{2}^{2}2We can finish one trial of the experiment around one or two hours on a laptop. Dynastyle MBRL algorithm in the framework of the probabilistic reinforcement learning in a principled way. Different from the traditional MBRL algorithm, we directly embed the model into the policy improvement, which could reduce the variance in the gradient estimation and avoid the computation on the huge imaginary data set. In addition, since the algorithm is offpolicy, it is sampleefficient. At last, we provide theoretical results of our algorithm on the long term return considering the model bias and policy distribution shift. We test our algorithm on several benchmark tasks in Mujoco simulation environment (Todorov et al., 2012) and demonstrate that our algorithm can achieve stateoftheart performance. We provide our code anonymously for the reproducibility ^{3}^{3}3Code is submitted at https://github.com/MEMBanonymous1/MEMB .
Related work: There are a plethora of works on MBRL. They can be classified into several categories depending on the way to utilize the model, to search the optimal policy or the function approximator of the dynamics model. We leave the comprehensive discussion on the related work in appendix A.
2 Preliminaries
In this section, we first present some backgrounds on the Markov decision process. Then we introduce the knowledge on the probabilistic reinforcement learning with entropy regularization
(Ziebart et al., 2008; Levine, 2018) and stochastic value gradient (Heess et al., 2015) since parts of them are the building blocks of our algorithm.2.1 Mdp
Markov Decision Process (MDP) can be described by a 5tuple (): is the state space, is the action space,
is the transition probability,
is the expected reward, and is the discount factor. That is for and , is the expected reward, is the probability to reach the state . A policy is used to select actions in the MDP. In general, the policy is stochastic and denoted by , where is the conditional probability density at associated with the policy. The state value evaluated on policy could be represented by on immediate reward return with discount factor along the horizon . When the entropy of the policy is incorporated in the probabilistic reinforcement learning (Ziebart et al., 2008), we could redefine the reward function . When the model of the environment is learned from the data, we use to denote the learned dynamic model, and as the learned reward model.We denote the true long term return as , where the expectation corresponds to the policy, true transition and true reward. In the model based reinforcement learning, we denote the model long term return as , where means the expectation over the policy, learned model and .
2.2 Probabilistic Reinforcement Learning
Levine (2018) formulate reinforcement learning as a probabilistic inference problem. The trajectory up to time step is defined as
The probability of the trajectory with the optimal policy is defined as
The probability of the trajectory induced by the policy is
The objective is to minimize the KL divergence , which leads to the entropy regularized reinforcement learning where is an entropy term scaled by (Ziebart et al., 2008). The optimal policy can be obtained by the following softQ update (Fox et al., 2016).
Above iterations define the soft operator, which is a contraction. The optimal policy can be recovered by , where is the fixed point of the softQ update. We refer readers to the work (Ziebart et al., 2008; Haarnoja et al., 2017) for more discussions. In soft actorcritic (Haarnoja et al., 2018), the optimal policy is approximated by a neural network , which is obtained by solving the following optimization problem
2.3 Stochastic Value Gradient
Stochastic value gradient method is a modelbased algorithm which is designed to avoid the compounded model errors by only using the realworld observation and gradient information from the model (Heess et al., 2015)
. The algorithm directly substitutes the dynamics model and reward model in the Bellman equation and calculates the gradient. To perform the backpropagation in the stochastic Bellman equation, the reparameterization trick is applied to evaluate the gradient on realworld data. In SVG(1), the stochastic policy
with parameter could be optimized by the policy gradient in the following way(1) 
where and are the policy and model reparameterization noise which could be directly sampled from a prior distribution or inferred from a generative model . The and are dynamics model and reward model respectively. In the offpolicy update, SVG includes the important weight , where represent the parameter of the current policy and is the index of the data from the replay buffer.
Notions: Given two metric space and , we say a function is Lipschitz if . Give a meritc space and the set of probability measures on
, the Wasserstein metric between two probability distributions
and in is defined as wheredenotes the collection of all joint distributions
with marginal and .3 Memb
In this section, we introduce our modelembedding modelbased reinforcement learning algorithm (MEMB). Particularly, we optimize the following model long term return with the entropy regularization.
(2) 
where we omit the regularizer parameter of the entropy term in the following discussion to ease the exposition. Remind that optimizing the entropy regularzied reinforcement learning is equivalent to minimize the KL divergence between the distribution of and in Section 2.2. Now we replace the true model by the learned model and . Therefore we have and . We then optimize the KL divergence , w.r.t to . Using the backward view as that in (Levine, 2018), we have the optimal policy. We defer the derivation to the appdendix E.
(3) 
In the policy improvement step, the optimal policy can be approximated by a parametric function . In the MFRL, this can be obtained by solving
(Levine, 2018; Haarnoja et al., 2018). A straightforward way is to optimize , and using the imaginary data from the rollout, which reduces to Luo et al. (2018); Janner et al. (2019) and many others. However such way used in the MFRL cannot leverage the model information. We leave the our derivation and discussion on the policy improvement in Section 3.3.
3.1 Model Learning
The transition dynamics and rewards could be modeled by nonlinear function approximations as two independent regression tasks which have the same input but different output. Particularly, we train two independent deep neural networks with parameter and to represent the dynamics model and reward model respectively. In our analysis, we assume is not far from , i.e., , which can be estimated in practice by cross validation.
To better represent the stochastic nature of the dynamic transitions and rewards, we implement reparameterization trick on both and with input noises and
sampled from Gaussian distribution
. Thus we can denote dynamic model and reward as . In practice, we use neural networks to generate mean: , , and variance: , separately for the transition model and reward model, respectively. Then, we compute the result by and , respectively.We optimize above two models by sampling the data from the (real data) replay buffer and minimizing the mean square error:
(4) 
3.2 Value function learning
Although in equation (3), the function is computed w.r.t. the learned model, this way would cause the model bias in practice. In addition, in the policy update, this bias would result in an additional error of the policy gradient (since roughly speaking, the gradient of the policy is weighted by the Q function). To avoid this model error, we update , using equation (3) with the real transition from the real data replay buffer. Particularly, we minimize the following error w.r.t to and .
(5) 
(6) 
A straightforward way to incorporate the imaginary data is the value expansion (Feinberg et al., 2018). However in our ablation study, we find that the training with real data gives the best result. Thus we just briefly introduce the value expansion here. If function and function are parameterized by and respectively, they could be updated by minimizing the new objective function with the value expansion on imaginary rollout: where only the initial tuple is sampled from replay buffer with realworld data, and later transitions are sampled from the imaginary rollout from the model. here is the time step of value expansion using imaginary data. is the training tuple and is the initial training tuple. Note that when , it reduces to the case where just real data is used.
3.3 Policy Learning
Then we consider the policy improvement step, i.e., to calculate the optimal policy at each time step. One straightforward way is to optimize the following problem
as that in MFRL but with the imaginary data set. This way reduces to the work (Luo et al., 2018; Chua et al., 2018; Janner et al., 2019). However, such way cannot leverage the learned dynamics model and reward model. To incorporate the model information, notice that , thus the policy improvement step is equal to
(7) 
In the following, we connect the dynamics model, reward model, and value function together by the soft Bellman equation. Recall we have reparameterized the dynamics model in Section 3.1. Now we reparameterize the policy as with noise variables . Now we can write the soft Bellman equation in the following way.
(8) 
To optimize (7) and leverage gradient information of the model, we sample from the real data replay buffer and take the gradient of w.r.t.
(9) 
For the Gaussian noise case, we can further simplify
by plugging in the density function of the normal distribution. Clearly, we can unroll the Bellman equation with
steps and obtain similar result but here we focus on . The equation (9) demonstrates an interesting connection between our algorithm and SVG. Notice that the transition from to is sampled from the learned dynamics model , while the SVG(1) just utilizes the real data. Thus in the algorithm we can update policy several times in each iteration to fully utilized the model rather than just use the real transition once. Compared with the policy gradient step taken by SVG(1) algorithm (Heess et al., 2015), equation (9) includes one extra term to maximize the entropy of policy. We also drop the importance sampling weights by sampling from the current policy.3.4 MEMB algorithm
We summarize our MEMB in Algorithm 1. At the beginning of each step, we train dynamics model and reward model by minimizing the loss shown in (4). Then the agent interacts with the environment and stores the data in the real data replay buffer . Actor samples from and collects according to the dynamics model . Such imaginary transition is stored in . Then we train , , and according to the update rule in Section 3. Similar to other valuebased RL algorithms, our algorithm also utilizes two functions to further reduce the overestimation error by training them simultaneously with the same data but only selecting the minimum target in value updates (Fujimoto et al., 2018). We use the target function for like that in deep Qlearning algorithm (Mnih et al., 2015), and update it with an exponential moving average. We train policy using the gradient in (9). Remark that our is sampled from the dynamic model , while in SVG, it uses the true transition.
4 Theoretical Analysis
In this section, we provide a theoretical analysis for our algorithm. Notice our algorithm basically samples a state from the real data replay buffer and then unrolls the trajectory with several steps. Then using this imaginary data from rollout and the value function trained from real data, we update the policy with our policy learning formulation in 3.3. In the following, we first give a general result on how accurate the model long term return is regardless of how many rollout step is used in our algorithm. Later, we provide a more subtle analysis considering the rollout procedure.
We first investigate the difference between the true long term return and the model long term return , which is induced by the model bias and the distribution shift due to the updated policy encountering states not seen during model training. Particularly, we denote that the model bias as , i.e., the Wasserstein distance between the true model and the learned model. We denote the distribution shift as , where is the datacollecting policy and is intermediate policy during the update of the algorithm. For instance, in our algorithm 1, corresponds to the replay buffer of the true data while is the policy in the imaginary rollout. We assume and . Comparing with the total variation used in (Janner et al., 2019), the Wasserstein distance has better representation in the sense of how close approximate (Asadi et al., 2018). For instance, if and has disjoint supports, the total variation is always 1 regardless of how far the supports are from each other. Such case could always happen in the highdimensional setting (Gulrajani et al., 2017).
To bound the error of the long term return, we follow the Lipschitz assumption on the model and policy in (Asadi et al., 2018). Particularly, a transition model belongs to Lipschitz class model if it is represented by which says the transition probabilities can be decomposed in to a distribution over a set of deterministic function . The model is called Lipschitz, if is a Lipschitz function. It is easy to understand this in the context of our work when we reparametrize the model function. For instance, if the transition is deterministic, i.e., , then is the Lipschitz constant of . Similarly the policy associated with Lipschitz class is given by and is Lipschitz. If we use neural network to approximate model and policy, Lipschitz continuity means the gradient w.r.t the input is bounded by a constant. Here for simplicity, we assume and are independent with state and action. The similar bound including such dependence can be proved but with more involved assumption and notations.
Notice the analysis in (Asadi et al., 2018) just considers the error caused by the model bias and neglect the the effect of distribution shift of the policy. Therefore they assume and thus do not need the Lipschitz assumption on . In addition, we give a bound considering the step rollout in Theorem 2, which is not covered by (Asadi et al., 2018). In the following, we first give a general result to bound true long term return and model long term return , where we assume the reward model is known and mainly focus on the effect of the model bias and distribution shift.
Theorem 1.
Let the true transition model and the learned transition model both be Lipschitz. We also assume policy and reward function are and Lipschitz respectively. Suppose , . Let and assume , then the difference between true return and model return is bounded as
Notice above theorem is a generic result on the model based reinforcement learning. Such analysis is based on running full rollout of the learned model, which results in the compounded error. However, notice in our algorithm, we actually start a rollout from a state with the distribution induced by the previous policy and then run imaginary rollout for k steps using the current policy on learned transition model . Follow the notion in (Janner et al., 2019), we call it kstep branched rollout. We use to describe the long term return of this branched rollout and have a finegrained analysis in the following.
Theorem 2.
Let the true transition model and learned transition model both be Lipschitz. We also assume policy and reward function are and Lipschitz respectively. Suppose , . If , then the difference between true return and branched return is bounded as
Clearly, on the right hand side of the bound, some terms increase with the rollout step while the others decreases. Thus there exist a best which depends on the discount factor , and . In general, the imaginary data from short rollout is still trustful. Recall we can apply the rollout of Bellman equation in two different ways in our algorithm: (1) Policy update (equation (8)). (2) value function learning in section 3.2. So we do ablation study on these two ways and find that the imaginary data in value function learning would degrade the performance while improves the learning a lot in policy update, which also explains why MEMB is much better than SVG. Another interesting result is on . In our algorithm 1, in each iteration, we can update policy times using the imaginary data from the model. If is too large, it will cause a large distribution shift and degrade the performance. As such, typically we choose as to .
Performance of MEMB and other baselines in benchmark tasks. The xaxis is the training step (epoch or step). Each experiment is tested on five trials using five different random seeds and initialized parameters. For a simple task, i.e., InvertedPendulum, we limit the training steps at 40 epochs. For the other three complex tasks, the total training steps are 200K or 300K. The solid line is the mean of the average return. The shaded region represents the standard deviation. On HalfCheetah, Hopper, and Swimmer, MEMB outperforms the other baselines significantly. In the task Walker2d, SLBO is slightly better than MEMB. They both surpass other algorithms. On Reacher, MEMB and SAC perform best.
5 Experimental results
In this section, we would like to answer two questions: (1) How does MEMB perform on some benchmark reinforcement learning tasks comparing with other stateoftheart modelbased and modelfree reinforcement learning algorithms? (2) Whether we should use the imaginary data generated by the model embedding in the policy learning in Section 3.3. How many imaginary data we should use in the value function update in Section 3.2? We leave the answer of second question in the ablation study in appendix B.
Environment: To answer these two questions, we experiment in the Mujoco simulation environment (Todorov et al., 2012): InvertedPendulumv2, HalfCheetahv2, Reacherv2, Hopperv2, Swimmerv2, and Walker2dv2. Each experiment is tested on five trials using five different random seeds and initialized parameters. The details of the tasks and experiment implementations can be found in appendix C.
Comparison to stateoftheart: We compare our algorithm with stateoftheart modelfree and modelbased reinforcement learning algorithms in terms of sample complexity and performance. DDPG (Lillicrap et al., 2015) and SAC (Haarnoja et al., 2018) are two modelfree reinforcement learning algorithms on continuous action tasks. SAC has shown its reliable performance and robustness on several benchmark tasks. Our algorithm also builds on the maximum entropy reinforcement learning framework and benefits from incorporating the model in the policy update. Four modelbased reinforcement learning baselines are SVG (Heess et al., 2015),SLBO (Luo et al., 2018), MBPO (Janner et al., 2019) and POPLIN (Wang and Ba, 2019). Notice in SVG, the algorithm just computes the gradient in the real trajectory, while our MEMB updates policy using the imaginary data times generated from the model. At the same time, we avoid the importance sampling by using the data from the learned model. SLBO is a modelbase algorithm with performance guarantees that applies TRPO (Schulman et al., 2015) on the data set generated from the rollout of the model. MBPO has the similar spirit but with SAC as the learning algorithm on the imaginary data.
For fairness, we compare the baseline without the ensemble learning techniques (Chua et al., 2018). These techniques are known to reduce the model bias. We do not use distributed RL either to accelerate the training. We believe that the abovementioned skills are orthogonal to our work and could be integrated into the future work to further improve the performance. We just compare this pure version of MEMB with other baselines. We also notice that some recent works in MBRL modify the benchmarks to shorten the task horizons and simplify the model problem while some work assume the true terminal condition is known to the algorithm (Wang et al., 2019). On the contrary, we test our algorithm in the fulllength tasks and do not have assumptions on the terminal condition.
We present experimental results in Figure 1. In a simple task, InvertedPendulum, MEMB achieves the asymptotic result just using 16 epochs. In HalfCheetah, MEMB’s performance is at around 8000 at 200k steps, while all the other baselines’ performance is below 5300. In Reacher, MEMB and SAC have similar performance. Both of them are better than other algorithms. In Hopper, the final performance of MEMB is around 3300. The runnerup is POPLIN whose final performance is around 2300. In Swimmer, the performance of MEMB is the best. In Walker2d, SLBO is slighter better than MEMB. Both of them achieve the average return of 2900 at 300k timesteps.
References
 Asadi et al. (2018) Kavosh Asadi, Dipendra Misra, and Michael L Littman. Lipschitz continuity in modelbased reinforcement learning. ICML 2018, 2018.
 Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 Chua et al. (2018) Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pages 4754–4765, 2018.
 Clavera et al. (2018) Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. Modelbased reinforcement learning via metapolicy optimization. arXiv preprint arXiv:1809.05214, 2018.

Deisenroth and Rasmussen (2011)
Marc Deisenroth and Carl E Rasmussen.
Pilco: A modelbased and dataefficient approach to policy search.
In
Proceedings of the 28th International Conference on machine learning (ICML11)
, pages 465–472, 2011.  Feinberg et al. (2018) V Feinberg, A Wan, I Stoica, MI Jordan, JE Gonzalez, and S Levine. Modelbased value expansion for efficient modelfree reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 2018.
 Finn et al. (2016) Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pages 49–58, 2016.
 Fox et al. (2016) Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. UAI, 2016.
 Fujimoto et al. (2018) Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actorcritic methods. arXiv preprint arXiv:1802.09477, 2018.
 Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in neural information processing systems, pages 5767–5777, 2017.
 Haarnoja et al. (2017) Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energybased policies. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1352–1361. JMLR. org, 2017.
 Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
 Heess et al. (2015) Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pages 2944–2952, 2015.
 Janner et al. (2019) Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Modelbased policy optimization. In Advances in Neural Information Processing Systems, pages 12498–12509, 2019.
 Kalweit and Boedecker (2017) Gabriel Kalweit and Joschka Boedecker. Uncertaintydriven imagination for continuous deep reinforcement learning. In Conference on Robot Learning, pages 195–206, 2017.
 Kurutach et al. (2018) Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Modelensemble trustregion policy optimization. arXiv preprint arXiv:1802.10592, 2018.
 Levine (2018) Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
 Levine and Abbeel (2014) Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pages 1071–1079, 2014.
 Levine and Koltun (2013) Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on Machine Learning, pages 1–9, 2013.
 Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Luo et al. (2018) Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algorithmic framework for modelbased deep reinforcement learning with theoretical guarantees. arXiv preprint arXiv:1807.03858, 2018.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Nagabandi et al. (2018) Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566. IEEE, 2018.
 Richards (2005) Arthur George Richards. Robust constrained model predictive control. PhD thesis, Massachusetts Institute of Technology, 2005.
 Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897, 2015.
 Sutton (1990) Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning Proceedings 1990, pages 216–224. Elsevier, 1990.
 Tassa et al. (2012) Yuval Tassa, Tom Erez, and Emanuel Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4906–4913. IEEE, 2012.
 Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
 Villani (2010) Cédric Villani. Optimal transport: old and new. Bull. Amer. Math. Soc, 47:723–727, 2010.
 Wang and Ba (2019) Tingwu Wang and Jimmy Ba. Exploring modelbased planning with policy networks. ICLR 2020, 2019.
 Wang et al. (2019) Tingwu Wang, Xuchan Bao, Ignasi Clavera, Jerrick Hoang, Yeming Wen, Eric Langlois, Shunshi Zhang, Guodong Zhang, Pieter Abbeel, and Jimmy Ba. Benchmarking modelbased reinforcement learning. arXiv preprint arXiv:1907.02057, 2019.
 Ziebart et al. (2008) Brian D Ziebart, Andrew Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. 2008.
Appendix A Related work
There are a plethora of works on MBRL. They can be classified into several categories depending on the way to utilize the model, to search the optimal policy or the function approximator of the dynamics model. Iterative Linear QuadraticGaussian (iLQG) (Tassa et al., 2012) assumes that the true dynamics are known to the agent. It approximates the dynamics with linear functions and the reward function with quadratic functions. Hence the problem can be transferred into the classic LQR problem. In Guided Policy Search (Levine and Koltun, 2013; Levine and Abbeel, 2014; Finn et al., 2016), the system dynamics are modeled with the timevarying Gaussianlinear model. It approximated the policy with a neural network by minimizing the KL divergence between iLQG and . A regularization term is augmented into the reward function to avoid the overconfidence on the policy optimization. Nonlinear function approximators can be leveraged to model more complicated dynamics. Deisenroth and Rasmussen (2011)
use Gaussian processes to model the dynamics of the environment. The policy gradient can be computed analytically along the training trajectory. However, it may suffer from the curse of dimensionality which hinders its applicability in the real problem. Recently, more and more works incorporate the deep neural network into MBRL.
Heess et al. (2015) model the dynamics and reward with neural networks, and compute the gradient with the true data. Richards (2005); Nagabandi et al. (2018) optimize the action sequence to maximize the expected planning reward along with the learned dynamics model and then the policy is finetuned with TRPO. Luo et al. (2018); Chua et al. (2018); Kurutach et al. (2018); Janner et al. (2019) use the current policy to gather the data from the interaction with the environment and then learn the dynamics model. In the next step, the policy is improved (trained by the modelfree reinforcement learning algorithm) with a large amount of imaginary data generated by the learned model. MEMB may reduce to their work by updating the policy with a modelfree algorithm. Janner et al. (2019) provide an error bound on the long term return of the kstep rollout given that the total variation of model bias and policy distribution are bounded by . However, as we discussed, total variation may not be a good measure to describe the difference of learned model and true model especially when the support of the distributions is disjoint. Ensemble learning can also be applied to further reduce the model error. Asadi et al. (2018) leverage the Lipschitz model to analyze the model bias in the long term return. Our work consider the both model bias and the distribution shift on the policy. In addition, we analyze the kstep rollout while Asadi et al. (2018) just gives the result on full rollout.Appendix B Ablation Study
b.1 How to utilize the imaginary data
In this section, we make the ablation study to understand how much imaginary data we should include in the algorithm. Remind that in our algorithm, the model is embedded in the Soft Bellman equation in the policy update step, which means we fully trust the model to compute the policy gradient. This way to utilize the imaginary data would improve the sampleefficiency. To confirm our claim, we compare the MEMB with SVG on the task HalfCheetah. Notice when we just utilize the real trajectory in the policy update, MEMB reduces to the SVG^{4}^{4}4It still has some difference such as the update rule of Q and V function. To remove the effect of the entropy regularization, we vary (the regularizer parameter of entropy) in MEMB. When , it reduces to the nonregularized formulation. In that case, we add the noise in policy for the exploration. We report the result in panel (a) of Fig 2. It is clear that using imaginary data in policy update improves the learning with a wide margin.
In Section 3.2, we train and with the true data set. In the experiment, we also try the value expansion introduced in (Feinberg et al., 2018). We test the algorithm with value expansion, particularly with horizon and . Our conclusion is that including the imaginary data to train the value function in our algorithm would hurt the performance, especially in the complex tasks. We demonstrate the performance of MEMB with value expansion in panel (b) and (c) of Figure 2. We first test the algorithm on a simple task Pendulum from OpenAI gym (Brockman et al., 2016) and show the result in Panel (b) of Figure 2. MEMB with converges to the optimal policy within several epochs. When we increase the value of , the performance decreases. Then we evaluate the performance of value expansion in a complex task HalfCheetah from the Mujoco environment (Todorov et al., 2012) in panel (c) of Figure 2. In this task, value expansion with and does not work at all. The reason would be that the dynamics model of HalfCheetah introduces more significant model bias comparing to the simple task Pendulum. Thus training both policy and value function in the imaginary data set may cause a large error in policy gradient.
b.2 Model Error
We test the difference between the true model and learned model (using Wasserstein distance), i.e., model error. In particular, we record the learned model by MEMB every 10 epochs and then randomly sample the state action pair . Feed this pair to the learn model we can obtain the predicted next state , which is used to compared with the true next state . The error is averaged over state action pair. We do similar things on the reward model. Result are tested on five trials using five different random seeds. They are reported in Figure 3.
b.3 Plug The True Model Into MEMB
It is interesting to see the performance of MEMB if we plugin the true model in the algorithm. Since the dynamic in Mujoco is complicated, we just test a simple task pendulum. Intuitively, MEMB with true model should have better performance than the MEMB with learned one. We verify this in the Figure 4.
b.4 Asymptotic Performance
We evaluate the asymptotic behavior of modelfree RL (particularly SAC) and MEMB agents through 2000 epoch of training (20M steps) on four simulation environments to see the asymptotic result. The SAC agent achieves in HalfCheetah, in Walker2d, in Hopper, in Swimmer. The MEMB agent achieves in HalfCheetah, in Walker2d, in Hopper, in Swimmer. In general, there is a small gap between the asymptotic performance between MBRL and MFRL. It is well expected since the learned model is not accurate.
Appendix C Environment Overview and Hyperparameter Setting
In this section, we provide an overview of simulation environment in Table 1
. The hyperparameter setting for each environment is shown in Table
2.Environment Name  Observation Space Dimension  Action Space Dimension  Horizon 

Pendulum  3  1  200 
InvertedPendulum  4  1  1000 
HalfCheetah  17  6  1000 
Hopper  11  3  1000 
Walker2D  17  6  1000 
Swimmer  8  2  1000 
Reacher  11  2  50 
Pendulum  InvertedPendulum  HalfCheetah  Hopper  Walker2D  Swimmer  Reacher  
Epoch  50  40  200  300  180  50  

0.0003  

0.0003  0.001  0.001  0.0003  0.0003  0.0003  

0.0003  0.0001  0.0001  0.0001  

0.2  0.1  0.4  0.2  0.2  

1000  50  

(256,256)  

(32,16)  (256,128)  (256,256)  (256,128)  

5  1  5  3  5  5 
The hyperparameter used in training MEMB algorithm for each simulation environment. The number in policy, value, and model network architecture indicate the size of hidden units in each layer of MLP. The ReLu activation function is implemented in all architecture.
Appendix D Proof
In this section, we give the proof of the theorem in the main paper. To start with, we give the definition of the Wasserstein distance and its dual form, since we will use it frequently in the following discussion.
Definition: Give a meritc space and the set of probability measures on , the Wasserstein metric between two probability distributions and in is defined as
(10) 
where denotes the collection of all joint distributions with marginal and .
The dual presentation is a special case of the duality theorem of Kantorovich and Rubinstein Villani (2010).
(11) 
where means the function is Lipschitz.
The first lemma is well known. It says the Lipschitz constant of a composition function is the product of Lipschitz constants of two functions.
Lemma 1.
Define three metric spaces . Define Lipschitz function and with constant , . Then is Lipschitz with constant
Proof.
(12) 
∎
Lemma 2.
Suppose we have two joint distribution , . We further assume that and . Then we have .
Proof.
Using the triangle inequality, we have
Now we bound the first term and second term respectively. For the first term, according to the dual form of the Wasserstein distance, we have
(13) 
Notice it is easy to verify that if is a 1Lipschitz function w.r.t. , then for a fixed , (we denote it as ) is also a 1Lipschitz function w.r.t . Thus , we have
(14) 
We then bound the second term in the following way.
(15) 
where (1) holds using the assumption is in the Lipschitz class. (2) uses the fact that is 1 Lipschitz, which holds using the similar argument in Lemma 1.
Combine two pieces together, we obtain the result. ∎
Lemma 3.
Define and . Suppose , , then we have
.
Proof.
We define a reference probability distribution . Using the triangle inequaity, we have
Thus we just need to bound the two terms on the right hand side.
For the first term, according to the definition of the Wasserstein distance, we have