Model Embedding Model-Based Reinforcement Learning

by   Xiaoyu Tan, et al.

Model-based reinforcement learning (MBRL) has shown its advantages in sample-efficiency over model-free reinforcement learning (MFRL). Despite the impressive results it achieves, it still faces a trade-off between the ease of data generation and model bias. In this paper, we propose a simple and elegant model-embedding model-based reinforcement learning (MEMB) algorithm in the framework of the probabilistic reinforcement learning. To balance the sample-efficiency and model bias, we exploit both real and imaginary data in the training. In particular, we embed the model in the policy update and learn Q and V functions from the real data set. We provide the theoretical analysis of MEMB with the Lipschitz continuity assumption on the model and policy. At last, we evaluate MEMB on several benchmarks and demonstrate our algorithm can achieve state-of-the-art performance.


page 1

page 2

page 3

page 4


When to Trust Your Model: Model-Based Policy Optimization

Designing effective model-based reinforcement learning algorithms is dif...

On Effective Scheduling of Model-based Reinforcement Learning

Model-based reinforcement learning has attracted wide attention due to i...

Model-based Lookahead Reinforcement Learning

Model-based Reinforcement Learning (MBRL) allows data-efficient learning...

Accelerating Goal-Directed Reinforcement Learning by Model Characterization

We propose a hybrid approach aimed at improving the sample efficiency in...

Model-based Policy Optimization with Unsupervised Model Adaptation

Model-based reinforcement learning methods learn a dynamics model with r...

Minimax Model Learning

We present a novel off-policy loss function for learning a transition mo...

On-Policy Model Errors in Reinforcement Learning

Model-free reinforcement learning algorithms can compute policy gradient...

1 Introduction

Reinforcement learning can be generally classified into two categories: model-free reinforcement learning (MFRL) and model-based reinforcement learning (MBRL). There is a surge of interest in MBRL recently due to its higher sample-efficiency comparing with MFRL

(Kurutach et al., 2018; Heess et al., 2015; Asadi et al., 2018). Despite its success, MBRL still faces a challenging problem, i.e., the model-bias, where the imperfect dynamics model would degrade the performance of the algorithm (Kurutach et al., 2018). Unfortunately, such things always happen when the environment is sufficiently complex. There are a few efforts to mitigate such issues by combining model-based and model-free approaches. Heess et al. (2015) compute the value gradient along real system trajectories instead of planned ones to avoid the compounded error. Kalweit and Boedecker (2017) mix the real data and imaginary data from the model and then train

function. An ensemble of neural networks can be applied to model the environment dynamics, which effectively reduces the error of the model

(Kurutach et al., 2018; Clavera et al., 2018; Chua et al., 2018).

Indeed, how to exploit the real and imaginary data is a key question in model-based reinforcement learning. Recent model-based algorithms applying Dyna-style updates have demonstrated promising results (Sutton, 1990; Kurutach et al., 2018; Luo et al., 2018). They collect real data using the current policy to train the dynamics model. Then the policy is improved using state-of-the-art model-free reinforcement learning algorithms with imagined data generated by the learned model. Our argument is that why not directly embed the model into the policy improvement? To this end, we derive a reinforcement learning algorithm called model-embedding model-based reinforcement learning (MEMB) in the framework of the probabilistic reinforcement learning (Levine, 2018).

We provide the theoretical result on the error of the long term return in MEMB, which is caused by the model bias and policy distribution shift given the Lipschitz continuity condition of the model and policy. In addition, our analysis takes consideration of the length of the rollout step, which helps us to design the algorithm. In MEMB, the dynamics model and reward model are trained with the real data set collected from the environment. Then we simply train and function using the real data set with the update rule derived from the maximum entropy principle (several other ways to include the imaginary data can also be applied, see discussions in Section 3). In the policy improvement step, the stochastic actor samples an action with the real state as the input, and then the state switches from to according to the learned dynamics model.

We link the learned dynamics model, reward model, and policy to compute an analytic policy gradient by the back-propagation. Comparing with the likelihood-ratio estimator usually used in the MFRL method, such value gradient method would reduce the variance of the policy gradient

(Heess et al., 2015). The other merit of MEMB is its computational efficiency. Several state-of-the-art MBRL algorithms generate hundreds of thousands imaginary data from the model and a few real samples (Luo et al., 2018; Janner et al., 2019). Then the huge imaginary data set feeds into MFRL algorithms, which may be sample-efficient in terms of real samples but not computational-friendly. On the contrary, our algorithm embeds the model in the policy update. Thus we can implement it efficiently by computing policy gradient several times in each iteration (see our algorithm 1) and do not need to do the calculation on the huge imaginary data set.

Notice SVG (Heess et al., 2015) also embeds the model to compute the policy gradient. However, there are several key differences between MEMB and SVG.

  • To alleviate the issue of the compounded error, SVG proposes a conservative algorithm where just real data is used to evaluate policy gradients and the imaginary data is wasted. However, our theorem shows that the imaginary data from the short rollout from the learned model can be trusted. In our work, the policy is trained with the model and imaginary dataset times in each iteration of the algorithm. In the ablation study (appendix B), we demonstrate such difference leads to a large gap in the performance.

  • We provide a theoretical guarantee of the algorithm, which is not included in SVG.

  • We derive our algorithm in the framework of the probabilistic reinforcement learning. The entropy term would encourage the exploration, prevent the early convergence to the sub-optimal policies, and show state-of-the-art performance in MFRL (Haarnoja et al., 2018).

In addition, MEMB avoids the importance sampling in the off-policy setting by sampling the action from and transition from the learned model, which further reduces the variance of the gradient estimation.

Contributions: We derive an elegant, sample-efficient, and computational-friendly 222We can finish one trial of the experiment around one or two hours on a laptop. Dyna-style MBRL algorithm in the framework of the probabilistic reinforcement learning in a principled way. Different from the traditional MBRL algorithm, we directly embed the model into the policy improvement, which could reduce the variance in the gradient estimation and avoid the computation on the huge imaginary data set. In addition, since the algorithm is off-policy, it is sample-efficient. At last, we provide theoretical results of our algorithm on the long term return considering the model bias and policy distribution shift. We test our algorithm on several benchmark tasks in Mujoco simulation environment (Todorov et al., 2012) and demonstrate that our algorithm can achieve state-of-the-art performance. We provide our code anonymously for the reproducibility 333Code is submitted at .

Related work: There are a plethora of works on MBRL. They can be classified into several categories depending on the way to utilize the model, to search the optimal policy or the function approximator of the dynamics model. We leave the comprehensive discussion on the related work in appendix A.

2 Preliminaries

In this section, we first present some backgrounds on the Markov decision process. Then we introduce the knowledge on the probabilistic reinforcement learning with entropy regularization

(Ziebart et al., 2008; Levine, 2018) and stochastic value gradient (Heess et al., 2015) since parts of them are the building blocks of our algorithm.

2.1 Mdp

Markov Decision Process (MDP) can be described by a 5-tuple (): is the state space, is the action space,

is the transition probability,

is the expected reward, and is the discount factor. That is for and , is the expected reward, is the probability to reach the state . A policy is used to select actions in the MDP. In general, the policy is stochastic and denoted by , where is the conditional probability density at associated with the policy. The state value evaluated on policy could be represented by on immediate reward return with discount factor along the horizon . When the entropy of the policy is incorporated in the probabilistic reinforcement learning (Ziebart et al., 2008), we could redefine the reward function . When the model of the environment is learned from the data, we use to denote the learned dynamic model, and as the learned reward model.

We denote the true long term return as , where the expectation corresponds to the policy, true transition and true reward. In the model based reinforcement learning, we denote the model long term return as , where means the expectation over the policy, learned model and .

2.2 Probabilistic Reinforcement Learning

Levine (2018) formulate reinforcement learning as a probabilistic inference problem. The trajectory up to time step is defined as

The probability of the trajectory with the optimal policy is defined as

The probability of the trajectory induced by the policy is

The objective is to minimize the KL divergence , which leads to the entropy regularized reinforcement learning where is an entropy term scaled by (Ziebart et al., 2008). The optimal policy can be obtained by the following soft-Q update (Fox et al., 2016).

Above iterations define the soft operator, which is a contraction. The optimal policy can be recovered by , where is the fixed point of the soft-Q update. We refer readers to the work (Ziebart et al., 2008; Haarnoja et al., 2017) for more discussions. In soft actor-critic (Haarnoja et al., 2018), the optimal policy is approximated by a neural network , which is obtained by solving the following optimization problem

2.3 Stochastic Value Gradient

Stochastic value gradient method is a model-based algorithm which is designed to avoid the compounded model errors by only using the real-world observation and gradient information from the model (Heess et al., 2015)

. The algorithm directly substitutes the dynamics model and reward model in the Bellman equation and calculates the gradient. To perform the backpropagation in the stochastic Bellman equation, the re-parameterization trick is applied to evaluate the gradient on real-world data. In SVG(1), the stochastic policy

with parameter could be optimized by the policy gradient in the following way


where and are the policy and model re-parameterization noise which could be directly sampled from a prior distribution or inferred from a generative model . The and are dynamics model and reward model respectively. In the off-policy update, SVG includes the important weight , where represent the parameter of the current policy and is the index of the data from the replay buffer.

Notions: Given two metric space and , we say a function is Lipschitz if . Give a meritc space and the set of probability measures on

, the Wasserstein metric between two probability distributions

and in is defined as where

denotes the collection of all joint distributions

with marginal and .

3 Memb

In this section, we introduce our model-embedding model-based reinforcement learning algorithm (MEMB). Particularly, we optimize the following model long term return with the entropy regularization.


where we omit the regularizer parameter of the entropy term in the following discussion to ease the exposition. Remind that optimizing the entropy regularzied reinforcement learning is equivalent to minimize the KL divergence between the distribution of and in Section 2.2. Now we replace the true model by the learned model and . Therefore we have and . We then optimize the KL divergence , w.r.t to . Using the backward view as that in (Levine, 2018), we have the optimal policy. We defer the derivation to the appdendix E.


In the policy improvement step, the optimal policy can be approximated by a parametric function . In the MFRL, this can be obtained by solving

(Levine, 2018; Haarnoja et al., 2018). A straightforward way is to optimize , and using the imaginary data from the rollout, which reduces to Luo et al. (2018); Janner et al. (2019) and many others. However such way used in the MFRL cannot leverage the model information. We leave the our derivation and discussion on the policy improvement in Section 3.3.

3.1 Model Learning

The transition dynamics and rewards could be modeled by non-linear function approximations as two independent regression tasks which have the same input but different output. Particularly, we train two independent deep neural networks with parameter and to represent the dynamics model and reward model respectively. In our analysis, we assume is not far from , i.e., , which can be estimated in practice by cross validation.

To better represent the stochastic nature of the dynamic transitions and rewards, we implement re-parameterization trick on both and with input noises and

sampled from Gaussian distribution

. Thus we can denote dynamic model and reward as . In practice, we use neural networks to generate mean: , , and variance: , separately for the transition model and reward model, respectively. Then, we compute the result by and , respectively.

We optimize above two models by sampling the data from the (real data) replay buffer and minimizing the mean square error:


3.2 Value function learning

Although in equation (3), the function is computed w.r.t. the learned model, this way would cause the model bias in practice. In addition, in the policy update, this bias would result in an additional error of the policy gradient (since roughly speaking, the gradient of the policy is weighted by the Q function). To avoid this model error, we update , using equation (3) with the real transition from the real data replay buffer. Particularly, we minimize the following error w.r.t to and .


A straightforward way to incorporate the imaginary data is the value expansion (Feinberg et al., 2018). However in our ablation study, we find that the training with real data gives the best result. Thus we just briefly introduce the value expansion here. If function and function are parameterized by and respectively, they could be updated by minimizing the new objective function with the value expansion on imaginary rollout: where only the initial tuple is sampled from replay buffer with real-world data, and later transitions are sampled from the imaginary rollout from the model. here is the time step of value expansion using imaginary data. is the training tuple and is the initial training tuple. Note that when , it reduces to the case where just real data is used.

3.3 Policy Learning

Then we consider the policy improvement step, i.e., to calculate the optimal policy at each time step. One straightforward way is to optimize the following problem

as that in MFRL but with the imaginary data set. This way reduces to the work (Luo et al., 2018; Chua et al., 2018; Janner et al., 2019). However, such way cannot leverage the learned dynamics model and reward model. To incorporate the model information, notice that , thus the policy improvement step is equal to


In the following, we connect the dynamics model, reward model, and value function together by the soft Bellman equation. Recall we have re-parameterized the dynamics model in Section 3.1. Now we re-parameterize the policy as with noise variables . Now we can write the soft Bellman equation in the following way.


To optimize (7) and leverage gradient information of the model, we sample from the real data replay buffer and take the gradient of w.r.t.


For the Gaussian noise case, we can further simplify

by plugging in the density function of the normal distribution. Clearly, we can unroll the Bellman equation with

steps and obtain similar result but here we focus on . The equation (9) demonstrates an interesting connection between our algorithm and SVG. Notice that the transition from to is sampled from the learned dynamics model , while the SVG(1) just utilizes the real data. Thus in the algorithm we can update policy several times in each iteration to fully utilized the model rather than just use the real transition once. Compared with the policy gradient step taken by SVG(1) algorithm (Heess et al., 2015), equation (9) includes one extra term to maximize the entropy of policy. We also drop the importance sampling weights by sampling from the current policy.

3.4 MEMB algorithm

We summarize our MEMB in Algorithm 1. At the beginning of each step, we train dynamics model and reward model by minimizing the loss shown in (4). Then the agent interacts with the environment and stores the data in the real data replay buffer . Actor samples from and collects according to the dynamics model . Such imaginary transition is stored in . Then we train , , and according to the update rule in Section 3. Similar to other value-based RL algorithms, our algorithm also utilizes two functions to further reduce the overestimation error by training them simultaneously with the same data but only selecting the minimum target in value updates (Fujimoto et al., 2018). We use the target function for like that in deep Q-learning algorithm (Mnih et al., 2015), and update it with an exponential moving average. We train policy using the gradient in (9). Remark that our is sampled from the dynamic model , while in SVG, it uses the true transition.

  Inputs: Replay buffer , imaginary replay buffer , policy , value function , target value function . Two functions with parameters and , dynamic model with parameter , and reward model with parameter
  for each iteration do
      1. Train the dynamics model and reward model
     Calculate the gradients , using (4) with , update and
      2. Interact with environment
     Sample , get reward , and observe the next state
     Append the tuple into 3. Update the actor, critics times
     for each imaginary rollout step  do
        Sample , get reward , and sample
        Append the tuple into
     end for
     Calculate the gradient using (6) with and .
     Calculate the gradient using (5) with
     Calculate the gradient using (9) with .
     Update , , and , update with Polyak averaging
  end for
Algorithm 1 MEMB

4 Theoretical Analysis

In this section, we provide a theoretical analysis for our algorithm. Notice our algorithm basically samples a state from the real data replay buffer and then unrolls the trajectory with several steps. Then using this imaginary data from rollout and the value function trained from real data, we update the policy with our policy learning formulation in 3.3. In the following, we first give a general result on how accurate the model long term return is regardless of how many rollout step is used in our algorithm. Later, we provide a more subtle analysis considering the rollout procedure.

We first investigate the difference between the true long term return and the model long term return , which is induced by the model bias and the distribution shift due to the updated policy encountering states not seen during model training. Particularly, we denote that the model bias as , i.e., the Wasserstein distance between the true model and the learned model. We denote the distribution shift as , where is the data-collecting policy and is intermediate policy during the update of the algorithm. For instance, in our algorithm 1, corresponds to the replay buffer of the true data while is the policy in the imaginary rollout. We assume and . Comparing with the total variation used in (Janner et al., 2019), the Wasserstein distance has better representation in the sense of how close approximate (Asadi et al., 2018). For instance, if and has disjoint supports, the total variation is always 1 regardless of how far the supports are from each other. Such case could always happen in the high-dimensional setting (Gulrajani et al., 2017).

To bound the error of the long term return, we follow the Lipschitz assumption on the model and policy in (Asadi et al., 2018). Particularly, a transition model belongs to Lipschitz class model if it is represented by which says the transition probabilities can be decomposed in to a distribution over a set of deterministic function . The model is called Lipschitz, if is a Lipschitz function. It is easy to understand this in the context of our work when we re-parametrize the model function. For instance, if the transition is deterministic, i.e., , then is the Lipschitz constant of . Similarly the policy associated with Lipschitz class is given by and is Lipschitz. If we use neural network to approximate model and policy, Lipschitz continuity means the gradient w.r.t the input is bounded by a constant. Here for simplicity, we assume and are independent with state and action. The similar bound including such dependence can be proved but with more involved assumption and notations.

Notice the analysis in (Asadi et al., 2018) just considers the error caused by the model bias and neglect the the effect of distribution shift of the policy. Therefore they assume and thus do not need the Lipschitz assumption on . In addition, we give a bound considering the step rollout in Theorem 2, which is not covered by (Asadi et al., 2018). In the following, we first give a general result to bound true long term return and model long term return , where we assume the reward model is known and mainly focus on the effect of the model bias and distribution shift.

Theorem 1.

Let the true transition model and the learned transition model both be Lipschitz. We also assume policy and reward function are and Lipschitz respectively. Suppose , . Let and assume , then the difference between true return and model return is bounded as

Notice above theorem is a generic result on the model based reinforcement learning. Such analysis is based on running full rollout of the learned model, which results in the compounded error. However, notice in our algorithm, we actually start a rollout from a state with the distribution induced by the previous policy and then run imaginary rollout for k steps using the current policy on learned transition model . Follow the notion in (Janner et al., 2019), we call it k-step branched rollout. We use to describe the long term return of this branched rollout and have a fine-grained analysis in the following.

Theorem 2.

Let the true transition model and learned transition model both be Lipschitz. We also assume policy and reward function are and Lipschitz respectively. Suppose , . If , then the difference between true return and branched return is bounded as

Clearly, on the right hand side of the bound, some terms increase with the rollout step while the others decreases. Thus there exist a best which depends on the discount factor , and . In general, the imaginary data from short rollout is still trustful. Recall we can apply the rollout of Bellman equation in two different ways in our algorithm: (1) Policy update (equation (8)). (2) value function learning in section 3.2. So we do ablation study on these two ways and find that the imaginary data in value function learning would degrade the performance while improves the learning a lot in policy update, which also explains why MEMB is much better than SVG. Another interesting result is on . In our algorithm 1, in each iteration, we can update policy times using the imaginary data from the model. If is too large, it will cause a large distribution shift and degrade the performance. As such, typically we choose as to .

(a) Inverted Pendulum
(b) HalfCheetah
(c) Reacher
(d) Hopper
(e) Swimmer
(f) Walker2D
Figure 1:

Performance of MEMB and other baselines in benchmark tasks. The x-axis is the training step (epoch or step). Each experiment is tested on five trials using five different random seeds and initialized parameters. For a simple task, i.e., InvertedPendulum, we limit the training steps at 40 epochs. For the other three complex tasks, the total training steps are 200K or 300K. The solid line is the mean of the average return. The shaded region represents the standard deviation. On HalfCheetah, Hopper, and Swimmer, MEMB outperforms the other baselines significantly. In the task Walker2d, SLBO is slightly better than MEMB. They both surpass other algorithms. On Reacher, MEMB and SAC perform best.

5 Experimental results

In this section, we would like to answer two questions: (1) How does MEMB perform on some benchmark reinforcement learning tasks comparing with other state-of-the-art model-based and model-free reinforcement learning algorithms? (2) Whether we should use the imaginary data generated by the model embedding in the policy learning in Section 3.3. How many imaginary data we should use in the value function update in Section 3.2? We leave the answer of second question in the ablation study in appendix B.

Environment: To answer these two questions, we experiment in the Mujoco simulation environment (Todorov et al., 2012): InvertedPendulum-v2, HalfCheetah-v2, Reacher-v2, Hopper-v2, Swimmer-v2, and Walker2d-v2. Each experiment is tested on five trials using five different random seeds and initialized parameters. The details of the tasks and experiment implementations can be found in appendix C.

Comparison to state-of-the-art: We compare our algorithm with state-of-the-art model-free and model-based reinforcement learning algorithms in terms of sample complexity and performance. DDPG (Lillicrap et al., 2015) and SAC (Haarnoja et al., 2018) are two model-free reinforcement learning algorithms on continuous action tasks. SAC has shown its reliable performance and robustness on several benchmark tasks. Our algorithm also builds on the maximum entropy reinforcement learning framework and benefits from incorporating the model in the policy update. Four model-based reinforcement learning baselines are SVG (Heess et al., 2015),SLBO (Luo et al., 2018), MBPO (Janner et al., 2019) and POPLIN (Wang and Ba, 2019). Notice in SVG, the algorithm just computes the gradient in the real trajectory, while our MEMB updates policy using the imaginary data times generated from the model. At the same time, we avoid the importance sampling by using the data from the learned model. SLBO is a model-base algorithm with performance guarantees that applies TRPO (Schulman et al., 2015) on the data set generated from the rollout of the model. MBPO has the similar spirit but with SAC as the learning algorithm on the imaginary data.

For fairness, we compare the baseline without the ensemble learning techniques (Chua et al., 2018). These techniques are known to reduce the model bias. We do not use distributed RL either to accelerate the training. We believe that the above-mentioned skills are orthogonal to our work and could be integrated into the future work to further improve the performance. We just compare this pure version of MEMB with other baselines. We also notice that some recent works in MBRL modify the benchmarks to shorten the task horizons and simplify the model problem while some work assume the true terminal condition is known to the algorithm (Wang et al., 2019). On the contrary, we test our algorithm in the full-length tasks and do not have assumptions on the terminal condition.

We present experimental results in Figure 1. In a simple task, InvertedPendulum, MEMB achieves the asymptotic result just using 16 epochs. In HalfCheetah, MEMB’s performance is at around 8000 at 200k steps, while all the other baselines’ performance is below 5300. In Reacher, MEMB and SAC have similar performance. Both of them are better than other algorithms. In Hopper, the final performance of MEMB is around 3300. The runner-up is POPLIN whose final performance is around 2300. In Swimmer, the performance of MEMB is the best. In Walker2d, SLBO is slighter better than MEMB. Both of them achieve the average return of 2900 at 300k timesteps.


Appendix A Related work

There are a plethora of works on MBRL. They can be classified into several categories depending on the way to utilize the model, to search the optimal policy or the function approximator of the dynamics model. Iterative Linear Quadratic-Gaussian (iLQG) (Tassa et al., 2012) assumes that the true dynamics are known to the agent. It approximates the dynamics with linear functions and the reward function with quadratic functions. Hence the problem can be transferred into the classic LQR problem. In Guided Policy Search (Levine and Koltun, 2013; Levine and Abbeel, 2014; Finn et al., 2016), the system dynamics are modeled with the time-varying Gaussian-linear model. It approximated the policy with a neural network by minimizing the KL divergence between iLQG and . A regularization term is augmented into the reward function to avoid the over-confidence on the policy optimization. Nonlinear function approximators can be leveraged to model more complicated dynamics. Deisenroth and Rasmussen (2011)

use Gaussian processes to model the dynamics of the environment. The policy gradient can be computed analytically along the training trajectory. However, it may suffer from the curse of dimensionality which hinders its applicability in the real problem. Recently, more and more works incorporate the deep neural network into MBRL.

Heess et al. (2015) model the dynamics and reward with neural networks, and compute the gradient with the true data. Richards (2005); Nagabandi et al. (2018) optimize the action sequence to maximize the expected planning reward along with the learned dynamics model and then the policy is fine-tuned with TRPO. Luo et al. (2018); Chua et al. (2018); Kurutach et al. (2018); Janner et al. (2019) use the current policy to gather the data from the interaction with the environment and then learn the dynamics model. In the next step, the policy is improved (trained by the model-free reinforcement learning algorithm) with a large amount of imaginary data generated by the learned model. MEMB may reduce to their work by updating the policy with a model-free algorithm. Janner et al. (2019) provide an error bound on the long term return of the k-step rollout given that the total variation of model bias and policy distribution are bounded by . However, as we discussed, total variation may not be a good measure to describe the difference of learned model and true model especially when the support of the distributions is disjoint. Ensemble learning can also be applied to further reduce the model error. Asadi et al. (2018) leverage the Lipschitz model to analyze the model bias in the long term return. Our work consider the both model bias and the distribution shift on the policy. In addition, we analyze the k-step rollout while Asadi et al. (2018) just gives the result on full rollout.

Appendix B Ablation Study

b.1 How to utilize the imaginary data

In this section, we make the ablation study to understand how much imaginary data we should include in the algorithm. Remind that in our algorithm, the model is embedded in the Soft Bellman equation in the policy update step, which means we fully trust the model to compute the policy gradient. This way to utilize the imaginary data would improve the sample-efficiency. To confirm our claim, we compare the MEMB with SVG on the task HalfCheetah. Notice when we just utilize the real trajectory in the policy update, MEMB reduces to the SVG444It still has some difference such as the update rule of Q and V function. To remove the effect of the entropy regularization, we vary (the regularizer parameter of entropy) in MEMB. When , it reduces to the non-regularized formulation. In that case, we add the noise in policy for the exploration. We report the result in panel (a) of Fig 2. It is clear that using imaginary data in policy update improves the learning with a wide margin.

In Section 3.2, we train and with the true data set. In the experiment, we also try the value expansion introduced in (Feinberg et al., 2018). We test the algorithm with value expansion, particularly with horizon and . Our conclusion is that including the imaginary data to train the value function in our algorithm would hurt the performance, especially in the complex tasks. We demonstrate the performance of MEMB with value expansion in panel (b) and (c) of Figure 2. We first test the algorithm on a simple task Pendulum from OpenAI gym (Brockman et al., 2016) and show the result in Panel (b) of Figure 2. MEMB with converges to the optimal policy within several epochs. When we increase the value of , the performance decreases. Then we evaluate the performance of value expansion in a complex task HalfCheetah from the Mujoco environment (Todorov et al., 2012) in panel (c) of Figure 2. In this task, value expansion with and does not work at all. The reason would be that the dynamics model of HalfCheetah introduces more significant model bias comparing to the simple task Pendulum. Thus training both policy and value function in the imaginary data set may cause a large error in policy gradient.

(a) HalfCheetah (policy)
(b) Pendulum (value)
(c) HalfCheetah (value)
Figure 2: Ablation study. In (a) we do the ablation study on the effect of the imaginary data on policy learning. In (b) and (c) we do ablation study on the value function learning with different length of rollout H. The x-axis is the training step and the y-axis is the reward.

b.2 Model Error

We test the difference between the true model and learned model (using Wasserstein distance), i.e., model error. In particular, we record the learned model by MEMB every 10 epochs and then randomly sample the state action pair . Feed this pair to the learn model we can obtain the predicted next state , which is used to compared with the true next state . The error is averaged over state action pair. We do similar things on the reward model. Result are tested on five trials using five different random seeds. They are reported in Figure 3.

(a) Transition model error
(b) Reward model error
Figure 3: Model error. We calculate the model error on the environment of HalfCheetah.

b.3 Plug The True Model Into MEMB

It is interesting to see the performance of MEMB if we plugin the true model in the algorithm. Since the dynamic in Mujoco is complicated, we just test a simple task pendulum. Intuitively, MEMB with true model should have better performance than the MEMB with learned one. We verify this in the Figure 4.

Figure 4: MEMB with the true model vs MEMB with the learned model in pendulum

b.4 Asymptotic Performance

We evaluate the asymptotic behavior of model-free RL (particularly SAC) and MEMB agents through 2000 epoch of training (20M steps) on four simulation environments to see the asymptotic result. The SAC agent achieves in HalfCheetah, in Walker2d, in Hopper, in Swimmer. The MEMB agent achieves in HalfCheetah, in Walker2d, in Hopper, in Swimmer. In general, there is a small gap between the asymptotic performance between MBRL and MFRL. It is well expected since the learned model is not accurate.

Appendix C Environment Overview and Hyperparameter Setting

In this section, we provide an overview of simulation environment in Table 1

. The hyperparameter setting for each environment is shown in Table


Environment Name Observation Space Dimension Action Space Dimension Horizon
Pendulum 3 1 200
InvertedPendulum 4 1 1000
HalfCheetah 17 6 1000
Hopper 11 3 1000
Walker2D 17 6 1000
Swimmer 8 2 1000
Reacher 11 2 50
Table 1: The observation space dimension, action space dimension, and horizon for each simulation environment implemented in the experiment and ablation study.
Pendulum InvertedPendulum HalfCheetah Hopper Walker2D Swimmer Reacher
Epoch 50 40 200 300 180 50
Policy Learning Rate
Value Learning Rate
0.0003 0.001 0.001 0.0003 0.0003 0.0003
Learning Rate
0.0003 0.0001 0.0001 0.0001
Alpha value
(in entropy term)
0.2 0.1 0.4 0.2 0.2
environment steps
per epoch
1000 50
Value and Policy
Network Architecture
Network Architecture
(32,16) (256,128) (256,256) (256,128)
Train Actor-critic
Times ()
5 1 5 3 5 5
Table 2:

The hyper-parameter used in training MEMB algorithm for each simulation environment. The number in policy, value, and model network architecture indicate the size of hidden units in each layer of MLP. The ReLu activation function is implemented in all architecture.

Appendix D Proof

In this section, we give the proof of the theorem in the main paper. To start with, we give the definition of the Wasserstein distance and its dual form, since we will use it frequently in the following discussion.

Definition: Give a meritc space and the set of probability measures on , the Wasserstein metric between two probability distributions and in is defined as


where denotes the collection of all joint distributions with marginal and .

The dual presentation is a special case of the duality theorem of Kantorovich and Rubinstein Villani (2010).


where means the function is -Lipschitz.

The first lemma is well known. It says the Lipschitz constant of a composition function is the product of Lipschitz constants of two functions.

Lemma 1.

Define three metric spaces . Define Lipschitz function and with constant , . Then is Lipschitz with constant


Lemma 2.

Suppose we have two joint distribution , . We further assume that and . Then we have .


Using the triangle inequality, we have

Now we bound the first term and second term respectively. For the first term, according to the dual form of the Wasserstein distance, we have


Notice it is easy to verify that if is a 1-Lipschitz function w.r.t. , then for a fixed , (we denote it as ) is also a 1-Lipschitz function w.r.t . Thus , we have


We then bound the second term in the following way.


where (1) holds using the assumption is in the Lipschitz class. (2) uses the fact that is 1 Lipschitz, which holds using the similar argument in Lemma 1.

Combine two pieces together, we obtain the result. ∎

Lemma 3.

Define and . Suppose , , then we have



We define a reference probability distribution . Using the triangle inequaity, we have

Thus we just need to bound the two terms on the right hand side.

For the first term, according to the definition of the Wasserstein distance, we have