Bidirectional Model-based Policy Optimization

by   Hang Lai, et al.

Model-based reinforcement learning approaches leverage a forward dynamics model to support planning and decision making, which, however, may fail catastrophically if the model is inaccurate. Although there are several existing methods dedicated to combating the model error, the potential of the single forward model is still limited. In this paper, we propose to additionally construct a backward dynamics model to reduce the reliance on accuracy in forward model predictions. We develop a novel method, called Bidirectional Model-based Policy Optimization (BMPO) to utilize both the forward model and backward model to generate short branched rollouts for policy optimization. Furthermore, we theoretically derive a tighter bound of return discrepancy, which shows the superiority of BMPO against the one using merely the forward model. Extensive experiments demonstrate that BMPO outperforms state-of-the-art model-based methods in terms of sample efficiency and asymptotic performance.


page 1

page 2

page 3

page 4


Deep Model-Based Reinforcement Learning via Estimated Uncertainty and Conservative Policy Optimization

Model-based reinforcement learning algorithms tend to achieve higher sam...

Model-based Policy Optimization with Unsupervised Model Adaptation

Model-based reinforcement learning methods learn a dynamics model with r...

Backward Imitation and Forward Reinforcement Learning via Bi-directional Model Rollouts

Traditional model-based reinforcement learning (RL) methods generate for...

Double Check Your State Before Trusting It: Confidence-Aware Bidirectional Offline Model-Based Imagination

The learned policy of model-free offline reinforcement learning (RL) met...

Dynamic Horizon Value Estimation for Model-based Reinforcement Learning

Existing model-based value expansion methods typically leverage a world ...

Forward-backward algorithms with a biallelic mutation-drift model: Orthogonal polynomials, and a coalescent/urn-model based approach

Inference of the marginal likelihood of sample allele configurations usi...

On the Forward and Backward Motion of Milli-Bristle-Bots

This works presents the theoretical analysis and experimental observatio...

Code Repositories

1 Introduction

Reinforcement learning (RL) methods are commonly divided into two categories: (1) model-free RL (MFRL) which directly learns a policy or value function from observation data, and (2) model-based RL (MBRL) which builds a predictive model of the environment and generates samples from it to derive a policy or a controller. While MFRL algorithms have achieved remarkable success in different ranges of areas (Mnih et al., 2015; Lillicrap et al., 2015; Schulman et al., 2017), they need a large number of samples, which limit their applications mostly to simulators. Model-based methods, in contrast, have shown great potential in reducing the sample complexity (Deisenroth et al., 2013). However, the asymptotic performance of MBRL methods often lags behind their model-free counterparts due to model error, which is especially severe for multi-step rollout because of the compounding error (Asadi et al., 2018b).

Several previous works have been proposed to alleviate compounding model error in different ways. For example, Whitney and Fergus (2018) and Asadi et al. (2019) introduced multi-step models which directly predict the consequence of executing a sequence of actions; Mishra et al. (2017) tried to divide trajectories into temporal segments and make predictions over segments instead of one timestep; Talvitie (2017), Kaiser et al. (2019) and Freeman et al. (2019) proposed to train the dynamics model on its own outputs, hoping the model is capable of giving accurate predictions in the region of its outputs. Wu et al. (2019)

attributed long horizon rollouts mismatching to the one-step supervised learning and proposed to learn the model via model imitation. Besides the above innovations in model learning,

Nguyen et al. and Xiao et al. (2019)

developed methods of using adaptive rollout horizon according to the estimated compounding error, and

Janner et al. (2019) proposed to use truncated short rollouts branched from real states. However, even with the compounding model error mitigated, it is almost impossible to obtain perfect multi-step predictions in practice, and thus the potential of the single forward model is still limited.

In this paper, we investigate how to maximally take advantage of the underlying dynamics of the environment to reduce the reliance on the accuracy of the forward model. When making decisions, human beings can not only predict future consequences of current behaviors forward but also imagine possible traces leading to a target goal backward. Based on this consideration, inspired by Edwards et al. (2018) and Goyal et al. (2018), we propose to learn a backward model in addition to the forward dynamics model to generate simulated rollouts backwards. To be specific, given a state and a previous action, the backward model predicts the preceding state, which means it can generate trajectories that terminate at one certain state.

Now, intuitively, if we utilize both the forward model and the backward model to sample bidirectionally from some encountering states for policy optimization, the model error will compound separately in different directions. As Figure 1 shows, when we generate trajectories with the same length using a forward model and bidirectional models respectively, the compounding error of the latter will be less than that of the former. More theoretical analysis is given in Section 5.

Figure 1: Comparison of compounding error of trajectories generated using only a forward model (top) and bidirectional models (bottom). States in real trajectories are shown in red, and states predicted by the forward (backward) model are shown in blue (green). The orange lines between real states and simulated states roughly represent the error, and when the rollout length increases, the error will compound. For bidirectional models, model error compounds steps backward and steps forward while for the forward model, model error compounds steps.

With this insight, we combine bidirectional models with recent MBPO method (Janner et al., 2019) and propose a practical MBRL algorithm called Bidirectional Model-based Policy Optimization (BMPO). Besides, we develop a novel state sampling strategy to sample states for model rollouts and incorporate model predictive control (MPC) (Camacho and Alba, 2013) into our algorithm for further performance improvement. Furthermore, we theoretically prove that BMPO has a tighter bound of return discrepancy compared against MBPO, which verifies the advantage of bidirectional models in theory. We evaluate our BMPO and previous state-of-the-art algorithms (Haarnoja et al., 2018; Janner et al., 2019) on a range of continuous control benchmark tasks. Experiments demonstrate that BMPO achieves higher sample efficiency and better asymptotic performance compared with prior model-based methods which only use forward models.

2 Related Work

Model-based reinforcement learning methods are expected to reduce sample complexity by learning a model as a simulator of the environment (Sutton and Barto, 2018). However, model error tends to cripple the performance of model-based approaches, which is also known as model-bias (Deisenroth and Rasmussen, 2011). As is discovered in previous works (Venkatraman et al., 2015; Talvitie, 2017; Asadi et al., 2018b), even small model error can severely degrade multi-step rollouts since the model error will compound, and the predicted states will move out of the region where the model has high accuracy after a few steps (Asadi et al., 2018a).

To mitigate the compounding error problem, multi-step models (Whitney and Fergus, 2018; Asadi et al., 2019) were proposed to predict the outcome of executing a sequence of actions directly. Segments-based models have also been considered to make stable and accurate predictions over temporal segments (Mishra et al., 2017). Alternatively, a model may also be trained on its own outputs, hoping that the model can perform reasonable prediction in the region of its outputs (Talvitie, 2017; Kaiser et al., 2019; Freeman et al., 2019). Besides, model imitation tried to learn the model by matching the multi-step rollouts distributions via WGAN (Wu et al., 2019). Furthermore, adaptive rollout horizon techniques were investigated to stop rolling out based on compound model error estimates (Nguyen et al., ; Xiao et al., 2019). The MBPO algorithm (Janner et al., 2019) avoided the compounding error by generating short branched rollouts from real states. This paper is mainly based on MBPO framework backbone and allows for extended rollouts with less compounding error by generating rollouts bidirectionally.

There are many model architecture choices, such as linear models (Parr et al., 2008; Sutton et al., 2012; Levine and Koltun, 2013; Levine and Abbeel, 2014; Kumar et al., 2016), nonparametric Gaussian processes (Kuss and Rasmussen, 2004; Ko et al., 2007; Deisenroth and Rasmussen, 2011)

, and neural networks

(Draeger et al., 1995; Gal et al., 2016; Nagabandi et al., 2018). Model ensembles have shown to be effective in preventing a policy or a controller from exploiting the inaccuracies of any single model (Rajeswaran et al., 2016; Chua et al., 2018; Kurutach et al., 2018; Janner et al., 2019). In this paper, we adopt the ensemble of probabilistic networks for both forward and backward models.

Model predictive control (MPC) (Camacho and Alba, 2013) is considered as an efficient and robust way in model-based planning, which utilizes the dynamic model to look forward several steps and optimizes the actions sequence over a finite horizon. Nagabandi et al. (2018) combined deterministic neural networks model with a simple MPC of random shooting and acquired stable and plausible locomotion gaits in high-dimensional tasks. Chua et al. (2018) improved this method by capturing two kinds of uncertainty in modeling and replacing random shooting with the cross-entropy method (CEM). Wang and Ba (2019) further expanded MPC algorithms by applying policy networks to generate action sequence proposals. In this paper, we also incorporate MPC into our algorithm to refine the actions sampled from policy when interacting with the environment.

Theoretical analysis for tabular- and linear-setting model-based methods has been conducted in prior works (Szita and Szepesvári, 2010; Dean et al., 2017). As for non-linear settings, Sun et al. (2018) provided a convergence analysis for the model-based optimal control framework. Luo et al. (2018) built a lower bound of the expected reward while enforced a trust region on the policy. Janner et al. (2019), instead, derived a bound of discrepancy between returns in real environment and those under the branched rollout scheme of MBPO in terms of rollout length, model error, and policy shift divergence. In this paper, we also provide a theoretical analysis of leveraging bidirectional models in MBPO and obtain a tighter bound.

To facilitate research in MBRL, Langlois et al. (2019) gathered a wide collection of MBRL algorithms and benchmarked them on a series of environments specially designed for MBRL. Beyond comparing different methods, they also raised several crucial challenges for future MBRL research.

This work is closely related to Goyal et al. (2018)

, which built a backtracking model of the environment and used it to generate traces leading to high value states. These traces are then used for imitation learning to improve the model-free learner. Our approach differs from theirs in the following two aspects: (1) their method is more like model-free RL as they mainly train the policy on real observation data and use the model generated traces to fine-tuning the policy, while our method is purely model-based RL and optimize the policy with model-generated data. (2) they only build a backward model of the environment, whereas we build a forward model and a backward model at the same time.

3 Preliminaries

3.1 Reinforcement Learning

A discrete-time Markov decision process (MDP) with infinite horizon is defined by the tuple (

). Here, and denote the state and action spaces, respectively. is the discount factor. denotes the transition distribution function, and denotes the reward function. represents the initial state distribution. Let denote the expected return, i.e., the sum of the discounted rewards along a trajectory . The goal of reinforcement learning is to find the optimal policy that maximizes the expected return:


where ,  ,  .

3.2 Dyna-style Algorithm

Dyna-style algorithm (Sutton, 1991; Langlois et al., 2019) is one kind of MBRL method that uses a model for policy optimization. To be specific, in Dyna, a forward dynamic model is learned from agent interactions with the environment by executing the policy . The policy is then optimized using model-free algorithms on real data and model-generated data. To perform the Dyna-style algorithm using bidirectional models, we further learn a backward model and a backward policy . As such, forward rollouts and backward rollouts can be generated simultaneously for policy optimization.

Figure 2: Overview of the BMPO algorithm. When interacting with the environment, the agent uses the model to predict the future states over a finite horizon and then take the first action of the sequence with the highest simulated estimated return. The transitions are stored in , where the state value is represented by the shade of color (value increases from light red to dark red). High value states are then sampled from to perform bidirectional model rollouts, which are stored in for policy optimization afterward.

4 BMPO Framework


In this section, we introduce how BMPO leverages bidirectional models to generate more plentiful simulated data for policy optimization in detail. Although bidirectional models can be incorporated into almost any Dyna-style model-based algorithms (Sutton, 1991), we choose the Model-based Policy Optimization (MBPO) (Janner et al., 2019) algorithm as the framework backbone since it is the state-of-the-art MBRL method and is sufficiently general. The overall algorithm is demonstrated in Algorithm LABEL:alg:BMPO , and an overview of the algorithm architecture is shown in Figure 2.

4.1 Bidirectional Models Learning in BMPO

4.1.1 Forward Model

For the forward model, we use an ensemble of bootstrapped probabilistic dynamics models, which were first introduced in PETS (Chua et al., 2018) to capture two kinds of uncertainty. To be specific, individual probabilistic models can capture the aleatoric uncertainty aroused from the inherent stochasticity of a system, and the bootstrapped ensemble aims to capture the epistemic uncertainty due to the lack of sufficient training data. Prior works (Chua et al., 2018; Janner et al., 2019) have demonstrated that the ensemble of probabilistic models is quite effective in MBRL, even when the ground truth dynamics are deterministic.

In detail, each in the forward model ensembles is parameterized by a multi-layer neural network, and we denote the parameters in the forward model ensembles as . Given a state and an action

, each probabilistic neural network outputs a Gaussian distribution with diagonal covariances of the next state:

.111The model network will output a similar Gaussian distribution of the reward as well, which is omitted here for simplicity. We train the model ensembles with different initializations and bootstrapped samples of the real environment data via maximum likelihood, and the corresponding loss of the forward model is


where and are the mean and covariance respectively, and denotes the total number of transition data.

4.1.2 Backward Model

Besides the traditional forward model learning, we additionally learn a backward model for extended simulated data generation. In this way, we can mitigate the error caused by excessively exploiting the forward model. In other words, the backward model is used to reduce the burden of the forward model on generating data.

Due to the powerful capabilities of the probabilistic network ensemble, similarly to the forward model, we adopt the same parameterization for the backward model. In detail, another multi-layer neural network with parameters is used to output a Gaussian prediction, i.e.,

, and the loss function of the backward model is


We note that although we can directly train the backward model with data sampled from the environment, it remains a problem how to choose the actions when we use the backward model to generate trajectories backwards. Recall that in the forward model situation, the actions are usually taken by the current policy or an exploration policy. Thus we need an additional backward policy to generate actions in the backward rollouts.

4.2 Policy Optimization in BMPO

4.2.1 Backward policy

To sample trajectories backwards, we need to design a backward policy

to take action given the next state. There are many alternatives for backward policy design. In our experiments, we use two heuristic methods for comparison.

The first way is to train the backward policy by maximum likelihood estimation according to the data sampled in the environment. The corresponding loss function is


The second way is to use a conditional generative adversarial network (CGAN) (Mirza and Osindero, 2014) to generate an action conditioned on the next state , and the adversarial loss can be written as


where is an additional discriminator and the generator here is the backward policy .

Our goal is to make the backward rollouts resemble the real trajectory sampled by the current forward policy. Thus when training the backward policy, we only use the recent trajectories sampled by the agent in the real environment.

4.2.2 MBPO with Bidirectional Models

The original MBPO (Janner et al., 2019) algorithm iterates between three stages: data collection, model learning, and policy optimization. In the data collection stage, data is collected by executing the latest policy in real environment and added to replay buffer . In the model learning stage, an ensemble of bootstrapped forward models are trained using all the data in through maximum likelihood. Then, in the policy optimization stage, short-length rollouts starting from states randomly chosen from are generated using the forward model. Model-generated data is then used to train a policy (see line 10 in Algorithm LABEL:alg:BMPO) through Soft Actor-Critic (SAC) (Haarnoja et al., 2018) by minimizing the expected KL-divergence .

Bidirectional models can be incorporated into MBPO naturally. Specifically, we iteratively collect data, train bidirectional models and backward policy, and then use them to generate short-length rollouts bidirectionally starting from some real states to optimize our policy. Besides, in Section 4.3 we design other components specifically for BMPO, i.e., state sampling and MPC, which are crucial for the improvement of performance.

4.3 Design Decisions

4.3.1 State Sampling Strategy

To better exploit the bidirectional models, inspired by Goyal et al. (2018), we begin simulated branched rollouts bidirectionally from high value states instead of randomly chosen states in the environment replay buffer. In such a way, the agent could learn to reach high value states through backward rollouts, and also learn to act better after these states through forward rollouts.

An intuitive idea is simply choosing the first highest-value states to begin rollouts, which, however, will cause the agent not knowing how to behave in low value states due to the absence of low value states data. For better generalization and stability, we choose starting states from the environment replay buffer according to Boltzmann distribution based on the value estimated by SAC. Let

be the probability of a state

being chosen as a starting state, then we have



is the hyperparameter to control the ratio of high value states.

4.3.2 Incorporating MPC

In BMPO, when taking actions in the real environment, we adopt model predictive control (MPC) to exploit the forward model further. MPC is a common model-based planning method using the learned model to look forward and optimize actions sequence. In detail, at each timestep, candidate action sequences are generated, where is the planning horizon, and the corresponding trajectories are simulated in the learned model. Then the first action of the sequence that yields the highest accumulated rewards is selected. Here, we use a variant of traditional MPC: (1) we generate action sequences from the current policy like Wang and Ba (2019)

instead of uniform distribution; (2) we add the value estimate of the last state in the simulated trajectory to the planning objective,



It is worth noting that we only use MPC in the training phase but not in the evaluation phase. In the training phase, MPC helps select actions to take in real environments, and the obtained high value states will be used to generate simulated rollouts for policy optimization. In the evaluation phase, the trained policy is evaluated directly.

5 Theoretical Analysis

In this section, we theoretically analyze the discrepancy between the expected returns in the real environment and those under the bidirectional branched rollout scheme of BMPO. The proofs can be found in the appendix as provided in supplementary materials. When we use the bidirectional models to generate simulated rollouts from some encountered state, we assume that the length of backward rollouts is , and the length of forward rollouts is . Under such a scheme, we first consider a more general return discrepancy of two arbitrary bidirectional branched rollouts.

Lemma 5.1.

(Bidirectional Branched Rollout Returns Bound). Let , be the expected returns of two bidirectional branched rollouts. Out of the branch, we assume that the expected total variation distance between these two dynamics at each timestep is bounded as , similarly, the forward branch dynamic bounded as , and the backward branch dynamic bounded as . Likewise, the total variation distance of policy is bounded by , and , respectively. Then the returns are bounded as:


See Appendix A, Lemma A.1. ∎

Now, we can bound the discrepancy between the returns in the environment and in the branched rollouts of BMPO. Let be the expected return of executing current policy in the true dynamics, and be the expected return of executing current policy in the model generated branch and executing old policy out of the branch. Then we can derive the discrepancy between them as follows.

Theorem 5.1.

(BMPO Return Discrepancy Upper Bound) Assume that the expected total variation distance between the learned forward model and the true dynamics at each timestep is bounded as . Similarly, the error of backward model is bounded as and the variation between current policy and the behavioral policy is bounded as . Assume , then under a branched rollouts scheme with a backward branch length of and a forward branch length of , we have


See Appendix A, Theorem A.1. ∎

We notice that in MBPO (Janner et al., 2019), the authors derived a similar return discrepancy bound (refer to Theorem 4.3 therein) with only one forward dynamics model. Setting the forward rollout length as , the bound is


By comparing the two bounds, it is evident that BMPO obtains a tighter upper bound of the return discrepancy by employing bidirectional models. The main difference is that in BMPO the coefficient of the model error is , while in MBPO the coefficient is . It is easy to understand: as we generate the branched rollouts bidirectionally, the model error will compound only steps in the backward model or steps in the forward model and thus at most steps, instead of steps using forward model only.

6 Experiments

Our experiments aim to answer the following three questions: 1) How does BMPO perform compared with model-free RL methods and previous state-of-the-art model-based RL methods using only one forward model? 2) Does using the bidirectional models reduce compounding error compared with using one forward model? 3) What are the critical components of our overall algorithm?

Figure 3:

Learning curves of BMPO (ours) and four baselines on different continuous control environments. The solid lines indicate the mean and shaded areas indicate the standard error of six trails over different random seeds. Each trial is evaluated every 1000 environment steps (200 steps for Pendulum), where each evaluation reports the average return over ten episodes. The dashed reference lines are the asymptotic performance of SAC.

6.1 Comparison with State-of-the-Arts

In this section, we compare our method with previous state-of-the-art baselines. Specifically, for model-based methods, we compare against MBPO (Janner et al., 2019), as our method builds on top of it; SLBO (Luo et al., 2018) and PETS (Chua et al., 2018), both performing well in the model-based benchmarking test (Langlois et al., 2019). For model-free methods, we compare to Soft Actor-Critic (SAC) (Haarnoja et al., 2018), which is proved to be effective in continuous control tasks. For the MBPO baseline, we only generate -steps forward rollout, where is set as the default value used in the original MBPO paper. We do not report the result of MBPO using rollouts of steps since using only steps achieves better performance. Notice that we do not include the backtracking model method (Goyal et al., 2018) for comparison since it is not a purely model-based method, and its performance improvement is limited compared with SAC.

We evaluate all the algorithms in six environments in total using OpenAI Gym (Brockman et al., 2016). Among them, Pendulum is one traditional control task, and Hopper, Walker2D, Ant are three complex MuJoCo tasks (Todorov et al., 2012). We additionally add two variants of MuJoCo tasks without early termination states, denoted as Hopper-NT and Walker2d-NT, which have been released as benchmarking environments for MBRL (Langlois et al., 2019). More details about the environments can be found in Appendix C.

The comparison results are shown in Figure 3. In different locomotion tasks, our method BMPO learns faster and has better asymptotic performance than previous model-based algorithms using only the forward model, which empirically demonstrates the advantage of bidirectional models. This gap is even more significant in benchmarking environments Hopper-NT and Walker2d-NT. One possible reason is that in the environments without early termination, the space of encountered states is larger, and thus the introduced state sampling strategy is more effective.

6.2 Model Error

Figure 4: (a) Validation loss of forward/backward models. (b) Compounding error of the forward model and the bidirectional models in environment Pendulum (left) and Hopper-NT (right).

In this section, we first plot the validation loss of the models in Figure 4a, which can roughly represent the single-step prediction error. As the figure shows, the difficulty of learning forward/backward models is task-dependent. Then, we investigate the compounding error of traditional forward model and our bidirectional models when generating the same length simulated trajectories. We calculate the multi-step prediction error for evaluation. A similar validation error is also used in Nagabandi et al. (2018). More specifically, assume a real trajectory of length is denoted as . For the forward model, we sample from and generate forward rollouts where and for , . Then the corresponding compounding error is defined as

Similarly, for the bidirectional models, suppose we sample from and generate both forward and backward rollouts of steps, compounding error is defined as

where and for , and .

(a) Backward Policy
(b) Ablations
(c) in Equation 6
(d) Backward Length
Figure 5: Design evaluation of our algorithm BMPO on task Hopper-NT. (a) Comparison of two heuristic design choices for the backward policy loss: MLE loss and GAN loss. (b) Ablation study of three crucial components: forward model, backward model, and MPC. (c) We study the sensitivity of our algorithm to the hyperparameter in Equation 6

. (d) Average return of the last 10 epochs over six trials with different backward rollout lengths

and fixed forward length .

We conduct experiments with different rollout length and plot the results in Figure 4b. We observe that employing bidirectional models can significantly reduce model compounding error, which is consistent with our intuition and theoretical analysis.

6.3 Design Evaluation

In this section, we evaluate the importance of each design decision for our overall algorithm. The results are demonstrated in Figure 5.

6.3.1 Backward Policy Design

In Figure 5(a), we compare two heuristic design choices for the backward policy loss: MLE loss and GAN loss, which are described in detail in Section 4.2.1. From the comparison, we notice a slight performance degradation in the late iterations of training when adopting GAN loss. One possible reason may be the instability and model collapse of GAN (Goodfellow, 2016). More advanced methods to stabilize GAN’s training can be incorporated into backward policy training, which remains as future work. As for other comparisons in this paper, we use the MLE loss for the backward policy as default.

6.3.2 Ablation Study

We further carry out an ablation study to characterize the importance of three main components of our algorithm: 1) no backward model to sample trajectory (Forward Only);  2) no forward model to sample trajectory (Backward Only);  3) no MPC when interacting with the environment (No MPC). The results are shown in Figure 5(b). We find that ablating MPC decreases the performance slightly, and using only one forward model or backward model causes a more severe performance dropping. This further reveals the superiority of using bidirectional models. Notice that the performance of not using backward model is even worse than the vanilla MBPO, which means that sampling from high value states does not benefit the traditional forward model since the agent may only focus on how to act at high value states, but not care how to reach these states.

6.3.3 Hyperparameter Study

In this section, we investigate the sensitivity of BMPO to the hyperparameters. First, we test BMPO with different hyperparameter used in Equation 6. We vary from to a relatively large number, where means random sampling, and larger means focusing more on high value states. The results are shown in Figure 5(c). We observe that up to a certain level, increasing yields better performance while too large can degrade the performance. This may be due to the fact that with a too large , low value states are almost impossible to be chosen, and the value estimation at the beginning is difficult. Nevertheless, our algorithm with different all outperform the baseline, which indicates the robustness to .

Though it has been discovered that linearly increasing rollout length achieves excellent performance (Janner et al., 2019), it remains a problem that how to choose the backward rollout length according to the forward length . We fix to be the same as Janner et al. (2019) and vary from 0 to . As is shown in Figure 5(d), setting provides the best result, while too short and too long backward length are both detrimental. In practice, we use in most environments except Ant, where the backward model error is too large compared with forward model error. All hyperparameters settings are provided in Appendix D.

7 Conclusion

In this work, we present a novel model-based reinforcement learning method, namely bidirectional model-based policy optimization (BMPO), using the newly introduced bidirectional models. We theoretically prove the advantage of bidirectional models by deriving a tighter return discrepancy upper bound of RL objective compared with only one forward model. Experimental results show that BMPO achieves better asymptotic performance and higher sample efficiency than previous state-of-the-art model-based methods on several benchmark continuous control tasks. For future work, we will investigate the usage of bidirectional models in other model-based RL frameworks and study how to leverage the bidirectional models better.


The corresponding author Weinan Zhang thanks the support of ”New Generation of AI 2030” Major Project 2018AAA0100900 and NSFC (61702327, 61772333, 61632017).


  • K. Asadi, E. Cater, D. Misra, and M. L. Littman (2018a) Towards a simple approach to multi-step model-based reinforcement learning. arXiv preprint arXiv:1811.00128. Cited by: §2.
  • K. Asadi, D. Misra, S. Kim, and M. L. Littman (2019) Combating the compounding-error problem with a multi-step model. arXiv preprint arXiv:1905.13320. Cited by: §1, §2.
  • K. Asadi, D. Misra, and M. L. Littman (2018b) Lipschitz continuity in model-based reinforcement learning. arXiv preprint arXiv:1804.07193. Cited by: §1, §2.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §6.1.
  • E. F. Camacho and C. B. Alba (2013) Model predictive control. Springer Science & Business Media. Cited by: §1, §2.
  • K. Chua, R. Calandra, R. McAllister, and S. Levine (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pp. 4754–4765. Cited by: §2, §2, §4.1.1, §6.1.
  • S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu (2017) On the sample complexity of the linear quadratic regulator. arXiv preprint arXiv:1710.01688. Cited by: §2.
  • M. P. Deisenroth, G. Neumann, J. Peters, et al. (2013) A survey on policy search for robotics. Foundations and Trends® in Robotics 2 (1–2), pp. 1–142. Cited by: §1.
  • M. Deisenroth and C. E. Rasmussen (2011) PILCO: a model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pp. 465–472. Cited by: §2, §2.
  • A. Draeger, S. Engell, and H. Ranke (1995) Model predictive control using neural networks. IEEE Control Systems Magazine 15 (5), pp. 61–66. Cited by: §2.
  • A. D. Edwards, L. Downs, and J. C. Davidson (2018) Forward-backward reinforcement learning. arXiv preprint arXiv:1803.10227. Cited by: §1.
  • C. D. Freeman, L. Metz, and D. Ha (2019) Learning to predict without looking ahead: world models without forward prediction. ArXiv abs/1910.13038. Cited by: §1, §2.
  • Y. Gal, R. McAllister, and C. E. Rasmussen (2016) Improving pilco with bayesian neural network dynamics models. In Data-Efficient Machine Learning workshop, ICML, Vol. 4. Cited by: §2.
  • I. Goodfellow (2016) NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160. Cited by: §6.3.1.
  • A. Goyal, P. Brakel, W. Fedus, S. Singhal, T. Lillicrap, S. Levine, H. Larochelle, and Y. Bengio (2018) Recall traces: backtracking models for efficient reinforcement learning. arXiv preprint arXiv:1804.00379. Cited by: §1, §2, §4.3.1, §6.1.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §1, §4.2.2, §6.1.
  • M. Janner, J. Fu, M. Zhang, and S. Levine (2019) When to trust your model: model-based policy optimization. arXiv preprint arXiv:1906.08253. Cited by: Lemma B.2, Lemma B.3, Table 3, §1, §1, §2, §2, §2, §4.1.1, §4.2.2, §4, §5, §6.1, §6.3.3.
  • L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, et al. (2019) Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374. Cited by: §1, §2.
  • J. Ko, D. J. Klein, D. Fox, and D. Haehnel (2007) Gaussian processes and reinforcement learning for identification and control of an autonomous blimp. In Proceedings 2007 ieee international conference on robotics and automation, pp. 742–747. Cited by: §2.
  • V. Kumar, E. Todorov, and S. Levine (2016) Optimal control with learned local models: application to dexterous manipulation. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 378–383. Cited by: §2.
  • T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel (2018) Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592. Cited by: §2.
  • M. Kuss and C. E. Rasmussen (2004) Gaussian processes in reinforcement learning. In Advances in neural information processing systems, pp. 751–758. Cited by: §2.
  • E. Langlois, S. Zhang, G. Zhang, P. Abbeel, and J. Ba (2019) Benchmarking model-based reinforcement learning. arXiv preprint arXiv:1907.02057. Cited by: Appendix C, §2, §3.2, §6.1, §6.1.
  • S. Levine and P. Abbeel (2014) Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pp. 1071–1079. Cited by: §2.
  • S. Levine and V. Koltun (2013) Guided policy search. In International Conference on Machine Learning, pp. 1–9. Cited by: §2.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
  • Y. Luo, H. Xu, Y. Li, Y. Tian, T. Darrell, and T. Ma (2018) Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. arXiv preprint arXiv:1807.03858. Cited by: §2, §6.1.
  • M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §4.2.1.
  • N. Mishra, P. Abbeel, and I. Mordatch (2017) Prediction and control with temporal segment models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2459–2468. Cited by: §1, §2.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1.
  • A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine (2018) Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566. Cited by: §2, §2, §6.2.
  • [32] N. M. Nguyen, A. Singh, and K. Tran Improving model-based rl with adaptive rollout using uncertainty estimation. Cited by: §1, §2.
  • R. Parr, L. Li, G. Taylor, C. Painter-Wakefield, and M. L. Littman (2008)

    An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning

    In Proceedings of the 25th international conference on Machine learning, pp. 752–759. Cited by: §2.
  • A. Rajeswaran, S. Ghotra, B. Ravindran, and S. Levine (2016) Epopt: learning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283. Cited by: §2.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1.
  • W. Sun, G. J. Gordon, B. Boots, and J. Bagnell (2018) Dual policy iteration. In Advances in Neural Information Processing Systems, pp. 7059–7069. Cited by: §2.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §2.
  • R. S. Sutton, C. Szepesvári, A. Geramifard, and M. P. Bowling (2012) Dyna-style planning with linear function approximation and prioritized sweeping. arXiv preprint arXiv:1206.3285. Cited by: §2.
  • R. S. Sutton (1991) Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin 2 (4), pp. 160–163. Cited by: §3.2, §4.
  • I. Szita and C. Szepesvári (2010) Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 1031–1038. Cited by: §2.
  • E. Talvitie (2017) Self-correcting models for model-based reinforcement learning. In

    Thirty-First AAAI Conference on Artificial Intelligence

    Cited by: §1, §2, §2.
  • E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §6.1.
  • A. Venkatraman, M. Hebert, and J. A. Bagnell (2015) Improving multi-step prediction of learned time series models. In Twenty-Ninth AAAI Conference on Artificial Intelligence, Cited by: §2.
  • T. Wang and J. Ba (2019) Exploring model-based planning with policy networks. arXiv preprint arXiv:1906.08649. Cited by: §2, §4.3.2.
  • W. Whitney and R. Fergus (2018) Understanding the asymptotic performance of model-based rl methods. Cited by: §1, §2.
  • Y. Wu, T. Fan, P. J. Ramadge, and H. Su (2019) Model imitation for model-based reinforcement learning. arXiv preprint arXiv:1909.11821. Cited by: §1, §2.
  • C. Xiao, Y. Wu, C. Ma, D. Schuurmans, and M. Müller (2019) Learning to combat compounding-error in model-based reinforcement learning. arXiv preprint arXiv:1912.11206. Cited by: §1, §2.

Appendix A BMPO Performance Guarantee

Figure 6: Bidirectional rollout.
Lemma A.1.

(Bidirectional Branched Rollout Returns Bound). Let , be the expected returns of two bidirectional branched rollouts. Out of the branch, we assume that the expected total variation distance between these two dynamics at each timestep is bounded as , similarly, the forward branch dynamic bounded as , and the backward branch dynamic bounded as . Likewise, the total variation distance of policy is bounded by , and , respectively (as Figure 6 shows). Then the returns are bounded as


Lemma B.1 and Lemma B.2 imply that state marginal error at each timestep can be bounded by the divergence at the current timestep plus the state marginal error at the next (Lemma B.1), or previous (Lemma B.2) timestep. And by employing Lemma B.3

, we can convert the (s,a) joint distribution to marginal distributions. Thus, letting

and denote the state-action marginals, we can write:
For :


Similarly, for :


And for :


We can now bound the difference in occupancy measures by averaging the state marginal error over time, weighted by the discount:

Multiplying this bound by to convert the occupancy measure difference into a returns bound completes the proof. ∎

Theorem A.1.

(BMPO Return Discrepancy Upper Bound) Assume that the expected total variation distance between the learned forward model and the true dynamics at each timestep is bounded as . Similarly, the error of backward model is bounded as and the variation between current policy and the behavioral policy is bounded as . Assume and , then under a branched rollouts scheme with a backward branch length of and a forward branch length of , the returns are bounded as:


Using Lemma A.1, out of the branch, we only suffer from error of executing old policy , so, set and . Then in the branched rollout, we execute current policy, so the only error comes from using the learned model to simulate. Set and . Plugging these in Lemma B.1 we can get:


Appendix B Useful Lemmas

In this section, we give proofs of the lemmas used before.

Lemma B.1.

(Backward State Marginal Distance Bound). Suppose the expected total variation distance between two backward dynamics is bounded as and the backward policy divergences are bounded as . Then the state marginal distance at timestep can be bounded as:


Let the total variation distance of state at time be denoted as .

Lemma B.2.

(Forward State Marginal Distance Bound) ((Janner et al., 2019), Lemma B.2, B.3). Suppose the expected TVD between two forward dynamics is bounded as and the forward policy divergences are bounded as . Then the state marginal distance at timestep can be bounded as:

Lemma B.3.

(TVD Of Joint Distributions) ((Janner et al., 2019), Lemma B.1). Suppose we have two distributions and . We can bound the total variation distance of the joint distributions as:


Appendix C Environment Settings

In this section, we provide a comparison of the environment settings used in our experiments. Among them, ’Hopper-NT’ and ’Walker2d-NT’ refer to the settings in Langlois et al. (2019) and others are the standard version.

Environment Name Observation Space Dimension Action Space Dimension Steps Per Epoch
Pendulum 3 1 200
Hopper 11 3 1000
Hopper-NT 11 3 1000
Walker2d 17 6 1000
Walker2d-NT 17 6 1000
Ant 27 8 1000
Table 1: Observation and action dimension, and task horizon of the environments used in our experiments.
Environment Name Reward Function Termination States Condition
Pendulum None
Hopper or
Hopper-NT None
Walker2d or or
Walker2d-NT None
Ant or
Table 2: Reward function and termination states condition of the environments used in our experiments. denotes the joint angle, denotes the position in x direction, denotes the action control input, and denotes the height.

Appendix D Hyperparameters

Environment Name MPC Horizon Epochs
6 20
6 100
0.01 6 100
Walker2d 1 1
1 200
Walker2d-NT 1 1 0.01 0 200
Ant 1
0.003 0 300
Table 3: Hyperparameter settings for BMPO. over epochs means clipped linear function, i.e. for epoch e, . Other hyperparameters not listed here are the same as those in MBPO (Janner et al., 2019).

Appendix E Computing Infrastructure

In this section, we provide a description of the computing infrastructure used to run all the experiments in Table 4. We also show the computation time comparison between our algorithm and the MBPO baseline in Table 5.

CPU GPU Memory
AMD2990WX RTX2080TI4 256GB
Table 4: Computing infrastructure.
Pendulum Hopper Hopper-NT Walker2d Walker2d-NT Ant
BMPO 0.49 16.34 17.98 27.24 27.34 71.51
MBPO 0.41 10.33 11.12 22.26 21.32 57.42
Table 5: Computation time in hours for one experiment.