1 Introduction
Reinforcement learning (RL) methods are commonly divided into two categories: (1) modelfree RL (MFRL) which directly learns a policy or value function from observation data, and (2) modelbased RL (MBRL) which builds a predictive model of the environment and generates samples from it to derive a policy or a controller. While MFRL algorithms have achieved remarkable success in different ranges of areas (Mnih et al., 2015; Lillicrap et al., 2015; Schulman et al., 2017), they need a large number of samples, which limit their applications mostly to simulators. Modelbased methods, in contrast, have shown great potential in reducing the sample complexity (Deisenroth et al., 2013). However, the asymptotic performance of MBRL methods often lags behind their modelfree counterparts due to model error, which is especially severe for multistep rollout because of the compounding error (Asadi et al., 2018b).
Several previous works have been proposed to alleviate compounding model error in different ways. For example, Whitney and Fergus (2018) and Asadi et al. (2019) introduced multistep models which directly predict the consequence of executing a sequence of actions; Mishra et al. (2017) tried to divide trajectories into temporal segments and make predictions over segments instead of one timestep; Talvitie (2017), Kaiser et al. (2019) and Freeman et al. (2019) proposed to train the dynamics model on its own outputs, hoping the model is capable of giving accurate predictions in the region of its outputs. Wu et al. (2019)
attributed long horizon rollouts mismatching to the onestep supervised learning and proposed to learn the model via model imitation. Besides the above innovations in model learning,
Nguyen et al. and Xiao et al. (2019)developed methods of using adaptive rollout horizon according to the estimated compounding error, and
Janner et al. (2019) proposed to use truncated short rollouts branched from real states. However, even with the compounding model error mitigated, it is almost impossible to obtain perfect multistep predictions in practice, and thus the potential of the single forward model is still limited.In this paper, we investigate how to maximally take advantage of the underlying dynamics of the environment to reduce the reliance on the accuracy of the forward model. When making decisions, human beings can not only predict future consequences of current behaviors forward but also imagine possible traces leading to a target goal backward. Based on this consideration, inspired by Edwards et al. (2018) and Goyal et al. (2018), we propose to learn a backward model in addition to the forward dynamics model to generate simulated rollouts backwards. To be specific, given a state and a previous action, the backward model predicts the preceding state, which means it can generate trajectories that terminate at one certain state.
Now, intuitively, if we utilize both the forward model and the backward model to sample bidirectionally from some encountering states for policy optimization, the model error will compound separately in different directions. As Figure 1 shows, when we generate trajectories with the same length using a forward model and bidirectional models respectively, the compounding error of the latter will be less than that of the former. More theoretical analysis is given in Section 5.
With this insight, we combine bidirectional models with recent MBPO method (Janner et al., 2019) and propose a practical MBRL algorithm called Bidirectional Modelbased Policy Optimization (BMPO). Besides, we develop a novel state sampling strategy to sample states for model rollouts and incorporate model predictive control (MPC) (Camacho and Alba, 2013) into our algorithm for further performance improvement. Furthermore, we theoretically prove that BMPO has a tighter bound of return discrepancy compared against MBPO, which verifies the advantage of bidirectional models in theory. We evaluate our BMPO and previous stateoftheart algorithms (Haarnoja et al., 2018; Janner et al., 2019) on a range of continuous control benchmark tasks. Experiments demonstrate that BMPO achieves higher sample efficiency and better asymptotic performance compared with prior modelbased methods which only use forward models.
2 Related Work
Modelbased reinforcement learning methods are expected to reduce sample complexity by learning a model as a simulator of the environment (Sutton and Barto, 2018). However, model error tends to cripple the performance of modelbased approaches, which is also known as modelbias (Deisenroth and Rasmussen, 2011). As is discovered in previous works (Venkatraman et al., 2015; Talvitie, 2017; Asadi et al., 2018b), even small model error can severely degrade multistep rollouts since the model error will compound, and the predicted states will move out of the region where the model has high accuracy after a few steps (Asadi et al., 2018a).
To mitigate the compounding error problem, multistep models (Whitney and Fergus, 2018; Asadi et al., 2019) were proposed to predict the outcome of executing a sequence of actions directly. Segmentsbased models have also been considered to make stable and accurate predictions over temporal segments (Mishra et al., 2017). Alternatively, a model may also be trained on its own outputs, hoping that the model can perform reasonable prediction in the region of its outputs (Talvitie, 2017; Kaiser et al., 2019; Freeman et al., 2019). Besides, model imitation tried to learn the model by matching the multistep rollouts distributions via WGAN (Wu et al., 2019). Furthermore, adaptive rollout horizon techniques were investigated to stop rolling out based on compound model error estimates (Nguyen et al., ; Xiao et al., 2019). The MBPO algorithm (Janner et al., 2019) avoided the compounding error by generating short branched rollouts from real states. This paper is mainly based on MBPO framework backbone and allows for extended rollouts with less compounding error by generating rollouts bidirectionally.
There are many model architecture choices, such as linear models (Parr et al., 2008; Sutton et al., 2012; Levine and Koltun, 2013; Levine and Abbeel, 2014; Kumar et al., 2016), nonparametric Gaussian processes (Kuss and Rasmussen, 2004; Ko et al., 2007; Deisenroth and Rasmussen, 2011)
, and neural networks
(Draeger et al., 1995; Gal et al., 2016; Nagabandi et al., 2018). Model ensembles have shown to be effective in preventing a policy or a controller from exploiting the inaccuracies of any single model (Rajeswaran et al., 2016; Chua et al., 2018; Kurutach et al., 2018; Janner et al., 2019). In this paper, we adopt the ensemble of probabilistic networks for both forward and backward models.Model predictive control (MPC) (Camacho and Alba, 2013) is considered as an efficient and robust way in modelbased planning, which utilizes the dynamic model to look forward several steps and optimizes the actions sequence over a finite horizon. Nagabandi et al. (2018) combined deterministic neural networks model with a simple MPC of random shooting and acquired stable and plausible locomotion gaits in highdimensional tasks. Chua et al. (2018) improved this method by capturing two kinds of uncertainty in modeling and replacing random shooting with the crossentropy method (CEM). Wang and Ba (2019) further expanded MPC algorithms by applying policy networks to generate action sequence proposals. In this paper, we also incorporate MPC into our algorithm to refine the actions sampled from policy when interacting with the environment.
Theoretical analysis for tabular and linearsetting modelbased methods has been conducted in prior works (Szita and Szepesvári, 2010; Dean et al., 2017). As for nonlinear settings, Sun et al. (2018) provided a convergence analysis for the modelbased optimal control framework. Luo et al. (2018) built a lower bound of the expected reward while enforced a trust region on the policy. Janner et al. (2019), instead, derived a bound of discrepancy between returns in real environment and those under the branched rollout scheme of MBPO in terms of rollout length, model error, and policy shift divergence. In this paper, we also provide a theoretical analysis of leveraging bidirectional models in MBPO and obtain a tighter bound.
To facilitate research in MBRL, Langlois et al. (2019) gathered a wide collection of MBRL algorithms and benchmarked them on a series of environments specially designed for MBRL. Beyond comparing different methods, they also raised several crucial challenges for future MBRL research.
This work is closely related to Goyal et al. (2018)
, which built a backtracking model of the environment and used it to generate traces leading to high value states. These traces are then used for imitation learning to improve the modelfree learner. Our approach differs from theirs in the following two aspects: (1) their method is more like modelfree RL as they mainly train the policy on real observation data and use the model generated traces to finetuning the policy, while our method is purely modelbased RL and optimize the policy with modelgenerated data. (2) they only build a backward model of the environment, whereas we build a forward model and a backward model at the same time.
3 Preliminaries
3.1 Reinforcement Learning
A discretetime Markov decision process (MDP) with infinite horizon is defined by the tuple (
). Here, and denote the state and action spaces, respectively. is the discount factor. denotes the transition distribution function, and denotes the reward function. represents the initial state distribution. Let denote the expected return, i.e., the sum of the discounted rewards along a trajectory . The goal of reinforcement learning is to find the optimal policy that maximizes the expected return:(1) 
where , , .
3.2 Dynastyle Algorithm
Dynastyle algorithm (Sutton, 1991; Langlois et al., 2019) is one kind of MBRL method that uses a model for policy optimization. To be specific, in Dyna, a forward dynamic model is learned from agent interactions with the environment by executing the policy . The policy is then optimized using modelfree algorithms on real data and modelgenerated data. To perform the Dynastyle algorithm using bidirectional models, we further learn a backward model and a backward policy . As such, forward rollouts and backward rollouts can be generated simultaneously for policy optimization.
4 BMPO Framework
algocf[t]
In this section, we introduce how BMPO leverages bidirectional models to generate more plentiful simulated data for policy optimization in detail. Although bidirectional models can be incorporated into almost any Dynastyle modelbased algorithms (Sutton, 1991), we choose the Modelbased Policy Optimization (MBPO) (Janner et al., 2019) algorithm as the framework backbone since it is the stateoftheart MBRL method and is sufficiently general. The overall algorithm is demonstrated in Algorithm LABEL:alg:BMPO , and an overview of the algorithm architecture is shown in Figure 2.
4.1 Bidirectional Models Learning in BMPO
4.1.1 Forward Model
For the forward model, we use an ensemble of bootstrapped probabilistic dynamics models, which were first introduced in PETS (Chua et al., 2018) to capture two kinds of uncertainty. To be specific, individual probabilistic models can capture the aleatoric uncertainty aroused from the inherent stochasticity of a system, and the bootstrapped ensemble aims to capture the epistemic uncertainty due to the lack of sufficient training data. Prior works (Chua et al., 2018; Janner et al., 2019) have demonstrated that the ensemble of probabilistic models is quite effective in MBRL, even when the ground truth dynamics are deterministic.
In detail, each in the forward model ensembles is parameterized by a multilayer neural network, and we denote the parameters in the forward model ensembles as . Given a state and an action
, each probabilistic neural network outputs a Gaussian distribution with diagonal covariances of the next state:
.^{1}^{1}1The model network will output a similar Gaussian distribution of the reward as well, which is omitted here for simplicity. We train the model ensembles with different initializations and bootstrapped samples of the real environment data via maximum likelihood, and the corresponding loss of the forward model is(2)  
where and are the mean and covariance respectively, and denotes the total number of transition data.
4.1.2 Backward Model
Besides the traditional forward model learning, we additionally learn a backward model for extended simulated data generation. In this way, we can mitigate the error caused by excessively exploiting the forward model. In other words, the backward model is used to reduce the burden of the forward model on generating data.
Due to the powerful capabilities of the probabilistic network ensemble, similarly to the forward model, we adopt the same parameterization for the backward model. In detail, another multilayer neural network with parameters is used to output a Gaussian prediction, i.e.,
, and the loss function of the backward model is
(3)  
We note that although we can directly train the backward model with data sampled from the environment, it remains a problem how to choose the actions when we use the backward model to generate trajectories backwards. Recall that in the forward model situation, the actions are usually taken by the current policy or an exploration policy. Thus we need an additional backward policy to generate actions in the backward rollouts.
4.2 Policy Optimization in BMPO
4.2.1 Backward policy
To sample trajectories backwards, we need to design a backward policy
to take action given the next state. There are many alternatives for backward policy design. In our experiments, we use two heuristic methods for comparison.
The first way is to train the backward policy by maximum likelihood estimation according to the data sampled in the environment. The corresponding loss function is
(4) 
The second way is to use a conditional generative adversarial network (CGAN) (Mirza and Osindero, 2014) to generate an action conditioned on the next state , and the adversarial loss can be written as
(5)  
where is an additional discriminator and the generator here is the backward policy .
Our goal is to make the backward rollouts resemble the real trajectory sampled by the current forward policy. Thus when training the backward policy, we only use the recent trajectories sampled by the agent in the real environment.
4.2.2 MBPO with Bidirectional Models
The original MBPO (Janner et al., 2019) algorithm iterates between three stages: data collection, model learning, and policy optimization. In the data collection stage, data is collected by executing the latest policy in real environment and added to replay buffer . In the model learning stage, an ensemble of bootstrapped forward models are trained using all the data in through maximum likelihood. Then, in the policy optimization stage, shortlength rollouts starting from states randomly chosen from are generated using the forward model. Modelgenerated data is then used to train a policy (see line 10 in Algorithm LABEL:alg:BMPO) through Soft ActorCritic (SAC) (Haarnoja et al., 2018) by minimizing the expected KLdivergence .
Bidirectional models can be incorporated into MBPO naturally. Specifically, we iteratively collect data, train bidirectional models and backward policy, and then use them to generate shortlength rollouts bidirectionally starting from some real states to optimize our policy. Besides, in Section 4.3 we design other components specifically for BMPO, i.e., state sampling and MPC, which are crucial for the improvement of performance.
4.3 Design Decisions
4.3.1 State Sampling Strategy
To better exploit the bidirectional models, inspired by Goyal et al. (2018), we begin simulated branched rollouts bidirectionally from high value states instead of randomly chosen states in the environment replay buffer. In such a way, the agent could learn to reach high value states through backward rollouts, and also learn to act better after these states through forward rollouts.
An intuitive idea is simply choosing the first highestvalue states to begin rollouts, which, however, will cause the agent not knowing how to behave in low value states due to the absence of low value states data. For better generalization and stability, we choose starting states from the environment replay buffer according to Boltzmann distribution based on the value estimated by SAC. Let
be the probability of a state
being chosen as a starting state, then we have(6) 
where
is the hyperparameter to control the ratio of high value states.
4.3.2 Incorporating MPC
In BMPO, when taking actions in the real environment, we adopt model predictive control (MPC) to exploit the forward model further. MPC is a common modelbased planning method using the learned model to look forward and optimize actions sequence. In detail, at each timestep, candidate action sequences are generated, where is the planning horizon, and the corresponding trajectories are simulated in the learned model. Then the first action of the sequence that yields the highest accumulated rewards is selected. Here, we use a variant of traditional MPC: (1) we generate action sequences from the current policy like Wang and Ba (2019)
instead of uniform distribution; (2) we add the value estimate of the last state in the simulated trajectory to the planning objective,
i.e.,(7) 
It is worth noting that we only use MPC in the training phase but not in the evaluation phase. In the training phase, MPC helps select actions to take in real environments, and the obtained high value states will be used to generate simulated rollouts for policy optimization. In the evaluation phase, the trained policy is evaluated directly.
5 Theoretical Analysis
In this section, we theoretically analyze the discrepancy between the expected returns in the real environment and those under the bidirectional branched rollout scheme of BMPO. The proofs can be found in the appendix as provided in supplementary materials. When we use the bidirectional models to generate simulated rollouts from some encountered state, we assume that the length of backward rollouts is , and the length of forward rollouts is . Under such a scheme, we first consider a more general return discrepancy of two arbitrary bidirectional branched rollouts.
Lemma 5.1.
(Bidirectional Branched Rollout Returns Bound). Let , be the expected returns of two bidirectional branched rollouts. Out of the branch, we assume that the expected total variation distance between these two dynamics at each timestep is bounded as , similarly, the forward branch dynamic bounded as , and the backward branch dynamic bounded as . Likewise, the total variation distance of policy is bounded by , and , respectively. Then the returns are bounded as:
(8) 
Proof.
See Appendix A, Lemma A.1. ∎
Now, we can bound the discrepancy between the returns in the environment and in the branched rollouts of BMPO. Let be the expected return of executing current policy in the true dynamics, and be the expected return of executing current policy in the model generated branch and executing old policy out of the branch. Then we can derive the discrepancy between them as follows.
Theorem 5.1.
(BMPO Return Discrepancy Upper Bound) Assume that the expected total variation distance between the learned forward model and the true dynamics at each timestep is bounded as . Similarly, the error of backward model is bounded as and the variation between current policy and the behavioral policy is bounded as . Assume , then under a branched rollouts scheme with a backward branch length of and a forward branch length of , we have
(9)  
Proof.
See Appendix A, Theorem A.1. ∎
We notice that in MBPO (Janner et al., 2019), the authors derived a similar return discrepancy bound (refer to Theorem 4.3 therein) with only one forward dynamics model. Setting the forward rollout length as , the bound is
(10)  
By comparing the two bounds, it is evident that BMPO obtains a tighter upper bound of the return discrepancy by employing bidirectional models. The main difference is that in BMPO the coefficient of the model error is , while in MBPO the coefficient is . It is easy to understand: as we generate the branched rollouts bidirectionally, the model error will compound only steps in the backward model or steps in the forward model and thus at most steps, instead of steps using forward model only.
6 Experiments
Our experiments aim to answer the following three questions: 1) How does BMPO perform compared with modelfree RL methods and previous stateoftheart modelbased RL methods using only one forward model? 2) Does using the bidirectional models reduce compounding error compared with using one forward model? 3) What are the critical components of our overall algorithm?
6.1 Comparison with StateoftheArts
In this section, we compare our method with previous stateoftheart baselines. Specifically, for modelbased methods, we compare against MBPO (Janner et al., 2019), as our method builds on top of it; SLBO (Luo et al., 2018) and PETS (Chua et al., 2018), both performing well in the modelbased benchmarking test (Langlois et al., 2019). For modelfree methods, we compare to Soft ActorCritic (SAC) (Haarnoja et al., 2018), which is proved to be effective in continuous control tasks. For the MBPO baseline, we only generate steps forward rollout, where is set as the default value used in the original MBPO paper. We do not report the result of MBPO using rollouts of steps since using only steps achieves better performance. Notice that we do not include the backtracking model method (Goyal et al., 2018) for comparison since it is not a purely modelbased method, and its performance improvement is limited compared with SAC.
We evaluate all the algorithms in six environments in total using OpenAI Gym (Brockman et al., 2016). Among them, Pendulum is one traditional control task, and Hopper, Walker2D, Ant are three complex MuJoCo tasks (Todorov et al., 2012). We additionally add two variants of MuJoCo tasks without early termination states, denoted as HopperNT and Walker2dNT, which have been released as benchmarking environments for MBRL (Langlois et al., 2019). More details about the environments can be found in Appendix C.
The comparison results are shown in Figure 3. In different locomotion tasks, our method BMPO learns faster and has better asymptotic performance than previous modelbased algorithms using only the forward model, which empirically demonstrates the advantage of bidirectional models. This gap is even more significant in benchmarking environments HopperNT and Walker2dNT. One possible reason is that in the environments without early termination, the space of encountered states is larger, and thus the introduced state sampling strategy is more effective.
6.2 Model Error
In this section, we first plot the validation loss of the models in Figure 4a, which can roughly represent the singlestep prediction error. As the figure shows, the difficulty of learning forward/backward models is taskdependent. Then, we investigate the compounding error of traditional forward model and our bidirectional models when generating the same length simulated trajectories. We calculate the multistep prediction error for evaluation. A similar validation error is also used in Nagabandi et al. (2018). More specifically, assume a real trajectory of length is denoted as . For the forward model, we sample from and generate forward rollouts where and for , . Then the corresponding compounding error is defined as
Similarly, for the bidirectional models, suppose we sample from and generate both forward and backward rollouts of steps, compounding error is defined as
where and for , and .
. (d) Average return of the last 10 epochs over six trials with different backward rollout lengths
and fixed forward length .We conduct experiments with different rollout length and plot the results in Figure 4b. We observe that employing bidirectional models can significantly reduce model compounding error, which is consistent with our intuition and theoretical analysis.
6.3 Design Evaluation
In this section, we evaluate the importance of each design decision for our overall algorithm. The results are demonstrated in Figure 5.
6.3.1 Backward Policy Design
In Figure 5(a), we compare two heuristic design choices for the backward policy loss: MLE loss and GAN loss, which are described in detail in Section 4.2.1. From the comparison, we notice a slight performance degradation in the late iterations of training when adopting GAN loss. One possible reason may be the instability and model collapse of GAN (Goodfellow, 2016). More advanced methods to stabilize GAN’s training can be incorporated into backward policy training, which remains as future work. As for other comparisons in this paper, we use the MLE loss for the backward policy as default.
6.3.2 Ablation Study
We further carry out an ablation study to characterize the importance of three main components of our algorithm: 1) no backward model to sample trajectory (Forward Only); 2) no forward model to sample trajectory (Backward Only); 3) no MPC when interacting with the environment (No MPC). The results are shown in Figure 5(b). We find that ablating MPC decreases the performance slightly, and using only one forward model or backward model causes a more severe performance dropping. This further reveals the superiority of using bidirectional models. Notice that the performance of not using backward model is even worse than the vanilla MBPO, which means that sampling from high value states does not benefit the traditional forward model since the agent may only focus on how to act at high value states, but not care how to reach these states.
6.3.3 Hyperparameter Study
In this section, we investigate the sensitivity of BMPO to the hyperparameters. First, we test BMPO with different hyperparameter used in Equation 6. We vary from to a relatively large number, where means random sampling, and larger means focusing more on high value states. The results are shown in Figure 5(c). We observe that up to a certain level, increasing yields better performance while too large can degrade the performance. This may be due to the fact that with a too large , low value states are almost impossible to be chosen, and the value estimation at the beginning is difficult. Nevertheless, our algorithm with different all outperform the baseline, which indicates the robustness to .
Though it has been discovered that linearly increasing rollout length achieves excellent performance (Janner et al., 2019), it remains a problem that how to choose the backward rollout length according to the forward length . We fix to be the same as Janner et al. (2019) and vary from 0 to . As is shown in Figure 5(d), setting provides the best result, while too short and too long backward length are both detrimental. In practice, we use in most environments except Ant, where the backward model error is too large compared with forward model error. All hyperparameters settings are provided in Appendix D.
7 Conclusion
In this work, we present a novel modelbased reinforcement learning method, namely bidirectional modelbased policy optimization (BMPO), using the newly introduced bidirectional models. We theoretically prove the advantage of bidirectional models by deriving a tighter return discrepancy upper bound of RL objective compared with only one forward model. Experimental results show that BMPO achieves better asymptotic performance and higher sample efficiency than previous stateoftheart modelbased methods on several benchmark continuous control tasks. For future work, we will investigate the usage of bidirectional models in other modelbased RL frameworks and study how to leverage the bidirectional models better.
Acknowledgments
The corresponding author Weinan Zhang thanks the support of ”New Generation of AI 2030” Major Project 2018AAA0100900 and NSFC (61702327, 61772333, 61632017).
References
 Towards a simple approach to multistep modelbased reinforcement learning. arXiv preprint arXiv:1811.00128. Cited by: §2.
 Combating the compoundingerror problem with a multistep model. arXiv preprint arXiv:1905.13320. Cited by: §1, §2.
 Lipschitz continuity in modelbased reinforcement learning. arXiv preprint arXiv:1804.07193. Cited by: §1, §2.
 Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §6.1.
 Model predictive control. Springer Science & Business Media. Cited by: §1, §2.
 Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pp. 4754–4765. Cited by: §2, §2, §4.1.1, §6.1.
 On the sample complexity of the linear quadratic regulator. arXiv preprint arXiv:1710.01688. Cited by: §2.
 A survey on policy search for robotics. Foundations and Trends® in Robotics 2 (1–2), pp. 1–142. Cited by: §1.
 PILCO: a modelbased and dataefficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML11), pp. 465–472. Cited by: §2, §2.
 Model predictive control using neural networks. IEEE Control Systems Magazine 15 (5), pp. 61–66. Cited by: §2.
 Forwardbackward reinforcement learning. arXiv preprint arXiv:1803.10227. Cited by: §1.
 Learning to predict without looking ahead: world models without forward prediction. ArXiv abs/1910.13038. Cited by: §1, §2.
 Improving pilco with bayesian neural network dynamics models. In DataEfficient Machine Learning workshop, ICML, Vol. 4. Cited by: §2.
 NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160. Cited by: §6.3.1.
 Recall traces: backtracking models for efficient reinforcement learning. arXiv preprint arXiv:1804.00379. Cited by: §1, §2, §4.3.1, §6.1.
 Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §1, §4.2.2, §6.1.
 When to trust your model: modelbased policy optimization. arXiv preprint arXiv:1906.08253. Cited by: Lemma B.2, Lemma B.3, Table 3, §1, §1, §2, §2, §2, §4.1.1, §4.2.2, §4, §5, §6.1, §6.3.3.
 Modelbased reinforcement learning for atari. arXiv preprint arXiv:1903.00374. Cited by: §1, §2.
 Gaussian processes and reinforcement learning for identification and control of an autonomous blimp. In Proceedings 2007 ieee international conference on robotics and automation, pp. 742–747. Cited by: §2.
 Optimal control with learned local models: application to dexterous manipulation. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 378–383. Cited by: §2.
 Modelensemble trustregion policy optimization. arXiv preprint arXiv:1802.10592. Cited by: §2.
 Gaussian processes in reinforcement learning. In Advances in neural information processing systems, pp. 751–758. Cited by: §2.
 Benchmarking modelbased reinforcement learning. arXiv preprint arXiv:1907.02057. Cited by: Appendix C, §2, §3.2, §6.1, §6.1.
 Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pp. 1071–1079. Cited by: §2.
 Guided policy search. In International Conference on Machine Learning, pp. 1–9. Cited by: §2.
 Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
 Algorithmic framework for modelbased deep reinforcement learning with theoretical guarantees. arXiv preprint arXiv:1807.03858. Cited by: §2, §6.1.
 Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §4.2.1.
 Prediction and control with temporal segment models. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2459–2468. Cited by: §1, §2.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1.
 Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566. Cited by: §2, §2, §6.2.
 [32] Improving modelbased rl with adaptive rollout using uncertainty estimation. Cited by: §1, §2.

An analysis of linear models, linear valuefunction approximation, and feature selection for reinforcement learning
. In Proceedings of the 25th international conference on Machine learning, pp. 752–759. Cited by: §2.  Epopt: learning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283. Cited by: §2.
 Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1.
 Dual policy iteration. In Advances in Neural Information Processing Systems, pp. 7059–7069. Cited by: §2.
 Reinforcement learning: an introduction. MIT press. Cited by: §2.
 Dynastyle planning with linear function approximation and prioritized sweeping. arXiv preprint arXiv:1206.3285. Cited by: §2.
 Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin 2 (4), pp. 160–163. Cited by: §3.2, §4.
 Modelbased reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on Machine Learning (ICML10), pp. 1031–1038. Cited by: §2.

Selfcorrecting models for modelbased reinforcement learning.
In
ThirtyFirst AAAI Conference on Artificial Intelligence
, Cited by: §1, §2, §2.  Mujoco: a physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §6.1.
 Improving multistep prediction of learned time series models. In TwentyNinth AAAI Conference on Artificial Intelligence, Cited by: §2.
 Exploring modelbased planning with policy networks. arXiv preprint arXiv:1906.08649. Cited by: §2, §4.3.2.
 Understanding the asymptotic performance of modelbased rl methods. Cited by: §1, §2.
 Model imitation for modelbased reinforcement learning. arXiv preprint arXiv:1909.11821. Cited by: §1, §2.
 Learning to combat compoundingerror in modelbased reinforcement learning. arXiv preprint arXiv:1912.11206. Cited by: §1, §2.
Appendix A BMPO Performance Guarantee
Lemma A.1.
(Bidirectional Branched Rollout Returns Bound). Let , be the expected returns of two bidirectional branched rollouts. Out of the branch, we assume that the expected total variation distance between these two dynamics at each timestep is bounded as , similarly, the forward branch dynamic bounded as , and the backward branch dynamic bounded as . Likewise, the total variation distance of policy is bounded by , and , respectively (as Figure 6 shows). Then the returns are bounded as
(11)  
Proof.
Lemma B.1 and Lemma B.2 imply that state marginal error at each timestep can be bounded by the divergence at the current timestep plus the state marginal error at the next (Lemma B.1), or previous (Lemma B.2) timestep. And by employing Lemma B.3
, we can convert the (s,a) joint distribution to marginal distributions. Thus, letting
and denote the stateaction marginals, we can write:For :
(12)  
Similarly, for :
(13) 
And for :
(14) 
We can now bound the difference in occupancy measures by averaging the state marginal error over time, weighted by the discount:
Multiplying this bound by to convert the occupancy measure difference into a returns bound completes the proof. ∎
Theorem A.1.
(BMPO Return Discrepancy Upper Bound) Assume that the expected total variation distance between the learned forward model and the true dynamics at each timestep is bounded as . Similarly, the error of backward model is bounded as and the variation between current policy and the behavioral policy is bounded as . Assume and , then under a branched rollouts scheme with a backward branch length of and a forward branch length of , the returns are bounded as:
(15) 
Proof.
Using Lemma A.1, out of the branch, we only suffer from error of executing old policy , so, set and . Then in the branched rollout, we execute current policy, so the only error comes from using the learned model to simulate. Set and . Plugging these in Lemma B.1 we can get:
(16)  
∎
Appendix B Useful Lemmas
In this section, we give proofs of the lemmas used before.
Lemma B.1.
(Backward State Marginal Distance Bound). Suppose the expected total variation distance between two backward dynamics is bounded as and the backward policy divergences are bounded as . Then the state marginal distance at timestep can be bounded as:
(17) 
Proof.
Let the total variation distance of state at time be denoted as .
∎
Lemma B.2.
(Forward State Marginal Distance Bound) ((Janner et al., 2019), Lemma B.2, B.3). Suppose the expected TVD between two forward dynamics is bounded as and the forward policy divergences are bounded as . Then the state marginal distance at timestep can be bounded as:
(18) 
Lemma B.3.
(TVD Of Joint Distributions) ((Janner et al., 2019), Lemma B.1). Suppose we have two distributions and . We can bound the total variation distance of the joint distributions as:
(19) 
Appendix C Environment Settings
In this section, we provide a comparison of the environment settings used in our experiments. Among them, ’HopperNT’ and ’Walker2dNT’ refer to the settings in Langlois et al. (2019) and others are the standard version.
Environment Name  Observation Space Dimension  Action Space Dimension  Steps Per Epoch 

Pendulum  3  1  200 
Hopper  11  3  1000 
HopperNT  11  3  1000 
Walker2d  17  6  1000 
Walker2dNT  17  6  1000 
Ant  27  8  1000 
Environment Name  Reward Function  Termination States Condition 

Pendulum  None  
Hopper  or  
HopperNT  None  
Walker2d  or or  
Walker2dNT  None  
Ant  or 
Appendix D Hyperparameters
Environment Name  MPC Horizon  Epochs  
Pendulum 



6  20  
Hopper 



6  100  
HopperNT 


0.01  6  100  
Walker2d  1  1 

1  200  
Walker2dNT  1  1  0.01  0  200  
Ant  1 

0.003  0  300 
Appendix E Computing Infrastructure
In this section, we provide a description of the computing infrastructure used to run all the experiments in Table 4. We also show the computation time comparison between our algorithm and the MBPO baseline in Table 5.
CPU  GPU  Memory 

AMD2990WX  RTX2080TI4  256GB 
Pendulum  Hopper  HopperNT  Walker2d  Walker2dNT  Ant  

BMPO  0.49  16.34  17.98  27.24  27.34  71.51 
MBPO  0.41  10.33  11.12  22.26  21.32  57.42 
Comments
There are no comments yet.