1 Introduction
Reinforcement learning (RL) has been successfully applied to game playing. However, its application in realworld scenarios is still limited. One of the main challenges is the high sample complexity of the existing RL algorithms. The high sample complexity can be contributed by three factors: sample complexity of core learning algorithms, the ability of systems to learn from existing offpolicy data, and the ability for a policy, as well as the world model in the case of modelbased RL (MBRL), to be transferred to another task. We argue that in order for RL to be applicable in the real world, a RL system essentially needs to possess these properties: low sample complexity of the learning algorithm, the capability to learn from offpolicy data, and transferability to similar tasks. To the best of our knowledge, none of the existing algorithms has all of those properties. In this paper, we propose a novel RL framework that is sample efficient, able to bootstrap from offpolicy data, and also able to bootstrap from an existing policy.
Popular RL algorithms are divided into two main paradigms: modelfree and modelbased. Modelfree algorithms have been shown to achieve good asymptotic performance in many high dimensional problems Mnih et al. (2015); Silver et al. (2017). However, modelfree ones have a crucial limitation that they often require a massive amount of data samples for training, thus prevents their applications from control/robotic domains, where the costs associated with realsystem interactions are high. The main reason is that modelfree RL learns the state/stateaction values only from rewards and does not explicitly exploit the rich information underlying the transition dynamics data.
In contrast to modelfree RL (MFRL), modelbased algorithms try to learn the transition dynamics, which is in turn used for imagining/planning without having to frequently interact with the real systems. Therefore, they are often considered more sample efficient than their modelfree counterparts. In addition, the dynamics model is independent from rewards and thus can be transferred to other tasks in the same or similar environments. Nevertheless, the assumption about the accuracy of the learned dynamics model is usually not satisfied, especially in complex environments. The model error and its compounding effect when planning, i.e. a small bias in the model can lead to a highly erroneous value function estimate and a stronglybiased suboptimal policy, make MBRL less competitive in terms of asymptotic performance than MFRL for many nontrivial tasks.
Many solutions have been proposed to mitigate the model bias issue. Early work such as Deisenroth and Rasmussen (2011) uses Gaussian Processes (GP) to capture the model uncertainty. GPbased methods are, however, computationally intractable and unscalable for complex environments such as MuJoCo Todorov et al. (2012)
or Atari games. In recent years, deep neural network (DNN) has been gaining lots of attention due to its high representational capacity and great successes in largescale supervised training tasks. There have been several attempts in making DNN uncertaintyaware. For examples, Bayesian neural networks were used in
Gal et al. (2016); Depeweg et al. (2016); Kamthe and Deisenroth (2017), or in Gal and Ghahramani (2016), dropout was proposed as a scalable approximation to GP. More recently, bootstrapping or ensembling in general has become a favorable technique for uncertainty modeling in DNN such as in Kurutach et al. (2018); Clavera et al. (2018). We hypothesize the reasons are because bootstrapping is a wellstudied technique in Statistics and ensembling is relatively easy to train.Another limitation of many existing MBRL methods is that they rely on the MPC framework. In this framework, at each time step , the agent uses the learned dynamics model to plan
(a hyperparameter) steps ahead, predicts the optimal sequence of actions
, then interacts with the environment using . But MPC, while easy to understand, has two drawbacks. First, each MPC step requires solving a highdimensional optimization problem and thus is computationally prohibitive for applications requiring either realtime or lowlatency reaction such as autonomous driving. Second, the policy is only implicit via solving the mentioned optimization problem. In more detail, not being able to explicitly represent the policy makes it hard to transfer the learned policy to other tasks or to initialize the agent with an existing betterthanrandom policy.In contrast to MBRL, in the MFRL literature, policy gradient has shown to be an effective technique to train an agent Lillicrap et al. (2015); Schulman et al. (2017); Haarnoja et al. (2018). In this paper, we propose a new Modelbased Policy Optimization (MBPO) framework that combines MBRL and policy gradient techniques in a principled way. Our experiments demonstrate the superior sample efficiency of our MBPO agent, compared to stateoftheart methods, under the condition that the agent starts from scratch.
In addition to promising experimental results, we also emphasize that our framework is designed to support learning and improving from existing knowledge and from existing policy. Figure 1 illustrates our broad approach to realworld reinforcement learning.
2 Related work
Initial successes in MBRL in continuous control achieved promising results by learning control policies trained on models of local dynamics using linear parametric approximators Abbeel et al. (2007); Levine and Koltun (2013). Alternate methods such as Deisenroth and Rasmussen (2011) incorporated nonparametric probabilistic GPs to capture model uncertainty during policy planning and evaluation. While these methods enhance data efficiency in lowdimensional tasks, their applications in more challenging domains such as environments involving noncontact dynamics and highdimensional control remain limited by the inflexibility of their temporally local structure and intractable inference time. Our model, on the contrary, achieves both targets of having asymptotically high performance compared to MFRL methods and, at the same time, retaining data efficiency in those complex domains.
Recently, there has been a revived interest in using DNNs to learn predictive models of environments from data, drawing inspiration from ideas in the early literature on this MBRL field. The large representational capacity of DNNs enables them as suitable function approximators for complex environments. However, additional care has to be usually taken to avoid model bias, a situation where the DNNs overfit in the early stages of learning, resulting in inaccurate models. Nagabandi et al. (2017) combined a learned dynamics network with MPC to initialize the policy network to accelerate learning in modelfree deep RL. Chua et al. (2018) extended this idea by introducing a bootstrapped ensemble of probabilistic DNNs to model predictive uncertainty of the learned networks and demonstrating that a pure modelbased approach can attain the asymptotic performance of MFRL counterparts. However, the use of MPC to define a policy leads to poor runtime execution and hard to transfer policy across tasks.
Subsequent research proposed algorithms to leverage the learned ensemble of dynamics models to train a policy network. Kurutach et al. (2018) learned a stochastic policy via trustregion policy optimization and Clavera et al. (2018) casted the policy gradient as a metalearning adaptation step with respect to each member of the ensemble. Buckman et al. (2018)
proposed an algorithm to learn a weighted combination of rollouts of different horizon lengths, which dynamically interpolates between modelbased and modelfree learning based on the uncertainty in the model predictions. To our knowledge, this is the closest work in aside from ours, which learns a reward function in addition to the dynamics function. Furthermore, none of the aforementioned works propagates the uncertainty all the way to the value function and uses the concept of utility function to balance risk and return, as used in our model.
The ensemble of DNNs provide a straightforward technique to obtain reliable estimates of predictive uncertainty Lakshminarayanan et al. (2017) and has been integrated with bootstrap to guide exploration in MFRL Osband et al. (2016)
. While many of the approaches mentioned in this section employ bootstrap to train an ensemble of models, we note that their implementations comprise of reconstructing bootstrap datasets at every training iteration, which effectively trains every single data sample and thus diminishes the advantage on uncertainty quantification achieved through bootstrap. Our model is different in that, to maintain online bootstrap datasets across the ensemble, it adds each incoming data sample to a dataset according to a Poisson probability distribution
Park et al. (2007); Qin et al. (2013), thereby guaranteeing asymptotically consistent bootstrap datasets.3 Uncertaintyaware Modelbased Policy Optimization
3.1 Policy Optimization Formulation
Consider a discretetime Markov Decision Process (MDP) defined by a tuple
, in which is a state space, is an action space, is a deterministic transition function, is a reward function, is a task horizon, and is a discount factor. We define the return as the sum of rewards along a trajectory induced by a policy : . The goal of reinforcement learning is to find a policy that maximizes the expected return, i.e.(1) 
where , for and is randomly choosen from some initial dsitribution on .
If the dynamics function and the reward function are given, solving (1) can be done using the Calculus of Variations Young (2000) or Policy Gradient Sutton et al. (2000), which is the equivalent of Calculus of Variations when the control function is parameterized or is finite dimensional.
However, in reinforcement learning, and are often unknown and hence Equation (1) becomes a blackbox optimization problem with an unknown objective function. Following the Bayesian approach commonly used in the blackbox optimization literature Shahriari et al. (2015), we propose to solve this problem by iteratively learning a probabilistic estimate of from data and optimizing the policy according to this approximated model.
It is worth noting that any unbiased method would model as a probabilistic estimate, i.e. would be a distribution (as opposed to a point estimate) for a given . Optimizing a stochastic objective is, however, not welldefined. Our solution is to transform into a deterministic utility function that reflects a subjective measure balancing the risk and return. Following Markowitz (1952); Sato et al. (2001); Garcıa and Fernández (2015)
, we propose a risksensitive objective criterion using a linear combination of the mean and the standard deviation of
. Formally stated, our objective criterion now becomes(2) 
where and are the mean and the standard deviation of respectively, and is a constant that represents the subjective risk preference of the learning agent. A positive risk preference infers that the agent is adventurous while a negative risk preference indicates that the agent has a safe exploration strategy.
3.2 Estimate of Value Function and Its Gradient
Section 3.1 above provides a general framework for policy optimization under uncertainty, assuming the availability of the estimation model of the true value function . In this section, we present a modelbased method to compute as an approximation of . The main idea is to approximate the functions
with probabilistic parametric models
and fully propagate the estimated uncertainty when planning under each policy from an initial state . The value function estimate can be formulated as(3) 
where and for . Next, we describe how to accurately model with wellcalibrated uncertainty and a rollout technique that allows the uncertainty to be faithfully propagated into .
3.2.1 Model Uncertainty with Online Bootstrap
As discussed in Section 1, there are several prior attempts to learn uncertaintyaware dynamics models including GPs, Bayesian neural networks, dropout neural networks and ensemble of neural networks. In this work, however, we employ an ensemble of bootstrapped neural networks. Bootstrapping is a generic, principled and statistical approach for uncertainty quantification. Furthermore, as explained later in Section 3.2.3, this ensemble approach also gives rise to easy gradient computation. In particular, is represented as .
For simplicity of implementation, we model each bootstrap replica as deterministic and rely on the ensemble as the sole mechanism for quantifying and propagating uncertainty. Each bootstrapped model , which is parameterized by , learns to minimize the onestep prediction loss over the respective bootstrapped dataset :
(4) 
The training dataset stores the transitions on which the agent has experienced. Since each model observes its own subset of the real data samples, the predictions across the ensemble remain sufficiently diverse in the early stages of the learning and will then converge to their true values as the error of the individual networks decreases.
Bootstrap learning is often studied in the context of batch learning. However, since our agent updates its world model
after each physical step for the best possible sample efficiency, we follow the online bootstrapping via sampling from Poisson distribution method presented in
Oza (2005); Qin et al. (2013). This is a very effective online approximation to batch bootstrapping, leveraging the following argument: bootstrapping a dataset with examples means sampling examples from with replacement. Each example will appear times in the bootstrapped sample whereis a random variable whose distribution is
because during resampling, the th example will have chances to be picked, each with probability . This distribution converges to when . Therefore, for each new data point, this method adds copies of that data point to the bootstrapped dataset , where is sampled from a .Unlike many other modelbased approaches, we also learn the reward function, along the same design of classical MBRL algorithms Sutton (1991). However, in our current implementation, we use a deterministic model for the reward function to simplify the policy evaluation. Note that unlike the error from the dynamics model, the error from the reward model does not get compounded when we estimate .
3.2.2 Bootstrap Rollout
In this section, we describe how to propagate the estimates with uncertainty from the dynamics model to evaluate a policy . We represent our policy : as a neural network parameterized by . Note that we choose to represent our policy as deterministic. We argue that while all estimation models, including that of the dynamics and of the value function, need to be stochastic (i.e. uncertainty aware), the policy does not need to be. The policy is not an estimator and deterministic policy simply means that the agent is consistent when taking an action, no matter how uncertain it may know about the world.
Given a policy and an initial state , we can estimate the distribution of by simulating through each each bootstrapped dynamics model. And since each bootstrap model is an independent approximator of the dynamics function, by expanding the value function via these dynamics approximators, we eventually obtain independent estimates of that value function. Finally, these separate and independent trajectories collectively form an ensemble estimator of .
In practice, we sample these trajectories with a finite horizon . It is still a challenge to expand the value function estimation for a very long horizon due to a few reasons:

neural network training becomes harder when the depth increases,

despite our best effort to control the uncertainty, we still do not have a guarantee that our uncertainty modeling is perfectly calibrated, which in turn may be problematic if the planning horizon is too large, and

the policy learning time is proportional to the rollout horizon.
3.2.3 SGD and Gradient Computation
We can rewrite our objective as
(5) 
where . Using the ensemble method and the rollout technique described above, we can naturally compute and for a given policy and for a given state . Therefore, the policy can be updated using the SGD or a variant of it.
The aforementioned rollout method also allows us to easily express in Equation (5) as a single computational graph of . This makes it straight forward to compute the policy gradient
using automatic differentiation, a feature provided outofthebox in most popular deep learning toolkits.
4 Algorithm Summary
We summarize our method in Algorithm 1. Furthermore, in this section, we also highlight some important details in our implementation.
Online offpolicy learning.
Except for the initialization step (we may initialize the models with batch training from offpolicy data), our model learning is an online learning process. For each time step, the learning cost stays constant and does not grow over time, which is required for lifelong learning. Despite being online, the learning is offpolicy because we maintain a bootstrapped replay buffer for each model in the ensemble. For each model update, we sample a minibatch of training data from the respective replay buffer. In addition, as mentioned, the models can also be initialized from existing data even before the policy optimization starts.
Linearly weighted random sampling.
Since our replay buffers are accumulated online, a naive uniformly sampling strategy would lead to early data being sampled more frequently than later data. We thus propose a linearly weighted random sampling scheme to mitigate the earlydata bias issue. In this sampling scheme, example th is randomly sampled with weight , i.e. higher weights for the fresher examples in each online update step. As proved in Appendix A.1, with this scheme, early data points are still cumulatively sampled slightly more frequently than later points but the cumulative bias gap is greatly reduced; and asymptotically all data points are equally sampled.
5 Experiment
We provide two experimental analyses in this section. The first analyzes the model error when the planning horizon gets increased. The second provides the benchmark comparisons against stateoftheart baselines.
5.1 Model learning and compounding errors
Figure 2
shows an error analysis of our model on the simple Pendulum environment. We used an oracle model to measure the errors. From the figure, we can see that even when the loss is small, the compounded error in the value function estimation can grow very quickly. Therefore, it is crucial to estimate the uncertainty (e.g. as variance or as error bound) in a principled way.
5.2 Comparison to baseline algorithms
We evaluate the performance of our proposed MBPO algorithm on three continuous control tasks in the MuJoCo simulator Todorov et al. (2012): Pendulumv0, Swimmerv2 and HalfCheetahv2 from OpenAI Gym Brockman et al. (2016).
Experimentation Protocol.
We compare our algorithm against those stateoftheart baseline algorithms designed for continuous control:

PPO Schulman et al. (2017): a modelfree policy gradient algorithm,

DDPG Lillicrap et al. (2015): an offpolicy modelfree actorcritic algorithm,

SAC Haarnoja et al. (2018): a modelfree actorcritic algorithm, which reports better dataefficiency than DDPG and PPO on most MuJoCo benchmarks,

STEVE Buckman et al. (2018): a recent deterministic modelbased algorithm.
Some other algorithms such as METRPO Kurutach et al. (2018)), MBMPO Clavera et al. (2018), PETS Chua et al. (2018) assume known reward function^{1}^{1}1We were also, with reasonable effort, unable to get their opensource implementations to run on standard OpenAI’s Gym environments.. We therefore do not include them in this benchmark study.
For each algorithm, we evaluate the learned policy after every episode (200 time steps for Pendulumv0 and 1000 time steps for Swimmerv2 and HalfCheetahv2). The evaluation is done by running the current policy on 20 random episodes and then compute the average return over them.
Results.
Figures 3 and 4 show that MBPO has a superior sample efficiency compared to the baseline algorithms across a wide range of environments. Furthermore, it also has the asymptotic performance competitive to or even better than that of the modelfree counterparts.
However, Figure 3 also shows that the performance of MBPO is sensitive to the random seed. We hypothesize that this is due to our strategy of aggressive online learning and policy update after each step. It might also be due to our choice of learning early without random exploration on the first episode as many other methods apply. We plan to do a deeper analysis and address this instability issue in our future work.
6 Discussion and Conclusion
Our experiments suggest that our MBPO algorithm not only can achieve the asymptotic performance of modelfree methods in challenging continuous control tasks, it does so in much fewer samples. It is also more sample efficient than other existing MBRL algorithms. We further demonstrate that the model bias issue in modelbased RL can be dealt with effectively with principled and careful uncertainty quantification.
We acknowledge that our current implementation still has several limitations to overcome, such as high variance of the performance, which still depends on many hyperparameters (plan horizon, risk sensitivity, and all hyperparameters associated to neural network training) and even depends on the random seed. Note that these traits are not unique to our method. Nevertheless, the results indicate that if implemented right, modelbased methods can be both sample efficient and has better asymptotic performance than modelfree methods on challenging tasks. In addition, by explicitly representing both the dynamics model and the policy, MBPO enables transfer learning, not just for the world (dynamics) model but also for the policy.
Finally, we identify that sample efficiency, offpolicy learning, and transferability are the three necessary, albeit not sufficient, properties for realworld reinforcement learning. We claim that our method meets these criteria and hence is a step towards realworld reinforcement learning.
References
 Abbeel et al. [2007] Pieter Abbeel, Adam Coates, Morgan Quigley, and Andrew Y Ng. An application of reinforcement learning to aerobatic helicopter flight. In Advances in neural information processing systems, pages 1–8, 2007.
 Brockman et al. [2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 Buckman et al. [2018] Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sampleefficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pages 8224–8234, 2018.
 Chua et al. [2018] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pages 4754–4765, 2018.
 Clavera et al. [2018] Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. Modelbased reinforcement learning via metapolicy optimization. arXiv preprint arXiv:1809.05214, 2018.

Deisenroth and Rasmussen [2011]
Marc Deisenroth and Carl E Rasmussen.
Pilco: A modelbased and dataefficient approach to policy search.
In
Proceedings of the 28th International Conference on machine learning (ICML11)
, pages 465–472, 2011.  Depeweg et al. [2016] Stefan Depeweg, José Miguel HernándezLobato, Finale DoshiVelez, and Steffen Udluft. Learning and policy search in stochastic dynamical systems with bayesian neural networks. arXiv preprint arXiv:1605.07127, 2016.
 Gal and Ghahramani [2016] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
 Gal et al. [2016] Yarin Gal, Rowan McAllister, and Carl Edward Rasmussen. Improving pilco with bayesian neural network dynamics models. In DataEfficient Machine Learning workshop, ICML, volume 4, 2016.
 Garcıa and Fernández [2015] Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
 Haarnoja et al. [2018] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
 Kamthe and Deisenroth [2017] Sanket Kamthe and Marc Peter Deisenroth. Dataefficient reinforcement learning with probabilistic model predictive control. arXiv preprint arXiv:1706.06491, 2017.
 Kurutach et al. [2018] Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Modelensemble trustregion policy optimization. arXiv preprint arXiv:1802.10592, 2018.
 Lakshminarayanan et al. [2017] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6402–6413, 2017.
 Levine and Koltun [2013] Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on Machine Learning, pages 1–9, 2013.
 Lillicrap et al. [2015] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Markowitz [1952] Harry Markowitz. Portfolio selection. The journal of finance, 7(1):77–91, 1952.
 Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Nagabandi et al. [2017] Anusha Nagabandi, Gregory Kahn, Ronald S. Fearing, and Sergey Levine. Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. CoRR, abs/1708.02596, 2017. URL http://arxiv.org/abs/1708.02596.
 Osband et al. [2016] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pages 4026–4034, 2016.
 Oza [2005] Nikunj C Oza. Online bagging and boosting. In 2005 IEEE international conference on systems, man and cybernetics, volume 3, pages 2340–2345. Ieee, 2005.
 Park et al. [2007] ByungHoon Park, George Ostrouchov, and Nagiza F Samatova. Sampling streaming data with replacement. Computational Statistics & Data Analysis, 52(2):750–762, 2007.
 Qin et al. [2013] Zhen Qin, Vaclav Petricek, Nikos Karampatziakis, Lihong Li, and John Langford. Efficient online bootstrapping for large scale learning. arXiv preprint arXiv:1312.5021, 2013.

Sato et al. [2001]
Makoto Sato, Hajime Kimura, and Shibenobu Kobayashi.
Td algorithm for the variance of return and meanvariance
reinforcement learning.
Transactions of the Japanese Society for Artificial Intelligence
, 16(3):353–362, 2001.  Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Shahriari et al. [2015] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2015.
 Silver et al. [2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
 Sutton [1991] Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 2(4):160–163, 1991.
 Sutton et al. [2000] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 Todorov et al. [2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
 Young [2000] Laurence Chisholm Young. Lecture on the calculus of variations and optimal control theory, volume 304. American Mathematical Soc., 2000.
Appendix A Appendix
a.1 Why linearly weighted random sampling is a fairer sampling scheme
Consider the following online learning process: for each time step, we need to randomly sample an example from the accumulating dataset. Suppose that at time , each example th is randomly sampled with weight . Note that at each time , we have a total of examples in the dataset. Then the probability of that example being sampled is
If we use uniformly random sampling then the expected number of times an example th gets selected until time is
Hence, for all , for , is larger than by . Now, if we use a linearly weighted random sampling scheme, in which , then the expected number of times an example th gets selected until time is
We can see that at time , is still larger than for but by weighting recent examples more in each online update step, we reduce the overall earlydata bias.
a.2 Environments
We evaluate the performance of our proposed MBPO algorithm on five continuous control tasks in the MuJoCo simulator from OpenAI Gym and we keep the default configurations prodived by OpenAI Gym.
Environment  State dimension  Action dimension  Task horizon 

Reacherv2  
Pusherv2  
Pendulumv0  
Swimmerv2  
HalfCheetahv2 