Reinforcement learning (RL) has been successfully applied to game playing. However, its application in real-world scenarios is still limited. One of the main challenges is the high sample complexity of the existing RL algorithms. The high sample complexity can be contributed by three factors: sample complexity of core learning algorithms, the ability of systems to learn from existing off-policy data, and the ability for a policy, as well as the world model in the case of model-based RL (MBRL), to be transferred to another task. We argue that in order for RL to be applicable in the real world, a RL system essentially needs to possess these properties: low sample complexity of the learning algorithm, the capability to learn from off-policy data, and transferability to similar tasks. To the best of our knowledge, none of the existing algorithms has all of those properties. In this paper, we propose a novel RL framework that is sample efficient, able to bootstrap from off-policy data, and also able to bootstrap from an existing policy.
Popular RL algorithms are divided into two main paradigms: model-free and model-based. Model-free algorithms have been shown to achieve good asymptotic performance in many high dimensional problems Mnih et al. (2015); Silver et al. (2017). However, model-free ones have a crucial limitation that they often require a massive amount of data samples for training, thus prevents their applications from control/robotic domains, where the costs associated with real-system interactions are high. The main reason is that model-free RL learns the state/state-action values only from rewards and does not explicitly exploit the rich information underlying the transition dynamics data.
In contrast to model-free RL (MFRL), model-based algorithms try to learn the transition dynamics, which is in turn used for imagining/planning without having to frequently interact with the real systems. Therefore, they are often considered more sample efficient than their model-free counterparts. In addition, the dynamics model is independent from rewards and thus can be transferred to other tasks in the same or similar environments. Nevertheless, the assumption about the accuracy of the learned dynamics model is usually not satisfied, especially in complex environments. The model error and its compounding effect when planning, i.e. a small bias in the model can lead to a highly erroneous value function estimate and a strongly-biased suboptimal policy, make MBRL less competitive in terms of asymptotic performance than MFRL for many non-trivial tasks.
Many solutions have been proposed to mitigate the model bias issue. Early work such as Deisenroth and Rasmussen (2011) uses Gaussian Processes (GP) to capture the model uncertainty. GP-based methods are, however, computationally intractable and unscalable for complex environments such as MuJoCo Todorov et al. (2012)
or Atari games. In recent years, deep neural network (DNN) has been gaining lots of attention due to its high representational capacity and great successes in large-scale supervised training tasks. There have been several attempts in making DNN uncertainty-aware. For examples, Bayesian neural networks were used inGal et al. (2016); Depeweg et al. (2016); Kamthe and Deisenroth (2017), or in Gal and Ghahramani (2016), dropout was proposed as a scalable approximation to GP. More recently, bootstrapping or ensembling in general has become a favorable technique for uncertainty modeling in DNN such as in Kurutach et al. (2018); Clavera et al. (2018). We hypothesize the reasons are because bootstrapping is a well-studied technique in Statistics and ensembling is relatively easy to train.
Another limitation of many existing MBRL methods is that they rely on the MPC framework. In this framework, at each time step , the agent uses the learned dynamics model to plan
(a hyperparameter) steps ahead, predicts the optimal sequence of actions, then interacts with the environment using . But MPC, while easy to understand, has two drawbacks. First, each MPC step requires solving a high-dimensional optimization problem and thus is computationally prohibitive for applications requiring either real-time or low-latency reaction such as autonomous driving. Second, the policy is only implicit via solving the mentioned optimization problem. In more detail, not being able to explicitly represent the policy makes it hard to transfer the learned policy to other tasks or to initialize the agent with an existing better-than-random policy.
In contrast to MBRL, in the MFRL literature, policy gradient has shown to be an effective technique to train an agent Lillicrap et al. (2015); Schulman et al. (2017); Haarnoja et al. (2018). In this paper, we propose a new Model-based Policy Optimization (MBPO) framework that combines MBRL and policy gradient techniques in a principled way. Our experiments demonstrate the superior sample efficiency of our MBPO agent, compared to state-of-the-art methods, under the condition that the agent starts from scratch.
In addition to promising experimental results, we also emphasize that our framework is designed to support learning and improving from existing knowledge and from existing policy. Figure 1 illustrates our broad approach to real-world reinforcement learning.
2 Related work
Initial successes in MBRL in continuous control achieved promising results by learning control policies trained on models of local dynamics using linear parametric approximators Abbeel et al. (2007); Levine and Koltun (2013). Alternate methods such as Deisenroth and Rasmussen (2011) incorporated non-parametric probabilistic GPs to capture model uncertainty during policy planning and evaluation. While these methods enhance data efficiency in low-dimensional tasks, their applications in more challenging domains such as environments involving non-contact dynamics and high-dimensional control remain limited by the inflexibility of their temporally local structure and intractable inference time. Our model, on the contrary, achieves both targets of having asymptotically high performance compared to MFRL methods and, at the same time, retaining data efficiency in those complex domains.
Recently, there has been a revived interest in using DNNs to learn predictive models of environments from data, drawing inspiration from ideas in the early literature on this MBRL field. The large representational capacity of DNNs enables them as suitable function approximators for complex environments. However, additional care has to be usually taken to avoid model bias, a situation where the DNNs overfit in the early stages of learning, resulting in inaccurate models. Nagabandi et al. (2017) combined a learned dynamics network with MPC to initialize the policy network to accelerate learning in model-free deep RL. Chua et al. (2018) extended this idea by introducing a bootstrapped ensemble of probabilistic DNNs to model predictive uncertainty of the learned networks and demonstrating that a pure model-based approach can attain the asymptotic performance of MFRL counterparts. However, the use of MPC to define a policy leads to poor run-time execution and hard to transfer policy across tasks.
Subsequent research proposed algorithms to leverage the learned ensemble of dynamics models to train a policy network. Kurutach et al. (2018) learned a stochastic policy via trust-region policy optimization and Clavera et al. (2018) casted the policy gradient as a meta-learning adaptation step with respect to each member of the ensemble. Buckman et al. (2018)
proposed an algorithm to learn a weighted combination of roll-outs of different horizon lengths, which dynamically interpolates between model-based and model-free learning based on the uncertainty in the model predictions. To our knowledge, this is the closest work in aside from ours, which learns a reward function in addition to the dynamics function. Furthermore, none of the aforementioned works propagates the uncertainty all the way to the value function and uses the concept of utility function to balance risk and return, as used in our model.
The ensemble of DNNs provide a straightforward technique to obtain reliable estimates of predictive uncertainty Lakshminarayanan et al. (2017) and has been integrated with bootstrap to guide exploration in MFRL Osband et al. (2016)
. While many of the approaches mentioned in this section employ bootstrap to train an ensemble of models, we note that their implementations comprise of reconstructing bootstrap datasets at every training iteration, which effectively trains every single data sample and thus diminishes the advantage on uncertainty quantification achieved through bootstrap. Our model is different in that, to maintain online bootstrap datasets across the ensemble, it adds each incoming data sample to a dataset according to a Poisson probability distributionPark et al. (2007); Qin et al. (2013), thereby guaranteeing asymptotically consistent bootstrap datasets.
3 Uncertainty-aware Model-based Policy Optimization
3.1 Policy Optimization Formulation
Consider a discrete-time Markov Decision Process (MDP) defined by a tuple, in which is a state space, is an action space, is a deterministic transition function, is a reward function, is a task horizon, and is a discount factor. We define the return as the sum of rewards along a trajectory induced by a policy : . The goal of reinforcement learning is to find a policy that maximizes the expected return, i.e.
where , for and is randomly choosen from some initial dsitribution on .
If the dynamics function and the reward function are given, solving (1) can be done using the Calculus of Variations Young (2000) or Policy Gradient Sutton et al. (2000), which is the equivalent of Calculus of Variations when the control function is parameterized or is finite dimensional.
However, in reinforcement learning, and are often unknown and hence Equation (1) becomes a blackbox optimization problem with an unknown objective function. Following the Bayesian approach commonly used in the blackbox optimization literature Shahriari et al. (2015), we propose to solve this problem by iteratively learning a probabilistic estimate of from data and optimizing the policy according to this approximated model.
It is worth noting that any unbiased method would model as a probabilistic estimate, i.e. would be a distribution (as opposed to a point estimate) for a given . Optimizing a stochastic objective is, however, not well-defined. Our solution is to transform into a deterministic utility function that reflects a subjective measure balancing the risk and return. Following Markowitz (1952); Sato et al. (2001); Garcıa and Fernández (2015)
, we propose a risk-sensitive objective criterion using a linear combination of the mean and the standard deviation of. Formally stated, our objective criterion now becomes
where and are the mean and the standard deviation of respectively, and is a constant that represents the subjective risk preference of the learning agent. A positive risk preference infers that the agent is adventurous while a negative risk preference indicates that the agent has a safe exploration strategy.
3.2 Estimate of Value Function and Its Gradient
Section 3.1 above provides a general framework for policy optimization under uncertainty, assuming the availability of the estimation model of the true value function . In this section, we present a model-based method to compute as an approximation of . The main idea is to approximate the functions
with probabilistic parametric modelsand fully propagate the estimated uncertainty when planning under each policy from an initial state . The value function estimate can be formulated as
where and for . Next, we describe how to accurately model with well-calibrated uncertainty and a rollout technique that allows the uncertainty to be faithfully propagated into .
3.2.1 Model Uncertainty with Online Bootstrap
As discussed in Section 1, there are several prior attempts to learn uncertainty-aware dynamics models including GPs, Bayesian neural networks, dropout neural networks and ensemble of neural networks. In this work, however, we employ an ensemble of bootstrapped neural networks. Bootstrapping is a generic, principled and statistical approach for uncertainty quantification. Furthermore, as explained later in Section 3.2.3, this ensemble approach also gives rise to easy gradient computation. In particular, is represented as .
For simplicity of implementation, we model each bootstrap replica as deterministic and rely on the ensemble as the sole mechanism for quantifying and propagating uncertainty. Each bootstrapped model , which is parameterized by , learns to minimize the one-step prediction loss over the respective bootstrapped dataset :
The training dataset stores the transitions on which the agent has experienced. Since each model observes its own subset of the real data samples, the predictions across the ensemble remain sufficiently diverse in the early stages of the learning and will then converge to their true values as the error of the individual networks decreases.
Bootstrap learning is often studied in the context of batch learning. However, since our agent updates its world model
after each physical step for the best possible sample efficiency, we follow the online bootstrapping via sampling from Poisson distribution method presented inOza (2005); Qin et al. (2013). This is a very effective online approximation to batch bootstrapping, leveraging the following argument: bootstrapping a dataset with examples means sampling examples from with replacement. Each example will appear times in the bootstrapped sample where
is a random variable whose distribution isbecause during resampling, the -th example will have chances to be picked, each with probability . This distribution converges to when . Therefore, for each new data point, this method adds copies of that data point to the bootstrapped dataset , where is sampled from a .
Unlike many other model-based approaches, we also learn the reward function, along the same design of classical MBRL algorithms Sutton (1991). However, in our current implementation, we use a deterministic model for the reward function to simplify the policy evaluation. Note that unlike the error from the dynamics model, the error from the reward model does not get compounded when we estimate .
3.2.2 Bootstrap Rollout
In this section, we describe how to propagate the estimates with uncertainty from the dynamics model to evaluate a policy . We represent our policy : as a neural network parameterized by . Note that we choose to represent our policy as deterministic. We argue that while all estimation models, including that of the dynamics and of the value function, need to be stochastic (i.e. uncertainty aware), the policy does not need to be. The policy is not an estimator and deterministic policy simply means that the agent is consistent when taking an action, no matter how uncertain it may know about the world.
Given a policy and an initial state , we can estimate the distribution of by simulating through each each bootstrapped dynamics model. And since each bootstrap model is an independent approximator of the dynamics function, by expanding the value function via these dynamics approximators, we eventually obtain independent estimates of that value function. Finally, these separate and independent trajectories collectively form an ensemble estimator of .
In practice, we sample these trajectories with a finite horizon . It is still a challenge to expand the value function estimation for a very long horizon due to a few reasons:
neural network training becomes harder when the depth increases,
despite our best effort to control the uncertainty, we still do not have a guarantee that our uncertainty modeling is perfectly calibrated, which in turn may be problematic if the planning horizon is too large, and
the policy learning time is proportional to the rollout horizon.
3.2.3 SGD and Gradient Computation
We can rewrite our objective as
where . Using the ensemble method and the rollout technique described above, we can naturally compute and for a given policy and for a given state . Therefore, the policy can be updated using the SGD or a variant of it.
4 Algorithm Summary
We summarize our method in Algorithm 1. Furthermore, in this section, we also highlight some important details in our implementation.
Online off-policy learning.
Except for the initialization step (we may initialize the models with batch training from off-policy data), our model learning is an online learning process. For each time step, the learning cost stays constant and does not grow over time, which is required for lifelong learning. Despite being online, the learning is off-policy because we maintain a bootstrapped replay buffer for each model in the ensemble. For each model update, we sample a minibatch of training data from the respective replay buffer. In addition, as mentioned, the models can also be initialized from existing data even before the policy optimization starts.
Linearly weighted random sampling.
Since our replay buffers are accumulated online, a naive uniformly sampling strategy would lead to early data being sampled more frequently than later data. We thus propose a linearly weighted random sampling scheme to mitigate the early-data bias issue. In this sampling scheme, example -th is randomly sampled with weight , i.e. higher weights for the fresher examples in each online update step. As proved in Appendix A.1, with this scheme, early data points are still cumulatively sampled slightly more frequently than later points but the cumulative bias gap is greatly reduced; and asymptotically all data points are equally sampled.
We provide two experimental analyses in this section. The first analyzes the model error when the planning horizon gets increased. The second provides the benchmark comparisons against state-of-the-art baselines.
5.1 Model learning and compounding errors
shows an error analysis of our model on the simple Pendulum environment. We used an oracle model to measure the errors. From the figure, we can see that even when the loss is small, the compounded error in the value function estimation can grow very quickly. Therefore, it is crucial to estimate the uncertainty (e.g. as variance or as error bound) in a principled way.
5.2 Comparison to baseline algorithms
We evaluate the performance of our proposed MBPO algorithm on three continuous control tasks in the MuJoCo simulator Todorov et al. (2012): Pendulum-v0, Swimmer-v2 and HalfCheetah-v2 from OpenAI Gym Brockman et al. (2016).
We compare our algorithm against those state-of-the-art baseline algorithms designed for continuous control:
PPO Schulman et al. (2017): a model-free policy gradient algorithm,
DDPG Lillicrap et al. (2015): an off-policy model-free actor-critic algorithm,
SAC Haarnoja et al. (2018): a model-free actor-critic algorithm, which reports better data-efficiency than DDPG and PPO on most MuJoCo benchmarks,
STEVE Buckman et al. (2018): a recent deterministic model-based algorithm.
Some other algorithms such as ME-TRPO Kurutach et al. (2018)), MB-MPO Clavera et al. (2018), PETS Chua et al. (2018) assume known reward function111We were also, with reasonable effort, unable to get their open-source implementations to run on standard OpenAI’s Gym environments.. We therefore do not include them in this benchmark study.
For each algorithm, we evaluate the learned policy after every episode (200 time steps for Pendulum-v0 and 1000 time steps for Swimmer-v2 and HalfCheetah-v2). The evaluation is done by running the current policy on 20 random episodes and then compute the average return over them.
Figures 3 and 4 show that MBPO has a superior sample efficiency compared to the baseline algorithms across a wide range of environments. Furthermore, it also has the asymptotic performance competitive to or even better than that of the model-free counterparts.
However, Figure 3 also shows that the performance of MBPO is sensitive to the random seed. We hypothesize that this is due to our strategy of aggressive online learning and policy update after each step. It might also be due to our choice of learning early without random exploration on the first episode as many other methods apply. We plan to do a deeper analysis and address this instability issue in our future work.
6 Discussion and Conclusion
Our experiments suggest that our MBPO algorithm not only can achieve the asymptotic performance of model-free methods in challenging continuous control tasks, it does so in much fewer samples. It is also more sample efficient than other existing MBRL algorithms. We further demonstrate that the model bias issue in model-based RL can be dealt with effectively with principled and careful uncertainty quantification.
We acknowledge that our current implementation still has several limitations to overcome, such as high variance of the performance, which still depends on many hyper-parameters (plan horizon, risk sensitivity, and all hyper-parameters associated to neural network training) and even depends on the random seed. Note that these traits are not unique to our method. Nevertheless, the results indicate that if implemented right, model-based methods can be both sample efficient and has better asymptotic performance than model-free methods on challenging tasks. In addition, by explicitly representing both the dynamics model and the policy, MBPO enables transfer learning, not just for the world (dynamics) model but also for the policy.
Finally, we identify that sample efficiency, off-policy learning, and transferability are the three necessary, albeit not sufficient, properties for real-world reinforcement learning. We claim that our method meets these criteria and hence is a step towards real-world reinforcement learning.
- Abbeel et al.  Pieter Abbeel, Adam Coates, Morgan Quigley, and Andrew Y Ng. An application of reinforcement learning to aerobatic helicopter flight. In Advances in neural information processing systems, pages 1–8, 2007.
- Brockman et al.  Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
- Buckman et al.  Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sample-efficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pages 8224–8234, 2018.
- Chua et al.  Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pages 4754–4765, 2018.
- Clavera et al.  Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. Model-based reinforcement learning via meta-policy optimization. arXiv preprint arXiv:1809.05214, 2018.
Deisenroth and Rasmussen 
Marc Deisenroth and Carl E Rasmussen.
Pilco: A model-based and data-efficient approach to policy search.
Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011.
- Depeweg et al.  Stefan Depeweg, José Miguel Hernández-Lobato, Finale Doshi-Velez, and Steffen Udluft. Learning and policy search in stochastic dynamical systems with bayesian neural networks. arXiv preprint arXiv:1605.07127, 2016.
- Gal and Ghahramani  Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
- Gal et al.  Yarin Gal, Rowan McAllister, and Carl Edward Rasmussen. Improving pilco with bayesian neural network dynamics models. In Data-Efficient Machine Learning workshop, ICML, volume 4, 2016.
- Garcıa and Fernández  Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
- Haarnoja et al.  Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
- Kamthe and Deisenroth  Sanket Kamthe and Marc Peter Deisenroth. Data-efficient reinforcement learning with probabilistic model predictive control. arXiv preprint arXiv:1706.06491, 2017.
- Kurutach et al.  Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592, 2018.
- Lakshminarayanan et al.  Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6402–6413, 2017.
- Levine and Koltun  Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on Machine Learning, pages 1–9, 2013.
- Lillicrap et al.  Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Markowitz  Harry Markowitz. Portfolio selection. The journal of finance, 7(1):77–91, 1952.
- Mnih et al.  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Nagabandi et al.  Anusha Nagabandi, Gregory Kahn, Ronald S. Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. CoRR, abs/1708.02596, 2017. URL http://arxiv.org/abs/1708.02596.
- Osband et al.  Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pages 4026–4034, 2016.
- Oza  Nikunj C Oza. Online bagging and boosting. In 2005 IEEE international conference on systems, man and cybernetics, volume 3, pages 2340–2345. Ieee, 2005.
- Park et al.  Byung-Hoon Park, George Ostrouchov, and Nagiza F Samatova. Sampling streaming data with replacement. Computational Statistics & Data Analysis, 52(2):750–762, 2007.
- Qin et al.  Zhen Qin, Vaclav Petricek, Nikos Karampatziakis, Lihong Li, and John Langford. Efficient online bootstrapping for large scale learning. arXiv preprint arXiv:1312.5021, 2013.
Sato et al. 
Makoto Sato, Hajime Kimura, and Shibenobu Kobayashi.
Td algorithm for the variance of return and mean-variance
Transactions of the Japanese Society for Artificial Intelligence, 16(3):353–362, 2001.
- Schulman et al.  John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Shahriari et al.  Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2015.
- Silver et al.  David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
- Sutton  Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 2(4):160–163, 1991.
- Sutton et al.  Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
- Todorov et al.  Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
- Young  Laurence Chisholm Young. Lecture on the calculus of variations and optimal control theory, volume 304. American Mathematical Soc., 2000.
Appendix A Appendix
a.1 Why linearly weighted random sampling is a fairer sampling scheme
Consider the following online learning process: for each time step, we need to randomly sample an example from the accumulating dataset. Suppose that at time , each example -th is randomly sampled with weight . Note that at each time , we have a total of examples in the dataset. Then the probability of that example being sampled is
If we use uniformly random sampling then the expected number of times an example -th gets selected until time is
Hence, for all , for , is larger than by . Now, if we use a linearly weighted random sampling scheme, in which , then the expected number of times an example -th gets selected until time is
We can see that at time , is still larger than for but by weighting recent examples more in each online update step, we reduce the overall early-data bias.
We evaluate the performance of our proposed MBPO algorithm on five continuous control tasks in the MuJoCo simulator from OpenAI Gym and we keep the default configurations prodived by OpenAI Gym.
|Environment||State dimension||Action dimension||Task horizon|