Uncertainty-aware Model-based Policy Optimization

by   Tung-Long Vuong, et al.

Model-based reinforcement learning has the potential to be more sample efficient than model-free approaches. However, existing model-based methods are vulnerable to model bias, which leads to poor generalization and asymptotic performance compared to model-free counterparts. In addition, they are typically based on the model predictive control (MPC) framework, which not only is computationally inefficient at decision time but also does not enable policy transfer due to the lack of an explicit policy representation. In this paper, we propose a novel uncertainty-aware model-based policy optimization framework which solves those issues. In this framework, the agent simultaneously learns an uncertainty-aware dynamics model and optimizes the policy according to these learned models. In the optimization step, the policy gradient is computed by automatic differentiation through the models. With respect to sample efficiency alone, our approach shows promising results on challenging continuous control benchmarks with competitive asymptotic performance and significantly lower sample complexity than state-of-the-art baselines.


page 1

page 2

page 3

page 4


Deep Model-Based Reinforcement Learning via Estimated Uncertainty and Conservative Policy Optimization

Model-based reinforcement learning algorithms tend to achieve higher sam...

Muesli: Combining Improvements in Policy Optimization

We propose a novel policy update that combines regularized policy optimi...

Model-based micro-data reinforcement learning: what are the crucial model properties and which model to choose?

We contribute to micro-data model-based reinforcement learning (MBRL) by...

Reinforcement Learning for Robotics and Control with Active Uncertainty Reduction

Model-free reinforcement learning based methods such as Proximal Policy ...

Guided Uncertainty-Aware Policy Optimization: Combining Learning and Model-Based Strategies for Sample-Efficient Policy Learning

Traditional robotic approaches rely on an accurate model of the environm...

Combining Model-Based and Model-Free Methods for Nonlinear Control: A Provably Convergent Policy Gradient Approach

Model-free learning-based control methods have seen great success recent...

Physics-informed Dyna-Style Model-Based Deep Reinforcement Learning for Dynamic Control

Model-based reinforcement learning (MBRL) is believed to have much highe...

1 Introduction

Reinforcement learning (RL) has been successfully applied to game playing. However, its application in real-world scenarios is still limited. One of the main challenges is the high sample complexity of the existing RL algorithms. The high sample complexity can be contributed by three factors: sample complexity of core learning algorithms, the ability of systems to learn from existing off-policy data, and the ability for a policy, as well as the world model in the case of model-based RL (MBRL), to be transferred to another task. We argue that in order for RL to be applicable in the real world, a RL system essentially needs to possess these properties: low sample complexity of the learning algorithm, the capability to learn from off-policy data, and transferability to similar tasks. To the best of our knowledge, none of the existing algorithms has all of those properties. In this paper, we propose a novel RL framework that is sample efficient, able to bootstrap from off-policy data, and also able to bootstrap from an existing policy.

Popular RL algorithms are divided into two main paradigms: model-free and model-based. Model-free algorithms have been shown to achieve good asymptotic performance in many high dimensional problems Mnih et al. (2015); Silver et al. (2017). However, model-free ones have a crucial limitation that they often require a massive amount of data samples for training, thus prevents their applications from control/robotic domains, where the costs associated with real-system interactions are high. The main reason is that model-free RL learns the state/state-action values only from rewards and does not explicitly exploit the rich information underlying the transition dynamics data.

In contrast to model-free RL (MFRL), model-based algorithms try to learn the transition dynamics, which is in turn used for imagining/planning without having to frequently interact with the real systems. Therefore, they are often considered more sample efficient than their model-free counterparts. In addition, the dynamics model is independent from rewards and thus can be transferred to other tasks in the same or similar environments. Nevertheless, the assumption about the accuracy of the learned dynamics model is usually not satisfied, especially in complex environments. The model error and its compounding effect when planning, i.e. a small bias in the model can lead to a highly erroneous value function estimate and a strongly-biased suboptimal policy, make MBRL less competitive in terms of asymptotic performance than MFRL for many non-trivial tasks.

Many solutions have been proposed to mitigate the model bias issue. Early work such as Deisenroth and Rasmussen (2011) uses Gaussian Processes (GP) to capture the model uncertainty. GP-based methods are, however, computationally intractable and unscalable for complex environments such as MuJoCo Todorov et al. (2012)

or Atari games. In recent years, deep neural network (DNN) has been gaining lots of attention due to its high representational capacity and great successes in large-scale supervised training tasks. There have been several attempts in making DNN uncertainty-aware. For examples, Bayesian neural networks were used in

Gal et al. (2016); Depeweg et al. (2016); Kamthe and Deisenroth (2017), or in Gal and Ghahramani (2016), dropout was proposed as a scalable approximation to GP. More recently, bootstrapping or ensembling in general has become a favorable technique for uncertainty modeling in DNN such as in Kurutach et al. (2018); Clavera et al. (2018). We hypothesize the reasons are because bootstrapping is a well-studied technique in Statistics and ensembling is relatively easy to train.

Another limitation of many existing MBRL methods is that they rely on the MPC framework. In this framework, at each time step , the agent uses the learned dynamics model to plan

(a hyperparameter) steps ahead, predicts the optimal sequence of actions

, then interacts with the environment using . But MPC, while easy to understand, has two drawbacks. First, each MPC step requires solving a high-dimensional optimization problem and thus is computationally prohibitive for applications requiring either real-time or low-latency reaction such as autonomous driving. Second, the policy is only implicit via solving the mentioned optimization problem. In more detail, not being able to explicitly represent the policy makes it hard to transfer the learned policy to other tasks or to initialize the agent with an existing better-than-random policy.

In contrast to MBRL, in the MFRL literature, policy gradient has shown to be an effective technique to train an agent Lillicrap et al. (2015); Schulman et al. (2017); Haarnoja et al. (2018). In this paper, we propose a new Model-based Policy Optimization (MBPO) framework that combines MBRL and policy gradient techniques in a principled way. Our experiments demonstrate the superior sample efficiency of our MBPO agent, compared to state-of-the-art methods, under the condition that the agent starts from scratch.

In addition to promising experimental results, we also emphasize that our framework is designed to support learning and improving from existing knowledge and from existing policy. Figure 1 illustrates our broad approach to real-world reinforcement learning.

[ page=1, scale=0.7, keepaspectratio]figures/Intro1.pdf

Figure 1: Model-based Policy Optimization Framework

2 Related work

Initial successes in MBRL in continuous control achieved promising results by learning control policies trained on models of local dynamics using linear parametric approximators Abbeel et al. (2007); Levine and Koltun (2013). Alternate methods such as Deisenroth and Rasmussen (2011) incorporated non-parametric probabilistic GPs to capture model uncertainty during policy planning and evaluation. While these methods enhance data efficiency in low-dimensional tasks, their applications in more challenging domains such as environments involving non-contact dynamics and high-dimensional control remain limited by the inflexibility of their temporally local structure and intractable inference time. Our model, on the contrary, achieves both targets of having asymptotically high performance compared to MFRL methods and, at the same time, retaining data efficiency in those complex domains.

Recently, there has been a revived interest in using DNNs to learn predictive models of environments from data, drawing inspiration from ideas in the early literature on this MBRL field. The large representational capacity of DNNs enables them as suitable function approximators for complex environments. However, additional care has to be usually taken to avoid model bias, a situation where the DNNs overfit in the early stages of learning, resulting in inaccurate models. Nagabandi et al. (2017) combined a learned dynamics network with MPC to initialize the policy network to accelerate learning in model-free deep RL. Chua et al. (2018) extended this idea by introducing a bootstrapped ensemble of probabilistic DNNs to model predictive uncertainty of the learned networks and demonstrating that a pure model-based approach can attain the asymptotic performance of MFRL counterparts. However, the use of MPC to define a policy leads to poor run-time execution and hard to transfer policy across tasks.

Subsequent research proposed algorithms to leverage the learned ensemble of dynamics models to train a policy network. Kurutach et al. (2018) learned a stochastic policy via trust-region policy optimization and Clavera et al. (2018) casted the policy gradient as a meta-learning adaptation step with respect to each member of the ensemble. Buckman et al. (2018)

proposed an algorithm to learn a weighted combination of roll-outs of different horizon lengths, which dynamically interpolates between model-based and model-free learning based on the uncertainty in the model predictions. To our knowledge, this is the closest work in aside from ours, which learns a reward function in addition to the dynamics function. Furthermore, none of the aforementioned works propagates the uncertainty all the way to the value function and uses the concept of utility function to balance risk and return, as used in our model.

The ensemble of DNNs provide a straightforward technique to obtain reliable estimates of predictive uncertainty Lakshminarayanan et al. (2017) and has been integrated with bootstrap to guide exploration in MFRL Osband et al. (2016)

. While many of the approaches mentioned in this section employ bootstrap to train an ensemble of models, we note that their implementations comprise of reconstructing bootstrap datasets at every training iteration, which effectively trains every single data sample and thus diminishes the advantage on uncertainty quantification achieved through bootstrap. Our model is different in that, to maintain online bootstrap datasets across the ensemble, it adds each incoming data sample to a dataset according to a Poisson probability distribution

Park et al. (2007); Qin et al. (2013), thereby guaranteeing asymptotically consistent bootstrap datasets.

3 Uncertainty-aware Model-based Policy Optimization

3.1 Policy Optimization Formulation

Consider a discrete-time Markov Decision Process (MDP) defined by a tuple

, in which is a state space, is an action space, is a deterministic transition function, is a reward function, is a task horizon, and is a discount factor. We define the return as the sum of rewards along a trajectory induced by a policy : . The goal of reinforcement learning is to find a policy that maximizes the expected return, i.e.


where , for and is randomly choosen from some initial dsitribution on .

If the dynamics function and the reward function are given, solving (1) can be done using the Calculus of Variations Young (2000) or Policy Gradient Sutton et al. (2000), which is the equivalent of Calculus of Variations when the control function is parameterized or is finite dimensional.

However, in reinforcement learning, and are often unknown and hence Equation (1) becomes a blackbox optimization problem with an unknown objective function. Following the Bayesian approach commonly used in the blackbox optimization literature Shahriari et al. (2015), we propose to solve this problem by iteratively learning a probabilistic estimate of from data and optimizing the policy according to this approximated model.

It is worth noting that any unbiased method would model as a probabilistic estimate, i.e. would be a distribution (as opposed to a point estimate) for a given . Optimizing a stochastic objective is, however, not well-defined. Our solution is to transform into a deterministic utility function that reflects a subjective measure balancing the risk and return. Following Markowitz (1952); Sato et al. (2001); Garcıa and Fernández (2015)

, we propose a risk-sensitive objective criterion using a linear combination of the mean and the standard deviation of

. Formally stated, our objective criterion now becomes


where and are the mean and the standard deviation of respectively, and is a constant that represents the subjective risk preference of the learning agent. A positive risk preference infers that the agent is adventurous while a negative risk preference indicates that the agent has a safe exploration strategy.

3.2 Estimate of Value Function and Its Gradient

Section 3.1 above provides a general framework for policy optimization under uncertainty, assuming the availability of the estimation model of the true value function . In this section, we present a model-based method to compute as an approximation of . The main idea is to approximate the functions

with probabilistic parametric models

and fully propagate the estimated uncertainty when planning under each policy from an initial state . The value function estimate can be formulated as


where and for . Next, we describe how to accurately model with well-calibrated uncertainty and a rollout technique that allows the uncertainty to be faithfully propagated into .

3.2.1 Model Uncertainty with Online Bootstrap

As discussed in Section 1, there are several prior attempts to learn uncertainty-aware dynamics models including GPs, Bayesian neural networks, dropout neural networks and ensemble of neural networks. In this work, however, we employ an ensemble of bootstrapped neural networks. Bootstrapping is a generic, principled and statistical approach for uncertainty quantification. Furthermore, as explained later in Section 3.2.3, this ensemble approach also gives rise to easy gradient computation. In particular, is represented as .

For simplicity of implementation, we model each bootstrap replica as deterministic and rely on the ensemble as the sole mechanism for quantifying and propagating uncertainty. Each bootstrapped model , which is parameterized by , learns to minimize the one-step prediction loss over the respective bootstrapped dataset :


The training dataset stores the transitions on which the agent has experienced. Since each model observes its own subset of the real data samples, the predictions across the ensemble remain sufficiently diverse in the early stages of the learning and will then converge to their true values as the error of the individual networks decreases.

Bootstrap learning is often studied in the context of batch learning. However, since our agent updates its world model

after each physical step for the best possible sample efficiency, we follow the online bootstrapping via sampling from Poisson distribution method presented in

Oza (2005); Qin et al. (2013). This is a very effective online approximation to batch bootstrapping, leveraging the following argument: bootstrapping a dataset with examples means sampling examples from with replacement. Each example will appear times in the bootstrapped sample where

is a random variable whose distribution is

because during resampling, the -th example will have chances to be picked, each with probability . This distribution converges to when . Therefore, for each new data point, this method adds copies of that data point to the bootstrapped dataset , where is sampled from a .

Unlike many other model-based approaches, we also learn the reward function, along the same design of classical MBRL algorithms Sutton (1991). However, in our current implementation, we use a deterministic model for the reward function to simplify the policy evaluation. Note that unlike the error from the dynamics model, the error from the reward model does not get compounded when we estimate .

3.2.2 Bootstrap Rollout

In this section, we describe how to propagate the estimates with uncertainty from the dynamics model to evaluate a policy . We represent our policy : as a neural network parameterized by . Note that we choose to represent our policy as deterministic. We argue that while all estimation models, including that of the dynamics and of the value function, need to be stochastic (i.e. uncertainty aware), the policy does not need to be. The policy is not an estimator and deterministic policy simply means that the agent is consistent when taking an action, no matter how uncertain it may know about the world.

Given a policy and an initial state , we can estimate the distribution of by simulating through each each bootstrapped dynamics model. And since each bootstrap model is an independent approximator of the dynamics function, by expanding the value function via these dynamics approximators, we eventually obtain independent estimates of that value function. Finally, these separate and independent trajectories collectively form an ensemble estimator of .

In practice, we sample these trajectories with a finite horizon . It is still a challenge to expand the value function estimation for a very long horizon due to a few reasons:

  • neural network training becomes harder when the depth increases,

  • despite our best effort to control the uncertainty, we still do not have a guarantee that our uncertainty modeling is perfectly calibrated, which in turn may be problematic if the planning horizon is too large, and

  • the policy learning time is proportional to the rollout horizon.

3.2.3 SGD and Gradient Computation

We can rewrite our objective as


where . Using the ensemble method and the rollout technique described above, we can naturally compute and for a given policy and for a given state . Therefore, the policy can be updated using the SGD or a variant of it.

The aforementioned rollout method also allows us to easily express in Equation (5) as a single computational graph of . This makes it straight forward to compute the policy gradient

using automatic differentiation, a feature provided out-of-the-box in most popular deep learning toolkits.

4 Algorithm Summary

1:  Initialize bootstrapped datasets , bootstrapped models , reward model , and policy .
2:  while not done do
3:     1. Step in the environment, collect new data point
4:     2. Push data to the bootstrapped replay buffers: for each member -th in the ensemble, add copies of that data point to , where is sampled from
5:     3. Update and
6:     4. Compute the value of the policy and the utility function by simulating through the learned models and .

     5. Policy update using SGD with the policy gradient computed by backpropagating the gradient of

through the models
8:  end while
Algorithm 1 Uncertainty-aware Model-based Policy Optimization

We summarize our method in Algorithm 1. Furthermore, in this section, we also highlight some important details in our implementation.

Online off-policy learning.

Except for the initialization step (we may initialize the models with batch training from off-policy data), our model learning is an online learning process. For each time step, the learning cost stays constant and does not grow over time, which is required for lifelong learning. Despite being online, the learning is off-policy because we maintain a bootstrapped replay buffer for each model in the ensemble. For each model update, we sample a minibatch of training data from the respective replay buffer. In addition, as mentioned, the models can also be initialized from existing data even before the policy optimization starts.

Linearly weighted random sampling.

Since our replay buffers are accumulated online, a naive uniformly sampling strategy would lead to early data being sampled more frequently than later data. We thus propose a linearly weighted random sampling scheme to mitigate the early-data bias issue. In this sampling scheme, example -th is randomly sampled with weight , i.e. higher weights for the fresher examples in each online update step. As proved in Appendix A.1, with this scheme, early data points are still cumulatively sampled slightly more frequently than later points but the cumulative bias gap is greatly reduced; and asymptotically all data points are equally sampled.

5 Experiment

We provide two experimental analyses in this section. The first analyzes the model error when the planning horizon gets increased. The second provides the benchmark comparisons against state-of-the-art baselines.

5.1 Model learning and compounding errors


Figure 2: Our model prediction errors and compounding errors, both computed as mean square error, of value function estimate for different plan horizons .

Figure  2

shows an error analysis of our model on the simple Pendulum environment. We used an oracle model to measure the errors. From the figure, we can see that even when the loss is small, the compounded error in the value function estimation can grow very quickly. Therefore, it is crucial to estimate the uncertainty (e.g. as variance or as error bound) in a principled way.

5.2 Comparison to baseline algorithms

We evaluate the performance of our proposed MBPO algorithm on three continuous control tasks in the MuJoCo simulator Todorov et al. (2012): Pendulum-v0, Swimmer-v2 and HalfCheetah-v2 from OpenAI Gym Brockman et al. (2016).

Experimentation Protocol.

We compare our algorithm against those state-of-the-art baseline algorithms designed for continuous control:

  • PPO Schulman et al. (2017): a model-free policy gradient algorithm,

  • DDPG Lillicrap et al. (2015): an off-policy model-free actor-critic algorithm,

  • SAC Haarnoja et al. (2018): a model-free actor-critic algorithm, which reports better data-efficiency than DDPG and PPO on most MuJoCo benchmarks,

  • STEVE Buckman et al. (2018): a recent deterministic model-based algorithm.

Some other algorithms such as ME-TRPO Kurutach et al. (2018)), MB-MPO Clavera et al. (2018), PETS Chua et al. (2018) assume known reward function111We were also, with reasonable effort, unable to get their open-source implementations to run on standard OpenAI’s Gym environments.. We therefore do not include them in this benchmark study.

For each algorithm, we evaluate the learned policy after every episode (200 time steps for Pendulum-v0 and 1000 time steps for Swimmer-v2 and HalfCheetah-v2). The evaluation is done by running the current policy on 20 random episodes and then compute the average return over them.


Figures 3 and 4 show that MBPO has a superior sample efficiency compared to the baseline algorithms across a wide range of environments. Furthermore, it also has the asymptotic performance competitive to or even better than that of the model-free counterparts.


Figure 3: Average return of our MBPO model over 3 different randomly selected random seeds. Solid lines indicate the mean and shaded areas indicate one standard deviation.


Figure 4: Best run of our MBPO model with different random seeds.

However, Figure 3 also shows that the performance of MBPO is sensitive to the random seed. We hypothesize that this is due to our strategy of aggressive online learning and policy update after each step. It might also be due to our choice of learning early without random exploration on the first episode as many other methods apply. We plan to do a deeper analysis and address this instability issue in our future work.

6 Discussion and Conclusion

Our experiments suggest that our MBPO algorithm not only can achieve the asymptotic performance of model-free methods in challenging continuous control tasks, it does so in much fewer samples. It is also more sample efficient than other existing MBRL algorithms. We further demonstrate that the model bias issue in model-based RL can be dealt with effectively with principled and careful uncertainty quantification.

We acknowledge that our current implementation still has several limitations to overcome, such as high variance of the performance, which still depends on many hyper-parameters (plan horizon, risk sensitivity, and all hyper-parameters associated to neural network training) and even depends on the random seed. Note that these traits are not unique to our method. Nevertheless, the results indicate that if implemented right, model-based methods can be both sample efficient and has better asymptotic performance than model-free methods on challenging tasks. In addition, by explicitly representing both the dynamics model and the policy, MBPO enables transfer learning, not just for the world (dynamics) model but also for the policy.

Finally, we identify that sample efficiency, off-policy learning, and transferability are the three necessary, albeit not sufficient, properties for real-world reinforcement learning. We claim that our method meets these criteria and hence is a step towards real-world reinforcement learning.


Appendix A Appendix

a.1 Why linearly weighted random sampling is a fairer sampling scheme

Consider the following online learning process: for each time step, we need to randomly sample an example from the accumulating dataset. Suppose that at time , each example -th is randomly sampled with weight . Note that at each time , we have a total of examples in the dataset. Then the probability of that example being sampled is

If we use uniformly random sampling then the expected number of times an example -th gets selected until time is

Hence, for all , for , is larger than by . Now, if we use a linearly weighted random sampling scheme, in which , then the expected number of times an example -th gets selected until time is

We can see that at time , is still larger than for but by weighting recent examples more in each online update step, we reduce the overall early-data bias.

a.2 Environments

We evaluate the performance of our proposed MBPO algorithm on five continuous control tasks in the MuJoCo simulator from OpenAI Gym and we keep the default configurations prodived by OpenAI Gym.

Environment State dimension Action dimension Task horizon
Table 1: Description of the environment used for testing