Deep Model-Based Reinforcement Learning via Estimated Uncertainty and Conservative Policy Optimization

11/28/2019 ∙ by Qi Zhou, et al. ∙ USTC 18

Model-based reinforcement learning algorithms tend to achieve higher sample efficiency than model-free methods. However, due to the inevitable errors of learned models, model-based methods struggle to achieve the same asymptotic performance as model-free methods. In this paper, We propose a Policy Optimization method with Model-Based Uncertainty (POMBU)—a novel model-based approach—that can effectively improve the asymptotic performance using the uncertainty in Q-values. We derive an upper bound of the uncertainty, based on which we can approximate the uncertainty accurately and efficiently for model-based methods. We further propose an uncertainty-aware policy optimization algorithm that optimizes the policy conservatively to encourage performance improvement with high probability. This can significantly alleviate the overfitting of policy to inaccurate models. Experiments show POMBU can outperform existing state-of-the-art policy optimization algorithms in terms of sample efficiency and asymptotic performance. Moreover, the experiments demonstrate the excellent robustness of POMBU compared to previous model-based approaches.



There are no comments yet.


page 8

page 9

page 10

page 18

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Model-free reinforcement learning has achieved remarkable success in sequential decision tasks, such as playing Atari games [21, 11] and controlling robots in simulation environments [19, 10]

. However, model-free approaches require large amounts of samples, especially when using powerful function approximators, like neural networks. Therefore, the high sample complexity hinders the application of model-free methods in real-world tasks, not to mention data gathering is often costly. In contrast, model-based reinforcement learning is more sample efficient, as it can learn from the interactions with models and then find a near-optimal policy via models

[14, 8, 17, 22]. However, these methods suffer from errors of learned models, which hurt the asymptotic performance [31, 1]. Thus, compared to model-free methods, model-based algorithms can learn more quickly but tend to learn suboptimal policies after plenty of trials.

Early model-based methods achieve impressing results using simple models, like linear models [2, 18] and Gaussian processes [16, 8]. However, these methods have difficulties in high-dimensional and non-linear environments due to the limited expressiveness of models. Recent methods use neural network models for better performance, especially for complicate tasks [29, 22]. Some methods further characterize the uncertainty in models via neural network ensembles [30, 15], or Bayesian neural networks [9]. Although the uncertainty in models improves the performance of model-based methods, recent research shows that these methods still struggle to achieve the comparable asymptotic performance to state-of-the-art model-free methods robustly [35].

Inspired by previous work that improves model-free algorithms via uncertainty-aware exploration [23], we propose a theoretically-motivated algorithm to estimate the uncertainty in Q-values and apply it to the exploration of model-based reinforcement learning. Moreover, we propose to optimize the policy conservatively by encouraging a large probability of performance improvement, which is also informed by the estimated uncertainty. Thus, we use the uncertainty in Q-values to enhance both exploration and policy optimization in our model-based algorithm.

Our contributions consist of three parts.

First, we derive an upper bound of the uncertainty in Q-values and present an algorithm to estimate it. Our bound is tighter than previous work [23], and our algorithm is feasible for deep model-based reinforcement learning, while many previous methods only focus on model-free cases [25, 26], or assume simple models [7].

Second, we propose to optimize the policy conservatively based on an estimated probability of performance improvement, which is estimated via the uncertainty in Q-values. We found the conservative policy optimization is useful to prevent the overfitting to the biased models.

Third, we propose a Policy Optimization method with Model-Based Uncertainty (POMBU), which combines our uncertainty estimation algorithm with the conservative policy optimization algorithm. Experiments show that POMBU achieves excellent robustness and can outperform state-of-the-art policy optimization algorithms.

2 Background

A finite-horizon Markov decision process (MDP)

is defined by the tuple . Here, is a finite set of states, is a finite set of actions,

is a third-order tensor that denotes the transition probabilities,

is a matrix that denotes the rewards, denotes the distribution of initial states and is the horizon length. More specifically, at the state and selecting the action , is the probability of transitioning to the state , and is the obtained reward. We represent a posterior of MDPs as , where is the sample space containing all possible MDPs, is a -field consisting of subsets of , and

measures the posterior probability of MDPs. We assume that each MDP in

is different from others only in terms of and . In this case, is a random tensor and

is a random matrix. For any random variable, matrix or tensor

, and

denotes its expectation and variance respectively. When without ambiguity, we write

as for short. For example, denotes and denotes .

Let denotes a policy. denotes the probability of taking the action at the state . Considering the posterior of MDPs, the expected return is a random variable, which is defined by

Here is a trajectory. means that the trajectory is sampled from the MDP under policy . That is, is sampled from the initial state distribution of , is sampled with the probability and is sampled with the probability in . Our goal is to find a policy maximizing in real environment.

Given an MDP , we define the corresponding state-action value function , the state value function and the advantage function as follow:

When the policy is fixed, we write , and as , and respectively for short. In this case, for any time-step , , and are random variables mapping to . Hence,

is a random vector.

and are random matrices.

3 Uncertainty Estimation

In this section, we consider a fixed policy . Similarly to the uncertainty Bellman equation (UBE) [23]

, we regard the standard deviations of Q-values as the uncertainty. In this section, we derive an upper bound of

for each , and prove that our upper bound is tighter than that of UBE. Moreover, we propose an uncertainty estimation algorithm for deep model-based reinforcement learning and discuss its advantages. We provide related proofs in Appendix A.1-A.4.

Upper Bound of Uncertainty in Q-values

To analyze the uncertainty, we first make two assumptions.

Assumption 1

Each MDP in is a directed acyclic graph.

This assumption is common [27, 23]. It means that the agent cannot visit a state more than twice within the same episode. This assumption is weak because each finite horizon MDP violating the assumption can be converted into a similar MDP that satisfying the assumption [23].

Assumption 2

The random vector and the random matrix are independent of and if .

This assumption is used in the derivation of UBE [23]. It is consistent with the trajectory sampling strategies used in recent model-based algorithms [5, 15], which sample a model from the ensemble of models independently per time step to predict the next state and reward.

First, we derive an inequation from these assumptions.

Lemma 1

Under Assumption 1 and 2, for any and , we have

We consider as a local uncertainty, because we can compute it locally with .

Then, we can derive our main theorem from this lemma.

Theorem 1

Under Assumption 1 and 2, for any policy , there exists a unique solution satisfying the following equation:


for any and , where , and furthermore pointwise.

Theorem 1 means that we can compute an upper bound of by solving the Bellman-style equation (1).

Moreover, we provide the following theorem to show the convergence when computing iteratively.

Theorem 2

For arbitrary , if

for any , and , where and converges to pointwise, we have converges to pointwise.

Theorem 2 shows that we can solve the equation (1) iteratively if the estimated local uncertainty is inaccurate per update but converges to the correct value, which is significant when we use an estimated to compute the uncertainty.

As is an upper bound of , is an upper bound of the uncertainty in . We use the upper bound to approximate the uncertainty in our algorithm similarly to UBE. We need to analyze the accuracy of our estimates.

Here, we compare our upper bound with that of UBE under the same assumptions, and hence we need to make an extra assumption used in UBE.

Assumption 3

is independent of for any .

This assumption is not used to derive our upper bound of the uncertainty but is used in UBE. Under the assumption 2 and 3, we have is independent of .

The upper bound derived in UBE satisfies

where . Here, is an upper bound of all for any and MDP. For example, we can regard as .

Theorem 3

Under the assumption 1, 2 and 3, is a tighter upper bound of than .

This theorem means that our upper bound is a more accurate estimate of the uncertainty in Q-values than the upper bound derived in UBE.

Uncertainty Estimation Algorithm

First, we characterizes the posterior of MDPs approximatively using a deterministic model ensemble (please refer to the Section 5 for the details of training models). A deterministic ensemble is denoted by . Here, for any , is a single model that predicts the next state and the reward, and is its parameters. We define a posterior probability of MDPs by

where eq is defined by

Then, we can construct an MDP defined according to the posterior of MDPs, such that its transition tensor is equal to and its reward matrix is equal to . Hence, the state value matrix of the MDP is equal to .

Moreover, we use a neural network to predict for any state and time step , which is equivalent to predicting . We train by minimizing loss function


Finally, given an imagined trajectory sampled from under , we can estimate the uncertainty in Q-values via the algorithm 1. Note that for long-horizon tasks, we can introduce a discount factor similarly to previous work [23]. The modified uncertainty estimation method can be found in Appendix B.

Input : A approximate value function ; An ensemble model ; A trajectory ;
Output : Estimates of ;
for  do
      for  do
           end for
           end for
return ;
Algorithm 1 Uncertainty Estimation for Q-values


In this part, we discuss some advantages of our algorithm to estimate the uncertainty in Q-values.


Based on the Theorem 3, our upper bound of the uncertainty is tighter than that of UBE, which means a more accurate estimation. Intuitively, our local uncertainty depends on while that of UBE depends on . Therefore, our local uncertainty has a weaker dependence on and can provide a relatively accurate estimation for long-horizon tasks (see an example in Appendix C). Moreover, considering an infinite set of states, our method ensures the boundedness of the local uncertainty because and are bounded. Therefore, our method has the potential to apply to tasks with continuous action spaces.

Applicability for Model-Based Methods

Our method to estimate the uncertainty in Q-values is effective for model-based reinforcement learning. In model-based cases, estimated Q-values are highly dependent on the models. Our method considers the model when computing the local uncertainty, while most of the existing methods estimate the uncertainty directly via the real-world samples regardless of the models. Ignoring models may lead to bad estimates of uncertainty in model-based cases. For example, the uncertainty estimated by a count-based method [3, 28] tends to decrease with the increase of the number of samples, while the true uncertainty keeps high even with a large amount of samples when modeling a complicate MDP using a simple model.

Computational Cost

Our method is much more computationally cheap compared with estimating the uncertainty via the empirical standard deviation of . When MDP is given, estimating requires plenty of virtual samples. Estimating the empirical standard deviation requires estimating for several MDPs. Previous work reduces the computational cost by learning an ensemble of Q functions [4]. However, training an ensemble of Q functions requires higher computational overhead than training a single neural network .

Compatibility with Neural Networks

Previous methods that estimate uncertainty for model-based methods always assume simple models, like Gaussian processes [8, 7]. Estimating uncertainty using Theorem 1 only requires that the models can represent a posterior. This makes our method compatible with neural network ensembles and Bayesian neural networks. For instance, we propose Algorithm 1 with an ensemble of neural networks.

Propagation of Uncertainty

As discussed in previous work [24], Bellman equation implies the high dependency between Q-values. Ignoring this dependence will limit the accuracy of the estimates of uncertainty. Our method considers the dependency and propagates the uncertainty via a Bellman-style equation.

4 Conservative Policy Optimization

In this section, we first introduce surrogate objective and then modify it via uncertainty. The modified objective leads to conservative policy optimization because it penalizes the update in the high-uncertainty regions. denotes a parameterized policy, and is its parameters. is the probability of taking action at state .

Surrogate Objective

Recent reinforcement learning algorithms, like Trust Region Policy Optimization (TRPO) [32], Proximal Policy Optimization (PPO) [33], optimize the policy based on surrogate objective. We rewrite the surrogate objective in TRPO and PPO as follow:

where are the old policy parameters before the update, is the advantage function of and

Previous work has proven the surrogate objective is the first order approximation to when is around [32, 12]. That is, for any , we have the following theorem:

Theorem 4

(see proof in Appendix A.5). Therefore, maximizing can maximize approximately when is around .

Uncertainty-Aware Surrogate Objective

To prevent the overfitting of the policy to inaccurate models, we introduce the estimated uncertainty in Q-values into the surrogate objective.

First, we need to estimate , which means the probability that the new policy outperforms the old one. Because of Theorem 4, can approximate . We assume that a Gaussian can approximate the distribution of . Thus, is approximately equal to , where

is the probability distribution function of standard normal distribution.

Then, we need to construct an objective function for optimization. Here, we aims to find a new with a large . As is monotonically increasing, we can maximize while minimize . Therefore, we can maximize



is a hyperparameter.

Moreover, we need to estimate the expectation and the variance of the surrogate objective. Because is equal to

we can approximate and as and respectively, where


Here is defined in Section 3 using a learned ensemble, can be approximated by , and is computed by Algorithm 1.

However, policy optimization without trust region may lead to unacceptable bad performance [32]. Thus, we clip similarly to PPO. That is,


Here, we define as

in which is a hyperparameter.

Finally, we obtain the modified surrogate objective

Note that, the main difference of our objective from PPO is the uncertainty penalty . This penalty limits the ratio changes in high-uncertainty regions. Therefore, this objective is uncertainty-aware and leads to a conservative update.

Initialize an ensemble and a policy ;
Initialize a value function ;
Initialize the dataset as a empty set;
Sample trajectories using ;
Add the sampled transitions to ;
      Train the ensemble using ;
      for  do
           Sample virtual trajectories from using ;
           Train by minimizeing ;
           Train by maximizing ;
           end for
          for  do
                Sample virtual trajectories from using ;
                Train an exploration policy based on ;
                Collect real-world trajectories using ;
                Add the sampled transitions to ;
                end for
                until  performs well in the real environment ;
Algorithm 2 POMBU

5 Algorithm

In this section, we propose a Policy Optimization method with Model-Based Uncertainty (POMBU) in Algorithm 2. We detail each stage of our algorithm as following.

Exploration Policy

We train a set of exploration policies by maximizing the . Different policies are trained with different virtual trajectories. To explore the unknown, we replace with in the equation (6). Here, controlling the exploration to high-uncertainty regions.

Model Ensemble

To predict the next state, a single neural network in the ensemble outputs the change in state and then adds the change to the current state [15, 22]. To predict the reward, we assume the reward in real environment is computed by a function such that , which is commonly true in many simulation control tasks. Then, we can predict the reward via the predicted next state. We train the model by minimizing loss similarly to previous work [15, 22] and optimize the parameter using Adam [13]. Different models are trained with different train-validation split.

Policy Optimization

We use a Gaussian policy whose mean is computed by a forward neural network and standard deviation is represented by a vector of parameters. We optimizing all parameters by maximizing via Adam.

6 Experiments

In this section, we fist evaluate our uncertainty estimation method. Second, we compare POMBU to state-of-the-arts. Then, we show how does the estimated uncertainty work by ablation study. Finally, we analyze the robustness of our method empirically. In the following experiments, we report the performance averaged over at least three random seeds. Please refer to Appendix D for the details of experiments. The source code and appendix of this work is available at

Effectiveness of Uncertainty Estimation

Figure 1:

Frequency histograms of the ratios of errors to uncertainties after different numbers of epochs (training

). The red dotted line means the probability density function of the standard normal distribution.

We evaluate the effectiveness of our uncertainty estimation method in two environments: 2D-point and 3D-point. These environments have continuous state spaces and continuous action spaces. First, we train an ensemble model of the environment and sample state-action pairs from the model using a deterministic policy. Then, we estimate the Q-values of these pairs via the means of virtual returns (computed using the models), and estimate the uncertainty using the algorithm 1. Finally, we compute the real Q-values using the return in real world, compute the ratios of errors to the estimated uncertainties, and count the frequencies of these ratios to draw Figure 1. This figure shows the distribution of ratios is similar to a standard normal distribution after sufficient training of , which demonstrates the accuracy of the estimated uncertainty.

Comparison to State-of-the-Arts

We compare POMBU with state-of-the-art policy optimization algorithms in four continuous control tasks of Mujoco [34]: Swimmer, HalfCheetah, Ant, and Walker2d. Our method and our baselines optimize a stochastic policy to complete the tasks. Our baselines include: soft actor critic (SAC) [10]; proximal policy optimization (PPO); stochastic lower bounds optimization (SLBO) [20]; model-ensemble trust region policy optimization (METRPO) [15]. To show the benefits of using uncertainty in model-based reinforcement learning, we also compare POMBU to model-ensemble proximal policy optimization (MEPPO), which is equivalent to POMBU when and . We evaluate POMBU with and for all tasks.

The result is shown in Figure 2. The solid curves correspond to the mean and the shaded region corresponds to the empirical standard deviation. It shows that POMBU achieves higher sample efficiency and better final performance than baselines, which highlights the great benefits of using uncertainty. Moreover, POMBU achieves comparable asymptotic performances with PPO and SAC in all tasks.

We also provide Table 1 that summarizes the performance, estimated wall-clock time and the number of used imagined samples and real-world samples in the HalfCheetah task (H=200). Compared to MEPPO, the extra time used in POMBU is small (time: ), while the improvement is significant (mean: ; standard deviation: ). Compared to SAC, POMBU achieve higher performance with about 5 times less real-world samples. Moreover, in our experiments, the total time to compute the uncertainty (not include the time to train ) is about 1.4 minutes, which is ignorable compared with the overall time.

We further compare POMBU with state-of-the-art model-based algorithms in long-horizon tasks. The compared algorithms include model-based meta policy optimization (MBMPO) [6], probabilistic ensemble with trajectory sampling (PETS) [5] and stochastic ensemble value expansion (STEVE) [4] in addition. We directly use some of the results given by Tingwu Wang [35], and summarize all results in Table 2. The table shows that POMBU achieves comparable performance with STEVE and PETS, and outperforms other model-based algorithms. It demonstrates that POMBU is also effective in long-horizon tasks.

Figure 2: The training curve of our method and baselines. The horizons of all tasks are 200. The number of total steps is selected to ensure most model-based algorithms converge. We train the policy via PPO and SAC with at least 1 million samples and report the best averaged performance as ”max”.
Time (h) 12.05 10.17 6.35 3.91 0.87 0.04 4.18 0.19
Imagined 1.2e8 8e7 5e7 1e7 0 0 0 0
Real-world 2e5 2e5 2e5 2e5 2e5 2e5 9.89e5 9.78e5
Table 1: The performance, estimated wall-clock time and the number of used imagined samples and real-world samples in the HalfCheetah task (H=200). We conduct all experiments with one GPU Nvidia GTX 2080Ti.
Table 2: The performance of 200k time-step training. The horizons of all environments are 1000.

Ablation Study

Figure 3: The development of the average return during training with different in the Cheetah task (H=200).

We provide an ablation study to show how the uncertainty benefits the performance. In our algorithm, we employ the uncertainty in policy optimization (controlled by ) and exploration (controlled by ). Therefore, we compare the performance with different and .

The results are shown in Figure 3 and 4. Setting as or achieves the best final performance and the best robustness with 200K samples. Note that a large may result in poorer performance in the early stage, because the uncertainty is high in the early stage and a large tends to choose a small step size when uncertainty is high. Using can improve the performance (larger mean and smaller standard deviation), which demonstrate the effectiveness of uncertainty-aware exploration.

Robustness Analyses

We demonstrate the excellent robustness of POMBU in two ways. First, we evaluate algorithms in noisy environments. In these environments, we add Gaussian noise to the observation with the standard deviation . This noise will affect the accuracy of the learned models. Second, we evaluate algorithms in long-horizon tasks. In these tasks, models need to generate long trajectories, and the error is further exacerbated due to the difficulty of longterm predictions.

We report the results in Figure 5. Experiments show that our algorithm achieves similar performance with different random seeds, while the performance of METRPO varies greatly with the random seeds. Moreover, in Figure 5, the worst performance of POMBU beats the best of METRPO. This implies that our method has promising robustness, even in noisy environments and long-horizon environments.

Figure 4: (a): The development of average return with and . (b): The performance after 1e5 time-step training with different random seeds.
Figure 5: The training curves of POMBU and METRPO with different random seeds. (a) Comparison in a noisy Cheetah task (). (b) Comparison in a long-horizon Cheetah task ().

7 Conclusion

In this work, we propose a Policy Optimization method with Model-Based Uncertainty (POMBU), which is a novel uncertainty-aware model-based algorithm. This method estimates uncertainty using a model ensemble and then optimizes policy Conservatively considering the uncertainty. Experiments demonstrate that POMBU can achieve comparable asymptotic performance with SAC and PPO while using much fewer samples. Compared with other model-based methods, POMBU is robust and can achieve better performance. We believe that our approach will bring new insights into model-based reinforcement learning. An enticing direction for further work is the combination of our uncertainty estimation method with other kinds of models like Bayesian neural networks. Another exciting direction is to modify other advanced model-based algorithms like STEVE and PETS using our uncertainty estimation method.


A Proof

In this suction, We provide all the proof mentioned in the body of our paper.

Proof of Lemma 1

Lemma 1

Under Assumption 1 and 2, for any and , we have

Proof. Let and each is a random variable. By using Jensen’s inequality , we have


By applying the inequation (7) to the Bellman equation, we have


By using the law of total variance, we have


Because Assumption 1 and 2 implies that when , we have


By using the inequation (7), we have


where the last step holds because is independent of when according to Assumption 1 and 2. Combining 8, 9, 10 and 11, we obtain the Lemma 1.

Proof of Theorem 1

Theorem 1

Under Assumption 1 and 2, for any policy , there exists a unique solution satisfying the following equation:


for any and , where , and furthermore pointwise.

Proof. First, the solution of exists and is unique because and is a linear combinations of . Moreover, we know that

Then, there exists a unique solution of if there exists a unique solution of because is a linear combinations of and . Additionally, by using Lemma 1, we have

if pointwise.

Finally, we obtain Theorem 1 by induction.

Proof of Theorem 2

Theorem 2

For arbitrary , if

for any , and , where and converges to pointwise, we have converges to pointwise.

Proof. is converges to because converges to and .

For any , if converges to , converges to with the assumption converges to because is a linear combinations of and .

Then, we obtain the conclusion by induction.

Proof of Theorem 3

Theorem 3

Under the assumption 1, 2 and 3, is a tighter upper bound of than .

Proof. Here, we only show that pointwise (see UBE [23] for the proof that is an upper bound of uncertainty).

Because , by using inequation (8), we have


Under the Assumption 1, 2 and 3, we have