1 Introduction
Modelfree reinforcement learning has achieved remarkable success in sequential decision tasks, such as playing Atari games [21, 11] and controlling robots in simulation environments [19, 10]
. However, modelfree approaches require large amounts of samples, especially when using powerful function approximators, like neural networks. Therefore, the high sample complexity hinders the application of modelfree methods in realworld tasks, not to mention data gathering is often costly. In contrast, modelbased reinforcement learning is more sample efficient, as it can learn from the interactions with models and then find a nearoptimal policy via models
[14, 8, 17, 22]. However, these methods suffer from errors of learned models, which hurt the asymptotic performance [31, 1]. Thus, compared to modelfree methods, modelbased algorithms can learn more quickly but tend to learn suboptimal policies after plenty of trials.Early modelbased methods achieve impressing results using simple models, like linear models [2, 18] and Gaussian processes [16, 8]. However, these methods have difficulties in highdimensional and nonlinear environments due to the limited expressiveness of models. Recent methods use neural network models for better performance, especially for complicate tasks [29, 22]. Some methods further characterize the uncertainty in models via neural network ensembles [30, 15], or Bayesian neural networks [9]. Although the uncertainty in models improves the performance of modelbased methods, recent research shows that these methods still struggle to achieve the comparable asymptotic performance to stateoftheart modelfree methods robustly [35].
Inspired by previous work that improves modelfree algorithms via uncertaintyaware exploration [23], we propose a theoreticallymotivated algorithm to estimate the uncertainty in Qvalues and apply it to the exploration of modelbased reinforcement learning. Moreover, we propose to optimize the policy conservatively by encouraging a large probability of performance improvement, which is also informed by the estimated uncertainty. Thus, we use the uncertainty in Qvalues to enhance both exploration and policy optimization in our modelbased algorithm.
Our contributions consist of three parts.
First, we derive an upper bound of the uncertainty in Qvalues and present an algorithm to estimate it. Our bound is tighter than previous work [23], and our algorithm is feasible for deep modelbased reinforcement learning, while many previous methods only focus on modelfree cases [25, 26], or assume simple models [7].
Second, we propose to optimize the policy conservatively based on an estimated probability of performance improvement, which is estimated via the uncertainty in Qvalues. We found the conservative policy optimization is useful to prevent the overfitting to the biased models.
Third, we propose a Policy Optimization method with ModelBased Uncertainty (POMBU), which combines our uncertainty estimation algorithm with the conservative policy optimization algorithm. Experiments show that POMBU achieves excellent robustness and can outperform stateoftheart policy optimization algorithms.
2 Background
A finitehorizon Markov decision process (MDP)
is defined by the tuple . Here, is a finite set of states, is a finite set of actions,is a thirdorder tensor that denotes the transition probabilities,
is a matrix that denotes the rewards, denotes the distribution of initial states and is the horizon length. More specifically, at the state and selecting the action , is the probability of transitioning to the state , and is the obtained reward. We represent a posterior of MDPs as , where is the sample space containing all possible MDPs, is a field consisting of subsets of , andmeasures the posterior probability of MDPs. We assume that each MDP in
is different from others only in terms of and . In this case, is a random tensor andis a random matrix. For any random variable, matrix or tensor
, anddenotes its expectation and variance respectively. When without ambiguity, we write
as for short. For example, denotes and denotes .Let denotes a policy. denotes the probability of taking the action at the state . Considering the posterior of MDPs, the expected return is a random variable, which is defined by
Here is a trajectory. means that the trajectory is sampled from the MDP under policy . That is, is sampled from the initial state distribution of , is sampled with the probability and is sampled with the probability in . Our goal is to find a policy maximizing in real environment.
Given an MDP , we define the corresponding stateaction value function , the state value function and the advantage function as follow:
When the policy is fixed, we write , and as , and respectively for short. In this case, for any timestep , , and are random variables mapping to . Hence,
is a random vector.
and are random matrices.3 Uncertainty Estimation
In this section, we consider a fixed policy . Similarly to the uncertainty Bellman equation (UBE) [23]
, we regard the standard deviations of Qvalues as the uncertainty. In this section, we derive an upper bound of
for each , and prove that our upper bound is tighter than that of UBE. Moreover, we propose an uncertainty estimation algorithm for deep modelbased reinforcement learning and discuss its advantages. We provide related proofs in Appendix A.1A.4.Upper Bound of Uncertainty in Qvalues
To analyze the uncertainty, we first make two assumptions.
Assumption 1
Each MDP in is a directed acyclic graph.
This assumption is common [27, 23]. It means that the agent cannot visit a state more than twice within the same episode. This assumption is weak because each finite horizon MDP violating the assumption can be converted into a similar MDP that satisfying the assumption [23].
Assumption 2
The random vector and the random matrix are independent of and if .
This assumption is used in the derivation of UBE [23]. It is consistent with the trajectory sampling strategies used in recent modelbased algorithms [5, 15], which sample a model from the ensemble of models independently per time step to predict the next state and reward.
First, we derive an inequation from these assumptions.
Lemma 1
Under Assumption 1 and 2, for any and , we have
We consider as a local uncertainty, because we can compute it locally with .
Then, we can derive our main theorem from this lemma.
Theorem 1
Under Assumption 1 and 2, for any policy , there exists a unique solution satisfying the following equation:
(1) 
for any and , where , and furthermore pointwise.
Theorem 1 means that we can compute an upper bound of by solving the Bellmanstyle equation (1).
Moreover, we provide the following theorem to show the convergence when computing iteratively.
Theorem 2
For arbitrary , if
for any , and , where and converges to pointwise, we have converges to pointwise.
Theorem 2 shows that we can solve the equation (1) iteratively if the estimated local uncertainty is inaccurate per update but converges to the correct value, which is significant when we use an estimated to compute the uncertainty.
As is an upper bound of , is an upper bound of the uncertainty in . We use the upper bound to approximate the uncertainty in our algorithm similarly to UBE. We need to analyze the accuracy of our estimates.
Here, we compare our upper bound with that of UBE under the same assumptions, and hence we need to make an extra assumption used in UBE.
Assumption 3
is independent of for any .
This assumption is not used to derive our upper bound of the uncertainty but is used in UBE. Under the assumption 2 and 3, we have is independent of .
The upper bound derived in UBE satisfies
where . Here, is an upper bound of all for any and MDP. For example, we can regard as .
Theorem 3
Under the assumption 1, 2 and 3, is a tighter upper bound of than .
This theorem means that our upper bound is a more accurate estimate of the uncertainty in Qvalues than the upper bound derived in UBE.
Uncertainty Estimation Algorithm
First, we characterizes the posterior of MDPs approximatively using a deterministic model ensemble (please refer to the Section 5 for the details of training models). A deterministic ensemble is denoted by . Here, for any , is a single model that predicts the next state and the reward, and is its parameters. We define a posterior probability of MDPs by
where eq is defined by
Then, we can construct an MDP defined according to the posterior of MDPs, such that its transition tensor is equal to and its reward matrix is equal to . Hence, the state value matrix of the MDP is equal to .
Moreover, we use a neural network to predict for any state and time step , which is equivalent to predicting . We train by minimizing loss function
(2) 
Discussion
In this part, we discuss some advantages of our algorithm to estimate the uncertainty in Qvalues.
Accuracy
Based on the Theorem 3, our upper bound of the uncertainty is tighter than that of UBE, which means a more accurate estimation. Intuitively, our local uncertainty depends on while that of UBE depends on . Therefore, our local uncertainty has a weaker dependence on and can provide a relatively accurate estimation for longhorizon tasks (see an example in Appendix C). Moreover, considering an infinite set of states, our method ensures the boundedness of the local uncertainty because and are bounded. Therefore, our method has the potential to apply to tasks with continuous action spaces.
Applicability for ModelBased Methods
Our method to estimate the uncertainty in Qvalues is effective for modelbased reinforcement learning. In modelbased cases, estimated Qvalues are highly dependent on the models. Our method considers the model when computing the local uncertainty, while most of the existing methods estimate the uncertainty directly via the realworld samples regardless of the models. Ignoring models may lead to bad estimates of uncertainty in modelbased cases. For example, the uncertainty estimated by a countbased method [3, 28] tends to decrease with the increase of the number of samples, while the true uncertainty keeps high even with a large amount of samples when modeling a complicate MDP using a simple model.
Computational Cost
Our method is much more computationally cheap compared with estimating the uncertainty via the empirical standard deviation of . When MDP is given, estimating requires plenty of virtual samples. Estimating the empirical standard deviation requires estimating for several MDPs. Previous work reduces the computational cost by learning an ensemble of Q functions [4]. However, training an ensemble of Q functions requires higher computational overhead than training a single neural network .
Compatibility with Neural Networks
Previous methods that estimate uncertainty for modelbased methods always assume simple models, like Gaussian processes [8, 7]. Estimating uncertainty using Theorem 1 only requires that the models can represent a posterior. This makes our method compatible with neural network ensembles and Bayesian neural networks. For instance, we propose Algorithm 1 with an ensemble of neural networks.
Propagation of Uncertainty
As discussed in previous work [24], Bellman equation implies the high dependency between Qvalues. Ignoring this dependence will limit the accuracy of the estimates of uncertainty. Our method considers the dependency and propagates the uncertainty via a Bellmanstyle equation.
4 Conservative Policy Optimization
In this section, we first introduce surrogate objective and then modify it via uncertainty. The modified objective leads to conservative policy optimization because it penalizes the update in the highuncertainty regions. denotes a parameterized policy, and is its parameters. is the probability of taking action at state .
Surrogate Objective
Recent reinforcement learning algorithms, like Trust Region Policy Optimization (TRPO) [32], Proximal Policy Optimization (PPO) [33], optimize the policy based on surrogate objective. We rewrite the surrogate objective in TRPO and PPO as follow:
where are the old policy parameters before the update, is the advantage function of and
Previous work has proven the surrogate objective is the first order approximation to when is around [32, 12]. That is, for any , we have the following theorem:
Theorem 4
(see proof in Appendix A.5). Therefore, maximizing can maximize approximately when is around .
UncertaintyAware Surrogate Objective
To prevent the overfitting of the policy to inaccurate models, we introduce the estimated uncertainty in Qvalues into the surrogate objective.
First, we need to estimate , which means the probability that the new policy outperforms the old one. Because of Theorem 4, can approximate . We assume that a Gaussian can approximate the distribution of . Thus, is approximately equal to , where
is the probability distribution function of standard normal distribution.
Then, we need to construct an objective function for optimization. Here, we aims to find a new with a large . As is monotonically increasing, we can maximize while minimize . Therefore, we can maximize
(3) 
where
is a hyperparameter.
Moreover, we need to estimate the expectation and the variance of the surrogate objective. Because is equal to
we can approximate and as and respectively, where
(4)  
(5) 
Here is defined in Section 3 using a learned ensemble, can be approximated by , and is computed by Algorithm 1.
However, policy optimization without trust region may lead to unacceptable bad performance [32]. Thus, we clip similarly to PPO. That is,
(6) 
Here, we define as
in which is a hyperparameter.
Finally, we obtain the modified surrogate objective
Note that, the main difference of our objective from PPO is the uncertainty penalty . This penalty limits the ratio changes in highuncertainty regions. Therefore, this objective is uncertaintyaware and leads to a conservative update.
5 Algorithm
In this section, we propose a Policy Optimization method with ModelBased Uncertainty (POMBU) in Algorithm 2. We detail each stage of our algorithm as following.
Exploration Policy
We train a set of exploration policies by maximizing the . Different policies are trained with different virtual trajectories. To explore the unknown, we replace with in the equation (6). Here, controlling the exploration to highuncertainty regions.
Model Ensemble
To predict the next state, a single neural network in the ensemble outputs the change in state and then adds the change to the current state [15, 22]. To predict the reward, we assume the reward in real environment is computed by a function such that , which is commonly true in many simulation control tasks. Then, we can predict the reward via the predicted next state. We train the model by minimizing loss similarly to previous work [15, 22] and optimize the parameter using Adam [13]. Different models are trained with different trainvalidation split.
Policy Optimization
We use a Gaussian policy whose mean is computed by a forward neural network and standard deviation is represented by a vector of parameters. We optimizing all parameters by maximizing via Adam.
6 Experiments
In this section, we fist evaluate our uncertainty estimation method. Second, we compare POMBU to stateofthearts. Then, we show how does the estimated uncertainty work by ablation study. Finally, we analyze the robustness of our method empirically. In the following experiments, we report the performance averaged over at least three random seeds. Please refer to Appendix D for the details of experiments. The source code and appendix of this work is available at https://github.com/MIRALabUSTC/RLPOMBU.
Effectiveness of Uncertainty Estimation
We evaluate the effectiveness of our uncertainty estimation method in two environments: 2Dpoint and 3Dpoint. These environments have continuous state spaces and continuous action spaces. First, we train an ensemble model of the environment and sample stateaction pairs from the model using a deterministic policy. Then, we estimate the Qvalues of these pairs via the means of virtual returns (computed using the models), and estimate the uncertainty using the algorithm 1. Finally, we compute the real Qvalues using the return in real world, compute the ratios of errors to the estimated uncertainties, and count the frequencies of these ratios to draw Figure 1. This figure shows the distribution of ratios is similar to a standard normal distribution after sufficient training of , which demonstrates the accuracy of the estimated uncertainty.
Comparison to StateoftheArts
We compare POMBU with stateoftheart policy optimization algorithms in four continuous control tasks of Mujoco [34]: Swimmer, HalfCheetah, Ant, and Walker2d. Our method and our baselines optimize a stochastic policy to complete the tasks. Our baselines include: soft actor critic (SAC) [10]; proximal policy optimization (PPO); stochastic lower bounds optimization (SLBO) [20]; modelensemble trust region policy optimization (METRPO) [15]. To show the benefits of using uncertainty in modelbased reinforcement learning, we also compare POMBU to modelensemble proximal policy optimization (MEPPO), which is equivalent to POMBU when and . We evaluate POMBU with and for all tasks.
The result is shown in Figure 2. The solid curves correspond to the mean and the shaded region corresponds to the empirical standard deviation. It shows that POMBU achieves higher sample efficiency and better final performance than baselines, which highlights the great benefits of using uncertainty. Moreover, POMBU achieves comparable asymptotic performances with PPO and SAC in all tasks.
We also provide Table 1 that summarizes the performance, estimated wallclock time and the number of used imagined samples and realworld samples in the HalfCheetah task (H=200). Compared to MEPPO, the extra time used in POMBU is small (time: ), while the improvement is significant (mean: ; standard deviation: ). Compared to SAC, POMBU achieve higher performance with about 5 times less realworld samples. Moreover, in our experiments, the total time to compute the uncertainty (not include the time to train ) is about 1.4 minutes, which is ignorable compared with the overall time.
We further compare POMBU with stateoftheart modelbased algorithms in longhorizon tasks. The compared algorithms include modelbased meta policy optimization (MBMPO) [6], probabilistic ensemble with trajectory sampling (PETS) [5] and stochastic ensemble value expansion (STEVE) [4] in addition. We directly use some of the results given by Tingwu Wang [35], and summarize all results in Table 2. The table shows that POMBU achieves comparable performance with STEVE and PETS, and outperforms other modelbased algorithms. It demonstrates that POMBU is also effective in longhorizon tasks.
POMBU  MEPPO  METRPO  SLBO  SAC  PPO  SAC_max  PPO_max  
Time (h)  12.05  10.17  6.35  3.91  0.87  0.04  4.18  0.19 
Imagined  1.2e8  8e7  5e7  1e7  0  0  0  0 
Realworld  2e5  2e5  2e5  2e5  2e5  2e5  9.89e5  9.78e5 
Return 
Environment  POMBU  STEVE  MBMPO  SLBO  METRPO  PETS 

Ant  
HalfCheetah  
Swimmer  
Walker2d 
Ablation Study
We provide an ablation study to show how the uncertainty benefits the performance. In our algorithm, we employ the uncertainty in policy optimization (controlled by ) and exploration (controlled by ). Therefore, we compare the performance with different and .
The results are shown in Figure 3 and 4. Setting as or achieves the best final performance and the best robustness with 200K samples. Note that a large may result in poorer performance in the early stage, because the uncertainty is high in the early stage and a large tends to choose a small step size when uncertainty is high. Using can improve the performance (larger mean and smaller standard deviation), which demonstrate the effectiveness of uncertaintyaware exploration.
Robustness Analyses
We demonstrate the excellent robustness of POMBU in two ways. First, we evaluate algorithms in noisy environments. In these environments, we add Gaussian noise to the observation with the standard deviation . This noise will affect the accuracy of the learned models. Second, we evaluate algorithms in longhorizon tasks. In these tasks, models need to generate long trajectories, and the error is further exacerbated due to the difficulty of longterm predictions.
We report the results in Figure 5. Experiments show that our algorithm achieves similar performance with different random seeds, while the performance of METRPO varies greatly with the random seeds. Moreover, in Figure 5, the worst performance of POMBU beats the best of METRPO. This implies that our method has promising robustness, even in noisy environments and longhorizon environments.
7 Conclusion
In this work, we propose a Policy Optimization method with ModelBased Uncertainty (POMBU), which is a novel uncertaintyaware modelbased algorithm. This method estimates uncertainty using a model ensemble and then optimizes policy Conservatively considering the uncertainty. Experiments demonstrate that POMBU can achieve comparable asymptotic performance with SAC and PPO while using much fewer samples. Compared with other modelbased methods, POMBU is robust and can achieve better performance. We believe that our approach will bring new insights into modelbased reinforcement learning. An enticing direction for further work is the combination of our uncertainty estimation method with other kinds of models like Bayesian neural networks. Another exciting direction is to modify other advanced modelbased algorithms like STEVE and PETS using our uncertainty estimation method.
References

[1]
P. Abbeel, M. Quigley, and A. Y. Ng.
Using inaccurate models in reinforcement learning.
In
Proceedings of the 23rd international conference on Machine learning
, pages 1–8. ACM, 2006.  [2] J. A. Bagnell and J. G. Schneider. Autonomous helicopter control using reinforcement learning policy search methods. In Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No. 01CH37164), volume 2, pages 1615–1620. IEEE, 2001.
 [3] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifying countbased exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.
 [4] J. Buckman, D. Hafner, G. Tucker, E. Brevdo, and H. Lee. Sampleefficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pages 8224–8234, 2018.
 [5] K. Chua, R. Calandra, R. McAllister, and S. Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pages 4754–4765, 2018.
 [6] I. Clavera, J. Rothfuss, J. Schulman, Y. Fujita, T. Asfour, and P. Abbeel. Modelbased reinforcement learning via metapolicy optimization. arXiv preprint arXiv:1809.05214, 2018.

[7]
R. Dearden, N. Friedman, and D. Andre.
Model based bayesian exploration.
In
Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
, pages 150–159. Morgan Kaufmann Publishers Inc., 1999.  [8] M. Deisenroth and C. E. Rasmussen. Pilco: A modelbased and dataefficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML11), pages 465–472, 2011.
 [9] S. Depeweg, J. M. HernándezLobato, F. DoshiVelez, and S. Udluft. Learning and policy search in stochastic dynamical systems with bayesian neural networks. arXiv preprint arXiv:1605.07127, 2016.
 [10] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
 [11] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 [12] S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pages 267–274, 2002.
 [13] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [14] J. Kocijan, R. MurraySmith, C. E. Rasmussen, and A. Girard. Gaussian process model based predictive control. In Proceedings of the 2004 American Control Conference, volume 3, pages 2214–2219. IEEE, 2004.
 [15] T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel. Modelensemble trustregion policy optimization. arXiv preprint arXiv:1802.10592, 2018.
 [16] M. Kuss and C. E. Rasmussen. Gaussian processes in reinforcement learning. In Advances in neural information processing systems, pages 751–758, 2004.
 [17] S. Levine and P. Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pages 1071–1079, 2014.
 [18] S. Levine, C. Finn, T. Darrell, and P. Abbeel. Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
 [19] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 [20] Y. Luo, H. Xu, Y. Li, Y. Tian, T. Darrell, and T. Ma. Algorithmic framework for modelbased deep reinforcement learning with theoretical guarantees. arXiv preprint arXiv:1807.03858, 2018.
 [21] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 [22] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine. Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566. IEEE, 2018.
 [23] B. O’Donoghue, I. Osband, R. Munos, and V. Mnih. The uncertainty bellman equation and exploration. arXiv preprint arXiv:1709.05380, 2017.
 [24] I. Osband, J. Aslanides, and A. Cassirer. Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 8617–8629, 2018.
 [25] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pages 4026–4034, 2016.
 [26] I. Osband, B. Van Roy, D. Russo, and Z. Wen. Deep exploration via randomized value functions. arXiv preprint arXiv:1703.07608, 2017.
 [27] I. Osband, B. Van Roy, and Z. Wen. Generalization and exploration via randomized value functions. arXiv preprint arXiv:1402.0635.
 [28] G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos. Countbased exploration with neural density models. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2721–2730. JMLR. org, 2017.
 [29] A. Punjani and P. Abbeel. Deep learning helicopter dynamics models. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 3223–3230. IEEE, 2015.
 [30] A. Rajeswaran, S. Ghotra, B. Ravindran, and S. Levine. Epopt: Learning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283, 2016.
 [31] J. G. Schneider. Exploiting model uncertainty estimates for safe dynamic control learning. In NIPS, pages 1047–1053. 1997.
 [32] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
 [33] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 [34] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
 [35] T. Wang, X. Bao, I. Clavera, J. Hoang, Y. Wen, E. Langlois, S. Zhang, G. Zhang, P. Abbeel, and J. Ba. Benchmarking modelbased reinforcement learning, 2019.
A Proof
In this suction, We provide all the proof mentioned in the body of our paper.
Proof of Lemma 1
Lemma 1
Under Assumption 1 and 2, for any and , we have
Proof. Let and each is a random variable. By using Jensen’s inequality , we have
(7) 
By applying the inequation (7) to the Bellman equation, we have
(8) 
By using the law of total variance, we have
(9) 
Because Assumption 1 and 2 implies that when , we have
(10) 
By using the inequation (7), we have
(11) 
where the last step holds because is independent of when according to Assumption 1 and 2. Combining 8, 9, 10 and 11, we obtain the Lemma 1.
Proof of Theorem 1
Theorem 1
Under Assumption 1 and 2, for any policy , there exists a unique solution satisfying the following equation:
(12) 
for any and , where , and furthermore pointwise.
Proof. First, the solution of exists and is unique because and is a linear combinations of . Moreover, we know that
Then, there exists a unique solution of if there exists a unique solution of because is a linear combinations of and . Additionally, by using Lemma 1, we have
if pointwise.
Finally, we obtain Theorem 1 by induction.
Proof of Theorem 2
Theorem 2
For arbitrary , if
for any , and , where and converges to pointwise, we have converges to pointwise.
Proof. is converges to because converges to and .
For any , if converges to , converges to with the assumption converges to because is a linear combinations of and .
Then, we obtain the conclusion by induction.
Proof of Theorem 3
Theorem 3
Under the assumption 1, 2 and 3, is a tighter upper bound of than .
Proof. Here, we only show that pointwise (see UBE [23] for the proof that is an upper bound of uncertainty).
Because , by using inequation (8), we have
(13) 
Under the Assumption 1, 2 and 3, we have
Comments
There are no comments yet.