MBDP: A Model-based Approach to Achieve both Robustness and Sample Efficiency via Double Dropout Planning

by   Wanpeng Zhang, et al.
Tsinghua University

Model-based reinforcement learning is a widely accepted solution for solving excessive sample demands. However, the predictions of the dynamics models are often not accurate enough, and the resulting bias may incur catastrophic decisions due to insufficient robustness. Therefore, it is highly desired to investigate how to improve the robustness of model-based RL algorithms while maintaining high sampling efficiency. In this paper, we propose Model-Based Double-dropout Planning (MBDP) to balance robustness and efficiency. MBDP consists of two kinds of dropout mechanisms, where the rollout-dropout aims to improve the robustness with a small cost of sample efficiency, while the model-dropout is designed to compensate for the lost efficiency at a slight expense of robustness. By combining them in a complementary way, MBDP provides a flexible control mechanism to meet different demands of robustness and efficiency by tuning two corresponding dropout ratios. The effectiveness of MBDP is demonstrated both theoretically and experimentally.



There are no comments yet.


page 8

page 17


Robust Model-based Reinforcement Learning for Autonomous Greenhouse Control

Due to the high efficiency and less weather dependency, autonomous green...

Dropout Q-Functions for Doubly Efficient Reinforcement Learning

Randomized ensemble double Q-learning (REDQ) has recently achieved state...

Self-Balanced Dropout

Dropout is known as an effective way to reduce overfitting via preventin...

Bias-reduced multi-step hindsight experience replay

Multi-goal reinforcement learning is widely used in planning and robot m...

Kinetic foundation of the zero-inflated negative binomial model for single-cell RNA sequencing data

Single-cell RNA sequencing data have complex features such as dropout ev...

Beyond Prioritized Replay: Sampling States in Model-Based RL via Simulated Priorities

Model-based reinforcement learning (MBRL) can significantly improve samp...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) algorithms are commonly divided into two categories: model-free RL and model-based RL. Model-free RL methods learn a policy directly from samples collected in the real environment, while model-based RL approaches build approximate predictive models of the environment to assist in the optimization of the policy Chen et al. (2015); Polydoros and Nalpantidis (2017). In recent years, RL has achieved remarkable results in a wide range of areas, including continuous control Schulman et al. (2015); Lillicrap et al. (2015); Levine et al. (2016), and outperforming human performances on Go and games Mnih et al. (2015); Silver et al. (2016). However, most of these results are achieved by model-free RL algorithms, which rely on a large number of environmental samples for training, limiting the application scenarios when deployed in practice. In contrast, model-based RL methods have shown the promising potential to cope with the lack of samples by using predictive models for simulation and planning Deisenroth et al. (2013); Berkenkamp et al. (2017). To reduce sample complexity, PILCO Deisenroth and Rasmussen (2011) learns a probabilistic model through Gaussian process regression, which models prediction uncertainty to boost agent’s performance in complex environments. Based on PILCO, the DeepPILCO algorithm Gal et al. (2016)

enables the modeling of more complex environments by introducing the Bayesian Neural Network (BNN), a universal function approximator with high capacity. To further enhance the interpretability of the predictive models and improve the robustness of the learned policies

Chua et al. (2018); Malik et al. (2019), ensemble-based methods Rajeswaran et al. (2016); Kurutach et al. (2018) train an ensemble of models to comprehensively capture the uncertainty in the environment and have been empirically shown to obtain significant improvements in sample efficiency Levine et al. (2016); Chua et al. (2018); Janner et al. (2019).

Despite the high sample efficiency, model-based RL methods inherently suffer from inaccurate predictions, especially when faced with high-dimensional complex tasks and insufficient training samples Abbeel et al. (2006); Moerland et al. (2020). Model accuracy can greatly affect the policy quality, and policies learned in inaccurate models tend to have significant performance degradation due to cumulative model error Sutton (1996); Asadi et al. (2019). Therefore, how to eliminate the effects caused by model bias has become a hot topic in model-based RL methods. Another important factor that limits the application of model-based algorithms is safety concerns. In a general RL setup, the agent needs to collect observations to extrapolate the current state before making decisions, which poses a challenge to the robustness of the learned policy because the process of acquiring observations through sensors may introduce random noise and the real environment is normally partial observable. Non-robust policies may generate disastrous decisions when faced with a noisy environment, and this safety issue is more prominent in model-based RL because the error in inferring the current state from observations may be further amplified by model bias when doing simulation and planning with the predictive models. Drawing on researches in robust control Zhou and Doyle (1998), a branch of control theory, robust RL methods have attracted more and more attention to improve the capability of the agent against perturbed states and model bias. The main objective of robust RL is to optimize the agent’s performance in worst-case scenarios and to improve the generalization of learned policies to noisy environments Shapiro et al. (2014)

. Existing robust RL methods can be roughly classified into two types, one is based on adversarial ideas such as RARL

Pinto et al. (2017) and NR-MDP Tessler et al. (2019) to obtain robust policies by proposing corresponding minimax objective functions, while the other group of approaches Tamar et al. (2015) introduce conditional value at risk (CVaR) objectives to ensure the robustness of the learned policies. However, the increased robustness of these methods can lead to a substantial loss of sample efficiency due to the pessimistic manner of data use. Therefore, it is nontrival to enhance the robustness of policy while avoiding sample inefficiency.

In this paper, we propose Model-Based Reinforcement Learning with Double Dropout Planning (MBDP) algorithm for the purpose of learning policies that can reach a balance between robustness and sample efficiency. Inspried by CVaR, we design the rollout-dropout mechanism to enhance robustness by optimizing the policies with low-reward samples. On the other hand, in order to maintain high sample efficiency and reduce the impact of model bias, we learn an ensemble of models to compensate for the inaccuracy of single model. Furthermore, when generating imaginary samples to assist in the optimization of policies, we design the model-dropout mechanism to avoid the perturbation of inaccurate models by only using models with small errors. To meet different demands of robustness and sample efficiency, a flexible control can be realized via the two dropout mechanisms. We demonstrate the effectiveness of MBDP both theoretically and empirically.

2 Notations and Preliminaries

2.1 Reinforcement Learning

We consider a Markov decision process (MDP), defined by the tuple

, where is the state space, is the action space, is the reward function, is the discount factor, and

is the conditional probability distribution of the next state given current state

and action . The form denotes the state transition function when the environment is deterministic. Let denote the expected return or expectation of accumulated rewards starting from initial state , i.e., the expected sum of discounted rewards following policy and state transition function :


For simplicity of symbol, let denote the expected return over random initial states:


The goal of reinforcement learning is to maximize the expected return by finding the optimal decision policy, i.e., .

2.2 Model-based Methods

In model-based reinforcement learning, an approximated transition model is learned by interacting with the environment, the policy is then optimized with samples from the environment and data generated by the model. We use the parametric notation to specifically denote the model trained by a neural network, where is the parameter space of models.

More specifically, to improve the ability of models to represent complex environment, we need to learn multiple models and make an ensemble of them, i.e., . To generate a prediction from the model ensemble, we select a model from uniformly at random, and perform a model rollout using the selected model at each time step, i.e., . Then we fill these rollout samples into a batch. Finally we can perform policy optimization on these generated samples.

2.3 Conditional Value-at-Risk


denote a random variable with a cumulative distribution function (CDF)

. Given a confidence level , the Value-at-Risk of (at confidence level ) is denoted , and given by


The Conditional-Value-at-Risk of (at confidence level ) is denoted by and defined as the expected value of , conditioned on the -portion of the tail distribution:


3 MBDP Framework

Figure 1: Overview of the MBDP algorithm. When interacting with the environment, we collect samples into environment replay buffer , used for training the simulator model of the environment. Then we implement the model-dropout procedure and perform rollouts on the model ensemble. The sampled data from the model ensemble is filled into a temporary batch, and then we get a dropout buffer by implementing the rollout-dropout procedure. Finally, we use samples from to optimize the policy .

In this section, we introduce how MBDP leverages Double Dropout Planning to find the balance between efficiency and robustness. The basic procedure of MBDP is to 1) sample data from the environment; 2) train an ensemble of models from the sampled data; 3) calculate model bias over observed environment samples, and choose a subset of model ensemble based on the calculated model bias; 4) collect rollout trajectories from the model ensemble, and make gradient updates based on the subsets of sampled data. The overview of the algorithm architecture is shown in figure 1 and the overall algorithm pseudo-code is demonstrated in Algorithm 1.

We will also theoretically analyze robustness and performance under the dropout planning of our MBDP algorithm. For simplicity of theoretical analysis, we only consider deterministic environment and models in this section, but the experimental part does not require this assumption. The detailed proofs can be found in the appendix as provided in supplementary materials.

  Initialize hyperparameters, policy

, environment replay buffer , model replay buffer
  for  iterations do
     Take an action in environment using policy
     Add samples to
     for  iterations do
        Train probabilistic model on
        Build a model subset according to
        for  do
           Select a model from randomly
           Perform rollouts on model with policy and get samples
           Fill these samples into temp batch
        end for
        Calculate : the percentile of batch grouped by state
        for  do
           if  then
              fill into
           end if
        end for
     end for
     Optimize on :
  end for
Algorithm 1 Model-Based Reinforcement Learning with Double Dropout Planning (MBDP)

3.1 Rollout Dropout in MBDP

Optimizing the expected return in a general way as model-based methods allows us to learn a policy that performs best in expectation over the training model ensemble. However, best expectation does not mean that the result policies can perform well at all times. This instability typically leads to risky decisions when facing poorly-informed states at deployment.

Inspired by previous works Rajeswaran et al. (2016); Tamar et al. (2015); Chow et al. (2015) which optimize conditional value at risk (CVaR) to explicitly seek a robust policy, we add a dropout mechanism in the rollout procedure. Recall the model-based methods in Section 2.2, to generate a prediction from the model ensemble, we select a model from uniformly at random, and perform a model rollout using the selected model at each time step, i.e., . Then we fill these rollout samples into a batch and retain a percentile subset with more pessimistic rewards. We use to denote the percentile rollout batch:


where and is the percentile of reward values conditioned on state in batch . The expected return of dropout batch rollouts is denoted by :


Rollout-dropout can improve the robustness with a nano cost of sample efficiency, we will analyze how it brings improvements to robustness in Section 3.3.

3.2 Model Dropout in MBDP

Rollout-dropout can improve the robustness, but it is clear that dropping a certain number of samples could affect the algorithm’s sample efficiency. Model-based methods can improve this problem. However, since model bias can affect the performance of the algorithm, we also need to consider how to optimize it. Previous works use an ensemble of bootstrapped probabilistic transition models as in PETS method Chua et al. (2018) to properly incorporate two kinds of uncertainty into the transition model.

In order to mitigate the impact of discrepancies and flexibly control the accuracy of model ensemble, we design a model-dropout mechanism. More specifically, we first learn an ensemble of transition models , each member of the ensemble is a probabilistic neural network whose outputs parametrize a Guassian distribution: . While training models based on samples from environment, we calculate bias averaged over the observed state-action pair for each model:


which formulates the distance of next states in model and in environment , where is a distance function on state space .

Then we select models from the model ensemble uniformly at random, sort them in ascending order by the calculated bias and retain a dropout subset with smaller model bias: , i.e., , where and is the max integer in the ascending order index after we dropout the -percentile subset with large bias.

3.3 Theoretical Analysis of MBDP

We now give theoretical guarantees for the robustness and sample efficiency of the MBDP algorithm. All the proofs of this section are detailed in Appendix A.

3.3.1 Guarantee of Robustness

We define the robustness as the expected performance in a perturbed environment. Consider a perturbed transition matrix , where

is a multiplicative probability perturbation and

is the Hadamard Product. Recall the definition of in equation (2.4), now we propose following theorem to provide guarantee of robustness for MBDP algorithm.

Theorem 3.1.

It holds


given the constraint set of perturbation


Since means optimizing the expected performance in a perturbed environment, which is exactly our definition of robustness, then Theorem 3.1 can be interpreted as an equivalence between optimizing robustness and the expected return under rollout-dropout, i.e., .

3.3.2 Guarantee of Efficiency

We first propose Lemma 3.2 to prove that the expected return with only rollout-dropout mechanism, compared to the expected return when it is deployed in the environment , has a discrepancy bound.

Lemma 3.2.

Suppose is the supremum of reward function , i.e., , the expected return of dropout batch rollouts with individual model has a discrepancy bound:


While Lemma 3.2 only provides a guarantee for the performance of rollout-dropout mechanism, we now propose Theorem 3.3 to prove that the expected return of policy derived by model dropout together with rollout-dropout, i.e., our MBDP algorithm, compared to the expected return when it is deployed in the environment , has a discrepancy bound.

Theorem 3.3.

Suppose is a constant. The expected return of MBDP algorithm, i.e., , compared to the expected return when it is deployed in the environment , i.e., , has a discrepancy bound:






Since MBDP algorithm is an extension of the Dyna-style algorithm Sutton (1991): a series of model-based reinforcement learning methods which jointly optimize the policy and transition model, it can be written in a general pattern as below:


where denotes the updated policy in -th iteration and denotes the updated dropout model ensemble in -th iteration. In this setting, we can show that, performance of the policy derived by our MBDP algorithm, is approximatively monotonically increasing when deploying in the real environment , with ability to robustly jump out of local optimum.

Proposition 3.4.

The expected return of policy derived by general algorithm pattern (3.10), is approximatively monotonically increasing when deploying in the real environment , i.e.


where is defined in (3.6) and is the update residual:


Intuitively, proposition 3.4 shows that under the control of reasonable parameters and , is often a large update value in the early learning stage, while as an error bound is a fixed small value. Thus is a value greater than most of the time in the early learning stage, which can guarantee . In the late stage near convergence, the update becomes slow and may be smaller than , which leads to the possibility that is smaller than . This makes the update process try some other convergence direction, providing an opportunity to jump out of the local optimum. We empirically verify this claim in Appendix C.

3.3.3 Flexible control of robustness and efficiency

According to Theorem 3.1, rollout-dropout improves robustness, and the larger is, the more robustness is improved. Conversely, the smaller is, the worse the robustness will be. For model-dropout, it is obvious that when is larger, it means that the more models we will be dropped, and the more likely the model is to overfit the environment, so the less robust it is. Conversely, when is less, the model ensemble has better robustness in simulating complex environments, and the robustness is better at this point.

Turning to the efficiency. Note that the bound in equation (3.8) i.e., , is in positive ratio with and inverse ratio with . This means that as increases or decreases, this bound expands, causing the accuracy of the algorithm to decrease and the algorithm to take longer to converge, thus making it less efficient. Conversely, when decreases or increases, the efficiency increases.

With the analysis above, it suggests that MBDP can provide a flexible control mechanism to meet different demands of robustness and efficiency by tuning two corresponding dropout ratios. This conclusion can be summarized as follows and we also empirically verify it in section 4.

  • To get balanced efficiency and robustness: set and both to a moderate value

  • To get better robustness: set to a larger value and to a smaller value.

  • To get better efficiency: set to a smaller value and to a larger value.

4 Experiments

Our experiments aim to answer the following questions:

  • How does MBDP perform on benchmark reinforcement learning tasks compared to state-of-the-art model-based and model-free RL methods?

  • Can MBDP find a balance between robustness and benefits?

  • How does the robustness and efficiency of MBDP change by tuning parameters and ?

To answer the posed questions, we need to understand how well our method compares to state-of-the-art model-based and model-free methods and how our design choices affect performance. We evaluate our approach on four continuous control benchmark tasks in the Mujoco simulator Todorov et al. (2012): Hopper, Walker, HalfCheetah, and Ant. We also need to perform the ablation study by removing the dropout modules from our algorithm. Finally, a separate analysis of the hyperparameters ( and ) is also needed. A depiction of the environments and a detailed description of the experimental setup can be found in Appendix B.

4.1 Comparison with State-of-the-Arts

In this subsection, we compare our MBDP algorithm with state-of-the-art model-free and model-based reinforcement learning algorithms in terms of sample complexity and performance. Specifically, we compare against SAC Haarnoja et al. (2018), which is the state-of-the-art model-free method and establishes a widely accepted baseline. For model-based methods, we compare against MBPO Janner et al. (2019), which uses short-horizon model-based rollouts started from samples in the real environment; STEVE Buckman et al. (2018)

, which dynamically incorporates data from rollouts into value estimation rather than policy learning; and SLBO

Luo et al. (2019), a model-based algorithm with performance guarantees. For our MBDP algorithm, we choose and as hyperparameter setting.

Figure 2:

Learning curves of our MBDP algorithm and four baselines on different continuous control environments. Solid curves indicate the mean of all trials with 5 different seeds. Shaded regions correspond to standard deviation among trials. Each trial is evaluated every 1000 steps. The dashed reference lines are the asymptotic performance of SAC algorithm. These results show that our MBDP method learns faster and has better asymptotic performance and sample efficiency than existing model-based algorithms.

Figure 2 shows the learning curves for all methods, along with asymptotic performance of the model-free SAC algorithm which do not converge in the region shown. The results highlight the strength of MBDP in terms of performance and sample complexity. In all the Mujoco simulator environments, our MBDP method learns faster and has better efficiency than existing model-based algorithms, which empirically demonstrates the advantage of Dropout Planning.

4.2 Analysis of Robustness

Figure 3: The robustness performance is depicted as heat maps for various environment settings. Each heat map represents a set of experiments, and each square in the heat map represents the average return value in one experiment. The closer the color to red (hotter) means the higher the value, the better the algorithm is trained in that environment, and vice versa. The four different algorithms in the figure are no dropout (), rollout-dropout only (-dropout: ), model-dropout only (-dropout: ), and both dropouts (). Each experiment in the Hopper environment stops after 300,000 steps, and each experiment in the HalfCheetah environment stops after 600,000 steps.

Aiming to evaluate the robustness of our MBDP algorithm by testing policies on different environment settings (i.e., different combinations of physical parameters) without any adaption, we define ranges of mass and friction coefficients as follows: and , and modify the environments by scaling the torso mass with coefficient and the friction of every geom with coefficient .

We compare the original MBDP algorithm with the -dropout variation () which keeps only the rollout-dropout, the -dropout variation () which keeps only the model-dropout, and the no-dropout variation () which removes both dropouts. This experiment is conducted in the modified environments mentioned above. The results are presented in Figure 3 in the form of heat maps, each square of a heat map represents the average return value that the algorithm can achieve after training in each modified environment. The closer the color to red (hotter) means the higher the value, the better the algorithm is trained in that environment, and vice versa. Obviously, if the algorithm can only achieve good training results in the central region and inadequate results in the region far from the center, it means that the algorithm is more sensitive to perturbation in environments and thus less robust.

Based on the results, we can see that the -dropout using only the rollout-dropout can improve the robustness of the algorithm, while the -dropout using only the model-dropout will slightly weaken the robustness, and the combination of both dropouts, i.e., the MBDP algorithm, achieves robustness close to that of -dropout.

4.3 Ablation Study

In this section, we investigate the sensitivity of MBDP algorithm to the hyperparameter . We conduct two sets of experiments in both Hopper and HalfCheetah environments: (1) fix and change (); (2) fix and change ().

The experimental results are shown in Figure 4. The first row corresponds to experiments in the Hopper environment and the second row corresponds to experiments in the HalfCheetah environment. Columns 1 and 2 correspond to the experiments conducted in the perturbed Mujoco environment with modified environment settings. We construct a total of different perturbed environments (), and calculate the average of the return values after training a fixed number of steps (Hopper: 120k steps, HalfCheetah: 400k steps) in each of the four environments. The higher this average value represents the algorithm can achieve better overall performance in multiple perturbed environments, implying better robustness. Therefore, this metric can be used to evaluate the robustness of different . Columns 3 and 4 are the return values obtained after a fixed number of steps (Hopper: 120k steps, HalfCheetah: 400k steps) for experiments conducted in the standard Mujoco environment without any modification, which are used to evaluate the efficiency of the algorithm for different values of . Each box plot corresponds to 10 different random seeds.

Observing the experimental results, we can find that robustness shows a positive relationship with and an inverse relationship with ; efficiency shows an inverse relationship with and a positive relationship with . This result verifies our conclusion in Section 3.3.3. In addition, we use horizontal dashed lines in Figure 4 to indicate the baseline with rollout-dropout and model-dropout removed (). It can be seen that when , the robustness and efficiency of the algorithm can both exceed the baseline. Therefore, when is adjusted to a reasonable range of values, we can simultaneously improve the robustness and efficiency.

Figure 4: The horizontal axis represents the different values of . The vertical axis is the metric for evaluating the robustness or efficiency. The horizontal dashed line is the baseline case with both rollout-dropout and model-dropout removed (). 120k steps are trained for each experiment in the Hopper environment, and 400k steps are trained for each experiment in the HalfCheetah environment. Each box plot corresponds to 10 different random seeds.

5 Conclusions and Future Work

In this paper, we propose the MBDP algorithm to address the dilemma of robustness and sample efficiency. Specifically, MBDP drops some overvalued imaginary samples through the rollout-dropout mechanism to focus on the bad samples for the purpose of improving robustness, while the model-dropout mechanism is designed to enhance the sample efficiency by only using accurate models. Both theoretical analysis and experiment results verify our claims that 1) MBDP algorithm can provide policies with competitive robustness while achieving state-of-the-art performance; 2) we empirically find that there is a seesaw phenomenon between robustness and efficiency, that is, the growth of one will cause a slight decline of the other; 3) we can get policies with different types of performance and robustness by tuning the hyperparameters and , ensuring that our algorithm is capable of performing well in a wide range of tasks.

Our future work will incorporate more domain knowledge of robust control to further enhance robustness. We also plan to transfer the design of Double Dropout Planning as a more general module that can be easily embedded in more model-based RL algorithms and validate the effectiveness of Double Dropout Planning in real-world scenarios. Besides, relevant researches in the field of meta learning and transfer learning may inspire us to further optimize the design and training procedure of the predictive models. Finally, we can use more powerful function approximators to model the environment.


  • [1] P. Abbeel, M. Quigley, and A. Y. Ng (2006) Using inaccurate models in reinforcement learning. In

    Proceedings of the 23rd international conference on Machine learning

    pp. 1–8. Cited by: §1.
  • [2] K. Asadi, D. Misra, S. Kim, and M. L. Littman (2019) Combating the compounding-error problem with a multi-step model. arXiv preprint arXiv:1905.13320. Cited by: §1.
  • [3] F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause (2017) Safe model-based reinforcement learning with stability guarantees. arXiv preprint arXiv:1705.08551. Cited by: §1.
  • [4] J. Buckman, D. Hafner, G. Tucker, E. Brevdo, and H. Lee (2018) Sample-efficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pp. 8224–8234. Cited by: §4.1.
  • [5] C. Chen, T. Takahashi, S. Nakagawa, T. Inoue, and I. Kusumi (2015) Reinforcement learning in depression: a review of computational research. Neuroscience & Biobehavioral Reviews 55, pp. 247–267. Cited by: §1.
  • [6] Y. Chow, A. Tamar, S. Mannor, and M. Pavone (2015) Risk-sensitive and robust decision-making: a cvar optimization approach. In Advances in Neural Information Processing Systems, pp. 1522–1530. Cited by: §3.1.
  • [7] K. Chua, R. Calandra, R. McAllister, and S. Levine (2018) Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models. Advances in Neural Information Processing Systems 2018-Decem (NeurIPS), pp. 4754–4765. External Links: ISSN 10495258 Cited by: §1, §3.2.
  • [8] M. P. Deisenroth, G. Neumann, and J. Peters (2013) A survey on policy search for robotics. now publishers. Cited by: §1.
  • [9] M. Deisenroth and C. E. Rasmussen (2011) PILCO: a model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pp. 465–472. Cited by: §1.
  • [10] Y. Gal, R. McAllister, and C. E. Rasmussen (2016) Improving PILCO with Bayesian neural network dynamics models. In Data-Efficient Machine Learning workshop, International Conference on Machine Learning, Cited by: §1.
  • [11] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §4.1.
  • [12] M. Janner, J. Fu, M. Zhang, and S. Levine (2019) When to trust your model: model-based policy optimization. In Advances in Neural Information Processing Systems, pp. 12519–12530. Cited by: §1, §4.1.
  • [13] T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel (2018) Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592. Cited by: §1.
  • [14] S. Levine, C. Finn, T. Darrell, and P. Abbeel (2016) End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §1.
  • [15] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
  • [16] Y. Luo, H. Xu, Y. Li, Y. Tian, T. Darrell, and T. Ma (2019) Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. 7th International Conference on Learning Representations, ICLR 2019, pp. 1–27. Cited by: §4.1.
  • [17] A. Malik, V. Kuleshov, J. Song, D. Nemer, H. Seymour, and S. Ermon (2019) Calibrated model-based deep reinforcement learning. In International Conference on Machine Learning, pp. 4314–4323. Cited by: §1.
  • [18] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §1.
  • [19] T. M. Moerland, J. Broekens, and C. M. Jonker (2020) Model-based reinforcement learning: a survey. arXiv preprint arXiv:2006.16712. Cited by: §1.
  • [20] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta (2017) Robust adversarial reinforcement learning. arXiv preprint arXiv:1703.02702. Cited by: §1.
  • [21] A. S. Polydoros and L. Nalpantidis (2017) Survey of model-based reinforcement learning: applications on robotics. Journal of Intelligent & Robotic Systems 86 (2), pp. 153–173. Cited by: §1.
  • [22] A. Rajeswaran, S. Ghotra, B. Ravindran, and S. Levine (2016) Epopt: learning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283. Cited by: §1, §3.1.
  • [23] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel (2015) High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: §1.
  • [24] A. Shapiro, D. Dentcheva, and A. Ruszczyński (2014) Lectures on stochastic programming: modeling and theory. SIAM. Cited by: §1.
  • [25] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484–489. Cited by: §1.
  • [26] L. K. R. Sutton (1996) Model-based reinforcement learning with an approximate, learned model. In Proceedings of the ninth Yale workshop on adaptive and learning systems, pp. 101–105. Cited by: §1.
  • [27] R. S. Sutton (1991) Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin 2 (4), pp. 160–163. External Links: Document, ISSN 0163-5719 Cited by: §3.3.2.
  • [28] A. Tamar, Y. Glassner, and S. Mannor (2015) Optimizing the CVaR via sampling.

    Proceedings of the National Conference on Artificial Intelligence

    4, pp. 2993–2999.
    External Links: ISBN 9781577357025 Cited by: §1, §3.1.
  • [29] C. Tessler, Y. Efroni, and S. Mannor (2019) Action robust reinforcement learning and applications in continuous control. arXiv preprint arXiv:1901.09184. Cited by: §1.
  • [30] E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §B.1, §4.
  • [31] K. Zhou and J. C. Doyle (1998) Essentials of robust control. Vol. 104, Prentice hall Upper Saddle River, NJ. Cited by: §1.

Appendix A Proofs

In Appendix A, we will provide proofs for Theorem 3.1, Lemma 3.2, Theorem 3.3 ,and Proposition 3.4. Note that the numbering and citations in the appendices are referenced from the main manuscript.

a.1 Proof of Theorem 3.1


Recall the definition of (2.4) and (3.2), we need to take the negative value of rewards to represent the loss in the sense of CVaR. Then we have that,

Obviously, the condition of in the above equation exactly meets our definition of , that is, eqaution (3.1). Then we can prove the first part of Theorem 3.1


Considering , recall the definition of , we have that

Since is the random perturbation to the environment as we defined, it’s intuitive that


Recall the definition of in (3.5), we can prove the second part of Theorem 3.1


The last equation (A.3) is obtained by equation (A.2) and the Representation Theorem [22] for CVaR.

a.2 Proof of Lemma 3.2

To prove Lemma 3.2, we need to introduce two useful lemmas.

Lemma A.1.



For any policy and dynamical models , we have that


Lemma A.1 is a directly cited theorem in existing work (Lemma 4.3 in [31]), we make some modifications to fit our subsequent conclusions. With the above lemma, we first propose Lemma A.2.

Lemma A.2.

Suppose the expected return for model-based methods is Lipschitz continuous on the state space , is the Lipschitz constant, is the transition distribution of environment, then




In Lemma A.2, we make the assumption that the expected return on the estimated model is Lipschitz continuous w.r.t any norm , i.e.


where is a Lipschitz constant. This assumption means that the closer states should give the closer value estimation, which should hold in most scenarios.


By definition of in (A.4) and Assumption (A.8), i.e., is Lipschitz continuous, we have that


Then, we can show that

(By Lemma A.1)
(By Triangle Inequality)
(By equation (A.9))

Now we prove Lemma 3.2.


For two disjoint sets and , i.e., , there are the following property


By this property,

Recall the definition (2.1), (2.2) and (3.2), we have that


Where . Recall the definition (3.1) and , we have

(By definition of )


Based on the above two inequalities and equation (A.11), we have that


a.3 Proof of Theorem 3.3


With Lemma A.2 and Lemma 3.2, we can show that

(By Triangle Inequality)
(By Lemma 3.2)

For the first part of (A.13), let

denotes the general bias between any model and environment transition , with Lemma A.2, we now get

(By Lemma A.2)

For the second part of (A.13), by Lemma 3.2, we can show that

(By Lemma 3.2)

Go back to equation (A.13), it follows that

(By equation (A.14) and (A.15))

a.4 Proof of Proposition 3.4


With Theorem 3.3, i.e., , we have