1 Introduction
Reinforcement learning (RL) algorithms are commonly divided into two categories: modelfree RL and modelbased RL. Modelfree RL methods learn a policy directly from samples collected in the real environment, while modelbased RL approaches build approximate predictive models of the environment to assist in the optimization of the policy Chen et al. (2015); Polydoros and Nalpantidis (2017). In recent years, RL has achieved remarkable results in a wide range of areas, including continuous control Schulman et al. (2015); Lillicrap et al. (2015); Levine et al. (2016), and outperforming human performances on Go and games Mnih et al. (2015); Silver et al. (2016). However, most of these results are achieved by modelfree RL algorithms, which rely on a large number of environmental samples for training, limiting the application scenarios when deployed in practice. In contrast, modelbased RL methods have shown the promising potential to cope with the lack of samples by using predictive models for simulation and planning Deisenroth et al. (2013); Berkenkamp et al. (2017). To reduce sample complexity, PILCO Deisenroth and Rasmussen (2011) learns a probabilistic model through Gaussian process regression, which models prediction uncertainty to boost agent’s performance in complex environments. Based on PILCO, the DeepPILCO algorithm Gal et al. (2016)
enables the modeling of more complex environments by introducing the Bayesian Neural Network (BNN), a universal function approximator with high capacity. To further enhance the interpretability of the predictive models and improve the robustness of the learned policies
Chua et al. (2018); Malik et al. (2019), ensemblebased methods Rajeswaran et al. (2016); Kurutach et al. (2018) train an ensemble of models to comprehensively capture the uncertainty in the environment and have been empirically shown to obtain significant improvements in sample efficiency Levine et al. (2016); Chua et al. (2018); Janner et al. (2019).Despite the high sample efficiency, modelbased RL methods inherently suffer from inaccurate predictions, especially when faced with highdimensional complex tasks and insufficient training samples Abbeel et al. (2006); Moerland et al. (2020). Model accuracy can greatly affect the policy quality, and policies learned in inaccurate models tend to have significant performance degradation due to cumulative model error Sutton (1996); Asadi et al. (2019). Therefore, how to eliminate the effects caused by model bias has become a hot topic in modelbased RL methods. Another important factor that limits the application of modelbased algorithms is safety concerns. In a general RL setup, the agent needs to collect observations to extrapolate the current state before making decisions, which poses a challenge to the robustness of the learned policy because the process of acquiring observations through sensors may introduce random noise and the real environment is normally partial observable. Nonrobust policies may generate disastrous decisions when faced with a noisy environment, and this safety issue is more prominent in modelbased RL because the error in inferring the current state from observations may be further amplified by model bias when doing simulation and planning with the predictive models. Drawing on researches in robust control Zhou and Doyle (1998), a branch of control theory, robust RL methods have attracted more and more attention to improve the capability of the agent against perturbed states and model bias. The main objective of robust RL is to optimize the agent’s performance in worstcase scenarios and to improve the generalization of learned policies to noisy environments Shapiro et al. (2014)
. Existing robust RL methods can be roughly classified into two types, one is based on adversarial ideas such as RARL
Pinto et al. (2017) and NRMDP Tessler et al. (2019) to obtain robust policies by proposing corresponding minimax objective functions, while the other group of approaches Tamar et al. (2015) introduce conditional value at risk (CVaR) objectives to ensure the robustness of the learned policies. However, the increased robustness of these methods can lead to a substantial loss of sample efficiency due to the pessimistic manner of data use. Therefore, it is nontrival to enhance the robustness of policy while avoiding sample inefficiency.In this paper, we propose ModelBased Reinforcement Learning with Double Dropout Planning (MBDP) algorithm for the purpose of learning policies that can reach a balance between robustness and sample efficiency. Inspried by CVaR, we design the rolloutdropout mechanism to enhance robustness by optimizing the policies with lowreward samples. On the other hand, in order to maintain high sample efficiency and reduce the impact of model bias, we learn an ensemble of models to compensate for the inaccuracy of single model. Furthermore, when generating imaginary samples to assist in the optimization of policies, we design the modeldropout mechanism to avoid the perturbation of inaccurate models by only using models with small errors. To meet different demands of robustness and sample efficiency, a flexible control can be realized via the two dropout mechanisms. We demonstrate the effectiveness of MBDP both theoretically and empirically.
2 Notations and Preliminaries
2.1 Reinforcement Learning
We consider a Markov decision process (MDP), defined by the tuple
, where is the state space, is the action space, is the reward function, is the discount factor, andis the conditional probability distribution of the next state given current state
and action . The form denotes the state transition function when the environment is deterministic. Let denote the expected return or expectation of accumulated rewards starting from initial state , i.e., the expected sum of discounted rewards following policy and state transition function :(2.1) 
For simplicity of symbol, let denote the expected return over random initial states:
(2.2) 
The goal of reinforcement learning is to maximize the expected return by finding the optimal decision policy, i.e., .
2.2 Modelbased Methods
In modelbased reinforcement learning, an approximated transition model is learned by interacting with the environment, the policy is then optimized with samples from the environment and data generated by the model. We use the parametric notation to specifically denote the model trained by a neural network, where is the parameter space of models.
More specifically, to improve the ability of models to represent complex environment, we need to learn multiple models and make an ensemble of them, i.e., . To generate a prediction from the model ensemble, we select a model from uniformly at random, and perform a model rollout using the selected model at each time step, i.e., . Then we fill these rollout samples into a batch. Finally we can perform policy optimization on these generated samples.
2.3 Conditional ValueatRisk
Let
denote a random variable with a cumulative distribution function (CDF)
. Given a confidence level , the ValueatRisk of (at confidence level ) is denoted , and given by(2.3) 
The ConditionalValueatRisk of (at confidence level ) is denoted by and defined as the expected value of , conditioned on the portion of the tail distribution:
(2.4) 
3 MBDP Framework
In this section, we introduce how MBDP leverages Double Dropout Planning to find the balance between efficiency and robustness. The basic procedure of MBDP is to 1) sample data from the environment; 2) train an ensemble of models from the sampled data; 3) calculate model bias over observed environment samples, and choose a subset of model ensemble based on the calculated model bias; 4) collect rollout trajectories from the model ensemble, and make gradient updates based on the subsets of sampled data. The overview of the algorithm architecture is shown in figure 1 and the overall algorithm pseudocode is demonstrated in Algorithm 1.
We will also theoretically analyze robustness and performance under the dropout planning of our MBDP algorithm. For simplicity of theoretical analysis, we only consider deterministic environment and models in this section, but the experimental part does not require this assumption. The detailed proofs can be found in the appendix as provided in supplementary materials.
3.1 Rollout Dropout in MBDP
Optimizing the expected return in a general way as modelbased methods allows us to learn a policy that performs best in expectation over the training model ensemble. However, best expectation does not mean that the result policies can perform well at all times. This instability typically leads to risky decisions when facing poorlyinformed states at deployment.
Inspired by previous works Rajeswaran et al. (2016); Tamar et al. (2015); Chow et al. (2015) which optimize conditional value at risk (CVaR) to explicitly seek a robust policy, we add a dropout mechanism in the rollout procedure. Recall the modelbased methods in Section 2.2, to generate a prediction from the model ensemble, we select a model from uniformly at random, and perform a model rollout using the selected model at each time step, i.e., . Then we fill these rollout samples into a batch and retain a percentile subset with more pessimistic rewards. We use to denote the percentile rollout batch:
(3.1) 
where and is the percentile of reward values conditioned on state in batch . The expected return of dropout batch rollouts is denoted by :
(3.2) 
Rolloutdropout can improve the robustness with a nano cost of sample efficiency, we will analyze how it brings improvements to robustness in Section 3.3.
3.2 Model Dropout in MBDP
Rolloutdropout can improve the robustness, but it is clear that dropping a certain number of samples could affect the algorithm’s sample efficiency. Modelbased methods can improve this problem. However, since model bias can affect the performance of the algorithm, we also need to consider how to optimize it. Previous works use an ensemble of bootstrapped probabilistic transition models as in PETS method Chua et al. (2018) to properly incorporate two kinds of uncertainty into the transition model.
In order to mitigate the impact of discrepancies and flexibly control the accuracy of model ensemble, we design a modeldropout mechanism. More specifically, we first learn an ensemble of transition models , each member of the ensemble is a probabilistic neural network whose outputs parametrize a Guassian distribution: . While training models based on samples from environment, we calculate bias averaged over the observed stateaction pair for each model:
(3.3) 
which formulates the distance of next states in model and in environment , where is a distance function on state space .
Then we select models from the model ensemble uniformly at random, sort them in ascending order by the calculated bias and retain a dropout subset with smaller model bias: , i.e., , where and is the max integer in the ascending order index after we dropout the percentile subset with large bias.
3.3 Theoretical Analysis of MBDP
We now give theoretical guarantees for the robustness and sample efficiency of the MBDP algorithm. All the proofs of this section are detailed in Appendix A.
3.3.1 Guarantee of Robustness
We define the robustness as the expected performance in a perturbed environment. Consider a perturbed transition matrix , where
is a multiplicative probability perturbation and
is the Hadamard Product. Recall the definition of in equation (2.4), now we propose following theorem to provide guarantee of robustness for MBDP algorithm.Theorem 3.1.
It holds
(3.4) 
given the constraint set of perturbation
(3.5) 
Since means optimizing the expected performance in a perturbed environment, which is exactly our definition of robustness, then Theorem 3.1 can be interpreted as an equivalence between optimizing robustness and the expected return under rolloutdropout, i.e., .
3.3.2 Guarantee of Efficiency
We first propose Lemma 3.2 to prove that the expected return with only rolloutdropout mechanism, compared to the expected return when it is deployed in the environment , has a discrepancy bound.
Lemma 3.2.
Suppose is the supremum of reward function , i.e., , the expected return of dropout batch rollouts with individual model has a discrepancy bound:
(3.6) 
While Lemma 3.2 only provides a guarantee for the performance of rolloutdropout mechanism, we now propose Theorem 3.3 to prove that the expected return of policy derived by model dropout together with rolloutdropout, i.e., our MBDP algorithm, compared to the expected return when it is deployed in the environment , has a discrepancy bound.
Theorem 3.3.
Suppose is a constant. The expected return of MBDP algorithm, i.e., , compared to the expected return when it is deployed in the environment , i.e., , has a discrepancy bound:
(3.7) 
where
(3.8) 
and
(3.9) 
Since MBDP algorithm is an extension of the Dynastyle algorithm Sutton (1991): a series of modelbased reinforcement learning methods which jointly optimize the policy and transition model, it can be written in a general pattern as below:
(3.10) 
where denotes the updated policy in th iteration and denotes the updated dropout model ensemble in th iteration. In this setting, we can show that, performance of the policy derived by our MBDP algorithm, is approximatively monotonically increasing when deploying in the real environment , with ability to robustly jump out of local optimum.
Proposition 3.4.
Intuitively, proposition 3.4 shows that under the control of reasonable parameters and , is often a large update value in the early learning stage, while as an error bound is a fixed small value. Thus is a value greater than most of the time in the early learning stage, which can guarantee . In the late stage near convergence, the update becomes slow and may be smaller than , which leads to the possibility that is smaller than . This makes the update process try some other convergence direction, providing an opportunity to jump out of the local optimum. We empirically verify this claim in Appendix C.
3.3.3 Flexible control of robustness and efficiency
According to Theorem 3.1, rolloutdropout improves robustness, and the larger is, the more robustness is improved. Conversely, the smaller is, the worse the robustness will be. For modeldropout, it is obvious that when is larger, it means that the more models we will be dropped, and the more likely the model is to overfit the environment, so the less robust it is. Conversely, when is less, the model ensemble has better robustness in simulating complex environments, and the robustness is better at this point.
Turning to the efficiency. Note that the bound in equation (3.8) i.e., , is in positive ratio with and inverse ratio with . This means that as increases or decreases, this bound expands, causing the accuracy of the algorithm to decrease and the algorithm to take longer to converge, thus making it less efficient. Conversely, when decreases or increases, the efficiency increases.
With the analysis above, it suggests that MBDP can provide a flexible control mechanism to meet different demands of robustness and efficiency by tuning two corresponding dropout ratios. This conclusion can be summarized as follows and we also empirically verify it in section 4.

To get balanced efficiency and robustness: set and both to a moderate value

To get better robustness: set to a larger value and to a smaller value.

To get better efficiency: set to a smaller value and to a larger value.
4 Experiments
Our experiments aim to answer the following questions:

How does MBDP perform on benchmark reinforcement learning tasks compared to stateoftheart modelbased and modelfree RL methods?

Can MBDP find a balance between robustness and benefits?

How does the robustness and efficiency of MBDP change by tuning parameters and ?
To answer the posed questions, we need to understand how well our method compares to stateoftheart modelbased and modelfree methods and how our design choices affect performance. We evaluate our approach on four continuous control benchmark tasks in the Mujoco simulator Todorov et al. (2012): Hopper, Walker, HalfCheetah, and Ant. We also need to perform the ablation study by removing the dropout modules from our algorithm. Finally, a separate analysis of the hyperparameters ( and ) is also needed. A depiction of the environments and a detailed description of the experimental setup can be found in Appendix B.
4.1 Comparison with StateoftheArts
In this subsection, we compare our MBDP algorithm with stateoftheart modelfree and modelbased reinforcement learning algorithms in terms of sample complexity and performance. Specifically, we compare against SAC Haarnoja et al. (2018), which is the stateoftheart modelfree method and establishes a widely accepted baseline. For modelbased methods, we compare against MBPO Janner et al. (2019), which uses shorthorizon modelbased rollouts started from samples in the real environment; STEVE Buckman et al. (2018)
, which dynamically incorporates data from rollouts into value estimation rather than policy learning; and SLBO
Luo et al. (2019), a modelbased algorithm with performance guarantees. For our MBDP algorithm, we choose and as hyperparameter setting.Figure 2 shows the learning curves for all methods, along with asymptotic performance of the modelfree SAC algorithm which do not converge in the region shown. The results highlight the strength of MBDP in terms of performance and sample complexity. In all the Mujoco simulator environments, our MBDP method learns faster and has better efficiency than existing modelbased algorithms, which empirically demonstrates the advantage of Dropout Planning.
4.2 Analysis of Robustness
Aiming to evaluate the robustness of our MBDP algorithm by testing policies on different environment settings (i.e., different combinations of physical parameters) without any adaption, we define ranges of mass and friction coefficients as follows: and , and modify the environments by scaling the torso mass with coefficient and the friction of every geom with coefficient .
We compare the original MBDP algorithm with the dropout variation () which keeps only the rolloutdropout, the dropout variation () which keeps only the modeldropout, and the nodropout variation () which removes both dropouts. This experiment is conducted in the modified environments mentioned above. The results are presented in Figure 3 in the form of heat maps, each square of a heat map represents the average return value that the algorithm can achieve after training in each modified environment. The closer the color to red (hotter) means the higher the value, the better the algorithm is trained in that environment, and vice versa. Obviously, if the algorithm can only achieve good training results in the central region and inadequate results in the region far from the center, it means that the algorithm is more sensitive to perturbation in environments and thus less robust.
Based on the results, we can see that the dropout using only the rolloutdropout can improve the robustness of the algorithm, while the dropout using only the modeldropout will slightly weaken the robustness, and the combination of both dropouts, i.e., the MBDP algorithm, achieves robustness close to that of dropout.
4.3 Ablation Study
In this section, we investigate the sensitivity of MBDP algorithm to the hyperparameter . We conduct two sets of experiments in both Hopper and HalfCheetah environments: (1) fix and change (); (2) fix and change ().
The experimental results are shown in Figure 4. The first row corresponds to experiments in the Hopper environment and the second row corresponds to experiments in the HalfCheetah environment. Columns 1 and 2 correspond to the experiments conducted in the perturbed Mujoco environment with modified environment settings. We construct a total of different perturbed environments (), and calculate the average of the return values after training a fixed number of steps (Hopper: 120k steps, HalfCheetah: 400k steps) in each of the four environments. The higher this average value represents the algorithm can achieve better overall performance in multiple perturbed environments, implying better robustness. Therefore, this metric can be used to evaluate the robustness of different . Columns 3 and 4 are the return values obtained after a fixed number of steps (Hopper: 120k steps, HalfCheetah: 400k steps) for experiments conducted in the standard Mujoco environment without any modification, which are used to evaluate the efficiency of the algorithm for different values of . Each box plot corresponds to 10 different random seeds.
Observing the experimental results, we can find that robustness shows a positive relationship with and an inverse relationship with ; efficiency shows an inverse relationship with and a positive relationship with . This result verifies our conclusion in Section 3.3.3. In addition, we use horizontal dashed lines in Figure 4 to indicate the baseline with rolloutdropout and modeldropout removed (). It can be seen that when , the robustness and efficiency of the algorithm can both exceed the baseline. Therefore, when is adjusted to a reasonable range of values, we can simultaneously improve the robustness and efficiency.
5 Conclusions and Future Work
In this paper, we propose the MBDP algorithm to address the dilemma of robustness and sample efficiency. Specifically, MBDP drops some overvalued imaginary samples through the rolloutdropout mechanism to focus on the bad samples for the purpose of improving robustness, while the modeldropout mechanism is designed to enhance the sample efficiency by only using accurate models. Both theoretical analysis and experiment results verify our claims that 1) MBDP algorithm can provide policies with competitive robustness while achieving stateoftheart performance; 2) we empirically find that there is a seesaw phenomenon between robustness and efficiency, that is, the growth of one will cause a slight decline of the other; 3) we can get policies with different types of performance and robustness by tuning the hyperparameters and , ensuring that our algorithm is capable of performing well in a wide range of tasks.
Our future work will incorporate more domain knowledge of robust control to further enhance robustness. We also plan to transfer the design of Double Dropout Planning as a more general module that can be easily embedded in more modelbased RL algorithms and validate the effectiveness of Double Dropout Planning in realworld scenarios. Besides, relevant researches in the field of meta learning and transfer learning may inspire us to further optimize the design and training procedure of the predictive models. Finally, we can use more powerful function approximators to model the environment.
References

[1]
(2006)
Using inaccurate models in reinforcement learning.
In
Proceedings of the 23rd international conference on Machine learning
, pp. 1–8. Cited by: §1.  [2] (2019) Combating the compoundingerror problem with a multistep model. arXiv preprint arXiv:1905.13320. Cited by: §1.
 [3] (2017) Safe modelbased reinforcement learning with stability guarantees. arXiv preprint arXiv:1705.08551. Cited by: §1.
 [4] (2018) Sampleefficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pp. 8224–8234. Cited by: §4.1.
 [5] (2015) Reinforcement learning in depression: a review of computational research. Neuroscience & Biobehavioral Reviews 55, pp. 247–267. Cited by: §1.
 [6] (2015) Risksensitive and robust decisionmaking: a cvar optimization approach. In Advances in Neural Information Processing Systems, pp. 1522–1530. Cited by: §3.1.
 [7] (2018) Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models. Advances in Neural Information Processing Systems 2018Decem (NeurIPS), pp. 4754–4765. External Links: ISSN 10495258 Cited by: §1, §3.2.
 [8] (2013) A survey on policy search for robotics. now publishers. Cited by: §1.
 [9] (2011) PILCO: a modelbased and dataefficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML11), pp. 465–472. Cited by: §1.
 [10] (2016) Improving PILCO with Bayesian neural network dynamics models. In DataEfficient Machine Learning workshop, International Conference on Machine Learning, Cited by: §1.
 [11] (2018) Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §4.1.
 [12] (2019) When to trust your model: modelbased policy optimization. In Advances in Neural Information Processing Systems, pp. 12519–12530. Cited by: §1, §4.1.
 [13] (2018) Modelensemble trustregion policy optimization. arXiv preprint arXiv:1802.10592. Cited by: §1.
 [14] (2016) Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §1.
 [15] (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
 [16] (2019) Algorithmic framework for modelbased deep reinforcement learning with theoretical guarantees. 7th International Conference on Learning Representations, ICLR 2019, pp. 1–27. Cited by: §4.1.
 [17] (2019) Calibrated modelbased deep reinforcement learning. In International Conference on Machine Learning, pp. 4314–4323. Cited by: §1.
 [18] (2015) Humanlevel control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §1.
 [19] (2020) Modelbased reinforcement learning: a survey. arXiv preprint arXiv:2006.16712. Cited by: §1.
 [20] (2017) Robust adversarial reinforcement learning. arXiv preprint arXiv:1703.02702. Cited by: §1.
 [21] (2017) Survey of modelbased reinforcement learning: applications on robotics. Journal of Intelligent & Robotic Systems 86 (2), pp. 153–173. Cited by: §1.
 [22] (2016) Epopt: learning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283. Cited by: §1, §3.1.
 [23] (2015) Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: §1.
 [24] (2014) Lectures on stochastic programming: modeling and theory. SIAM. Cited by: §1.
 [25] (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484–489. Cited by: §1.
 [26] (1996) Modelbased reinforcement learning with an approximate, learned model. In Proceedings of the ninth Yale workshop on adaptive and learning systems, pp. 101–105. Cited by: §1.
 [27] (1991) Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin 2 (4), pp. 160–163. External Links: Document, ISSN 01635719 Cited by: §3.3.2.

[28]
(2015)
Optimizing the CVaR via sampling.
Proceedings of the National Conference on Artificial Intelligence
4, pp. 2993–2999. External Links: ISBN 9781577357025 Cited by: §1, §3.1.  [29] (2019) Action robust reinforcement learning and applications in continuous control. arXiv preprint arXiv:1901.09184. Cited by: §1.
 [30] (2012) Mujoco: a physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §B.1, §4.
 [31] (1998) Essentials of robust control. Vol. 104, Prentice hall Upper Saddle River, NJ. Cited by: §1.
Appendix A Proofs
In Appendix A, we will provide proofs for Theorem 3.1, Lemma 3.2, Theorem 3.3 ,and Proposition 3.4. Note that the numbering and citations in the appendices are referenced from the main manuscript.
a.1 Proof of Theorem 3.1
Proof.
Recall the definition of (2.4) and (3.2), we need to take the negative value of rewards to represent the loss in the sense of CVaR. Then we have that,
Obviously, the condition of in the above equation exactly meets our definition of , that is, eqaution (3.1). Then we can prove the first part of Theorem 3.1
(A.1) 
Considering , recall the definition of , we have that
Since is the random perturbation to the environment as we defined, it’s intuitive that
(A.2) 
(A.3) 
∎
a.2 Proof of Lemma 3.2
To prove Lemma 3.2, we need to introduce two useful lemmas.
Lemma A.1.
Define
(A.4) 
For any policy and dynamical models , we have that
(A.5) 
Lemma A.1 is a directly cited theorem in existing work (Lemma 4.3 in [31]), we make some modifications to fit our subsequent conclusions. With the above lemma, we first propose Lemma A.2.
Lemma A.2.
Suppose the expected return for modelbased methods is Lipschitz continuous on the state space , is the Lipschitz constant, is the transition distribution of environment, then
(A.6) 
where
(A.7) 
In Lemma A.2, we make the assumption that the expected return on the estimated model is Lipschitz continuous w.r.t any norm , i.e.
(A.8) 
where is a Lipschitz constant. This assumption means that the closer states should give the closer value estimation, which should hold in most scenarios.
Proof.
(A.9) 
Then, we can show that
∎
Now we prove Lemma 3.2.
a.3 Proof of Theorem 3.3
Proof.
denotes the general bias between any model and environment transition , with Lemma A.2, we now get
∎