Algorithmic Framework for Model-based Reinforcement Learning with Theoretical Guarantees

While model-based reinforcement learning has empirically been shown to significantly reduce the sample complexity that hinders model-free RL, the theoretical understanding of such methods has been rather limited. In this paper, we introduce a novel algorithmic framework for designing and analyzing model-based RL algorithms with theoretical guarantees, and a practical algorithm Optimistic Lower Bounds Optimization (OLBO). In particular, we derive a theoretical guarantee of monotone improvement for model-based RL with our framework. We iteratively build a lower bound of the expected reward based on the estimated dynamical model and sample trajectories, and maximize it jointly over the policy and the model. Assuming the optimization in each iteration succeeds, the expected reward is guaranteed to improve. The framework also incorporates an optimism-driven perspective, and reveals the intrinsic measure for the model prediction error. Preliminary simulations demonstrate that our approach outperforms the standard baselines on continuous control benchmark tasks.


page 1

page 2

page 3

page 4


PC-MLP: Model-based Reinforcement Learning with Policy Cover Guided Exploration

Model-based Reinforcement Learning (RL) is a popular learning paradigm d...

Variational Model-based Policy Optimization

Model-based reinforcement learning (RL) algorithms allow us to combine m...

Reinforcement Learning with Convex Constraints

In standard reinforcement learning (RL), a learning agent seeks to optim...

Model-Based Reinforcement Learning in Contextual Decision Processes

We study the sample complexity of model-based reinforcement learning in ...

Federated Ensemble Model-based Reinforcement Learning

Federated learning (FL) is a privacy-preserving machine learning paradig...

Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve Optimism, Embrace Virtual Curvature

This paper studies model-based bandit and reinforcement learning (RL) wi...

Sample-efficient Iterative Lower Bound Optimization of Deep Reactive Policies for Planning in Continuous MDPs

Recent advances in deep learning have enabled optimization of deep react...

1 Introduction

In recent years deep reinforcement learning has achieved strong empirical success, including super-human performances on Atari games and Go (Mnih et al., 2015; Silver et al., 2017) and learning locomotion and manipulation skills in robotics (Levine et al., 2016; Schulman et al., 2015b; Lillicrap et al., 2015). Many of these results are achieved by model-free RL algorithms that often require a massive number of samples, and therefore their applications are mostly limited to simulated environments. Model-based deep reinforcement learning, in contrast, exploits the information from state observations explicitly — by planning with an estimated dynamical model — and is considered to be a promising approach to reduce the sample complexity. Indeed, empirical results (Deisenroth & Rasmussen, 2011b; Deisenroth et al., 2013; Levine et al., 2016; Nagabandi et al., 2017; Kurutach et al., 2018; Pong et al., 2018a) have shown strong improvements in sample efficiency.

Despite promising empirical findings, many of theoretical properties of model-based deep reinforcement learning are not well-understood. For example, how does the error of the estimated model affect the estimation of the value function and the planning? Can model-based RL algorithms be guaranteed to improve the policy monotonically and converge to a local maximum of the value function? How do we quantify the uncertainty in the dynamical models?

It’s challenging to address these questions theoretically in the context of deep RL with continuous state and action space and non-linear dynamical models. Due to the high-dimensionality, learning models from observations in one part of the state space and extrapolating to another part sometimes involves a leap of faith. The uncertainty quantification of the non-linear parameterized dynamical models is difficult — even without the RL components, it is an active but widely-open research area. Prior work in model-based RL mostly quantifies uncertainty with either heuristics or simpler models 

(Moldovan et al., 2015; Xie et al., 2016; Deisenroth & Rasmussen, 2011a).

Previous theoretical work on model-based RL mostly focuses on either the finite-state MDPs (Jaksch et al., 2010; Bartlett & Tewari, 2009; Fruit et al., 2018; Lakshmanan et al., 2015; Hinderer, 2005; Pirotta et al., 2015, 2013), or the linear parametrization of the dynamics, policy, or value function (Abbasi-Yadkori & Szepesvári, 2011; Simchowitz et al., 2018; Dean et al., 2017; Sutton et al., 2012; Tamar et al., 2012), but not much on non-linear models. Even with an oracle prediction intervals222

We note that the confidence interval of parameters are likely meaningless for over-parameterized neural networks models.

or posterior estimation, to the best of our knowledge, there was no previous algorithm with convergence guarantees for model-based deep RL.

Towards addressing these challenges, the main contribution of this paper is to propose a novel algorithmic framework for model-based deep RL with theoretical guarantees. Our meta-algorithm (Algorithm 1) extends the optimism-in-face-of-uncertainty principle to non-linear dynamical models in a way that requires no explicit uncertainty quantification of the dynamical models.

Let be the value function of a policy on the true environment, and let be the value function of the policy on the estimated model . We design provable upper bounds, denoted by , on how much the error can compound and divert the expected value of the imaginary rollouts from their real value , in a neighborhood of some reference policy. Such upper bounds capture the intrinsic difference between the estimated and real dynamical model with respect to the particular reward function under consideration.

The discrepancy bounds naturally leads to a lower bound for the true value function:


Our algorithm iteratively collects batches of samples from the interactions with environments, builds the lower bound above, and then maximizes it over both the dynamical model and the policy . We can use any RL algorithms to optimize the lower bounds, because it will be designed to only depend on the sample trajectories from a fixed reference policy (as opposed to requiring new interactions with the policy iterate.)

We show that the performance of the policy is guaranteed to monotonically increase, assuming the optimization within each iteration succeeds (see Theorem 3.1.) To the best of our knowledge, this is the first theoretical guarantee of monotone improvement for model-based deep RL.

Readers may have realized that optimizing a robust lower bound is reminiscent of robust control and robust optimization. The distinction is that we optimistically and iteratively maximize the RHS of (1.1) jointly over the model and the policy. The iterative approach allows the algorithms to collect higher quality trajectory adaptively, and the optimism in model optimization encourages explorations of the parts of space that are not covered by the current discrepancy bounds.

To instantiate the meta-algorithm, we design a few valid discrepancy bounds in Section 4. In Section 4.1, we recover the norm-based model loss by imposing the additional assumption of a Lipschitz value function. The result suggests a norm is preferred compared to the square of the norm. Indeed in Section 6.2, we show that experimentally learning with loss significantly outperforms the mean-squared error loss ().

In Section 4.2, we design a discrepancy bound that is invariant

to the representation of the state space. Here we measure the loss of the model by the difference between the value of the predicted next state and the value of the true next state. Such a loss function is shown to be invariant to one-to-one transformation of the state space. Thus we argue that the loss is an intrinsic measure for the model error without any information beyond observing the rewards. We also refine our bounds in Section 

A by utilizing some mathematical tools of measuring the difference between policies in -divergence (instead of KL divergence or TV distance).

Our analysis also sheds light on the comparison between model-based RL and on-policy model-free RL algorithms such as policy gradient or TRPO (Schulman et al., 2015a). The RHS of equation (1.1) is likely to be a good approximator of in a larger neighborhood than the linear approximation of used in policy gradient is (see Remark 4.5.)

Finally, inspired by our framework and analysis, we design a variant of model-based RL algorithms Stochastic Lower Bounds Optimization (SLBO). Experiments demonstrate that SLBO achieves state-of-the-art performance when only 1M samples are permitted on a range of continuous control benchmark tasks.

2 Notations and Preliminaries

We denote the state space by , the action space by . A policy specifies the conditional distribution over the action space given a state . A dynamical model specifies the conditional distribution of the next state given the current state and action . We will use globally to denote the unknown true dynamical model. Our target applications are problems with the continuous state and action space, although the results apply to discrete state or action space as well. When the model is deterministic, is a dirac measure. In this case, we use to denote the unique value of and view as a function from to . Let denote a (parameterized) family of models that we are interested in, and denote a (parameterized) family of policies.

Unless otherwise stated, for random variable

, we will use to denote its density function.

Let be the random variable for the initial state. Let to denote the random variable of the states at steps when we execute the policy on the dynamic model stating with . Note that unless otherwise stated. We will omit the subscript when it’s clear from the context. We use to denote the actions at step similarly. We often use to denote the random variable for the trajectory . Let be the reward function at each step. We assume is known throughout the paper, although can be also considered as part of the model if unknown. Let be the discount factor.

Let be the value function on the model and policy defined as:


We define as the expected reward-to-go at Step 0 (averaged over the random initial states). Our goal is to maximize the reward-to-go on the true dynamical model, that is, , over the policy . For simplicity, throughout the paper, we set since it occurs frequently in our equations. Every policy induces a distribution of states visited by policy :

Definition 2.1.

For a policy , define as the discounted distribution of the states visited by on . Let be a shorthand for and we omit the superscript throughout the paper. Concretely,we have

3 Algorithmic Framework

As mentioned in the introduction, towards optimizing ,333Note that in the introduction we used for simplicity, and in the rest of the paper we will make the dependency on explicit. our plan is to build a lower bound for of the following type and optimize it iteratively:


where bounds from above the discrepancy between and . Building such an optimizable discrepancy bound globally that holds for all and turns out to be rather difficult, if not impossible. Instead, we shoot for establishing such a bound over the neighborhood of a reference policy .


Here is a function that measures the closeness of two policies, which will be chosen later in alignment with the choice of . We will mostly omit the subscript in for simplicity in the rest of the paper. We will require our discrepancy bound to vanish when is an accurate model:


The third requirement for the discrepancy bound is that it can be estimated and optimized in the sense that


where is a known differentiable function. We can estimate such discrepancy bounds for every in the neighborhood of by sampling empirical trajectories from executing policy on the real environment and compute the average of ’s. We would have to insist that the expectation cannot be over the randomness of trajectories from on , because then we would have to re-sample trajectories for every possible encountered.

For example, assuming the dynamical models are all deterministic, one of the valid discrepancy bounds (under some strong assumptions) that will prove in Section 4 is a multiple of the error of the prediction of on the trajectories from :


Suppose we can establish such an discrepancy bound (and the distance function ) with properties  (R1), (R2), and (R3), — which will be the main focus of Section 4 —, then we can devise the following meta-algorithm (Algorithm  1). We iteratively optimize the lower bound over the policy and the model , subject to the constraint that the policy is not very far from the reference policy obtained in the previous iteration. For simplicity, we only state the population version with the exact computation of , though empirically it is estimated by sampling trajectories.

Inputs: Initial policy . Discrepancy bound and distance function that satisfy equation (R1) and (R2).
For to :

Algorithm 1 Meta-Algorithm for Model-based RL

We first remark that the discrepancy bound in the objective plays the role of learning the dynamical model by ensuring the model to fit to the sampled trajectories. For example, using the discrepancy bound in the form of equation (3.2), we roughly recover the standard objective for model learning, with the caveat that we only have the norm instead of the square of the norm in MSE. Such distinction turns out to be empirically important for better performance (see Section 6.2).

Second, our algorithm can be viewed as an extension of the optimism-in-face-of-uncertainty (OFU) principle to non-linear parameterized setting: jointly optimizing and encourages the algorithm to choose the most optimistic model among those that can be used to accurately estimate the value function. (See (Jaksch et al., 2010; Bartlett & Tewari, 2009; Fruit et al., 2018; Lakshmanan et al., 2015; Pirotta et al., 2015, 2013)

and references therein for the OFU principle in finite-state MDPs.) The main novelty here is to optimize the lower bound directly, without explicitly building any confidence intervals, which turns out to be challenging in deep learning. In other words, the uncertainty is measured straightforwardly by how the error would affect the estimation of the value function.

Thirdly, the maximization of , when is fixed, can be solved by any model-free RL algorithms with as the environment without querying any real samples. Optimizing jointly over can be also viewed as another RL problem with an extended actions space using the known “extended MDP technique”. See (Jaksch et al., 2010, section 3.1) for details.

Our main theorem shows formally that the policy performance in the real environment is non-decreasing under the assumption that the real dynamics belongs to our parameterized family .444We note that such an assumption, though restricted, may not be very far from reality: optimistically speaking, we only need to approximate the dynamical model accurately on the trajectories of the optimal policy. This might be much easier than approximating the dynamical model globally.

Theorem 3.1.

Suppose that , that and satisfy equation (R1) and (R2), and the optimization problem in equation (3.3) is solvable at each iteration. Then, Algorithm 1 produces a sequence of policies with monotonically increasing values:


Moreover, as , the value converges to some , where is a local maximum of in domain .

The theorem above can also be extended to a finite sample complexity result with standard concentration inequalities. We show in Theorem G.2 that we can obtain an approximate local maximum in iterations with sample complexity (in the number of trajectories) that is polynomial in dimension and accuracy and is logarithmic in certain smoothness parameters.

Proof of Theorem 3.1.

Since and satisfy equation (R1), we have that

By the definition that and are the optimizers of equation (3.3), we have that

(by equation R2)

Combing the two equations above we complete the proof of equation (3.5).

For the second part of the theorem, by compactness, we have that a subsequence of converges to some . By the monotonicity we have for every . For the sake of contradiction, we assume is a not a local maximum, then in the neighborhood of there exists such that and . Let be such that is in the -neighborhood of . Then we see that is a better solution than for the optimization problem (3.3) in iteration because . (Here the last inequality uses equation (R1) with as .) The fact is a strictly better solution than contradicts the fact that is defined to be the optimal solution of (3.3) . Therefore is a local maximum and we complete the proof.

4 Discrepancy Bounds Design

In this section, we design discrepancy bounds that can provably satisfy the requirements (R1),  (R2), and (R3). We design increasingly stronger discrepancy bounds from Section 4.1 to Section A.

4.1 Norm-based prediction error bounds

In this subsection, we assume the dynamical model is deterministic and we also learn with a deterministic model . Under assumptions defined below, we derive a discrepancy bound of the form averaged over the observed state-action pair on the dynamical model . This suggests that the norm is a better metric than the mean-squared error for learning the model, which is empirically shown in Section 6.2. Through the derivation, we will also introduce a telescoping lemma, which serves as the main building block towards other finer discrepancy bounds.

We make the (strong) assumption that the value function on the estimated dynamical model is -Lipschitz w.r.t to some norm in the sense that


In other words, nearby starting points should give reward-to-go under the same policy . We note that not every real environment has this property, let alone the estimated dynamical models. However, once the real dynamical model induces a Lipschitz value function, we may penalize the Lipschitz-ness of the value function of the estimated model during the training.

We start off with a lemma showing that the expected prediction error is an upper bound of the discrepancy between the real and imaginary values.

Lemma 4.1.

Suppose is -Lipschitz (in the sense of  (4.1)). Recall .


However, in RHS in equation 4.2 cannot serve as a discrepancy bound because it does not satisfy the requirement (R3) — to optimize it over we need to collect samples from for every iterate — the state distribution of the policy on the real model . The main proposition of this subsection stated next shows that for every in the neighborhood of a reference policy , we can replace the distribution be a fixed distribution with incurring only a higher order approximation. We use the expected KL divergence between two and to define the neighborhood:

Proposition 4.2.

In the same setting of Lemma 4.1, assume in addition that is close to a reference policy in the sense that , and that the states in are uniformly bounded in the sense that . Then,


In a benign scenario, the second term in the RHS of equation (4.4) should be dominated by the first term when the neighborhood size is sufficiently small. Moreover, the term can also be replaced by (see the proof that is deferred to Section C.). The dependency on may not be tight for real-life instances, but we note that most analysis of similar nature loses the additional factor Schulman et al. (2015a); Achiam et al. (2017), and it’s inevitable in the worst-case.

A telescoping lemma

Towards proving Propositions 4.2 and deriving stronger discrepancy bound, we define the following quantity that captures the discrepancy between and on a single state-action pair .


Note that if are deterministic, then . We give a telescoping lemma that decompose the discrepancy between and into the expected single-step discrepancy .

Lemma 4.3.

[Telescoping Lemma] Recall that . For any policy and dynamical models , we have that


The proof is reminiscent of the telescoping expansion in Kakade & Langford (2002) (c.f. Schulman et al. (2015a)) for characterizing the value difference of two policies, but we apply it to deal with the discrepancy between models. The detail is deferred to Section B. With the telescoping Lemma 4.3, Proposition 4.1 follows straightforwardly from Lipschitzness of the imaginary value function. Proposition 4.2 follows from that and are close. We defer the proof to Appendix C.

4.2 Representation-invariant Discrepancy Bounds

The main limitation of the norm-based discrepancy bounds in previous subsection is that it depends on the state representation. Let be a one-to-one map from the state space to some other space , and for simplicity of this discussion let’s assume a model is deterministic. Then if we represent every state by its transformed representation , then the transformed model defined as together with the transformed reward and transformed policy is equivalent to the original set of the model, reward, and policy in terms of the performance (Lemma C.1). Thus such transformation is not identifiable from only observing the reward. However, the norm in the state space is a notion that depends on the hidden choice of the transformation . 555That said, in many cases the reward function itself is known, and the states have physical meanings, and therefore we may be able to use the domain knowledge to figure out the best norm.

Another limitation is that the loss for the model learning should also depend on the state itself instead of only on the difference . It is possible that when is at a critical position, the prediction error needs to be highly accurate so that the model can be useful for planning. On the other hand, at other states, the dynamical model is allowed to make bigger mistakes because they are not essential to the reward.

We propose the following discrepancy bound towards addressing the limitations above. Recall the definition of which measures the difference between and according to their imaginary rewards. We construct a discrepancy bound using the absolute value of . Let’s define and as the average of and its maximum: where . We will show that the following discrepancy bound satisfies the property  (R1), (R2).

Proposition 4.4.

Let and be defined as in equation (4.3) and (4.7). Then the choice and satisfies the basic requirements (equation  (R1) and (R2)). Moreover, is invariant w.r.t any one-to-one transformation of the state space (in the sense of equation C.1 in the proof).

The proof follows from the telescoping lemma (Lemma 4.3) and is deferred to Section C. We remark that the first term can in principle be estimated and optimized approximately: the expectation be replaced by empirical samples from , and is an analytical function of and when they are both deterministic, and therefore can be optimized by back-propagation through time (BPTT). (When and

and are stochastic with a re-parameterizable noise such as Gaussian distribution 

Kingma & Welling (2013), we can also use back-propagation to estimate the gradient.) The second term in equation (4.7) is difficult to optimize because it involves the maximum. However, it can be in theory considered as a second-order term because can be chosen to be a fairly small number. (In the refined bound in Section A, the dependency on is even milder.)

Remark 4.5.

Proposition 4.4 intuitively suggests a technical reason of why model-based approach can be more sample-efficient than policy gradient based algorithms such as TRPO or PPO (Schulman et al., 2015a, 2017). The approximation error of in model-based approach decreases as the model error decrease or the neighborhood size decreases, whereas the approximation error in policy gradient only linearly depends on the the neighborhood size Schulman et al. (2015a). In other words, model-based algorithms can trade model accuracy for a larger neighborhood size, and therefore the convergence can be faster (in terms of outer iterations.) This is consistent with our empirical observation that the model can be accurate in a descent neighborhood of the current policy so that the constraint (3.4) can be empirically dropped. We also refine our bonds in Section A, where the discrepancy bounds is proved to decay faster in .

5 Additional Related work

Model-based reinforcement learning is expected to require fewer samples than model-free algorithms (Deisenroth et al., 2013) and has been successfully applied to robotics in both simulation and in the real world (Deisenroth & Rasmussen, 2011b; Morimoto & Atkeson, 2003; Deisenroth et al., 2011) using dynamical models ranging from Gaussian process (Deisenroth & Rasmussen, 2011b; Ko & Fox, 2009), time-varying linear models (Levine & Koltun, 2013; Lioutikov et al., 2014; Levine & Abbeel, 2014; Yip & Camarillo, 2014), mixture of Gaussians (Khansari-Zadeh & Billard, 2011), to neural networks (Hunt et al., 1992; Nagabandi et al., 2017; Kurutach et al., 2018; Tangkaratt et al., 2014; Sanchez-Gonzalez et al., 2018; Pascanu et al., 2017). In particular, the work of  Kurutach et al. (2018) uses an ensemble of neural networks to learn the dynamical model, and significantly reduces the sample complexity compared to model-free approaches. The work of Chua et al. (2018) makes further improvement by using a probabilistic model ensemble. Clavera et al. (Clavera et al., 2018) extended this method with meta-policy optimization and improve the robustness to model error. In contrast, we focus on theoretical understanding of model-based RL and the design of new algorithms, and our experiments use a single neural network to estimate the dynamical model.

Our discrepancy bound in Section 4 is closely related to the work (Farahmand et al., 2017) on the value-aware model loss. Our approach differs from it in three details: a) we use the absolute value of the value difference instead of the squared difference; b) we use the imaginary value function from the estimated dynamical model to define the loss, which makes the loss purely a function of the estimated model and the policy; c) we show that the iterative algorithm, using the loss function as a building block, can converge to a local maximum, partly by cause of the particular choices made in a) and b).  Asadi et al. (2018) also study the discrepancy bounds under Lipschitz condition of the MDP.

Prior work explores a variety of ways of combining model-free and model-based ideas to achieve the best of the two methods (Sutton, 1991, 1990; Racanière et al., 2017; Mordatch et al., 2016; Sun et al., 2018). For example, estimated models (Levine & Koltun, 2013; Gu et al., 2016; Kalweit & Boedecker, 2017) are used to enrich the replay buffer in the model-free off-policy RL. Pong et al. (2018b) proposes goal-conditioned value functions trained by model-free algorithms and uses it for model-based controls. Feinberg et al. (2018); Buckman et al. (2018) use dynamical models to improve the estimation of the value functions in the model-free algorithms.

On the control theory side,  Dean et al. (2018, 2017) provide strong finite sample complexity bounds for solving linear quadratic regulator using model-based approach. Boczar et al. (2018) provide finite-data guarantees for the “coarse-ID control” pipeline, which is composed of a system identification step followed by a robust controller synthesis procedure. Our method is inspired by the general idea of maximizing a low bound of the reward in (Dean et al., 2017). By contrast, our work applies to non-linear dynamical systems. Our algorithms also estimate the models iteratively based on trajectory samples from the learned policies.

Strong model-based and model-free sample complexity bounds have been achieved in the tabular case (finite state space). We refer the readers to (Kakade et al., 2018; Dann et al., 2017; Szita & Szepesvári, 2010; Kearns & Singh, 2002; Jaksch et al., 2010; Agrawal & Jia, 2017) and the reference therein. Our work focus on continuous and high-dimensional state space (though the results also apply to tabular case).

Another line of work of model-based reinforcement learning is to learn a dynamic model in a hidden representation space, which is especially necessary for pixel state spaces 

(Kakade et al., 2018; Dann et al., 2017; Szita & Szepesvári, 2010; Kearns & Singh, 2002; Jaksch et al., 2010). Srinivas et al. (2018) shows the possibility to learn an abstract transition model to imitate expert policy. Oh et al. (2017) learns the hidden state of a dynamical model to predict the value of the future states and applies RL or planning on top of it. Serban et al. (2018); Ha & Schmidhuber (2018) learns a bottleneck representation of the states. Our framework can be potentially combined with this line of research.

6 Practical Implementation and Experiments

6.1 Practical implementation

We design with simplification of our framework a variant of model-based RL algorithms, Stochastic Lower Bound Optimization (SLBO). First, we removed the constraints (3.4). Second, we stop the gradient w.r.t (but not ) from the occurrence of in in equation (3.3) (and thus our practical implementation is not optimism-driven.)

Extending the discrepancy bound in Section 4.1, we use a multi-step prediction loss for learning the models with norm. For a state and action sequence , we define the -step prediction as The -step loss is then defined as


A similar loss is also used in Nagabandi et al. (2017) for validation. We note that motivation by the theory in Section 4.1, we use -norm instead of the square of norm. The loss function we attempt to optimize at iteration is thus666This is technically not a well-defined mathematical objective. The sg operation means identity when the function is evaluated, whereas when computing the update, is considered fixed.


where is a tunable parameter and sg denotes the stop gradient operation.

We note that the term depends on both the parameter and the parameter but there is no gradient passed through , whereas only depends on the . We optimize equation (6.2) by alternatively maximizing and minimizing : for the former, we use TRPO with samples from the estimated dynamical model (by treating as a fixed simulator), and for the latter we use standard stochastic gradient methods. Algorithm 2 gives a pseudo-code for the algorithm. The and iterations are used to balance the number of steps of TRPO and Adam updates within the loop indexed by .777In principle, to balance the number of steps, it suffices to take one of and to be 1. However, empirically we found the optimal balance is achieved with larger and , possibly due to complicated interactions between the two optimization problem.

1:Initialize model network parameters and policy network parameters
2:Initialize dataset
3:for  iterations do
4:      collect samples from real environment using with noises
5:     for  iterations do optimize (6.2) with stochastic alternating updates
6:         for  iterations do
7:              optimize (6.1) over with sampled data from by one step of Adam          
8:         for  iterations do
9:               collect samples using as dynamics
10:              optimize by running TRPO on               
Algorithm 2 Stochastic Lower Bound Optimization (SLBO)

Power of stochasticity and connection to standard MB RL: We identify the main advantage of our algorithms over standard model-based RL algorithms is that we alternate the updates of the model and the policy within an outer iteration. By contrast, most of the existing model-based RL methods only optimize the models once (for a lot of steps) after collecting a batch of samples (see Algorithm 3 for an example). The stochasticity introduced from the alternation with stochastic samples seems to dramatically reduce the overfitting (of the policy to the estimated dynamical model) in a way similar to that SGD regularizes ordinary supervised training. 888

Similar stochasticity can potentially be obtained by an extreme hyperparameter choice of the standard MB RL algorithm: in each outer iteration of Algorithm 

3, we only sample a very small number of trajectories and take a few model updates and policy updates. We argue our interpretation of stochastic optimization of the lower bound (6.2) is more natural in that it reveals the regularization from stochastic optimization. Another way to view the algorithm is that the model obtained from line 7 of Algorithm 2 at different inner iteration serves as an ensemble of models. We do believe that a cleaner and easier instantiation of our framework (with optimism) exists, and the current version, though performing very well, is not necessarily the best implementation.

Entropy regularization: An additional component we apply to SLBO is the commonly-adopted entropy regularization in policy gradient method (Williams & Peng, 1991; Mnih et al., 2016), which was found to significantly boost the performance in our experiments (ablation study in Appendix F.5). Specifically, an additional entropy term is added to the objective function in TRPO. We hypothesize that entropy bonus helps exploration, diversifies the collected data, and thus prevents overfitting.

6.2 Experimental Results

We evaluate our algorithm SLBO (Algorithm 2) on five continuous control tasks from rllab (Duan et al., 2016), including Swimmer, Half Cheetah, Humanoid, Ant, Walker. All environments that we test have a maximum horizon of 500, which is longer than most of the existing model-based RL work (Nagabandi et al., 2017; Kurutach et al., 2018). (Environments with longer horizons are commonly harder to train.) More details can be found in Appendix F.1.

Baselines. We compare our algorithm with 3 other algorithms including: (1) Soft Actor-Critic (SAC) (Haarnoja et al., 2018), the state-of-the-art model-free off-policy algorithm in sample efficiency; (2) Trust-Region Policy Optimization (TRPO) (Schulman et al., 2015a), a policy-gradient based algorithm; and (3) Model-Based TRPO, a standard model-based algorithm described in Algorithm 3. Details of these algorithms can be found in Appendix F.4.999We did not have the chance to implement the competitive random search algorithms in (Mania et al., 2018) yet, although our test performance with 500 episode length is higher than theirs with 1000 episode on Half Cheetach (3950 by ours vs 2345 by theirs) and Walker (3650 by ours vs 894 by theirs).

The result is shown in Figure 1. In Fig 1, our algorithm shows superior convergence rate (in number of samples) than all the baseline algorithms while achieving better final performance with 1M samples. Specifically, we mark model-free TRPO performance after 8 million steps by the dotted line in Fig 1 and find out that our algorithm can achieve comparable or better final performance in one million steps. For ablation study, we also add the performance of SLBO-MSE, which corresponds to running SLBO with squared model loss instead of . SLBO-MSE performs significantly worse than SLBO on four environments, which is consistent with our derived model loss in Section 4.1. We also study the performance of SLBO and baselines with 4 million training samples in F.5. Ablation study of multi-step model training can be found in Appendix F.5.101010Videos demonstrations are available at A link to the codebase is available at

(a) Swimmer
(b) Half Cheetah
(c) Ant
(d) Walker
(e) Humanoid
Figure 1: Comparison between SLBO (ours), SLBO with squared

model loss (SLBO-MSE), vanilla model-based TRPO (MB-TRPO), model-free TRPO (MF-TRPO), and Soft Actor-Critic (SAC). We average the results over 10 different random seeds, where the solid lines indicate the mean and shaded areas indicate one standard deviation. The dotted reference lines are the total rewards of MF-TRPO after 8 million steps.

7 Conclusions

We devise a novel algorithmic framework for designing and analyzing model-based RL algorithms with the guarantee to convergence monotonically to a local maximum of the reward. Experimental results show that our proposed algorithm (SLBO) achieves new state-of-the-art performance on several mujoco benchmark tasks when one million or fewer samples are permitted.

A compelling (but obvious) empirical open question then given rise to is whether model-based RL can achieve near-optimal reward on other more complicated tasks or real-world robotic tasks with fewer samples. We believe that understanding the trade-off between optimism and robustness is essential to design more sample-efficient algorithms. Currently, we observed empirically that the optimism-driven part of our proposed meta-algorithm (optimizing over ) may lead to instability in the optimization, and therefore don’t in general help the performance. It’s left for future work to find practical implementation of the optimism-driven approach.

In our theory, we assume that the parameterized model class contains the true dynamical model. Removing this assumption is also another interesting open question. It would be also very interesting if the theoretical analysis can be applied other settings involving model-based approaches (e.g., model-based imitation learning).


We thank the anonymous reviewers for detailed, thoughtful, and helpful reviews. We’d like to thank Emma Brunskill, Chelsea Finn, Shane Gu, Ben Recht, and Haoran Tang for many helpful comments and discussions.


Appendix A Refined bounds

The theoretical limitation of the discrepancy bound is that the second term involving is not rigorously optimizable by stochastic samples. In the worst case, there seem to exist situations where such infinity norm of is inevitable. In this section we tighten the discrepancy bounds with a different closeness measure , -divergence, in the policy space, and the dependency on the is smaller (though not entirely removed.) We note that -divergence has the same second order approximation as KL-divergence around the local neighborhood the reference policy and thus locally affects the optimization much.

We start by defining a re-weighted version of the distribution where examples in later step are slightly weighted up. We can effectively sample from by importance sampling from

Definition A.1.

For a policy , define as the re-weighted version of discounted distribution of the states visited by on . Recall that is the distribution of the state at step , we define .

Then we are ready to state our discrepancy bound. Let


where and are defined in equation (4.2).

Proposition A.2.

The discrepancy bound and closeness measure satisfies requirements  (R1) and (R2).

We defer the proof to Section C so that we can group relevant proofs with similar tools together. Some of these tools may be of independent interests and used for better analysis of model-free reinforcement learning algorithms such as TRPO Schulman et al. (2015a), PPO Schulman et al. (2017) and CPO Achiam et al. (2017).

Appendix B Proof of Lemma 4.3

Proof of Lemma 4.3.

Let be the cumulative reward when we use dynamical model for steps and then for the rest of the steps, that is,

By definition, we have that . Then, we decompose the target into a telescoping sum,


Now we re-write each of the summands . Comparing the trajectory distribution in the definition of and , we see that they only differ in the dynamical model applied in -th step. Concretely, and can be rewritten as and where denotes the reward from the first steps from policy and model . Canceling the shared term in the two equations above, we get

Combining the equation above with equation (B.1) concludes that

Appendix C Missing Proofs in Section 4

c.1 Proof of Proposition 4.4

Towards proving the second part of Proposition 4.4 regarding the invariance, we state the following lemma:

Lemma C.1.

Suppose for simplicity the model and the policy are both deterministic. For any one-to-one transformation from to , let , , and be a set of transformed model, reward and policy. Then we have that is equivalent to in the sense that

where the value function is defined with respect to .


Let be the sequence of states visited by policy on model starting from . We have that . We prove by induction that . Assume this is true for some value , then we prove that holds:

(by inductive hypothesis)
(by defintion of , )

Thus we have . Therefore . ∎

Proof of Proposition 4.4.

We first show the invariant of under deterministic models and policies. The same result applies to stochastic policies with slight modification. Let . We consider the transformation applied to and and the resulting function

Note that by Lemma C.1, we have that . Similarly, . Therefore we obtain that


By Lemma 4.3 and triangle inequality, we have that

(triangle inequality)