1 Introduction
In recent years deep reinforcement learning has achieved strong empirical success, including superhuman performances on Atari games and Go (Mnih et al., 2015; Silver et al., 2017) and learning locomotion and manipulation skills in robotics (Levine et al., 2016; Schulman et al., 2015b; Lillicrap et al., 2015). Many of these results are achieved by modelfree RL algorithms that often require a massive number of samples, and therefore their applications are mostly limited to simulated environments. Modelbased deep reinforcement learning, in contrast, exploits the information from state observations explicitly — by planning with an estimated dynamical model — and is considered to be a promising approach to reduce the sample complexity. Indeed, empirical results (Deisenroth & Rasmussen, 2011b; Deisenroth et al., 2013; Levine et al., 2016; Nagabandi et al., 2017; Kurutach et al., 2018; Pong et al., 2018a) have shown strong improvements in sample efficiency.
Despite promising empirical findings, many of theoretical properties of modelbased deep reinforcement learning are not wellunderstood. For example, how does the error of the estimated model affect the estimation of the value function and the planning? Can modelbased RL algorithms be guaranteed to improve the policy monotonically and converge to a local maximum of the value function? How do we quantify the uncertainty in the dynamical models?
It’s challenging to address these questions theoretically in the context of deep RL with continuous state and action space and nonlinear dynamical models. Due to the highdimensionality, learning models from observations in one part of the state space and extrapolating to another part sometimes involves a leap of faith. The uncertainty quantification of the nonlinear parameterized dynamical models is difficult — even without the RL components, it is an active but widelyopen research area. Prior work in modelbased RL mostly quantifies uncertainty with either heuristics or simpler models
(Moldovan et al., 2015; Xie et al., 2016; Deisenroth & Rasmussen, 2011a).Previous theoretical work on modelbased RL mostly focuses on either the finitestate MDPs (Jaksch et al., 2010; Bartlett & Tewari, 2009; Fruit et al., 2018; Lakshmanan et al., 2015; Hinderer, 2005; Pirotta et al., 2015, 2013), or the linear parametrization of the dynamics, policy, or value function (AbbasiYadkori & Szepesvári, 2011; Simchowitz et al., 2018; Dean et al., 2017; Sutton et al., 2012; Tamar et al., 2012), but not much on nonlinear models. Even with an oracle prediction intervals^{2}^{2}2
We note that the confidence interval of parameters are likely meaningless for overparameterized neural networks models.
or posterior estimation, to the best of our knowledge, there was no previous algorithm with convergence guarantees for modelbased deep RL.Towards addressing these challenges, the main contribution of this paper is to propose a novel algorithmic framework for modelbased deep RL with theoretical guarantees. Our metaalgorithm (Algorithm 1) extends the optimisminfaceofuncertainty principle to nonlinear dynamical models in a way that requires no explicit uncertainty quantification of the dynamical models.
Let be the value function of a policy on the true environment, and let be the value function of the policy on the estimated model . We design provable upper bounds, denoted by , on how much the error can compound and divert the expected value of the imaginary rollouts from their real value , in a neighborhood of some reference policy. Such upper bounds capture the intrinsic difference between the estimated and real dynamical model with respect to the particular reward function under consideration.
The discrepancy bounds naturally leads to a lower bound for the true value function:
(1.1) 
Our algorithm iteratively collects batches of samples from the interactions with environments, builds the lower bound above, and then maximizes it over both the dynamical model and the policy . We can use any RL algorithms to optimize the lower bounds, because it will be designed to only depend on the sample trajectories from a fixed reference policy (as opposed to requiring new interactions with the policy iterate.)
We show that the performance of the policy is guaranteed to monotonically increase, assuming the optimization within each iteration succeeds (see Theorem 3.1.) To the best of our knowledge, this is the first theoretical guarantee of monotone improvement for modelbased deep RL.
Readers may have realized that optimizing a robust lower bound is reminiscent of robust control and robust optimization. The distinction is that we optimistically and iteratively maximize the RHS of (1.1) jointly over the model and the policy. The iterative approach allows the algorithms to collect higher quality trajectory adaptively, and the optimism in model optimization encourages explorations of the parts of space that are not covered by the current discrepancy bounds.
To instantiate the metaalgorithm, we design a few valid discrepancy bounds in Section 4. In Section 4.1, we recover the normbased model loss by imposing the additional assumption of a Lipschitz value function. The result suggests a norm is preferred compared to the square of the norm. Indeed in Section 6.2, we show that experimentally learning with loss significantly outperforms the meansquared error loss ().
In Section 4.2, we design a discrepancy bound that is invariant
to the representation of the state space. Here we measure the loss of the model by the difference between the value of the predicted next state and the value of the true next state. Such a loss function is shown to be invariant to onetoone transformation of the state space. Thus we argue that the loss is an intrinsic measure for the model error without any information beyond observing the rewards. We also refine our bounds in Section
A by utilizing some mathematical tools of measuring the difference between policies in divergence (instead of KL divergence or TV distance).Our analysis also sheds light on the comparison between modelbased RL and onpolicy modelfree RL algorithms such as policy gradient or TRPO (Schulman et al., 2015a). The RHS of equation (1.1) is likely to be a good approximator of in a larger neighborhood than the linear approximation of used in policy gradient is (see Remark 4.5.)
Finally, inspired by our framework and analysis, we design a variant of modelbased RL algorithms Stochastic Lower Bounds Optimization (SLBO). Experiments demonstrate that SLBO achieves stateoftheart performance when only 1M samples are permitted on a range of continuous control benchmark tasks.
2 Notations and Preliminaries
We denote the state space by , the action space by . A policy specifies the conditional distribution over the action space given a state . A dynamical model specifies the conditional distribution of the next state given the current state and action . We will use globally to denote the unknown true dynamical model. Our target applications are problems with the continuous state and action space, although the results apply to discrete state or action space as well. When the model is deterministic, is a dirac measure. In this case, we use to denote the unique value of and view as a function from to . Let denote a (parameterized) family of models that we are interested in, and denote a (parameterized) family of policies.
Let be the random variable for the initial state. Let to denote the random variable of the states at steps when we execute the policy on the dynamic model stating with . Note that unless otherwise stated. We will omit the subscript when it’s clear from the context. We use to denote the actions at step similarly. We often use to denote the random variable for the trajectory . Let be the reward function at each step. We assume is known throughout the paper, although can be also considered as part of the model if unknown. Let be the discount factor.
Let be the value function on the model and policy defined as:
(2.1) 
We define as the expected rewardtogo at Step 0 (averaged over the random initial states). Our goal is to maximize the rewardtogo on the true dynamical model, that is, , over the policy . For simplicity, throughout the paper, we set since it occurs frequently in our equations. Every policy induces a distribution of states visited by policy :
Definition 2.1.
For a policy , define as the discounted distribution of the states visited by on . Let be a shorthand for and we omit the superscript throughout the paper. Concretely,we have
3 Algorithmic Framework
As mentioned in the introduction, towards optimizing ,^{3}^{3}3Note that in the introduction we used for simplicity, and in the rest of the paper we will make the dependency on explicit. our plan is to build a lower bound for of the following type and optimize it iteratively:
(3.1) 
where bounds from above the discrepancy between and . Building such an optimizable discrepancy bound globally that holds for all and turns out to be rather difficult, if not impossible. Instead, we shoot for establishing such a bound over the neighborhood of a reference policy .
(R1) 
Here is a function that measures the closeness of two policies, which will be chosen later in alignment with the choice of . We will mostly omit the subscript in for simplicity in the rest of the paper. We will require our discrepancy bound to vanish when is an accurate model:
(R2) 
The third requirement for the discrepancy bound is that it can be estimated and optimized in the sense that
(R3) 
where is a known differentiable function. We can estimate such discrepancy bounds for every in the neighborhood of by sampling empirical trajectories from executing policy on the real environment and compute the average of ’s. We would have to insist that the expectation cannot be over the randomness of trajectories from on , because then we would have to resample trajectories for every possible encountered.
For example, assuming the dynamical models are all deterministic, one of the valid discrepancy bounds (under some strong assumptions) that will prove in Section 4 is a multiple of the error of the prediction of on the trajectories from :
(3.2) 
Suppose we can establish such an discrepancy bound (and the distance function ) with properties (R1), (R2), and (R3), — which will be the main focus of Section 4 —, then we can devise the following metaalgorithm (Algorithm 1). We iteratively optimize the lower bound over the policy and the model , subject to the constraint that the policy is not very far from the reference policy obtained in the previous iteration. For simplicity, we only state the population version with the exact computation of , though empirically it is estimated by sampling trajectories.
Inputs: Initial policy . Discrepancy bound and distance function that satisfy equation (R1) and (R2).
For to :
(3.3)  
(3.4) 
We first remark that the discrepancy bound in the objective plays the role of learning the dynamical model by ensuring the model to fit to the sampled trajectories. For example, using the discrepancy bound in the form of equation (3.2), we roughly recover the standard objective for model learning, with the caveat that we only have the norm instead of the square of the norm in MSE. Such distinction turns out to be empirically important for better performance (see Section 6.2).
Second, our algorithm can be viewed as an extension of the optimisminfaceofuncertainty (OFU) principle to nonlinear parameterized setting: jointly optimizing and encourages the algorithm to choose the most optimistic model among those that can be used to accurately estimate the value function. (See (Jaksch et al., 2010; Bartlett & Tewari, 2009; Fruit et al., 2018; Lakshmanan et al., 2015; Pirotta et al., 2015, 2013)
and references therein for the OFU principle in finitestate MDPs.) The main novelty here is to optimize the lower bound directly, without explicitly building any confidence intervals, which turns out to be challenging in deep learning. In other words, the uncertainty is measured straightforwardly by how the error would affect the estimation of the value function.
Thirdly, the maximization of , when is fixed, can be solved by any modelfree RL algorithms with as the environment without querying any real samples. Optimizing jointly over can be also viewed as another RL problem with an extended actions space using the known “extended MDP technique”. See (Jaksch et al., 2010, section 3.1) for details.
Our main theorem shows formally that the policy performance in the real environment is nondecreasing under the assumption that the real dynamics belongs to our parameterized family .^{4}^{4}4We note that such an assumption, though restricted, may not be very far from reality: optimistically speaking, we only need to approximate the dynamical model accurately on the trajectories of the optimal policy. This might be much easier than approximating the dynamical model globally.
Theorem 3.1.
Suppose that , that and satisfy equation (R1) and (R2), and the optimization problem in equation (3.3) is solvable at each iteration. Then, Algorithm 1 produces a sequence of policies with monotonically increasing values:
(3.5) 
Moreover, as , the value converges to some , where is a local maximum of in domain .
The theorem above can also be extended to a finite sample complexity result with standard concentration inequalities. We show in Theorem G.2 that we can obtain an approximate local maximum in iterations with sample complexity (in the number of trajectories) that is polynomial in dimension and accuracy and is logarithmic in certain smoothness parameters.
Proof of Theorem 3.1.
Since and satisfy equation (R1), we have that
By the definition that and are the optimizers of equation (3.3), we have that
(by equation R2) 
Combing the two equations above we complete the proof of equation (3.5).
For the second part of the theorem, by compactness, we have that a subsequence of converges to some . By the monotonicity we have for every . For the sake of contradiction, we assume is a not a local maximum, then in the neighborhood of there exists such that and . Let be such that is in the neighborhood of . Then we see that is a better solution than for the optimization problem (3.3) in iteration because . (Here the last inequality uses equation (R1) with as .) The fact is a strictly better solution than contradicts the fact that is defined to be the optimal solution of (3.3) . Therefore is a local maximum and we complete the proof.
∎
4 Discrepancy Bounds Design
In this section, we design discrepancy bounds that can provably satisfy the requirements (R1), (R2), and (R3). We design increasingly stronger discrepancy bounds from Section 4.1 to Section A.
4.1 Normbased prediction error bounds
In this subsection, we assume the dynamical model is deterministic and we also learn with a deterministic model . Under assumptions defined below, we derive a discrepancy bound of the form averaged over the observed stateaction pair on the dynamical model . This suggests that the norm is a better metric than the meansquared error for learning the model, which is empirically shown in Section 6.2. Through the derivation, we will also introduce a telescoping lemma, which serves as the main building block towards other finer discrepancy bounds.
We make the (strong) assumption that the value function on the estimated dynamical model is Lipschitz w.r.t to some norm in the sense that
(4.1) 
In other words, nearby starting points should give rewardtogo under the same policy . We note that not every real environment has this property, let alone the estimated dynamical models. However, once the real dynamical model induces a Lipschitz value function, we may penalize the Lipschitzness of the value function of the estimated model during the training.
We start off with a lemma showing that the expected prediction error is an upper bound of the discrepancy between the real and imaginary values.
Lemma 4.1.
Suppose is Lipschitz (in the sense of (4.1)). Recall .
(4.2) 
However, in RHS in equation 4.2 cannot serve as a discrepancy bound because it does not satisfy the requirement (R3) — to optimize it over we need to collect samples from for every iterate — the state distribution of the policy on the real model . The main proposition of this subsection stated next shows that for every in the neighborhood of a reference policy , we can replace the distribution be a fixed distribution with incurring only a higher order approximation. We use the expected KL divergence between two and to define the neighborhood:
(4.3) 
Proposition 4.2.
In the same setting of Lemma 4.1, assume in addition that is close to a reference policy in the sense that , and that the states in are uniformly bounded in the sense that . Then,
(4.4) 
In a benign scenario, the second term in the RHS of equation (4.4) should be dominated by the first term when the neighborhood size is sufficiently small. Moreover, the term can also be replaced by (see the proof that is deferred to Section C.). The dependency on may not be tight for reallife instances, but we note that most analysis of similar nature loses the additional factor Schulman et al. (2015a); Achiam et al. (2017), and it’s inevitable in the worstcase.
A telescoping lemma
Towards proving Propositions 4.2 and deriving stronger discrepancy bound, we define the following quantity that captures the discrepancy between and on a single stateaction pair .
(4.5) 
Note that if are deterministic, then . We give a telescoping lemma that decompose the discrepancy between and into the expected singlestep discrepancy .
Lemma 4.3.
[Telescoping Lemma] Recall that . For any policy and dynamical models , we have that
(4.6) 
The proof is reminiscent of the telescoping expansion in Kakade & Langford (2002) (c.f. Schulman et al. (2015a)) for characterizing the value difference of two policies, but we apply it to deal with the discrepancy between models. The detail is deferred to Section B. With the telescoping Lemma 4.3, Proposition 4.1 follows straightforwardly from Lipschitzness of the imaginary value function. Proposition 4.2 follows from that and are close. We defer the proof to Appendix C.
4.2 Representationinvariant Discrepancy Bounds
The main limitation of the normbased discrepancy bounds in previous subsection is that it depends on the state representation. Let be a onetoone map from the state space to some other space , and for simplicity of this discussion let’s assume a model is deterministic. Then if we represent every state by its transformed representation , then the transformed model defined as together with the transformed reward and transformed policy is equivalent to the original set of the model, reward, and policy in terms of the performance (Lemma C.1). Thus such transformation is not identifiable from only observing the reward. However, the norm in the state space is a notion that depends on the hidden choice of the transformation . ^{5}^{5}5That said, in many cases the reward function itself is known, and the states have physical meanings, and therefore we may be able to use the domain knowledge to figure out the best norm.
Another limitation is that the loss for the model learning should also depend on the state itself instead of only on the difference . It is possible that when is at a critical position, the prediction error needs to be highly accurate so that the model can be useful for planning. On the other hand, at other states, the dynamical model is allowed to make bigger mistakes because they are not essential to the reward.
We propose the following discrepancy bound towards addressing the limitations above. Recall the definition of which measures the difference between and according to their imaginary rewards. We construct a discrepancy bound using the absolute value of . Let’s define and as the average of and its maximum: where . We will show that the following discrepancy bound satisfies the property (R1), (R2).
(4.7) 
Proposition 4.4.
The proof follows from the telescoping lemma (Lemma 4.3) and is deferred to Section C. We remark that the first term can in principle be estimated and optimized approximately: the expectation be replaced by empirical samples from , and is an analytical function of and when they are both deterministic, and therefore can be optimized by backpropagation through time (BPTT). (When and
and are stochastic with a reparameterizable noise such as Gaussian distribution
Kingma & Welling (2013), we can also use backpropagation to estimate the gradient.) The second term in equation (4.7) is difficult to optimize because it involves the maximum. However, it can be in theory considered as a secondorder term because can be chosen to be a fairly small number. (In the refined bound in Section A, the dependency on is even milder.)Remark 4.5.
Proposition 4.4 intuitively suggests a technical reason of why modelbased approach can be more sampleefficient than policy gradient based algorithms such as TRPO or PPO (Schulman et al., 2015a, 2017). The approximation error of in modelbased approach decreases as the model error decrease or the neighborhood size decreases, whereas the approximation error in policy gradient only linearly depends on the the neighborhood size Schulman et al. (2015a). In other words, modelbased algorithms can trade model accuracy for a larger neighborhood size, and therefore the convergence can be faster (in terms of outer iterations.) This is consistent with our empirical observation that the model can be accurate in a descent neighborhood of the current policy so that the constraint (3.4) can be empirically dropped. We also refine our bonds in Section A, where the discrepancy bounds is proved to decay faster in .
5 Additional Related work
Modelbased reinforcement learning is expected to require fewer samples than modelfree algorithms (Deisenroth et al., 2013) and has been successfully applied to robotics in both simulation and in the real world (Deisenroth & Rasmussen, 2011b; Morimoto & Atkeson, 2003; Deisenroth et al., 2011) using dynamical models ranging from Gaussian process (Deisenroth & Rasmussen, 2011b; Ko & Fox, 2009), timevarying linear models (Levine & Koltun, 2013; Lioutikov et al., 2014; Levine & Abbeel, 2014; Yip & Camarillo, 2014), mixture of Gaussians (KhansariZadeh & Billard, 2011), to neural networks (Hunt et al., 1992; Nagabandi et al., 2017; Kurutach et al., 2018; Tangkaratt et al., 2014; SanchezGonzalez et al., 2018; Pascanu et al., 2017). In particular, the work of Kurutach et al. (2018) uses an ensemble of neural networks to learn the dynamical model, and significantly reduces the sample complexity compared to modelfree approaches. The work of Chua et al. (2018) makes further improvement by using a probabilistic model ensemble. Clavera et al. (Clavera et al., 2018) extended this method with metapolicy optimization and improve the robustness to model error. In contrast, we focus on theoretical understanding of modelbased RL and the design of new algorithms, and our experiments use a single neural network to estimate the dynamical model.
Our discrepancy bound in Section 4 is closely related to the work (Farahmand et al., 2017) on the valueaware model loss. Our approach differs from it in three details: a) we use the absolute value of the value difference instead of the squared difference; b) we use the imaginary value function from the estimated dynamical model to define the loss, which makes the loss purely a function of the estimated model and the policy; c) we show that the iterative algorithm, using the loss function as a building block, can converge to a local maximum, partly by cause of the particular choices made in a) and b). Asadi et al. (2018) also study the discrepancy bounds under Lipschitz condition of the MDP.
Prior work explores a variety of ways of combining modelfree and modelbased ideas to achieve the best of the two methods (Sutton, 1991, 1990; Racanière et al., 2017; Mordatch et al., 2016; Sun et al., 2018). For example, estimated models (Levine & Koltun, 2013; Gu et al., 2016; Kalweit & Boedecker, 2017) are used to enrich the replay buffer in the modelfree offpolicy RL. Pong et al. (2018b) proposes goalconditioned value functions trained by modelfree algorithms and uses it for modelbased controls. Feinberg et al. (2018); Buckman et al. (2018) use dynamical models to improve the estimation of the value functions in the modelfree algorithms.
On the control theory side, Dean et al. (2018, 2017) provide strong finite sample complexity bounds for solving linear quadratic regulator using modelbased approach. Boczar et al. (2018) provide finitedata guarantees for the “coarseID control” pipeline, which is composed of a system identification step followed by a robust controller synthesis procedure. Our method is inspired by the general idea of maximizing a low bound of the reward in (Dean et al., 2017). By contrast, our work applies to nonlinear dynamical systems. Our algorithms also estimate the models iteratively based on trajectory samples from the learned policies.
Strong modelbased and modelfree sample complexity bounds have been achieved in the tabular case (finite state space). We refer the readers to (Kakade et al., 2018; Dann et al., 2017; Szita & Szepesvári, 2010; Kearns & Singh, 2002; Jaksch et al., 2010; Agrawal & Jia, 2017) and the reference therein. Our work focus on continuous and highdimensional state space (though the results also apply to tabular case).
Another line of work of modelbased reinforcement learning is to learn a dynamic model in a hidden representation space, which is especially necessary for pixel state spaces
(Kakade et al., 2018; Dann et al., 2017; Szita & Szepesvári, 2010; Kearns & Singh, 2002; Jaksch et al., 2010). Srinivas et al. (2018) shows the possibility to learn an abstract transition model to imitate expert policy. Oh et al. (2017) learns the hidden state of a dynamical model to predict the value of the future states and applies RL or planning on top of it. Serban et al. (2018); Ha & Schmidhuber (2018) learns a bottleneck representation of the states. Our framework can be potentially combined with this line of research.6 Practical Implementation and Experiments
6.1 Practical implementation
We design with simplification of our framework a variant of modelbased RL algorithms, Stochastic Lower Bound Optimization (SLBO). First, we removed the constraints (3.4). Second, we stop the gradient w.r.t (but not ) from the occurrence of in in equation (3.3) (and thus our practical implementation is not optimismdriven.)
Extending the discrepancy bound in Section 4.1, we use a multistep prediction loss for learning the models with norm. For a state and action sequence , we define the step prediction as The step loss is then defined as
(6.1) 
A similar loss is also used in Nagabandi et al. (2017) for validation. We note that motivation by the theory in Section 4.1, we use norm instead of the square of norm. The loss function we attempt to optimize at iteration is thus^{6}^{6}6This is technically not a welldefined mathematical objective. The sg operation means identity when the function is evaluated, whereas when computing the update, is considered fixed.
(6.2) 
where is a tunable parameter and sg denotes the stop gradient operation.
We note that the term depends on both the parameter and the parameter but there is no gradient passed through , whereas only depends on the . We optimize equation (6.2) by alternatively maximizing and minimizing : for the former, we use TRPO with samples from the estimated dynamical model (by treating as a fixed simulator), and for the latter we use standard stochastic gradient methods. Algorithm 2 gives a pseudocode for the algorithm. The and iterations are used to balance the number of steps of TRPO and Adam updates within the loop indexed by .^{7}^{7}7In principle, to balance the number of steps, it suffices to take one of and to be 1. However, empirically we found the optimal balance is achieved with larger and , possibly due to complicated interactions between the two optimization problem.
Power of stochasticity and connection to standard MB RL: We identify the main advantage of our algorithms over standard modelbased RL algorithms is that we alternate the updates of the model and the policy within an outer iteration. By contrast, most of the existing modelbased RL methods only optimize the models once (for a lot of steps) after collecting a batch of samples (see Algorithm 3 for an example). The stochasticity introduced from the alternation with stochastic samples seems to dramatically reduce the overfitting (of the policy to the estimated dynamical model) in a way similar to that SGD regularizes ordinary supervised training. ^{8}^{8}8
Similar stochasticity can potentially be obtained by an extreme hyperparameter choice of the standard MB RL algorithm: in each outer iteration of Algorithm
3, we only sample a very small number of trajectories and take a few model updates and policy updates. We argue our interpretation of stochastic optimization of the lower bound (6.2) is more natural in that it reveals the regularization from stochastic optimization. Another way to view the algorithm is that the model obtained from line 7 of Algorithm 2 at different inner iteration serves as an ensemble of models. We do believe that a cleaner and easier instantiation of our framework (with optimism) exists, and the current version, though performing very well, is not necessarily the best implementation.Entropy regularization: An additional component we apply to SLBO is the commonlyadopted entropy regularization in policy gradient method (Williams & Peng, 1991; Mnih et al., 2016), which was found to significantly boost the performance in our experiments (ablation study in Appendix F.5). Specifically, an additional entropy term is added to the objective function in TRPO. We hypothesize that entropy bonus helps exploration, diversifies the collected data, and thus prevents overfitting.
6.2 Experimental Results
We evaluate our algorithm SLBO (Algorithm 2) on five continuous control tasks from rllab (Duan et al., 2016), including Swimmer, Half Cheetah, Humanoid, Ant, Walker. All environments that we test have a maximum horizon of 500, which is longer than most of the existing modelbased RL work (Nagabandi et al., 2017; Kurutach et al., 2018). (Environments with longer horizons are commonly harder to train.) More details can be found in Appendix F.1.
Baselines. We compare our algorithm with 3 other algorithms including: (1) Soft ActorCritic (SAC) (Haarnoja et al., 2018), the stateoftheart modelfree offpolicy algorithm in sample efficiency; (2) TrustRegion Policy Optimization (TRPO) (Schulman et al., 2015a), a policygradient based algorithm; and (3) ModelBased TRPO, a standard modelbased algorithm described in Algorithm 3. Details of these algorithms can be found in Appendix F.4.^{9}^{9}9We did not have the chance to implement the competitive random search algorithms in (Mania et al., 2018) yet, although our test performance with 500 episode length is higher than theirs with 1000 episode on Half Cheetach (3950 by ours vs 2345 by theirs) and Walker (3650 by ours vs 894 by theirs).
The result is shown in Figure 1. In Fig 1, our algorithm shows superior convergence rate (in number of samples) than all the baseline algorithms while achieving better final performance with 1M samples. Specifically, we mark modelfree TRPO performance after 8 million steps by the dotted line in Fig 1 and find out that our algorithm can achieve comparable or better final performance in one million steps. For ablation study, we also add the performance of SLBOMSE, which corresponds to running SLBO with squared model loss instead of . SLBOMSE performs significantly worse than SLBO on four environments, which is consistent with our derived model loss in Section 4.1. We also study the performance of SLBO and baselines with 4 million training samples in F.5. Ablation study of multistep model training can be found in Appendix F.5.^{10}^{10}10Videos demonstrations are available at https://sites.google.com/view/algombrl/home. A link to the codebase is available at https://github.com/roosephu/slbo.
model loss (SLBOMSE), vanilla modelbased TRPO (MBTRPO), modelfree TRPO (MFTRPO), and Soft ActorCritic (SAC). We average the results over 10 different random seeds, where the solid lines indicate the mean and shaded areas indicate one standard deviation. The dotted reference lines are the total rewards of MFTRPO after 8 million steps.
7 Conclusions
We devise a novel algorithmic framework for designing and analyzing modelbased RL algorithms with the guarantee to convergence monotonically to a local maximum of the reward. Experimental results show that our proposed algorithm (SLBO) achieves new stateoftheart performance on several mujoco benchmark tasks when one million or fewer samples are permitted.
A compelling (but obvious) empirical open question then given rise to is whether modelbased RL can achieve nearoptimal reward on other more complicated tasks or realworld robotic tasks with fewer samples. We believe that understanding the tradeoff between optimism and robustness is essential to design more sampleefficient algorithms. Currently, we observed empirically that the optimismdriven part of our proposed metaalgorithm (optimizing over ) may lead to instability in the optimization, and therefore don’t in general help the performance. It’s left for future work to find practical implementation of the optimismdriven approach.
In our theory, we assume that the parameterized model class contains the true dynamical model. Removing this assumption is also another interesting open question. It would be also very interesting if the theoretical analysis can be applied other settings involving modelbased approaches (e.g., modelbased imitation learning).
Acknowledgments:
We thank the anonymous reviewers for detailed, thoughtful, and helpful reviews. We’d like to thank Emma Brunskill, Chelsea Finn, Shane Gu, Ben Recht, and Haoran Tang for many helpful comments and discussions.
References
 AbbasiYadkori & Szepesvári (2011) Yasin AbbasiYadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 1–26, 2011.
 Achiam et al. (2017) Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. arXiv preprint arXiv:1705.10528, 2017.
 Agrawal & Jia (2017) Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: worstcase regret bounds. In Advances in Neural Information Processing Systems, pp. 1184–1194, 2017.
 Asadi et al. (2018) Kavosh Asadi, Dipendra Misra, and Michael L Littman. Lipschitz continuity in modelbased reinforcement learning. arXiv preprint arXiv:1804.07193, 2018.

Bartlett & Tewari (2009)
Peter L Bartlett and Ambuj Tewari.
Regal: A regularization based algorithm for reinforcement learning in
weakly communicating mdps.
In
Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence
, pp. 35–42. AUAI Press, 2009.  Boczar et al. (2018) Ross Boczar, Nikolai Matni, and Benjamin Recht. Finitedata performance guarantees for the outputfeedback control of an unknown system. arXiv preprint arXiv:1803.09186, 2018.
 Buckman et al. (2018) J. Buckman, D. Hafner, G. Tucker, E. Brevdo, and H. Lee. SampleEfficient Reinforcement Learning with Stochastic Ensemble Value Expansion. ArXiveprints, July 2018.
 Chua et al. (2018) Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. arXiv preprint arXiv:1805.12114, 2018.
 Clavera et al. (2018) Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. Modelbased reinforcement learning via metapolicy optimization. arXiv preprint arXiv:1809.05214, 2018.
 Cover & Thomas (2012) Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
 Dann et al. (2017) Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pp. 5713–5723, 2017.
 Dean et al. (2017) Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the sample complexity of the linear quadratic regulator. arXiv preprint arXiv:1710.01688, 2017.
 Dean et al. (2018) Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. arXiv preprint arXiv:1805.09388, 2018.

Deisenroth & Rasmussen (2011a)
Marc Deisenroth and Carl E Rasmussen.
Pilco: A modelbased and dataefficient approach to policy search.
In
Proceedings of the 28th International Conference on machine learning (ICML11)
, pp. 465–472, 2011a.  Deisenroth & Rasmussen (2011b) Marc Deisenroth and Carl E Rasmussen. Pilco: A modelbased and dataefficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML11), pp. 465–472, 2011b.
 Deisenroth et al. (2011) Marc Peter Deisenroth, Carl Edward Rasmussen, and Dieter Fox. Learning to control a lowcost manipulator using dataefficient reinforcement learning. 2011.
 Deisenroth et al. (2013) Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A survey on policy search for robotics. Foundations and Trends® in Robotics, 2(1–2):1–142, 2013.
 Duan et al. (2016) Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329–1338, 2016.
 Farahmand et al. (2017) Amirmassoud Farahmand, Andre Barreto, and Daniel Nikovski. Valueaware loss function for modelbased reinforcement learning. In Artificial Intelligence and Statistics, pp. 1486–1494, 2017.
 Feinberg et al. (2018) Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I Jordan, Joseph E Gonzalez, and Sergey Levine. Modelbased value estimation for efficient modelfree reinforcement learning. arXiv preprint arXiv:1803.00101, 2018.
 Fruit et al. (2018) Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient biasspanconstrained explorationexploitation in reinforcement learning. arXiv preprint arXiv:1802.04020, 2018.
 Gu et al. (2016) Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep qlearning with modelbased acceleration. In International Conference on Machine Learning, pp. 2829–2838, 2016.
 Ha & Schmidhuber (2018) David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.
 Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
 Hinderer (2005) Karl Hinderer. Lipschitz continuity of value functions in markovian decision processes. Mathematical Methods of Operations Research, 62(1):3–22, 2005.
 Hunt et al. (1992) K Jetal Hunt, D Sbarbaro, R Żbikowski, and Peter J Gawthrop. Neural networks for control systems—a survey. Automatica, 28(6):1083–1112, 1992.
 Jaksch et al. (2010) Thomas Jaksch, Ronald Ortner, and Peter Auer. Nearoptimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
 Kakade et al. (2018) S. Kakade, M. Wang, and L. F. Yang. Variance Reduction Methods for Sublinear Reinforcement Learning. ArXiv eprints, February 2018.
 Kakade & Langford (2002) Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pp. 267–274, 2002.
 Kalweit & Boedecker (2017) Gabriel Kalweit and Joschka Boedecker. Uncertaintydriven imagination for continuous deep reinforcement learning. In Conference on Robot Learning, pp. 195–206, 2017.
 Kearns & Singh (2002) Michael Kearns and Satinder Singh. Nearoptimal reinforcement learning in polynomial time. Machine learning, 49(23):209–232, 2002.

KhansariZadeh & Billard (2011)
S Mohammad KhansariZadeh and Aude Billard.
Learning stable nonlinear dynamical systems with gaussian mixture models.
IEEE Transactions on Robotics, 27(5):943–957, 2011.  Kingma & Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Ko & Fox (2009) Jonathan Ko and Dieter Fox. Gpbayesfilters: Bayesian filtering using gaussian process prediction and observation models. Autonomous Robots, 27(1):75–90, 2009.
 Kurutach et al. (2018) Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Modelensemble trustregion policy optimization. arXiv preprint arXiv:1802.10592, 2018.
 Lakshmanan et al. (2015) Kailasam Lakshmanan, Ronald Ortner, and Daniil Ryabko. Improved regret bounds for undiscounted continuous reinforcement learning. In International Conference on Machine Learning, pp. 524–532, 2015.
 Levine & Abbeel (2014) Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pp. 1071–1079, 2014.
 Levine & Koltun (2013) Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on Machine Learning, pp. 1–9, 2013.
 Levine et al. (2016) Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
 Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Lioutikov et al. (2014) Rudolf Lioutikov, Alexandros Paraschos, Jan Peters, and Gerhard Neumann. Samplebased informationltheoretic stochastic optimal control. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, pp. 3896–3902. IEEE, 2014.
 Mania et al. (2018) Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937, 2016.
 Moldovan et al. (2015) Teodor Mihai Moldovan, Sergey Levine, Michael I Jordan, and Pieter Abbeel. Optimismdriven exploration for nonlinear systems. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pp. 3239–3246. IEEE, 2015.
 Mordatch et al. (2016) Igor Mordatch, Nikhil Mishra, Clemens Eppner, and Pieter Abbeel. Combining modelbased policy search with online model learning for control of physical humanoids. In RoboticsandAutomation(ICRA),2016IEEEInternationalConferenceon, pp. 242–248. IEEE, 2016.
 Morimoto & Atkeson (2003) Jun Morimoto and Christopher G Atkeson. Minimax differential dynamic programming: An application to robust biped walking. In Advances in neural information processing systems, pp. 1563–1570, 2003.
 Nagabandi et al. (2017) Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. arXiv preprint arXiv:1708.02596, 2017.
 Nielsen & Nock (2014) Frank Nielsen and Richard Nock. On the chi square and higherorder chi distances for approximating fdivergences. IEEE Signal Processing Letters, 21(1):10–13, 2014.
 Oh et al. (2017) Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. In Advances in NeuralInformationProcessingSystems, pp. 6118–6128, 2017.
 Pascanu et al. (2017) Razvan Pascanu, Yujia Li, Oriol Vinyals, Nicolas Heess, Lars Buesing, Sebastien Racanière, David Reichert, Théophane Weber, Daan Wierstra, and Peter Battaglia. Learning modelbased planning from scratch. arXiv preprint arXiv:1707.06170, 2017.
 Pathak et al. (2018) Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zeroshot visual imitation. In International Conference on Learning Representations, 2018.
 Pirotta et al. (2013) Matteo Pirotta, Marcello Restelli, and Luca Bascetta. Adaptive stepsize for policy gradient methods. In Advances in Neural Information Processing Systems, pp. 1394–1402, 2013.

Pirotta et al. (2015)
Matteo Pirotta, Marcello Restelli, and Luca Bascetta.
Policy gradient in lipschitz markov decision processes.
Machine Learning, 100(23):255–283, 2015.  Pong et al. (2018a) Vitchyr Pong, Shixiang Gu, Murtaza Dalal, and Sergey Levine. Temporal difference models: Modelfree deep rl for modelbased control. arXiv preprint arXiv:1802.09081, 2018a.
 Pong et al. (2018b) Vitchyr Pong, Shixiang Gu, Murtaza Dalal, and Sergey Levine. Temporal difference models: Modelfree deep rl for modelbased control. International Conference on Learning Representations, 2018b.
 Racanière et al. (2017) Sébastien Racanière, Théophane Weber, David Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adrià Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imaginationaugmented agents for deep reinforcement learning. In AdvancesinNeuralInformationProcessingSystems, pp. 5690–5701, 2017.
 SanchezGonzalez et al. (2018) Alvaro SanchezGonzalez, Nicolas Heess, Jost Tobias Springenberg, Josh Merel, Martin Riedmiller, Raia Hadsell, and Peter Battaglia. Graph networks as learnable physics engines for inference and control. arXiv preprint arXiv:1806.01242, 2018.
 Sason & Verdú (2016) Igal Sason and Sergio Verdú. divergence inequalities. IEEE Transactions on Information Theory, 62(11):5973–6006, 2016.
 Schulman et al. (2015a) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015a.
 Schulman et al. (2015b) John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015b.
 Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Serban et al. (2018) Iulian Vlad Serban, Chinnadhurai Sankar, Michael Pieper, Joelle Pineau, and Yoshua Bengio. The bottleneck simulator: A modelbased deep reinforcement learning approach. arXiv preprint arXiv:1807.04723, 2018.
 Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
 Simchowitz et al. (2018) Max Simchowitz, Horia Mania, Stephen Tu, Michael I Jordan, and Benjamin Recht. Learning without mixing: Towards a sharp analysis of linear system identification. arXiv preprint arXiv:1802.08334, 2018.
 Srinivas et al. (2018) Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal planning networks. arXiv preprint arXiv:1804.00645, 2018.
 Sun et al. (2018) Wen Sun, Geoffrey J Gordon, Byron Boots, and J Andrew Bagnell. Dual policy iteration. arXiv preprint arXiv:1805.10755, 2018.
 Sutton (1990) Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning Proceedings 1990, pp. 216–224. Elsevier, 1990.
 Sutton (1991) Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 2(4):160–163, 1991.
 Sutton et al. (2012) Richard S Sutton, Csaba Szepesvári, Alborz Geramifard, and Michael P Bowling. Dynastyle planning with linear function approximation and prioritized sweeping. arXiv preprint arXiv:1206.3285, 2012.
 Szita & Szepesvári (2010) István Szita and Csaba Szepesvári. Modelbased reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on Machine Learning (ICML10), pp. 1031–1038, 2010.
 Tamar et al. (2012) Aviv Tamar, Dotan Di Castro, and Ron Meir. Integrating a partial model into model free reinforcement learning. Journal of Machine Learning Research, 13(Jun):1927–1966, 2012.
 Tangkaratt et al. (2014) Voot Tangkaratt, Syogo Mori, Tingting Zhao, Jun Morimoto, and Masashi Sugiyama. Modelbased policy gradients with parameterbased exploration by leastsquares conditional density estimation. Neural networks, 57:128–140, 2014.
 Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012.
 Williams & Peng (1991) Ronald J Williams and Jing Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268, 1991.
 Xie et al. (2016) Chris Xie, Sachin Patil, Teodor Moldovan, Sergey Levine, and Pieter Abbeel. Modelbased reinforcement learning with parametrized physical models and optimismdriven exploration. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 504–511. IEEE, 2016.
 Yip & Camarillo (2014) Michael C Yip and David B Camarillo. Modelless feedback control of continuum manipulators in constrained environments. IEEE Transactions on Robotics, 30(4):880–889, 2014.
Appendix A Refined bounds
The theoretical limitation of the discrepancy bound is that the second term involving is not rigorously optimizable by stochastic samples. In the worst case, there seem to exist situations where such infinity norm of is inevitable. In this section we tighten the discrepancy bounds with a different closeness measure , divergence, in the policy space, and the dependency on the is smaller (though not entirely removed.) We note that divergence has the same second order approximation as KLdivergence around the local neighborhood the reference policy and thus locally affects the optimization much.
We start by defining a reweighted version of the distribution where examples in later step are slightly weighted up. We can effectively sample from by importance sampling from
Definition A.1.
For a policy , define as the reweighted version of discounted distribution of the states visited by on . Recall that is the distribution of the state at step , we define .
Then we are ready to state our discrepancy bound. Let
(A.1)  
(A.2) 
where and are defined in equation (4.2).
We defer the proof to Section C so that we can group relevant proofs with similar tools together. Some of these tools may be of independent interests and used for better analysis of modelfree reinforcement learning algorithms such as TRPO Schulman et al. (2015a), PPO Schulman et al. (2017) and CPO Achiam et al. (2017).
Appendix B Proof of Lemma 4.3
Proof of Lemma 4.3.
Let be the cumulative reward when we use dynamical model for steps and then for the rest of the steps, that is,
By definition, we have that . Then, we decompose the target into a telescoping sum,
(B.1) 
Now we rewrite each of the summands . Comparing the trajectory distribution in the definition of and , we see that they only differ in the dynamical model applied in th step. Concretely, and can be rewritten as and where denotes the reward from the first steps from policy and model . Canceling the shared term in the two equations above, we get
Combining the equation above with equation (B.1) concludes that
∎
Appendix C Missing Proofs in Section 4
c.1 Proof of Proposition 4.4
Towards proving the second part of Proposition 4.4 regarding the invariance, we state the following lemma:
Lemma C.1.
Suppose for simplicity the model and the policy are both deterministic. For any onetoone transformation from to , let , , and be a set of transformed model, reward and policy. Then we have that is equivalent to in the sense that
where the value function is defined with respect to .
Proof.
Let be the sequence of states visited by policy on model starting from . We have that . We prove by induction that . Assume this is true for some value , then we prove that holds:
(by inductive hypothesis)  
(by defintion of , )  
Thus we have . Therefore . ∎
Proof of Proposition 4.4.
We first show the invariant of under deterministic models and policies. The same result applies to stochastic policies with slight modification. Let . We consider the transformation applied to and and the resulting function
Note that by Lemma C.1, we have that . Similarly, . Therefore we obtain that
(C.1) 
By Lemma 4.3 and triangle inequality, we have that
(triangle inequality)  