1 Introduction
Modelbased reinforcement learning (RL) offers the potential to be a generalpurpose tool for learning complex policies while being sample efficient. When learning in realworld physical systems, data collection can be an arduous process. Contrary to modelfree methods, modelbased approaches are appealing due to their comparatively fast learning. By first learning the dynamics of the system in a supervised learning way, it can exploit offpolicy data. Then, modelbased methods use the model to derive controllers from it either parametric controllers
(Luo et al., 2019; Buckman et al., 2018; Janner et al., 2019) or nonparametric controllers (Nagabandi et al., 2017; Chua et al., 2018).Current modelbased methods learn with an order of magnitude less data than their modelfree counterparts while achieving the same asymptotic convergence. Tools like ensembles, probabilistic models, planning over shorter horizons, and metalearning have been used to achieved such performance (Kurutach et al., 2018; Chua et al., 2018; Clavera et al., 2018). However, the model usage in all of these methods is the same: simple data augmentation. They use the learned model as a blackbox simulator generating samples from it. In highdimensional environments or environments that require longer planning, substantial sampling is needed to provide meaningful signal for the policy. Can we further exploit our learned models?
In this work, we propose to estimate the policy gradient by backpropagating its gradient through the model using the pathwise derivative estimator. Since the learned model is differentiable, one can link together the model, reward function, and policy to obtain an analytic expression for the gradient of the returns with respect to the policy. By computing the gradient in this manner, we obtain an expressive signal that allows rapid policy learning. We avoid the instabilities that often result from backpropagating through long horizons by using a terminal Qfunction. This scheme fully exploits the learned model without harming the learning stability seen in previous approaches
(Kurutach et al., 2018; Heess et al., 2015). The horizon at which we apply the terminal Qfunction acts as a hyperparameter between modelfree (when fully relying on the Qfunction) and modelbased (when using a longer horizon) of our algorithm.
The main contribution of this work is a modelbased method that significantly reduces the sample complexity compared to stateoftheart modelbased algorithms (Janner et al., 2019; Buckman et al., 2018). For instance, we achieve a 10k return in the halfcheetah environment in just 50 trajectories. We theoretically justify our optimization objective and derive the monotonic improvement of our learned policy in terms of the Qfunction and the model error. Furtermore, we experimentally analyze the theoretical derivations. Finally, we pinpoint the importance of our objective by ablating all the components of our algorithm. The results are reported in four modelbased benchmarking environments (Wang et al., 2019; Todorov et al., 2012). The low sample complexity and high performance of our method carry high promise towards learning directly on real robots.
2 Related Work
ModelBased Reinforcement Learning. Learned dynamics models offer the possibility to reduce sample complexity while maintaining the asymptotic performance. For instance, the models can act as a learned simulator on which a modelfree policy is trained on (Kurutach et al., 2018; Luo et al., 2019; Janner et al., 2019). The model can also be used to improve the target value estimates (Feinberg et al., 2018) or to provide additional context to a policy (Du and Narasimhan, 2019). Contrary to these methods, our approach uses the model in a different way: we exploit the fact that the learned simulator is differentiable and optimize the policy with the analytical gradient. Long term predictions suffer from a compounding error effect in the model, resulting in unrealistic predictions. In such cases, the policy tends to overfit to the deficiencies of the model, which translates to poor performance in the real environment; this problem is known as modelbias (Deisenroth and Rasmussen, 2011). The modelbias problem has motivated work that uses metalearning (Clavera et al., 2018)
, interpolation between different horizon predictions
(Buckman et al., 2018; Janner et al., 2019), and interpolating between model and real data (Kalweit and Boedecker, 2017). To prevent modelbias, we exploit the model for a short horizon and use a terminal value function to model the rest of the trajectory. Finally, since our approach returns a stochastic policy, dynamics model, and value function could use modelpredictive control (MPC) for better performance at test time, similar to (Lowrey et al., 2018; Hong et al., 2019). MPC methods (Nagabandi et al., 2017) have shown to be very effective when the uncertainty of the dynamics is modelled (Chua et al., 2018; Wang and Ba, 2019).Differentable Planning. Previous work has used backpropagate through learned models to obtain the optimal sequences of actions. For instance, Levine and Abbeel (2014)
learn linear local models and obtain the optimal sequences of actions, which is then distilled into a neural network policy. The planning can be incorporated into the neural network architecture
(Okada et al., 2017; Tamar et al., 2016; Srinivas et al., 2018; Karkus et al., 2019) or formulated as a differentiable function (Pereira et al., 2018; Amos et al., 2018). Planning sequences of actions, even when doing modelpredictive control (MPC), does not scale well to highdimensional, complex domains Janner et al. (2019). Our method, instead learns a neural network policy in an actorcritic fashion aided with a learned model. In our study, we evaluate the benefit of carrying out MPC on top of our learned policy at test time, Section 5.4. The results suggest that the policy captures the optimal sequence of action, and replanning does not result in significant benefits.Policy Gradient Estimation. The reinforcement learning objective involves computing the gradient of an expectation (Schulman et al., 2015a). By using Gaussian processes (Deisenroth and Rasmussen, 2011), it is possible to compute the expectation analytically. However, when learning expressive parametric nonlinear dynamical models and policies, such closed form solutions do not exist. The gradient is then estimated using MonteCarlo methods (Mohamed et al., 2019). In the context of modelbased RL, previous approaches mostly made use of the scorefunction, or REINFORCE estimator (Peters and Schaal, 2006; Kurutach et al., 2018)
. However, this estimator has high variance and extensive sampling is needed, which hampers its applicability in highdimensional environments. In this work, we make use of the pathwise derivative estimator
(Mohamed et al., 2019). Similar to our approach, Heess et al. (2015) uses this estimator in the context of modelbased RL. However, they just make use of realworld trajectories that introduces the need of a likelihood ratio term for the model predictions, which in turn increases the variance of the gradient estimate. Instead, we entirely rely on the predictions of the model, removing the need of likelihood ratio terms.ActorCritic Methods. Actorcritic methods alternate between policy evaluation, computing the value function for the policy; and policy improvement using such value function (Sutton and Barto, 1998; Barto et al., 1983)
. Actorcritic methods can be classified between onpolicy and offpolicy. Onpolicy methods tend to be more stable, but at the cost of sample efficiency
(Sutton, 1991; Mnih et al., 2016). On the other hand, offpolicy methods offer better sample complexity (Lillicrap et al., 2015). Recent work has significantly stabilized and improved the performance of offpolicy methods using maximumentropy objectives (Haarnoja et al., 2018a) and multiple value functions (Fujimoto et al., 2018). Our method combines the benefit of both. By using the learned model we can have a learning that resembles an onpolicy method while still being offpolicy.3 Background
In this section, we present the reinforcement learning problem, two different lines of algorithms that tackle it, and a summary on MonteCarlo gradient estimators.
3.1 Reinforcement Learning
A discretetime finite Markov decision process (MDP)
is defined by the tuple . Here, is the set of states, the action space, the transition distribution, is a reward function, represents the initial state distribution, the discount factor, and is the horizon of the process. We define the return as the sum of rewards along a trajectory . The goal of reinforcement learning is to find a policy that maximizes the expected return, i.e., .ActorCritic. In actorcritic methods, we learn a function (critic) that approximates the expected return conditioned on a state and action , . Then, the learned Qfunction is used to optimize a policy (actor). Usually, the Qfunction is learned by iteratively minimizing the Bellman residual:
The above method is referred as onestep Qlearning, and while a naive implementation often results in unstable behaviour, recent methods have succeeded in stabilizing the Qfunction training (Fujimoto et al., 2018). The actor then can be trained to maximize the learned function . The benefit of this form of actorcritic method is that it can be applied in an offpolicy fashion, sampling random minibatches of transitions from an experience replay buffer (Lin, 1992).
ModelBased RL. Modelbased methods, contrary to modelfree RL, learn the transition distribution from experience. Typically, this is carried out by learning a parametric function approximator , known as a dynamics model. We define the state predicted by the dynamics model as , i.e., . The models are trained via maximum likelihood:
3.2 MonteCarlo Gradient Estimators
In order to optimize the reinforcement learning objective, it is needed to take the gradient of an expectation. In general, it is not possible to compute the exact expectation so MonteCarlo gradient estimators are used instead. These are mainly categorized into three classes: the pathwise, score function, and measurevalued gradient estimator (Mohamed et al., 2019). In this work, we use the pathwise gradient estimator, which is also known as the reparameterization trick (Kingma and Welling, 2013). This estimator is derived from the law of the unconscious statistician (LOTUS) (Grimmett and Stirzaker, 2001)
Here, we have stated that we can compute the expectation of a random variable
without knowing its distribution, if we know its corresponding sampling path and base distribution. A common case, and the one used in this manuscript,parameterizes a Gaussian distribution:
, which is equivalent to for .4 Policy Gradient via ModelAugmented Pathwise Derivative
Exploiting the full capability of learned models has the potential to enable complex and highdimensional real robotics tasks while maintaining low sample complexity. Our approach, modelaugmented actorcritic (MAAC), exploits the learned model by computing the analytic gradient of the returns with respect to the policy. In contrast to samplebased methods, which one can think of as providing directional derivatives in trajectory space, MAAC computes the full gradient, providing a strong learning signal for policy learning, which further decreases the sample complexity.
In the following, we present our policy optimization scheme and describe the full algorithm.
4.1 ModelAugmented ActorCritic Objective
Among modelfree methods, actorcritic methods have shown superior performance in terms of sample efficiency and asymptotic performance (Haarnoja et al., 2018a). However, their sample efficiency remains worse than modelbased approaches, and fully offpolicy methods still show instabilities comparing to onpolicy algorithms (Mnih et al., 2016). Here, we propose a modification of the Qfunction parametrization by using the model predictions on the first timesteps after the action is taken. Specifically, we do policy optimization by maximizing the following objective:
whereby, and . Note that under the true dynamics and Qfunction, this objective is the same as the RL objective. Contrary to previous reinforcement learning methods, we optimize this objective by backpropagation through time. Since the learned dynamics model and policy are parameterized as Gaussian distributions, we can make use of the pathwise derivative estimator to compute the gradient, resulting in an objective that captures uncertainty while presenting low variance. The computational graph of the proposed objective is shown in Figure 1.
While the proposed objective resembles nstep bootstrap (Sutton and Barto, 1998), our model usage fundamentally differs from previous approaches. First, we do not compromise between being offpolicy and stability. Typically, nstep bootstrap is either onpolicy, which harms the sample complexity, or its gradient estimation uses likelihood ratios, which presents large variance and results in unstable learning (Heess et al., 2015). Second, we obtain a strong learning signal by backpropagating the gradient of the policy across multiple steps using the pathwise derivative estimator, instead of the REINFORCE estimator (Mohamed et al., 2019; Peters and Schaal, 2006). And finally, we prevent the exploding and vanishing gradients effect inherent to backpropagation through time by the means of the terminal Qfunction (Kurutach et al., 2018).
The horizon in our proposed objective allows us to trade off between the accuracy of our learned model and the accuracy of our learned Qfunction. Hence, it controls the degree to which our algorithm is modelbased or well modelfree. If we were not to trust our model at all (), we would end up with a modelfree update; for , the objective results in a shooting objective. Note that we will perform policy optimization by taking derivatives of the objective, hence we require accuracy on the derivatives of the objective and not on its value. The following lemma provides a bound on the gradient error in terms of the error on the derivatives of the model, the Qfunction, and the horizon .
Lemma 4.1 (Gradient Error).
Let and be the learned approximation of the dynamics and Qfunction , respectively. Assume that and have Lipschitz continuous gradient and and have Lipschitz continuous gradient. Let be the error on the model derivatives and the error on the Qfunction derivative. Then the error on the gradient between the learned objective and the true objective can be bounded by:
Proof.
See Appendix. ∎
The result in Lemma 4.1 stipulates the error of the policy gradient in terms of the maximum error in the model derivatives and the error in the Q derivatives. The functions and are functions of the horizon and depend on the Lipschitz constants of the model and the Qfunction. Note that we are just interested in the relation between both sources of error, since the gradient magnitude will be scaled by the learning rate, or by the optimizer, when applying it to the weights.
4.2 Monotonic Improvement
In the previous section, we presented our objective and the error it incurs in the policy gradient with respect to approximation error in the model and the Q function. However, the error on the gradient is not indicative of the effect of the desired metric: the average return. Here, we quantify the effect of the modeling error on the return. First, we will bound the KLdivergence between the policies resulting from taking the gradient with the true objective and the approximated one. Then we will bound the performance in terms of the KL.
Lemma 4.2 (Total Variation Bound).
Under the assumptions of the Lemma 4.1, let be the parameters resulting from taking a gradient step on the exact objective, and the parameters resulting from taking a gradient step on approximated objective, where . Then the following bound on the total variation distance holds
Proof.
See Appendix. ∎
The previous lemma results in a bound on the distance between the policies originated from taking a gradient step using the true dynamics and Qfunction, and using its learned counterparts. Now, we can derive a similar result from Kakade and Langford (2002) to bound the difference in average returns.
Theorem 4.1 (Monotonic Improvement).
Under the assumptions of the Lemma 4.1, be and as defined in Lemma 4.2, and assuming that the reward is bounded by . Then the average return of the satisfies
Proof.
See Appendix. ∎
Hence, we can provide explicit lower bounds of improvement in terms of model error and function error. Theorem 4.1 extends previous work of monotonic improvement for modelfree policies (Schulman et al., 2015b; Kakade and Langford, 2002), to the modelbased and actor critic set up by taking the error on the learned functions into account. From this bound one could, in principle, derive the optimal horizon that minimizes the gradient error. However, in practice, approximation errors are hard to determine and we treat as an extra hyperparameter. In section 5.2, we experimentally analyze the error on the gradient for different estimators and values of .
4.3 Algorithm
Based on the previous sections, we develop a new algorithm that explicitly optimizes the modelaugmented actorcritic (MAAC) objective. The overall algorithm is divided into three main steps: model learning, policy optimization, and Qfunction learning.
Model learning. In order to prevent overfitting and overcome modelbias (Deisenroth and Rasmussen, 2011), we use a bootstrap ensemble of dynamics models . Each of the dynamics models parameterizes the mean and the covariance of a Gaussian distribution with diagonal covariance. The bootstrap ensemble captures the epistemic uncertainty, uncertainty due to the limited capacity or data, while the probabilistic models are able to capture the aleatoric uncertainty (Chua et al., 2018), inherent uncertainty of the environment. We denote by the transitions dynamics resulting from , where is uniform random variable on . The dynamics models are trained via maximum likelihood with early stopping on a validation set.
Policy Optimization. We extend the MAAC objective with an entropy bonus (Haarnoja et al., 2018b), and perform policy learning by maximizing
where , , . We learn the policy by using the pathwise derivative of the model through steps and the Qfunction by sampling multiple trajectories from the same . Hence, we learn a maximum entropy policy using pathwise derivative of the model through steps and the Qfunction. We compute the expectation by sampling multiple actions and states from the policy and learned dynamics, respectively.
Qfunction Learning. In practice, we train two Qfunctions (Fujimoto et al., 2018) since it has been experimentally proven to yield better results. We train both Q functions by minimizing the Bellman error (Section 3.1):
Similar to (Janner et al., 2019), we minimize the Bellman residual on states previously visited and imagined states obtained from unrolling the learned model. Finally, the value targets are obtained in the same fashion the Stochastic Ensemble Value Expansion (Buckman et al., 2018), using as a horizon for the expansion. In doing so, we maximally make use of the model by not only using it for the policy gradient step, but also for training the Qfunction.
Our method, MAAC, iterates between collecting samples from the environment, model training, policy optimization, and Qfunction learning. A practical implementation of our method is described in Algorithm 1. First, we obtain trajectories from the real environment using the latest policy available. Those samples are appended to a replay buffer , on which the dynamics models are trained until convergence. The third step is to collect imaginary data from the models: we collect step transitions by unrolling the latest policy from a randomly sampled state on the replay buffer. The imaginary data constitutes the , which together with the replay buffer, is used to learn the Qfunction and train the policy.
Our algorithm consolidates the insights built through the course of this paper, while at the same time making maximal use of recently developed actorcritic and modelbased methods. All in all, it consistently outperforms previous modelbased and actorcritic methods.
5 Results
Our experimental evaluation aims to examine the following questions: 1) How does MAAC compares against stateoftheart modelbased and modelfree methods? 2) Does the gradient error correlate with the derived bound?, 3) Which are the key components of its performance?, and 4) Does it benefit from planning at test time?
In order to answer the posed questions, we evaluate our approach on modelbased continuous control benchmark tasks in the MuJoCo simulator (Todorov et al., 2012; Wang et al., 2019).
5.1 Comparison Against StateoftheArt
We compare our method on sample complexity and asymptotic performance against stateoftheart modelfree (MF) and modelbased (MB) baselines. Specifically, we compare against the modelfree soft actorcritic (SAC) (Haarnoja et al., 2018a), which is an offpolicy algorithm that has been proven to be sample efficient and performant; as well as two stateoftheart modelbased baselines: modelbased policyoptimization (MBPO) (Janner et al., 2019) and stochastic ensemble value expansion (STEVE) (Buckman et al., 2018). The original STEVE algorithm builds on top of the modelfree algorithm DDPG (Lillicrap et al., 2015), however this algorithm is outperformed by SAC. In order to remove confounding effects of the underlying modelfree algorithm, we have implemented the STEVE algorithm on top of SAC. We also add SVG(1) (Heess et al., 2015) to comparison, which similar to our method uses the derivative of dynamic models to learn the policy.
The results, shown in Fig. 2, highlight the strength of MAAC in terms of performance and sample complexity. MAAC scales to higher dimensional tasks while maintaining its sample efficiency and asymptotic performance. In all the four environments, our method learns faster than previous MB and MF methods. We are able to learn nearoptimal policies in the halfcheetah environment in just over 50 rollouts, while previous modelbased methods need at least the double amount of data. Furthermore, in complex environments, such as ant, MAAC achieves nearoptimal performance within 150 rollouts while other take orders of magnitudes more data.
5.2 Gradient Error
Here, we investigate how the bounds obtained relate to the empirical performance. In particular, we study the effect of the horizon of the model predictions on the gradient error. In order to do so, we construct a double integrator environment; since the transitions are linear and the cost is quadratic for a linear policy, we can obtain the analytic gradient of the expect return.
Figure 3 depicts the error of the MAAC objective for different values of the horizon as well as what would be the error using the true dynamics. As expected, using the true dynamics yields to lower gradient error since the only source comes from the learned Qfunction that is weighted down by . The results using learned dynamics correlate with our assumptions and the derived bounds: the error from the learned dynamics is lower than the one in the Qfunction, but it scales poorly with the horizon. For short horizons the error decreases as we increase the horizon. However, large horizons is detrimental since it magnifies the error on the models.
5.3 Ablations
In order to investigate the importance of each of the components of our overall algorithm, we carry out an ablation test. Specifically, we test three different components: 1) not using the model to train the policy, i.e., set , 2) not using the STEVE targets for training the critic, and 3) using a single sample estimate of the pathwise derivative.
The ablation test is shown in Figure 4. The test underpins the importance of backpropagating through the model: setting to be 0 inflicts a severe drop in the algorithm performance. On the other hand, using the STEVE targets results in slightly more stable training, but it does not have a significant effect. Finally, while single sample estimates can be used in simple environments, they are not accurate enough in higher dimensional environments such as ant.
5.4 Model Predictive Control
One of the key benefits of methods that combine modelbased reinforcement learning and actorcritic methods is that the optimization procedure results in a stochastic policy, a dynamics model and a Qfunction. Hence, we have all the components for, at test time, refine the action selection by the means of model predictive control (MPC). Here, we investigate the improvement in performance of planning at test time. Specifically, we use the crossentropy method with our stochastic policy as our initial distributions. The results, shown in Table 1, show benefits in online planning in complex domains; however, its improvement gains are more timid in easier domains, showing that the learned policy has already interiorized the optimal behaviour.
AntEnv  HalfCheetahEnv  HopperEnv  Walker2dEnv  

MAAC+MPC  
MAAC  

6 Conclusion
In this work, we present modelaugmented actorcritic, MAAC, a reinforcement learning algorithm that makes use of a learned model by using the pathwise derivative across future timesteps. We prevent instabilities arisen from backpropagation through time by the means of a terminal value function. The objective is theoretically analyzed in terms of the model and value error, and we derive a policy improvement expression with respect to those terms. Our algorithm that builds on top of MAAC is able to achieve superior performance and sample efficiency than stateoftheart modelbased and modelfree reinforcement learning algorithms. For future work, it would be enticing to deploy the presented algorithm on a realrobotic agent.
Acknowledgments
This work was supported in part by Berkeley Deep Drive (BDD) and ONR PECASE N000141612723.
References
 Differentiable mpc for endtoend planning and control. External Links: 1810.13400 Cited by: §2.
 Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics SMC13 (5), pp. 834–846. External Links: Document, ISSN Cited by: §2.
 Sampleefficient reinforcement learning with stochastic ensemble value expansion. CoRR abs/1807.01675. External Links: Link, 1807.01675 Cited by: §1, §1, §2, §4.3, §5.1.
 Deep reinforcement learning in a handful of trials using probabilistic dynamics models. arXiv preprint arXiv:1805.12114. Cited by: §1, §1, §2, §4.3.
 Modelbased reinforcement learning via metapolicy optimization. CoRR abs/1809.05214. External Links: 1809.05214 Cited by: §1, §2.

PILCO: a modelbased and dataefficient approach to policy search.
In
Proceedings of the 28th International Conference on machine learning (ICML11)
, pp. 465–472. Cited by: §2, §2, §4.3.  Taskagnostic dynamics priors for deep reinforcement learning. CoRR abs/1905.04819. External Links: Link, 1905.04819 Cited by: §2.
 Modelbased value estimation for efficient modelfree reinforcement learning. arXiv preprint arXiv:1803.00101. Cited by: §2.
 Addressing function approximation error in actorcritic methods. arXiv preprint arXiv:1802.09477. Cited by: §2, §3.1, §4.3.
 Probability and random processes. Vol. 80, Oxford university press. External Links: Link Cited by: §3.2.
 Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §2, §4.1, §5.1.
 Soft actorcritic algorithms and applications. CoRR abs/1812.05905. External Links: 1812.05905 Cited by: §4.3.
 Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pp. 2944–2952. Cited by: §1, §2, §4.1, §5.1.
 Modelbased lookahead reinforcement learning. ArXiv abs/1908.06012. Cited by: §2.
 When to trust your model: modelbased policy optimization. CoRR abs/1906.08253. External Links: 1906.08253 Cited by: §1, §1, §2, §2, §4.3, §5.1.
 Approximately optimal approximate reinforcement learning. In IN PROC. 19TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING, pp. 267–274. Cited by: §4.2, §4.2.
 Uncertaintydriven imagination for continuous deep reinforcement learning. In Proceedings of the 1st Annual Conference on Robot Learning, S. Levine, V. Vanhoucke, and K. Goldberg (Eds.), Proceedings of Machine Learning Research, Vol. 78, , pp. 195–206. Cited by: §2.
 Differentiable algorithm networks for composable robot learning. CoRR abs/1905.11602. External Links: Link, 1905.11602 Cited by: §2.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.2.
 Modelensemble trustregion policy optimization. arXiv preprint arXiv:1802.10592. Cited by: §1, §1, §2, §2, §4.1.
 Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pp. 1071–1079. Cited by: §2.
 Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §2, §5.1.
 Selfimproving reactive agents based on reinforcement learning, planning and teaching. Machine Learning 8 (3), pp. 293–321. External Links: ISSN 15730565, Document Cited by: §3.1.
 Plan online, learn offline: efficient learning and exploration via modelbased control. CoRR abs/1811.01848. External Links: 1811.01848 Cited by: §2.
 Algorithmic framework for modelbased deep reinforcement learning with theoretical guarantees. ICLR. Cited by: §1, §2.
 Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937. Cited by: §2, §4.1.
 Monte carlo gradient estimation in machine learning. External Links: 1906.10652 Cited by: §2, §3.2, §4.1.
 Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. arXiv preprint arXiv:1708.02596. Cited by: §1, §2.
 Path integral networks: endtoend differentiable optimal control. External Links: 1706.09597 Cited by: §2.
 MPCinspired neural network policies for sequential decision making. CoRR abs/1802.05803. External Links: Link, 1802.05803 Cited by: §2.
 Policy gradient methods for robotics. In 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vol. , pp. 2219–2225. External Links: Document, ISSN Cited by: §2, §4.1.
 Gradient estimation using stochastic computation graphs. CoRR abs/1506.05254. External Links: Link, 1506.05254 Cited by: §2.
 Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pp. 1889–1897. Cited by: §4.2.
 Universal planning networks. arXiv preprint arXiv:1804.00645. Cited by: §2.
 Introduction to reinforcement learning. 1st edition, MIT Press, Cambridge, MA, USA. External Links: ISBN 0262193981 Cited by: §2, §4.1.
 Planning by incremental dynamic programming. In Machine Learning Proceedings 1991, pp. 353–357. Cited by: §2.
 Value iteration networks. CoRR abs/1602.02867. External Links: Link, 1602.02867 Cited by: §2.
 Mujoco: a physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. Cited by: §1, §5.
 Exploring modelbased planning with policy networks. CoRR abs/1906.08649. External Links: Link, 1906.08649 Cited by: §2.
 Benchmarking modelbased reinforcement learning. CoRR abs/1907.02057. External Links: 1907.02057 Cited by: §1, §5.
Appendix A Appendix
Here we prove the lemmas and theorems stated in the manuscript.
a.1 Proof of Lemma 4.1
Let and be the expected return of the policy under our objective and under the RL objective, respectively. Then, we can write the MSE of the gradient as
whereby, and .
We will denote as the gradient w.r.t the inputs of network, and ; where . Notice that since and are Lipschitz and their gradient is Lipschitz as well, we have that , where K depends on the Lipschitz constants of the model and the policy. Without loss of generality, we assume that K is larger than 1. Now, we can bound the error on the Q as
Now, we will bound the term :
Hence, applying this recursion we obtain that
where . Then, the error in the gradient in the previous term is bounded by
In order to bound the model term we need first to bound the rewards since
Similar to the previous bounds, we can bound now each reward term by
With this result we can bound the total error in models
Then, the gradient error has the form
a.2 Proof of Lemma 4.2
The total variation distance can be bounded by the KLdivergence using the Pinsker’s inequality
Then if we assume third order smoothness on our policy, by the Fisher information metric theorem then
Given that , for a small enough step the following inequality holds
Combining this bound with the Pinsker inequality
a.3 Proof of Theorem 4.1
Given the bound on the total variation distance, we can now make use of the monotonic improvement theorem to establish an improvement bound in terms of the gradient error. Let and be the expected return of the policy and under the true dynamics. Let and be the discounted state marginal for the policy and , respectively
Then, combining the results from Lemma 4.2 we obtain the desired bound.
a.4 Ablations
In order to show the significance of each component of MAAC, we conducted more ablation studies. The results are shown in Figure 5. Here, we analyze the effect of training the function with data coming from just the real environment, not learning a maximum entropy policy, and increasing the batch size instead of increasing the amount of samples to estimate the value function.
a.5 Execution Time Comparison
Iteration (s)  Training Model (s)  Optimization (s)  MBPO Iteration (s)  

HalfCheetahEnv  
HopperEnv  

Comments
There are no comments yet.