Model-Augmented Actor-Critic: Backpropagating through Paths

05/16/2020 ∙ by Ignasi Clavera, et al. ∙ berkeley college 7

Current model-based reinforcement learning approaches use the model simply as a learned black-box simulator to augment the data for policy optimization or value function learning. In this paper, we show how to make more effective use of the model by exploiting its differentiability. We construct a policy optimization algorithm that uses the pathwise derivative of the learned model and policy across future timesteps. Instabilities of learning across many timesteps are prevented by using a terminal value function, learning the policy in an actor-critic fashion. Furthermore, we present a derivation on the monotonic improvement of our objective in terms of the gradient error in the model and value function. We show that our approach (i) is consistently more sample efficient than existing state-of-the-art model-based algorithms, (ii) matches the asymptotic performance of model-free algorithms, and (iii) scales to long horizons, a regime where typically past model-based approaches have struggled.



There are no comments yet.


page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Model-based reinforcement learning (RL) offers the potential to be a general-purpose tool for learning complex policies while being sample efficient. When learning in real-world physical systems, data collection can be an arduous process. Contrary to model-free methods, model-based approaches are appealing due to their comparatively fast learning. By first learning the dynamics of the system in a supervised learning way, it can exploit off-policy data. Then, model-based methods use the model to derive controllers from it either parametric controllers 

(Luo et al., 2019; Buckman et al., 2018; Janner et al., 2019) or non-parametric controllers (Nagabandi et al., 2017; Chua et al., 2018).

Current model-based methods learn with an order of magnitude less data than their model-free counterparts while achieving the same asymptotic convergence. Tools like ensembles, probabilistic models, planning over shorter horizons, and meta-learning have been used to achieved such performance (Kurutach et al., 2018; Chua et al., 2018; Clavera et al., 2018). However, the model usage in all of these methods is the same: simple data augmentation. They use the learned model as a black-box simulator generating samples from it. In high-dimensional environments or environments that require longer planning, substantial sampling is needed to provide meaningful signal for the policy. Can we further exploit our learned models?

In this work, we propose to estimate the policy gradient by backpropagating its gradient through the model using the pathwise derivative estimator. Since the learned model is differentiable, one can link together the model, reward function, and policy to obtain an analytic expression for the gradient of the returns with respect to the policy. By computing the gradient in this manner, we obtain an expressive signal that allows rapid policy learning. We avoid the instabilities that often result from back-propagating through long horizons by using a terminal Q-function. This scheme fully exploits the learned model without harming the learning stability seen in previous approaches 

(Kurutach et al., 2018; Heess et al., 2015)

. The horizon at which we apply the terminal Q-function acts as a hyperparameter between model-free (when fully relying on the Q-function) and model-based (when using a longer horizon) of our algorithm.

The main contribution of this work is a model-based method that significantly reduces the sample complexity compared to state-of-the-art model-based algorithms (Janner et al., 2019; Buckman et al., 2018). For instance, we achieve a 10k return in the half-cheetah environment in just 50 trajectories. We theoretically justify our optimization objective and derive the monotonic improvement of our learned policy in terms of the Q-function and the model error. Furtermore, we experimentally analyze the theoretical derivations. Finally, we pinpoint the importance of our objective by ablating all the components of our algorithm. The results are reported in four model-based benchmarking environments (Wang et al., 2019; Todorov et al., 2012). The low sample complexity and high performance of our method carry high promise towards learning directly on real robots.

2 Related Work

Model-Based Reinforcement Learning. Learned dynamics models offer the possibility to reduce sample complexity while maintaining the asymptotic performance. For instance, the models can act as a learned simulator on which a model-free policy is trained on (Kurutach et al., 2018; Luo et al., 2019; Janner et al., 2019). The model can also be used to improve the target value estimates (Feinberg et al., 2018) or to provide additional context to a policy (Du and Narasimhan, 2019). Contrary to these methods, our approach uses the model in a different way: we exploit the fact that the learned simulator is differentiable and optimize the policy with the analytical gradient. Long term predictions suffer from a compounding error effect in the model, resulting in unrealistic predictions. In such cases, the policy tends to overfit to the deficiencies of the model, which translates to poor performance in the real environment; this problem is known as model-bias (Deisenroth and Rasmussen, 2011). The model-bias problem has motivated work that uses meta-learning (Clavera et al., 2018)

, interpolation between different horizon predictions 

(Buckman et al., 2018; Janner et al., 2019), and interpolating between model and real data (Kalweit and Boedecker, 2017). To prevent model-bias, we exploit the model for a short horizon and use a terminal value function to model the rest of the trajectory. Finally, since our approach returns a stochastic policy, dynamics model, and value function could use model-predictive control (MPC) for better performance at test time, similar to  (Lowrey et al., 2018; Hong et al., 2019). MPC methods (Nagabandi et al., 2017) have shown to be very effective when the uncertainty of the dynamics is modelled (Chua et al., 2018; Wang and Ba, 2019).

Differentable Planning. Previous work has used backpropagate through learned models to obtain the optimal sequences of actions. For instance, Levine and Abbeel (2014)

learn linear local models and obtain the optimal sequences of actions, which is then distilled into a neural network policy. The planning can be incorporated into the neural network architecture 

(Okada et al., 2017; Tamar et al., 2016; Srinivas et al., 2018; Karkus et al., 2019) or formulated as a differentiable function (Pereira et al., 2018; Amos et al., 2018). Planning sequences of actions, even when doing model-predictive control (MPC), does not scale well to high-dimensional, complex domains Janner et al. (2019). Our method, instead learns a neural network policy in an actor-critic fashion aided with a learned model. In our study, we evaluate the benefit of carrying out MPC on top of our learned policy at test time, Section 5.4. The results suggest that the policy captures the optimal sequence of action, and re-planning does not result in significant benefits.

Policy Gradient Estimation. The reinforcement learning objective involves computing the gradient of an expectation (Schulman et al., 2015a). By using Gaussian processes (Deisenroth and Rasmussen, 2011), it is possible to compute the expectation analytically. However, when learning expressive parametric non-linear dynamical models and policies, such closed form solutions do not exist. The gradient is then estimated using Monte-Carlo methods (Mohamed et al., 2019). In the context of model-based RL, previous approaches mostly made use of the score-function, or REINFORCE estimator (Peters and Schaal, 2006; Kurutach et al., 2018)

. However, this estimator has high variance and extensive sampling is needed, which hampers its applicability in high-dimensional environments. In this work, we make use of the pathwise derivative estimator 

(Mohamed et al., 2019). Similar to our approach, Heess et al. (2015) uses this estimator in the context of model-based RL. However, they just make use of real-world trajectories that introduces the need of a likelihood ratio term for the model predictions, which in turn increases the variance of the gradient estimate. Instead, we entirely rely on the predictions of the model, removing the need of likelihood ratio terms.

Actor-Critic Methods. Actor-critic methods alternate between policy evaluation, computing the value function for the policy; and policy improvement using such value function (Sutton and Barto, 1998; Barto et al., 1983)

. Actor-critic methods can be classified between on-policy and off-policy. On-policy methods tend to be more stable, but at the cost of sample efficiency 

(Sutton, 1991; Mnih et al., 2016). On the other hand, off-policy methods offer better sample complexity (Lillicrap et al., 2015). Recent work has significantly stabilized and improved the performance of off-policy methods using maximum-entropy objectives (Haarnoja et al., 2018a) and multiple value functions (Fujimoto et al., 2018). Our method combines the benefit of both. By using the learned model we can have a learning that resembles an on-policy method while still being off-policy.

3 Background

In this section, we present the reinforcement learning problem, two different lines of algorithms that tackle it, and a summary on Monte-Carlo gradient estimators.

3.1 Reinforcement Learning

A discrete-time finite Markov decision process (MDP)

is defined by the tuple . Here, is the set of states, the action space, the transition distribution, is a reward function, represents the initial state distribution, the discount factor, and is the horizon of the process. We define the return as the sum of rewards along a trajectory . The goal of reinforcement learning is to find a policy that maximizes the expected return, i.e., .

Actor-Critic. In actor-critic methods, we learn a function (critic) that approximates the expected return conditioned on a state and action , . Then, the learned Q-function is used to optimize a policy (actor). Usually, the Q-function is learned by iteratively minimizing the Bellman residual:

The above method is referred as one-step Q-learning, and while a naive implementation often results in unstable behaviour, recent methods have succeeded in stabilizing the Q-function training (Fujimoto et al., 2018). The actor then can be trained to maximize the learned function . The benefit of this form of actor-critic method is that it can be applied in an off-policy fashion, sampling random mini-batches of transitions from an experience replay buffer (Lin, 1992).

Model-Based RL. Model-based methods, contrary to model-free RL, learn the transition distribution from experience. Typically, this is carried out by learning a parametric function approximator , known as a dynamics model. We define the state predicted by the dynamics model as , i.e., . The models are trained via maximum likelihood:

3.2 Monte-Carlo Gradient Estimators

In order to optimize the reinforcement learning objective, it is needed to take the gradient of an expectation. In general, it is not possible to compute the exact expectation so Monte-Carlo gradient estimators are used instead. These are mainly categorized into three classes: the pathwise, score function, and measure-valued gradient estimator (Mohamed et al., 2019). In this work, we use the pathwise gradient estimator, which is also known as the re-parameterization trick (Kingma and Welling, 2013). This estimator is derived from the law of the unconscious statistician (LOTUS) (Grimmett and Stirzaker, 2001)

Here, we have stated that we can compute the expectation of a random variable

without knowing its distribution, if we know its corresponding sampling path and base distribution. A common case, and the one used in this manuscript,

parameterizes a Gaussian distribution:

, which is equivalent to for .

4 Policy Gradient via Model-Augmented Pathwise Derivative

Exploiting the full capability of learned models has the potential to enable complex and high-dimensional real robotics tasks while maintaining low sample complexity. Our approach, model-augmented actor-critic (MAAC), exploits the learned model by computing the analytic gradient of the returns with respect to the policy. In contrast to sample-based methods, which one can think of as providing directional derivatives in trajectory space, MAAC computes the full gradient, providing a strong learning signal for policy learning, which further decreases the sample complexity.

Figure 1: Stochastic computation graph of the proposed objective . The stochastic nodes are represented by circles and the deterministic ones by squares.

In the following, we present our policy optimization scheme and describe the full algorithm.

4.1 Model-Augmented Actor-Critic Objective

Among model-free methods, actor-critic methods have shown superior performance in terms of sample efficiency and asymptotic performance (Haarnoja et al., 2018a). However, their sample efficiency remains worse than model-based approaches, and fully off-policy methods still show instabilities comparing to on-policy algorithms (Mnih et al., 2016). Here, we propose a modification of the Q-function parametrization by using the model predictions on the first time-steps after the action is taken. Specifically, we do policy optimization by maximizing the following objective:

whereby, and . Note that under the true dynamics and Q-function, this objective is the same as the RL objective. Contrary to previous reinforcement learning methods, we optimize this objective by back-propagation through time. Since the learned dynamics model and policy are parameterized as Gaussian distributions, we can make use of the pathwise derivative estimator to compute the gradient, resulting in an objective that captures uncertainty while presenting low variance. The computational graph of the proposed objective is shown in Figure 1.

While the proposed objective resembles n-step bootstrap (Sutton and Barto, 1998), our model usage fundamentally differs from previous approaches. First, we do not compromise between being off-policy and stability. Typically, n-step bootstrap is either on-policy, which harms the sample complexity, or its gradient estimation uses likelihood ratios, which presents large variance and results in unstable learning (Heess et al., 2015). Second, we obtain a strong learning signal by backpropagating the gradient of the policy across multiple steps using the pathwise derivative estimator, instead of the REINFORCE estimator (Mohamed et al., 2019; Peters and Schaal, 2006). And finally, we prevent the exploding and vanishing gradients effect inherent to back-propagation through time by the means of the terminal Q-function (Kurutach et al., 2018).

The horizon in our proposed objective allows us to trade off between the accuracy of our learned model and the accuracy of our learned Q-function. Hence, it controls the degree to which our algorithm is model-based or well model-free. If we were not to trust our model at all (), we would end up with a model-free update; for , the objective results in a shooting objective. Note that we will perform policy optimization by taking derivatives of the objective, hence we require accuracy on the derivatives of the objective and not on its value. The following lemma provides a bound on the gradient error in terms of the error on the derivatives of the model, the Q-function, and the horizon .

Lemma 4.1 (Gradient Error).

Let and be the learned approximation of the dynamics and Q-function , respectively. Assume that and have -Lipschitz continuous gradient and and have -Lipschitz continuous gradient. Let be the error on the model derivatives and the error on the Q-function derivative. Then the error on the gradient between the learned objective and the true objective can be bounded by:


See Appendix. ∎

The result in Lemma 4.1 stipulates the error of the policy gradient in terms of the maximum error in the model derivatives and the error in the Q derivatives. The functions and are functions of the horizon and depend on the Lipschitz constants of the model and the Q-function. Note that we are just interested in the relation between both sources of error, since the gradient magnitude will be scaled by the learning rate, or by the optimizer, when applying it to the weights.

4.2 Monotonic Improvement

In the previous section, we presented our objective and the error it incurs in the policy gradient with respect to approximation error in the model and the Q function. However, the error on the gradient is not indicative of the effect of the desired metric: the average return. Here, we quantify the effect of the modeling error on the return. First, we will bound the KL-divergence between the policies resulting from taking the gradient with the true objective and the approximated one. Then we will bound the performance in terms of the KL.

Lemma 4.2 (Total Variation Bound).

Under the assumptions of the Lemma 4.1, let be the parameters resulting from taking a gradient step on the exact objective, and the parameters resulting from taking a gradient step on approximated objective, where . Then the following bound on the total variation distance holds


See Appendix. ∎

The previous lemma results in a bound on the distance between the policies originated from taking a gradient step using the true dynamics and Q-function, and using its learned counterparts. Now, we can derive a similar result from  Kakade and Langford (2002) to bound the difference in average returns.

Theorem 4.1 (Monotonic Improvement).

Under the assumptions of the Lemma 4.1, be and as defined in Lemma 4.2, and assuming that the reward is bounded by . Then the average return of the satisfies


See Appendix. ∎

Hence, we can provide explicit lower bounds of improvement in terms of model error and function error. Theorem 4.1 extends previous work of monotonic improvement for model-free policies (Schulman et al., 2015b; Kakade and Langford, 2002), to the model-based and actor critic set up by taking the error on the learned functions into account. From this bound one could, in principle, derive the optimal horizon that minimizes the gradient error. However, in practice, approximation errors are hard to determine and we treat as an extra hyper-parameter. In section 5.2, we experimentally analyze the error on the gradient for different estimators and values of .

4.3 Algorithm

Based on the previous sections, we develop a new algorithm that explicitly optimizes the model-augmented actor-critic (MAAC) objective. The overall algorithm is divided into three main steps: model learning, policy optimization, and Q-function learning.

1:  Initialize the policy , model , , , and the model dataset
2:  repeat
3:     Sample trajectories from the real environment with policy . Add them to .
4:     for  do
5:         using data from .
6:     end for
7:     Sample trajectories from . Add them to .
9:     for  do
10:        Update using data from
11:        Update using data from
12:     end for
13:  until the policy performs well in the real environment
14:  return  Optimal parameters
Algorithm 1 MAAC

Model learning. In order to prevent overfitting and overcome model-bias (Deisenroth and Rasmussen, 2011), we use a bootstrap ensemble of dynamics models . Each of the dynamics models parameterizes the mean and the covariance of a Gaussian distribution with diagonal covariance. The bootstrap ensemble captures the epistemic uncertainty, uncertainty due to the limited capacity or data, while the probabilistic models are able to capture the aleatoric uncertainty (Chua et al., 2018), inherent uncertainty of the environment. We denote by the transitions dynamics resulting from , where is uniform random variable on . The dynamics models are trained via maximum likelihood with early stopping on a validation set.

Policy Optimization. We extend the MAAC objective with an entropy bonus (Haarnoja et al., 2018b), and perform policy learning by maximizing

where , , . We learn the policy by using the pathwise derivative of the model through steps and the Q-function by sampling multiple trajectories from the same . Hence, we learn a maximum entropy policy using pathwise derivative of the model through steps and the Q-function. We compute the expectation by sampling multiple actions and states from the policy and learned dynamics, respectively.

Q-function Learning. In practice, we train two Q-functions (Fujimoto et al., 2018) since it has been experimentally proven to yield better results. We train both Q functions by minimizing the Bellman error (Section 3.1):

Similar to (Janner et al., 2019), we minimize the Bellman residual on states previously visited and imagined states obtained from unrolling the learned model. Finally, the value targets are obtained in the same fashion the Stochastic Ensemble Value Expansion (Buckman et al., 2018), using as a horizon for the expansion. In doing so, we maximally make use of the model by not only using it for the policy gradient step, but also for training the Q-function.

Our method, MAAC, iterates between collecting samples from the environment, model training, policy optimization, and Q-function learning. A practical implementation of our method is described in Algorithm 1. First, we obtain trajectories from the real environment using the latest policy available. Those samples are appended to a replay buffer , on which the dynamics models are trained until convergence. The third step is to collect imaginary data from the models: we collect -step transitions by unrolling the latest policy from a randomly sampled state on the replay buffer. The imaginary data constitutes the , which together with the replay buffer, is used to learn the Q-function and train the policy.

Our algorithm consolidates the insights built through the course of this paper, while at the same time making maximal use of recently developed actor-critic and model-based methods. All in all, it consistently outperforms previous model-based and actor-critic methods.

5 Results

Our experimental evaluation aims to examine the following questions: 1) How does MAAC compares against state-of-the-art model-based and model-free methods? 2) Does the gradient error correlate with the derived bound?, 3) Which are the key components of its performance?, and 4) Does it benefit from planning at test time?

In order to answer the posed questions, we evaluate our approach on model-based continuous control benchmark tasks in the MuJoCo simulator (Todorov et al., 2012; Wang et al., 2019).

5.1 Comparison Against State-of-the-Art

We compare our method on sample complexity and asymptotic performance against state-of-the-art model-free (MF) and model-based (MB) baselines. Specifically, we compare against the model-free soft actor-critic (SAC) (Haarnoja et al., 2018a), which is an off-policy algorithm that has been proven to be sample efficient and performant; as well as two state-of-the-art model-based baselines: model-based policy-optimization (MBPO) (Janner et al., 2019) and stochastic ensemble value expansion (STEVE) (Buckman et al., 2018). The original STEVE algorithm builds on top of the model-free algorithm DDPG (Lillicrap et al., 2015), however this algorithm is outperformed by SAC. In order to remove confounding effects of the underlying model-free algorithm, we have implemented the STEVE algorithm on top of SAC. We also add SVG(1) (Heess et al., 2015) to comparison, which similar to our method uses the derivative of dynamic models to learn the policy.

The results, shown in Fig. 2, highlight the strength of MAAC in terms of performance and sample complexity. MAAC scales to higher dimensional tasks while maintaining its sample efficiency and asymptotic performance. In all the four environments, our method learns faster than previous MB and MF methods. We are able to learn near-optimal policies in the half-cheetah environment in just over 50 rollouts, while previous model-based methods need at least the double amount of data. Furthermore, in complex environments, such as ant, MAAC achieves near-optimal performance within 150 rollouts while other take orders of magnitudes more data.

Figure 2: Comparison against state-of-the-art model-free and model-based baselines in four different MuJoCo environments. Our method, MAAC, attains better sample efficiency and asymptotic performance than previous approaches. The gap in performance between MAAC and previous work increases as the environments increase in complexity.
Figure 3: error on the policy gradient when using the proposed objective for different values of the horizon as well as the error obtained when using the true dynamics. The results correlate with the assumption that the error in the learned dynamics is lower than the error in the Q-function, as well as they correlate with the derived bounds.

5.2 Gradient Error

Here, we investigate how the bounds obtained relate to the empirical performance. In particular, we study the effect of the horizon of the model predictions on the gradient error. In order to do so, we construct a double integrator environment; since the transitions are linear and the cost is quadratic for a linear policy, we can obtain the analytic gradient of the expect return.

Figure 3 depicts the error of the MAAC objective for different values of the horizon as well as what would be the error using the true dynamics. As expected, using the true dynamics yields to lower gradient error since the only source comes from the learned Q-function that is weighted down by . The results using learned dynamics correlate with our assumptions and the derived bounds: the error from the learned dynamics is lower than the one in the Q-function, but it scales poorly with the horizon. For short horizons the error decreases as we increase the horizon. However, large horizons is detrimental since it magnifies the error on the models.

5.3 Ablations

In order to investigate the importance of each of the components of our overall algorithm, we carry out an ablation test. Specifically, we test three different components: 1) not using the model to train the policy, i.e., set , 2) not using the STEVE targets for training the critic, and 3) using a single sample estimate of the path-wise derivative.

The ablation test is shown in Figure 4. The test underpins the importance of backpropagating through the model: setting to be 0 inflicts a severe drop in the algorithm performance. On the other hand, using the STEVE targets results in slightly more stable training, but it does not have a significant effect. Finally, while single sample estimates can be used in simple environments, they are not accurate enough in higher dimensional environments such as ant.

Figure 4: Ablation test of our method. We test the importance of several components of our method: not using the model to train the policy (), not using the STEVE targets for training the Q-function (-STEVE), and using a single sample estimate of the pathwise derivative. Using the model is the component that affects the most the performance, highlighting the importance of our derived estimator.

5.4 Model Predictive Control

One of the key benefits of methods that combine model-based reinforcement learning and actor-critic methods is that the optimization procedure results in a stochastic policy, a dynamics model and a Q-function. Hence, we have all the components for, at test time, refine the action selection by the means of model predictive control (MPC). Here, we investigate the improvement in performance of planning at test time. Specifically, we use the cross-entropy method with our stochastic policy as our initial distributions. The results, shown in Table 1, show benefits in online planning in complex domains; however, its improvement gains are more timid in easier domains, showing that the learned policy has already interiorized the optimal behaviour.

AntEnv HalfCheetahEnv HopperEnv Walker2dEnv

Table 1: Performance at test time with (maac+mpc) and without (maac) planning of the converged policy using the MAAC objective.

6 Conclusion

In this work, we present model-augmented actor-critic, MAAC, a reinforcement learning algorithm that makes use of a learned model by using the pathwise derivative across future timesteps. We prevent instabilities arisen from backpropagation through time by the means of a terminal value function. The objective is theoretically analyzed in terms of the model and value error, and we derive a policy improvement expression with respect to those terms. Our algorithm that builds on top of MAAC is able to achieve superior performance and sample efficiency than state-of-the-art model-based and model-free reinforcement learning algorithms. For future work, it would be enticing to deploy the presented algorithm on a real-robotic agent.


This work was supported in part by Berkeley Deep Drive (BDD) and ONR PECASE N000141612723.


  • B. Amos, I. D. J. Rodriguez, J. Sacks, B. Boots, and J. Z. Kolter (2018) Differentiable mpc for end-to-end planning and control. External Links: 1810.13400 Cited by: §2.
  • A. G. Barto, R. S. Sutton, and C. W. Anderson (1983) Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics SMC-13 (5), pp. 834–846. External Links: Document, ISSN Cited by: §2.
  • J. Buckman, D. Hafner, G. Tucker, E. Brevdo, and H. Lee (2018) Sample-efficient reinforcement learning with stochastic ensemble value expansion. CoRR abs/1807.01675. External Links: Link, 1807.01675 Cited by: §1, §1, §2, §4.3, §5.1.
  • K. Chua, R. Calandra, R. McAllister, and S. Levine (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. arXiv preprint arXiv:1805.12114. Cited by: §1, §1, §2, §4.3.
  • I. Clavera, J. Rothfuss, J. Schulman, Y. Fujita, T. Asfour, and P. Abbeel (2018) Model-based reinforcement learning via meta-policy optimization. CoRR abs/1809.05214. External Links: 1809.05214 Cited by: §1, §2.
  • M. Deisenroth and C. E. Rasmussen (2011) PILCO: a model-based and data-efficient approach to policy search. In

    Proceedings of the 28th International Conference on machine learning (ICML-11)

    pp. 465–472. Cited by: §2, §2, §4.3.
  • Y. Du and K. Narasimhan (2019) Task-agnostic dynamics priors for deep reinforcement learning. CoRR abs/1905.04819. External Links: Link, 1905.04819 Cited by: §2.
  • V. Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gonzalez, and S. Levine (2018) Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv:1803.00101. Cited by: §2.
  • S. Fujimoto, H. van Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477. Cited by: §2, §3.1, §4.3.
  • G.R. Grimmett and D.R. Stirzaker (2001) Probability and random processes. Vol. 80, Oxford university press. External Links: Link Cited by: §3.2.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018a) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §2, §4.1, §5.1.
  • T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine (2018b) Soft actor-critic algorithms and applications. CoRR abs/1812.05905. External Links: 1812.05905 Cited by: §4.3.
  • N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa (2015) Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pp. 2944–2952. Cited by: §1, §2, §4.1, §5.1.
  • Z. Hong, J. Pajarinen, and J. Peters (2019) Model-based lookahead reinforcement learning. ArXiv abs/1908.06012. Cited by: §2.
  • M. Janner, J. Fu, M. Zhang, and S. Levine (2019) When to trust your model: model-based policy optimization. CoRR abs/1906.08253. External Links: 1906.08253 Cited by: §1, §1, §2, §2, §4.3, §5.1.
  • S. Kakade and J. Langford (2002) Approximately optimal approximate reinforcement learning. In IN PROC. 19TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING, pp. 267–274. Cited by: §4.2, §4.2.
  • G. Kalweit and J. Boedecker (2017) Uncertainty-driven imagination for continuous deep reinforcement learning. In Proceedings of the 1st Annual Conference on Robot Learning, S. Levine, V. Vanhoucke, and K. Goldberg (Eds.), Proceedings of Machine Learning Research, Vol. 78, , pp. 195–206. Cited by: §2.
  • P. Karkus, X. Ma, D. Hsu, L. P. Kaelbling, W. S. Lee, and T. Lozano-Pérez (2019) Differentiable algorithm networks for composable robot learning. CoRR abs/1905.11602. External Links: Link, 1905.11602 Cited by: §2.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.2.
  • T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel (2018) Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592. Cited by: §1, §1, §2, §2, §4.1.
  • S. Levine and P. Abbeel (2014) Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pp. 1071–1079. Cited by: §2.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §2, §5.1.
  • L. Lin (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning 8 (3), pp. 293–321. External Links: ISSN 1573-0565, Document Cited by: §3.1.
  • K. Lowrey, A. Rajeswaran, S. M. Kakade, E. Todorov, and I. Mordatch (2018) Plan online, learn offline: efficient learning and exploration via model-based control. CoRR abs/1811.01848. External Links: 1811.01848 Cited by: §2.
  • Y. Luo, H. Xu, Y. Li, Y. Tian, T. Darrell, and T. Ma (2019) Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. ICLR. Cited by: §1, §2.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937. Cited by: §2, §4.1.
  • S. Mohamed, M. Rosca, M. Figurnov, and A. Mnih (2019) Monte carlo gradient estimation in machine learning. External Links: 1906.10652 Cited by: §2, §3.2, §4.1.
  • A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine (2017) Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. arXiv preprint arXiv:1708.02596. Cited by: §1, §2.
  • M. Okada, L. Rigazio, and T. Aoshima (2017) Path integral networks: end-to-end differentiable optimal control. External Links: 1706.09597 Cited by: §2.
  • M. Pereira, D. D. Fan, G. N. An, and E. A. Theodorou (2018) MPC-inspired neural network policies for sequential decision making. CoRR abs/1802.05803. External Links: Link, 1802.05803 Cited by: §2.
  • J. Peters and S. Schaal (2006) Policy gradient methods for robotics. In 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vol. , pp. 2219–2225. External Links: Document, ISSN Cited by: §2, §4.1.
  • J. Schulman, N. Heess, T. Weber, and P. Abbeel (2015a) Gradient estimation using stochastic computation graphs. CoRR abs/1506.05254. External Links: Link, 1506.05254 Cited by: §2.
  • J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015b) Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889–1897. Cited by: §4.2.
  • A. Srinivas, A. Jabri, P. Abbeel, S. Levine, and C. Finn (2018) Universal planning networks. arXiv preprint arXiv:1804.00645. Cited by: §2.
  • R. S. Sutton and A. G. Barto (1998) Introduction to reinforcement learning. 1st edition, MIT Press, Cambridge, MA, USA. External Links: ISBN 0262193981 Cited by: §2, §4.1.
  • R. S. Sutton (1991) Planning by incremental dynamic programming. In Machine Learning Proceedings 1991, pp. 353–357. Cited by: §2.
  • A. Tamar, S. Levine, and P. Abbeel (2016) Value iteration networks. CoRR abs/1602.02867. External Links: Link, 1602.02867 Cited by: §2.
  • E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. Cited by: §1, §5.
  • T. Wang and J. Ba (2019) Exploring model-based planning with policy networks. CoRR abs/1906.08649. External Links: Link, 1906.08649 Cited by: §2.
  • T. Wang, X. Bao, I. Clavera, J. Hoang, Y. Wen, E. Langlois, S. Zhang, G. Zhang, P. Abbeel, and J. Ba (2019) Benchmarking model-based reinforcement learning. CoRR abs/1907.02057. External Links: 1907.02057 Cited by: §1, §5.

Appendix A Appendix

Here we prove the lemmas and theorems stated in the manuscript.

a.1 Proof of Lemma 4.1

Let and be the expected return of the policy under our objective and under the RL objective, respectively. Then, we can write the MSE of the gradient as

whereby, and .

We will denote as the gradient w.r.t the inputs of network, and ; where . Notice that since and are Lipschitz and their gradient is Lipschitz as well, we have that , where K depends on the Lipschitz constants of the model and the policy. Without loss of generality, we assume that K is larger than 1. Now, we can bound the error on the Q as

Now, we will bound the term :

Hence, applying this recursion we obtain that

where . Then, the error in the gradient in the previous term is bounded by

In order to bound the model term we need first to bound the rewards since

Similar to the previous bounds, we can bound now each reward term by

With this result we can bound the total error in models

Then, the gradient error has the form

a.2 Proof of Lemma 4.2

The total variation distance can be bounded by the KL-divergence using the Pinsker’s inequality

Then if we assume third order smoothness on our policy, by the Fisher information metric theorem then

Given that , for a small enough step the following inequality holds

Combining this bound with the Pinsker inequality

a.3 Proof of Theorem 4.1

Given the bound on the total variation distance, we can now make use of the monotonic improvement theorem to establish an improvement bound in terms of the gradient error. Let and be the expected return of the policy and under the true dynamics. Let and be the discounted state marginal for the policy and , respectively

Then, combining the results from Lemma 4.2 we obtain the desired bound.

a.4 Ablations

In order to show the significance of each component of MAAC, we conducted more ablation studies. The results are shown in Figure 5. Here, we analyze the effect of training the -function with data coming from just the real environment, not learning a maximum entropy policy, and increasing the batch size instead of increasing the amount of samples to estimate the value function.

Figure 5: We further test the significance of some components of our method: not use the dynamics to generate data, and only use real data sampled from environments to train policy and Q-functions (real_data), remove entropy from optimization objects (no_entropy), and using a single sample estimate of the pathwise derivative but increase the batch size accordingly (5x batch size). Considering entropy and using dynamic models to augment data set are both very important.

a.5 Execution Time Comparison

Iteration (s) Training Model (s) Optimization (s) MBPO Iteration (s)

Table 2: This table shows the time that different parts of MAAC need to train for one iteration after 6000 time steps, averaged across 4 seeds. We also add the time needed for MBPO for one iteration here for comparison.