How to Learn a Useful Critic? Model-based Action-Gradient-Estimator Policy Optimization

by   Pierluca D'Oro, et al.

Deterministic-policy actor-critic algorithms for continuous control improve the actor by plugging its actions into the critic and ascending the action-value gradient, which is obtained by chaining the actor's Jacobian matrix with the gradient of the critic w.r.t. input actions. However, instead of gradients, the critic is, typically, only trained to accurately predict expected returns, which, on their own, are useless for policy optimization. In this paper, we propose MAGE, a model-based actor-critic algorithm, grounded in the theory of policy gradients, which explicitly learns the action-value gradient. MAGE backpropagates through the learned dynamics to compute gradient targets in temporal difference learning, leading to a critic tailored for policy improvement. On a set of MuJoCo continuous-control tasks, we demonstrate the efficiency of the algorithm with respect to model-free and model-based state-of-the-art baselines.



There are no comments yet.


page 1

page 2

page 3

page 4


Revisiting stochastic off-policy action-value gradients

Off-policy stochastic actor-critic methods rely on approximating the sto...

Model-Augmented Actor-Critic: Backpropagating through Paths

Current model-based reinforcement learning approaches use the model simp...

Softmax Deep Double Deterministic Policy Gradients

A widely-used actor-critic reinforcement learning algorithm for continuo...

Critic PI2: Master Continuous Planning via Policy Improvement with Path Integrals and Deep Actor-Critic Reinforcement Learning

Constructing agents with planning capabilities has long been one of the ...

Trust the Model When It Is Confident: Masked Model-based Actor-Critic

It is a popular belief that model-based Reinforcement Learning (RL) is m...

A Learning-based Optimal Market Bidding Strategy for Price-Maker Energy Storage

Load serving entities with storage units reach sizes and performances th...

Refined Continuous Control of DDPG Actors via Parametrised Activation

In this paper, we propose enhancing actor-critic reinforcement learning ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) Puterman1994MarkovDP; sutton2018reinforcement studies sequential decision making problems, in which an agent aims at maximizing the cumulative reward it collects in an environment. One of the most popular classes of algorithms for RL are policy gradient methods sutton2000policy; deisenroth2013survey

, which involve differentiable control policies improved by gradient ascent. They feature suitability to environments with continuous state and action spaces, and compatibility with state-of-the-art deep learning 

schmidhuber2015deep methods. Policy gradient algorithms often employ the actor-critic konda2000actor scheme: an actor, which determines the control policy, is evaluated using a critic. Thus, the degree of actor’s improvement is limited by the information provided by the critic, naturally raising the question of how the critic should be trained.

Typically, algorithms that use powerful function approximators lillicrap2015continuous; fujimoto2018addressing learn the critic by temporal difference sutton1988learning, optimizing for an accurate prediction of the expected return of the actor. For deterministic-policy continuous-control silver2014deterministic; lillicrap2015continuous, however, the value provided by the critic is neither used for improving the policy nor for acting in the environment sutton2000policy. Instead, only the action-gradient of the value function, i.e., the gradient of the critic w.r.t. the action performed by the actor, is employed during policy optimization. Specifically, the policy gradient is obtained through the computation of the action-value gradient, by chaining the actor’s Jacobian with the action-gradient of the critic.

Learning the critic by value rather than by action-gradient of the value relies on hazy smoothness assumptions on the real value function silver2014deterministic. This means that, in conventional temporal difference learning, the critic learns action-value gradients implicitly, which could harm the performance of a deterministic policy gradient algorithm.

In this paper, we propose Model-based Action-Gradient-Estimator Policy Optimization (MAGE), a continuos-control deterministic-policy actor-critic algorithm that explicitly trains the critic to provide accurate action-gradients for the use in the policy improvement step. Motivated by both the theory on Deterministic Policy Gradients silver2014deterministic and practical considerations, MAGE utilizes temporal difference methods to minimize the error on the action-value gradient. To this aim, the algorithm leverages a trained dynamics model as a proxy for a differentiable environment and techniques reminiscent of double backpropagation drucker1992improving. On challenging continuous control benchmarks brockman2016gym; todorov2012mujoco, we show that MAGE is significantly more sample-efficient than state-of-the-art model-free and model-based baselines.

The rest of the paper is organized as follows. In Section 2, we provide the notation and background on deterministic policy gradients. Our algorithm, together with its theoretical motivation, is introduced in Section 3, followed by empirical results in Section 4. In Section 5, we present some of the related work and its relationship with our approach.

2 Background

2.1 Preliminaries

Consider a discrete-time Markov Decision Process 

Puterman1994MarkovDP (MDP), defined as , where is the space of possible states, is the space of possible actions, is the transition model, is the known and differentiable reward function, is the discount factor, is the initial state distribution. The behavior of the agent is described by a deterministic policy , belonging to a parametric space of policies . Let be the -discounted state distribution induced by policy , defined as . The total reward collected by an agent is quantified with action-value function and performance function .

Practical algorithms can employ an approximate action-value function and an approximate dynamics model , which, most commonly, are parametric function approximators specified by the spaces and .

2.2 Deterministic Policy Gradients and TD-learning

Policy gradient methods optimize policy by ascending the direction of the gradient of its performance function . The Deterministic Policy Gradient Theorem silver2014deterministic provides a practical way to calculate this gradient. It shows that, under some mild regularity conditions on the MDP, the gradient of the performance of a deterministic policy is given by:


This result can be interpreted through the lens of the chain rule applied to the

action-value gradient : the policy gradient does not directly depend on the gradient of , and can be obtained by just chaining the actor’s Jacobian with the action-gradient of the value function .

The theorem motivates a family of policy gradient actor-critic algorithms, such as DDPG lillicrap2015continuous and TD3 fujimoto2018addressing. Similarly to the classical policy iteration sutton2018reinforcement, the evaluation of a policy (called actor in this context) is interleaved with its improvement w.r.t the approximate action-value function (called critic). Specifically, the typical desideratum consists in finding a critic which minimizes the policy evaluation error:


where is a deviation w.r.t the true state-action value. Given the lack of knowledge about the transition model, needs to be approximated. A common approximation technique consists in employing the temporal-difference (TD) error sutton1988learning, defined as , giving rise to a bootstrapped optimization criterion for :


Minimizing the TD-error, albeit under rather strong assumptions, enjoys convergence guarantees sutton2018reinforcement; tsitsiklis1997analysis. Once a critic is learned, the actor can be improved by maximizing the action-value function for actions produced by the policy:


The above can be seen as a generalization of the policy improvement step in classical policy iteration, which relies on maximization in a discrete action space that cannot be easily carried out in continuous spaces. In practice, to reduce computational burden, the problems in Equation 3 and Equation 4 are solved only partially (e.g., by using a single optimization step) at each iteration, similarly to generalized policy iteration sutton2018reinforcement.

3 Learning Action-Value Gradients

In this section, we discuss theoretically how to learn a useful critic in the context of deterministic policy gradients. Then, we make the theoretical insights concrete and, guided by practical considerations, present Model-based Action-Gradient-Estimator Policy Optimization (MAGE), a novel policy optimization algorithm.

3.1 How to learn a useful critic?

An actor can only be as good as allowed by its critic. Thus, obtaining an effective critic is one of the most crucial passages for any actor-critic algorithm. In the previous section, we outlined the most common method to train the critic, consisting in the minimization of the temporal difference error. However, when the learned action-value function will not be perfect, as common in policy optimization with function approximation, minimizing the TD-error does not guarantee that the critic will be effective at solving the control problem. Instead, the following result provides foundations for a better objective function for critic learning.

Proposition 3.1.

Let be a parametric space of -Lipschitz continuous differentiable deterministic policies, a space of approximate value functions and any -norm. Given and , the norm of the difference between the true policy gradient and its approximation , which uses , can be upper bounded as:

The proposition (see Appendix A for the proof) is a direct consequence of the Deterministic Policy Gradient Theorem. The Lipschitz assumption for

is easily satisfied for many policy classes of practical use, e.g., neural networks 


Proposition 3.1 suggests that it is the norm of the action-gradient of the policy evaluation error instead of its value that should be minimized to reduce the bias introduced by the use of the approximate value function . To minimize the bound, a proxy for the unknown is needed. To this aim, it is possible to follow the approach of traditional TD-learning, substituting the evaluation error with the TD-error . This leads to the following optimization problem:


Notice that computing the gradient w.r.t. the action of the TD-error requires taking into account the effect of action on the transition to the subsequent state in the environment , i.e., backpropagating through the environment dynamics . Since is not available in typical RL settings, especially in a differentiable form, it needs to be substituted with an approximate model , as commonly done in model-based RL deisenroth2013survey; janner2019trust; Chua2018DeepRL. An environment model gives rise to imaginary transitions , where . Given differentiable model, policy, and action-value function, the action-gradient can be effectively computed by leveraging standard automatic differentiation tools baydin2018autodiff. The corresponding computational graph is depicted in Figure 1. This leads to a viable way to obtain :


Even in the general case of a stochastic model, differentiating through the resulting computations is still possible for many commonly used model classes via the reparametrization trick heess2015learning. Using an approximate model implies a tradeoff, since additional bias is injected into the estimation of the critic. Nonetheless, the use of is the most direct way to solve the optimization problem in Equation 6, and thus to obtain a that provides a more accurate policy gradient w.r.t. the one obtained by training the critic using the TD-error.

Figure 1: Graph describing the computation of , when using policy , model , action-value function . Nodes and edges represent functions and variables, respectively. To compute , all the paths from the output back to must be considered, including the one highlighted in cyan, which involves the environment dynamics. Therefore, an approximate differentiable model needs to be learned in order to make all the required paths accessible.

3.2 Model-based Action-Gradient-Estimator Policy Optimization

The outlined procedure for learning the value function requires an approximate model , thus naturally suggesting its integration into a model-based policy optimization framework. A model-based actor-critic method involves three steps during each iteration: learning the model , updating the action-value function and improving the policy . In the following, we consider neural networks as function approximators to represent the three modules, although any class of differentiable models could be leveraged. Our approach is inspired by Dyna sutton1991dyna, and employs an approximate dynamics model for generating 1-step imaginary on-policy transitions starting from observed states stored in a replay buffer. Those transitions are then employed to learn , and, in turn, leveraged for computing an improvement direction for the parameters of the policy .

In preliminary experiments, we have found that directly solving the minimization problem in Equation 6 is hard in practice. During the optimization, the parameters are prone to be trapped in local-minima, which leads to degenerate solutions. A demonstration of this effect is detailed in Appendix B.1. The root cause of this effect is unknown.

A remedy consists in introducing a constraint into the optimization problem. We argue that, among the possible solutions, a natural one is constraining the optimization landscape by bounding the traditional TD-error (see Equation 3), and thus solving the following optimization problem:


As the above expressions already require non-trivial gradient computations, we decided to avoid the use of complex and expensive methods for nonlinear programming. Instead, we resort to penalty function methods smith1995penalty by regularizing the original objective by using the TD-error. A similar approach has been used in the past in, e.g., Proximal Policy Optimization (PPO, schulman2017proximal) to approximately solve a different constrained optimization problem.

Eventually, the parameters of are learned by descending the gradient


on an imaginary transition . This expression requires computing second-order gradients, which would be computationally expensive if computed w.r.t. to the high-dimensional space of parameters of the -function. Here, however, the optimization is affordable since the gradients are computed w.r.t., typically low dimensional, actions. Notice also that the computational overhead of the second term in Equation 8 is minimal, since evaluating the TD-error is anyway, when using automatic differentiation, required to compute its gradient.

Input: Initial buffer

, set of parameter vectors

for each iteration do
     Collect transition acting according to exploratory version of
     for each model learning step do
     end for
     for each policy optimization step do
          Extract state after sampling
     end for
end for
Algorithm 1 Model-based Action-Gradient-Estimator Policy Optimization (MAGE)

We plug our critic training method into a model-based Dyna-like algorithm, giving rise to Model-based Action-Gradient-Estimator Policy Optimization (MAGE), which is presented222For simplicity of presentation, an abstract version of MAGE is considered in Algorithm 1. Any actor-critic algorithm can be then used to instantiate MAGE into a practical incarnation. in Algorithm 1. At each iteration, the dynamics model is trained to maximize the likelihood of the transitions stored in the experience replay buffer

, or, equivalently, to minimize an appropriate loss function



Then, for one or more steps, the TD-error for the current policy and action-value function is computed, and used together with its action-gradient to update , which in turn is leveraged to improve .

4 Experiments

4.1 Sample-Efficient Continuous Control with MAGE

Algorithm settings.

The general structure of MAGE is compatible with many actor-critic algorithms with deterministic policies. In this experiment, we employ TD3 fujimoto2018addressing, a popular, state-of-the-art extension to DDPG lillicrap2015continuous, as a base policy optimization method. This amounts to the addition of target policy smoothing, delayed policy updates, clipped double-Q learning and target functions. We call this version of our algorithm MAGE-TD3. After each step of environment interaction, we add the collected transition in the replay buffer , train the approximate model , and update critic and actor times. We employ a single value of

across all the experiments, since we found MAGE reasonably robust to the choice of this hyperparameter (see Appendix 

B). In order to reduce the impact of model bias, MAGE leverages an ensemble of probabilistic Gaussian-output models, trained by maximum likelihood estimation.

Figure 2: Performance in terms of average return of MAGE on continuous control benchmarks. MAGE compares favorably to the three baselines on all the environments (5 runs, c.i.).

Baselines and environments.

We consider one model-based and two model-free algorithms as baselines. The first one is Dyna-TD3, which uses a classical TD-error loss, otherwise being identical to MAGE-TD3. It resembles 1-step horizon Model-based Policy Optimization (MBPO janner2019trust), but uses a deterministic policy optimized by TD3. Apart from that, we compared MAGE against TD3 and its sample-efficient variant van2019use, which employs multiple updates for each environment step and trades off computational efficiency and, potentially, stability metelli2018policy for sample-efficiency. Specifically, for a fair comparison with MAGE-TD3, we execute critic and actor updates after each interaction with the environment. We employ environments from OpenAI Gym brockman2016gym and the MuJoCo physics simulator todorov2012mujoco as continuous control benchmarks, assuming, for all the environments, the availability of a differentiable reward function (we show in Appendix B.2 that MAGE behaves well also with a learned reward). Additional details concerning the experimental setting are reported in Appendix D.


Figure 2 shows the learning curves for the average return of all the approaches. Since our primary interest is MAGE’s sample efficiency, we show the first steps of environment interaction. The results show that MAGE is able to learn at least as fast as all the baselines on all the environments, confirming the intuitive advantage of directly optimizing for the accuracy of the estimated action-value gradient. Interestingly, no superiority of the vanilla Dyna-TD3 on its simple data-efficient version can be observed: this demonstrates that there is no intrinsic advantage in terms of sample-efficiency for model-based reinforcement learning, but it is instead highly environment- and algorithm-dependent. On the other hand, increasing the number of offline updates for model-free algorithms can cause strong instabilities in some environments, as it is the case, for instance, on the Pusher-v2 environment. Note that, in contrast with Dyna-TD3, that only leverages the model as a generator for additional transitions w.r.t. the ones that can be obtained in the environment, MAGE makes deeper use of the learned model of the dynamics in order to unlock a peculiar learning modality that would be impossible in a model-free setting.

4.2 Understanding MAGE

Action-Gradient Estimation.

MAGE was designed to obtain a critic that is maximally useful for policy improvement by yielding accurate action-value gradients. How much better does it predict them compared to the traditional TD-learning? To investigate this question, we employ the Pendulum-v0 environment, using a differentiable oracle in place of the approximate dynamics model.

Figure 3: Error of critics in predicting for a random (4, runs, 95% c.i.). Notice the log scale on the Y axis.

We fix a randomly initialized actor, and train only its critic with both MAGE-TD3 and its Dyna counterpart. During the training, for each transition on a trajectory, we estimate the true action-gradient by computing and compare it to the action-gradient provided by a learned critic. The results, shown in Figure 3, indicate that the MAGE’s critic progressively learns an accurate estimate of the action-gradient; by contrast, the one trained using traditional temporal difference completely fails in predicting it. The results undermine the common assumption that minimizing the TD-error yields also a minimization of the error on the gradients. The difference can explain the superior sample-efficiency of MAGE over classical TD-learning.

Reward Availability.

Throughout the presentation and evaluation of MAGE, we assumed complete knowledge of the reward function of the underlying Markov Decision Process.

Figure 4: Performance of MAGE on Pusher-v2 using an estimated reward function (5 runs, 95% c.i.).

While this assumption is natural in many real-world settings deisenroth2013survey and thus commonly employed in other model-based reinforcement learning methods Chua2018DeepRL; d2019gradient; heess2015learning, its role is particularly crucial in our algorithm. In traditional temporal difference learning, given a transition , the reward constitutes the only grounding element in the objective function. The reward function plays an even stronger role as a grounding element for bootstrapping in MAGE, since both its value and its action-gradient are needed: while the former can be usually observed in the environment, the latter can only be computed with complete knowledge of the underlying function. In our experiments on the sample-efficiency of MAGE, we employed the ground-truth reward function (with ground-truth gradients): a natural question is whether MAGE still performs reasonably well if an estimated reward function learned from data is used in place of the real . To answer this question, we evaluate a version of MAGE in which an approximate reward function is learned by using a neural network approximator and minimizing the mean squared error on the rewards observed in the environment. The results, perhaps surprising, are reported in Figure 4 for the Pusher-v2 environment (see Appendix B.2 for the complete results). They show that, for the commonly employed continuous control benchmarks, the performance of our method is only minimally degraded by the use of an approximate reward function in place of the real one, thus suggesting inherent robustness to inaccurate evaluations of the reward function as well as its action-gradient.

5 Related Work

Policy gradients are among the most popular methods in reinforcement learning. A variety of algorithms have been proposed for the estimation of the policy gradient, either involving only the policy williams1992simple; baxter2001infinite; metelli2018policy or also the value function schulman2015trust; mnih2016asynchronous; schulman2017proximal. The latter category of algorithms is reffered to as actor-critic methods prokhorov1997adaptive; konda2000actor. Among them, the ones based on the Deterministic Policy Gradient silver2014deterministic; lillicrap2015continuous leverage the action-gradient of the critic. When using function approximation, the quality of the learned critic is of paramount importance barth-maron2018distributional: for instance, enforcing on the critic the compatiblity conditions silver2014deterministic

ensures an unbiased estimate of the policy gradient.

Developed around such conditions, GProp balduzzi2015compatible is, to the best of our knowledge, the only method that explicitly optimizes for the accuracy of the learned action-value gradient. It is significantly different w.r.t. MAGE, being model-free and based on gradient estimation via noisy perturbations together with an additional deviator network. Importantly, while GProp’s deviator network is a function approximator that outputs an estimate for the action-gradient, recent theoretical saremi2019approximating and practical saremi2019neural insights outside of RL suggest that learning the action-gradient by second-order differentiation, as we propose in MAGE, is not only simpler to implement, but also fundamentally more effective when using neural network approximators.

The technique we use for learning the action-gradient relies on the differentiation of the TD-error and, thus, of the Bellman equation. This is related to a broad class of methods called value gradients schmidhuber1990making; fairbank2014value; heess2015learning, in which the policy is improved by backpropagating through the unrolled Bellman equation. Those approaches, however, learn the value function by standard temporal difference heess2015learning. Another approach, named

Dual Heuristic Programming

 werbosdhdp; prokhorov1997adaptive; Fairbank2008ReinforcementLB, learns the gradient of the state-value function in a model-based setting, leading to a TD-learning procedure that resembles our approach. However, the method has the main goal of improving generalization of the value function and exploration, and is fundamentally different from MAGE, that aims at learning an accurate action-gradient of the critic and is motivated by the Deterministic Policy Gradient Theorem.

More broadly, inside and outside of reinforcement learning, several algorithms incorporate gradient penalties into the loss function used for training a neural network. This technique, known as double backpropagation drucker1992improving, has been employed in a number of applications, for instance increasing generalization capabilities drucker1992improving; Rifaiicml, enforcing Lipschitz constants gulrajani2017improved; lunz2018adversarial; gelada2019deepmdp, or encouraging robustness to adversarial examples simongabriel2019. Particularly related to our approach is Sobolev training czarnecki2017sobolev, which leverages the availability of the derivatives of a target function to explicitly try to learn both value and gradient of it during supervised training; in our case, no ground-truth gradient is available and we use the action-gradient of the TD-target as a proxy.

Our method learns the action-gradient in the context of model-based policy optimization deisenroth2013survey; wang2003model; tangkaratt2014model. We mainly build upon the classical Dyna framework sutton1991dyna, in which a learned model is used for generating imaginary transitions, then employed for training a value function. Our algorithm, which learns a Q-function from model-generated data but only optimizes the policy by using real data, is related to the approaches that compute the policy gradient by using a model-based value function together with trajectories sampled in the environment abbeel2006using; d2019gradient; heess2015learning; janner2019trust. In practice, we leverage an ensemble of models, which has been shown to improve performance in a variety of contexts kurutach2018modelensemble; Chua2018DeepRL; janner2019trust.

Finally, our work is related in spirit to decision-aware model learning (DAML) joseph2013reinforcement; farahmand2017value; d2019gradient. In DAML approaches, the model of the dynamics of the environment is learned by explicitly considering how it will be used for improving the control policy: this is the same rationale behind the learning objective used in MAGE for the critic, focused on how it will be useful for policy optimization, and not merely on how it will be similar to the true value function.

6 Conclusion

In this paper, we presented MAGE, a model-based actor-critic algorithm with deterministic actor that leverages an approximate dynamics model to directly learn the action-value gradient via temporal-difference learning. MAGE employs second-order differentiation to obtain a critic tailored for policy improvement. The empirical evaluation of MAGE demonstrated its superiority over model-based and model-free baselines on challenging high-dimensional continuous control tasks.

A limitation of our method is of computational nature: in addition to the cost of model learning, paid also by other model-based actor-critic algorithms, we incur the expense of computing the gradient w.r.t. the critic parameters of the action-gradient of the TD-error, in result, approximately doubling the running time in comparison to the Dyna-based policy gradient approach. This can potentially be alleviated by the development of more efficient automatic differentiation tools, which is currently an active area of research baydin2018autodiff.

While it is often hard to determine under which circumstances the addition of an approximate learned model to a model-free algorithm is beneficial janner2019trust, we have shown that model-based techniques such as MAGE’s gradient-learning procedure, can unlock novel learning modalities, inaccessible for model-free algorithms. This can actually be the true power of model-based reinforcement learning. Therefore, apart from improving MAGE (e.g., by investigating the unconstrained critic learning problem) and applying it in other model-based approaches (e.g., value gradients with real trajectories heess2015learning), we hope that future work along this direction will reveal other innovative learning schemes that are infeasible in model-free settings.

Broader Impact

The method presented in this paper is a reinforcement learning algorithm that can be used to control a system executing real-valued actions in an environment. Therefore, a natural application of it is in robotics, with positive (e.g., elderly care, resource-efficiency in manufacturing) and negative (e.g., military) applications. Alongside many deep reinforcement learning algorithms, our method is computationally intensive and its training can thus require considerable resources (i.e., hardware and electricity).

The authors are grateful to Miroslav Štrupl for discussions about the relationship between action-gradients and TD-learning, David Alvarez for co-authoring the reinforcement learning framework used for the experiments, Christian Osendorfer, Miroslav Štrupl, Jan Koutník for their valuable feedback on an early draft of this manuscript, and to everyone at NNAISENSE for contributing to an inspiring research environment.


Appendix A Proof of Proposition 3.1

The proof follows directly from the Deterministic Policy Gradient Theorem, Therefore, the Proposition inherits all of its smoothness assumptions about the Markov Decision Process [silver2014deterministic]. See 3.1


Equation 10 follows from the Deterministic Policy Gradient Theorem. To obtain Equation 11, we exploit the definition of and linearity of differentiation. Finally, in Equation 12, we use the Lipschitz policy assumption.

Appendix B Additional Experiments

b.1 Unconstrained Action-Value Gradient learning

Proposition 3.1 directly encourages training the critic by minimizing the bound on the error of the policy gradient, i.e., the norm of the action-gradient of the policy evaluation error.

Figure 5: Action-gradients

However, we found a direct optimization of this bound, by means of the TD-error, difficult in the context of Dyna-like algorithms. We analyze this behavior in the Pendulum-v0 environment [brockman2016gym], instantiating a version of MAGE based on DDPG [lillicrap2015continuous] (MAGE-DDPG). To understand the learning dynamics of the action-value gradients in a way that is not affected by the model bias, we employ the differentiable version of the real environment dynamics and test MAGE without the TD-error regularization (i.e., with ). Therefore, at each step, is improved by minimizing the norm of computed on transitions whose next state is sampled from . Unfortunately, no useful learning can be achieved in this setting: a degenerate solution consisting of such that is rapidly reached, as shown in Figure 5. We employ exactly the settings and hyperparamters that are successfully employed in the full version of MAGE.

We believe that understanding whether, or under which circumstances, the direct minimization of the bound in Proposition 3.1 is possible is an interesting open question.

b.2 MAGE with Trained Reward Function

Figure 6: Performance in terms of average return of MAGE-TD3 and Dyna-TD3 with and without the use of an estimated reward function (5 runs, 95% c.i.).

As discussed in Section 4, MAGE is able to achieve good performance even with an estimated reward function. We report in Figure 6 the full results of this experiment on all the considered environments. For reference, we test MAGE and Dyna-TD3 as well as their versions in which the ground-truth reward function is substituted with one trained on the experience replay data using the MSE loss.

The results indicate that learning the reward function when it is not directly accessible does not produce any catastrophic harm to the performance of the algorithm. Therefore, our approach remains competitive even when the assumption of a know differentiable reward function is not satisfied.

b.3 Importance of

Our practical solution to viably minimize the norm of the action-gradient of the TD-error involves a constrained optimization problem, that limits the magnitude of the traditional TD-error.

Figure 7: Median return of MAGE for different (5 runs, 95% c.i.).

We approximately solve this problem by transforming it into an unconstrained one, introducing a new hyperparameter . can be seen as a weight that is given to the traditional TD-error, assigning more or less importance to it compared to the error on the action-gradient. In the main experiment shown in the paper, we used , which was chosen arbitrarily. How sensitive is MAGE to this parameter?

To study that, we carried out an experiment on the environment HalfCheetah-v2, by testing the TD3-based version of MAGE using four different values of . The results are shown in Figure 7, and demonstrate that, regardless of the value of , MAGE is significantly better than the baseline Dyna-TD3. MAGE is therefore robust to the choice of this hyperparameter. Notice also that the we used is not optimal for HalfCheetah-v2: thus, the absolute returns obtained by MAGE could be improved for particular tasks if careful parameter search is executed, which we leave for a future work. Nonetheless, we decided to report in Figure 2 results for a fixed value of across all the environments to show the robustness and ease of use of MAGE.

Appendix C Action-Gradient of the TD-error

In this section, we present some additional information about the computation of the action-gradient of the TD-error, carried out during the critic learning step of MAGE. To implement MAGE, we employed PyTorch [paszke2019pytorch] and its automatic differentiation tools in order to compute the second-order gradient required by our method. In this way, we did not need to explicitly derive a closed form expression for a given model class or neural network architecture. Nonetheless, we report here the general expression for the action-gradient of the TD-error:


In MAGE, we employ a Gaussian stochastic model : therefore, its action-gradient

can be obtained by reparameterizing this distribution using randomly drawn unit Gaussian noise together with the learned mean and standard deviations. In our experiments, we only deal with continuous state and action spaces; however, by leveraging appropriate approximations (e.g., concrete distributions 

[maddison2016concrete]), similar techniques can be employed also in the case of a discrete state space .

Figure 8: Alternative view of the computational graph constructed during the computation of the TD-error , following the notation from [schulman2015gradient]. Round nodes represent stochastic variables, squares represent deterministic variables. Nodes with incoming dashed edges also depend on the state .

To further visualize the constructed computational graph, it is possible to employ a different view, inspired recent work on stochastic computational graphs [schulman2015gradient], w.r.t. the one leveraged in Figure 1 (see Figure 8). In our case, the only possibly stochastic entity is the approximate model.

Appendix D Experimental details

d.1 Instantiating MAGE

We presented in Algorithm 1 a generic version with MAGE, whose structure can be adapted to many model-free actor-critic algorithms. In most of our experiments, we use TD3 [fujimoto2018addressing] as a reference algorithm, due to its stability and performance, giving birth to MAGE-TD3.

Input: Initial buffer , parameters , target parameters

for each iteration do
     Collect transition acting according to exploratory policy
     for each model learning step do
     end for
     for each policy optimization step do
          Extract state after sampling
          for  do
          end for
          if  then
          end if
     end for
end for
Algorithm 2 Model-based Action-Gradient-Estimator TD3 (MAGE-TD3)

In Algorithm 2, we report pseudocode for this version of our method. Unfortunately, while the use of the model is unchanged w.r.t. the abstract version, the addition of a second value function implies the computational overhead of using second-order differentiation twice.

d.2 Hyperparameters

We employ ( for the Pendulum and Cartpole environments) warmup steps of interaction with the environment before starting to update the critic and the actor. We use an ensemble of

neural network as approximate dynamics models, that learn both mean and standard deviation of a Gaussian distribution, of

hidden layers of neurons ( layers with units for the Pendulum and Cartpole environments) with swish [ramachandran2017searching]activation function. They are trained by maximum likelihood, minimizing the loss function, after every 25 steps of interaction with the environment, on batches of

samples. We employ multi-layer perceptrons also for the actor (

layers, neurons each for the Pendulum and CartPole environments and for all the others) and the critic ( layers, neurons each). Model, actor and critic are trained with the RAdam optimizer [liu2019variance], with learning rates of and default parameters, and a weight decay of for the approximate dynamics model. We update the critic and the actor by extracting ( for the Pendulum and CartPole environment) states from the buffer of collected transitions, then sampling from the ensemble by first randomly selecting one of the members and then sampling an estimated difference between current and next state.

In MAGE-TD3, we employ the suggested hyperparameters of TD3: an action noise of , a target noise of , noise clipping to and a delay in the policy updates of . During training, the actions that the actor executes in the environment are perturbed by Gaussian noise . We obtain the target networks for both actor and critic by Polyak averaging with decay .

For the reward estimation experiments, we employ a neural network with hidden layers of units ( hidden layer with units for the Pendulum and Cartpole environments) and swish activations We employ a discount factor of .

For our experiment on the evaluation of gradients, we initially collect transitions, then simply run the algorithms with standard settings but without any update of the actor. Every steps, we collect trajectories in the environment and average the error over them. We compute the ground-truth , with being the empirical return, by automatic differentiation, leveraging the differentiable oracle model. We then average, the resulting discounted error:


We average this value across the different trajectories.

Across all the experiments, despite the formulation we used throughout the paper, we employ a reward , which is thus also a function of the next state. For generating the performance plots, we evaluate, after every steps of environment interaction, the actor for episodes and average the result. To improve presentation, we then uniformly smooth the resulting curves with a window size of .