Deterministic Policy Optimization by Combining Pathwise and Score Function Estimators for Discrete Action Spaces

11/21/2017 ∙ by Daniel Lévy, et al. ∙ Stanford University 0

Policy optimization methods have shown great promise in solving complex reinforcement and imitation learning tasks. While model-free methods are broadly applicable, they often require many samples to optimize complex policies. Model-based methods greatly improve sample-efficiency but at the cost of poor generalization, requiring a carefully handcrafted model of the system dynamics for each task. Recently, hybrid methods have been successful in trading off applicability for improved sample-complexity. However, these have been limited to continuous action spaces. In this work, we present a new hybrid method based on an approximation of the dynamics as an expectation over the next state under the current policy. This relaxation allows us to derive a novel hybrid policy gradient estimator, combining score function and pathwise derivative estimators, that is applicable to discrete action spaces. We show significant gains in sample complexity, ranging between 1.7 and 25×, when learning parameterized policies on Cart Pole, Acrobot, Mountain Car and Hand Mass. Our method is applicable to both discrete and continuous action spaces, when competing pathwise methods are limited to the latter.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Reinforcement and imitation learning using deep neural networks have achieved impressive results on a wide range of tasks spanning manipulation

[levine2016end, levine2014learning], locomotion [silver2014deterministic], games [mnih2015human, silver2016mastering], and autonomous driving [ho2016generative, li2017inferring]. Model-free methods search for optimal policies without explicitly modeling the system’s dynamics [williams1992simple, schulman2015trust]

. Most model-free algorithms build an estimate of the policy gradient by sampling trajectories from the environment and perform gradient ascent. However, these methods suffer from either very high sample complexity due to the generally large variance of the estimator, or are restricted to policies with few parameters.

On the other hand, model-based reinforcement learning methods learn an explicit model of the dynamics of the system. A policy can then be optimized under this model by back-propagating the reward signal through the learned dynamics. While these approaches can greatly improve sample efficiency, the dynamics model needs to be carefully hand-crafted for each task. Recently, hybrid approaches, at the interface of model-free and model-based, have attempted to balance sample-efficiency with generalizability, with notable success in robotics


Existing methods, however, are limited to continuous action spaces [levine2014learning, heess2015learning]. In this work, we introduce a hybrid reinforcement learning algorithm for deterministic policy optimization that is applicable to continuous and discrete action spaces. Starting from a class of deterministic policies, we relax the corresponding policy optimization problem to one over a carefully chosen set of stochastic policies under approximate dynamics. This relaxation allows the derivation of a novel policy gradient estimator, which combines pathwise derivative and score function estimator. This enables incorporating model-based assumptions while remaining applicable to discrete action spaces. We then perform gradient-based optimization on this larger class of policies while slowly annealing stochasticity, converging to an optimal deterministic solution. We additionally justify and bound the dynamics approximation under certain assumptions. Finally, we complement this estimator by a scalable method to estimate the dynamics, first introduced in [levine2014learning], and with a general extension to non-differentiable rewards, rendering it applicable to a large class of tasks. Our contributions are as follows:

  • We introduce a novel deterministic policy optimization method, that leverages a model of the dynamics and can be applied to any action space, whereas existing methods are limited to continuous action spaces. We also provide theoretical guarantees for our approximation.

  • We show how our estimator can be applied to a broad class of problems by extending it to discrete rewards, and utilizing a sample-efficient dynamics estimation method [levine2014learning].

  • We show that this method successfully optimizes complex neural network deterministic policies without additional variance reduction techniques. We present experiments on tasks with discrete action spaces where model-based or hybrid methods are not applicable. On these tasks, we show significant gains in terms of sample-complexity, ranging between and .

Related Work

Sample efficiency is a key metric for RL methods, especially when used on real-world physical systems. Model-based methods improve sample efficiency at the cost of defining and learning task-specific dynamics models, while model-free methods typically require significantly more samples but are more broadly applicable.

Model-based methods approximate dynamics using various classes of models, spanning from gaussian processes [deisenroth2011pilco] to neural networks [fu2016one]. Assuming the reward function is known and differentiable, the policy gradient can be computed exactly by differentiating through the dynamics model, thus enabling gradient-based optimization. heess2015learning heess2015learning extends this idea to stochastic dynamics and policies by expressing randomness as an exogenous variable and making use of the “re-parameterization trick”. This requires differentiability of the dynamics function w.r.t. both state and action, limiting its applicability to continuous action spaces. On the other hand, model-free algorithms are broadly applicable but require significantly more samples as well as variance reduction techniques through control variates [mnih2016asynchronous, ho2016model] or trust regions [schulman2015trust] to converge in reasonable time.

Recently, hybrid approaches have attempted to balance sample efficiency with applicability. levine2014learning levine2014learning locally approximate dynamics to fit locally optimal policies, which then serve as supervision for global policies. A good dynamics model also enables generating artificial trajectories to limit strain on the simulator [gu2016continuous]. However, these works are once again limited to continuous action spaces. Our work can also be considered a hybrid algorithm that can handle and improve sample-efficiency for discrete action spaces as well.


In this section, we first present the canonical RL formalism. We then review the score function and pathwise derivative estimators, and their respective advantages. We show applications of these with first, the REINFORCE estimator [williams1992simple], followed by a standard method for model-based policy optimization consisting of back-propagating through the dynamics equation.

Notations and definitions

Let and denote the state and action spaces, respectively. refers to a deterministic dynamics function i.e. . A deterministic policy is a function , and a stochastic policy is a conditional distribution over actions given state, denoted . For clarity, when considering parameterized policies, stochastic ones will be functions of , and deterministic ones functions of . Throughout this work, dynamics will be considered deterministic.

We consider the standard RL setting where an agent interacts with an environment and obtains rewards for its actions. The agent’s goal is to maximize its expected cumulative reward. Formally, there exists an initial distribution over and a collection of reward functions with . is sampled from , and at each step, the agent is presented with and chooses an action according to a policy . is then computed as . In the finite horizon setting, the episode ends when . The agent is then provided with the cumulative reward, . The goal of the agent is to find a policy that maximizes the expected cumulative reward .

Score Function Estimator and Pathwise Derivative

We now review two approaches for estimating the gradient of an expectation. Let

be a probability distribution over a measurable set

and . We are interested in obtaining an estimator of the following quantity: .

Score Function Estimator

The score function estimator relies on the ‘log-trick’. It relies on the following identity (given appropriate regularity assumptions on and ):


This last quantity can then be estimated using Monte-Carlo sampling. This estimator is very general and can be applied even if

is a discrete random variable, as long as

is differentiable w.r.t. for all . However, it suffers from high-variance [glasserman2013monte].

Pathwise Derivative

The pathwise derivative estimator depends on being re-parameterizable, i.e., there exists a function and a distribution (independent of ) such that sampling is equivalent to sampling and computing . Given that observation, . This quantity can once again be estimated using Monte-Carlo sampling, but is conversely lower variance [glasserman2013monte]. This requires to be a differentiable function of for all .

We override (resp. ) as (resp. ), and aim at maximizing this objective function using gradient ascent.


Using the score function estimator, we can derive the REINFORCE rule [williams1992simple], which is applicable without assumptions on the dynamics or action space. With a stochastic policy, we want to maximize where . Then the REINFORCE rule is . We can estimate this quantity using Monte-Carlo sampling. This only requires differentiability of w.r.t. and does not assume knowledge of and . This estimator is however not applicable to deterministic policies.

A Method for Model-based Policy Optimization

Let be a deterministic policy, differentiable w.r.t. both and . Assuming we have knowledge of and , we can directly differentiate : . The first term, , is easily computed given knowledge of the reward functions. The second term, given knowledge of the dynamics, can be computed by recursively differentiating , i.e. and:


This method can be extended to stochastic dynamics and policies by using a pathwise derivative estimator i.e. by re-parameterizing the noise [heess2015learning].

This method is applicable for a deterministic and differentiable policy w.r.t. both and , differentiable dynamics w.r.t. both variables and differentiable reward function. This implies that must be continuous. This model-based method aims at utilizing the dynamics and differentiating the dynamics equation.

Relaxing the Policy Optimization Problem

Deterministic policies are optimal in a non-game theoretic setting. Our objective, in this work, is to perform hybrid policy optimization for deterministic policies for both continuous and discrete action spaces. In order to accomplish this, we present a relaxation of the RL optimization problem. This relaxation allows us to derive, in the subsequent section, a policy gradient estimator that differentiates approximate dynamics and leverages model-based assumptions for any action spaces. Contrary to traditional hybrid methods, this does not assume differentiability of the dynamics function w.r.t. the action variable, thus elegantly handling discrete action spaces. As in the previous section, we place ourselves in the finite-horizon setting. We assume terminal rewards and convex state space .

Relaxing the dynamics constraint

In this section, we describe our relaxation. Starting from a class of deterministic policies parameterized by , we can construct a class of stochastic policies parameterized by , that can be chosen as close as desired to by adjusting . On this extended class, we can derive a low-variance gradient estimator. We first explain the relaxation, then detail how to construct from , and finally, provide guarantees on the approximation.

Formally, the RL problem with deterministic policy and dynamics can be written as:

subject to

Given that is deterministic, the constraint can be equivalently rewritten as:


Having made this observation, we proceed to relaxing the optimization from over to , with the constraint now being over approximate and in particular differentiable dynamics. The relaxed optimization program is therefore:

subject to

This relaxation casts the optimization to be over stochastic policies, which allows us to derive a policy gradient in Theorem 3, but under different, approximated dynamics. We later describe how to project the solution in back to an element on .

Construction of from

Here we show how to construct stochastic policies from deterministic ones while providing a parameterized ‘knob’ to control randomness and closeness to a deterministic policy.

Discrete action spaces

The natural parameterization is as a softmax model. However, this requires careful parameterization, in order to ensure that all policies of are included. We use the deterministic policy as a prior of which we can control the strength. Formally, we choose a class of parameterized functions . For and , we can define the following stochastic policy s.t. . We have therefore defined where and . We easily verify that , as, for any , we can choose an arbitrary and define , we then have .

Continuous action spaces

In the continuous setting, a very simple parameterization is by adding Gaussian noise, of which we can control variance. Formally, given , let and . The surjection can be derived by setting to . More complicated stochastic parameterizations can be easily derived as long as the density remains tractable.


We assume that there is a surjection from to , s.t. any stochastic policy can be made deterministic by setting to a certain value. For the above examples, the mapping consists of setting to .

Similar in spirit to simulated annealing for optimization [kirkpatrick1983optimization], we optimize over , while slowly annealing to converge to an optimal solution in .

Theoretical Guarantees

Having presented the relaxation, we now provide theoretical justifications, to show, first, that given conditions on the stochastic policy, a trajectory computed with approximate dynamics under a stochastic policy is close to the trajectory computed with the true dynamics under a deterministic policy. We additionally present connections in the case where our dynamics are discretization of a continuous-time system.

Bounding the deviation from real dynamics

In this paragraph, we assume that the action space is continuous. Given the terminal reward setting, the amount of approximation can be defined as the divergence between , the trajectory from following a deterministic policy , and , the trajectory corresponding to the approximate dynamics with a policy . This will allow us to relate the optimal value of the relaxed program with the optimal value of the initial optimization problem.

Theorem 1 (Approximation Bound).

Let and be a deterministic and stochastic policy, respectively, s.t. . Let us suppose that is -lipschitz and is -lipschitz. We further assume that and that . We have the following guarantee:


Furthermore, if are distributions of fixed variance , the approximation converges towards when .


See Appendix. ∎

We know that solving the relaxed optimization problem will provide an upper-bound on the expected terminal reward from a policy in . Given a -lipschitz reward function, this bound shows that the optimal value of the true optimization program is within of the optimal value of the relaxed one.

Equivalence in continuous-time systems

The relaxation has strong theoretical guarantees when the dynamics are a discretization of a continuous-time system. With analogous notations as before, let us consider a continuous-time dynamical system: .

A -discretization of this system can be written as . We can thus write the dynamics of the relaxed problem: .

Intuitively, when the discretization step tends to , the policy converges to a deterministic one. In the limit, our approximation is the true dynamics for continuous time systems. Proposition 2 formalizes this idea.

Proposition 2.

Let be a trajectory obtained from the -discretized relaxed system, following a stochastic policy. Let be the continuous time trajectory. Then, with probability :


See [munos2006policy]. ∎

Relaxed Policy Gradient

The relaxation presented in the previous section allows us to differentiate through the dynamics for any action space. In this section, we derive a novel deterministic policy gradient estimator, Relaxed Policy Gradient (RPG), that combines pathwise derivative and score function estimators, and is applicable to all action spaces. We then show how to apply our estimator, using a sample-efficient and scalable method to estimate the dynamics, first presented in [levine2014learning]. We conclude by extending the estimator to discrete reward functions, effectively making our algorithm applicable to a wide variety of problems. For simplicity, we omit (the stochasticity) from our derivations and consider it fixed.


We place ourselves in the relaxation setting. Letting be a parameterized stochastic policy, we define . Our objective is to find . To that aim, we wish to perform gradient ascent on .

Theorem 3 (Relaxed Policy Gradient).

Given a trajectory , sampled by following a stochastic policy , we define the following quantity where can be computed with the recursion defined by and:


is an unbiased estimator of the policy gradient of the relaxed problem, defined in Equation 



See Appendix. ∎

The presented RPG estimator is effectively a combination of pathwise derivative and score function estimators. Intuitively, the pathwise derivative cannot be used for discrete action spaces as it requires differentiability w.r.t. the action. To this end, we differentiate pathwise through and handle the discrete portion with a score function estimator.

Benefits of pathwise derivative

Gradient estimates through pathwise derivatives are considered lower variance [glasserman2013monte]. In the context of RL, intuitively, REINFORCE suffers from high variance as it requires many samples to properly establish credit assignment, increasing or decreasing probabilities of actions seen during a trajectory based on the reward. Conversely, as RPG estimates the dependency of the state on , the gradient can directly adjust to steer the trajectory to high reward regions. This is possible because the estimator utilizes a model of the dynamics. While REINFORCE assigns credit indifferently to actions, RPG adjusts the entire trajectory.

However, when examining our expression for RPG, the computation requires the gradient of the dynamics w.r.t. the state, unlike REINFORCE. In the next section, we present a scalable and sample-efficient dynamics fitting method, further detailed in [levine2016end], to estimate this term.

Scalable estimation of the dynamics

The Relaxed Policy Gradient estimator can be computed exclusively from sampled trajectories if provided with the state-gradient of the dynamics . However, this is not an information available to the agent and thus needs to be estimated from samples. We review a method, first presented in [levine2014learning], that provides estimation of the state-gradient of the dynamics, incorporating information from multiple sampled trajectories. Although the dynamics are deterministic, we utilize a stochastic estimation to account for modeling errors.

Formally, we have sampled trajectories of the form from the dynamical system. Our goal is not only to accurately model the dynamics of the system, but also to estimate correctly the state-gradient. This prevents us from simply fitting a parametric function to the samples. Indeed, this approach would provide no guarantee that is a good approximation of . This is especially important as the variance of the estimator is dependent on the quality of our approximation of and not the approximation of .

In order to fit a dynamics model, we choose to locally model the dynamics of the discretized stochastic process as , parameterized by . We choose this approach as it does not model global dynamics but instead a good time-varying local model around the latest sampled trajectories, which corresponds to the term we want to estimate. Under that model, the term we are interested in estimating is then .

While this approach is simple, it requires a large number of samples to be accurate. To greatly increase sample-efficiency, we assume a prior over sampled trajectories. We use the GMM approach of [levine2014learning] which corresponds to softly piecewise linear dynamics. At each iteration, we update our prior using the current trajectories and fit our model. This allows us to utilize both past trajectories as well as nearby time steps that are included in the prior. Another key advantage of this stochastic estimation is that it elegantly handles discontinuous dynamics models. Indeed, by averaging the dynamics locally around the discontinuities, it effectively smooths out the discontinuities of the underlying deterministic dynamics. We refer to [levine2016end] for more detailed derivations.

Extension to discrete rewards

Prior to this work, estimators incorporating models of the dynamics such as [heess2015learning, levine2014learning] were constrained to continuously differentiable rewards. We present an extension of this type of estimators to a class of non-continuous rewards. To do so, we make assumptions on the form of the reward and approximate it by a smooth function.

We assume that and that the reward can be written as a sum of indicator functions, i.e.:


where are compact subsets of . This assumption covers a large collection of tasks where the goal is to end up in a compact subspace of . For each , we are going to approximate by an arbitrarily close smooth function.

Proposition 4 (Smooth approximation of an indicator function).

Let be a compact of . For any neighborhood of , there exists, , smooth, s.t. .


See Appendix. ∎

We can now approximate each by a smooth function . Given this surrogate reward , we can apply our estimator. In practice, however, discrete reward functions are often defined as , where is a function from to . We approximate the reward function by (

being the sigmoid function). If

is , then is too and pointwise converges to . We present in the Appendix an example of the approximate reward functions for the Mountain Car task.

Given these arbitrarily close surrogate rewards, we can now prove a continuity result, i.e. the sequence of optimal policies under the surrogate rewards converges to the optimal policy.

Proposition 5 (Optimal policies under surrogate reward functions).

Let be a reward function defined as and the optimal policy under this reward. Let us define to be the value of state under the policy . Then, there exists a sequence of smooth reward functions s.t. if is optimal under the reward , .


See Appendix. ∎

Practical Algorithm

In Algorithm 1, we present a practical algorithm to perform deterministic policy optimization based on the previously defined RPG. In summary, given , we construct our extended class of stochastic policies . We then perform gradient-based optimization over , i.e. stochastic policies, while slowly annealing , converging to a policy in . We easily extend the estimator to the case of non-terminal rewards as:

  Inputs: Environment giving given , deterministic class of policies , stochastic class of policies , number of training episodes , learning rate schedule , annealing schedule , initial parameters .
  Initialize and .
  for  to  do
     Sample with policy
     Update GMM prior over dynamics
     Fit dynamics
     for  to  do
         for  to  do
         end for
     end for
  end for
Algorithm 1 Relaxed Policy Gradient


We empirically evaluate our algorithm to investigate the following questions: 1) How much does RPG improve sample complexity? 2) Can RPG learn an optimal policy for the true reward using a smooth surrogate reward? 3) How effective is our approximation of the dynamics?

In an attempt to evaluate the benefits of the relaxation compared to other estimators as fairly as possible, we do not use additional techniques, such as trust regions [schulman2015trust]. Both our method and the compared ones can incorporate these improvements, but we leave study of these more sophisticated estimators for future work.

Classical Control Tasks

We apply our Relaxed Policy Gradient (RPG) to a set of classical control tasks with discrete action spaces: Cart Pole [barto1983neuronlike], Acrobot [sutton1996generalization], Mountain Car [sutton1996generalization] and Hand Mass [munos2006policy]. For the first three tasks, we used the OpenAI Gym [brockman2016openai] implementation and followed [munos2006policy] for the implementation of Hand Mass. A diagram of our tasks are presented in Figure 1.

Figure 1: Top row: Cart Pole, Acrobot. Bottom row: Mountain Car, Hand Mass.


We compare our methods against two different baselines: a black-box derivative free optimization algorithm called Cross-Entropy Method (CEM) [szita2006learning] and the Actor-Critic algorithm (A3C) [mnih2016asynchronous]. CEM is an evolutionary-type algorithm that samples a different set of parameters at each step, simulates using these parameters and keeps only the top , using those to re-fit the distribution over parameters. CEM is known to work well [duan2016benchmarking], but lacks sample efficiency as the algorithm does not exploit any structure. A3C is a variant of REINFORCE [williams1992simple] that learns a value function to reduce the variance of the estimator. For each task and algorithm, we evaluate for distinct random seeds.

Results and Analysis

We present the learning curves for all tasks in Figure 2. Even when training is done with surrogate rewards, in both instances we report the actual reward. We do not show CEM learning curves as these are not comparable due to the nature of the algorithm; each CEM episode is equivalent to RPG or A3C episodes, we instead report final performance in Tables 1 and 2. For all tasks, the policy is parameterized by a -layer neural network with tanh non-linearities and

hidden units, with a softmax layer outputting a distribution over actions. In practice, we optimize over stochastic policies with a fixed

. In our experiments, it converged to a near deterministic policy since the optimization is well-conditioned enough in that case. The policies are trained using Adam [kingma2014adam].

Figure 2: Mean rewards over random seeds for classical control tasks. Performance of RPG is shown against A3C.

We evaluate differently depending on the task: since Cartpole and Acrobot have a fixed reward threshold at which they are considered solved, for these we present the number of training samples needed to reach that performance (in Table 1). In contrast, for Mountain Car and Hand Mass, we report (in Table 2) the reward achieved with a fixed number of samples. Both of these metrics are meant to evaluate sample-efficiency.

Sample Complexity

As shown in Tables 1 and 2, our algorithm clearly outperforms A3C across all tasks, requiring between and times less samples to solve the tasks and showing significantly better performance for the same number of samples. As shown in Table 2, RPG performs better than CEM in Hand Mass and within for Mountain Car, despite using times less samples. We also note that CEM is particularly well suited for Acrobot, as it is a derivative-free method that can explore the space of parameters very fast and find near-optimal parameters quickly when the optimal policies are fairly simple, explaining its impressive sample-complexity on this specific task. We additionally point out that full plots of performance against number of samples are reported for all tasks in Figure 2, and the numbers in Tables 1 and 2 can all be extracted from there.

Robustness, Variance and Training Stability

Overall, our method was very robust to the choice of hyper-parameters such as learning rate or architecture. Indeed, our policy was trained on all tasks with the same learning rate , whereas different learning rates were cross-validated for A3C. When examining the training curves, our estimator demonstrates significantly less variance than A3C and a more stable training process. In tasks where the challenges are exploratory (Acrobot or Mountain Car), RPG’s exploration process is guided by its model of the dynamics while A3C’s is completely undirected, often leading to total failure. The same phenomenon can be observed on control tasks (Hand Mass or Cart Pole), where A3C favors bad actions seen in high reward trajectories, leading to unstable control.

Approximate Reward

On all tasks except Hand Mass, our estimator was trained using approximate smoothed rewards. The performances reported in Figure 2 and Table 2 show that this did not impair training at all. Regarding the baselines, it is interesting to note that since CEM does not leverage the structure of the RL problem, it is natural to train with the real rewards. For A3C, we experimented with the smoothed rewards and obtained comparable numbers.

Method Samples until solve
Cart Pole (Threshold = )
Acrobot (Threshold = -105)
Table 1: Average numbers of samples until the task is solved for the Cart Pole and Acrobot tasks for RPG, A3C and CEM.
Method Samples Performance
Hand Mass ( episodes)
Mountain Car ( episodes)
Table 2: Average mean rewards for the Hand Mass and Mountain Car tasks for RPG, A3C and CEM.


In this section, we explore limitations of our method. To leverage the dynamics of the RL problem, we trade-off some flexibility in the class of problems that model-free RL can tackle for better sample complexity. This can also be seen as an instance of the bias/variance trade-off.

Limitations on the reward function

While we presented a general way to extend this estimator for rewards in the discrete domain, such function approximations can be difficult to construct for high-dimensional state-spaces such as Atari games. Indeed, one would have to fit the indicator function of a very low dimensional manifold - corresponding to the set of images encoding the state of a given game score - living in a high-dimensional space (order of ).

Limitations on the type of tasks

While we show results on classical control tasks, our estimator is broadly applicable to all tasks where the dynamics can be estimated reasonably well. This has been shown to be possible on a number of locomotion and robotics tasks [levine2014learning, heess2015learning]. However, our work is not directly applicable to raw pixel input tasks such as the Atari domain.

Computational overhead

Compared to REINFORCE, our model presents some computational overhead as it requires evaluating as well as fitting dynamics matrices. In practice, this is minor compared to other training operations such as sampling trajectories or computing necessary gradients. In our experiments, computing the amounts to less than of overhead while dynamics estimation constitutes about .


In this work, we presented a method to find an optimal deterministic policy for any action space and in particular discrete actions. This method relies on a relaxation of the optimization problem to a carefully chosen, larger class of stochastic policies. On this class of policies, we can derive a novel, low-variance, policy gradient estimator, Relaxed Policy Gradient (RPG), by differentiating approximate dynamics. We then perform gradient-based optimization while slowly annealing the stochasticity of our policy, converging to a deterministic solution. We showed how this method can be successfully applied to a collection of RL tasks, presenting significant gains in sample-efficiency and training stability over existing methods. Furthermore, we introduced a way to apply this algorithm to non-continuous reward functions by learning a policy under a smooth surrogate reward function, for which we provided a construction method. It is also important to note that our method is easily amenable to problems with stochastic dynamics, assuming one can re-parameterize the noise.

This work also opens the way to promising future extensions of the estimator. For example, one could easily incorporate imaginary rollouts [gu2016continuous] with the estimated dynamics model or trust regions [schulman2015trust]. Finally, this work can also be extended more broadly to gradient estimation for any discrete random variables.


The authors thank Aditya Grover, Jiaming Song and Steve Mussmann for useful discussions, as well as the anonymous reviewers for insightful comments and suggestions. This research was funded by Intel Corporation, FLI, TRI and NSF grants , , .


Appendix A Appendix

Tasks specifications

Cart Pole

The classical Cart Pole setting where the agent wishes to balance a pole mounted on a pivot point on a cart. Due to instability, the agent has to perform continuous cart movement to preserve the equilibrium. The agent receives reward for each time step spend upwards. The task is considered solved if the agent reaches a reward of .


First introduced in [sutton1996generalization], the system is composed of two joints and two links. Initially hanging downwards, the objective is to actuate the links to swing the end of lower link up to a fixed height. The path’s reward is determined by the negative number of actions required to reach the objectif. The task is considered solved if the agent reaches a reward of .

Mountain Car

We consider the usual setting presented in [sutton1996generalization]. The objective is to get a car to the top of a hill by learning a policy that uses momentum gained by alternating backward and forward motions. The state consists of , where is the horizontal position of the car. The reward can be expressed as . The task’s difficulty lies in both the limitation of maximum tangential force and the need for exploration. The agent receives a reward of We use a surrogate smoothed reward for DCTPG. We evaluate the performance after a fixed number of episodes.

Hand Mass

We consider a physical system first presented in [munos2006policy]. The agent is holding a spring linked to a mass. The agent can move in four directions. The objective is to bring the mass to a target point while keeping the hand closest to the origin point . The state can be described as and the dynamics are described in [munos2006policy]. At the final time step, the agent receives the reward .

Proof of Theorem 1

Lemma 6.

Let be a random variable of of bounded variance. Let .


We can simply use Cauchy-Schwarz inequality. Let’s define .

Let’s prove this by recursion.


(2) is obtained with triangular inequality. (3) because is -lipschitz. (4) with triangular inequality and because . (5) because of Lemma 6 and the uniform bound on variances.

Given this inequality and the fact that , we can compute the geometric sum i.e.

concludes the proof.

Proofs of Theorem 3

As in the model-based method presented in Background, we want to directly differentiate . As before, we require the reward function to be differentiable. , but when computing , we need to differentiate through the relaxed equation of the dynamics:

The recursion corresponds to a Monte-Carlo estimate with trajectories obtained by running the stochastic policy in the environment, thus concluding the proof.

Proof of Proposition 4

Given a compact of and an open neighborhood , let us show that there exists s.t. . While according to Urysohn’s lemma [alexandroff1924theorie], such continuous functions exist in the general setting of normal spaces given any pair of disjoint closed sets and , we will show how to explicitly construct under our simpler assumptions.

It is sufficient to show that we can construct a function such that for open set containing the closed unit ball, is equal to 1 on the unit ball and is null outside of . Given this function, we can easily construct by scaling and translating.

Let us define . This function is smooth on both and .

Let us define . This function is strictly positive on and null outside of . For large enough, the function satisfy the desired property, thus proving that we can construct .

Figure 3 shows an example of a smooth surrogate reward for the Mountain Car task.

Figure 3: Surrogate smooth reward for the Mountain Car task. Left: True reward. Right: Smoothed surrogate reward.

Proof of Proposition 5

Let us consider the deterministic case where and the reward is terminal. Let be the optimal policy under reward .

There exists a sequence of open neighborhoods s.t. and tends to . Let be the smooth reward function constructed for with neighborhood . Let be the deterministic optimal policy under reward .

For a start state , we can distinguish between two cases:

For the former, no policy can reach the compact and thus the surrogate reward make no difference. For the latter, the policy , take to final state where . However, which tends to . The is a Cauchy sequence in a complete space and thus converges to a limit . Since and by continuity of the distance to a compact, proving the result.