1 Introduction
Modelfree reinforcement learning algorithms often alternate between two concurrent but interacting processes: (1) policy evaluation, where an action value function (i.e.,
a Qvalue) is updated to obtain a better estimate of the return associated with taking a specific action, and (2)
policy improvement, where the policy is updated aiming to maximize the current value function. In the past, different notions of Qvalue have led to distinct but important families of RL methods. For example, SARSA (Rummery & Niranjan, 1994; Sutton & Barto, 1998; Van Seijen et al., 2009) uses the expected Qvalue, defined as the expected return of following the current policy. Qlearning (Watkins, 1989) exploits a hardmax notion of Qvalue, defined as the expected return of following an optimal policy. Soft Qlearning (Haarnoja et al., 2017) and PCL (Nachum et al., 2017) both use a softmax form of Qvalue, defined as the future return of following an optimal entropy regularized policy. Clearly, the choice of Qvalue function has a considerable effect on the resulting algorithm; for example, restricting the types of policies that can be expressed, and determining the type of exploration that can be naturally applied. In each case, the Qvalue at a state and action answers the question,“What would my future value from be if I were to take an initial action ?”
Such information about a hypothetical action is helpful when learning a policy; we want to nudge the policy distribution to favor actions with potentially higher Qvalues.
In this work, we investigate the practicality and benefits of answering a more difficult, but more relevant, question:
“What would my future value from be if I were to sample my initial action from a distribution centered at ?”
We focus our efforts on Gaussian policies and thus the counterfactual posited by the Qvalue inquires about the expected future return of following the policy when changing the mean of the initial Gaussian distribution. Thus, our new notion of Qvalues maps a stateaction pair
to the expected return of first taking an action sampled from a normal distribution
centered at , and following actions sampled from the current policy thereafter. In this way, the Qvalues we introduce may be interpreted as a Gaussiansmoothed version of the expected Qvalue, hence we term them smoothed Qvalues.We show that smoothed Qvalues possess a number of important properties that make them attractive for use in RL algorithms. It is clear from the definition of smoothed Qvalues that, if known, their structure is highly beneficial for learning the mean of a Gaussian policy. We are able to show more than this: although the smoothed Qvalues are not a direct function of the covariance, one can surprisingly use knowledge of the smoothed Qvalues to derive updates to the covariance of a Gaussian policy. Specifically, the gradient of the standard expected return objective with respect to the mean and covariance of a Gaussian policy is equivalent to the gradient and Hessian of the smoothed Qvalue function, respectively. Moreover, we show that the smoothed Qvalues satisfy a singlestep Bellman consistency, which allows bootstrapping to be used to train them via function approximation.
These results lead us to propose an algorithm, Smoothie, which, in the spirit of (Deep) Deterministic Policy Gradient (DDPG) (Silver et al., 2014; Lillicrap et al., 2016), trains a policy using the derivatives of a trained (smoothed) Qvalue function to learn a Gaussian policy. Crucially, unlike DDPG, which is restricted to deterministic policies and is wellknown to have poor exploratory behavior (Haarnoja et al., 2017)
, the approach we develop is able to utilize a nondeterministic Gaussian policy parameterized by both a mean and a covariance, thus allowing the policy to be exploratory by default and alleviating the need for excessive hyperparameter tuning. On the other hand, compared to standard policy gradient algorithms
(Williams & Peng, 1991; Konda & Tsitsiklis, 2000), Smoothie’s utilization of the derivatives of a Qvalue function to train a policy avoids the high variance and sample inefficiency of stochastic updates.Furthermore, we show that Smoothie can be easily adapted to incorporate proximal policy optimization techniques by augmenting the objective with a penalty on KLdivergence from a previous version of the policy. The inclusion of a KLpenalty is not feasible in the standard DDPG algorithm, but we show that it is possible with our formulation, and it significantly improves stability and overall performance. On standard continuous control benchmarks, our results are competitive with or exceed stateoftheart, especially for more difficult tasks in the lowdata regime.
For most RL algorithms including SARSA, expected SARSA, and Qleanring, the policy improvement step is straightforward: An improved policy is obtained by taking the greedy action that maximizes the action value function
, or alternatively, the greedy action is taken with probability
and a random action is taken with probability . Using such greedy schemes may improve exploration and facilitate formulating convergence guarantees. However, policy improvement by finding the locally optimal policy for an arbitrary action value function is not tractable, especially in continuous and highdimensional action spaces. For example, if one considers the family of multivariate guassian policies in a continuous action space, then policy optimization within the feasible policies is not straighforward.Prior work on deterministic policy gradient (DPG) and its extensions substitude the mean of a gaussian policy into the action value function and make use of the gradient of the action value function to update the mean of a gaussian policy to improve action values estimates. Such formulations are interesting because they can handle offpolicy data, and they do not require Monte Carlo sampling to compute expected Qvalues under the current policy, resulting in a learning algorithm with a lower variance. However, DPG and its variants make a strong assumption about the policy. They assume that the policy is deterministic in the limit that the variance of Gaussians goes to zero. It was thought that the one needs to assume policy determinism to achieve the benefits of the DPG formulation (???), but in this paper we show the policy determinism assumption is not necessary. In other words, one can keep the benefits of DPG in making use of action value gradients and offpolicy data, but also allow for using a general family of stochastic Gaussian policies.
Modelfree reinforcement learning (RL) aims to optimize an agent’s behavior policy through trial and error interaction with a black box environment. An agent alternates between observing a state provided by the environment (e.g., joint positions and velocities), applying an action to the environment (e.g., force or torque), and receiving a reward from the environment (e.g., velocity in a specific desired direction). The agent’s objective is to maximize the long term sum of rewards received from the environment.
Within continuous control, the agent’s policy is traditionally parameterized by a unimodel Gaussian. The mean and covariance defining the Gaussian are then trained using policy gradient methods (Konda & Tsitsiklis, 2000; Williams & Peng, 1991) to maximize expected total reward. Policy gradient uses experience sampled stochastically from its policy as a cue to nudge the mean and covariance to put more probability mass on experience that yielded a higher long term reward than expected and vice versa for experience that yielded a lower long term reward. The stochastic nature of this training paradigm makes policy gradient methods unstable and exhibit high variance. To mitigate this problem, large batch sizes or trust region methods (Schulman et al., 2015; Nachum et al., 2018; Schulman et al., 2017) are utilized, although both of these remedies can require collecting a large amount of experience, making them infeasible for realworld applications.
As an attempt to circumvent the issue of high variance due to highly stochastic updates, recent years have introduced policy gradient algorithms which update policy parameters based only on the surface of a learned Qvalue function. The most widely used such algorithm is (Deep) Deterministic Policy Gradient (DDPG) (Silver et al., 2014; Lillicrap et al., 2016). In DDPG, the policy is deterministic, parameterized only by a mean. A Qvalue function is trained to take in a state and action and return the future discounted sum of rewards of first taking the specified action and subsequently following the deterministic policy. Thus, the suface of the learned Qvalue function dictates how the policy should be updated  along the gradient of the Qvalue with respect to the input action.
While DDPG has successfully avoided the highlystochastic policy updates associated with traditional policy gradient algorithms, its deterministic policy naturally leads to poor exploration. A deterministic policy gives no indication regarding which directions in action space to explore. Thus, in practice, the trained policy differs from the behavior policy, which is augmented with Gaussian noise whose variance is treated as a hyperparameter to optimize. Even so, DDPG is wellknown to have poor exploratory behavior (Haarnoja et al., 2017).
In this paper, we present a method which applies the same technique of updating a policy based only on a learned Qvalue function to a stochastic policy parameterized by both a mean and covariance. We show that given access to the proposed Qvalue function or a sufficiently accurate approximation, it is possible to derive unbiased updates to both the policy mean and covariance. Unlike recent attempts at this (e.g., Ciosek & Whiteson (2018)), our updates require neither approximate integrals nor loworder assumptions on the form of the true Qvalue. Crucially, providing a method to update the covariance allows the policy to be exploratory by default and alleviates the need for excessive hyperparameter tuning. Moreover, we show that our technique can be easily adapted to incorporate trust region methods by augmenting the objective with a penalty on KLdivergence from a previous version of the policy. The inclusion of a trust region is not possible in standard DDPG and we show that it improves stability and overall performance significantly when our algorithm is evaluated on standard continuous control benchmarks.
2 Formulation
We consider the standard modelfree RL problem represented a Markov decision process (MDP), consisting of a state space
and an action space . At iteration the agent encounters a state and emits an action , after which the environment returns a scalar reward and places the agent in a new state .We focus on continuous control tasks, where the actions are realvalued, i.e., . Our observations at a state are denoted . We parameterize the behavior of the agent using a stochastic policy , which takes the form of a Gaussian density at each state . The Gaussian policy is parameterized by a mean and a covariance function, and so that , where
(1) 
here using the notation .
Below we develop new RL training methods for this family of parametric policies, but some of the ideas presented may generalize to other families of policies as well. We begin the formulation by reviewing some prior work on learning Gaussian policies.
2.1 Policy Gradient for Generic Stochastic Policies
The optimization objective (expected discounted return), as a function of a generic stochastic policy, is expressed in terms of the expected action value function by,
(2) 
where is the state visitation distribution under , and is recursively defined using the Bellman equation,
(3) 
where is the discount factor. For brevity, we suppress explicit denotation of the distribution over immediate rewards and over state transitions.
The policy gradient theorem (Sutton et al., 2000) expresses the gradient of w.r.t. , the tunable parameters of a policy , as,
(4)  
In order to approximate the expectation on the RHS of (4), one often resorts to an empirical average over onpolicy samples from . This sampling scheme results in a gradient estimate with high variance, especially when is not concentrated. Many policy gradient algorithms, including actorcritic variants, trade off variance and bias, e.g., by attempting to estimate accurately using function approximation and the Bellman equation.
In the simplest scenario, an unbiased estimate of
is formed by accumulating discounted rewards from each state forward using a single Monte Carlo sample.The objective of the policy is to maximize expected future discounted reward at each state until reaching some terminal time :
(5) 
The objective may also be expressed in terms of expected Qvalues as
(6) 
where is defined recursively as
(7) 
to represent the expected future discounted reward of taking action at state and subsequently following the policy .
The stateagnostic objective is then
(8) 
where is the state distribution induced by the policy .
2.2 Deterministic Policy Gradient
Silver et al. (2014) study the policy gradient for the specific class of Gaussian policies in the limit where the policy’s covariance approaches zero. In this scenario, the policy becomes deterministic and samples from the policy approach the Gaussian mean. Under a deterministic policy , one can estimate the expected future return from a state as,
(9) 
Accordingly, Silver et al. (2014) express the gradient of the expected discounted return objective for as,
(10) 
This characterization of the policy gradient theorem for deterministic policies is called deterministic policy gradient (DPG). Since no Monte Carlo sampling is required for estimating the gradient, the variance of the estimate is reduced. On the other hand, the deterministic nature of the policy can lead to poor exploration and training instability in practice.
In the limit of , one can also reexpress the Bellman equation (3) as,
(11) 
Therefore, a value function approximator can be optimized by minimizing the Bellman error,
(12) 
for transitions sampled from a dataset of interactions of the agent with the environment. The deep variant of DPG known as DDPG (Lillicrap et al., 2016) alternates between improving the action value estimate by gradient descent on (12) and improving the policy based on (10).
To improve sample efficiency, Degris et al. (2012) and Silver et al. (2014) replace the state visitation distribution in (10) with an offpolicy visitation distribution based on a replay buffer. This subsititution introduces some bias in the gradient estimate (10
), but previous work has found that it works well in practice and improves the sample efficiency of the policy gradient algorithms. We also adopt a similar heuristic in our method to make use of offpolicy data.
In practice, DDPG exhibits improved sample efficiency over standard policy gradient algorithms: using offpolicy data to train Qvalues while basing policy updates on their gradients significantly improves stochastic policy updates dictated by (4), which require a large number of samples to reduce noise. On the other hand, the deterministic nature of the policy learned by DDPG leads to poor exploration and instability in training. In this paper, we propose an algorithm which, like DDPG, utilizes derivative information of learned Qvalues for better sampleefficiency, but which, unlike DDPG, is able to learn a Gaussian policy and imposes a KLpenalty for better exploration and stability.
The objective for becomes
(13) 
A parameterized can thus be trained according to
(14) 
2.3 Stochastic Value Gradients for Gaussian Policies
Inpired by deterministic policy gradients, Heess et al. (2015) propose to reparameterize the expectation w.r.t. a Guassian policy in (4) with an expectation over a noise variable drawn from a standard normal distribution. Note that a stochastic action drawn from a Gaussian policy can be reparameterized as , for and . Accordingly, the policy gradients take the form of,
(15)  
where for brevity, we dropped the dependence of and on . Similarly, one can reexpress Bellman equations using an expectation over the noise variable.
The key advantage of this formulation by Heess et al. (2015), called stochastic value gradients (SVG), over generic policy gradients (4) is that similar to DPG (10), SVG makes direct use of the gradient of the Gaussian mean functions with respect to the model parameters. The benefit over DPG is that SVG keeps the policy stochastic and enables learning the covariance function at every state, but the key disadvantage over DPG is that SVG requires sampling from a noise variable to estimate the gradients, resulting in a higher variance.
In this paper, we show how one can combine the benefits of DPG and SVG to formulate policy gradients for Gaussian policies wihtout requiring to sample from a noise variable to estimate the gradients, hence a lower variance.
3 Idea
Before giving a full exposition, we use a simplified scenario to illustrate the key intuitions behind the proposed approach and how it differs fundamentally from previous methods.
Consider a oneshot decision making problem over a one dimensional action space with a single state. Here the expected reward is given by a function over the real line, which also corresponds to the optimal Qvalue function; Figure 1 gives a concrete example. We assume the policy is specified by a Gaussian distribution parameterized by a scalar mean
. The goal is to optimize the policy parameters to maximize expected reward.A naive policy gradient method updates the parameters by sampling , observing reward , then adjusting and in directions and . Note that such updates suffer from large variance, particularly when is small.
To reduce the variance of direct policy gradient, deterministic policy gradient methods leverage a value function approximator , parameterized by , to approximate . For example, in this scenario, vanilla DPG would sample an action with exploration noise , then update using and using . Clearly, this update exhibits reduced variance, but requires to approximate (the green curve in Figure 1) to control bias. Unfortunately, DPG is not able to learn the exploration variance . Variants of DPG such as SVG (Heess et al., 2015) and EPG (Ciosek & Whiteson, 2018) have been proposed to work with stochastic policies. However they either have restrictive assumptions on the form of the true Qvalue, introduce a noise into the policy updates, or require an approximate integral, thus losing the advantage of deterministic gradient updates. By contrast, SVG is able to work with a stochastic policy and learn the variance, but loses the advantage of deterministic gradient updates. In particular, vanilla SVG would sample then update using , using , and using , all of which reintroduce significant variance in the updates.
Note, however, that the expected value at any given location is actually given by a convolution of the Gaussian policy with the underlying expected reward function. Such a process inherently smooths the landscape, as shown in the magenta curve in Figure 1. Unfortunately, DPG completely ignores this smoothing effect by trying to approximate , while policy gradient methods only benefit from it indirectly through sampling. A key insight is that this smoothing effect can be captured directly in the value function approximator itself, bypassing any need for sampling or approximating . That is, instead of using an approximator to model , one can directly approximate the smoothed version given by (the magenta curve in Figure 1), which, crucially, satisfies .
Based on this observation, we propose a novel actorcritic strategy below that uses a function approximator to model . Although approximating instead of might appear to be a subtle change, it is a major alteration to existing actorcritic approaches. Not only is approximating the magenta curve in Figure 1 far easier than the green curve, modeling allows the policy parameters to be updated deterministically for any given action. In particular, in the simple scenario above, if one sampled an action from the current policy, , and observed , then could be updated using , using (a key result we establish below), and using .
Such a strategy combines the best aspects of DPG and policy gradient while conferring additional advantages: (1) the smoothed value function cannot add but can only remove local minima from ; (2) is smoother than hence easier to approximate; (3) approximating allows deterministic gradient updates for ; (4) approximating allows gradients to be computed for both the mean and variance parameters. Among these advantages, DPG shares only 3 and policy gradient only 1. We will see below that the new strategy we propose significantly outperforms existing approaches, not only in the toy scenario depicted in Figure 1, but also in challenging benchmark problems.
4 Smoothed Action Value Functions
Moving beyond a simple illustrative scenario, the key contribution of this paper is to introduce the general notion of a smoothed action value function, the gradients of which provide an effective signal for optimizing the parameters of a Gaussian policy. Smoothed Qvalues, which we denote , differ from ordinary Qvalues by not assuming the first action of the agent is fully specified; instead, they assume only that a Gaussian centered at the action is known. Thus, to compute , one has to perform an expectation of for actions drawn in the vicinity of . More formally, smoothed action values are defined as,
(16) 
With this definition of , one can reexpress the gradient of the expected reward objective (Equation (4)) for a Gaussian policy as,
(17) 
The insight that differentiates this approach from prior work (Heess et al., 2015; Ciosek & Whiteson, 2018) is that instead of learning a function approximator for then drawing samples to approximate the expectation in (16) and its derivative, we directly learn a function approximator for .
One of the key observations that enables learning a function approximator for is that smoothed Qvalues satisfy a notion of Bellman consistency. First, note that for Gaussian policies we have the following relation between the expected and smoothed Qvalues:
(18) 
Then, combining (16) and (18), one can derive the following onestep Bellman equation for smoothed Qvalues,
(19) 
where and are sampled from and . Below, we elaborate on how one can make use of the derivatives of to learn and , and how the Bellman equation in (19) enables direct optimization of .
4.1 Policy Improvement
We assume a Gaussian policy parameterized by and for the mean and the covariance respectively. The gradient of the objective w.r.t. the mean parameters follows from the policy gradient theorem in conjunction with (17) and is almost identical to (10),
(20) 
Estimating the derivative of the objective w.r.t. the covariance parameters is not as straightforward, since is not a direct function of . However, a key result is that the second derivative of w.r.t. actions is sufficient to exactly compute the derivative of w.r.t. .
(21) 
A proof of this identity is provided in the Appendix. The full derivative w.r.t. can then be shown to take the form,
(22) 
4.2 Policy Evaluation
There are two natural ways to optimize . The first approach leverages (16) to update based on the expectation of . In this case, one first trains a parameterized model to approximate the standard function using conventional methods (Rummery & Niranjan, 1994; Sutton & Barto, 1998; Van Seijen et al., 2009), then fits to based on (16). In particular, given transitions sampled from interactions with the environment, one can train to minimize the Bellman error
(23) 
where . Then, can be optimized to minimize the squared error
(24) 
where , using several samples. When the target values in the residuals are treated as fixed (i.e., using a target network), these updates will reach a fixed point when satisfies the recursion in the Bellman equation (16).
The second approach requires a single function approximator for , resulting in a simpler implementation; hence we use this approach in our experimental evaluation. Suppose one has access to a tuple sampled from a replay buffer with knowledge of the sampling probability (possibly unnormalized) with full support. Then we draw a phantom action and optimize by minimizing a weighted Bellman error
(25) 
In this way, for a specific pair of state and action the expected objective value is,
(26) 
where . Note that the denominator of counteracts the expectation over in (26) and that the numerator of is . Therefore, when the target value is treated as fixed (i.e., using a target network) this training procedure reaches an optimum when takes on the target value provided in the Bellman equation (19).
In practice, we find that it is unnecessary to keep track of the probabilities
, and assume the replay buffer provides a nearuniform distribution of actions conditioned on states. Other recent work has also benefited from ignoring or heavily damping importance weights
(Munos et al., 2016; Wang et al., 2017; Schulman et al., 2017). However, it is possible when interacting with the environment to save the probability of sampled actions along with their transitions, and thus have access to .4.3 Proximal Policy Optimization
Policy gradient algorithms are notoriously unstable, particularly in continuous control problems. Such instability has motivated the development of trust region methods that constrain each gradient step to lie within a trust region (Schulman et al., 2015), or augment the expected reward objective with a penalty on KLdivergence from a previous policy (Nachum et al., 2018; Schulman et al., 2017). These stabilizing techniques have thus far not been applicable to algorithms like DDPG, since the policy is deterministic. The formulation we propose above, however, is easily amenable to trust region optimization. Specifically, we may augment the objective (17) with a penalty
(27) 
where is a previous version of the policy. The optimization is straightforward, since the KLdivergence of two Gaussians can be expressed analytically.
This concludes the technical presentation of the proposed algorithm Smoothie: pseudocode for the full training procedure, including policy improvement, policy evaluation, and proximal policy improvement is given in Algorithm 1. The reader may also refer to the appendix for additional implementation details.
4.4 Compatible Function Approximation
The approximator for should be sufficiently accurate so that updates for and are not affected by substituting and for and respectively. Define as the difference between the true th derivative of and the th derivative of the approximated at :
(28) 
We claim that a is compatible with respect to if

,

(i.e., minimizes the expected squared error of the gradients).
Additionally, is compatible with respect to if

,

(i.e., minimizes the expected squared error of the Hessians).
We provide a proof of these claims in the Appendix. One possible parameterization of may be achieved by taking and parameterizing
(29) 
Similar conditions and parameterizations exist for DDPG (Silver et al., 2014), in terms of . While it is reassuring to know that there exists a class of function approximators which are compatible, this fact has largely been ignored in practice. At first glance, it seems impossible to satisfy the second set of conditions without access to derivative information of the true (for DDPG, ). Indeed, the methods for training Qvalue approximators (equation (12) and Section 4.2) only train to minimize an error between raw scalar values. For DDPG, we are unaware of any method that allows one to train to minimize an error with respect to the derivatives of the true . However, the case is different for the smoothed Qvalues . In fact, it is possible to train to minimize an error with respect to the derivatives of the true . We provide an elaboration in the Appendix. In brief, it is possible to use (19) to derive Bellmanlike equations which relate a derivative of any degree to an integral over the raw Qvalues at the next time step . Thus, it is possible to devise a training scheme in the spirit of the one outlined in Section 4.2, which optimizes to minimize the squared error with the derivatives of the true . This theoretical property of the smoothed Qvalues is unique and provides added benefits to its use over the standard Qvalues.
5 Related Work
This paper follows a long line of work that uses Qvalue functions to stably learn a policy, which in the past has been used to either approximate expected (Rummery & Niranjan, 1994; Van Seijen et al., 2009; Gu et al., 2017) or optimal (Watkins, 1989; Silver et al., 2014; Nachum et al., 2017; Haarnoja et al., 2017; Metz et al., 2017) future value.
Work that is most similar to what we present are methods that exploit gradient information from the Qvalue function to train a policy. Deterministic policy gradient (Silver et al., 2014) is perhaps the best known of these. The method we propose can be interpreted as a generalization of the deterministic policy gradient. Indeed, if one takes the limit of the policy covariance as it goes to 0, the proposed Qvalue function becomes the deterministic value function of DDPG, and the updates for training the Qvalue approximator and the policy mean are identical.
Stochastic Value Gradient (SVG) (Heess et al., 2015) also trains stochastic policies using an update that is similar to DDPG (i.e., SVG(0) with replay). The key differences with our approach are that SVG does not provide an update for the covariance, and the mean update in SVG estimates the gradient with a noisy Monte Carlo sample, which we avoid by estimating the smoothed Qvalue function. Although a covariance update could be derived using the same reparameterization trick as in the mean update, that would also require a noisy Monte Carlo estimate. Methods for updating the covariance along the gradient of expected reward are essential for applying the subsequent trust region and proximal policy techniques.
More recently, Ciosek & Whiteson (2018) introduced expected policy gradients (EPG), a generalization of DDPG that provides updates for the mean and covariance of a stochastic Gaussian policy using gradients of an estimated Qvalue function. In that work, the expected Qvalue used in standard policy gradient algorithms such as SARSA (Sutton & Barto, 1998; Rummery & Niranjan, 1994; Van Seijen et al., 2009) is estimated. The updates in EPG therefore require approximating an integral of the expected Qvalue function, or assuming the Qvalue has a simple form that allows for analytic computation. Our analogous process directly estimates an integral (via the smoothed Qvalue function) and avoids approximate integrals, thereby making the updates simpler and generally applicable. Moreover, while Ciosek & Whiteson (2018)
rely on a quadratic Taylor expansion of the estimated Qvalue function, we instead rely on the strength of neural network function approximators to directly estimate the smoothed Qvalue function.
The novel training scheme we propose for learning the covariance of a Gaussian policy relies on properties of Gaussian integrals (Bonnet, 1964; Price, 1958). Similar identities have been used in the past to derive updates for variational autoencoders (Kingma & Welling, 2014) and Gaussian backpropagation (Rezende et al., 2014).
Finally, the perspective presented in this paper, where Qvalues represent the averaged return of a distribution of actions rather than a single action, is distinct from recent advances in distributional RL (Bellemare et al., 2017). Those approaches focus on the distribution of returns of a single action, whereas we consider the single average return of a distribution of actions. Although we restrict our attention in this paper to Gaussian policies, an interesting topic for further investigation is to study the applicability of this new perspective to a wider class of policy distributions.
6 Experiments
We utilize the insights from Section 4 to introduce a new RL algorithm, Smoothie. Smoothie maintains a parameterized trained via the procedure described in Section 4.2. It then uses the gradient and Hessian of this approximation to train a Gaussian policy using the updates stated in (20) and (22). See Algorithm 1 for a simplified pseudocode of the algorithm.
We perform a number of evaluations of Smoothie to compare to DDPG. We choose DDPG as a baseline because it (1) utilizes gradient information of a Qvalue approximator, much like the proposed algorithm; and (2) is a standard algorithm wellknown to have achieve good, sampleefficient performance on continuous control benchmarks.
6.1 Synthetic Task
Before investigating benchmark problems, we first briefly revisit the simple task introduced in Section 3 and reproduced in Figure 2 (Right). Here, the reward function is a mixture of two Gaussians, one better than the other, and we initialize the policy mean to be centered on the worse of the two. We plot the learnable policy mean and standard deviation during training for Smoothie and DDPG in Figure 2 (Left). Smoothie learns both the mean and variance, while DDPG learns only the mean and the variance plotted is the exploratory noise, whose scale is kept fixed during training.
HalfCheetah  Swimmer 
Hopper  Walker2d 
Ant  Humanoid 
As expected, we observe that DDPG cannot escape the local optimum. At the beginning of training it exhibits some movement away from the local optimum (likely due to the initial noisy approximation given by ). However, it is unable to progress very far from the initial mean. Note that this is not an issue of exploration. The exploration scale is high enough that is aware of the better Gaussian. The issue is in the update for , which is only with regard to the derivative of at the current mean.
On the other hand, we find Smoothie is easily able to solve the task. This is because the smoothed reward function approximated by has a derivative that clearly points toward the better Gaussian. We also observe that Smoothie is able to suitably adjust the covariance during training. Initially, decreases due to the concavity of the smoothed reward function. As a region of convexity is entered, it begins to increase, before again decreasing to nearzero as approaches the global optimum. This example clearly shows the ability of Smoothie to exploit the smoothed action value landscape.
6.2 Continuous Control
Next, we consider standard continuous control benchmarks available on OpenAI Gym (Brockman et al., 2016) utilizing the MuJoCo environment (Todorov et al., 2012).
Our implementations utilize feed forward neural networks for policy and Qvalues. We parameterize the covariance
as a diagonal given by . The exploration for DDPG is determined by an OrnsteinUhlenbeck process (Uhlenbeck & Ornstein, 1930; Lillicrap et al., 2016). Additional implementation details are provided in the Appendix.HalfCheetah  Swimmer  Hopper 
Walker2d  Ant  Humanoid 
We compare the results of Smoothie and DDPG in Figure 4. For each task we performed a hyperparameter search over actor learning rate, critic learning rate and reward scale, and plot the average of six runs for the best hyperparameters. For DDPG we extended the hyperparameter search to also consider the scale and damping of exploratory noise provided by the OrnsteinUhlenbeck process. Smoothie, on the other hand, contains an additional hyperparameter to determine the weight on KLpenalty.
Despite DDPG having the advantage of its exploration decided by a hyperparameter search while Smoothie must learn its exploration without supervision, we find that Smoothie performs competitively or better across all tasks, exhibiting a slight advantage in Swimmer and Ant, while showing more dramatic improvements in Hopper, Walker2d, and Humanoid. The improvement is especially dramatic for Hopper, where the average reward is doubled. We also highlight the results for Humanoid, which as far as we know, are the best published results for a method that only trains on the order of millions of environment steps. In contrast, TRPO, which to the best of our knowledge is the only other algorithm that can achieve competitive performance, requires on the order of tens of millions of environment steps to achieve comparable reward. This gives added evidence to the benefits of using a learnable covariance and not restricting a policy to be deterministic.
These results make it clear that on more difficult continuous control tasks (Hopper, Walker2d, Ant, Humanoid), both DDPG and Smoothie suffer from instability, showing high variance across runs and a performance that is far behind trust region and proximal policy methods (Schulman et al., 2015; Nachum et al., 2018; Schulman et al., 2017). Thus, we evaluated Smoothie with a KLpenalty and present these results in Figure 5.
Empirically, we found the introduction of a KLpenalty to improve performance of Smoothie, especially on harder tasks. We present a comparison of results of Smoothie with and without the KLpenalty on the four harder tasks in Figure 5. A KLpenalty to encourage stability is not possible in DDPG. Thus, Smoothie provides a much needed solution to the inherent instability in DDPG training.
We observe that Smoothie augmented with a KLpenalty consistently improves performance. The improvement is especially dramatic for Hopper, where the average reward is doubled. We also highlight the results for Humanoid, which also exhibits a significant improvement. As far as we know, these Humanoid results are the best published results for a method that only trains on the order of millions of environment steps. In contrast, TRPO, which to the best of our knowledge is the only other algorithm which can achieve better performance, requires on the order of tens of millions of environment steps. This gives added evidence to the benefits of using a learnable covariance and not restricting a policy to be deterministic.
Hopper  Walker2d 
Ant  Humanoid 
7 Conclusion
We have presented a new Qvalue function concept, , that is a Gaussiansmoothed version of the standard expected Qvalue, . The advantage of over is that its gradient and Hessian possess an intimate relationship with the gradient of expected reward with respect to mean and covariance of a Gaussian policy. The resulting algorithm, Smoothie, is able to successfully learn both mean and covariance during training, leading to performance that surpasses that of DDPG, especially when incorporating a penalty on divergence from a previous policy.
Gu et al. (2017)
showed that DDPG sits on one end of a spectrum of methods that interpolate between offpolicy updates and onpolicy policy gradient updates. Future work could determine if Smoothie can also benefit from the improved stability of interpolated methods.
The success of is encouraging. Intuitively it appears advantageous to learn instead of . The smoothed Qvalues by definition make the true reward surface smoother, thus possibly easier to learn; moreover the smoothed Qvalues have a more direct relationship with the expected discounted return objective. We encourage further investigation of these claims and techniques for applying the underlying motivations for to other types of policies.
8 Acknowledgments
We thank Luke Metz, Sergey Levine, and the Google Brain team for insightful comments and discussions.
References
 Bellemare et al. (2017) Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. In ICML, pp. 449–458, 2017.
 Bonnet (1964) Bonnet, G. Transformations des signaux aléatoires a travers les systemes non linéaires sans mémoire. Annals of Telecommunications, 19(9):203–220, 1964.
 Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. OpenAI Gym. arXiv:1606.01540, 2016.
 Ciosek & Whiteson (2018) Ciosek, K. and Whiteson, S. Expected policy gradients. AAAI, 2018.
 Degris et al. (2012) Degris, T., White, M., and Sutton, R. S. Offpolicy actorcritic. ICML, 2012.
 Gu et al. (2017) Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., Schölkopf, B., and Levine, S. Interpolated policy gradient: Merging onpolicy and offpolicy gradient estimation for deep reinforcement learning. NIPS, 2017.
 Haarnoja et al. (2017) Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energybased policies. ICML, 2017.
 Heess et al. (2015) Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., and Tassa, Y. Learning continuous control policies by stochastic value gradients. In NIPS, 2015.
 Kingma & Welling (2014) Kingma, D. P. and Welling, M. Autoencoding variational bayes. ICLR, 2014.
 Konda & Tsitsiklis (2000) Konda, V. R. and Tsitsiklis, J. N. Actorcritic algorithms, 2000.
 Lillicrap et al. (2016) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. ICLR, 2016.
 Metz et al. (2017) Metz, L., Ibarz, J., Jaitly, N., and Davidson, J. Discrete sequential prediction of continuous actions for deep RL. CoRR, abs/1705.05035, 2017. URL http://arxiv.org/abs/1705.05035.
 Munos et al. (2016) Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. Safe and efficient offpolicy reinforcement learning. In NIPS, 2016.
 Nachum et al. (2017) Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging the gap between value and policy based reinforcement learning. NIPS, 2017.
 Nachum et al. (2018) Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Trustpcl: An offpolicy trust region method for continuous control. ICLR, 2018.
 Price (1958) Price, R. A useful theorem for nonlinear devices having gaussian inputs. IRE Transactions on Information Theory, 4(2):69–72, 1958.

Rezende et al. (2014)
Rezende, D. J., Mohamed, S., and Wierstra, D.
Stochastic backpropagation and approximate inference in deep generative models.
In International Conference on Machine Learning, pp. 1278–1286, 2014.  Rummery & Niranjan (1994) Rummery, G. A. and Niranjan, M. Online Qlearning using connectionist systems, volume 37. 1994.
 Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In ICML, 2015.
 Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Silver et al. (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic policy gradient algorithms. In ICML, 2014.
 Sutton & Barto (1998) Sutton, R. S. and Barto, A. G. Introduction to Reinforcement Learning. MIT Press, 1998.
 Sutton et al. (2000) Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. NIPS, 2000.
 Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012.
 Uhlenbeck & Ornstein (1930) Uhlenbeck, G. E. and Ornstein, L. S. On the theory of the brownian motion. Physical review, 36(5):823, 1930.
 Van Seijen et al. (2009) Van Seijen, H., Van Hasselt, H., Whiteson, S., and Wiering, M. A theoretical and empirical analysis of expected sarsa. In Adaptive Dynamic Programming and Reinforcement Learning, 2009. ADPRL’09. IEEE Symposium on, pp. 177–184. IEEE, 2009.
 Wang et al. (2017) Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and de Freitas, N. Sample efficient actorcritic with experience replay. ICLR, 2017.
 Watkins (1989) Watkins, C. J. C. H. Learning from delayed rewards. PhD thesis, University of Cambridge England, 1989.
 Williams & Peng (1991) Williams, R. J. and Peng, J. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 1991.
Appendix A Proof of Theorem 1
We want to show that for any ,
(30) 
We note that similar identities for Gaussian integrals exist in the literature (Price, 1958; Rezende et al., 2014) and point the reader to these works for further information.
Proof. The specific identity we state may be derived using standard matrix calculus. We make use of the fact that
(31) 
and for symmetric ,
(32) 
We omit from in the following equations for succinctness. The LHS of (30) is
Appendix B Compatible Function Approximation
A function approximator of should be sufficiently accurate so that updates for are not affected by substituting and for and , respectively.
We claim that a is compatible with respect to if

,
