Smoothed Action Value Functions for Learning Gaussian Policies

by   Ofir Nachum, et al.

State-action value functions (i.e., Q-values) are ubiquitous in reinforcement learning (RL), giving rise to popular algorithms such as SARSA and Q-learning. We propose a new notion of action value defined by a Gaussian smoothed version of the expected Q-value. We show that such smoothed Q-values still satisfy a Bellman equation, making them learnable from experience sampled from an environment. Moreover, the gradients of expected reward with respect to the mean and covariance of a parameterized Gaussian policy can be recovered from the gradient and Hessian of the smoothed Q-value function. Based on these relationships, we develop new algorithms for training a Gaussian policy directly from a learned smoothed Q-value approximator. The approach is additionally amenable to proximal optimization by augmenting the objective with a penalty on KL-divergence from a previous policy. We find that the ability to learn both a mean and covariance during training leads to significantly improved results on standard continuous control benchmarks.


page 1

page 2

page 3

page 4


Deep RBF Value Functions for Continuous Control

A core operation in reinforcement learning (RL) is finding an action tha...

Proximal Deterministic Policy Gradient

This paper introduces two simple techniques to improve off-policy Reinfo...

Quinoa: a Q-function You Infer Normalized Over Actions

We present an algorithm for learning an approximate action-value soft Q-...

Expected Policy Gradients

We propose expected policy gradients (EPG), which unify stochastic polic...

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

Proximal Policy Optimization (PPO) is a highly popular model-free reinfo...

Optimistic Reinforcement Learning by Forward Kullback-Leibler Divergence Optimization

This paper addresses a new interpretation of reinforcement learning (RL)...

In Hindsight: A Smooth Reward for Steady Exploration

In classical Q-learning, the objective is to maximize the sum of discoun...

1 Introduction

Model-free reinforcement learning algorithms often alternate between two concurrent but interacting processes: (1) policy evaluation, where an action value function (i.e.,

 a Q-value) is updated to obtain a better estimate of the return associated with taking a specific action, and (2)

policy improvement, where the policy is updated aiming to maximize the current value function. In the past, different notions of Q-value have led to distinct but important families of RL methods. For example, SARSA (Rummery & Niranjan, 1994; Sutton & Barto, 1998; Van Seijen et al., 2009) uses the expected Q-value, defined as the expected return of following the current policy. Q-learning (Watkins, 1989) exploits a hard-max notion of Q-value, defined as the expected return of following an optimal policy. Soft Q-learning (Haarnoja et al., 2017) and PCL (Nachum et al., 2017) both use a soft-max form of Q-value, defined as the future return of following an optimal entropy regularized policy. Clearly, the choice of Q-value function has a considerable effect on the resulting algorithm; for example, restricting the types of policies that can be expressed, and determining the type of exploration that can be naturally applied. In each case, the Q-value at a state and action answers the question,

“What would my future value from be if I were to take an initial action ?”

Such information about a hypothetical action is helpful when learning a policy; we want to nudge the policy distribution to favor actions with potentially higher Q-values.

In this work, we investigate the practicality and benefits of answering a more difficult, but more relevant, question:

“What would my future value from be if I were to sample my initial action from a distribution centered at ?”

We focus our efforts on Gaussian policies and thus the counterfactual posited by the Q-value inquires about the expected future return of following the policy when changing the mean of the initial Gaussian distribution. Thus, our new notion of Q-values maps a state-action pair

to the expected return of first taking an action sampled from a normal distribution

centered at , and following actions sampled from the current policy thereafter. In this way, the Q-values we introduce may be interpreted as a Gaussian-smoothed version of the expected Q-value, hence we term them smoothed Q-values.

We show that smoothed Q-values possess a number of important properties that make them attractive for use in RL algorithms. It is clear from the definition of smoothed Q-values that, if known, their structure is highly beneficial for learning the mean of a Gaussian policy. We are able to show more than this: although the smoothed Q-values are not a direct function of the covariance, one can surprisingly use knowledge of the smoothed Q-values to derive updates to the covariance of a Gaussian policy. Specifically, the gradient of the standard expected return objective with respect to the mean and covariance of a Gaussian policy is equivalent to the gradient and Hessian of the smoothed Q-value function, respectively. Moreover, we show that the smoothed Q-values satisfy a single-step Bellman consistency, which allows bootstrapping to be used to train them via function approximation.

These results lead us to propose an algorithm, Smoothie, which, in the spirit of (Deep) Deterministic Policy Gradient (DDPG) (Silver et al., 2014; Lillicrap et al., 2016), trains a policy using the derivatives of a trained (smoothed) Q-value function to learn a Gaussian policy. Crucially, unlike DDPG, which is restricted to deterministic policies and is well-known to have poor exploratory behavior (Haarnoja et al., 2017)

, the approach we develop is able to utilize a non-deterministic Gaussian policy parameterized by both a mean and a covariance, thus allowing the policy to be exploratory by default and alleviating the need for excessive hyperparameter tuning. On the other hand, compared to standard policy gradient algorithms 

(Williams & Peng, 1991; Konda & Tsitsiklis, 2000), Smoothie’s utilization of the derivatives of a Q-value function to train a policy avoids the high variance and sample inefficiency of stochastic updates.

Furthermore, we show that Smoothie can be easily adapted to incorporate proximal policy optimization techniques by augmenting the objective with a penalty on KL-divergence from a previous version of the policy. The inclusion of a KL-penalty is not feasible in the standard DDPG algorithm, but we show that it is possible with our formulation, and it significantly improves stability and overall performance. On standard continuous control benchmarks, our results are competitive with or exceed state-of-the-art, especially for more difficult tasks in the low-data regime.

For most RL algorithms including SARSA, expected SARSA, and Q-leanring, the policy improvement step is straightforward: An improved policy is obtained by taking the greedy action that maximizes the action value function

, or alternatively, the greedy action is taken with probability

and a random action is taken with probability . Using such -greedy schemes may improve exploration and facilitate formulating convergence guarantees. However, policy improvement by finding the locally optimal policy for an arbitrary action value function is not tractable, especially in continuous and high-dimensional action spaces. For example, if one considers the family of multivariate guassian policies in a continuous action space, then policy optimization within the feasible policies is not straighforward.

Prior work on deterministic policy gradient (DPG) and its extensions substitude the mean of a gaussian policy into the action value function and make use of the gradient of the action value function to update the mean of a gaussian policy to improve action values estimates. Such formulations are interesting because they can handle off-policy data, and they do not require Monte Carlo sampling to compute expected Q-values under the current policy, resulting in a learning algorithm with a lower variance. However, DPG and its variants make a strong assumption about the policy. They assume that the policy is deterministic in the limit that the variance of Gaussians goes to zero. It was thought that the one needs to assume policy determinism to achieve the benefits of the DPG formulation (???), but in this paper we show the policy determinism assumption is not necessary. In other words, one can keep the benefits of DPG in making use of action value gradients and off-policy data, but also allow for using a general family of stochastic Gaussian policies.

Model-free reinforcement learning (RL) aims to optimize an agent’s behavior policy through trial and error interaction with a black box environment. An agent alternates between observing a state provided by the environment (e.g., joint positions and velocities), applying an action to the environment (e.g., force or torque), and receiving a reward from the environment (e.g., velocity in a specific desired direction). The agent’s objective is to maximize the long term sum of rewards received from the environment.

Within continuous control, the agent’s policy is traditionally parameterized by a uni-model Gaussian. The mean and covariance defining the Gaussian are then trained using policy gradient methods (Konda & Tsitsiklis, 2000; Williams & Peng, 1991) to maximize expected total reward. Policy gradient uses experience sampled stochastically from its policy as a cue to nudge the mean and covariance to put more probability mass on experience that yielded a higher long term reward than expected and vice versa for experience that yielded a lower long term reward. The stochastic nature of this training paradigm makes policy gradient methods unstable and exhibit high variance. To mitigate this problem, large batch sizes or trust region methods (Schulman et al., 2015; Nachum et al., 2018; Schulman et al., 2017) are utilized, although both of these remedies can require collecting a large amount of experience, making them infeasible for real-world applications.

As an attempt to circumvent the issue of high variance due to highly stochastic updates, recent years have introduced policy gradient algorithms which update policy parameters based only on the surface of a learned Q-value function. The most widely used such algorithm is (Deep) Deterministic Policy Gradient (DDPG) (Silver et al., 2014; Lillicrap et al., 2016). In DDPG, the policy is deterministic, parameterized only by a mean. A Q-value function is trained to take in a state and action and return the future discounted sum of rewards of first taking the specified action and subsequently following the deterministic policy. Thus, the suface of the learned Q-value function dictates how the policy should be updated - along the gradient of the Q-value with respect to the input action.

While DDPG has successfully avoided the highly-stochastic policy updates associated with traditional policy gradient algorithms, its deterministic policy naturally leads to poor exploration. A deterministic policy gives no indication regarding which directions in action space to explore. Thus, in practice, the trained policy differs from the behavior policy, which is augmented with Gaussian noise whose variance is treated as a hyperparameter to optimize. Even so, DDPG is well-known to have poor exploratory behavior (Haarnoja et al., 2017).

In this paper, we present a method which applies the same technique of updating a policy based only on a learned Q-value function to a stochastic policy parameterized by both a mean and covariance. We show that given access to the proposed Q-value function or a sufficiently accurate approximation, it is possible to derive unbiased updates to both the policy mean and covariance. Unlike recent attempts at this (e.g., Ciosek & Whiteson (2018)), our updates require neither approximate integrals nor low-order assumptions on the form of the true Q-value. Crucially, providing a method to update the covariance allows the policy to be exploratory by default and alleviates the need for excessive hyperparameter tuning. Moreover, we show that our technique can be easily adapted to incorporate trust region methods by augmenting the objective with a penalty on KL-divergence from a previous version of the policy. The inclusion of a trust region is not possible in standard DDPG and we show that it improves stability and overall performance significantly when our algorithm is evaluated on standard continuous control benchmarks.

2 Formulation

We consider the standard model-free RL problem represented a Markov decision process (MDP), consisting of a state space

and an action space . At iteration the agent encounters a state and emits an action , after which the environment returns a scalar reward and places the agent in a new state .

We focus on continuous control tasks, where the actions are real-valued, i.e., . Our observations at a state are denoted . We parameterize the behavior of the agent using a stochastic policy , which takes the form of a Gaussian density at each state . The Gaussian policy is parameterized by a mean and a covariance function, and so that , where


here using the notation .

Below we develop new RL training methods for this family of parametric policies, but some of the ideas presented may generalize to other families of policies as well. We begin the formulation by reviewing some prior work on learning Gaussian policies.

2.1 Policy Gradient for Generic Stochastic Policies

The optimization objective (expected discounted return), as a function of a generic stochastic policy, is expressed in terms of the expected action value function by,


where is the state visitation distribution under , and is recursively defined using the Bellman equation,


where is the discount factor. For brevity, we suppress explicit denotation of the distribution over immediate rewards and over state transitions.

The policy gradient theorem (Sutton et al., 2000) expresses the gradient of w.r.t. , the tunable parameters of a policy , as,


In order to approximate the expectation on the RHS of (4), one often resorts to an empirical average over on-policy samples from . This sampling scheme results in a gradient estimate with high variance, especially when is not concentrated. Many policy gradient algorithms, including actor-critic variants, trade off variance and bias, e.g., by attempting to estimate accurately using function approximation and the Bellman equation.

In the simplest scenario, an unbiased estimate of

is formed by accumulating discounted rewards from each state forward using a single Monte Carlo sample.

The objective of the policy is to maximize expected future discounted reward at each state until reaching some terminal time :


The objective may also be expressed in terms of expected Q-values as


where is defined recursively as


to represent the expected future discounted reward of taking action at state and subsequently following the policy .

The state-agnostic objective is then


where is the state distribution induced by the policy .

2.2 Deterministic Policy Gradient

Silver et al. (2014) study the policy gradient for the specific class of Gaussian policies in the limit where the policy’s covariance approaches zero. In this scenario, the policy becomes deterministic and samples from the policy approach the Gaussian mean. Under a deterministic policy , one can estimate the expected future return from a state as,


Accordingly, Silver et al. (2014) express the gradient of the expected discounted return objective for as,


This characterization of the policy gradient theorem for deterministic policies is called deterministic policy gradient (DPG). Since no Monte Carlo sampling is required for estimating the gradient, the variance of the estimate is reduced. On the other hand, the deterministic nature of the policy can lead to poor exploration and training instability in practice.

In the limit of , one can also re-express the Bellman equation (3) as,


Therefore, a value function approximator can be optimized by minimizing the Bellman error,


for transitions sampled from a dataset of interactions of the agent with the environment. The deep variant of DPG known as DDPG (Lillicrap et al., 2016) alternates between improving the action value estimate by gradient descent on (12) and improving the policy based on (10).

To improve sample efficiency, Degris et al. (2012) and Silver et al. (2014) replace the state visitation distribution in (10) with an off-policy visitation distribution based on a replay buffer. This subsititution introduces some bias in the gradient estimate (10

), but previous work has found that it works well in practice and improves the sample efficiency of the policy gradient algorithms. We also adopt a similar heuristic in our method to make use of off-policy data.

In practice, DDPG exhibits improved sample efficiency over standard policy gradient algorithms: using off-policy data to train Q-values while basing policy updates on their gradients significantly improves stochastic policy updates dictated by (4), which require a large number of samples to reduce noise. On the other hand, the deterministic nature of the policy learned by DDPG leads to poor exploration and instability in training. In this paper, we propose an algorithm which, like DDPG, utilizes derivative information of learned Q-values for better sample-efficiency, but which, unlike DDPG, is able to learn a Gaussian policy and imposes a KL-penalty for better exploration and stability.

The objective for becomes


A parameterized can thus be trained according to


2.3 Stochastic Value Gradients for Gaussian Policies

Inpired by deterministic policy gradients, Heess et al. (2015) propose to reparameterize the expectation w.r.t. a Guassian policy in (4) with an expectation over a noise variable drawn from a standard normal distribution. Note that a stochastic action drawn from a Gaussian policy can be reparameterized as , for and . Accordingly, the policy gradients take the form of,


where for brevity, we dropped the dependence of and on . Similarly, one can re-express Bellman equations using an expectation over the noise variable.

The key advantage of this formulation by Heess et al. (2015), called stochastic value gradients (SVG), over generic policy gradients (4) is that similar to DPG (10), SVG makes direct use of the gradient of the Gaussian mean functions with respect to the model parameters. The benefit over DPG is that SVG keeps the policy stochastic and enables learning the covariance function at every state, but the key disadvantage over DPG is that SVG requires sampling from a noise variable to estimate the gradients, resulting in a higher variance.

In this paper, we show how one can combine the benefits of DPG and SVG to formulate policy gradients for Gaussian policies wihtout requiring to sample from a noise variable to estimate the gradients, hence a lower variance.

Figure 1: A simple expected reward function, shown in green, with a Gaussian-smoothed version, shown in magenta.

3 Idea

Before giving a full exposition, we use a simplified scenario to illustrate the key intuitions behind the proposed approach and how it differs fundamentally from previous methods.

Consider a one-shot decision making problem over a one dimensional action space with a single state. Here the expected reward is given by a function over the real line, which also corresponds to the optimal Q-value function; Figure 1 gives a concrete example. We assume the policy is specified by a Gaussian distribution parameterized by a scalar mean

and standard deviation

. The goal is to optimize the policy parameters to maximize expected reward.

A naive policy gradient method updates the parameters by sampling , observing reward , then adjusting and in directions and . Note that such updates suffer from large variance, particularly when is small.

To reduce the variance of direct policy gradient, deterministic policy gradient methods leverage a value function approximator , parameterized by , to approximate . For example, in this scenario, vanilla DPG would sample an action with exploration noise , then update using and using . Clearly, this update exhibits reduced variance, but requires to approximate (the green curve in Figure 1) to control bias. Unfortunately, DPG is not able to learn the exploration variance . Variants of DPG such as SVG (Heess et al., 2015) and EPG (Ciosek & Whiteson, 2018) have been proposed to work with stochastic policies. However they either have restrictive assumptions on the form of the true Q-value, introduce a noise into the policy updates, or require an approximate integral, thus losing the advantage of deterministic gradient updates. By contrast, SVG is able to work with a stochastic policy and learn the variance, but loses the advantage of deterministic gradient updates. In particular, vanilla SVG would sample then update using , using , and using , all of which reintroduce significant variance in the updates.

Note, however, that the expected value at any given location is actually given by a convolution of the Gaussian policy with the underlying expected reward function. Such a process inherently smooths the landscape, as shown in the magenta curve in Figure 1. Unfortunately, DPG completely ignores this smoothing effect by trying to approximate , while policy gradient methods only benefit from it indirectly through sampling. A key insight is that this smoothing effect can be captured directly in the value function approximator itself, bypassing any need for sampling or approximating . That is, instead of using an approximator to model , one can directly approximate the smoothed version given by (the magenta curve in Figure 1), which, crucially, satisfies .

Based on this observation, we propose a novel actor-critic strategy below that uses a function approximator to model . Although approximating instead of might appear to be a subtle change, it is a major alteration to existing actor-critic approaches. Not only is approximating the magenta curve in Figure 1 far easier than the green curve, modeling allows the policy parameters to be updated deterministically for any given action. In particular, in the simple scenario above, if one sampled an action from the current policy, , and observed , then could be updated using , using (a key result we establish below), and using .

Such a strategy combines the best aspects of DPG and policy gradient while conferring additional advantages: (1) the smoothed value function cannot add but can only remove local minima from ; (2) is smoother than hence easier to approximate; (3) approximating allows deterministic gradient updates for ; (4) approximating allows gradients to be computed for both the mean and variance parameters. Among these advantages, DPG shares only 3 and policy gradient only 1. We will see below that the new strategy we propose significantly outperforms existing approaches, not only in the toy scenario depicted in Figure 1, but also in challenging benchmark problems.

4 Smoothed Action Value Functions

Moving beyond a simple illustrative scenario, the key contribution of this paper is to introduce the general notion of a smoothed action value function, the gradients of which provide an effective signal for optimizing the parameters of a Gaussian policy. Smoothed Q-values, which we denote , differ from ordinary Q-values by not assuming the first action of the agent is fully specified; instead, they assume only that a Gaussian centered at the action is known. Thus, to compute , one has to perform an expectation of for actions drawn in the vicinity of . More formally, smoothed action values are defined as,


With this definition of , one can re-express the gradient of the expected reward objective (Equation (4)) for a Gaussian policy as,


The insight that differentiates this approach from prior work (Heess et al., 2015; Ciosek & Whiteson, 2018) is that instead of learning a function approximator for then drawing samples to approximate the expectation in (16) and its derivative, we directly learn a function approximator for .

One of the key observations that enables learning a function approximator for is that smoothed Q-values satisfy a notion of Bellman consistency. First, note that for Gaussian policies we have the following relation between the expected and smoothed Q-values:


Then, combining (16) and (18), one can derive the following one-step Bellman equation for smoothed Q-values,


where and are sampled from and . Below, we elaborate on how one can make use of the derivatives of to learn and , and how the Bellman equation in (19) enables direct optimization of .

4.1 Policy Improvement

We assume a Gaussian policy parameterized by and for the mean and the covariance respectively. The gradient of the objective w.r.t. the mean parameters follows from the policy gradient theorem in conjunction with (17) and is almost identical to (10),


Estimating the derivative of the objective w.r.t. the covariance parameters is not as straightforward, since is not a direct function of . However, a key result is that the second derivative of w.r.t. actions is sufficient to exactly compute the derivative of w.r.t. .


A proof of this identity is provided in the Appendix. The full derivative w.r.t.  can then be shown to take the form,


4.2 Policy Evaluation

There are two natural ways to optimize . The first approach leverages (16) to update based on the expectation of . In this case, one first trains a parameterized model to approximate the standard function using conventional methods (Rummery & Niranjan, 1994; Sutton & Barto, 1998; Van Seijen et al., 2009), then fits to based on (16). In particular, given transitions sampled from interactions with the environment, one can train to minimize the Bellman error


where . Then, can be optimized to minimize the squared error


where , using several samples. When the target values in the residuals are treated as fixed (i.e., using a target network), these updates will reach a fixed point when satisfies the recursion in the Bellman equation (16).

The second approach requires a single function approximator for , resulting in a simpler implementation; hence we use this approach in our experimental evaluation. Suppose one has access to a tuple sampled from a replay buffer with knowledge of the sampling probability (possibly unnormalized) with full support. Then we draw a phantom action and optimize by minimizing a weighted Bellman error


In this way, for a specific pair of state and action the expected objective value is,


where . Note that the denominator of counter-acts the expectation over in (26) and that the numerator of is . Therefore, when the target value is treated as fixed (i.e., using a target network) this training procedure reaches an optimum when takes on the target value provided in the Bellman equation (19).

In practice, we find that it is unnecessary to keep track of the probabilities

, and assume the replay buffer provides a near-uniform distribution of actions conditioned on states. Other recent work has also benefited from ignoring or heavily damping importance weights 

(Munos et al., 2016; Wang et al., 2017; Schulman et al., 2017). However, it is possible when interacting with the environment to save the probability of sampled actions along with their transitions, and thus have access to .

4.3 Proximal Policy Optimization

Policy gradient algorithms are notoriously unstable, particularly in continuous control problems. Such instability has motivated the development of trust region methods that constrain each gradient step to lie within a trust region (Schulman et al., 2015), or augment the expected reward objective with a penalty on KL-divergence from a previous policy (Nachum et al., 2018; Schulman et al., 2017). These stabilizing techniques have thus far not been applicable to algorithms like DDPG, since the policy is deterministic. The formulation we propose above, however, is easily amenable to trust region optimization. Specifically, we may augment the objective (17) with a penalty


where is a previous version of the policy. The optimization is straightforward, since the KL-divergence of two Gaussians can be expressed analytically.

This concludes the technical presentation of the proposed algorithm Smoothie: pseudocode for the full training procedure, including policy improvement, policy evaluation, and proximal policy improvement is given in Algorithm 1. The reader may also refer to the appendix for additional implementation details.

  Input: Environment , learning rates , discount factor , KL-penalty , batch size , number of training steps , target network lag .
  Initialize , set .
  for  to  do
     // Collect experience
     Sample action and apply to to yield and .
     Insert transition to replay buffer.
     // Train
     Sample batch from replay buffer.
     Compute gradients .
     Compute Hessians .
     Compute penalties .
     Compute updates
     Update , .
     // Train
     Sample batch from replay buffer.
     Sample phantom actions .
     Compute loss
     Update .
     // Update target variables
     Update ; ; .
  end for
Algorithm 1 Smoothie

4.4 Compatible Function Approximation

The approximator for should be sufficiently accurate so that updates for and are not affected by substituting and for and respectively. Define as the difference between the true -th derivative of and the -th derivative of the approximated at :


We claim that a is compatible with respect to if

  1. ,

  2. (i.e.,  minimizes the expected squared error of the gradients).

Additionally, is compatible with respect to if

  1. ,

  2. (i.e.,  minimizes the expected squared error of the Hessians).

We provide a proof of these claims in the Appendix. One possible parameterization of may be achieved by taking and parameterizing


Similar conditions and parameterizations exist for DDPG (Silver et al., 2014), in terms of . While it is reassuring to know that there exists a class of function approximators which are compatible, this fact has largely been ignored in practice. At first glance, it seems impossible to satisfy the second set of conditions without access to derivative information of the true (for DDPG, ). Indeed, the methods for training Q-value approximators (equation (12) and Section 4.2) only train to minimize an error between raw scalar values. For DDPG, we are unaware of any method that allows one to train to minimize an error with respect to the derivatives of the true . However, the case is different for the smoothed Q-values . In fact, it is possible to train to minimize an error with respect to the derivatives of the true . We provide an elaboration in the Appendix. In brief, it is possible to use (19) to derive Bellman-like equations which relate a derivative of any degree to an integral over the raw Q-values at the next time step . Thus, it is possible to devise a training scheme in the spirit of the one outlined in Section 4.2, which optimizes to minimize the squared error with the derivatives of the true . This theoretical property of the smoothed Q-values is unique and provides added benefits to its use over the standard Q-values.

5 Related Work

This paper follows a long line of work that uses Q-value functions to stably learn a policy, which in the past has been used to either approximate expected (Rummery & Niranjan, 1994; Van Seijen et al., 2009; Gu et al., 2017) or optimal (Watkins, 1989; Silver et al., 2014; Nachum et al., 2017; Haarnoja et al., 2017; Metz et al., 2017) future value.

Work that is most similar to what we present are methods that exploit gradient information from the Q-value function to train a policy. Deterministic policy gradient (Silver et al., 2014) is perhaps the best known of these. The method we propose can be interpreted as a generalization of the deterministic policy gradient. Indeed, if one takes the limit of the policy covariance as it goes to 0, the proposed Q-value function becomes the deterministic value function of DDPG, and the updates for training the Q-value approximator and the policy mean are identical.

Stochastic Value Gradient (SVG) (Heess et al., 2015) also trains stochastic policies using an update that is similar to DDPG (i.e., SVG(0) with replay). The key differences with our approach are that SVG does not provide an update for the covariance, and the mean update in SVG estimates the gradient with a noisy Monte Carlo sample, which we avoid by estimating the smoothed Q-value function. Although a covariance update could be derived using the same reparameterization trick as in the mean update, that would also require a noisy Monte Carlo estimate. Methods for updating the covariance along the gradient of expected reward are essential for applying the subsequent trust region and proximal policy techniques.

More recently, Ciosek & Whiteson (2018) introduced expected policy gradients (EPG), a generalization of DDPG that provides updates for the mean and covariance of a stochastic Gaussian policy using gradients of an estimated Q-value function. In that work, the expected Q-value used in standard policy gradient algorithms such as SARSA (Sutton & Barto, 1998; Rummery & Niranjan, 1994; Van Seijen et al., 2009) is estimated. The updates in EPG therefore require approximating an integral of the expected Q-value function, or assuming the Q-value has a simple form that allows for analytic computation. Our analogous process directly estimates an integral (via the smoothed Q-value function) and avoids approximate integrals, thereby making the updates simpler and generally applicable. Moreover, while Ciosek & Whiteson (2018)

rely on a quadratic Taylor expansion of the estimated Q-value function, we instead rely on the strength of neural network function approximators to directly estimate the smoothed Q-value function.

The novel training scheme we propose for learning the covariance of a Gaussian policy relies on properties of Gaussian integrals (Bonnet, 1964; Price, 1958). Similar identities have been used in the past to derive updates for variational auto-encoders (Kingma & Welling, 2014) and Gaussian back-propagation (Rezende et al., 2014).

Finally, the perspective presented in this paper, where Q-values represent the averaged return of a distribution of actions rather than a single action, is distinct from recent advances in distributional RL (Bellemare et al., 2017). Those approaches focus on the distribution of returns of a single action, whereas we consider the single average return of a distribution of actions. Although we restrict our attention in this paper to Gaussian policies, an interesting topic for further investigation is to study the applicability of this new perspective to a wider class of policy distributions.

6 Experiments

Figure 2: Left: The learnable policy mean and standard deviation during training for Smoothie and DDPG on the simple synthetic task introduced in Section 3. The standard deviation for DDPG is the exploratory noise kept constant during training. Right: Copy of Figure 1 showing the reward function and its Gaussian-smoothed version. Smoothie successfully escapes the lower-reward local optimum, while increasing then decreasing its policy variance as the convexity/concavity of the smoothed reward function changes.

We utilize the insights from Section 4 to introduce a new RL algorithm, Smoothie. Smoothie maintains a parameterized trained via the procedure described in Section 4.2. It then uses the gradient and Hessian of this approximation to train a Gaussian policy using the updates stated in (20) and (22). See Algorithm 1 for a simplified pseudocode of the algorithm.

We perform a number of evaluations of Smoothie to compare to DDPG. We choose DDPG as a baseline because it (1) utilizes gradient information of a Q-value approximator, much like the proposed algorithm; and (2) is a standard algorithm well-known to have achieve good, sample-efficient performance on continuous control benchmarks.

6.1 Synthetic Task

Before investigating benchmark problems, we first briefly revisit the simple task introduced in Section 3 and reproduced in Figure 2 (Right). Here, the reward function is a mixture of two Gaussians, one better than the other, and we initialize the policy mean to be centered on the worse of the two. We plot the learnable policy mean and standard deviation during training for Smoothie and DDPG in Figure 2 (Left). Smoothie learns both the mean and variance, while DDPG learns only the mean and the variance plotted is the exploratory noise, whose scale is kept fixed during training.

HalfCheetah Swimmer
Hopper Walker2d
Ant Humanoid
Figure 3: Results of Smoothie and DDPG on continuous control benchmarks. The x-axis is in millions of environment steps. Each plot shows the average reward and standard deviation clipped at the min and max of six randomly seeded runs after choosing best hyperparameters. We see that Smoothie is competitive with DDPG even when DDPG uses a hyperparameter-tuned noise scale, and Smoothie learns the optimal noise scale (the covariance) during training. Moreoever, we observe significant advantages in terms of final reward performance, especially in the more difficult tasks like Hopper, Walker2d, and Humanoid.

As expected, we observe that DDPG cannot escape the local optimum. At the beginning of training it exhibits some movement away from the local optimum (likely due to the initial noisy approximation given by ). However, it is unable to progress very far from the initial mean. Note that this is not an issue of exploration. The exploration scale is high enough that is aware of the better Gaussian. The issue is in the update for , which is only with regard to the derivative of at the current mean.

On the other hand, we find Smoothie is easily able to solve the task. This is because the smoothed reward function approximated by has a derivative that clearly points toward the better Gaussian. We also observe that Smoothie is able to suitably adjust the covariance during training. Initially, decreases due to the concavity of the smoothed reward function. As a region of convexity is entered, it begins to increase, before again decreasing to near-zero as approaches the global optimum. This example clearly shows the ability of Smoothie to exploit the smoothed action value landscape.

6.2 Continuous Control

Next, we consider standard continuous control benchmarks available on OpenAI Gym (Brockman et al., 2016) utilizing the MuJoCo environment (Todorov et al., 2012).

Our implementations utilize feed forward neural networks for policy and Q-values. We parameterize the covariance

as a diagonal given by . The exploration for DDPG is determined by an Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein, 1930; Lillicrap et al., 2016). Additional implementation details are provided in the Appendix.

HalfCheetah Swimmer Hopper
Walker2d Ant Humanoid
Figure 4: Results of Smoothie, DDPG, and TRPO on continuous control benchmarks. The x-axis is in millions of environment steps. Each plot shows the average reward and standard deviation clipped at the min and max of six randomly seeded runs after choosing best hyperparameters. We see that Smoothie is competitive with DDPG even when DDPG uses a hyperparameter-tuned noise scale, and Smoothie learns the optimal noise scale (the covariance) during training. Moreoever, we observe significant advantages in terms of final reward performance, especially in the more difficult tasks like Hopper, Walker2d, and Humanoid. Across all tasks, TRPO is not sufficiently sample-efficient to provide a competitive baseline.

We compare the results of Smoothie and DDPG in Figure 4. For each task we performed a hyperparameter search over actor learning rate, critic learning rate and reward scale, and plot the average of six runs for the best hyperparameters. For DDPG we extended the hyperparameter search to also consider the scale and damping of exploratory noise provided by the Ornstein-Uhlenbeck process. Smoothie, on the other hand, contains an additional hyperparameter to determine the weight on KL-penalty.

Despite DDPG having the advantage of its exploration decided by a hyperparameter search while Smoothie must learn its exploration without supervision, we find that Smoothie performs competitively or better across all tasks, exhibiting a slight advantage in Swimmer and Ant, while showing more dramatic improvements in Hopper, Walker2d, and Humanoid. The improvement is especially dramatic for Hopper, where the average reward is doubled. We also highlight the results for Humanoid, which as far as we know, are the best published results for a method that only trains on the order of millions of environment steps. In contrast, TRPO, which to the best of our knowledge is the only other algorithm that can achieve competitive performance, requires on the order of tens of millions of environment steps to achieve comparable reward. This gives added evidence to the benefits of using a learnable covariance and not restricting a policy to be deterministic.

These results make it clear that on more difficult continuous control tasks (Hopper, Walker2d, Ant, Humanoid), both DDPG and Smoothie suffer from instability, showing high variance across runs and a performance that is far behind trust region and proximal policy methods (Schulman et al., 2015; Nachum et al., 2018; Schulman et al., 2017). Thus, we evaluated Smoothie with a KL-penalty and present these results in Figure 5.

Empirically, we found the introduction of a KL-penalty to improve performance of Smoothie, especially on harder tasks. We present a comparison of results of Smoothie with and without the KL-penalty on the four harder tasks in Figure 5. A KL-penalty to encourage stability is not possible in DDPG. Thus, Smoothie provides a much needed solution to the inherent instability in DDPG training.

We observe that Smoothie augmented with a KL-penalty consistently improves performance. The improvement is especially dramatic for Hopper, where the average reward is doubled. We also highlight the results for Humanoid, which also exhibits a significant improvement. As far as we know, these Humanoid results are the best published results for a method that only trains on the order of millions of environment steps. In contrast, TRPO, which to the best of our knowledge is the only other algorithm which can achieve better performance, requires on the order of tens of millions of environment steps. This gives added evidence to the benefits of using a learnable covariance and not restricting a policy to be deterministic.

Hopper Walker2d
Ant Humanoid
Figure 5: Results of Smoothie with and without a KL-penalty. The x-axis is in millions of environment steps. We observe benefits of using a proximal policy optimization method, especially in Hopper and Humanoid, where the performance improvement is significant without sacrificing sample efficiency.

7 Conclusion

We have presented a new Q-value function concept, , that is a Gaussian-smoothed version of the standard expected Q-value, . The advantage of over is that its gradient and Hessian possess an intimate relationship with the gradient of expected reward with respect to mean and covariance of a Gaussian policy. The resulting algorithm, Smoothie, is able to successfully learn both mean and covariance during training, leading to performance that surpasses that of DDPG, especially when incorporating a penalty on divergence from a previous policy.

Gu et al. (2017)

showed that DDPG sits on one end of a spectrum of methods that interpolate between off-policy updates and on-policy policy gradient updates. Future work could determine if Smoothie can also benefit from the improved stability of interpolated methods.

The success of is encouraging. Intuitively it appears advantageous to learn instead of . The smoothed Q-values by definition make the true reward surface smoother, thus possibly easier to learn; moreover the smoothed Q-values have a more direct relationship with the expected discounted return objective. We encourage further investigation of these claims and techniques for applying the underlying motivations for to other types of policies.

8 Acknowledgments

We thank Luke Metz, Sergey Levine, and the Google Brain team for insightful comments and discussions.


  • Bellemare et al. (2017) Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. In ICML, pp. 449–458, 2017.
  • Bonnet (1964) Bonnet, G. Transformations des signaux aléatoires a travers les systemes non linéaires sans mémoire. Annals of Telecommunications, 19(9):203–220, 1964.
  • Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. OpenAI Gym. arXiv:1606.01540, 2016.
  • Ciosek & Whiteson (2018) Ciosek, K. and Whiteson, S. Expected policy gradients. AAAI, 2018.
  • Degris et al. (2012) Degris, T., White, M., and Sutton, R. S. Off-policy actor-critic. ICML, 2012.
  • Gu et al. (2017) Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., Schölkopf, B., and Levine, S. Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. NIPS, 2017.
  • Haarnoja et al. (2017) Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based policies. ICML, 2017.
  • Heess et al. (2015) Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., and Tassa, Y. Learning continuous control policies by stochastic value gradients. In NIPS, 2015.
  • Kingma & Welling (2014) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ICLR, 2014.
  • Konda & Tsitsiklis (2000) Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms, 2000.
  • Lillicrap et al. (2016) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. ICLR, 2016.
  • Metz et al. (2017) Metz, L., Ibarz, J., Jaitly, N., and Davidson, J. Discrete sequential prediction of continuous actions for deep RL. CoRR, abs/1705.05035, 2017. URL
  • Munos et al. (2016) Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. Safe and efficient off-policy reinforcement learning. In NIPS, 2016.
  • Nachum et al. (2017) Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging the gap between value and policy based reinforcement learning. NIPS, 2017.
  • Nachum et al. (2018) Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Trust-pcl: An off-policy trust region method for continuous control. ICLR, 2018.
  • Price (1958) Price, R. A useful theorem for nonlinear devices having gaussian inputs. IRE Transactions on Information Theory, 4(2):69–72, 1958.
  • Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D.

    Stochastic backpropagation and approximate inference in deep generative models.

    In International Conference on Machine Learning, pp. 1278–1286, 2014.
  • Rummery & Niranjan (1994) Rummery, G. A. and Niranjan, M. On-line Q-learning using connectionist systems, volume 37. 1994.
  • Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In ICML, 2015.
  • Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Silver et al. (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic policy gradient algorithms. In ICML, 2014.
  • Sutton & Barto (1998) Sutton, R. S. and Barto, A. G. Introduction to Reinforcement Learning. MIT Press, 1998.
  • Sutton et al. (2000) Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. NIPS, 2000.
  • Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012.
  • Uhlenbeck & Ornstein (1930) Uhlenbeck, G. E. and Ornstein, L. S. On the theory of the brownian motion. Physical review, 36(5):823, 1930.
  • Van Seijen et al. (2009) Van Seijen, H., Van Hasselt, H., Whiteson, S., and Wiering, M. A theoretical and empirical analysis of expected sarsa. In Adaptive Dynamic Programming and Reinforcement Learning, 2009. ADPRL’09. IEEE Symposium on, pp. 177–184. IEEE, 2009.
  • Wang et al. (2017) Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and de Freitas, N. Sample efficient actor-critic with experience replay. ICLR, 2017.
  • Watkins (1989) Watkins, C. J. C. H. Learning from delayed rewards. PhD thesis, University of Cambridge England, 1989.
  • Williams & Peng (1991) Williams, R. J. and Peng, J. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 1991.

Appendix A Proof of Theorem 1

We want to show that for any ,


We note that similar identities for Gaussian integrals exist in the literature (Price, 1958; Rezende et al., 2014) and point the reader to these works for further information.

Proof. The specific identity we state may be derived using standard matrix calculus. We make use of the fact that


and for symmetric ,


We omit from in the following equations for succinctness. The LHS of (30) is

Meanwhile, towards tackling the RHS of (30) we note that


Thus we have

Appendix B Compatible Function Approximation

A function approximator of should be sufficiently accurate so that updates for are not affected by substituting and for and , respectively.

We claim that a is compatible with respect to if

  1. ,