Proximal Deterministic Policy Gradient

This paper introduces two simple techniques to improve off-policy Reinforcement Learning (RL) algorithms. First, we formulate off-policy RL as a stochastic proximal point iteration. The target network plays the role of the variable of optimization and the value network computes the proximal operator. Second, we exploits the two value functions commonly employed in state-of-the-art off-policy algorithms to provide an improved action value estimate through bootstrapping with limited increase of computational resources. Further, we demonstrate significant performance improvement over state-of-the-art algorithms on standard continuous-control RL benchmarks.


page 5

page 6


Parameter-based Value Functions

Learning value functions off-policy is at the core of modern Reinforceme...

Competitiveness of MAP-Elites against Proximal Policy Optimization on locomotion tasks in deterministic simulations

The increasing importance of robots and automation creates a demand for ...

Smoothed Action Value Functions for Learning Gaussian Policies

State-action value functions (i.e., Q-values) are ubiquitous in reinforc...

Deep Q-Network with Proximal Iteration

We employ Proximal Iteration for value-function optimization in reinforc...

Taylor Expansion of Discount Factors

In practical reinforcement learning (RL), the discount factor used for e...

Proximal Policy Optimization for Tracking Control Exploiting Future Reference Information

In recent years, reinforcement learning (RL) has gained increasing atten...

Discovering Reinforcement Learning Algorithms

Reinforcement learning (RL) algorithms update an agent's parameters acco...

I Introduction

Actor Critic (AC) [Konda2000]

algorithms have become the de facto standard in continuous control Reinforcement Learning tasks allowing the employment of powerful function approximation methods such as Deep Neural Networks (DNNs) to directly learn the control policy. While very effective in simulated environments, poor sample efficiency have limited their deployment on real systems, where querying the environment for new samples is expensive.

High variance of the gradient estimate

[sutton2018reinforcement] is at the foundation of such inefficiency. Algorithms like TRPO [schulman2015trust] and PPO [schulman2017proximal] operate in a sample regime that fail to provide good approximations of the true gradient [ilyas2018deep] with considerable impact on performance, moreover, their on policy nature requires new data to be collected at each optimization step wasting all past transitions. Deterministic Policy Gradient (DPG) algorithms [silver2014deterministic] improve upon these methods by employing deterministic policies and off-policy updates; the former limit the source of randomness to the sole environment with a consequent reduction in the number of samples required for gradient estimation and the latter allows for data reuse by storing past transitions in a replay buffer and employing them during the entire training procedure.

Another source of error in policy updates is a poor action value function estimate. Here multiple factors come into play. On the one hand, Overestimation Bias [thrun1993issues] causes Q-learning algorithms to exhibit a consistent overestimation of the action value, with potentially divergent errors; to mitigate such problem, Double Q-learning [van2016deep] employs two independent Q-functions trained with a mixed update, however, [fujimoto2018addressing] showed that this approach is not suitable in an AC setting and proposes a similar solution where the Time Difference (TD) update is performed with the minimum of the two action value functions.
On the other hand, the bootstrapping nature of TD updates results in a regression problem that changes over time making the optimization procedure very tricky and even unstable (if combined with off-policy updates and function approximation in the infamous deadly triad [sutton2018reinforcement]). Most AC algorithms employ target networks that are slowly updated during training to provide a stable regression target. This approach was initially introduced in Double Deep Q-Networks (DDQN) [van2016deep] and since then it has been adopted by Deep Deterministic Policy Gradient (DDPG) [lillicrap2015continuous], Twin Delayed Deep Deterministic policy gradient (TD3) [fujimoto2018addressing], Soft Actor Critic (SAC) [haarnoja2018soft]. Target networks play a fundamental role in the optimization procedure and algorithms are typically very sensible to the speed which the networks are updated at.

In a Deep RL setting, TD learning is performed by minimizing a surrogate loss function (typically the Mean Square Error) with Stochastic Gradient Descent (SGD) based algorithms, the most common choice is Adam

[kingma2014adam] that has proven effective for training DNNs. In this work, we propose an alternative optimization procedure to tackle the sample efficiency problem that provides a principled interpretation of target networks and minimizes a single loss function combining both policy and value updates. Our procedure employs Time Damped Stochastic Proximal Gradient (SPG) [parikh2014proximal] [ryu2014stochastic] iteration, widespread in convex optimization, combined with bootstrapped action value estimates and is able to provide improved performance compared to state-of-the-art algorithms on continuous control tasks. More in details, we endow TD3 with a proximal gradient optimization procedure and we exploit the two Q-networks already used in the original algorithm to limit overestimation bias also to provide a more accurate action value estimate via bootstrapping that allows better policy updates.

Ii Related work

The first successful application of DNNs in RL dates back to [mnih2013playing] where Deep Q-learning was introduced to Play Atari games at human level capabilities. Target networks where first introduced in [van2016deep] where they proposed an improved version of Deep Q-learning for the same control task. Since then, such algorithm has been the reference for Q-function estimation with DNNs.

In continuous control tasks, Q-learning methods are not enough to learn a policy; in fact, finding the maximizing action would require the solution of a maximization problem every time the agent needs to act on the environment, with prohibitive computational costs. Here, AC algorithms comes into play where a parametrized policy learns to maximize the total expected reward. These methods typically follows the policy iteration [sutton2018reinforcement] paradigm where at each time step the Q-function is estimated and then the policy is made greedy w.r.t. it. Hence Deep Q-learning is still a fundamental part of methods such as DDPG [lillicrap2015continuous], A3C [mnih2016asynchronous], TD3 [fujimoto2018addressing], SAC [haarnoja2018soft] that are all related to our method. Alternative approaches that follows the AC paradigm but employ sample estimates of the Q-function can be found in [schulman2015high], TRPO [schulman2015trust], PPO [schulman2017proximal] and P3O [fakoor2019p3o].

There has been attempts to improve AC algorithms from an optimization point of view, by proposing alternatives to the TD error with more complex loss functions; SBEED [dai2017sbeed] provides a primal dual interpretation of the Bellman Equation that results in a minmax game to optimize the convex dual of the quadratic loss function. [feng2019kernel] proposes a kernel loss alternative to the MSE that enables improved training of the Q-function.
Proximal methods, while widespread in convex optimization, have been used to train DNNs only in [chaudhari2018deep]

in a Supervised Learning setting.

Iii Background

RL deals with the problem of learning a maximally rewarding behavior for an agent interacting with its environment. Formally, this can be cast in the framework of optimal policy estimation in a Markov Decision Process (MDP)

[bertsekas2019reinforcement]. A MDP is a tuple where is the set of states that the environment can assume, the set of actions that can be performed on the environment,

the probability of transitioning to state

after taking action in state , the reward function, that can be either deterministic or stochastic, and the discount factor used to weight future rewards and guarantee finite total rewards even for infinite time horizon problems. The goal of an RL algorithm is finding a policy that maximizes the total expected discounted reward:


Where the expectation is taken over all sources of randomness in the MDP.

Policy Gradient methods tackle the problem by directly maximizing (1) with respect to the parameters of a NN parametrizing the policy function, using gradient based optimization procedures. We know from the policy gradient theorem [sutton2000policy] that it is possible to express the gradient of (1) with respect to the policy parameters as an expectation over trajectories:


Where the Q-function (or critic) is the total expected reward obtained starting from state , performing action and then following policy . The Q-function can be estimated using TD learning, exploiting the Bellman equation:


Here is the expected reward after taking action in state . To simplify the optimization procedure and speed-up policy updates, the expectation over trajectories in (1) is typically replaced with an expectation over transitions, hence maximizing the marginal expected reward. The resulting policy gradient is as follows:


In on policy methods such as PPO the actions are sampled from the current policy while off-policy methods employs a replay buffer where past transitions are stored.

The policy gradient theorem stated in (2) is valid for stochastic policies; there is an analogous for the deterministic case [silver2014deterministic]

where the policy gradient can simply be obtained by backpropagation through the Q-function; then gradient ascent steps are taken in order to make the policy greedy with respect to the actual estimate of the action values. When NNs are used as function approximators for the Q-functions, the Ballman update in (

3) requires itself the minimization of a loss function called TD error with respect to the network parameters . The resulting algorithm solves the coupled optimization problem:


With . Typically, to make SGD steps more stable, target networks are used both for the policy and the action value with parameters and that are averaged exponentially over time: , with . The TD error is then computed as with gradient propagation only on .

Iv Proximal gradient methods

SGD based optimization algorithms perform exceptionally well at training DNNs especially in a Supervised setting where the data distribution does not change during training. In RL, however, this assumption is not true since data are collected with different policies at different time instants and the TD error targets evolve during time; all these factors make the optimization procedure difficult. Proximal Methods have shown appealing convergence and stability properties in convex optimization and can be an alternative to standard gradient based algorithms. Given a function and a constant we define the proximal operator for a point as:


For a convex function the following theorem holds:

Theorem 1.

A vector

is a critical point for the function iff

For a detailed proof we refer the reader to [parikh2014proximal]. This implies that, starting from a random point , by repeated applications of the proximal operator one can hope to reach the minimum of ; for this to happen, must be a contraction. While this is not true, the map is a firm non-expansion i.e. for every :


With .
Defining the damped proximal iteration with constant as:


This iteration provably converges to a stationary point hence, in the convex setting, proximal iteration is an effective optimization method with convergence guarantees.

In the stochastic non-convex setting, the statements above do not hold true but the algorithm has still some desirable properties that make it a valid alternative to SGD. More in details, being the loss function associated to the batch sampled at time , the Stochastic Proximal Iteration (SPI), starting from a random point is:


SPI can be interpreted as SGD performed on a smoothed loss function derived from the viscosity solution of the Hamilton Jacobi equation [chaudhari2018deep] defined as:


In fact, if exists, it is true that:


Hence, performing SGD updates on results in a damped SPI on the function :


Here plays the role of in (6) and also serves as exponential averaging constant of the damped SPI. In particular, for we recover SPI with . As shown in [chaudhari2018deep] the function has smoother local minima and it’s easier to optimize. For this reason, SPI has better stability and convergence properties than SGD.

V Proposed Method

In this section we provide a detailed description of our algorithm and its optimization procedure. The proposed method is based on TD3 that has proven effective at solving continuous control tasks and provide state-of-the-art performance while being simple and easily reproducible. We provide here an interpretation of target networks that play a fundamental role in the optimization procedure of AC algorithms and are a widespread trick to make training more stable. Such interpretation can be easily derived from a few simple changes to (12); given variables and , we rewrite (12) as:


Where we have introduced two hyper-parameters, and that control respectively the proximal term strength and the damping constant. This decouples the two terms as opposed to (12) where the single parameter controls both, giving more freedom to tune the algorithm behavior. It is immediately clear how the time evolution of in (13) is analogous to the parameters evolution during training of the ”fast” moving function. Similarly plays the role of the target parameters that slowly change during time. The hyper-parameter allows to control how close the two remains, which may help trading-off the update speed and how off-policy the data collected are. As in TD3 we employ target functions for the policy and the two action value networks with parameters denoted respectively as , and . The pair of Q-functions is used to reduce the overestimation bias and also to provide a more reliable estimate of the action value by bootstrapping; to train the two Q-networks we minimize the following TD error:


with .
We employ the smooth- loss (or huber) instead of the MSE. This choice is justified by the nature of the Bellman equation (3): the expected value over the next state and action is estimated in the TD error with a single transition and thus present a high variance; the huber loss put less weight on large errors compared to the MSE trusting less the expectation estimate. Moreover, since the targets change during training, the smooth- loss may improve stability reducing strong changes in the parameters during a single optimization step. The policy network is trained to maximize the average action value of the two target Q-functions; the corresponding loss for a single transition is:


There are two main differences in the loss functions compared to standard TD3:

  1. Target networks are used to compute the policy gradients instead of their ”fast” counterpart.

  2. A bootstrapped estimate of the action value leverages both the the available Q-networks to reduce the approximation error.

The first difference implies that our method does not require delayed policy updates because the target networks change slowly during time. Moreover, the improved quality of the action value gives better gradient estimates for the policy. The resulting methods thus performs SPI on a single loss function :


Where we introduced the hyper-parameter to control the scale of the two loss functions. We believe this expedient is important since the TD error has the same scale of the one step reward while the Q-function that is maximized by the policy has magnitude similar to the cumulative reward. For most Mujoco environments the difference is of almost three orders of magnitude.

As in [fujimoto2018addressing] we add clipped noise to the actions in order to smooth the action value functions. In particular, for each action in the batch

we sample noise from a normal distribution

and then set:


Here is an hyper-parameter and the clipping is performed element-wise on the vector .

The SPI procedure detailed in (13) requires for each batch the computation of the proximal operator. This can be done through full gradient descent on the loss function defined by the batch sampled from the replay buffer at time step . More in details, we run gradient descent steps to minimize the proximal loss defined as:


We use a single hyper-parameter to control the strength of the proximal term for both the policy and the Q-networks. In our implementation, we replace the norm with the MSE in order to have all the proximal terms scaled with respect to number of parameters.
The resulting algorithm, called Proximal Deterministic Policy Gradient (PDPG), is summarized in Algorithm 1.

  Input: , , , batch size , learning rate , exploration noise variance
  Initialize network parameters , , randomly
  , ,
     for  to  do
        Collect transition with exploratory action with
        Store transition in the replay buffer
        Sample minibatch of transitions from the replay buffer
        Add clipped noise to actions in minibatch as in (17)
        for  to  do
           , , =
        end for
     end for
  until convergence
Algorithm 1 PDPG

Vi Experimental results

Ant HalfCheetah Hopper Humanoid Walker2d
Thresholds 2000 4000 5000 10000 1000 2000 3000 6000 1800 3600
Ours 371500 743500 101000 490500 92500 133000 390000 1426000 134500 227000
SAC 326000 1135000 122000 624000 203000 299000 288000 2335000 314000 540000
TD3 432000 2002000 147500 1392500 147000 202000 402500 1750000 187500 365000
PPO 977306 822067 / / 190874 359629 / / 484557 231834
TABLE I: Average time-steps required by each algorithms to reach the reward thresholds set approximately at one third and two thirds of the maximum reward achieved by the best algorithm.
Fig. 1: Training curves on OpenAI gym continuous control benchmarks. Our methods consistently outperform concurrent approaches (SAC and TD3) and on policy methods (PPO).
Fig. 2: Ablation study comparing the training curves of our method (blue) and its versions using the MSE in the TD error (yellow) and no bootstrapped value estimate (purple).

We provide in this section a comprehensive performance analysis of the proposed method to assess its capabilities in terms of sample efficiency and asymptotic performance with particular focus on the former being of fundamental interest in real applications where querying the environment for new samples is extremely expensive. In such a scenario being able to trade-off sample requirements with policy optimality is fundamental; in fact, it may be preferable to reach reasonable performance with few samples than having high asymptotic capabilities.

We train our agent on Mujoco [todorov2012mujoco] OpenAI gym [brockman2016openai] continuous control tasks, a challenging benchmark often used in literature to test RL algorithms dealing with continuous action and state spaces. We compare our method against state-of-the-art on-policy (PPO) and off-policy (TD3, SAC) algorithms. PPO [schulman2017proximal] is an on-policy algorithm that exploit the policy gradient theorem and generalized advantage estimation [schulman2015high] to learn an optimal policy. During training, a trust region constraint is imposed on the policy that is updated keeping it close to the one at the previous iteration; this results in more stable training and better performance. TD3 [fujimoto2018addressing] is the algorithm which our method is based on, it learns a deterministic policy exploiting the deterministic policy gradient theorem and additional tricks describes in Section V to improve upon DDPG [lillicrap2015continuous] which we do not include in our comparison being very similar to TD3 with lower performance. SAC [haarnoja2018soft] is a maximum entropy RL algorithm that learns a stochastic policy to maximize the total expected reward plus an entropy term in order to achieve high performance while being maximally exploring. TD3 and SAC are the most performing algorithms and show similar results.

Figure 1

shows a comparison of the training curves of the algorithms listed above. We run our algorithm for ten different seeds in order to assess its stability with different conditions; the reward is averaged among ten episodes, keeping the default maximum episode length defined by the gym framework. In each plot, the solid lines represent the average value among different seeds while the shaded area indicates a single standard deviation from the average. The curves have been smoothed for visual clarity with an exponential average.

For TD3 we have reported the curves taken from the authors github repository 111 except for the environments where the authors provided curves for only 1 million steps (Ant, HalfCheetah) or didn’t provide them at all (Humanoid); for such cases we run the experiments with the code available in the same repository. For SAC we employed the curves available at the project website 222 and used the OpenAi baselines 333 code to run experiments with PPO.
It is noticeable how our method consistently outperform both on policy and off-policy methods on most continuous control tasks. The Hopper environment exhibits a quite noisy behavior but the average performance are comparable with the other algorithms. Moreover, it takes much less for our method to reach an average return equal to the maximum performance obtained by the other methods. See for example the HalfCheetah environment where PDPG is able to match the performance of TD3 with half the number of samples.

To better characterize sample efficiency we report in Table I the average number of time-steps required by each algorithm to exceed a set of reward thresholds placed at approximately one third and two thirds of the maximum reward achieved by the best algorithm. The superior sample efficiency of our method here is evident, for most of the environments PDPG reaches the specified thresholds with much less samples than the others. In the Humanoid and Ant it shows less efficiency than SAC for the lower threshold but better efficiency for the higher one, moreover, it has consistently better asymptotic performance. We acknowledge that this sample efficiency comes at a cost: the computation of the proximal operator requires multiple gradient steps for each batch, slowing down the training; in our experiments we took five gradient steps for each batch hence the amount of computations required scaled accordingly. We believe that this is a fair price to pay since it reduces significantly the number of queries to the environment needed to train the agent properly.

Vii Ablation study

We provide in this section an ablation study to show the effect on the proposed method of the huber loss and the bootstrapped value estimation. We run a version of the method that employs the MSE in the TD error and also a version that doesn’t use the bootstrapped estimation, but still keeps two Q-functions to avoid the overestimation bias.

In Figure 2 the training curves are reported for the two alternative versions compared to the standard approach. The employment of bootstrapped value estimates while not drastically changing performance seems to provide improved stability, this can be seen especially from the Ant and Walker environment where the Single Q algorithm has much higher variance. The huber loss has a drastic effect on the HalfCheetah environment, providing considerable performance improvement. On the remaining environments there is not substantial difference between the two loss functions.

In general, the standard method shows better performance and stability hence all the components employed are empirically justified by this study.

Viii Conclusions

In this paper we proposed Proximal Deterministic Policy Gradient, an off-policy RL method for model free continuous control tasks that exploits proximal gradient methods and bootstrapping to better solve the TD error optimization problem. Proximal algorithms are appealing in an RL setting since they show improved convergence and stability properties compared to standard SGD. Moreover, we showed that proximal methods provide a natural interpretation of the target networks, a trick commonly employed in RL to stabilize training.
The resulting algorithm compare favourably with state-of-the-art off-policy and on-policy methods showing improved sample efficiency and asymptotic performance. The significant increase in sample efficiency makes our algorithm appealing for deployment in real environments, this possibility will be explored in a future work.


Viii-a Training details

We employed Feedforward Neural Networks with two hidden layers of 256 neurons each with ReLu

[glorot2011deep] activations for both policy and critic.

As in [fujimoto2018addressing] we perform a burn-in at the beginning of training where we sample random actions from the environment. The hyper-parameters employed are listed in Tables II, III. The exploration noise, action noise and noise clip are relative to the maximum value of the action that varies depending on the environment.

Name Value
Learning rate 3e-4
Damping coefficient 0.005
Exploration noise 0.1
Noise clip 0.5
Action noise 0.2
Batch size 256
Proximal steps 5
Policy loss strength 0.01
TABLE II: Hyper-parameter values used in all the environments.
Environment Name Value
Hopper, Walker Burn-in 1000
All others Burn-in 10000
HalfCheetah, Ant Policy Weight Decay 0.0
All others Policy Weight Decay 1e-5
Humanoid Proximal Strength ( 10.0
Hopper, Walker Proximal Strength ( 1.0
HalfCheetah, Ant Proximal Strength ( 0.1
TABLE III: Additional Hyper-parameter values.