I Introduction
Actor Critic (AC) [Konda2000]
algorithms have become the de facto standard in continuous control Reinforcement Learning tasks allowing the employment of powerful function approximation methods such as Deep Neural Networks (DNNs) to directly learn the control policy. While very effective in simulated environments, poor sample efficiency have limited their deployment on real systems, where querying the environment for new samples is expensive.
High variance of the gradient estimate
[sutton2018reinforcement] is at the foundation of such inefficiency. Algorithms like TRPO [schulman2015trust] and PPO [schulman2017proximal] operate in a sample regime that fail to provide good approximations of the true gradient [ilyas2018deep] with considerable impact on performance, moreover, their on policy nature requires new data to be collected at each optimization step wasting all past transitions. Deterministic Policy Gradient (DPG) algorithms [silver2014deterministic] improve upon these methods by employing deterministic policies and offpolicy updates; the former limit the source of randomness to the sole environment with a consequent reduction in the number of samples required for gradient estimation and the latter allows for data reuse by storing past transitions in a replay buffer and employing them during the entire training procedure.Another source of error in policy updates is a poor action value function estimate. Here multiple factors come into play. On the one hand, Overestimation Bias [thrun1993issues] causes Qlearning algorithms to exhibit a consistent overestimation of the action value, with potentially divergent errors; to mitigate such problem, Double Qlearning [van2016deep] employs two independent Qfunctions trained with a mixed update, however, [fujimoto2018addressing] showed that this approach is not suitable in an AC setting and proposes a similar solution where the Time Difference (TD) update is performed with the minimum of the two action value functions.
On the other hand, the bootstrapping nature of TD updates results in a regression problem that changes over time making the optimization procedure very tricky and even unstable (if combined with offpolicy updates and function approximation in the infamous deadly triad [sutton2018reinforcement]). Most AC algorithms employ target networks that are slowly updated during training to provide a stable regression target. This approach was initially introduced in Double Deep QNetworks (DDQN) [van2016deep] and since then it has been adopted by Deep Deterministic Policy Gradient (DDPG) [lillicrap2015continuous], Twin Delayed Deep Deterministic policy
gradient (TD3) [fujimoto2018addressing], Soft Actor Critic (SAC) [haarnoja2018soft]. Target networks play a fundamental role in the optimization procedure and algorithms are typically very sensible to the speed which the networks are updated at.
In a Deep RL setting, TD learning is performed by minimizing a surrogate loss function (typically the Mean Square Error) with Stochastic Gradient Descent (SGD) based algorithms, the most common choice is Adam
[kingma2014adam] that has proven effective for training DNNs. In this work, we propose an alternative optimization procedure to tackle the sample efficiency problem that provides a principled interpretation of target networks and minimizes a single loss function combining both policy and value updates. Our procedure employs Time Damped Stochastic Proximal Gradient (SPG) [parikh2014proximal] [ryu2014stochastic] iteration, widespread in convex optimization, combined with bootstrapped action value estimates and is able to provide improved performance compared to stateoftheart algorithms on continuous control tasks. More in details, we endow TD3 with a proximal gradient optimization procedure and we exploit the two Qnetworks already used in the original algorithm to limit overestimation bias also to provide a more accurate action value estimate via bootstrapping that allows better policy updates.Ii Related work
The first successful application of DNNs in RL dates back to [mnih2013playing] where Deep Qlearning was introduced to Play Atari games at human level capabilities. Target networks where first introduced in [van2016deep] where they proposed an improved version of Deep Qlearning for the same control task. Since then, such algorithm has been the reference for Qfunction estimation with DNNs.
In continuous control tasks, Qlearning methods are not enough to learn a policy; in fact, finding the maximizing action would require the solution of a maximization problem every time the agent needs to act on the environment, with prohibitive computational costs. Here, AC algorithms comes into play where a parametrized policy learns to maximize the total expected reward. These methods typically follows the policy iteration [sutton2018reinforcement] paradigm where at each time step the Qfunction is estimated and then the policy is made greedy w.r.t. it. Hence Deep Qlearning is still a fundamental part of methods such as DDPG [lillicrap2015continuous], A3C [mnih2016asynchronous], TD3 [fujimoto2018addressing], SAC [haarnoja2018soft] that are all related to our method. Alternative approaches that follows the AC paradigm but employ sample estimates of the Qfunction can be found in [schulman2015high], TRPO [schulman2015trust], PPO [schulman2017proximal] and P3O [fakoor2019p3o].
There has been attempts to improve AC algorithms from an optimization point of view, by proposing alternatives to the TD error with more complex loss functions; SBEED [dai2017sbeed] provides a primal dual interpretation of the Bellman Equation that results in a minmax game to optimize the convex dual of the quadratic loss function. [feng2019kernel] proposes a kernel loss alternative to the MSE that enables improved training of the Qfunction.
Proximal methods, while widespread in convex optimization, have been used to train DNNs only in [chaudhari2018deep]
in a Supervised Learning setting.
Iii Background
RL deals with the problem of learning a maximally rewarding behavior for an agent interacting with its environment. Formally, this can be cast in the framework of optimal policy estimation in a Markov Decision Process (MDP)
[bertsekas2019reinforcement]. A MDP is a tuple where is the set of states that the environment can assume, the set of actions that can be performed on the environment,the probability of transitioning to state
after taking action in state , the reward function, that can be either deterministic or stochastic, and the discount factor used to weight future rewards and guarantee finite total rewards even for infinite time horizon problems. The goal of an RL algorithm is finding a policy that maximizes the total expected discounted reward:(1) 
Where the expectation is taken over all sources of randomness in the MDP.
Policy Gradient methods tackle the problem by directly maximizing (1) with respect to the parameters of a NN parametrizing the policy function, using gradient based optimization procedures. We know from the policy gradient theorem [sutton2000policy] that it is possible to express the gradient of (1) with respect to the policy parameters as an expectation over trajectories:
(2) 
Where the Qfunction (or critic) is the total expected reward obtained starting from state , performing action and then following policy . The Qfunction can be estimated using TD learning, exploiting the Bellman equation:
(3) 
Here is the expected reward after taking action in state . To simplify the optimization procedure and speedup policy updates, the expectation over trajectories in (1) is typically replaced with an expectation over transitions, hence maximizing the marginal expected reward. The resulting policy gradient is as follows:
(4) 
In on policy methods such as PPO the actions are sampled from the current policy while offpolicy methods employs a replay buffer where past transitions are stored.
The policy gradient theorem stated in (2) is valid for stochastic policies; there is an analogous for the deterministic case [silver2014deterministic]
where the policy gradient can simply be obtained by backpropagation through the Qfunction; then gradient ascent steps are taken in order to make the policy greedy with respect to the actual estimate of the action values. When NNs are used as function approximators for the Qfunctions, the Ballman update in (
3) requires itself the minimization of a loss function called TD error with respect to the network parameters . The resulting algorithm solves the coupled optimization problem:(5)  
With . Typically, to make SGD steps more stable, target networks are used both for the policy and the action value with parameters and that are averaged exponentially over time: , with . The TD error is then computed as with gradient propagation only on .
Iv Proximal gradient methods
SGD based optimization algorithms perform exceptionally well at training DNNs especially in a Supervised setting where the data distribution does not change during training. In RL, however, this assumption is not true since data are collected with different policies at different time instants and the TD error targets evolve during time; all these factors make the optimization procedure difficult. Proximal Methods have shown appealing convergence and stability properties in convex optimization and can be an alternative to standard gradient based algorithms. Given a function and a constant we define the proximal operator for a point as:
(6) 
For a convex function the following theorem holds:
Theorem 1.
For a detailed proof we refer the reader to [parikh2014proximal]. This implies that, starting from a random point , by repeated applications of the proximal operator one can hope to reach the minimum of ; for this to happen, must be a contraction. While this is not true, the map is a firm nonexpansion i.e. for every :
(7) 
With .
Defining the damped proximal iteration with constant as:
(8) 
This iteration provably converges to a stationary point hence, in the convex setting, proximal iteration is an effective optimization method with convergence guarantees.
In the stochastic nonconvex setting, the statements above do not hold true but the algorithm has still some desirable properties that make it a valid alternative to SGD. More in details, being the loss function associated to the batch sampled at time , the Stochastic Proximal Iteration (SPI), starting from a random point is:
(9) 
SPI can be interpreted as SGD performed on a smoothed loss function derived from the viscosity solution of the Hamilton Jacobi equation [chaudhari2018deep] defined as:
(10) 
In fact, if exists, it is true that:
(11) 
Hence, performing SGD updates on results in a damped SPI on the function :
(12) 
Here plays the role of in (6) and also serves as exponential averaging constant of the damped SPI. In particular, for we recover SPI with . As shown in [chaudhari2018deep] the function has smoother local minima and it’s easier to optimize. For this reason, SPI has better stability and convergence properties than SGD.
V Proposed Method
In this section we provide a detailed description of our algorithm and its optimization procedure. The proposed method is based on TD3 that has proven effective at solving continuous control tasks and provide stateoftheart performance while being simple and easily reproducible. We provide here an interpretation of target networks that play a fundamental role in the optimization procedure of AC algorithms and are a widespread trick to make training more stable. Such interpretation can be easily derived from a few simple changes to (12); given variables and , we rewrite (12) as:
(13) 
Where we have introduced two hyperparameters, and that control respectively the proximal term strength and the damping constant. This decouples the two terms as opposed to (12) where the single parameter controls both, giving more freedom to tune the algorithm behavior. It is immediately clear how the time evolution of in (13) is analogous to the parameters evolution during training of the ”fast” moving function. Similarly plays the role of the target parameters that slowly change during time. The hyperparameter allows to control how close the two remains, which may help tradingoff the update speed and how offpolicy the data collected are. As in TD3 we employ target functions for the policy and the two action value networks with parameters denoted respectively as , and . The pair of Qfunctions is used to reduce the overestimation bias and also to provide a more reliable estimate of the action value by bootstrapping; to train the two Qnetworks we minimize the following TD error:
(14) 
with .
We employ the smooth loss (or huber) instead of the MSE. This choice is justified by the nature of the Bellman equation (3): the expected value over the next state and action is estimated in the TD error with a single transition and thus present a high variance; the huber loss put less weight on large errors compared to the MSE trusting less the expectation estimate. Moreover, since the targets change during training, the smooth loss may improve stability reducing strong changes in the parameters during a single optimization step.
The policy network is trained to maximize the average action value of the two target Qfunctions; the corresponding loss for a single transition is:
(15) 
There are two main differences in the loss functions compared to standard TD3:

Target networks are used to compute the policy gradients instead of their ”fast” counterpart.

A bootstrapped estimate of the action value leverages both the the available Qnetworks to reduce the approximation error.
The first difference implies that our method does not require delayed policy updates because the target networks change slowly during time. Moreover, the improved quality of the action value gives better gradient estimates for the policy. The resulting methods thus performs SPI on a single loss function :
(16) 
Where we introduced the hyperparameter to control the scale of the two loss functions. We believe this expedient is important since the TD error has the same scale of the one step reward while the Qfunction that is maximized by the policy has magnitude similar to the cumulative reward. For most Mujoco environments the difference is of almost three orders of magnitude.
As in [fujimoto2018addressing] we add clipped noise to the actions in order to smooth the action value functions. In particular, for each action in the batch
we sample noise from a normal distribution
and then set:(17) 
Here is an hyperparameter and the clipping is performed elementwise on the vector .
The SPI procedure detailed in (13) requires for each batch the computation of the proximal operator. This can be done through full gradient descent on the loss function defined by the batch sampled from the replay buffer at time step . More in details, we run gradient descent steps to minimize the proximal loss defined as:
(18) 
We use a single hyperparameter to control the strength of the proximal term for both the policy and the Qnetworks. In our implementation, we replace the norm with the MSE in order to have all the proximal terms scaled with respect to number of parameters.
The resulting algorithm, called Proximal Deterministic Policy Gradient (PDPG), is summarized in Algorithm 1.
Vi Experimental results
Ant  HalfCheetah  Hopper  Humanoid  Walker2d  

Thresholds  2000  4000  5000  10000  1000  2000  3000  6000  1800  3600 
Ours  371500  743500  101000  490500  92500  133000  390000  1426000  134500  227000 
SAC  326000  1135000  122000  624000  203000  299000  288000  2335000  314000  540000 
TD3  432000  2002000  147500  1392500  147000  202000  402500  1750000  187500  365000 
PPO  977306  822067  /  /  190874  359629  /  /  484557  231834 
We provide in this section a comprehensive performance analysis of the proposed method to assess its capabilities in terms of sample efficiency and asymptotic performance with particular focus on the former being of fundamental interest in real applications where querying the environment for new samples is extremely expensive. In such a scenario being able to tradeoff sample requirements with policy optimality is fundamental; in fact, it may be preferable to reach reasonable performance with few samples than having high asymptotic capabilities.
We train our agent on Mujoco [todorov2012mujoco] OpenAI gym [brockman2016openai] continuous control tasks, a challenging benchmark often used in literature to test RL algorithms dealing with continuous action and state spaces. We compare our method against stateoftheart onpolicy (PPO) and offpolicy (TD3, SAC) algorithms. PPO [schulman2017proximal] is an onpolicy algorithm that exploit the policy gradient theorem and generalized advantage estimation [schulman2015high] to learn an optimal policy. During training, a trust region constraint is imposed on the policy that is updated keeping it close to the one at the previous iteration; this results in more stable training and better performance. TD3 [fujimoto2018addressing] is the algorithm which our method is based on, it learns a deterministic policy exploiting the deterministic policy gradient theorem and additional tricks describes in Section V to improve upon DDPG [lillicrap2015continuous] which we do not include in our comparison being very similar to TD3 with lower performance. SAC [haarnoja2018soft] is a maximum entropy RL algorithm that learns a stochastic policy to maximize the total expected reward plus an entropy term in order to achieve high performance while being maximally exploring. TD3 and SAC are the most performing algorithms and show similar results.
Figure 1
shows a comparison of the training curves of the algorithms listed above. We run our algorithm for ten different seeds in order to assess its stability with different conditions; the reward is averaged among ten episodes, keeping the default maximum episode length defined by the gym framework. In each plot, the solid lines represent the average value among different seeds while the shaded area indicates a single standard deviation from the average. The curves have been smoothed for visual clarity with an exponential average.
For TD3 we have reported the curves taken from the authors github repository ^{1}^{1}1https://github.com/sfujim/TD3 except for the environments where the authors provided curves for only 1 million steps (Ant, HalfCheetah) or didn’t provide them at all (Humanoid); for such cases we run the experiments with the code available in the same repository. For SAC we employed the curves available at the project website ^{2}^{2}2 https://sites.google.com/view/softactorcritic and used the OpenAi baselines ^{3}^{3}3https://github.com/openai/baselines code to run experiments with PPO.
It is noticeable how our method consistently outperform both on policy and offpolicy methods on most continuous control tasks. The Hopper environment exhibits a quite noisy behavior but the average performance are comparable with the other algorithms. Moreover, it takes much less for our method to reach an average return equal to the maximum performance obtained by the other methods. See for example the HalfCheetah environment where PDPG is able to match the performance of TD3 with half the number of samples.
To better characterize sample efficiency we report in Table I the average number of timesteps required by each algorithm to exceed a set of reward thresholds placed at approximately one third and two thirds of the maximum reward achieved by the best algorithm. The superior sample efficiency of our method here is evident, for most of the environments PDPG reaches the specified thresholds with much less samples than the others. In the Humanoid and Ant it shows less efficiency than SAC for the lower threshold but better efficiency for the higher one, moreover, it has consistently better asymptotic performance. We acknowledge that this sample efficiency comes at a cost: the computation of the proximal operator requires multiple gradient steps for each batch, slowing down the training; in our experiments we took five gradient steps for each batch hence the amount of computations required scaled accordingly. We believe that this is a fair price to pay since it reduces significantly the number of queries to the environment needed to train the agent properly.
Vii Ablation study
We provide in this section an ablation study to show the effect on the proposed method of the huber loss and the bootstrapped value estimation. We run a version of the method that employs the MSE in the TD error and also a version that doesn’t use the bootstrapped estimation, but still keeps two Qfunctions to avoid the overestimation bias.
In Figure 2 the training curves are reported for the two alternative versions compared to the standard approach. The employment of bootstrapped value estimates while not drastically changing performance seems to provide improved stability, this can be seen especially from the Ant and Walker environment where the Single Q algorithm has much higher variance. The huber loss has a drastic effect on the HalfCheetah environment, providing considerable performance improvement. On the remaining environments there is not substantial difference between the two loss functions.
In general, the standard method shows better performance and stability hence all the components employed are empirically justified by this study.
Viii Conclusions
In this paper we proposed Proximal Deterministic Policy Gradient, an offpolicy RL method for model free continuous control tasks that exploits proximal gradient methods and bootstrapping to better solve the TD error optimization problem. Proximal algorithms are appealing in an RL setting since they show improved convergence and stability properties compared to standard SGD. Moreover, we showed that proximal methods provide a natural interpretation of the target networks, a trick commonly employed in RL to stabilize training.
The resulting algorithm compare favourably with stateoftheart offpolicy and onpolicy methods showing improved sample efficiency and asymptotic performance. The significant increase in sample efficiency makes our algorithm appealing for deployment in real environments, this possibility will be explored in a future work.
Appendix
Viiia Training details
We employed Feedforward Neural Networks with two hidden layers of 256 neurons each with ReLu
[glorot2011deep] activations for both policy and critic.As in [fujimoto2018addressing] we perform a burnin at the beginning of training where we sample random actions from the environment. The hyperparameters employed are listed in Tables II, III. The exploration noise, action noise and noise clip are relative to the maximum value of the action that varies depending on the environment.
Name  Value 
Learning rate  3e4 
Damping coefficient  0.005 
Exploration noise  0.1 
Noise clip  0.5 
Action noise  0.2 
Batch size  256 
Proximal steps  5 
Policy loss strength  0.01 
Environment  Name  Value 
Hopper, Walker  Burnin  1000 
All others  Burnin  10000 
HalfCheetah, Ant  Policy Weight Decay  0.0 
All others  Policy Weight Decay  1e5 
Humanoid  Proximal Strength (  10.0 
Hopper, Walker  Proximal Strength (  1.0 
HalfCheetah, Ant  Proximal Strength (  0.1 