1 Introduction
Reinforcement Learning (RL) algorithms have demonstrated good performance on largescale simulated data. It has proven challenging to translate this progress to realworld robotics problems for two reasons. First, the complexity and fragility of robots precludes extensive data collection. Second, a robot may face an environment during operation that is different than the simulated environment it was trained in; the myriad of intricate design and hyperparameter choices made by current RL algorithms, especially offpolicy methods, may not remain appropriate when the environment changes. We therefore lay out the following desiderata to make offpolicy reinforcement learningbased methods viable for realworld robotics: (i) reduce the number of data required for learning, and (ii) simplify complex stateoftheart offpolicy RL algorithms.
1.1 Problem Setup
Consider a discretetime dynamical system given by
(1) 
where are states at times , respectively, is the control input (also called action) applied at time and is noise that denotes unmodeled parts of the dynamics. The initial state
is drawn from some known probability distribution
. We will work under the standard modelfree formulation of RL wherein one assumes that the dynamics is unknown to the learner. Consider the discounted sum of rewards over an infinite timehorizon(2) 
The lefthand side is known as the value function and the expectation is computed over trajectories of the dynamical system Eq. 1. Note that we always have . The reward , denoted by in short, models a userchosen incentive for taking the control input at state . The goal in RL is to maximize the objective .
There are numerous approaches to solving the above problem. This paper focuses on offpolicy learning in which the algorithm reuses data from old polices to update the current policy [sutton2018reinforcement, bertsekas2019reinforcement]. The defining characteristic of these algorithms is that they use an experience replay buffer
collected by a policy to compute the value function corresponding to the controller . In practice this replay buffer, , consists of multiple trajectories from different episodes. Offpolicy techniques for continuousstate and control spaces regress a parametrized actionvalue function by minimizing the onestep squaredBellman error (also called the temporal difference error)
(3) 
If this objective is zero, the actionvalue function satisfies
which suggests that given one may find the best controller by maximizing
(4) 
The pair of equations Eqs. 4 and 3 form a coupled pair of optimization problems with variables that can be solved by, say, taking gradient steps on each objective alternately while keeping one of the parameters fixed.
Although offpolicy methods have shown promising performance in various tasks and are usually more sample efficient than onpolicy methods [fakoor2019p3o, fujimoto2018addressing, haarnoja2018soft, Lillicrap2016ContinuousCW], they are often very sensitive to hyperparameters, exploration methods, among other things [HendersonRlMatter]. This has led to a surge of interest in improving these methods.
1.2 State of current algorithms
The problems Eqs. 4 and 3 form the basis for a popular offpolicy method known as Deterministic Policy Gradient (DPG [silver2014deterministic]) or DeepDPG [lillicrap2015continuous]
which is its deep learning variant, as also many others such as TwinDelayed DDPG (TD3
[fujimoto2018addressing]) or SoftActorCritic (SAC [haarnoja2018soft]). As written, this pair of optimization algorithms does not lead to good performance in practice. Current offpolicy algorithms therefore introduce a number of modifications, some better motivated than others. These modifications have become de facto in the current literature and we listen them below.
[nosep]

The objective can be zero without being a good approximation of the right hand side of Eq. 2. Current algorithms use a “target” Qfunction, e.g., they compute in Eq. 3 using a timelagged version of the parameters [mnih2013playing]. The controller in Eq. 3 is also replaced by its timelagged version
These target parameters are updated using exponential averages of respectively.

The learnt estimate typically overestimates the righthand side of Eq. 2 [Thrun1993]. TD3 and SAC therefore train two copies , and maintain different timelagged targets for each to replace
This is called the “doubleQ” trick [van2016deep].

Some algorithms like TD3 add “target noise” and use
while others such as SAC which train a stochastic controller regularize with the entropy of the controller to get
here is a hyperparameter.

Further, SAC uses the minimum of two Qfunctions for the updating the controller in Eq. 4 with the entropy term.

The TD3 algorithm delays the updates to the controller, it performs two gradientbased updates of Eq. 4 before updating the controller; this is called “delaying policy updates”.
1.3 Contributions
Offpolicy algorithms achieve good empirical performance on standard simulated benchmark problems using the above modifications. This performance comes at the cost of additional hyperparameters and computational complexity for each of these modifications.
This paper presents a simplified offpolicy algorithm named DDPG++ that eliminates existing problematic modifications and introduces new ideas to make training more stable, while keeping the average returns unchanged. Our contributions are as follows:

[nosep]

We show that empirical performance is extremely sensitive to policy delay and there is no clear way to pick this hyperparameter for all benchmark environments. We eliminate delayed updates in DDPG++.

To make policy updates consistent with value function updates and avoid using overestimated Qvalues during policy updates, we propose to use minimum of two Qfunctions. This part is the most critical step to make training stable.

We observe that performance of the algorithm is highly dependent on the policy updates in Eq. 4 because the estimate can be quite erroneous in spite of a small in Eq. 3. We exploit this in the following way: observe that some tuples in the data depending upon the state may have controls that are similar to those of . These transitions are the most important to update the controller in Eq. 4, while the others in the dataset may lead to deterioration of the controller. We follow [fakoor2019meta, fakoor2019p3o] to estimate the propensity between the action distribution of the current policy and the action distribution of the past policies. This propensity is used to filter out transitions in the data that may lead to deterioration of the controller during the policy update.

We show that adding noise while computing the target is not necessary for good empirical performance.
Our paper presents a combination of a number of small, yet important, observations about the workings of current offpolicy algorithms. The merit of these changes is that the overall performance on a large number of standard benchmark continuouscontrol problems is unchanged, both in terms of the rate of convergence and the average reward.
2 Ddpg++
We first discuss the concept of covariate shift from the machine learning and statistics literature in
Section 2.1 and provide a simple method to compute it given samples from the dataset and the current policy being optimized in Section 2.2. We then present the algorithm in Section 2.3.2.1 Covariate shift correction
Consider the supervised learning problem where we observe independent and identically distributed data from a distribution
, say the training dataset. We would however like to minimize the loss on data from another distribution , say the test data. This amounts to minimizing(5)  
Here are the labels associated to draws and is the loss of the predictor . The importance ratio is defined as
(6) 
which is the RadonNikodym derivative of the two densities [resnick2013probability] and it rebalances the data to put more weight on unlikely samples in that are likely under the test data . If the two distributions are the same, the importance ratio is 1 and doing such correction is unnecessary. When the two distributions are not the same, we have an instance of covariate shift and need to use the trick in Eq. 5.
2.2 Logistic regression for estimating the covariate shift
When we do not know the densities and and we need to estimate using some finite data drawn from and drawn from . As [agarwal2011linear]
show, this is easy to do using logistic regression. Set
to be the labels for the data in and to be the labels of the data in forand fit a logistic classifier on the combined
samples by solving(7) 
This gives
(8) 
This method of computing propensity score, or the importance ratio, is close to twosample tests [fakoor2019p3o, agarwal2011linear, Reddi2015DoublyRC] in the statistics literature.
Remark 1 (Replay buffer has diverse data).
The dataset in offpolicy RL algorithms, also called the replay buffer, is created incrementally using a series of feedback controllers obtained during interaction with simulator or the environment. This is done by drawing data from the environment using the initialized controller ; the controller is then updated by iterating upon Eqs. 4 and 3. More data is drawn from the new controller and added to the dataset . The dataset for offpolicy algorithms therefore consists of data from a large number of diverse controllers/policies. It is critical to take this diversity into consideration when sampling from the replay buffer; propensity estimation allows doing so.
2.3 Algorithm
This section elaborates upon the DDPG++ algorithm along some implementation details that are important in practice.
Policy delay causes instability. Fig. 0(a) shows that different policy delays lead to large differences in the performance for the HalfCheetah environment of OpenAI’s Gym [brockman2016openai] in the MuJoCo [todorov2012mujoco] simulator. The second figure, in Fig. 0(b), shows that different policy delays perform about the same for the Hopper environment; the performance for the other MuJoCo environments is the same as that of the Hopper. Policy delay was introduced by the authors in [fujimoto2018addressing] to stabilize the performance of offpolicy algorithms but this experiment indicates that policy delay may not be the correct mechanism to do so.
Initialize neural networks
for the value function, their targets , the controller and its corresponding target .Propensity estimation for controls. The controller is being optimized to be greedy with respect to the current estimate of the value function . If the estimate of the value function is erroneous— theoretically may have a large bias because its objective Eq. 3 only uses the onestep Bellman error—updates to the controller will also incur a large error. One way to prevent such deterioration is to update the policy only on states where the control in the dataset and the current controller’s output are similar for a given state . The idea is that since the value function is fitted across multiple gradient updates of Eq. 3 using the dataset, the estimate of should be consistent for these states and controls. The controller is therefore being evaluated and updated only at states where the is consistent.
It is easy to instantiate the above idea using propensity estimation in Section 2.2. At each iteration before updating the policy using Eq. 4 we fit a logistic classifier on the two datasets and above to estimate the likelihood which is the relative probability of a control coming from a dataset versus . The objective for the policy update in Eq. 4 is thus modified to simply be
(9) 
Note that higher the , closer the control in the dataset to the output of the current controller . Also observe that the ideal accuracy of logistic classifier for propensity estimation is 0.5, i.e., the classifier should not be able to differentiate between controls in the dataset and those taken by the controller. As a result, will have large constant values. The importance ratio will be small if the current controller’s output at the same state is very different; this modified objective discards such states while updating .
Remark 2 (Relation to BatchConstrainedQlearning).
Our modification to the objective in Eq. 9 is close to the idea of the BCQ algorithm in [fujimoto2018off]. This algorithm uses a generative model to learn the action distribution at new states in Qlearning to update in Eq. 3 selectively. The BCQ algorithm is a primarily imitationbased algorithm and is designed for the socalled offline reinforcement learning setting where the dataset is given and fixed, and the agent does not have the ability to collect new data from the environment. Our use of propensity for stabilizing the controller’s updates in Eq. 9 is designed for the offpolicy setting and is a simpler instantiation of the same idea that one must selectively perform the updates to and at specific states.
We next demonstrate the performance of propensityweighted controller updates in a challenging example in Fig. 2. Fig. 1(a) compares the performance of TD3, DDPG++ and DDPG++ with propensity weighing. The average return of propensity weighing (orange) is higher than the others; the unweighted version (green) performs well but suffers from a sudden degradation after about 6M training samples. We can further understand this by observing Fig. 1(b): the propensity is always quite small for this environment which suggests that most controls in the dataset are very different from those the controller would take. Using propensity to weight the policy updates ignores such states and protects the controller from degradation seen in Fig. 1(a).
3 Experimental Validation
Setup. We use the MuJoCo [todorov2012mujoco] simulation environment and the OpenAI Gym [brockman2016openai] environments to demonstrate the performance of DDPG++. These are simulated robotic systems, e.g., the HalfCheetah environment in Fig. 0(a) has 17 states and 6 controls and is provided a reward at each timestep which encourages velocities of large magnitude and penalizes the magnitude of control. The most challenging environment in this suite is the Humanoid task which has 376dimensional statespace and 17 control inputs; the Humanoid is initialized upright and is given a reward for moving away from its initial condition with quadratic penalty for control inputs and impact forces. All the hyperparameters are given in the Appendix.
Baseline algorithms. We compare the performance of DDPG++ against two algorithms. The first is TD3 of [fujimoto2018addressing] which is a recent offpolicy algorithm that achieves good performance on these tasks. The second is DDPG [lillicrap2015continuous] which is, relatively speaking, a simple algorithm for offpolicy learning. The DDPG++ algorithm discussed in this paper can be thought of as a simplification of TD3 to bring it closer to DDPG. The SAC algorithm [haarnoja2018soft] has about the same performance as that of TD3; we therefore do not use it as a baseline.
Performance on standard tasks. Fig. 4 shows the average returns of these three algorithms on MuJoCo environments. Across the suite, the performance of DDPG++ (without propensity correction) is stable and at least as good as TD3. Performance gains are the largest for HalfCheetah and Walker2D. This study shows that one can obtain good performance on the MuJoCo suite with a few simple techniques while eliminating a lot of complexity of stateoftheart algorithms. The average returns at the end of training are shown in Table 3. The performance of DDPG++ with propensity was about the same as that of DDPG++ without it for the environments in Fig. 4.
Fig. 2 shows that propensityweighing is dramatically effective for the Humanoid environment. This is perhaps because of the significantly higher dimensionality of this environment as compared to others in the benchmark. Hyperparameters tuned for other environments do not perform well for Humanoid, propensityweighing the policy updates is much easier and more stable, in contrast.
3.1 Dexterous hand manipulation
We next evaluate the DDPG++ algorithm on a challenging robotic task where a simulated anthropomorphic ADROIT hand [kumar2013fast]
with 24 degreesoffreedom is used to open a door
Fig. 2(b). This task is difficult because of significant dry friction and a bias torque that forces the door to stay closed. This is a completely modelfree task and we do not provide any information to the agent about the latch, the RL agent only gets a reward when the door touches the stopper at the other end. The authors in [rajeswaran2017learning] demonstrate that having access to expert trajectories is essential for performing this task. Reward shaping, where the agent’s reward is artificially bolstered the closer to the goal as it gets, can be helpful when there is no expert data. The authors in [rajeswaran2017learning] demonstrate that DDPG with observations (DDPGfD [vecerik2017leveraging]) performs poorly on this task. Fig. 2(a) shows the average evaluation returns of running DDPG++ on this task without any expert data and Table 1 shows the success rate as the algorithm trains. Hyperparameters for this task are kept the same as those in continuouscontrol tasks in the MuJoCo suite.Task  TD3  DDPG++ 

ADROIT hand  0  34 
Task  DDPG  TD3  DDPG++ 

HalfCheetah  11262  12507  15342 
Walker2d  2538  4753  5701 
Hopper  1501  2752  2832 
Humanoid  1898  5384  4331 
Ant  758  5185  4404 
Swimmer  141  107  134 
Task  TD3  DDPG++  DDPG++ 

(with propensity)  
Humanoid  5384  4331  6384 
4 Discussion
We discussed an algorithm named DDPG++ which is a simplification of existing algorithms in offpolicy reinforcement learning. This algorithm eliminates noise while computing the value function targets and delayed policy updates in standard offpolicy learning algorithms; doing so eliminates hyperparameters that are difficult to tune in practice. DDPG++ uses propensitybased weights for the data in the minibatch to update the policy only on states where the controls in the dataset are similar to those that the current policy would take. We showed that a vanilla offpolicy method like DDPG works well as long as overestimationbias in value estimation is reduced. We evaluated these ideas on a number of challenging, highdimensional simulated control tasks.
Our paper is in the spirit of simplifying complex stateoftheart RL algorithms and making them more amenable to experimentation and deployment in robotics. There are a number of recent papers with the same goal, e.g., [fu2019diagnosing] is a thorough study of Qlearning algorithms, [fujimoto2018addressing] identifies the key concept of overestimation bias in Qlearning, [agarwal2019striving] shows that offline Qlearning, i.e., learning the optimal value function from a fixed dataset can be performed easily using existing algorithms if given access to a large amount of exploratory data. While these are promising first results, there a number of steps that remain before reinforcement learningbased methods can be incorporated into standard robotics pipelines. The most critical issue that underlies all modern deep RL approaches is that the function class that is used to approximate the value function, namely the neural network, is poorly understood. Bellman updates may not be a contraction if the value function is approximated using a functionapproximator and consequently the controls computed via the value function can be very suboptimal. This is a fundamental open problem in reinforcement learning [antos2008fitted, antos2008learning, farahmand2010error, munos2005error] and solutions to it are necessary to enable deployment of deep RL techniques in robotics.
References
Appendix A Appendix A: Hyperparameters
We next list down all the hyperparameters used for the experiments in Section 3.
Parameters  DDPG  TD3  DDPG++ (Ours) 

Exploration noise  0.1  0.1  0.2 
Policy noise  N/A  0.2  N/A 
Policy update frequency  N/A  2  N/A 
Adam learning rate  0.001  0.001  0.0003 
Adam learning rate (HM)  1E4  1E4  1E4 
Hidden size  256  256  256 
Burnin  1000  1000  1000 
Burnin (HC & AN)  10000  10000  10000 
Batch size  100  100  100 
We use a network with two fullconnected layers (256 hidden neurons each) for all environments. The abbreviations HC, AN, HM stand for HalfCheetah, Ant and Humanoid.