1 Introduction
Policy search methods are amongst the few successful Reinforcement Learning (RL) (Sutton et al., 2000) methods which are applicable to highdimensional or continuous control problems, such as the ones typically encountered in robotics (Peters & Schaal, 2008b; Deisenroth et al., 2013). One particular class of policy search methods directly estimates the gradient of the expected return with respect to the parameters of a differentiable policy. These Policy Gradient (PG) algorithms have achieved impressive results on highly complex tasks (Schulman et al., 2015, 2017). However, standard algorithms are vastly datainefficient and rely on millions of data points to achieve the aforementioned results. Typical applications are therefore limited to simulated problems where policy rollouts can be cheaply obtained.
Algorithms based on stochastic policy gradients, like REINFORCE (Williams, 1992) and G(PO)MDP (Baxter & Bartlett, 2001), typically estimate the policy gradient based on a batch of trajectories, which are obtained by executing the current policy on the system (i.e. based on onpolicy samples). In the next step, all previous experience is discarded and new trajectories are sampled using the updated policy. This scheme holds true also for more recent methods, like PPO (Schulman et al., 2017) or POIS (Metelli et al., 2018), where a surrogate objective is constructed, which can be optimized till convergence. Typically, Importance Sampling (IS) techniques are employed to evaluate a target policy based on rollouts obtained from behavioural policies (i.e. from offpolicy samples). Albeit these offpolicy evaluation schemes, in these algorithms, no data is shared between iterations. Prominent examples of offpolicy offline algorithms typically employ actorcritic architectures (Silver et al., 2014), where the parametric critic model, typically a value function, is updated to summarize all knowledge gathered so far. In contrast, we proposed the modelfree Deep Deterministic OffPolicy Gradient method (DDOPG)^{1}^{1}1https://github.com/boschresearch/DD_OPG, which incorporates previously gathered rollout data by sampling from a trajectory replay buffer. This effectively enables backtracking to promising solutions, whilst requiring only minimal assumptions to construct the surrogate model.
Next to the inefficient use of available data, stochasticity in both the policy and the environment causes highly variable gradient estimates and therefore slow convergence. When executing the probabilistic policy on the system, noise is injected into the policy gradient in each time step, leading to a variance, which linearly increases with the length of the horizon (Munos, 2006). Additive Gaussian noise is typically employed as source of exploration. Additionally, PG methods built around the likelihood ratio trick intrinsically require probabilistic policies. Only then, policies can be updated to increase the likelihood of actions, which have been advantageous in previous rollouts. Instead of independent noise, temporallycorrelated noise (Osband et al., 2016), or exploration directly in parameter space can lead to a larger variety of behaviours (Plappert et al., 2017). Here, the behavioural policy is deterministic, thereby effectively reducing the gradient variance. Methods like DPG (Silver et al., 2014) and DDPG (Lillicrap et al., 2015) learn a parametric value function model to translate changes in policy and therefore actions to changes in expected value. Similarly, our proposed modelfree DDOPG algorithm constructs a nonparametric critic based on importance sampling. This critic, called surrogate model in the following, allows for updating a deterministic policy without the need for explicit parametric value models.
To summarize: We propose an importance sampling based surrogate model of the return distribution, which enables offpolicy, offline policy optimization. This surrogate facilitates deterministic policy gradients to reduce gradient variance and enables incorporation of all available data from a replay buffer. Exploration in the policy parameter space is achieved by a prioritized resampling of the surrogates support data, thus favouring promising regions in policy space. Normalized IS, which we demonstrate to act similarly as a baseline in standard PG methods, additionally reduces the variance of the employed estimates. Although no additional, parametric value function baseline (as utilized in TRPO/PPO for variance reduction) is required in our method, fast progress and therefore dataefficient learning is demonstrated on typical continuous control tasks.
The general problem formulation and policy gradient framework is highlighted in Sec. 2, followed by a short presentation of the standard importance sampling estimators to incorporate offpolicy data in Sec. 3. The surrogate model, necessary to efficiently incorporate deterministic policy data, as the core of the proposed modelfree DDOPG method is detailed in Sec. 4. In Sec. 5, the main policy optimization scheme is presented and experimentally evaluated in Sec. 6. This work closes with a discussion of connections to related work in Sec. 7 and concludes with an outlook into future work and open topics in Sec. 8.
2 Preliminaries
This section depicts the general episodic RL problem in a discretetime Markovian environment and summarizes as core buildingblock of the proposed DDOPG method, the standard return based policy gradient estimators (Williams, 1992). DDOPG closely follows this algorithmic structure (cf. Alg. 1
), however with extensions to incorporate deterministic, offpolicy rollouts as detailed in the following sections. The RL problem is characterized by a discretetime Markov Decision Process (MDP)
. An agent is interacting with an environment, whose states transitions according to the agent’s actionsand the environment’s transition probabilities
into a successor state. Starting from a state drawn from the initial state distribution , agent tries to maximize its discounted reward, according to a reward function and discount factor , accumulated over a horizon length . In policy search, the agent acts according to a (stochastic) policy , parameterized by . The expected accumulated reward is given by(1) 
where the trajectory is the sequence of stateaction pairs , the (discounted) trajectory return is given by , and due to the Markov property, the trajectory distribution in (1) is given by
(2) 
The dynamics of the system and the initial state distribution are generally unknown to the learning agent.
Modelfree policy gradient methods typically directly estimate the expected cost gradient based on the logderivative trick. The gradient is given by
(3) 
Given onpolicy samples , the following Monte Carlo (MC) estimators are obtained for the expected return
(4) 
and the policy gradient
(5) 
Since the unknown initial state and dynamics distributions are independent of the policy parameters (cf. (2)), the trajectory likelihood gradient with respect to the policy parameters can be computed analytically for a given, differentiable policy .
3 OffPolicy Evaluation
The MC estimators require a substantial amount of onpolicy rollouts to reduce the gradient estimator’s variance and typically many more rollouts than used in stateoftheart implementations to closely approximate the true gradient (Ilyas et al., 2018).
For offpolicy data, Importance Sampling (IS) can be utilized to incorporate trajectories from a behavioural policy in order to evaluate a new target policy (Zhao et al., 2013; Espeholt et al., 2018; Munos et al., 2016; Metelli et al., 2018). In general, a Monte Carlo estimate of an expectation (such as (1)) can be obtained by sampling from a tractable distribution and reweighting the sampled function evaluations based on the likelihoodratio . The expected return can be rewritten as
(6) 
such that the IS weighted Monte Carlo estimator is given by
(7)  
(8) 
where trajectories are sampled from a policy to infer the expected cost of policy . Although system dynamics and initial state distribution in (2) are unknown, the likelihoodratio, i.e. the importance weights, can be computed since the unknown parts cancel out, such that
(9) 
During learning, trajectories are collected from multiple different policies . To incorporate all data, the importance sampling distribution can be replaced by an empirical mixture distribution such that the available trajectories are i.i.d. draws from the empirical mixture distribution (Jie & Abbeel, 2010). The resulting importance weights are given by
(10) 
Computing the importance weights in (10), however, scales quadratically with the number of available trajectories due to the summation over the likelihoods of all trajectories given all available policies. Scaling this estimator to today’s deep neural network policies with a large number of required rollouts is, thus, a major challenge. Instead of computing the surrogate based on all data, as in (Jie & Abbeel, 2010), which is only feasible for several hundred rollouts, the proposed DDOPG method employs a trajectory replay buffer and a probabilistic selection scheme to recompute a stochastic approximation of the full surrogate model. This idea is related to prioritized experience replay (Schaul et al., 2015) but for full trajectories. It enables scaling to much larger datasets and at the same time helps to avoid local minima by stochastically optimizing the objective.
Another technique typically employed for IS is weight normalization (Metelli et al., 2018). The weighted importance sampling estimator obtains a lower variance estimate at the cost of adding bias. It has been employed in (Peshkin & Shelton, 2002) and is both theoretically and empirically betterbehaved (Meuleau et al., 2000; Precup et al., 2000; Shelton, 2001) compared to the pure IS estimator. The weighted importance sampling estimator is given by
(11) 
where importance weights might be computed according to (9) or (10) and a normalizing constant instead of the standard normalization , previously used in (8).
From the policy gradient perspective, by normalizing the importance weights, we obtain a gradient estimator, which includes a parameter dependent baseline.
Proposition 1
The policy gradient estimator obtained from the selfnormalized importance sampling expected cost estimator is given by
(12)  
A proof of this proposition is shown in Appendix A. This estimator is closely related to standard PG estimators with an added baseline term for variance reduction.
In standard, REINFORCE like, PG methods, two of the most common variance reduction techniques (Greensmith et al., 2004) are: i) incorporation of the rewardtogo for each policy action update instead of the entire Monte Carlo path return; and ii) subtraction of a state dependent baseline term, such as to obtain an estimate of the advantage of the previously taken action. The intuition behind method i) is to reward actions only for rewards obtained after the action took effect, but not for those obtained earlier on. However, to compute the importance weights not for the full trajectory distribution but for each stateaction pair individually, the computation of a matrix of size
would be required. Therefore, the modelfree, importance sampling based approaches are typically limited to the path return based estimators. Modelbased methods (i.e. a parametric models for the value function) are employed in the costtogo estimators. Variance reduction method ii) is automatically obtained by the normalized estimator as shown in proposition
1, however, in contrast to the bias free value function control variates, at the cost of adding bias. Additional, optimal baselines to further decrease the variance of the gradient estimator have been derived in (Jie & Abbeel, 2010) and could be incorporated into DDOPG.4 Deterministic Policy Gradients
The policy gradient estimators in (5) and (12) rely on a policy distribution in order to obtain a gradient signal on how to update the policy parameters to increase the likelihood of successful actions.
In this situation the, typically Gaussian, additive policy noise acts in two ways, causing exploration and serving as the basis for the estimation of the objective function.
Exploration is being driven directly through noise in the action space, i.e., the policy covariance.
While driving exploration through noisy actions will converge in the limit, the resulting explorative behaviour exhibits no temporal correlations, which can make it inefficient.
Estimation of the objective function is typically achieved by reweighting the action distribution according to the policy’s likelihood.
Standard policies are given as
,
where is represented by some function approximator parameterized by , e.g. a neural network.
The additive Gaussian noise covariance is typically a diagonal matrix, parameterized by as well.
The proposed deterministic policy gradient method strives to separate the exploration and estimation part.
Parameter Space Exploration By utilizing deterministic rollout policies, the only noise introduced into the gradient estimate originates from the stochasticity of the environment and we have to perform exploration in parameter space instead of action space exploration. However, as stated above, parameter based exploration may in many cases be more efficient than exploration in action space, since parameter based exploration will lead to temporally correlated actions which can explore the state space faster. Typically, however, this effect is negated for neural network policies since the parameter space that has to be explored is prohibitively large. Thus, to navigate large parameter spaces efficiently, some approximate evaluation of the cost function (1) is needed.
Trajectory based objective estimate Whilst evaluation of the Monte Carlo based expected cost estimate is possible also for deterministic policies, the offpolicy evaluation is no longer feasible since the likelihood ratio (cf. (9)) becomes zero for two distinct dirac policy action distributions if .
However, we can still compare trajectories under a stochastic evaluation distribution, similar to a kernel function where the standard deviation of the evaluation function relates to a kernel lengthscale in action space.
Thus, we introduce the evaluation policy
(13) 
where is a diagonal covariance matrix as typically employed in deep RL methods with Gaussian action noise. The deterministic policy is given by , where is the dirac delta. From the general IS expectation in (6) and our evaluation policy in (13), the surrogate model follows as
(14) 
with surrogate weights
(15) 
where, depending on the choice of normalization constant , we obtain the analogue to the standard IS estimator () or the analog to the weighted IS estimator (). Reintroducing the fixed Gaussian noise as an implicit loss to obtain gradients for the evaluation of deterministic policies is clearly a model assumption in the proposed method but can be justified from several perspectives.
The hyperparameter allows for control over the amount of information shared between neighbouring policies. Similar to the cap of importance weights in PPO (Schulman et al., 2017), this parameter allows to control bias and variance of the surrogate model. Analyzing the introduced bias and relation to the PPO weight cap is however ongoing research. In the limit of , the proposed surrogate (14) approaches the MC estimator (4). Only in case of two different policy parameterizations , but equivalent actions for the sampled states , the surrogate model would output an average whereas the MC estimator would not mix up the obtained returns. For , the surrogate model recovers the true IS estimate, given that all trajectories are generated using the same additive Gaussian noise. Finally, for , the estimate is simply the average over all available path returns.
Modelling the expected return distribution by choosing a lengthscale in action space can furthermore be motivated from a second perspective. Typical expected return distributions oftentimes comprise sharp transitions between stable and unstable regions, where policy parameters change only slightly but reward changes drastically. One global lengthscale is therefore typically not well suited to directly model the expected return. This is a standard problem in Bayesian Optimization for reinforcement learning, where typical smooth kernel functions (e.g. squared exponential kernel) with globally fixed lengthscales are unable to model both stable and unstable regimes at the same time. However, in the proposed model, a lengthscale in action space is translated via the sampled state distribution and policy function into implicit assumptions in the actual policy parameter space. Doing so, instead of operating on arbitrary euclidean distances in policy parameter space, a more meaningful distance in trajectory and action space is available. Typically, for a given system, distance of trajectories and between actions is more graspable, compared to arbitrary deep neural network policy parameters.
The expected return estimator (14) falls back to zero for policy evaluation far away from training data. To estimate the variance of the importance sampling estimator itself, typically, the Effective Sample Size (ESS) is evaluated. Based on the variance of the importance weights, it analyses the effective number of available data points at a specific policy evaluation position. In (Metelli et al., 2018), a lower bound on the expected return has been proposed such that with probability it holds that
(16) 
where is the exponentiated 2Rényi divergence. Due to the identity , this lower bound can be estimated in a samplebased way by employing the ESS estimator
(17) 
such as to obtain the lower bound estimate
(18)  
(19) 
Refer to theorem 4.1 in (Metelli et al., 2018) for details and proof regarding the lower bound in (16). The confidence parameter determines, similar to the KLdivergence in TRPO (Schulman et al., 2015), how far the policy optimization can step away from known regions. In DDOPG, this uncertainty estimate is employed as penalty
(20) 
with penalty factor as an hyperparameter to control exploration, i.e. following the objective estimate vs. risk awareness, i.e. staying within a trust region.
5 ModelFree OffPolicy Optimization
The surrogate model of the return distribution, as derived in Sec. 4, can now be directly incorporated for policy optimization. In related work, parametric search distributions (e.g. Gaussian) are employed as policy search distribution or hyperpolicy (Zhao et al., 2013; Plappert et al., 2017; Metelli et al., 2018)
. However, in highdimensional spaces, as typically obtained with deep network policy representations, updating the full search distribution is challenging and common approaches usually revert to heuristics to control a simplified, e.g. diagonal or blockwise search distribution’s covariance matrix.
Instead, the proposed modelfree DDOPG method fully optimizes a stochastic version of the surrogate objective to foster exploration and overcome local minima. At the same time, the stochastic evaluation mitigates the unfavourable complexity of computing the full importance sampling estimate based on all available data. Due to the empirical mixture distribution in (10), computing the likelihood of all observed trajectories under all policies is quadratic in the number of observed paths. Instead, the proposed method employs a selection criterion to construct a stochastic surrogate model based on a subset of rollouts in each policy optimization step. In particular, a predefined number of rollout indices is drawn from the softmax distribution over the discrete set of available trajctory indices . The softmax is computed based on the normalized, empirical returns and a temperature factor .
(21) 
The temperature is used to trade off exploration against exploitation in the selection of reference trajectories. This scheme is closely related to prioritized experience replay (Schaul et al., 2015). A study of the effect of temperature selection on the learning progress is shown in Sec. 6.3.
The full DDOPG algorithm is detailed in Alg. 1. The main objective is to incorporate all available deterministic policy rollouts, not only the ones from the current iteration, into the surrogate model by means of the softmax replay selection. The lower bound expected return can then be fully optimized using standard optimization techniques. In practice Adam (Kingma & Ba, 2014) is employed, but other techniques, e.g. based on the natural policy gradient (Peters & Schaal, 2008a) could be incorporated as well.
6 Experimental Evaluation
The experimental evaluation of the proposed DDOPG method is threefold. In Sec. 6.1, the resulting surrogate return model is visualized, highlighting different modeling options. A benchmark against stateoftheart PG methods is shown in Sec. 6.2 to highlight fast and dataefficient learning. Finally, important parts of the proposed algorithms and their effects on the final learning performance are highlighted in an ablation study in Sec. 6.3.
6.1 Surrogate Model
As discussed in Sec. 4
, the proposed surrogate model can smoothly interpolate between the Monte Carlo estimate, the importance sampling estimate, and an average of all available returns. In Fig.
1, the available surrogate model predictions are visualized for multiple settings of the model hyperparameter . In particular, the estimate for expected return (solid orange line), return variance (shaded orange visualizes one standard deviation), and the lower bound of the expected return (dashed orange line) are visualized for policy evaluations along a random direction around the optimal policy for the cartpole environment (experimental details can be found in Appendix B). Trajectory data, which is available to the estimator is highlighted by grey dots. The groundtruth return distribution (mean +/ one std. in blue) is computed using the standard MC estimator, based on independent policy rollouts, which are not part of the surrogate model.Stepping from long lengthscales (cf. Fig. 0(a)) to shorter lengthscales (cf. Fig. 0(c)), the surrogate model predictions become more local. Most visibly in the lowerbound estimate, the ESS drops significantly when moving away from data points and small model lengthscales, resulting in much higher uncertainty.
6.2 Policy Gradient Benchmark
The proposed DDOPG method is evaluated in terms of dataefficiency and learning progress in comparison to stateoftheart policy gradient methods based on Monte Carlo return estimates. In contrast, methods such as DDPG (Lillicrap et al., 2015) employ TD learning for their value function model and are not part of this evaluation. The benchmark compares DDOPG to the standard REINFORCE (Williams, 1992) baseline and both TRPO (Schulman et al., 2015) and PPO (Schulman et al., 2017). All competitor algorithms employ, as it is common practice, the rewardtogo formulation and a linear featurebased baseline for variance reduction. For all methods, hyperparameters are selected to achieve maximal accumulated average return, i.e. fast and stable policy optimization. Details about the individual methods’ configuration and the employed environments can be found in Appendix B.
The resulting learning performances are visualized in Fig. 2 for the cartpole, mountaincar and swimmer environment (left to right) (Duan et al., 2016)
. For REINFORCE (blue), TRPO (yellow), PPO (green), and DDOPG (red), the mean average return (solid line) and its confidence intervals (one standard deviation as shaded area) are depicted, as obtained from 10 independent runs out of 10 random seeds for each environment and method. To compare the learning speed and dataefficiency between the batchwise learning competitors and the rolloutbased DDOPG, the results are visualized as a function of collected environment interactions (scaled by
) in Fig. 2.With DDOPG, rapid learning progress is achieved already and the final performance of the competitive, stateoftheart policy gradient methods is matched. In the hyperparameter tuning phase, experiments with TRPO and PPO have been conducted based on smaller batchsizes, but due to the lack of dataefficient incorporation of offpolicy data, no faster and stable learning progress could be achieved for these methods, compared to the one visualized in Fig. 2. Notice the large variance of the DDOPG learning progress in the swimmer environment. Albeit the superior learning performance of DDOPG on the swimmer environment, some of the runs got stuck in local minima, resulting in the large variance estimate. This tradeoff between exploration and exploitation is partially achieved by the stochastic memory selection. A mix of prioritized trajectory replay and current trajectories is mandatory to prevent greedy exploitation of previously seen, local minima and to facilitate exploration. Our experiments show that it is mandatory to incorporate previously seen rollout data, as it is done in DDOPG, to enable rapid progress already in the early stages of training.
6.3 Ablation Study
In the final DDOPG algorithm, multiple aspects come together: i) the deterministic surrogate model, ii) the memory selection strategy, and iii) the optimization scheme. In this ablation study, we separate the individual components to analyse their effect on the final learning performance. Experiments are conducted on the cartpole environment and results are averaged over three random seeds.
In the first experiment, DDOPG is reconstructed starting from the REINFORCE baseline. A visualization is shown in Fig. 3. In REINFORCE (red dotted line), only one policy gradient step is taken based on the current onpolicy data. This is comparable to DDOPG with almost no memory () and only one step gradient update (visualized as blue dotted line). Learning performance is already increased by adding more memory paths (green: , yellow: ). More significantly, the full optimization of the surrogate model (solid lines) achieves much faster learning progress.
In Fig. 4, the effect of the surrogate model’s lengthscale parameter is evaluated. Four different lengthscales are evaluated (red: 1.0, green: 2.0, yellow: 3.0, blue: 4.0). In this experiment, longer lengthscales clearly improve learning speed despite the introduced model bias.
The effects of the softmax temperature on the proposed prioritized trajectory replay and the learning progress are depicted in Fig. 5. Explorative behaviour is favoured for higher temperatures (red), whereas for low temperatures (blue), previous trajectories are selected more greedily. In this example, an intermediate temperature achieves the best tradeoff explorationexploitation tradeoff.
7 Connections to Related Work
Policy search methods (Peters & Schaal, 2008b; Deisenroth et al., 2013) and policy gradient methods (Williams, 1992; Baxter & Bartlett, 2001) are well studied in the RL community and many connections to DDOPG exist.
Importance sampling has been employed to either reweight full trajectory distributions (Shelton, 2001; Jie & Abbeel, 2010; Zhao et al., 2013; Metelli et al., 2018) or to reweight individual stateaction pairs (Munos et al., 2016; Espeholt et al., 2018). Except for (Jie & Abbeel, 2010), no global IS estimator is derived, but estimates are only based on the current iteration’s data. In contrast, DDOPG introduces global surrogate model based on all available deterministic policy rollouts and computes local, stochastic approximations using prioritized replay. Instead of DDOPG’s action space lengthscale, alternative appraoches consider truncation of the importance weights (Wawrzynski & Pacut, 2007; Schulman et al., 2017; Espeholt et al., 2018). So far, the connection between both approaches has not yet been subject of greater analysis.
Concepts for policy updates range from standard gradient ascent (Williams, 1992), to trust region methods (Schulman et al., 2015) to lower bounds, which can be fully optimized till convergence (Schulman et al., 2017; Metelli et al., 2018). The proposed DDOPG optimizes a stochastic version based on the lower bound, derived in (Metelli et al., 2018).
Deterministic policies as means of variance reduction have been previously discussed for example in (Sehnke et al., 2008; Plappert et al., 2017). Instead of action noise for exploration, exploration is achieved by stochasticity in parameter space The DDOPG method relies on deterministic policies for variance reduction, but introduces exploration by means of stochastic gradients from the prioritized replay model.
8 Discussion
This work presents a new surrogate model of the RL return distribution inspired by importance sampling. It can incorporate offpolicy data and deterministic rollouts to reduce estimator variance. Despite the promising results and the dataefficient learning progress, several interesting topics remain for future work.
The proposed surrogate model is motivated by its close connections to the importance sampling estimator, the interpretability of the model assumption in action space and its desirable behaviour in the model limits. A detailed analysis of the resulting model assumptions in policy space, implied by the model assumptions in action space and an analysis of the resulting bias remains an open question.
The proposed optimization scheme empirically achieved good performance in our benchmark experiments, outperforming stateoftheart methods, although no additional parametric value function baseline (as in TRPO/PPO) is employed. However, extensions to other strategies for exploration vs. exploitation, for example acquisition functions like Expected Improvement or Probability of Improvement from Bayesian Optimization (Snoek et al., 2012), are to be explored and directly carry over to the proposed surrogate return model.
Finally, memory selection is required to scale the nonparametric model structure to typical deep RL applications. The proposed prioritized trajectory replay is only one possible option to address this challenge.
References

Baxter & Bartlett (2001)
Baxter, J. and Bartlett, P. L.
Infinitehorizon policygradient estimation.
Journal of Artificial Intelligence Research
, 15:319–350, 2001.  Deisenroth et al. (2013) Deisenroth, M. P., Neumann, G., Peters, J., et al. A survey on policy search for robotics. Foundations and Trends® in Robotics, 2(1–2):1–142, 2013.

Duan et al. (2016)
Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P.
Benchmarking deep reinforcement learning for continuous control.
In
International Conference on Machine Learning (ICML)
, pp. 1329–1338, 2016.  Espeholt et al. (2018) Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. Impala: Scalable distributed deeprl with importance weighted actorlearner architectures. In International Conference on Machine Learning (ICML), pp. 1406–1415, 2018.
 Greensmith et al. (2004) Greensmith, E., Bartlett, P. L., and Baxter, J. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.
 Ilyas et al. (2018) Ilyas, A., Engstrom, L., Santurkar, S., Tsipras, D., Janoos, F., Rudolph, L., and Madry, A. Are deep policy gradient algorithms truly policy gradient algorithms? arXiv preprint arXiv:1811.02553, 2018.
 Jie & Abbeel (2010) Jie, T. and Abbeel, P. On a connection between importance sampling and the likelihood ratio policy gradient. In Advances in Neural Information Processing Systems (NIPS), pp. 1000–1008, 2010.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Metelli et al. (2018) Metelli, A. M., Papini, M., Faccio, F., and Restelli, M. Policy optimization via importance sampling. In Advances in neural information processing systems (NIPS), pp. 5442–5454, 2018.
 Meuleau et al. (2000) Meuleau, N., Peshkin, L., Kaelbling, L. P., and Kim, K.E. Offpolicy policy search. MIT Articical Intelligence Laboratory, 2000.
 Munos (2006) Munos, R. Policy gradient in continuous time. Journal of Machine Learning Research, 7(May):771–791, 2006.
 Munos et al. (2016) Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. Safe and efficient offpolicy reinforcement learning. In Advances in neural information processing systems (NIPS), pp. 1054–1062, 2016.
 Osband et al. (2016) Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. Deep exploration via bootstrapped DQN. In Advances in neural information processing systems (NIPS), pp. 4026–4034, 2016.
 Peshkin & Shelton (2002) Peshkin, L. and Shelton, C. R. Learning from scarce experience. In International Conference on Machine Learning (ICML), pp. 498–505. Morgan Kaufmann Publishers Inc., 2002.
 Peters & Schaal (2008a) Peters, J. and Schaal, S. Natural actorcritic. Neurocomputing, 71(79):1180–1190, 2008a.
 Peters & Schaal (2008b) Peters, J. and Schaal, S. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682–697, 2008b.
 Plappert et al. (2017) Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen, R. Y., Chen, X., Asfour, T., Abbeel, P., and Andrychowicz, M. Parameter space noise for exploration. arXiv preprint arXiv:1706.01905, 2017.
 Precup et al. (2000) Precup, D., Sutton, R. S., and Singh, S. P. Eligibility traces for offpolicy policy evaluation. In International Conference on Machine Learning (ICML), pp. 759–766. Citeseer, 2000.
 Schaul et al. (2015) Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
 Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning (ICML), pp. 1889–1897, 2015.
 Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Sehnke et al. (2008) Sehnke, F., Osendorfer, C., Rückstieß, T., Graves, A., Peters, J., and Schmidhuber, J. Policy gradients with parameterbased exploration for control. In International Conference on Artificial Neural Networks, pp. 387–396. Springer, 2008.
 Shelton (2001) Shelton, C. R. Policy improvement for POMDPs using normalized importance sampling. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence (UAI), pp. 496–503. Morgan Kaufmann Publishers Inc., 2001.
 Silver et al. (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic policy gradient algorithms. In ICML, 2014.
 Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959, 2012.
 Sutton et al. (2000) Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems (NIPS), pp. 1057–1063, 2000.
 Wawrzynski & Pacut (2007) Wawrzynski, P. and Pacut, A. Truncated importance sampling for reinforcement learning with experience replay. Proc. CSIT Int. Multiconf, pp. 305–315, 2007.
 Williams (1992) Williams, R. J. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. In Reinforcement Learning, pp. 5–32. Springer, 1992.
 Zhao et al. (2013) Zhao, T., Hachiya, H., Tangkaratt, V., Morimoto, J., and Sugiyama, M. Efficient sample reuse in policy gradients with parameterbased exploration. Neural computation, 25(6):1512–1547, 2013.
Appendix A Proof of Proposition 1
The weighted importance sampling estimator of the expected cost is given by
(22) 
as derived in Sec. 3. Talking the derivative with respect to the policy parameters, we obtain the policy gradient formulation from theorem 1 as shown in (28).
(23)  
(24)  
(25)  
(26)  
(27)  
(28) 
Appendix B Experimental Details
In the following section, details about the reference implementations of REINFORCE, TRPO and PPO and their parameter settings are summarized for the benchmark experiments and the ablation study. Information about the benchmark environments is given in Sec. B.2
b.1 Algorithm Configurations
The reference implementations of the benchmark algorithms REINFORCE, TRPO and PPO are from the Garage RL framework (Duan et al., 2016). A hyperparameter grid search has been conducted for each algorithm and each environment on separate random seeds. The parameter ranges and selected hyperparameters are indicated in Tab. 1. For the benchmark itself, ten runs have been conducted for each algorithm and each environment on the random seeds (404, 931, 159, 380, 858, 708, 16, 448, 136, 989).
The configuration of the DDOPG method is summarized in Tab. 1.
Algorithm  Parameter  Range  Selected 
REINFORCE  Batch size  [400, 5000]  5000 
Step size  [0.0001, 0.1]  0.03  
TRPO  Batch size  [400, 5000]  5000 
Step size  [0.0001, 0.1]  0.1  
PPO  Batch size  [400, 5000]  2000 
Step size  [0.0001, 0.2]  0.2  
Algorithm  Parameter  Symbol  Selected 
DDOPG  Temperature  0.1  
Penalty  0.05  
Lengthscale  
Path buffer  50 
Environment  Inputs  States  Horizon 

Cartpole  1  4  100 
Mountaincar  1  2  500 
Swimmer  2  13  1000 
b.2 Benchmark Environments
The benchmark environments are cartpole, mountaincar and swimmer from the Garage RL framework. Details about the input and state dimensions, as well as the task horizons are listed in Tab. 2.