Log In Sign Up

Policy Optimization via Importance Sampling

Policy optimization is an effective reinforcement learning approach to solve continuous control tasks. Recent achievements have shown that alternating on-line and off-line optimization is a successful choice for efficient trajectory reuse. However, deciding when to stop optimizing and collect new trajectories is non-trivial as it requires to account for the variance of the objective function estimate. In this paper, we propose a novel model-free policy search algorithm, POIS, applicable in both control-based and parameter-based settings. We first derive a high-confidence bound for importance sampling estimation and then we define a surrogate objective function which is optimized off-line using a batch of trajectories. Finally, the algorithm is tested on a selection of continuous control tasks, with both linear and deep policies, and compared with the state-of-the-art policy optimization methods.


page 1

page 2

page 3

page 4


Policy Optimization Through Approximated Importance Sampling

Recent policy optimization approaches (Schulman et al., 2015a, 2017) hav...

On the Reuse Bias in Off-Policy Reinforcement Learning

Importance sampling (IS) is a popular technique in off-policy evaluation...

A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation

Marginalized importance sampling (MIS), which measures the density ratio...

Trajectory-Based Off-Policy Deep Reinforcement Learning

Policy gradient methods are powerful reinforcement learning algorithms a...

Conservative Optimistic Policy Optimization via Multiple Importance Sampling

Reinforcement Learning (RL) has been able to solve hard problems such as...

Offline Policy Optimization with Eligible Actions

Offline policy optimization could have a large impact on many real-world...

Muesli: Combining Improvements in Policy Optimization

We propose a novel policy update that combines regularized policy optimi...

Code Repositories


Implementation of the POIS algorithm

view repo

1 Introduction

In recent years, policy search methods deisenroth2013survey have proved to be valuable Reinforcement Learning (RL) sutton1998reinforcement approaches thanks to their successful achievements in continuous control tasks (e.g., lillicrap2015continuous; schulman2015trust; schulman2017proximal; schulman2015high), robotic locomotion (e.g., tedrake2004stochastic; kober2013reinforcement) and partially observable environments (e.g., ng2000pegasus)

. These algorithms can be roughly classified into two categories:

action-based methods sutton2000policy; peters2008reinforcement and parameter-based methods sehnke2008policy. The former, usually known as policy gradient (PG) methods, perform a search in a parametric policy space by following the gradient of the utility function estimated by means of a batch of trajectories collected from the environment sutton1998reinforcement. In contrast, in parameter-based methods, the search is carried out directly in the space of parameters by exploiting global optimizers (e.g., rubinstein1999cross; hansen2001completely; stanley2002evolving; szita2006learning) or following a proper gradient direction like in Policy Gradients with Parameter-based Exploration (PGPE) sehnke2008policy; wierstra2008natural; sehnke2010parameter. A major question in policy search methods is: how should we use a batch of trajectories in order to exploit its information in the most efficient way? On one hand, on-policy methods leverage on the batch to perform a single gradient step, after which new trajectories are collected with the updated policy. Online PG methods are likely the most widespread policy search approaches: starting from the traditional algorithms based on stochastic policy gradient sutton2000policy, like REINFORCE williams1992simple and G(PO)MDP baxter2001infinite, moving toward more modern methods, such as Trust Region Policy Optimization (TRPO) schulman2015trust. These methods, however, rarely exploit the available trajectories in an efficient way, since each batch is thrown away after just one gradient update. On the other hand, off-policy methods maintain a behavioral policy, used to explore the environment and to collect samples, and a target policy which is optimized. The concept of off-policy learning is rooted in value-based RL watkins1992q; peng1994incremental; munos2016safe and it was first adapted to PG in degris2012off, using an actor-critic architecture. The approach has been extended to Deterministic Policy Gradient (DPG) silver2014deterministic

, which allows optimizing deterministic policies while keeping a stochastic policy for exploration. More recently, an efficient version of DPG coupled with a deep neural network to represent the policy has been proposed, named Deep Deterministic Policy Gradient (DDPG) 

lillicrap2015continuous. In the parameter-based framework, even though the original formulation sehnke2008policy introduces an online algorithm, an extension has been proposed to efficiently reuse the trajectories in an offline scenario zhao2013efficient. Furthermore, PGPE-like approaches allow overcoming several limitations of classical PG, like the need for a stochastic policy and the high variance of the gradient estimates.111Other solutions to these problems have been proposed in the action-based literature, like the aforementioned DPG algorithm, the gradient baselines peters2008reinforcement and the actor-critic architectures konda2000actor.

While on-policy algorithms are, by nature, online, as they need to be fed with fresh samples whenever the policy is updated, off-policy methods can take advantage of mixing online and offline

optimization. This can be done by alternately sampling trajectories and performing optimization epochs with the collected data. A prime example of this alternating procedure is Proximal Policy Optimization (PPO) 

schulman2017proximal, that has displayed remarkable performance on continuous control tasks. Off-line optimization, however, introduces further sources of approximation, as the gradient w.r.t. the target policy needs to be estimated (off-policy) with samples collected with a behavioral policy. A common choice is to adopt an importance sampling (IS) mcbook; hesterberg1988advances estimator in which each sample is reweighted proportionally to the likelihood of being generated by the target policy. However, directly optimizing this utility function is impractical since it displays a wide variance most of the times mcbook. Intuitively, the variance increases proportionally to the distance between the behavioral and the target policy; thus, the estimate is reliable as long as the two policies are close enough. Preventing uncontrolled updates in the space of policy parameters is at the core of the natural gradient approaches amari1998natural applied effectively both on PG methods kakade2002approximately; peters2008natural; wierstra2008natural and on PGPE methods miyamae2010natural. More recently, this idea has been captured (albeit indirectly) by TRPO, which optimizes via (approximate) natural gradient a surrogate objective function, derived from safe RL kakade2002approximately; pirotta2013safe

, subject to a constraint on the Kullback-Leibler divergence between the behavioral and target policy.

222Note that this regularization term appears in the performance improvement bound, which contains exact quantities only. Thus, it does not really account for the uncertainty derived from the importance sampling. Similarly, PPO performs a truncation of the importance weights to discourage the optimization process from going too far. Although TRPO and PPO, together with DDPG, represent the state-of-the-art policy optimization methods in RL for continuous control, they do not explicitly encode in their objective function the uncertainty injected by the importance sampling procedure. A more theoretically grounded analysis has been provided for policy selection doroudi2017importance, model-free thomas2015high and model-based thomas2016data policy evaluation (also accounting for samples collected with multiple behavioral policies), and combined with options guo2017using. Subsequently, in thomas2015high2 these methods have been extended for policy improvement, deriving a suitable concentration inequality for the case of truncated importance weights. Unfortunately, these methods are hardly scalable to complex control tasks. A more detailed review of the state-of-the-art policy optimization algorithms is reported in Appendix A.

In this paper, we propose a novel, model-free, actor-only, policy optimization algorithm, named Policy Optimization via Importance Sampling (POIS) that mixes online and offline optimization to efficiently exploit the information contained in the collected trajectories. POIS explicitly accounts for the uncertainty introduced by the importance sampling by optimizing a surrogate objective function. The latter captures the trade-off between the estimated performance improvement and the variance injected by the importance sampling. The main contributions of this paper are theoretical, algorithmic and experimental. After revising some notions about importance sampling (Section 3), we propose a concentration inequality, of independent interest, for high-confidence “off-distribution” optimization of objective functions estimated via importance sampling (Section 4). Then we show how this bound can be customized into a surrogate objective function in order to either search in the space of policies (Action-based POIS) or to search in the space of parameters (Parameter-bases POIS). The resulting algorithm (in both the action-based and the parameter-based flavor) collects, at each iteration, a set of trajectories. These are used to perform offline optimization of the surrogate objective via gradient ascent (Section 5), after which a new batch of trajectories is collected using the optimized policy. Finally, we provide an experimental evaluation with both linear policies and deep neural policies to illustrate the advantages and limitations of our approach compared to state-of-the-art algorithms (Section 6) on classical control tasks duan2016benchmarking; todorov2012mujoco. The proofs for all Theorems and Lemmas are reported in Appendix B. The implementation of POIS can be found at

2 Preliminaries

A discrete-time Markov Decision Process (MDP) 

puterman2014markov is defined as a tuple where is the state space, is the action space, is a Markovian transition model that assigns for each state-action pair

the probability of reaching the next state

, is the discount factor, assigns the expected reward for performing action in state and is the distribution of the initial state. The behavior of an agent is described by a policy that assigns for each state the probability of performing action . A trajectory is a sequence of state-action pairs , where is the actual trajectory horizon. The performance of an agent is evaluated in terms of the expected return, i.e., the expected discounted sum of the rewards collected along the trajectory: , where is the trajectory return.

We focus our attention to the case in which the policy belongs to a parametric policy space . In parameter-based approaches, the agent is equipped with a hyperpolicy used to sample the policy parameters at the beginning of each episode. The hyperpolicy belongs itself to a parametric hyperpolicy space . The expected return can be expressed, in the parameter-based case, as a double expectation: one over the policy parameter space and one over the trajectory space :



is the trajectory density function. The goal of a parameter-based learning agent is to determine the hyperparameters

so as to maximize . If is stochastic and differentiable, the hyperparameters can be learned according to the gradient ascent update: , where is the step size and . Since the stochasticity of the hyperpolicy is a sufficient source of exploration, deterministic action policies of the kind are typically considered, where is the Dirac delta function and is a deterministic mapping from to . In the action-based case, on the contrary, the hyperpolicy is a deterministic distribution , where is a deterministic mapping from to . For this reason, the dependence on is typically not represented and the expected return expression simplifies into a single expectation over the trajectory space :


An action-based learning agent aims to find the policy parameters that maximize . In this case, we need to enforce exploration by means of the stochasticity of . For stochastic and differentiable policies, learning can be performed via gradient ascent: , where .

3 Evaluation via Importance Sampling

In off-policy evaluation thomas2015high; thomas2016data, we aim to estimate the performance of a target policy (or hyperpolicy ) given samples collected with a behavioral policy (or hyperpolicy ). More generally, we face the problem of estimating the expected value of a deterministic bounded function (

) of random variable

taking values in under a target distribution , after having collected samples from a behavioral distribution . The importance sampling estimator (IS) cochran2007sampling; mcbook corrects the distribution with the importance weights (or Radon–Nikodym derivative or likelihood ratio) :


where is sampled from and we assume whenever . This estimator is unbiased () but it may exhibit an undesirable behavior due to the variability of the importance weights, showing, in some cases, infinite variance. Intuitively, the magnitude of the importance weights provides an indication of how much the probability measures and are dissimilar. This notion can be formalized by the Rényi divergence renyi1961measures; van2014renyi, an information-theoretic dissimilarity index between probability measures.

Rényi divergence

Let and be two probability measures on a measurable space such that ( is absolutely continuous w.r.t. ) and is -finite. Let and admit and

as Lebesgue probability density functions (p.d.f.), respectively. The

-Rényi divergence is defined as:


where is the Radon–Nikodym derivative of w.r.t. and . Some remarkable cases are: when and yielding . Importing the notation from cortes2010learning, we indicate the exponentiated -Rényi divergence as . With little abuse of notation, we will replace with whenever possible within the context.

The Rényi divergence provides a convenient expression for the moments of the importance weights:

. Moreover, and  cortes2010learning. To mitigate the variance problem of the IS estimator, we can resort to the self-normalized importance sampling estimator (SN) cochran2007sampling:


where is the self-normalized importance weight. Differently from , is biased but consistent mcbook and it typically displays a more desirable behavior because of its smaller variance.333Note that . Therefore, its variance is always finite. Given the realization we can interpret the SN estimator as the expected value of under an approximation of the distribution made by deltas, i.e., . The problem of assessing the quality of the SN estimator has been extensively studied by the simulation community, producing several diagnostic indexes to indicate when the weights might display problematic behavior mcbook. The effective sample size () was introduced in kong1992note as the number of samples drawn from so that the variance of the Monte Carlo estimator is approximately equal to the variance of the SN estimator computed with samples. Here we report the original definition and its most common estimate:


The has an interesting interpretation: if , i.e., almost everywhere, then since we are performing Monte Carlo estimation. Otherwise, the decreases as the dissimilarity between the two distributions increases. In the literature, other -like diagnostics have been proposed that also account for the nature of  martino2017effective.

4 Optimization via Importance Sampling

The off-policy optimization problem thomas2015high2 can be formulated as finding the best target policy (or hyperpolicy ), i.e., the one maximizing the expected return, having access to a set of samples collected with a behavioral policy (or hyperpolicy ). In a more abstract sense, we aim to determine the target distribution that maximizes having samples collected from the fixed behavioral distribution . In this section, we analyze the problem of defining a proper objective function for this purpose. Directly optimizing the estimator or is, in most of the cases, unsuccessful. With enough freedom in choosing , the optimal solution would assign as much probability mass as possible to the maximum value among . Clearly, in this scenario, the estimator is unreliable and displays a large variance. For this reason, we adopt a risk-averse approach and we decide to optimize a statistical lower bound of the expected value that holds with high confidence. We start by analyzing the behavior of the IS estimator and we provide the following result that bounds the variance of in terms of the Renyi divergence.

Lemma 4.1.

Let and be two probability measures on the measurable space such that . Let i.i.d. random variables sampled from and be a bounded function (). Then, for any , the variance of the IS estimator can be upper bounded as:


When almost everywhere, we get , a well-known bound on the variance of a Monte Carlo estimator. Recalling the definition of ESS (6) we can rewrite the previous bound as: , i.e., the variance scales with ESS instead of . While can have unbounded variance even if is bounded, the SN estimator is always bounded by and therefore it always has a finite variance. Since the normalization term makes all the samples interdependent, an exact analysis of its bias and variance is more challenging. Several works adopted approximate methods to provide an expression for the variance hesterberg1988advances. We propose an analysis of bias and variance of the SN estimator in Appendix D.

4.1 Concentration Inequality

Finding a suitable concentration inequality for off-policy learning was studied in thomas2015high for offline policy evaluation and subsequently in thomas2015high2 for optimization. On one hand, fully empirical concentration inequalities, like Student-T, besides the asymptotic approximation, are not suitable in this case since the empirical variance needs to be estimated with importance sampling as well injecting further uncertainty mcbook. On the other hand, several distribution-free inequalities like Hoeffding require knowing the maximum of the estimator, which might not exist () for the IS estimator. Constraining

to be finite often introduces unacceptable limitations. For instance, in the case of univariate Gaussian distributions, it prevents a step that selects a target variance larger than the behavioral one from being performed (see Appendix 

C).444Although the variance tends to be reduced in the learning process, there might be cases in which it needs to be increased (e.g., suppose we start with a behavioral policy with small variance, it might be beneficial increasing the variance to enforce exploration). Even Bernstein inequalities bercu2015concentration, are hardly applicable since, for instance, in the case of univariate Gaussian distributions, the importance weights display a fat tail behavior (see Appendix C). We believe that a reasonable trade-off is to require the variance of the importance weights to be finite, that is equivalent to require , i.e., for univariate Gaussians. For this reason, we resort to Chebyshev-like inequalities and we propose the following concentration bound derived from Cantelli’s inequality and customized for the IS estimator.

Theorem 4.1.

Let and be two probability measures on the measurable space such that and . Let be i.i.d. random variables sampled from , and be a bounded function (). Then, for any and with probability at least it holds that:


The bound highlights the interesting trade-off between the estimated performance and the uncertainty introduced by changing the distribution. The latter enters in the bound as the 2-Rényi divergence between the target distribution and the behavioral distribution . Intuitively, we should trust the estimator as long as is not too far from . For the SN estimator, accounting for the bias, we are able to obtain a bound (reported in Appendix D), with a similar dependence on as in Theorem 4.1, albeit with different constants. Renaming all constants involved in the bound of Theorem 4.1 as , we get a surrogate objective function. The optimization can be carried out in different ways. The following section shows why using the natural gradient could be a successful choice in case and can be expressed as parametric differentiable distributions.

4.2 Importance Sampling and Natural Gradient

We can look at a parametric distribution , having as a density function, as a point on a probability manifold with coordinates . If is differentiable, the Fisher Information Matrix (FIM) rao1992information; amari2012differential is defined as: . This matrix is, up to a scale, an invariant metric amari1998natural on parameter space , i.e., is independent on the specific parameterization and provides a second order approximation of the distance between and on the probability manifold up to a scale factor

. Given a loss function

, we define the natural gradient amari1998natural; kakade2002natural as , which represents the steepest ascent direction in the probability manifold. Thanks to the invariance property, there is a tight connection between the geometry induced by the Rényi divergence and the Fisher information metric.

Theorem 4.2.

Let be a p.d.f. differentiable w.r.t. . Then, it holds that, for the Rényi divergence: , and for the exponentiated Rényi divergence: .

This result provides an approximate expression for the variance of the importance weights, as . It also justifies the use of natural gradients in off-distribution optimization, since a step in natural gradient direction has a controllable effect on the variance of the importance weights.

Initialize arbitrarily
for  do
     Collect trajectories with
     for  do
          Compute , and
     end for
end for
Algorithm 1 Action-based POIS
Initialize arbitrarily
for  do
     Sample policy parameters from
     Collect a trajectory with each
     for  do
          Compute , and
     end for
end for
Algorithm 2 Parameter-based POIS

5 Policy Optimization via Importance Sampling

In this section, we discuss how to customize the bound provided in Theorem 4.1 for policy optimization, developing a novel model-free actor-only policy search algorithm, named Policy Optimization via Importance Sampling (POIS). We propose two versions of POIS: Action-based POIS (A-POIS), which is based on a policy gradient approach, and Parameter-based POIS (P-POIS), which adopts the PGPE framework. A more detailed description of the implementation aspects is reported in Appendix E.

5.1 Action-based POIS

In Action-based POIS (A-POIS) we search for a policy that maximizes the performance index within a parametric space of stochastic differentiable policies. In this context, the behavioral (resp. target) distribution (resp. ) becomes the distribution over trajectories (resp. ) induced by the behavioral policy (resp. target policy ) and is the trajectory return which is uniformly bounded as .555When the bound becomes . The surrogate loss function cannot be directly optimized via gradient ascent since computing requires the approximation of an integral over the trajectory space and, for stochastic environments, to know the transition model , which is unknown in a model-free setting. Simple bounds to this quantity, like , besides being hard to compute due to the presence of the supremum, are extremely conservative since the Rényi divergence is raised to the horizon . We suggest the replacement of the Rényi divergence with an estimate defined only in terms of the policy Rényi divergence (see Appendix E.2 for details). Thus, we obtain the following surrogate objective:


where . We consider the case in which is a Gaussian distribution over actions whose mean depends on the state and whose covariance is state-independent and diagonal: , where . The learning process mixes online and offline optimization. At each online iteration , a dataset of trajectories is collected by executing in the environment the current policy . These trajectories are used to optimize the surrogate loss function . At each offline iteration , the parameters are updated via gradient ascent: , where is the step size which is chosen via line search (see Appendix E.1) and is a positive semi-definite matrix (e.g., , the FIM, for natural gradient)666The FIM needs to be estimated via importance sampling as well, as shown in Appendix E.3.. The pseudo-code of POIS is reported in Algorithm 1.

5.2 Parameter-based POIS

In the Parameter-based POIS (P-POIS) we again consider a parametrized policy space , but needs not be differentiable. The policy parameters are sampled at the beginning of each episode from a parametric hyperpolicy selected in a parametric space . The goal is to learn the hyperparameters so as to maximize . In this setting, the distributions and of Section 4 correspond to the behavioral and target hyperpolicies, while remains the trajectory return . The importance weights zhao2013efficient must take into account all sources of randomness, derived from sampling a policy parameter and a trajectory : . In practice, a Gaussian hyperpolicy with diagonal covariance matrix is often used, i.e., with . The policy is assumed to be deterministic: , where is a deterministic function of the state  (e.g., sehnke2010parameter; gruttnermulti). A first advantage over the action-based setting is that the distribution of the importance weights is entirely known, as it is the ratio of two Gaussians and the Rényi divergence can be computed exactly burbea1984convexity (see Appendix C). This leads to the following surrogate objective:


where each trajectory is obtained by running an episode with action policy , and the corresponding policy parameters are sampled independently from hyperpolicy , at the beginning of each episode. The hyperpolicy parameters are then updated offline as (see Algorithm 2 for the complete pseudo-code). A further advantage w.r.t. the action-based case is that the FIM can be computed exactly, and it is diagonal in the case of a Gaussian hyperpolicy with diagonal covariance matrix, turning a problematic inversion into a trivial division (the FIM is block-diagonal in the more general case of a Gaussian hyperpolicy, as observed in miyamae2010natural). This makes natural gradient much more enticing for P-POIS.

(a) 0.4 0.4 0.1 0.01
(b) 0.1 0.1 0.1 1
(c) 0.7 0.2 1 1
(d) 0.9 1 0.01 1
(e) 0.9 0.8 0.01 0.01
(a) Cartpole
(b) Inverted Double Pendulum
(c) Acrobot
(d) Mountain Car
(e) Inverted Pendulum
Figure 1: Average return as a function of the number of trajectories for A-POIS, P-POIS and TRPO with linear policy (20 runs, 95% c.i.). The table reports the best hyperparameters found ( for POIS and the step size for TRPO and PPO).

6 Experimental Evaluation

In this section, we present the experimental evaluation of POIS in its two flavors (action-based and parameter-based). We first provide a set of empirical comparisons on classical continuous control tasks with linearly parametrized policies; we then show how POIS can be also adopted for learning deep neural policies. In all experiments, for the A-POIS we used the IS estimator, while for P-POIS we employed the SN estimator. All experimental details are provided in Appendix F.

6.1 Linear Policies

Linear parametrized Gaussian policies proved their ability to scale on complex control tasks rajeswaran2017towards. In this section, we compare the learning performance of A-POIS and P-POIS against TRPO schulman2015trust and PPO schulman2017proximal on classical continuous control benchmarks duan2016benchmarking. In Figure 1, we can see that both versions of POIS are able to significantly outperform both TRPO and PPO in the Cartpole environments, especially the P-POIS. In the Inverted Double Pendulum environment the learning curve of P-POIS is remarkable while A-POIS displays a behavior comparable to PPO. In the Acrobot task, P-POIS displays a better performance w.r.t. TRPO and PPO, but A-POIS does not keep up. In Mountain Car, we see yet another behavior: the learning curves of TRPO, PPO and P-POIS are almost one-shot (even if PPO shows a small instability), while A-POIS fails to display such a fast convergence. Finally, in the Inverted Pendulum environment, TRPO and PPO outperform both versions of POIS. This example highlights a limitation of our approach. Since POIS performs an importance sampling procedure at trajectory level, it cannot assign credit to good actions in bad trajectories. On the contrary, weighting each sample, TRPO and PPO are able also to exploit good trajectory segments. In principle, this problem can be mitigated in POIS by resorting to per-decision importance sampling precup2000eligibility, in which the weight is assigned to individual rewards instead of trajectory returns. Overall, POIS displays a performance comparable with TRPO and PPO across the tasks. In particular, P-POIS displays a better performance w.r.t. A-POIS. However, this ordering is not maintained when moving to more complex policy architectures, as shown in the next section.

In Figure 2 we show, for several metrics, the behavior of A-POIS when changing the parameter in the Cartpole environment. We can see that when is small (e.g., ), the Effective Sample Size () remains large and, consequently, the variance of the importance weights () is small. This means that the penalization term in the objective function discourages the optimization process from selecting policies which are far from the behavioral policy. As a consequence, the displayed behavior is very conservative, preventing the policy from reaching the optimum. On the contrary, when approaches 1, the ESS is smaller and the variance of the weights tends to increase significantly. Again, the performance remains suboptimal as the penalization term in the objective function is too light. The best behavior is obtained with an intermediate value of , specifically .

6.2 Deep Neural Policies

In this section, we adopt a deep neural network (3 layers: 100, 50, 25 neurons each) to represent the policy. The experiment setup is fully compatible with the classical benchmark 

duan2016benchmarking. While A-POIS can be directly applied to deep neural networks, P-POIS exhibits some critical issues. A highly dimensional hyperpolicy (like a Gaussian from which the weights of an MLP policy are sampled) can make extremely sensitive to small parameter changes, leading to over-conservative updates.777

This curse of dimensionality, related to

, has some similarities with the dependence of the Rényi divergence on the actual horizon in the action-based case.
A first practical variant comes from the insight that is the inverse of the effective sample size, as reported in Equation 6. We can obtain a less conservative (although approximate) surrogate function by replacing it with . Another trick is to model the hyperpolicy as a set of independent Gaussians, each defined over a disjoint subspace of (implementation details are provided in Appendix E.5). In Table 1, we augmented the results provided in duan2016benchmarking with the performance of POIS for the considered tasks. We can see that A-POIS is able to reach an overall behavior comparable with the best of the action-based algorithms, approaching TRPO and beating DDPG. Similarly, P-POIS exhibits a performance similar to CEM szita2006learning, the best performing among the parameter-based methods. The complete results are reported in Appendix F.

Figure 2: Average return, Effective Sample Size (ESS) and variance of the importance weights () as a function of the number of trajectories for A-POIS for different values of the parameter in the Cartpole environment (20 runs, 95% c.i.).
Cart-Pole Double Inverted
Algorithm Balancing Mountain Car Pendulum Swimmer
blue!20 A-POIS
[3pt/2pt] CEM
red!20 P-POIS
Table 1: Performance of POIS compated with duan2016benchmarking on deep neural policies (5 runs, 95% c.i.). In bold, the performances that are not statistically significantly different from the best algorithm in each task.

7 Discussion and Conclusions

In this paper, we presented a new actor-only policy optimization algorithm, POIS, which alternates online and offline optimization in order to efficiently exploit the collected trajectories, and can be used in combination with action-based and parameter-based exploration. In contrast to the state-of-the-art algorithms, POIS has a strong theoretical grounding, since its surrogate objective function derives from a statistical bound on the estimated performance, that is able to capture the uncertainty induced by importance sampling. The experimental evaluation showed that POIS, in both its versions (action-based and parameter-based), is able to achieve a performance comparable with TRPO, PPO and other classical algorithms on continuous control tasks. Natural extensions of POIS could focus on employing per-decision importance sampling, adaptive batch size, and trajectory reuse. Future work also includes scaling POIS to high-dimensional tasks and highly-stochastic environments. We believe that this work represents a valuable starting point for a deeper understanding of modern policy optimization and for the development of effective and scalable policy search methods.


The study was partially funded by Lombardy Region (Announcement PORFESR 2014-2020).
F.F. was partially funded through ERC Advanced Grant (no: 742870).


Index of the Appendix

In the following, we briefly recap the contents of the Appendix.

  • [leftmargin=*, label=–]

  • Appendix A shows a more detailed comparison of POIS with the policy-search algorithms. Table 2 summarizes some features of the considered methods.

  • Appendix B reports all proofs and derivations.

  • Appendix C provides an analysis of the distribution of the importance weights in the case of univariate Gaussian behavioral and target distributions.

  • Appendix D shows some bounds on bias and variance for the self-normalized importance sampling estimator and provides a high confidence bound.

  • Appendix E illustrates some implementation details of POIS, in particular line search algorithms, estimation of the Rényi divergence, computation of the FIM and practical versions of P-POIS.

  • Appendix F provides the hyperparameters used in the experiments and further results.

Appendix A Related Works

Policy optimization algorithms can be classified according to different dimensions (Table 2). It is by now established, in the policy-based RL community, that effective algorithms, either on-policy or off-policy, should account for the variance of the gradient estimate. Early attempts, in the class of action-based algorithms, are the usage of a baseline to reduce the estimated gradient variance without introducing bias [baxter2001infinite, peters2008reinforcement]. A similar rationale is at the basis of actor-critic architectures [konda2000actor, sutton2000policy, peters2008natural], in which an estimate of the value function is used to reduce uncertainty. Baselines are typically constant (REINFORCE), time-dependent (G(PO)MDP) or state-dependent (actor-critic), but these approaches have been recently extended to account for action-dependent baselines [tucker2018mirage, wu2018variance]. Even though parameter-based algorithms are, by nature, affected by smaller variance w.r.t. action-based ones, similar baselines can be derived [zhao2011analysis]. A first dichotomy in the class of policy-based algorithms comes when considering the minimal unit used to compute the gradient. Episode-based (or episodic) approaches [e.g., williams1992simple, baxter2001infinite] perform the gradient estimation by averaging the gradients of each episode which need to have a finite horizon. On the contrary, step-based approaches [e.g., schulman2015trust, schulman2017proximal, lillicrap2015continuous], derived from the Policy Gradient Theorem [sutton2000policy], can estimate the gradient by averaging over timesteps. The latter requires a function approximator (a critic) to estimate the Q-function, or directly the advantage function [schulman2015high]. When coming to the on/off-policy dichotomy, the previous distinction has a relevant impact. Indeed, episode-based approaches need to perform importance sampling on trajectories, thus the importance weights are the products of policy ratios for all executed actions within a trajectory, whereas step-based algorithms need just to weight each sample with the corresponding policy ratio. The latter case helps to keep the value of the importance weights close to one, but the need to have a critic prevents from a complete analysis of the uncertainty since the bias/variance injected by the critic is hard to compute [konda2000actor]. Moreover, in the off-policy scenario, it is necessary to control some notion of dissimilarity between the behavioral and target policy, as the variance increases when moving too far. This is the case of TRPO [schulman2015trust], where the regularization constraint based on the Kullback-Leibler divergence helps controlling the importance weights but originates from an exact bound on the performance improvement. Intuitively, the same rationale applies to the truncation of the importance weights, employed by PPO, that avoids performing too large steps in the policy space. Nevertheless, the step size in TRPO and the truncation range in PPO are just hyperparameters and have a limited statistical meaning. On the contrary, other actor-critic architectures have been proposed including also experience replay methods, like [wang2016sample] in which the importance weights are truncated, but the method is able to account for the injected bias. The authors propose to keep a running mean of the best policies seen so far to avoid a hard constraint on the policy dissimilarity. Differently from these methods, POIS directly models the uncertainty due to the importance sampling procedure. The bound in Theorem 4.1 introduces the unique hyperparameter which has a precise statistical meaning as confidence level. The optimal value of (like the step size in TRPO and in PPO) is task-dependent and might vary during the learning procedure. Furthermore, POIS is an episode-based approach in which the importance weights account for the whole trajectory at once; this might prevent from assigning credit to valuable subtrajectories (like in the case of Inverted Pendulum, see Figure 1). A possible solution is to resort to per-decision importance sampling [precup2000eligibility].

Algorithm Action/Parameter based On/Off policy Optimization problem Critic Timestep/Trajectory based
REINFORCE/ G(PO)MDP [williams1992simple, baxter2001infinite] action-based on-policy No episode-based
TRPO [schulman2015trust] action-based on-policy
Yes step-based
PPO [schulman2017proximal] action-based on/off-policy Yes step-based
DDPG [lillicrap2015continuous] action-based off-policy Yes step-based
REPS [peters2010relative]888We indicate with the state-action occupancy [sutton2000policy]. action-based on-policy
Yes step-based
RWR [peters2007reinforcement] action-based on-policy No step-based
blue!20 A-POIS action-based on/off-policy No episode-based
[3pt/2pt] PGPE [sehnke2008policy] parameter-based on-policy No episode-based
IW-PGPE [zhao2013efficient] parameter-based on/off-policy No episode-based
red!20 P-POIS parameter-based on/off-policy No episode-based
Table 2: Comparison of some policy optimization algorithms according to different dimensions. For brevity, we will indicate with . For episode-based algorithms we will indicate with the empirical average over trajectories collected with . For step-based algorithms is the empirical average collecting samples with . For parameter-based algorithms we indicate with the empirical expectation taken w.r.t. policy parameter sampled from the hyperpolicy and trajectory collected with . For the actor-critic architectures, and are the estimated Q-function and advantage function.

Appendix B Proofs and Derivations

See 4.1


From the fact that are i.i.d. we can write:

See 4.1


We start from Cantelli’s inequality applied on the random variable :


By calling and considering the complementary event, we get that with probability at least we have:


By replacing the variance with the bound in Theorem 4.1 we get the result. ∎

See 4.2


We need to compute the second-order Taylor expansion of the -Rényi divergence. We start considering the term:


The gradient is given by:

Thus, . We now compute the Hessian:

Evaluating the Hessian in we have:

Now, . Thus: