pois
Implementation of the POIS algorithm
view repo
Policy optimization is an effective reinforcement learning approach to solve continuous control tasks. Recent achievements have shown that alternating online and offline optimization is a successful choice for efficient trajectory reuse. However, deciding when to stop optimizing and collect new trajectories is nontrivial as it requires to account for the variance of the objective function estimate. In this paper, we propose a novel modelfree policy search algorithm, POIS, applicable in both controlbased and parameterbased settings. We first derive a highconfidence bound for importance sampling estimation and then we define a surrogate objective function which is optimized offline using a batch of trajectories. Finally, the algorithm is tested on a selection of continuous control tasks, with both linear and deep policies, and compared with the stateoftheart policy optimization methods.
READ FULL TEXT VIEW PDFImplementation of the POIS algorithm
In recent years, policy search methods deisenroth2013survey have proved to be valuable Reinforcement Learning (RL) sutton1998reinforcement approaches thanks to their successful achievements in continuous control tasks (e.g., lillicrap2015continuous; schulman2015trust; schulman2017proximal; schulman2015high), robotic locomotion (e.g., tedrake2004stochastic; kober2013reinforcement) and partially observable environments (e.g., ng2000pegasus)
. These algorithms can be roughly classified into two categories:
actionbased methods sutton2000policy; peters2008reinforcement and parameterbased methods sehnke2008policy. The former, usually known as policy gradient (PG) methods, perform a search in a parametric policy space by following the gradient of the utility function estimated by means of a batch of trajectories collected from the environment sutton1998reinforcement. In contrast, in parameterbased methods, the search is carried out directly in the space of parameters by exploiting global optimizers (e.g., rubinstein1999cross; hansen2001completely; stanley2002evolving; szita2006learning) or following a proper gradient direction like in Policy Gradients with Parameterbased Exploration (PGPE) sehnke2008policy; wierstra2008natural; sehnke2010parameter. A major question in policy search methods is: how should we use a batch of trajectories in order to exploit its information in the most efficient way? On one hand, onpolicy methods leverage on the batch to perform a single gradient step, after which new trajectories are collected with the updated policy. Online PG methods are likely the most widespread policy search approaches: starting from the traditional algorithms based on stochastic policy gradient sutton2000policy, like REINFORCE williams1992simple and G(PO)MDP baxter2001infinite, moving toward more modern methods, such as Trust Region Policy Optimization (TRPO) schulman2015trust. These methods, however, rarely exploit the available trajectories in an efficient way, since each batch is thrown away after just one gradient update. On the other hand, offpolicy methods maintain a behavioral policy, used to explore the environment and to collect samples, and a target policy which is optimized. The concept of offpolicy learning is rooted in valuebased RL watkins1992q; peng1994incremental; munos2016safe and it was first adapted to PG in degris2012off, using an actorcritic architecture. The approach has been extended to Deterministic Policy Gradient (DPG) silver2014deterministic, which allows optimizing deterministic policies while keeping a stochastic policy for exploration. More recently, an efficient version of DPG coupled with a deep neural network to represent the policy has been proposed, named Deep Deterministic Policy Gradient (DDPG)
lillicrap2015continuous. In the parameterbased framework, even though the original formulation sehnke2008policy introduces an online algorithm, an extension has been proposed to efficiently reuse the trajectories in an offline scenario zhao2013efficient. Furthermore, PGPElike approaches allow overcoming several limitations of classical PG, like the need for a stochastic policy and the high variance of the gradient estimates.^{1}^{1}1Other solutions to these problems have been proposed in the actionbased literature, like the aforementioned DPG algorithm, the gradient baselines peters2008reinforcement and the actorcritic architectures konda2000actor.While onpolicy algorithms are, by nature, online, as they need to be fed with fresh samples whenever the policy is updated, offpolicy methods can take advantage of mixing online and offline
optimization. This can be done by alternately sampling trajectories and performing optimization epochs with the collected data. A prime example of this alternating procedure is Proximal Policy Optimization (PPO)
schulman2017proximal, that has displayed remarkable performance on continuous control tasks. Offline optimization, however, introduces further sources of approximation, as the gradient w.r.t. the target policy needs to be estimated (offpolicy) with samples collected with a behavioral policy. A common choice is to adopt an importance sampling (IS) mcbook; hesterberg1988advances estimator in which each sample is reweighted proportionally to the likelihood of being generated by the target policy. However, directly optimizing this utility function is impractical since it displays a wide variance most of the times mcbook. Intuitively, the variance increases proportionally to the distance between the behavioral and the target policy; thus, the estimate is reliable as long as the two policies are close enough. Preventing uncontrolled updates in the space of policy parameters is at the core of the natural gradient approaches amari1998natural applied effectively both on PG methods kakade2002approximately; peters2008natural; wierstra2008natural and on PGPE methods miyamae2010natural. More recently, this idea has been captured (albeit indirectly) by TRPO, which optimizes via (approximate) natural gradient a surrogate objective function, derived from safe RL kakade2002approximately; pirotta2013safe, subject to a constraint on the KullbackLeibler divergence between the behavioral and target policy.
^{2}^{2}2Note that this regularization term appears in the performance improvement bound, which contains exact quantities only. Thus, it does not really account for the uncertainty derived from the importance sampling. Similarly, PPO performs a truncation of the importance weights to discourage the optimization process from going too far. Although TRPO and PPO, together with DDPG, represent the stateoftheart policy optimization methods in RL for continuous control, they do not explicitly encode in their objective function the uncertainty injected by the importance sampling procedure. A more theoretically grounded analysis has been provided for policy selection doroudi2017importance, modelfree thomas2015high and modelbased thomas2016data policy evaluation (also accounting for samples collected with multiple behavioral policies), and combined with options guo2017using. Subsequently, in thomas2015high2 these methods have been extended for policy improvement, deriving a suitable concentration inequality for the case of truncated importance weights. Unfortunately, these methods are hardly scalable to complex control tasks. A more detailed review of the stateoftheart policy optimization algorithms is reported in Appendix A.In this paper, we propose a novel, modelfree, actoronly, policy optimization algorithm, named Policy Optimization via Importance Sampling (POIS) that mixes online and offline optimization to efficiently exploit the information contained in the collected trajectories. POIS explicitly accounts for the uncertainty introduced by the importance sampling by optimizing a surrogate objective function. The latter captures the tradeoff between the estimated performance improvement and the variance injected by the importance sampling. The main contributions of this paper are theoretical, algorithmic and experimental. After revising some notions about importance sampling (Section 3), we propose a concentration inequality, of independent interest, for highconfidence “offdistribution” optimization of objective functions estimated via importance sampling (Section 4). Then we show how this bound can be customized into a surrogate objective function in order to either search in the space of policies (Actionbased POIS) or to search in the space of parameters (Parameterbases POIS). The resulting algorithm (in both the actionbased and the parameterbased flavor) collects, at each iteration, a set of trajectories. These are used to perform offline optimization of the surrogate objective via gradient ascent (Section 5), after which a new batch of trajectories is collected using the optimized policy. Finally, we provide an experimental evaluation with both linear policies and deep neural policies to illustrate the advantages and limitations of our approach compared to stateoftheart algorithms (Section 6) on classical control tasks duan2016benchmarking; todorov2012mujoco. The proofs for all Theorems and Lemmas are reported in Appendix B. The implementation of POIS can be found at https://github.com/T3p/pois.
A discretetime Markov Decision Process (MDP)
puterman2014markov is defined as a tuple where is the state space, is the action space, is a Markovian transition model that assigns for each stateaction pairthe probability of reaching the next state
, is the discount factor, assigns the expected reward for performing action in state and is the distribution of the initial state. The behavior of an agent is described by a policy that assigns for each state the probability of performing action . A trajectory is a sequence of stateaction pairs , where is the actual trajectory horizon. The performance of an agent is evaluated in terms of the expected return, i.e., the expected discounted sum of the rewards collected along the trajectory: , where is the trajectory return.We focus our attention to the case in which the policy belongs to a parametric policy space . In parameterbased approaches, the agent is equipped with a hyperpolicy used to sample the policy parameters at the beginning of each episode. The hyperpolicy belongs itself to a parametric hyperpolicy space . The expected return can be expressed, in the parameterbased case, as a double expectation: one over the policy parameter space and one over the trajectory space :
(1) 
where
is the trajectory density function. The goal of a parameterbased learning agent is to determine the hyperparameters
so as to maximize . If is stochastic and differentiable, the hyperparameters can be learned according to the gradient ascent update: , where is the step size and . Since the stochasticity of the hyperpolicy is a sufficient source of exploration, deterministic action policies of the kind are typically considered, where is the Dirac delta function and is a deterministic mapping from to . In the actionbased case, on the contrary, the hyperpolicy is a deterministic distribution , where is a deterministic mapping from to . For this reason, the dependence on is typically not represented and the expected return expression simplifies into a single expectation over the trajectory space :(2) 
An actionbased learning agent aims to find the policy parameters that maximize . In this case, we need to enforce exploration by means of the stochasticity of . For stochastic and differentiable policies, learning can be performed via gradient ascent: , where .
In offpolicy evaluation thomas2015high; thomas2016data, we aim to estimate the performance of a target policy (or hyperpolicy ) given samples collected with a behavioral policy (or hyperpolicy ). More generally, we face the problem of estimating the expected value of a deterministic bounded function (
) of random variable
taking values in under a target distribution , after having collected samples from a behavioral distribution . The importance sampling estimator (IS) cochran2007sampling; mcbook corrects the distribution with the importance weights (or Radon–Nikodym derivative or likelihood ratio) :(3) 
where is sampled from and we assume whenever . This estimator is unbiased () but it may exhibit an undesirable behavior due to the variability of the importance weights, showing, in some cases, infinite variance. Intuitively, the magnitude of the importance weights provides an indication of how much the probability measures and are dissimilar. This notion can be formalized by the Rényi divergence renyi1961measures; van2014renyi, an informationtheoretic dissimilarity index between probability measures.
Let and be two probability measures on a measurable space such that ( is absolutely continuous w.r.t. ) and is finite. Let and admit and
as Lebesgue probability density functions (p.d.f.), respectively. The
Rényi divergence is defined as:(4) 
where is the Radon–Nikodym derivative of w.r.t. and . Some remarkable cases are: when and yielding . Importing the notation from cortes2010learning, we indicate the exponentiated Rényi divergence as . With little abuse of notation, we will replace with whenever possible within the context.
The Rényi divergence provides a convenient expression for the moments of the importance weights:
. Moreover, and cortes2010learning. To mitigate the variance problem of the IS estimator, we can resort to the selfnormalized importance sampling estimator (SN) cochran2007sampling:(5) 
where is the selfnormalized importance weight. Differently from , is biased but consistent mcbook and it typically displays a more desirable behavior because of its smaller variance.^{3}^{3}3Note that . Therefore, its variance is always finite. Given the realization we can interpret the SN estimator as the expected value of under an approximation of the distribution made by deltas, i.e., . The problem of assessing the quality of the SN estimator has been extensively studied by the simulation community, producing several diagnostic indexes to indicate when the weights might display problematic behavior mcbook. The effective sample size () was introduced in kong1992note as the number of samples drawn from so that the variance of the Monte Carlo estimator is approximately equal to the variance of the SN estimator computed with samples. Here we report the original definition and its most common estimate:
(6) 
The has an interesting interpretation: if , i.e., almost everywhere, then since we are performing Monte Carlo estimation. Otherwise, the decreases as the dissimilarity between the two distributions increases. In the literature, other like diagnostics have been proposed that also account for the nature of martino2017effective.
The offpolicy optimization problem thomas2015high2 can be formulated as finding the best target policy (or hyperpolicy ), i.e., the one maximizing the expected return, having access to a set of samples collected with a behavioral policy (or hyperpolicy ). In a more abstract sense, we aim to determine the target distribution that maximizes having samples collected from the fixed behavioral distribution . In this section, we analyze the problem of defining a proper objective function for this purpose. Directly optimizing the estimator or is, in most of the cases, unsuccessful. With enough freedom in choosing , the optimal solution would assign as much probability mass as possible to the maximum value among . Clearly, in this scenario, the estimator is unreliable and displays a large variance. For this reason, we adopt a riskaverse approach and we decide to optimize a statistical lower bound of the expected value that holds with high confidence. We start by analyzing the behavior of the IS estimator and we provide the following result that bounds the variance of in terms of the Renyi divergence.
Let and be two probability measures on the measurable space such that . Let i.i.d. random variables sampled from and be a bounded function (). Then, for any , the variance of the IS estimator can be upper bounded as:
(7) 
When almost everywhere, we get , a wellknown bound on the variance of a Monte Carlo estimator. Recalling the definition of ESS (6) we can rewrite the previous bound as: , i.e., the variance scales with ESS instead of . While can have unbounded variance even if is bounded, the SN estimator is always bounded by and therefore it always has a finite variance. Since the normalization term makes all the samples interdependent, an exact analysis of its bias and variance is more challenging. Several works adopted approximate methods to provide an expression for the variance hesterberg1988advances. We propose an analysis of bias and variance of the SN estimator in Appendix D.
Finding a suitable concentration inequality for offpolicy learning was studied in thomas2015high for offline policy evaluation and subsequently in thomas2015high2 for optimization. On one hand, fully empirical concentration inequalities, like StudentT, besides the asymptotic approximation, are not suitable in this case since the empirical variance needs to be estimated with importance sampling as well injecting further uncertainty mcbook. On the other hand, several distributionfree inequalities like Hoeffding require knowing the maximum of the estimator, which might not exist () for the IS estimator. Constraining
to be finite often introduces unacceptable limitations. For instance, in the case of univariate Gaussian distributions, it prevents a step that selects a target variance larger than the behavioral one from being performed (see Appendix
C).^{4}^{4}4Although the variance tends to be reduced in the learning process, there might be cases in which it needs to be increased (e.g., suppose we start with a behavioral policy with small variance, it might be beneficial increasing the variance to enforce exploration). Even Bernstein inequalities bercu2015concentration, are hardly applicable since, for instance, in the case of univariate Gaussian distributions, the importance weights display a fat tail behavior (see Appendix C). We believe that a reasonable tradeoff is to require the variance of the importance weights to be finite, that is equivalent to require , i.e., for univariate Gaussians. For this reason, we resort to Chebyshevlike inequalities and we propose the following concentration bound derived from Cantelli’s inequality and customized for the IS estimator.Let and be two probability measures on the measurable space such that and . Let be i.i.d. random variables sampled from , and be a bounded function (). Then, for any and with probability at least it holds that:
(8) 
The bound highlights the interesting tradeoff between the estimated performance and the uncertainty introduced by changing the distribution. The latter enters in the bound as the 2Rényi divergence between the target distribution and the behavioral distribution . Intuitively, we should trust the estimator as long as is not too far from . For the SN estimator, accounting for the bias, we are able to obtain a bound (reported in Appendix D), with a similar dependence on as in Theorem 4.1, albeit with different constants. Renaming all constants involved in the bound of Theorem 4.1 as , we get a surrogate objective function. The optimization can be carried out in different ways. The following section shows why using the natural gradient could be a successful choice in case and can be expressed as parametric differentiable distributions.
We can look at a parametric distribution , having as a density function, as a point on a probability manifold with coordinates . If is differentiable, the Fisher Information Matrix (FIM) rao1992information; amari2012differential is defined as: . This matrix is, up to a scale, an invariant metric amari1998natural on parameter space , i.e., is independent on the specific parameterization and provides a second order approximation of the distance between and on the probability manifold up to a scale factor
. Given a loss function
, we define the natural gradient amari1998natural; kakade2002natural as , which represents the steepest ascent direction in the probability manifold. Thanks to the invariance property, there is a tight connection between the geometry induced by the Rényi divergence and the Fisher information metric.Let be a p.d.f. differentiable w.r.t. . Then, it holds that, for the Rényi divergence: , and for the exponentiated Rényi divergence: .
This result provides an approximate expression for the variance of the importance weights, as . It also justifies the use of natural gradients in offdistribution optimization, since a step in natural gradient direction has a controllable effect on the variance of the importance weights.
In this section, we discuss how to customize the bound provided in Theorem 4.1 for policy optimization, developing a novel modelfree actoronly policy search algorithm, named Policy Optimization via Importance Sampling (POIS). We propose two versions of POIS: Actionbased POIS (APOIS), which is based on a policy gradient approach, and Parameterbased POIS (PPOIS), which adopts the PGPE framework. A more detailed description of the implementation aspects is reported in Appendix E.
In Actionbased POIS (APOIS) we search for a policy that maximizes the performance index within a parametric space of stochastic differentiable policies. In this context, the behavioral (resp. target) distribution (resp. ) becomes the distribution over trajectories (resp. ) induced by the behavioral policy (resp. target policy ) and is the trajectory return which is uniformly bounded as .^{5}^{5}5When the bound becomes . The surrogate loss function cannot be directly optimized via gradient ascent since computing requires the approximation of an integral over the trajectory space and, for stochastic environments, to know the transition model , which is unknown in a modelfree setting. Simple bounds to this quantity, like , besides being hard to compute due to the presence of the supremum, are extremely conservative since the Rényi divergence is raised to the horizon . We suggest the replacement of the Rényi divergence with an estimate defined only in terms of the policy Rényi divergence (see Appendix E.2 for details). Thus, we obtain the following surrogate objective:
(9) 
where . We consider the case in which is a Gaussian distribution over actions whose mean depends on the state and whose covariance is stateindependent and diagonal: , where . The learning process mixes online and offline optimization. At each online iteration , a dataset of trajectories is collected by executing in the environment the current policy . These trajectories are used to optimize the surrogate loss function . At each offline iteration , the parameters are updated via gradient ascent: , where is the step size which is chosen via line search (see Appendix E.1) and is a positive semidefinite matrix (e.g., , the FIM, for natural gradient)^{6}^{6}6The FIM needs to be estimated via importance sampling as well, as shown in Appendix E.3.. The pseudocode of POIS is reported in Algorithm 1.
In the Parameterbased POIS (PPOIS) we again consider a parametrized policy space , but needs not be differentiable. The policy parameters are sampled at the beginning of each episode from a parametric hyperpolicy selected in a parametric space . The goal is to learn the hyperparameters so as to maximize . In this setting, the distributions and of Section 4 correspond to the behavioral and target hyperpolicies, while remains the trajectory return . The importance weights zhao2013efficient must take into account all sources of randomness, derived from sampling a policy parameter and a trajectory : . In practice, a Gaussian hyperpolicy with diagonal covariance matrix is often used, i.e., with . The policy is assumed to be deterministic: , where is a deterministic function of the state (e.g., sehnke2010parameter; gruttnermulti). A first advantage over the actionbased setting is that the distribution of the importance weights is entirely known, as it is the ratio of two Gaussians and the Rényi divergence can be computed exactly burbea1984convexity (see Appendix C). This leads to the following surrogate objective:
(10) 
where each trajectory is obtained by running an episode with action policy , and the corresponding policy parameters are sampled independently from hyperpolicy , at the beginning of each episode. The hyperpolicy parameters are then updated offline as (see Algorithm 2 for the complete pseudocode). A further advantage w.r.t. the actionbased case is that the FIM can be computed exactly, and it is diagonal in the case of a Gaussian hyperpolicy with diagonal covariance matrix, turning a problematic inversion into a trivial division (the FIM is blockdiagonal in the more general case of a Gaussian hyperpolicy, as observed in miyamae2010natural). This makes natural gradient much more enticing for PPOIS.

In this section, we present the experimental evaluation of POIS in its two flavors (actionbased and parameterbased). We first provide a set of empirical comparisons on classical continuous control tasks with linearly parametrized policies; we then show how POIS can be also adopted for learning deep neural policies. In all experiments, for the APOIS we used the IS estimator, while for PPOIS we employed the SN estimator. All experimental details are provided in Appendix F.
Linear parametrized Gaussian policies proved their ability to scale on complex control tasks rajeswaran2017towards. In this section, we compare the learning performance of APOIS and PPOIS against TRPO schulman2015trust and PPO schulman2017proximal on classical continuous control benchmarks duan2016benchmarking. In Figure 1, we can see that both versions of POIS are able to significantly outperform both TRPO and PPO in the Cartpole environments, especially the PPOIS. In the Inverted Double Pendulum environment the learning curve of PPOIS is remarkable while APOIS displays a behavior comparable to PPO. In the Acrobot task, PPOIS displays a better performance w.r.t. TRPO and PPO, but APOIS does not keep up. In Mountain Car, we see yet another behavior: the learning curves of TRPO, PPO and PPOIS are almost oneshot (even if PPO shows a small instability), while APOIS fails to display such a fast convergence. Finally, in the Inverted Pendulum environment, TRPO and PPO outperform both versions of POIS. This example highlights a limitation of our approach. Since POIS performs an importance sampling procedure at trajectory level, it cannot assign credit to good actions in bad trajectories. On the contrary, weighting each sample, TRPO and PPO are able also to exploit good trajectory segments. In principle, this problem can be mitigated in POIS by resorting to perdecision importance sampling precup2000eligibility, in which the weight is assigned to individual rewards instead of trajectory returns. Overall, POIS displays a performance comparable with TRPO and PPO across the tasks. In particular, PPOIS displays a better performance w.r.t. APOIS. However, this ordering is not maintained when moving to more complex policy architectures, as shown in the next section.
In Figure 2 we show, for several metrics, the behavior of APOIS when changing the parameter in the Cartpole environment. We can see that when is small (e.g., ), the Effective Sample Size () remains large and, consequently, the variance of the importance weights () is small. This means that the penalization term in the objective function discourages the optimization process from selecting policies which are far from the behavioral policy. As a consequence, the displayed behavior is very conservative, preventing the policy from reaching the optimum. On the contrary, when approaches 1, the ESS is smaller and the variance of the weights tends to increase significantly. Again, the performance remains suboptimal as the penalization term in the objective function is too light. The best behavior is obtained with an intermediate value of , specifically .
In this section, we adopt a deep neural network (3 layers: 100, 50, 25 neurons each) to represent the policy. The experiment setup is fully compatible with the classical benchmark
duan2016benchmarking. While APOIS can be directly applied to deep neural networks, PPOIS exhibits some critical issues. A highly dimensional hyperpolicy (like a Gaussian from which the weights of an MLP policy are sampled) can make extremely sensitive to small parameter changes, leading to overconservative updates.^{7}^{7}7This curse of dimensionality, related to
, has some similarities with the dependence of the Rényi divergence on the actual horizon in the actionbased case. A first practical variant comes from the insight that is the inverse of the effective sample size, as reported in Equation 6. We can obtain a less conservative (although approximate) surrogate function by replacing it with . Another trick is to model the hyperpolicy as a set of independent Gaussians, each defined over a disjoint subspace of (implementation details are provided in Appendix E.5). In Table 1, we augmented the results provided in duan2016benchmarking with the performance of POIS for the considered tasks. We can see that APOIS is able to reach an overall behavior comparable with the best of the actionbased algorithms, approaching TRPO and beating DDPG. Similarly, PPOIS exhibits a performance similar to CEM szita2006learning, the best performing among the parameterbased methods. The complete results are reported in Appendix F.CartPole  Double Inverted  

Algorithm  Balancing  Mountain Car  Pendulum  Swimmer 
REINFORCE  
TRPO  
DDPG  
blue!20 APOIS  
[3pt/2pt] CEM  
red!20 PPOIS 
In this paper, we presented a new actoronly policy optimization algorithm, POIS, which alternates online and offline optimization in order to efficiently exploit the collected trajectories, and can be used in combination with actionbased and parameterbased exploration. In contrast to the stateoftheart algorithms, POIS has a strong theoretical grounding, since its surrogate objective function derives from a statistical bound on the estimated performance, that is able to capture the uncertainty induced by importance sampling. The experimental evaluation showed that POIS, in both its versions (actionbased and parameterbased), is able to achieve a performance comparable with TRPO, PPO and other classical algorithms on continuous control tasks. Natural extensions of POIS could focus on employing perdecision importance sampling, adaptive batch size, and trajectory reuse. Future work also includes scaling POIS to highdimensional tasks and highlystochastic environments. We believe that this work represents a valuable starting point for a deeper understanding of modern policy optimization and for the development of effective and scalable policy search methods.
The study was partially funded by Lombardy Region (Announcement PORFESR 20142020).
F.F. was partially funded through ERC Advanced Grant (no: 742870).
In the following, we briefly recap the contents of the Appendix.
[leftmargin=*, label=–]
Appendix B reports all proofs and derivations.
Appendix C provides an analysis of the distribution of the importance weights in the case of univariate Gaussian behavioral and target distributions.
Appendix D shows some bounds on bias and variance for the selfnormalized importance sampling estimator and provides a high confidence bound.
Appendix E illustrates some implementation details of POIS, in particular line search algorithms, estimation of the Rényi divergence, computation of the FIM and practical versions of PPOIS.
Appendix F provides the hyperparameters used in the experiments and further results.
Policy optimization algorithms can be classified according to different dimensions (Table 2). It is by now established, in the policybased RL community, that effective algorithms, either onpolicy or offpolicy, should account for the variance of the gradient estimate. Early attempts, in the class of actionbased algorithms, are the usage of a baseline to reduce the estimated gradient variance without introducing bias [baxter2001infinite, peters2008reinforcement]. A similar rationale is at the basis of actorcritic architectures [konda2000actor, sutton2000policy, peters2008natural], in which an estimate of the value function is used to reduce uncertainty. Baselines are typically constant (REINFORCE), timedependent (G(PO)MDP) or statedependent (actorcritic), but these approaches have been recently extended to account for actiondependent baselines [tucker2018mirage, wu2018variance]. Even though parameterbased algorithms are, by nature, affected by smaller variance w.r.t. actionbased ones, similar baselines can be derived [zhao2011analysis]. A first dichotomy in the class of policybased algorithms comes when considering the minimal unit used to compute the gradient. Episodebased (or episodic) approaches [e.g., williams1992simple, baxter2001infinite] perform the gradient estimation by averaging the gradients of each episode which need to have a finite horizon. On the contrary, stepbased approaches [e.g., schulman2015trust, schulman2017proximal, lillicrap2015continuous], derived from the Policy Gradient Theorem [sutton2000policy], can estimate the gradient by averaging over timesteps. The latter requires a function approximator (a critic) to estimate the Qfunction, or directly the advantage function [schulman2015high]. When coming to the on/offpolicy dichotomy, the previous distinction has a relevant impact. Indeed, episodebased approaches need to perform importance sampling on trajectories, thus the importance weights are the products of policy ratios for all executed actions within a trajectory, whereas stepbased algorithms need just to weight each sample with the corresponding policy ratio. The latter case helps to keep the value of the importance weights close to one, but the need to have a critic prevents from a complete analysis of the uncertainty since the bias/variance injected by the critic is hard to compute [konda2000actor]. Moreover, in the offpolicy scenario, it is necessary to control some notion of dissimilarity between the behavioral and target policy, as the variance increases when moving too far. This is the case of TRPO [schulman2015trust], where the regularization constraint based on the KullbackLeibler divergence helps controlling the importance weights but originates from an exact bound on the performance improvement. Intuitively, the same rationale applies to the truncation of the importance weights, employed by PPO, that avoids performing too large steps in the policy space. Nevertheless, the step size in TRPO and the truncation range in PPO are just hyperparameters and have a limited statistical meaning. On the contrary, other actorcritic architectures have been proposed including also experience replay methods, like [wang2016sample] in which the importance weights are truncated, but the method is able to account for the injected bias. The authors propose to keep a running mean of the best policies seen so far to avoid a hard constraint on the policy dissimilarity. Differently from these methods, POIS directly models the uncertainty due to the importance sampling procedure. The bound in Theorem 4.1 introduces the unique hyperparameter which has a precise statistical meaning as confidence level. The optimal value of (like the step size in TRPO and in PPO) is taskdependent and might vary during the learning procedure. Furthermore, POIS is an episodebased approach in which the importance weights account for the whole trajectory at once; this might prevent from assigning credit to valuable subtrajectories (like in the case of Inverted Pendulum, see Figure 1). A possible solution is to resort to perdecision importance sampling [precup2000eligibility].
Algorithm  Action/Parameter based  On/Off policy  Optimization problem  Critic  Timestep/Trajectory based 

REINFORCE/ G(PO)MDP [williams1992simple, baxter2001infinite]  actionbased  onpolicy  No  episodebased  
TRPO [schulman2015trust]  actionbased  onpolicy 
s.t. 
Yes  stepbased 
PPO [schulman2017proximal]  actionbased  on/offpolicy  Yes  stepbased  
DDPG [lillicrap2015continuous]  actionbased  offpolicy  Yes  stepbased  
REPS [peters2010relative]^{8}^{8}8We indicate with the stateaction occupancy [sutton2000policy].  actionbased  onpolicy 
s.t. 
Yes  stepbased 
RWR [peters2007reinforcement]  actionbased  onpolicy  No  stepbased  
blue!20 APOIS  actionbased  on/offpolicy  No  episodebased  
[3pt/2pt] PGPE [sehnke2008policy]  parameterbased  onpolicy  No  episodebased  
IWPGPE [zhao2013efficient]  parameterbased  on/offpolicy  No  episodebased  
red!20 PPOIS  parameterbased  on/offpolicy  No  episodebased 
See 4.1
From the fact that are i.i.d. we can write:
∎
See 4.1
We start from Cantelli’s inequality applied on the random variable :
(11) 
By calling and considering the complementary event, we get that with probability at least we have:
(12) 
By replacing the variance with the bound in Theorem 4.1 we get the result. ∎
See 4.2
We need to compute the secondorder Taylor expansion of the Rényi divergence. We start considering the term:
(13) 
The gradient is given by:
Thus, . We now compute the Hessian:
Evaluating the Hessian in we have:
Now, . Thus: