Reinforcement Learning (RL, sutton2018reinforcement) has achieved astounding successes in games (mnih2015human; silver2018general; OpenAI_dota; alphastarblog), matching or surpassing human performance in several occasions. However, the much anticipated applications of RL to real-world tasks such as robotics (kober2013reinforcement), autonomous driving (okuda2014survey) and finance (li2014online) seem still far. The reasons of this delay have to do with the very nature of RL, which relies on the repeated interaction of the learning machine with the surrounding environment, e.g., a manufacturing plant, a trafficked road, a stock market. The trial-and-error process resulting from this interaction is what makes RL so powerful and general. However, it also poses significant challenges in terms of sample efficiency (recht2018tour) and safety (amodei2016concrete).
In Reinforcement Learning, the term safety can actually refer to a variety of problems (garcia2015comprehensive). Of course, the general concern is always the same: to avoid or limit damage. In financial applications, it is typically a loss of money. In robotics and autonomous driving, one should also consider direct damage to people and machines. In this work, we do not make assumptions on the nature of the damage, but we assume it is entirely encoded in the scalar reward signal that is presented to the agent in order to evaluate its actions. Other works (e.g., turchetta2016safe) employ a distinct safety signal, separate from rewards.
A further distinction is necessary on the scope of safety constraints with respect to the agent’s life. One may simply require the final behavior, the one that is deployed at the end of the learning process, to be safe. This is typically the case when learning is performed in simulation, but the final controller has to be deployed in the real world. The main challenges there are in transferring safety properties from simulation to reality (e.g., tan2018sim). In other cases, learning must be performed, or at least completed, on the actual system, because no reliable simulator is available (e.g., peters2008reinforcement). In such a scenario, safety must be enforced for the whole duration of the learning process. This poses a further challenge, as the agent must necessarily go through a sequence of sub-optimal behaviors before learning its final policy. The problem of learning while containing damage is also known as safe exploration (hans2008safe; pecka2014safe; amodei2016concrete), and will be the focus of this work.111Note that in this paper we are not concerned with how exploration should be performed, but only with ensuring safety in a context where some form of exploration is necessary. In particular, we only refer to undirect methods for exploration that perform perturbation at the level of actions and are the most used for continuous problems. In this paper, we do not consider approaches for exploration based on optimism (e.g., jaksch2010near) and/or baseline improvements (e.g., conservative bandits: Wu2016conservative; Kazerouni2017conservativelinear).
garcia2015comprehensive provide a comprehensive survey on safe RL, where the existing approaches are organized into two main families: methods that modify the exploration process directly in order to explicitly avoid dangerous actions (e.g., gehring2013smart), and methods that constrain exploration in a more indirect way by modifying the reward optimization process. The former typically require some sort of external knowledge, such as human demonstrations or advice (e.g., abbeel2010autonomous; clouse1992teaching), which we do not assume to have in this work, if not in the form of a sufficiently informative reward signal. Optimization-based methods (i.e., belonging to the second class) are more suited for this scenario. A particular kind, identified by Garcıa and Fernández as constrained criteria (moldovan2012safe; dicastro2012policy; kadota2006discounted), enforces safety by introducing constraints in the optimization problem, i.e., reward maximization. 222Notably, the approach proposed by (Chow2018lyapunov) lays between the two classes. It relies on the framework of constrained MDPs to guarantee the safety of a behavior policy during training via a set of local, linear constraints defined using an external cost signal. Similar techniques have been used in (Berkenkamp2017safembrl) to guarantee the ability to reenter a “safe region” during exploration.
A typical constraint is that the agent’s performance, i.e., the sum of rewards, must never be less than a user-specified threshold (geibel2005risk; thomas2015high). Under the assumption that the reward signal completely encodes danger, low performances can be matched with dangerous behaviors, so that the performance threshold works as a safety threshold. If we only cared about the safety of the final controller, the traditional RL objective, i.e., maximizing cumulated reward, would be enough. However, most RL algorithms are known to yield oscillating performances during the learning phase. Regardless of the final solution, intermediate ones may violate the threshold, hence yield unsafe behavior. This problem is is known as policy oscillation (bertsekas2011approximate; wagner2011reinterpretation).
A similar constraint, that confronts the policy oscillation problem even more directly, is Monotonic Improvment (MI, kakade2002approximately; pirotta2013safe), and is the one adopted in this work. The requirement is for each new policy implemented by the agent during the learning process not to perform worse than the previous one. In this way, if the initial policy is safe, so will be all the subsequent ones.
The way safety constraints such as MI can be imposed on the optimization process depends, of course, on what kind of policies are considered as candidates and on how the optimization itself is performed. These two aspects are often tied, and will depend on the specific kind of RL algorithm that is employed. Policy Search (PS, deisenroth2013survey) is a family of RL algorithms where the class of candidate policies is fixed in advance and a direct search for the best one within the class is performed. This makes PS algorithms radically different from value-based algorithms such as Deep Q-Networks (mnih2015human), where the optimal policy is a byproduct of a learned value function. Although value-based methods gained great popularity from their successes in games, PS algorithms are more suited for real-world tasks, especially the ones involving cyber-physical systems. The main reasons are the ability of PS methods to deal with high-dimensional, continuous state and action spaces, convergence guarantees (sutton2000policy), robustness to sensor noise, and the superior control on the set of feasible policies. The latter allows to introduce domain knowledge in the optimization process, possibly including some safety constraints.
In this work, we focus on Policy Gradient methods (PG, sutton2000policy; peters2008reinforcement), where the set of candidate policies is a class of parametric functions and the optimization is performed via stochastic gradient ascent on the performance objective. In particular, we analyze the prototypical PG algorithm, REINFORCE (williams1992simple) and see how the MI constraints can be imposed by adaptively selecting its meta-parameters during the learning process. To achieve this, we study in more depth the stochastic gradient-based optimization process that is at the core of all PG methods (robbins1951stochastic). In particular, we identify a general family of parametric policies that makes the optimization objective Lipschitz-smooth (nesterov2013introductory) and allows an easy computation of the related Lipschitz constant. This family, called of smoothing policies, includes commonly used policy classes from the PG literature, namely Gaussian and Softmax policies. Using known properties of Lipschitz-smooth functions, we then provide lower bounds on the performance improvement produced by gradient-based updates, as a function of tunable meta-parameters. This, in turn, allows to identify those meta-parameter schedules that guarantee MI with high probability. In previous works, a similar result was only achieved for Gaussian policies (pirotta2013adaptive; papini2017adaptive).
The meta-parameters analyzed here are the step size of the policy updates, or learning rate, and the batch size of gradient estimations, i.e., the number of trials that are performed within a single policy update. These meta-parameters, already present in the original REINFORCE algorithm, are typically selected by hand and fixed for the whole learning process (duan2018benchmarking). Besides guaranteeing monotonic improvement, our proposed method removes the burden of selecting these meta-parameters. This safe, automatic selection within the REINFORCE algorithmic framework yields SPG, our Safe Policy Gradient algorithm.
The paper is organized as follows: in Section 2
we introduce the necessary background on Markov decision processes, policy optimization, and smooth functions. In Section3, we introduce smoothing policies and show the useful properties they induce on the policy optimization problem, most importantly a lower bound on the performance improvement yielded by an arbitrary policy parameter update (Theorem 3.3). In Section 4, we exploit these properties to select the meta-parameters of REINFORCE in a way that guarantees MI with high probability. In Section 5, we show that Gaussian and Softmax policies are smoothing. In Section 6, we provide bounds on the variance of policy gradient estimators that are necessary for a rigorous applicability of the previous results. In Section 7, we present the SPG algorithm. In Section 8, we compare our contribution with the related, existing literature. Finally, we discuss future work in Section 9.
In this section, we revise continuous Markov Decision Processes (MDPs, puterman2014markov), actor-only Policy Gradient algorithms (PG, deisenroth2013survey), and some properties of smooth functions.
2.1 Markov Decision Processes
A Markov Decision Process (MDP, puterman2014markov) is a tuple , comprised of a state space , an action space , a Markovian transition kernel , where
denotes the set of probability density functions over, a reward function , a discount factor and an initial-state distribution . We only consider bounded-reward MPDs, and denote with the maximum absolute reward. The MDP is used to model the interaction of a rational agent with the environment. We model the agent’s behavior with a policy , a stochastic mapping from states to actions. The initial state is drawn as . For each time step , the agent draws an action , conditioned by the current state . Then, the agent obtains a reward and the state of the environment transitions to . The goal of the agent is to maximize the expected sum of discounted rewards, or performance:
where the expectation is over all the actions selected by the agent and all the state transitions. We focus on continuous MDPs, where the state and action spaces are continuous, i.e., and . However, all the results naturally extend to the discrete case by replacing integrals with summations.
Given an MDP, the purpose of RL is to find an optimal policy without knowing the transition kernel and the reward function in advance, but only through interaction with the environment. To better characterize this optimization objective, it is convenient to introduce further quantities. We denote with the transition kernel of the Markov Process induced by policy , i.e., . The -step transition kernel under policy is defined recursively as follows:
where denotes the Dirac delta. Note that , from the sifting property of . The -step transition kernel allows to define the following state-occupancy measures:
measuring the (discounted) probability of visiting a state starting from another state or from the start, respectively. Note that these measures are unnormalized:
The following property of will be useful:  Any function that can be recursively defined as:
where is any bounded function, is also equal to:
By unrolling the recursive definition:
where (8) is from Lemma 18 in (ciosek2018expected). The state-value function provides the discounted sum of rewards obtained, in expectation, by following policy from state , and is defined recursively by the Bellman equation:
Similarly, the action-value function:
denotes the discounted sum of rewards obtained, in expectation, by taking action in state and following afterwards. The two value functions are closely related:
For bounded-reward MDPs, the value functions are bounded for every policy :
where and . Using the definition of state-value function we can rewrite the performance as follows:
Policy search methods actively search for the policy maximizing the performance, typically within a specific class of policies.
2.2 Parametric policies
In this work, we only consider parametric policies. Given a parameter vector, a parametric policy is a stochastic mapping from states to actions parametrized by , denoted with . The search for the optimal policy is thus limited to the policy class . This corresponds to finding an optimal parameter, i.e., . For ease of notation, we often write in place of in function arguments and subscripts, e.g., , and in place of , and , respectively. We further restrict our attention to policies that are twice differentiable w.r.t. , i.e., for which the gradient and the Hessian are defined everywhere and finite. For ease of notation, we omit the subscript in when clear from the context. Given any twice-differentiable scalar function , we denote with the -th gradient component, i.e., , and with the Hessian element of coordinates , i.e., . We also write to denote when this does not introduce any ambiguity.
The Policy Gradient Theorem (sutton2000policy) allows to characterize the gradient of the performance for a differentiable policy as an expectation over states and actions visited under :
The gradient of the log-likelihood is called score function, while the Hessian of the log-likelihood is called observed information.
2.3 Actor-only policy gradient
In practice, we always consider finite episodes of length . We call this the effective horizon of the MDP, chosen to be sufficiently large for the problem not to loose any generality333When the reward is uniformly bounded by , by setting , the discounted truncated sum of rewards is -close to the infinite sum (e.g., kakade2003sample, Sec. 2.3.3).. We denote with a trajectory, i.e., a sequence of states and actions of length such that , , for and some policy . In this context, the performance of a parametric policy can be defined as:
where denotes the probability density of the trajectory that can be generated by following policy , i.e., . Let be a batch of trajectories generated with , i.e., i.i.d. for . Let be an estimate of the policy gradient based on . Such an estimate can be used to perform stochastic gradient ascent on the performance objective :
where is a step size and is called batch size. This yields an Actor-only Policy Gradient method, provided in Algorithm 1. Under mild conditions, this algorithm is guaranteed to converge to a local optimum (the objective is typically non-convex). As for the gradient estimator, we can use REINFORCE (williams1992simple)444In the literature, the term REINFORCE is often used to denote actor-only policy gradient algorithms in general. In this paper, we always use REINFORCE to denote the specific policy gradient method proposed by williams1992simple.:
or its refinement, GPOMDP (baxter2001infinite), which typically suffers from less variance (peters2008reinforcement):
where the superscript on states and actions denotes the -th trajectory of the dataset and is a (possibly time-dependent and vector-valued) control variate, or baseline. Both estimators are unbiased for any action-independent baseline555Also valid action-dependent baselines have been proposed. See tucker2018mirage for a recent discussion.. peters2008reinforcement prove that Algorithm 1 with the GPOMDP estimator is equivalent to Monte-Carlo PGT (Policy Gradient Theorem, sutton2000policy), and provide variance-minimizing baselines for both REINFORCE and GPOMDP, called Peter’s baselines henceforth.
Algorithm 1 is called actor-only to discriminate it from actor-critic policy gradient algorithms (sutton2000policy; peters2005natural; silver2014deterministic; schulman2015trust; wang2016sample; mnih2016asynchronous; wu2017scalable; ciosek2018expected; haarnoja2018soft; espeholt2018impala), where an approximate value function, or critic is employed in the gradient computation. In this work, we will focus on actor-only algorithms, for which safety guarantees are more easily proven666The distinction is not so sharp, as a critic can be seen as a baseline and vice-versa. We call critic an explicit value function estimate used in policy gradient estimation. Moreover, actor-only methods are necessarily episodic. Actor-critic algorithms tend to have the time step, not the episode, as their atomic update, although typically resolving to equally large batch updates (duan2018benchmarking).. A remarkable exception is the proof of convergence of actor-critic methods with compatible function approximation provided by Bhatnagar2009actorcritic.
Besides improving the gradient estimation (baxter2001infinite; weaver2001optimal; gu2017q; peters2008reinforcement; grathwohl2017backpropagation; liu51action; wu2018variance; papini2018stochastic), generalizations of Algorithm 1 include: using a vector step size (yu2006fast; papini2017adaptive), i.e., , where is a positive vector and denotes the Hadamard (element-wise) product; making the step size adaptive, i.e., iteration and/or data-dependent (pirotta2013adaptive; kingma2014adam); making the batch size also adaptive (papini2017adaptive); applying a preconditioning matrix to the gradient, as in Natural Policy Gradient (kakade2002natural) and second-order methods (furmston2012unifying)777Note that a vector step size can be seen as a special diagonal preconditioning matrix..
2.4 Smooth functions
In the following we denote with the -norm of vector . When the subscript is omitted, we always mean the norm, also called Euclidean norm. For a matrix , denotes the induced norm, that is the spectral norm for .
Let be a (possibly non-convex) vector-valued function. We call Lipschitz continuous if there exists such that, for every :
Let be a real-valued differentiable function. We call Lipschitz smooth if its gradient is Lipschitz continuous, i.e., there exists such that, for every :
Whenever we want to specify the Lipschitz constant888The Lipschitz constant is usually defined as the smallest constant satisfying the Lipschitz condition. In this paper, instead, we call Lipschitz constant any constant for which the Lipschitz condition holds. of the gradient, we call -smooth. For a twice-differentiable function, the following holds:  Let be a convex subset of and be a twice-differentiable function. If the Hessian is uniformly bounded in spectral norm by , i.e., , is -smooth. Lipschitz smooth functions admit a quadratic bound on the deviation from a linear behavior: [Quadratic Bound] Let be a convex subset of and be an -smooth function. Then, for every :
3 Smooth Policy Gradient
In this section, we provide lower bounds on performance improvement based on general assumptions on the policy class.
3.1 Smoothing policies
We introduce a family of parametric policies having properties that we deem desirable for policy-gradient learning. We call them smoothing, as they induce the smoothness of the performance:  Let be a class of twice-differentiable parametric policies. We call it smoothing if the parameter space is convex and there exist non-negative constants such that, for every state and in expectation over actions, the euclidean norm of the score function:
the squared euclidean norm of the score function:
and the spectral norm of the observed information:
are upper-bounded. Note that the definition requires the bounding constants to be independent from the policy parameters and the state. For this reason, the existence of such constants depends on the policy parametrization. We call a policy class -smoothing when we want to specify the bounding constants. In Section 5, we show that some of the most commonly used policies, such as the Gaussian policy for continuous actions and the Softmax policy for discrete actions, are smoothing.
3.2 Policy Hessian
We now show that the policy Hessian for a smoothing policy has bounded spectral norm. First, we write the policy Hessian for a general parametric policy as follows:  (kakade2002natural, equation 6) Given a twice-differentiable parametric policy , the policy Hessian is:
The first derivation was provided in (kakade2001optimizing), we restate it for the sake of clarity. We first compute the Hessian of the state-value function:
where (27) is from the log trick (), (28) is from another application of the log trick, (29) is from (13), and (30) is from Lemma 2.1 with as the recursive term. Computing the Hessian of the performance is then trivial:
where the first equality is from (15). Combining (30), (31) and (4) gives the statement of the lemma. We can now bound the policy Hessian for a smoothing policy:  Given a -smoothing policy , the spectral norm of the policy Hessian can be upper-bounded as follows:
We start by stating the gradient of the state-value function (see the proof of Theorem 1 in sutton2000policy):
where (37) is from Jensen inequality (all norms are convex) and the triangle inequality, (38) is from for any two vectors and , (39) is from (14) and (36), and the last inequality is from the smoothing assumption.
3.3 Smooth Performance
For a smoothing policy, the performance is Lipschitz smooth:  Given a -smoothing policy class , the performance is -smooth with the following Lipschitz constant:
From Lemma 3.2, is a bound on the spectral norm of the policy Hessian. From Lemma 8, this is a valid Lipschitz constant for the policy gradient, hence the performance is -smooth. The smoothness of the performance, in turn, yields the following property on the guaranteed performance improvement:  Let be a -smoothing policy class. For every :
where and . It suffices to apply Lemma 8 with the Lipschitz constant from Lemma 3.3. In the following, we will exploit this property of smoothing policies to enforce safety guarantees on the policy updates performed by Algorithm 1, i.e., stochastic gradient ascent updates. However, Theorem 3.3 applies to any policy update as long as .
4 Guaranteed Improvement Maximization
In this section, we provide meta-parameters for Algorithm 1 that maximize a lower bound on the performance improvement for smoothing policies. This yields safety in the sense of Monotonic Improvement (MI), i.e., non-negative performance improvements at each policy update.
In standard policy optimization, at each learning iteration , we aim to find the policy update that maximizes the new performance , or equivalently:
since is fixed. Unfortunately, the performance of the updated policy cannot be known in advance999The performance of the updated policy could be estimated with off-policy evaluation techniques, but this would introduce an additional, non-negligible source of variance. The idea of using off-policy evaluation to select meta-parameters has been recently explored by paul2019fast.. For this reason, we replace the optimization objective in (42) with a lower bound, i.e., a guaranteed improvement. In particular, taking Algorithm 1 as our starting point, we maximize the guaranteed improvement of a policy gradient update (line 5) by selecting optimal meta-parameters. The solution of this meta-optimization problem provides a lower bound on the actual performance improvement. As long as this is always non-negative, MI is guaranteed.
4.1 Adaptive step size (exact framework)
To decouple the pure optimization aspects of this problem from gradient estimation issues, we first consider an exact policy gradient update, i.e., , where we assume to have a first-order oracle, i.e., to be able to compute the exact policy gradient . This assumption is clearly not realistic, and will be removed in Section 4.3. In this simplified framework, performance improvement can be guaranteed deterministically. Moreover, the only relevant meta-parameter is the step size of the update. We first need a lower bound on the performance improvement . For a smoothing policy, we can use the following:  Let be a -smoothing policy class. Let and , where . Provided , the performance improvement of w.r.t. can be lower bounded as follows:
where . This is just a special case of Theorem 3.3 with . This bound is in the typical form of performance improvement bounds (e.g., kakade2002approximately; pirotta2013adaptive; schulman2015trust; cohen2018diverse): a positive term accounting for the anticipated advantage of over , and a penalty term accounting for the mismatch between the two policies, which makes the anticipated advantage less reliable. In our case, the mismatch is measured by the curvature of the performance w.r.t. the policy parameters, via the Lipschitz constant of the policy gradient. This lower bound is quadratic in , hence we can easily find the optimal step size .  Let be the guaranteed performance improvement of an exact policy gradient update, as defined in Theorem 4.1. Under the same assumptions, is maximized by the constant step size , which guarantees the following non-negative performance improvement:
We just maximize , which is a quadratic function of . The global optimum is attained by . The improvement guarantee follows from Theorem 4.1.
4.2 Adaptive step size (approximate framework)
In practice, we cannot compute the exact gradient , but only an estimate (see e.g., (18)) obtained from trajectories. To find the optimal step size, we just need to adapt the performance-improvement lower bound of Theorem 4.1 to stochastic-gradient updates. Since sample trajectories are involved, this new lower bound will only hold with high probability. First, we need the following assumption on the gradient estimation error:
For every , there exists a non-negative constant such that, with probability at least :
for every and .
One way to characterize the estimation error is to upper-bound the variance of the estimator:  Let be an unbiased estimator of such that:
then satisfies Assumption 1 with . We apply the vector version of Chebyshev’s inequality (ferentinos1982tcebycheff). In Section 6, we will provide variance upper bounds for the REINFORCE and the GPOMDP policy gradient estimators in the case of smoothing policies.
Under Assumption 1, we can adapt Theorem 4.1 to the stochastic gradient case as follows:  Let be a -smoothing policy class. Let and , where , and satisfies Assumption 1. Provided , the performance improvement of w.r.t. can be lower bounded, with probability at least , as follows:
where . From Assumption 1, with probability at least :
Then, from the law of cosines: