1 Introduction
The field of reinforcement learning (RL) has witnessed tremendous success in many highdimensional control problems, including video games (Mnih et al., 2015), board games (Silver et al., 2016), robot locomotion (Lillicrap et al., 2016), manipulation (Levine et al., 2016; Kalashnikov et al., 2018), navigation (Faust et al., 2018), and obstacle avoidance (Chiang et al., 2019). In standard RL, the ultimate goal is to optimize the expected sum of rewards/costs, and the agent is free to explore any behavior as long as it leads to performance improvement. Although this freedom might be acceptable in many problems, including those involving simulated environments, and could expedite learning a good policy, it might be harmful in many other problems and could cause damage to the agent (robot) or to the environment (plant or the people working nearby). In such domains, it is absolutely crucial that while the agent (RL algorithm) optimizes its longterm performance, it also maintains safe policies both during training and at convergence.
A natural way to incorporate safety is via constraints. A standard model for RL with constraints is constrained Markov decision process (CMDP) (Altman, 1999)
, where in addition to its standard objective, the agent must satisfy constraints on expectations of auxiliary costs. Although optimal policies for finite CMDPs with known models can be obtained by linear programming
(Altman, 1999), there are not many results for solving CMDPs when the model is unknown or the state and/or action spaces are large or infinite. A common approach to solve CMDPs is to use the Lagrangian method (Altman, 1998; Geibel & Wysotzki, 2005) that augments the original objective function with a penalty on constraint violation and computes the saddlepoint of the constrained policy optimization via primaldual methods (Chow et al., 2017). Although safety is ensured when the policy converges asymptotically, a major drawback of this approach is that it makes no guarantee with regards to the safety of the policies generated during training.A few algorithms have been recently proposed to solve CMDPs at scale, while remaining safety during training. One such algorithm is constrained policy optimization (CPO) (Achiam et al., 2017). CPO extends the trustregion policy optimization (TRPO) algorithm (Schulman et al., 2015a) to handle the constraints in a principled way and has shown promising empirical results in terms scalability, performance, and constraint satisfaction, both during training and after convergence. Another class of algorithms of this sort is by Chow et al. (2018). These algorithms use the notion of Lyapunov functions that have a long history in control theory to analyze the stability of dynamical systems (Khalil, 1996). Lyapunov functions have been used in RL to guarantee closedloop stability of the agent (Perkins & Barto, 2002; Faust et al., 2014). They also have been used to guarantee that a modelbased RL agent can be brought back to a “region of attraction” during exploration (Berkenkamp et al., 2017). Chow et al. (2018) use the theoretical properties of the Lyapunov functions and propose safe approximate policy and value iteration algorithms. They prove theories for their algorithms, when the CMDP is finite and known, and empirically evaluate them when it is large and/or unknown. However, since their algorithms are valuefunctionbased, applying them to continuous action problems is not straightforward, and was left as a future work.
In this paper, we build on the problem formulation and theoretical findings of the Lyapunovbased approach to solve CMDPs, and extend it to tackle continuous action problems that play an important role in control theory and robotics. We propose Lyapunovbased safe RL algorithms that can handle problems with large or infinite action spaces, and return safe policies both during training and at convergence. To do so, there are two major difficulties which we resolve: 1) the policy update becomes an optimization problem over the large or continuous action space (similar to standard MDPs with large actions), and 2) the policy update is a constrained optimization problem in which the (Lyapunov) constraints involve integration over the action space, and thus, it is often impossible to have them in closedform. Since the number of Lyapunov constraints is equal to the number of states, the situation is even more challenging when the problem has a large or an infinite state space. To address the first difficulty, we switch from valuefunctionbased to policy gradient (PG) and actorcritic algorithms. To address the second difficulty, we propose two approaches to solve our constrained policy optimization problem (a problem with infinite constraints, each involving an integral over the continuous action space) that can work with any standard onpolicy (e.g., proximal policy optimization (PPO) Schulman et al. 2017) and offpolicy (e.g., deep deterministic policy gradient (DDPG) Lillicrap et al. 2015) PG algorithm. Our first approach, which we call policy parameter projection or projection, is a constrained optimization method that combines PG with a projection of the policy parameters onto the set of feasible solutions induced by the Lyapunov constraints. Our second approach, which we call action projection or projection, uses the concept of a safety layer introduced by Dalal et al. (2018) to handle simple singlestep constraints, extends this concept to general trajectorybased constraints, solves the constrained policy optimization problem in closedform using Lyapunov functions, and integrates this closedform into the policy network via safetylayer augmentation. Since both approaches guarantee safety at every policy update, they manage to maintain safety throughout training (ignoring errors resulting from function approximation), ensuring that all intermediate policies are safe to be deployed. To prevent constraint violations due to function approximation and modeling errors, similar to CPO, we offer a safeguard policy update rule that decreases constraint cost and ensures nearconstraint satisfaction.
Our proposed algorithms have two main advantages over CPO. First, since CPO is closely connected to TRPO, it can only be trivially combined with PG algorithms that are regularized with relative entropy, such as PPO. This restricts CPO to onpolicy PG algorithms. On the contrary, our algorithms can work with any onpolicy (e.g., PPO) and offpolicy (e.g., DDPG) PG algorithm. Having an offpolicy implementation is beneficial, since offpolicy algorithms are potentially more dataefficient, as they can use the data from the replay buffer. Second, while CPO is not a backpropagatable algorithm, due to the backtracking linesearch procedure and the conjugate gradient iterations for computing natural gradient in TRPO, our algorithms can be trained endtoend, which is crucial for scalable and efficient implementation (Hafner et al., 2017). In fact, we show in Section 3.1 that CPO (minus the line search) can be viewed as a special case of the onpolicy version (PPO version) of our projection algorithm, corresponding to a specific approximation of the constraints.
We evaluate our algorithms and compare them with CPO and the Lagrangian method on several continuous control (MuJoCo) tasks and a realworld robot navigation problem, in which the robot must satisfy certain constraints, while minimizing its expected cumulative cost. Results show that our algorithms outperform the baselines in terms of balancing the performance and constraint satisfaction (during training), and generalize better to new and more complex environments, including transfer to a real Fetch robot.
2 Preliminaries
We consider the RL problem in which the agent’s interaction with the environment is modeled as a Markov decision process (MDP). A MDP is a tuple , where and are the state and action spaces; is a discounting factor; is the immediate cost function;
is the transition probability distribution; and
is the initial state. Although we consider deterministic initial state and cost function, our results can be easily generalized to random initial states and costs. We model the RL problems in which there are constraints on the cumulative cost using CMDPs. The CMDP model extends MDP by introducing additional costs and the associated constraints, and is defined by , where the first six components are the same as in the unconstrained MDP; is the (statedependent) immediate constraint cost; and is an upperbound on the expected cumulative constraint cost.To formalize the optimization problem associated with CMDPs, let be the set of Markovian stationary policies, i.e., . At each state , we define the generic Bellman operator w.r.t. a policy and a cost function as . Given a policy , we define the expected cumulative cost and the safety constraint function (expected cumulative constraint cost) as and . The safety constraint is then defined as . The goal in CMDPs is to solve the constrained optimization problem
(1) 
It has been shown that if the feasibility set is nonempty, then there exists an optimal policy in the class of stationary Markovian policies (Altman, 1999, Theorem 3.1).
2.1 Policy Gradient Algorithms
Policy gradient (PG) algorithms optimize a policy by computing a sample estimate of the gradient of the expected cumulative cost induced by the policy, and then updating the policy parameter in the gradient direction. In general, stochastic policies that give a probability distribution over actions are parameterized by a
dimensional vector
, so the space of policies can be written as . Since in this setting a policy is uniquely defined by its parameter , policydependent functions can be written as a function of or interchangeably.Deep deterministic policy gradient (DDPG) (Lillicrap et al., 2015) and proximal policy optimization (PPO) (Schulman et al., 2017) are two PG algorithms that have recently gained popularity in solving continuous control problems. DDPG is an offpolicy Qlearning style algorithm that jointly trains a deterministic policy and a Qvalue approximator . The Qvalue approximator is trained to fit the true Qvalue function and the deterministic policy is trained to optimize
via chainrule. The PPO algorithm we use is a penalty form of the trust region policy optimization (TRPO) algorithm
(Schulman et al., 2015a) with an adaptive rule to tune the penalty weight . PPO trains a policyby optimizing a loss function that consists of the standard policy gradient objective and a penalty on the KLdivergence between the current
and previous policies, i.e., .2.2 Lagrangian Method
The Lagrangian method is a straightforward way to address the constraint in CMDPs. In this approach, we add the constraint costs to the task costs and transform the constrained optimization problem to a penalty form, i.e., . We then jointly optimizes and to find a saddlepoint of the penalized objective. The optimization of may be performed by any PG algorithm, such as DDPG or PPO, on the augmented cost , while
is optimized by stochastic gradient descent. As described in Section
1, although the Lagrangian approach is easy to implement (see Appendix B for the details), in practice, it often violates the constraints during training. While at each step during training, the objective encourages finding a safe solution, the current value of may lead to an unsafe policy. This is why the Lagrangian method may not be suitable for solving problems in which safety is crucial during training.2.3 Lyapunov Functions
Since in this paper, we extend the Lyapunovbased approach to CMDPs to PG algorithms, we end this section by introducing some terms and notations from Chow et al. (2018) that are important in developing our safe PG algorithms. We refer the reader to Appendix A for more details.
We define a set of Lyapunov functions w.r.t. initial state and constraint threshold as , where is a feasible policy of (1), i.e., . We refer to the constraints in this feasibility set as the Lyapunov constraints. For any arbitrary Lyapunov function , we denote by , the set of induced Markov stationary policies. The contraction property of , together with , imply that any induced policy is a feasible policy of (1). However, does not always contain an optimal solution of (1), and thus, it is necessary to design a Lyapunov function that provides this guarantee. In other words, the main goal of the Lyapunov approach is to construct a Lyapunov function , such that contains an optimal policy , i.e., . Chow et al. (2018) show in their Theorem 1 that without loss of optimality, the Lyapunov function that satisfies the above criterion can be expressed as , in which is a specific immediate auxiliary constraint cost that keeps track of the maximum constraint budget available for policy improvement (from to ). They propose ways to construct such , as well as an auxiliary constraint cost surrogate , which is a tight upperbound on and can be computed more efficiently. They use this construction to propose their safe (approximate) policy and value iteration algorithms, in which the goal is to solve the following LP problem (Chow et al., 2018, Eq. 6) at each policy improvement step:
(2)  
where and are the value function and stateaction value function (w.r.t. the cost function ), and is the Lyapunov function. Note that in an iterative policy optimization method, such as those we will present in this paper, the feasible policy can be set to the policy at the previous iteration.
In (2), there are as many constraints as the number of states and each constraint involves an integral over the entire action space . When the state space is large or continuous, even if the integral in the constraint has a closedform (e.g., when the number of actions is finite), solving LP (2) becomes numerically intractable. Since Chow et al. (2018) assume that the number of actions is finite, they focus on valuefunctionbased RL algorithms and address the large state issue by policy distillation. However, in this paper, we are interested in problems with large action spaces. In our case, solving (2) will be even more challenging. To address this issue, in the next section, we first switch from valuefunctionbased algorithms to PG algorithms, then propose an optimization problem with Lyapunov constraints, analogous to (2), that is suitable for the PG setting, and finally present two methods to solve our proposed optimization problem efficiently.
3 Safe Lyapunovbased Policy Gradient
We now present our approach to solve CMDPs in a way that guarantees safety both at convergence and during training. Similar to Chow et al. (2018), our Lyapunovbased safe PG algorithms solve a constrained optimization problem analogous to (2). In particular, our algorithms consist of two components, a baseline PG algorithm, such as DDPG or PPO, and an effective method to solve the general Lyapunovbased policy optimization problem (the analogous to (2))
(3)  
In the next two sections, we present two approaches to solve (3) efficiently. We call these approaches 1) projection, a constrained optimization method that combines PG with projecting the policy parameter onto the set of feasible solutions induced by the Lyapunov constraints, and 2) projection, in which we embed the Lyapunov constraints into the policy network via a safety layer.
3.1 The projection Approach
In this section, we show how a safe Lyapunovbased PG algorithm can be derived using the projection approach. This machinery is based on the minorizationmaximization technique in conservative PG (Kakade & Langford, 2002) and Taylor series expansion, and it can be applied to both onpolicy and offpolicy algorithms. Following Theorem 4.1 in Kakade & Langford (2002), we first have the following bound for the cumulative cost: , where is the visiting distribution of starting at the initial state , and is the weight for the entropybased regularization.^{1}^{1}1Theorem 1 in Schulman et al. (2015a) provides a recipe for computing such that the minorizationmaximization inequality holds. But in practice, is treated as a tunable hyperparameter for entropybased regularization. Using the above result, we denote by
the surrogate cumulative cost. It has been shown in Eq. 10 of Schulman et al. (2015a) that replacing the objective function with its surrogate in solving (3) will still lead to policy improvement. In order to effectively compute the improved policy parameter , one further approximates the function with its Taylor series expansion (around ). In particular, the term is approximated up to its first order, and the term is approximated up to its second order. Altogether this allows us to replace the objective function in (3) with the following surrogate:
Similarly, regarding the constraints in (3), we can use the Taylor series expansion (around ) to approximate the LHS of the Lyapunov constraints as
Using the above approximations, at each iteration, our safe PG algorithm updates the policy by solving the following constrained optimization problem with semiinfinite dimensional Lyapunov constraints:
(4)  
It can be seen that if the errors resulted from the neural network parameterizations of and , and the Taylor series expansions are small, then an algorithm that updates the policy parameter by solving (3.1) can ensure safety during training. However, the presence of infinitedimensional Lyapunov constraints makes solving (3.1) numerically intractable. A solution to this is to write the Lyapunov constraints in (3.1) (without loss of optimality) as Since the above operator is nondifferentiable, this may still lead to numerical instability in gradient descent algorithms. Similar to the surrogate constraint used in TRPO (to transform the constraint to an average constraint, see Eq. 12 in Schulman et al. 2015a), a more numerically stable way is to approximate the Lyapunov constraint using the following average constraint surrogate:
(5) 
where is the number of onpolicy sample trajectories of . In practice, when the auxiliary constraint surrogate is chosen as (see Appendix A for the justification of this choice), the gradient term in (5) can be simplified as , where and are the constraint value function and constraint stateaction value function, respectively. Combining with the fact that is state independent, the above arguments further imply that the average constraint surrogate in (5) can be approximated by the inequality , which is equivalent to the constraint used in CPO (see Sec. 6.1 in Achiam et al. 2017). This shows a clear connection between CPO (minus the line search) and our Lyapunovbased PG with projection. Algorithm 4 in Appendix E contains the pseudocodes of our safe Lyapunovbased PG algorithms with projection. We refer to the DDPG and PPO versions of this algorithm as SDDPG and SPPO.
3.2 The projection Approach
Note that the main characteristic of the Lyapunov approach is to break down a trajectorybased constraint into a sequence of singlestep state dependent constraints. However, when the state space is infinite, the feasibility set is characterized by infinite dimensional constraints, and thus, it is actually counterintuitive to directly enforce these Lyapunov constraints (as opposed to the original trajectorybased constraint) into the policy update optimization. To address this issue, we leverage the idea of a safety layer from Dalal et al. (2018), that was applied to simple singlestep constraints, and propose a novel approach to embed the set of Lyapunov constraints into the policy network. This way, we reformulate the CMDP problem (1) as an unconstrained optimization problem and optimize its policy parameter (of the augmented network) using any standard unconstrained PG algorithm. At every given state, the unconstrained action is first computed and then passed through the safety layer, where a feasible action mapping is constructed by projecting the unconstrained actions onto the feasibility set w.r.t. the corresponding Lyapunov constraint. Therefore, safety during training w.r.t. the CMDP problem can be guaranteed by this constraint projection approach.
For simplicity, we only describe how the action mapping (to the set of Lyapunov constraints) works for deterministic policies. Using identical machinery, this procedure can be extended to guarantee safety for stochastic policies. Recall from the policy improvement problem in (3) that the Lyapunov constraint is imposed at every state . Given a baseline feasible policy , for any arbitrary policy parameter , we denote by , the projection of onto the feasibility set induced by the Lyapunov constraints. One way to construct a feasible policy from a parameter is to solve the following projection problem at each state :
(6)  
We refer to this operation as the Lyapunov safety layer. Intuitively, this projection perturbs the unconstrained action as little as possible in the Euclidean norm in order to satisfy the Lyapunov constraints. Since this projection guarantees safety (in the Lyapunov sense), if we have access to a closed form of the projection, we may insert it into the policy parameterization and simply solve an unconstrained policy optimization problem, i.e., , using any standard PG algorithm.
To simplify the projection (6), we can approximate the LHS of the Lyapunov constraint with its firstorder Taylor series (w.r.t. action ). Thus, at any given state , the safety layer solves the following projection problem:
(7)  
where is the actiongradient of the stateaction Lyapunov function induced by the baseline action .
Similar to the analysis of Section 3.1, if the auxiliary cost is state independent, one can readily find by computing the gradient of the constraint actionvalue function . Note that the objective function in (7) is positivedefinite and quadratic, and the constraint approximation is linear. Therefore, the solution of this (convex) projection problem can be effectively computed by an ingraph QPsolver, such as OPTNet (Amos & Kolter, 2017). Combined with the above projection procedure, this further implies that the CMDP problem can be effectively solved using an endtoend PG training pipeline (such as DDPG or PPO). Furthermore, when the CMDP has a single constraint (and thus a single Lyapunov constraint), the policy has the following analytical solution.
Proposition 1.
At any given state , the solution to the optimization problem (7) has the form , where
The closedform solution is essentially a linear projection of the unconstrained action
to the Lyapunovsafe hyperplane characterized with slope
and intercept . Extending this closedform solution to handle multiple constraints is possible, if there is at most one constraint active at a time (see Proposition 1 in Dalal et al. 2018 for a similar extension).Without loss of generality, this projection step can also be extended to handle actions generated by stochastic policies with bounded first and second order moments
(Yu et al., 2009). For example when the policy is parameterized with a Gaussian distribution, then one needs to project both the mean and standarddeviation vector onto the Lyapunovsafe hyperplane, in order to obtain a feasible action probability. Algorithm
5 in Appendix E contains the pseudocode of our safe Lyapunovbased PG algorithms with projection. We refer to the DDPG and PPO versions of this algorithm as SDDPGmodular and SPPOmodular, respectively.4 Experiments on MuJoCo Benchmarks
SD confidence intervals (over
random seeds). The dashed purple line represents the constraint limit.We empirically evaluate the Lyapunovbased PG algorithms to assess: (i) the performance in terms of cost and safety during training, and (ii) robustness with respect to constraint violations in the presence of function approximation errors. To that end, we design three interpretable experiments in simulated robot locomotion continuous control tasks using the MuJoCo simulator (Todorov et al., 2012). The tasks notions of safety are motivated by physical constraints: (i)HalfCheetahSafe: The HalfCheetah agent is rewarded for running, but its speed is limited for stability and safety; (ii) PointCircle: The Point agent is rewarded for running in a wide circle, but is constrained to stay within a safe region defined by (Achiam et al., 2017); (iii) PointGather & AntGather: Point or Ant Gatherer agent, is rewarded for collecting target objects in a terrain map, while being constrained to avoid bombs (Achiam et al., 2017). Visualizations of these tasks as well as more details of the network architecture used in training the algorithms are given in Appendix C.
We compare the presented methods with two stateoftheart unconstrained reinforcement learning algorithms, DDPG (Lillicrap et al., 2015) and PPO (Schulman et al., 2017), and two constrained methods, Lagrangian approach with optimized hyperparameters for fairness (Appendix B) and onpolicy CPO algorithm (Achiam et al., 2017). The original CPO is based on TRPO (schulman2015trust). We use its PPO alternative (which coincides with the SPPO algorithm derived in Section 4.1) as the safe RL baseline. SPPO preserves the essence of CPO by adding the first order constraint and the relative entropy regularization to the policy optimization problem. The main difference between CPO and SPPO is that the latter does not perform backtracking linesearch in learning rate. The decision to compare with SPPO instead of CPO is 1) to avoid the additional computational complexity of linesearch in TRPO, while maintaining the performance of PG using the popular PPO algorithm, 2) to have a backpropagatable version of CPO, and 3) to have a fair comparison with other backpropagatable safe RL algorithms, such as the DDPG and safety layer counterparts.
Comparisons with baselines: The Lyapunovbased PG algorithms are stable in learning and all methods converge to feasible policies with reasonable performance (Figures (a)a, (c)c, (e)e, (g)g, (a)a, (c)c, (e)e, (g)g). In contrast, when examining the constraint violation (Figures (b)b, (d)d, (f)f, (h)h, (b)b, (d)d, (f)f, (g)g), the Lyapunovbased PG algorithms quickly stabilize the constraint cost to be below the threshold, while the unconstrained DDPG and PPO agents violate the constraints in these environments, and the the Lagrangian approach tends to jiggle around the constrain threshold. Furthermore it is worthnoting that the Lagrangian approach can be sensitive to the initialization of the Lagrange multiplier . If is too large, it would make policy updates overly conservative, while if is too small then constraint violation will be more pronounced. Without further knowledge about the environment, here we treat as a hyperparameter and optimize it via gridsearch. See Appendix C for more detail.
projection vs. projection: In many cases the projection (DDPG and PPO projections) converges faster and has lower constraint violation than its projection counterpart (SDDPG, SPPO). This corroborates with the hypothesis that the projection approach is less conservative during policy updates than the projection approach (which is what CPO is based on) and generates smoother gradient updates during endtoend training, resulting in more effective learning than CPO (projection).
DDPG vs. PPO: Finally, in most experiments (HalfCheetah, PointGather, and AntGather) the DDPG algorithms tend to have faster learning than the PPO counterpart, while the PPO algorithms have better control on constraint violations (which are able to satisfy lower constraint thresholds). The faster learning behavior is potentially due to the improved dataefficiency when using offpolicy samples in PG updates, however the covariateshift in offpolicy data makes tight constraint control more challenging.
5 Safe Policy Gradient for Robot Navigation
We now evaluate the safe policy optimization on a real robot task – point to point (P2P) navigation (Chiang et al., 2019) – where a noisy differential drive robot with limited sensors (Fig. (a)a), is required to navigate to a goal outside of its visual field of view while avoiding collisions with obstacles. The agent’s observations consist of the relative goal position, the relative goal velocity, and the Lidar measurements (Fig. (a)a). The actions are the linear and angular velocity vector at the robot’s center of the mass. The transition probability captures the noisy differential drive robot dynamics, whose exact formulation is not known to the robot. The robot must navigate to arbitrary goal positions collisionfree and without memory of the workspace topology.
Here the CMDP is nondiscounting and has a fixed horizon. We reward the agent for reaching the goal, which translates to an immediate cost that measures the relative distance to goal. To measure the impact energy of obstacle collisions, we impose an immediate constraint cost to account for the speed during collision, with a constraint threshold that characterizes the agent’s maximum tolerable collision impact energy to any objects. This type of constraint allows the robot to touch the obstacle (such as walls) but prevent it from ramming into any objects. Under this CMDP framework (Fig. (b)b), the main goal is to train a policy that drives the robot along the shortest path to the goal and to limit the total impact energy of obstacle collisions. Furthermore, we note that due to limited data, in practice intermediate pointtopoint policies are deployed on the realworld robot to collect more samples for further training. Therefore, guaranteeing safety during training is critical in this application. Descriptions about the robot navigation problem, including training and evaluation environments are in Appendix D.
Experimental Results: We evaluate the learning algorithms in terms of average mission success percentage and constraint control. The task is successful if the robot reaches the goal before the constraint threshold (total energy of collision) is exhausted, and the success rate is averaged over evaluation episodes with random initialization. While all methods converge to policies with reasonable performance, Figure (a)a and (b)b shows that the Lyapunovbased PG algorithms have higher success rates, due to their robust abilities of controlling the total constraint, as well minimizing the distance to goal. Although the unconstrained method often yields a lower distance to goal, it violates the constraint more frequently and thus leads to a lower success rate. Furthermore, note that the Lagrangian approach is less robust to initialization of parameters, and therefore it generally has lower success rate and higher variability than the Lyapunovbased methods. Unfortunately due to function approximation error and stochasticity of the problem, all the algorithms converged prematurely with constraints above the threshold. One reason is due to the constraint threshold () being overlyconservative. In realworld problems guaranteeing constraint satisfaction is more challenging than maximizing return, and that usually requires much more training data. Finally, Figures (a)a and (b)b illustrate the navigation routes of two policies. On similar goal configurations, the Lagrangian method tends to zigzag and has more collisions, while the Lyapunovbased algorithm (SDDPG) chooses a safer path to reach the goal.
Next, we evaluate how well the methods generalize to (i) longer trajectories, and (ii) new environments. P2P tasks are trained in a by meters environment (Fig. 32) with goals placed within to meters from the robot initial state, Figure 31 depicts the results evaluations, averaged over trials, on P2P tasks in a much larger evaluation environment ( by meters) with goals placed up to meters away from the goal. The success rate of all methods degrades as the goals are further away (Fig. (a)a), and the safety methods (projection – SLDDPG, and projection – SGDDPG) outperform unconstrained and Lagrangian (DDPG and LADDPG) as the task becomes more difficult. At the same time, our methods retain the lower constraints even when the task is difficult (Fig. (b)b).
Finally, we deployed the SLDDPG policy onto the real Fetch robot (Melonee Wise & Dymesich, 2016) in an everyday office environment. Figure (c)c shows the top down view of the robot log. Robot travelled a total of meters to complete five repetitions of tasks, each averaging about meters to the goal. The experiments included narrow corridors and people walking through the office. The robot robustly avoids both static and dynamic (humans) obstacles coming into its path. We observed additional ”wobbling” effects, that was not present in simulation. This is likely due to the wheel slippage at the floor that the policy was not trained for. In several occasions when the robot could not find a clear path, the policy instructed the robot to stay put instead of narrowly passing by the obstacle. This is precisely the safety behavior we want to achieve with the Lyapunovbased algorithms.
6 Conclusions
We formulated safe RL as a continuous action CMDP and developed two classes, projection and projection, of policy optimization algorithms based on Lyapunov functions to learn safe policies with high expected cumulative return. We do so by combining both on and offpolicy optimization (DDPG or PPO) with a critic that evaluates the policy and computes its corresponding Lyapunov function. We evaluated our algorithms on four highdimensional simulated robot locomotion tasks and compared them with several baselines. To demonstrate the effectiveness of the Lyapunovbased algorithms in solving realworld problems, we also apply these algorithms to indoor robot navigation, to ensure that the agent’s path is optimal and collisionfree. Our results indicate that our Lyapunovbased algorithms 1) achieve safe learning, 2) have better dataefficiency, 3) can be more naturally integrated within the standard endtoend differentiable policy gradient training pipeline, and 4) are scalable to tackle realworld problems. Our work is a step forward in deploying RL to realworld problems in which safety guarantees are of paramount importance. Future work includes 1) further exploration of Lyapunov function properties to improve training stability and safety, 2) more efficient use of Lyapunov constraints in constrained policy optimization, and 3) extensions of the Lyapunovapproach to the modelbased setting to better utilize the agent’s dynamics.
References
 Achiam et al. (2017) Achiam, J., Held, D., Tamar, A., and Abbeel, P. Constrained policy optimization. arXiv preprint arXiv:1705.10528, 2017.
 Altman (1998) Altman, E. Constrained Markov decision processes with total cost criteria: Lagrangian approach and dual linear program. Mathematical methods of operations research, 48(3):387–417, 1998.
 Altman (1999) Altman, E. Constrained Markov decision processes, volume 7. CRC Press, 1999.
 Amos & Kolter (2017) Amos, B. and Kolter, Z. Optnet: Differentiable optimization as a layer in neural networks. arXiv preprint arXiv:1703.00443, 2017.
 Berkenkamp et al. (2017) Berkenkamp, F., Turchetta, M., Schoellig, A., and Krause, A. Safe modelbased reinforcement learning with stability guarantees. In Advances in Neural Information Processing Systems, pp. 908–918, 2017.
 Bertsekas (1999) Bertsekas, D. Nonlinear programming. Athena scientific Belmont, 1999.
 Bertsekas (2005) Bertsekas, D. Dynamic programming and optimal control, volume 12. Athena scientific Belmont, MA, 2005.
 Chiang et al. (2019) Chiang, H. L., Faust, A., Fiser, M., and Francis, A. Learning navigation behaviors end to end with autorl. IEEE Robotics and Automation Letters, to appear, 2019. URL http://arxiv.org/abs/1809.10124.
 Chow et al. (2017) Chow, Y., Ghavamzadeh, M., Janson, L., and Pavone, M. Riskconstrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070–6120, 2017.
 Chow et al. (2018) Chow, Y., Nachum, O., Ghavamzadeh, M., and DuenezGuzman, E. A Lyapunovbased approach to safe reinforcement learning. In Accepted at NIPS, 2018.
 Dalal et al. (2018) Dalal, G., Dvijotham, K., Vecerik, M., Hester, T., Paduraru, C., and Tassa, Y. Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757, 2018.
 Faust et al. (2014) Faust, A., Ruymgaart, P., Salman, M., Fierro, R., and Tapia, L. Continuous action reinforcement learning for controlaffine systems with unknown dynamics. Acta Automatica Sinica Special Issue on Extensions of Reinforcement Learning and Adaptive Control, IEEE/CAA Journal of, 1(3):323–336, July 2014.
 Faust et al. (2018) Faust, A., Ramirez, O., Fiser, M., Oslund, K., Francis, A., Davidson, J., and Tapia, L. PRMRL: Longrange robotic navigation tasks by combining reinforcement learning and samplingbased planning. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pp. 5113–5120, Brisbane, Australia, 2018. URL https://arxiv.org/abs/1710.03937.

Geibel & Wysotzki (2005)
Geibel, P. and Wysotzki, F.
Risksensitive reinforcement learning applied to control under
constraints.
Journal of Artificial Intelligence Research
, 24:81–108, 2005.  Hafner et al. (2017) Hafner, D., Davidson, J., and Vanhoucke, V. TensorFlow Agents: Efficient batched reinforcement learning in tensorflow. arXiv preprint arXiv:1709.02878, 2017.
 Kakade & Langford (2002) Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pp. 267–274, 2002.
 Kalashnikov et al. (2018) Kalashnikov, D., Irpan, A., Sampedro, P. P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V., and Levine, S. QTOpt: Scalable deep reinforcement learning for visionbased robotic manipulation. 2018. URL https://arxiv.org/pdf/1806.10293.
 Khalil (1996) Khalil, H. Noninear systems. PrenticeHall, New Jersey, 2(5):5–1, 1996.
 Levine et al. (2016) Levine, S., Finn, C., Darrell, T., and Abbeel, P. Endtoend training of deep visuomotor policies. Journal of Machine Learning Research, 17:1–40, 2016.
 Lillicrap et al. (2015) Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Lillicrap et al. (2016) Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
 Melonee Wise & Dymesich (2016) Melonee Wise, Michael Ferguson, D. K. E. D. and Dymesich, D. Fetch & freight: Standard platforms for service robot applications. In Workshop on Autonomous Mobile Service Robots held at the 2016 International Joint Conference on Artificial Intelligence, 2016.
 Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A., Veness, J., Bellemare, M., Graves, A., Riedmiller, M., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Perkins & Barto (2002) Perkins, T. and Barto, A. Lyapunov design for safe reinforcement learning. Journal of Machine Learning Research, 3(Dec):803–832, 2002.
 Schaul et al. (2015) Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
 Schulman et al. (2015a) Schulman, J., Levine, S., Moritz, P., Jordan, M., and Abbeel, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, pp. 1889–1897, 2015a.
 Schulman et al. (2015b) Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015b.
 Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Silver et al. (2016) Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 Sutton et al. (2000) Sutton, R., McAllester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of Advances in Neural Information Processing Systems 12, pp. 1057–1063, 2000.
 Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012.
 Yu et al. (2009) Yu, Y., Li, Y., Schuurmans, D., and Szepesvári, C. A general projection property for distribution families. In Advances in Neural Information Processing Systems, pp. 2232–2240, 2009.
Appendix A The Lyapunov Approach to Solve CMDPs
In this section, we revisit the Lyapunov approach to solving CMDPs that was proposed by Chow et al. (2018) and report the mathematical results that are important in developing our safe policy optimization algorithms. To start, without loss of generality, we assume that we have access to a baseline feasible policy of Equation 1, ; i.e. satisfies . We define a set of Lyapunov functions w.r.t. initial state and constraint threshold as
and call the constraints in this feasibility set the Lyapunov constraints. For any arbitrary Lyapunov function , we denote by
the set of induced Markov stationary policies. Since is a contraction mapping (Bertsekas, 2005), any induced policy has the property , . Together with the property that , they imply that any induced policy is a feasible policy of Equation 1. However, in general, the set does not necessarily contain an optimal policy of Equation 1, and thus it is necessary to design a Lyapunov function (w.r.t. a baseline policy ) that provides this guarantee. In other words, the main goal is to construct a Lyapunov function such that
(8) 
Chow et al. (2018) show in their Theorem 1 that 1) without loss of optimality, the Lyapunov function can be expressed as
where is some auxiliary constraint cost uniformly upperbounded by
and 2) if the baseline policy satisfies the condition
where is the maximum constraint cost, then the Lyapunov function candidate also satisfies the properties of Equation 8, and thus, its induced feasible policy set contains an optimal policy. Furthermore, suppose that the distance between the baseline and optimal policies can be estimated effectively. Using the set of induced feasible policies and noting that the safe Bellman operator is monotonic and contractive, one can show that , has a unique fixed point , such that is a solution of Equation 1, and an optimal policy can be constructed via greedification, i.e., . This shows that under the above assumption, Equation 1 can be solved using standard dynamic programming (DP) algorithms. While this result connects CMDP with Bellman’s principle of optimality, verifying whether satisfies this assumption is challenging when a good estimate of is not available. To address this issue, Chow et al. (2018) propose to approximate with an auxiliary constraint cost , which is the largest auxiliary cost satisfying the Lyapunov condition and the safety condition . The intuition here is that the larger , the larger the set of policies . Thus, by choosing the largest such auxiliary cost, we hope to have a better chance of including the optimal policy in the set of feasible policies. Specifically, is computed by solving the following linear programming (LP) problem:
(9) 
where represents a onehot vector in which the nonzero element is located at . When is a feasible policy, this problem has a nonempty solution. Furthermore, according to the derivations in Chow et al. (2018), the maximizer of (9) has the following form:
where . They also show that by further restricting to be a constant function, the maximizer is given by
Using the construction of the Lyapunov function , Chow et al. (2018) propose the safe policy iteration (SPI) algorithm (see Algorithm 1) in which the Lyapunov function is updated via bootstrapping, i.e., at each iteration is recomputed using Equation 9 w.r.t. the current baseline policy. At each iteration , this algorithm has the following properties: 1) Consistent Feasibility, i.e., if the current policy is feasible, then is also feasible; 2) Monotonic Policy Improvement, i.e., for any ; and 3) Asymptotic Convergence. Despite all these nice properties, SPI is still a valuefunctionbased algorithm, and thus it is not straightforward to use it in continuous action problems. The main reason is that the greedification step becomes an optimization problem over the continuous set of actions that is not necessarily easy to solve. In Section 3, we show how we use SPI and its nice properties to develop safe policy optimization algorithms that can handle continuous action problems. Our algorithms can be thought as combinations of DDPG or PPO (or any other onpolicy or offpolicy policy optimization algorithm) with a SPIinspired critic that evaluates the policy and computes its corresponding Lyapunov function. The computed Lyapunov function is then used to guarantee safe policy update, i.e., the new policy is selected from a restricted set of safe policies defined by the Lyapunov function of the current policy.
Appendix B Lagrangian Approach to Safe RL
There are a number of mild technical and notational assumptions which we will make throughout this section, so we state them here.
Assumption 1 (Differentiability).
For any stateaction pair , is continuously differentiable in and is a Lipschitz function in for every and .
Assumption 2 (Strict Feasibility).
There exists a transient policy such that in the constrained problem.
Assumption 3 (Step Sizes).
The step size schedules , , and satisfy
(10)  
(11)  
(12) 
Assumption 1 imposes smoothness on the optimal policy. Assumption 2 guarantees the existence of a local saddle point in the Lagrangian analysis introduced in the next subsection. Assumption 3 refers to step sizes corresponding to policy updates that will be introduced for the algorithms in this paper, and indicates that the update corresponding to is on the fastest timescale, the updates corresponding to is on the intermediate timescale, and the update corresponding to is on the slowest timescale. As this assumption refer to userdefined parameters, they can always be chosen to be satisfied.
To solve the CMDP, we employ the Lagrangian relaxation procedure (Bertsekas, 1999) to convert it to the following unconstrained problem:
(13) 
where is the Lagrange multiplier. Notice that is a linear function in . Then there exists a local saddle point for the minimax optimization problem , such that for some , and , we have
(14) 
where is a hyperdimensional ball centered at with radius .
In the following, we present a policy gradient (PG) algorithm and an actorcritic (AC) algorithm. While the PG algorithm updates its parameters after observing several trajectories, the AC algorithms are incremental and update their parameters at each timestep.
We now present a policy gradient algorithm to solve the optimization problem Equation 13. The idea of the algorithm is to descend in and ascend in using the gradients of w.r.t. and , i.e.,
(15) 
The unit of observation in this algorithm is a system trajectory generated by following policy . At each iteration, the algorithm generates trajectories by following the current policy, uses them to estimate the gradients in Equation 15, and then uses these estimates to update the parameters .
Let be a trajectory generated by following the policy , where is the target state of the system and is the (random) stopping time. The cost, constraint cost, and probability of are defined as , , and , respectively. Based on the definition of , one obtains .
Algorithm 2 contains the pseudocode of our proposed policy gradient algorithm. What appears inside the parentheses on the righthandside of the update equations are the estimates of the gradients of w.r.t. (estimates of the expressions in 15). Gradient estimates of the Lagrangian function are given by
where the likelihood gradient is
In the algorithm, is a projection operator to , i.e., , which ensures the convergence of the algorithm. Recall from Assumption 3 that the stepsize schedules satisfy the standard conditions for stochastic approximation algorithms, and ensure that the policy parameter update is on the fast timescale , and the Lagrange multiplier update is on the slow timescale . This results in a two timescale stochastic approximation algorithm, which has shown to converge to a (local) saddle point of the objective function
. This convergence proof makes use of standard in many stochastic approximation theory, because in the limit when the stepsize is sufficiently small, analyzing the convergence of PG is equivalent to analyzing the stability of an ordinary differential equation (ODE) w.r.t. its equilibrium point.
In policy gradient, the unit of observation is a system trajectory. This may result in high variance for the gradient estimates, especially when the length of the trajectories is long. To address this issue, we propose two actorcritic algorithms that use value function approximation in the gradient estimates and update the parameters incrementally (after each stateaction transition). We present two actorcritic algorithms for optimizing Equation
13. These algorithms are still based on the above gradient estimates. Algorithm 3 contains the pseudocode of these algorithms. The projection operator is necessary to ensure the convergence of the algorithms. Recall from Assumption 3 that the stepsize schedules satisfy the standard conditions for stochastic approximation algorithms, and ensure that the critic update is on the fastest timescale , the policy and update is on the intermediate timescale, and finally the Lagrange multiplier update is on the slowest timescale . This results in three timescale stochastic approximation algorithms.Using the policy gradient theorem from Sutton et al. (2000), one can show that
Comments
There are no comments yet.