Lyapunov-based Safe Policy Optimization for Continuous Control

01/28/2019 ∙ by Yinlam Chow, et al. ∙ 6

We study continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through safe policies, i.e., policies that do not take the agent to undesirable situations. We formulate these problems as constrained Markov decision processes (CMDPs) and present safe policy optimization algorithms that are based on a Lyapunov approach to solve them. Our algorithms can use any standard policy gradient (PG) method, such as deep deterministic policy gradient (DDPG) or proximal policy optimization (PPO), to train a neural network policy, while guaranteeing near-constraint satisfaction for every policy update by projecting either the policy parameter or the action onto the set of feasible solutions induced by the state-dependent linearized Lyapunov constraints. Compared to the existing constrained PG algorithms, ours are more data efficient as they are able to utilize both on-policy and off-policy data. Moreover, our action-projection algorithm often leads to less conservative policy updates and allows for natural integration into an end-to-end PG training pipeline. We evaluate our algorithms and compare them with the state-of-the-art baselines on several simulated (MuJoCo) tasks, as well as a real-world indoor robot navigation problem, demonstrating their effectiveness in terms of balancing performance and constraint satisfaction. Videos of the experiments can be found in the following link: https://drive.google.com/file/d/1pzuzFqWIE710bE2U6DmS59AfRzqK2Kek/view?usp=sharing .

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

page 8

page 15

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The field of reinforcement learning (RL) has witnessed tremendous success in many high-dimensional control problems, including video games (Mnih et al., 2015), board games (Silver et al., 2016), robot locomotion (Lillicrap et al., 2016), manipulation (Levine et al., 2016; Kalashnikov et al., 2018), navigation (Faust et al., 2018), and obstacle avoidance (Chiang et al., 2019). In standard RL, the ultimate goal is to optimize the expected sum of rewards/costs, and the agent is free to explore any behavior as long as it leads to performance improvement. Although this freedom might be acceptable in many problems, including those involving simulated environments, and could expedite learning a good policy, it might be harmful in many other problems and could cause damage to the agent (robot) or to the environment (plant or the people working nearby). In such domains, it is absolutely crucial that while the agent (RL algorithm) optimizes its long-term performance, it also maintains safe policies both during training and at convergence.

A natural way to incorporate safety is via constraints. A standard model for RL with constraints is constrained Markov decision process (CMDP) (Altman, 1999)

, where in addition to its standard objective, the agent must satisfy constraints on expectations of auxiliary costs. Although optimal policies for finite CMDPs with known models can be obtained by linear programming 

(Altman, 1999), there are not many results for solving CMDPs when the model is unknown or the state and/or action spaces are large or infinite. A common approach to solve CMDPs is to use the Lagrangian method (Altman, 1998; Geibel & Wysotzki, 2005) that augments the original objective function with a penalty on constraint violation and computes the saddle-point of the constrained policy optimization via primal-dual methods (Chow et al., 2017). Although safety is ensured when the policy converges asymptotically, a major drawback of this approach is that it makes no guarantee with regards to the safety of the policies generated during training.

A few algorithms have been recently proposed to solve CMDPs at scale, while remaining safety during training. One such algorithm is constrained policy optimization (CPO) (Achiam et al., 2017). CPO extends the trust-region policy optimization (TRPO) algorithm (Schulman et al., 2015a) to handle the constraints in a principled way and has shown promising empirical results in terms scalability, performance, and constraint satisfaction, both during training and after convergence. Another class of algorithms of this sort is by Chow et al. (2018). These algorithms use the notion of Lyapunov functions that have a long history in control theory to analyze the stability of dynamical systems (Khalil, 1996). Lyapunov functions have been used in RL to guarantee closed-loop stability of the agent (Perkins & Barto, 2002; Faust et al., 2014). They also have been used to guarantee that a model-based RL agent can be brought back to a “region of attraction” during exploration (Berkenkamp et al., 2017)Chow et al. (2018) use the theoretical properties of the Lyapunov functions and propose safe approximate policy and value iteration algorithms. They prove theories for their algorithms, when the CMDP is finite and known, and empirically evaluate them when it is large and/or unknown. However, since their algorithms are value-function-based, applying them to continuous action problems is not straightforward, and was left as a future work.

In this paper, we build on the problem formulation and theoretical findings of the Lyapunov-based approach to solve CMDPs, and extend it to tackle continuous action problems that play an important role in control theory and robotics. We propose Lyapunov-based safe RL algorithms that can handle problems with large or infinite action spaces, and return safe policies both during training and at convergence. To do so, there are two major difficulties which we resolve: 1) the policy update becomes an optimization problem over the large or continuous action space (similar to standard MDPs with large actions), and 2) the policy update is a constrained optimization problem in which the (Lyapunov) constraints involve integration over the action space, and thus, it is often impossible to have them in closed-form. Since the number of Lyapunov constraints is equal to the number of states, the situation is even more challenging when the problem has a large or an infinite state space. To address the first difficulty, we switch from value-function-based to policy gradient (PG) and actor-critic algorithms. To address the second difficulty, we propose two approaches to solve our constrained policy optimization problem (a problem with infinite constraints, each involving an integral over the continuous action space) that can work with any standard on-policy (e.g., proximal policy optimization (PPO) Schulman et al. 2017) and off-policy (e.g., deep deterministic policy gradient (DDPG) Lillicrap et al. 2015) PG algorithm. Our first approach, which we call policy parameter projection or -projection, is a constrained optimization method that combines PG with a projection of the policy parameters onto the set of feasible solutions induced by the Lyapunov constraints. Our second approach, which we call action projection or -projection, uses the concept of a safety layer introduced by Dalal et al. (2018) to handle simple single-step constraints, extends this concept to general trajectory-based constraints, solves the constrained policy optimization problem in closed-form using Lyapunov functions, and integrates this closed-form into the policy network via safety-layer augmentation. Since both approaches guarantee safety at every policy update, they manage to maintain safety throughout training (ignoring errors resulting from function approximation), ensuring that all intermediate policies are safe to be deployed. To prevent constraint violations due to function approximation and modeling errors, similar to CPO, we offer a safeguard policy update rule that decreases constraint cost and ensures near-constraint satisfaction.

Our proposed algorithms have two main advantages over CPO. First, since CPO is closely connected to TRPO, it can only be trivially combined with PG algorithms that are regularized with relative entropy, such as PPO. This restricts CPO to on-policy PG algorithms. On the contrary, our algorithms can work with any on-policy (e.g., PPO) and off-policy (e.g., DDPG) PG algorithm. Having an off-policy implementation is beneficial, since off-policy algorithms are potentially more data-efficient, as they can use the data from the replay buffer. Second, while CPO is not a back-propagatable algorithm, due to the backtracking line-search procedure and the conjugate gradient iterations for computing natural gradient in TRPO, our algorithms can be trained end-to-end, which is crucial for scalable and efficient implementation (Hafner et al., 2017). In fact, we show in Section 3.1 that CPO (minus the line search) can be viewed as a special case of the on-policy version (PPO version) of our -projection algorithm, corresponding to a specific approximation of the constraints.

We evaluate our algorithms and compare them with CPO and the Lagrangian method on several continuous control (MuJoCo) tasks and a real-world robot navigation problem, in which the robot must satisfy certain constraints, while minimizing its expected cumulative cost. Results show that our algorithms outperform the baselines in terms of balancing the performance and constraint satisfaction (during training), and generalize better to new and more complex environments, including transfer to a real Fetch robot.

2 Preliminaries

We consider the RL problem in which the agent’s interaction with the environment is modeled as a Markov decision process (MDP). A MDP is a tuple , where and are the state and action spaces; is a discounting factor; is the immediate cost function;

is the transition probability distribution; and

is the initial state. Although we consider deterministic initial state and cost function, our results can be easily generalized to random initial states and costs. We model the RL problems in which there are constraints on the cumulative cost using CMDPs. The CMDP model extends MDP by introducing additional costs and the associated constraints, and is defined by , where the first six components are the same as in the unconstrained MDP; is the (state-dependent) immediate constraint cost; and is an upper-bound on the expected cumulative constraint cost.

To formalize the optimization problem associated with CMDPs, let be the set of Markovian stationary policies, i.e., . At each state , we define the generic Bellman operator w.r.t. a policy and a cost function as . Given a policy , we define the expected cumulative cost and the safety constraint function (expected cumulative constraint cost) as and . The safety constraint is then defined as . The goal in CMDPs is to solve the constrained optimization problem

(1)

It has been shown that if the feasibility set is non-empty, then there exists an optimal policy in the class of stationary Markovian policies  (Altman, 1999, Theorem 3.1).

2.1 Policy Gradient Algorithms

Policy gradient (PG) algorithms optimize a policy by computing a sample estimate of the gradient of the expected cumulative cost induced by the policy, and then updating the policy parameter in the gradient direction. In general, stochastic policies that give a probability distribution over actions are parameterized by a

-dimensional vector

, so the space of policies can be written as . Since in this setting a policy is uniquely defined by its parameter , policy-dependent functions can be written as a function of or interchangeably.

Deep deterministic policy gradient (DDPG) (Lillicrap et al., 2015) and proximal policy optimization (PPO) (Schulman et al., 2017) are two PG algorithms that have recently gained popularity in solving continuous control problems. DDPG is an off-policy Q-learning style algorithm that jointly trains a deterministic policy and a Q-value approximator . The Q-value approximator is trained to fit the true Q-value function and the deterministic policy is trained to optimize

via chain-rule. The PPO algorithm we use is a penalty form of the trust region policy optimization (TRPO) algorithm 

(Schulman et al., 2015a) with an adaptive rule to tune the penalty weight . PPO trains a policy

by optimizing a loss function that consists of the standard policy gradient objective and a penalty on the KL-divergence between the current

and previous policies, i.e., .

2.2 Lagrangian Method

The Lagrangian method is a straightforward way to address the constraint in CMDPs. In this approach, we add the constraint costs to the task costs and transform the constrained optimization problem to a penalty form, i.e., . We then jointly optimizes and to find a saddle-point of the penalized objective. The optimization of may be performed by any PG algorithm, such as DDPG or PPO, on the augmented cost , while

is optimized by stochastic gradient descent. As described in Section 

1, although the Lagrangian approach is easy to implement (see Appendix B for the details), in practice, it often violates the constraints during training. While at each step during training, the objective encourages finding a safe solution, the current value of may lead to an unsafe policy. This is why the Lagrangian method may not be suitable for solving problems in which safety is crucial during training.

2.3 Lyapunov Functions

Since in this paper, we extend the Lyapunov-based approach to CMDPs to PG algorithms, we end this section by introducing some terms and notations from Chow et al. (2018) that are important in developing our safe PG algorithms. We refer the reader to Appendix A for more details.

We define a set of Lyapunov functions w.r.t. initial state and constraint threshold as , where is a feasible policy of (1), i.e., . We refer to the constraints in this feasibility set as the Lyapunov constraints. For any arbitrary Lyapunov function , we denote by , the set of -induced Markov stationary policies. The contraction property of , together with , imply that any -induced policy is a feasible policy of (1). However, does not always contain an optimal solution of (1), and thus, it is necessary to design a Lyapunov function that provides this guarantee. In other words, the main goal of the Lyapunov approach is to construct a Lyapunov function , such that contains an optimal policy , i.e., Chow et al. (2018) show in their Theorem 1 that without loss of optimality, the Lyapunov function that satisfies the above criterion can be expressed as , in which is a specific immediate auxiliary constraint cost that keeps track of the maximum constraint budget available for policy improvement (from to ). They propose ways to construct such , as well as an auxiliary constraint cost surrogate , which is a tight upper-bound on and can be computed more efficiently. They use this construction to propose their safe (approximate) policy and value iteration algorithms, in which the goal is to solve the following LP problem (Chow et al., 2018, Eq. 6) at each policy improvement step:

(2)

where and are the value function and state-action value function (w.r.t. the cost function ), and is the Lyapunov function. Note that in an iterative policy optimization method, such as those we will present in this paper, the feasible policy can be set to the policy at the previous iteration.

In (2), there are as many constraints as the number of states and each constraint involves an integral over the entire action space . When the state space is large or continuous, even if the integral in the constraint has a closed-form (e.g., when the number of actions is finite), solving LP (2) becomes numerically intractable. Since Chow et al. (2018) assume that the number of actions is finite, they focus on value-function-based RL algorithms and address the large state issue by policy distillation. However, in this paper, we are interested in problems with large action spaces. In our case, solving (2) will be even more challenging. To address this issue, in the next section, we first switch from value-function-based algorithms to PG algorithms, then propose an optimization problem with Lyapunov constraints, analogous to (2), that is suitable for the PG setting, and finally present two methods to solve our proposed optimization problem efficiently.

3 Safe Lyapunov-based Policy Gradient

We now present our approach to solve CMDPs in a way that guarantees safety both at convergence and during training. Similar to Chow et al. (2018), our Lyapunov-based safe PG algorithms solve a constrained optimization problem analogous to (2). In particular, our algorithms consist of two components, a baseline PG algorithm, such as DDPG or PPO, and an effective method to solve the general Lyapunov-based policy optimization problem (the analogous to (2))

(3)

In the next two sections, we present two approaches to solve (3) efficiently. We call these approaches 1) -projection, a constrained optimization method that combines PG with projecting the policy parameter onto the set of feasible solutions induced by the Lyapunov constraints, and 2) -projection, in which we embed the Lyapunov constraints into the policy network via a safety layer.

3.1 The -projection Approach

In this section, we show how a safe Lyapunov-based PG algorithm can be derived using the -projection approach. This machinery is based on the minorization-maximization technique in conservative PG (Kakade & Langford, 2002) and Taylor series expansion, and it can be applied to both on-policy and off-policy algorithms. Following Theorem 4.1 in Kakade & Langford (2002), we first have the following bound for the cumulative cost: , where is the -visiting distribution of starting at the initial state , and is the weight for the entropy-based regularization.111Theorem 1 in Schulman et al. (2015a) provides a recipe for computing such that the minorization-maximization inequality holds. But in practice, is treated as a tunable hyper-parameter for entropy-based regularization. Using the above result, we denote by

the surrogate cumulative cost. It has been shown in Eq. 10 of Schulman et al. (2015a) that replacing the objective function with its surrogate in solving (3) will still lead to policy improvement. In order to effectively compute the improved policy parameter , one further approximates the function with its Taylor series expansion (around ). In particular, the term is approximated up to its first order, and the term is approximated up to its second order. Altogether this allows us to replace the objective function in (3) with the following surrogate:

Similarly, regarding the constraints in (3), we can use the Taylor series expansion (around ) to approximate the LHS of the Lyapunov constraints as

Using the above approximations, at each iteration, our safe PG algorithm updates the policy by solving the following constrained optimization problem with semi-infinite dimensional Lyapunov constraints:

(4)

It can be seen that if the errors resulted from the neural network parameterizations of and , and the Taylor series expansions are small, then an algorithm that updates the policy parameter by solving (3.1) can ensure safety during training. However, the presence of infinite-dimensional Lyapunov constraints makes solving (3.1) numerically intractable. A solution to this is to write the Lyapunov constraints in (3.1) (without loss of optimality) as Since the above -operator is non-differentiable, this may still lead to numerical instability in gradient descent algorithms. Similar to the surrogate constraint used in TRPO (to transform the constraint to an average constraint, see Eq. 12 in Schulman et al. 2015a), a more numerically stable way is to approximate the Lyapunov constraint using the following average constraint surrogate:

(5)

where is the number of on-policy sample trajectories of . In practice, when the auxiliary constraint surrogate is chosen as (see Appendix A for the justification of this choice), the gradient term in (5) can be simplified as , where and are the constraint value function and constraint state-action value function, respectively. Combining with the fact that is state independent, the above arguments further imply that the average constraint surrogate in (5) can be approximated by the inequality , which is equivalent to the constraint used in CPO (see Sec. 6.1 in Achiam et al. 2017). This shows a clear connection between CPO (minus the line search) and our Lyapunov-based PG with -projection. Algorithm 4 in Appendix E contains the pseudo-codes of our safe Lyapunov-based PG algorithms with -projection. We refer to the DDPG and PPO versions of this algorithm as SDDPG and SPPO.

3.2 The -projection Approach

Note that the main characteristic of the Lyapunov approach is to break down a trajectory-based constraint into a sequence of single-step state dependent constraints. However, when the state space is infinite, the feasibility set is characterized by infinite dimensional constraints, and thus, it is actually counter-intuitive to directly enforce these Lyapunov constraints (as opposed to the original trajectory-based constraint) into the policy update optimization. To address this issue, we leverage the idea of a safety layer from Dalal et al. (2018), that was applied to simple single-step constraints, and propose a novel approach to embed the set of Lyapunov constraints into the policy network. This way, we reformulate the CMDP problem (1) as an unconstrained optimization problem and optimize its policy parameter (of the augmented network) using any standard unconstrained PG algorithm. At every given state, the unconstrained action is first computed and then passed through the safety layer, where a feasible action mapping is constructed by projecting the unconstrained actions onto the feasibility set w.r.t. the corresponding Lyapunov constraint. Therefore, safety during training w.r.t. the CMDP problem can be guaranteed by this constraint projection approach.

For simplicity, we only describe how the action mapping (to the set of Lyapunov constraints) works for deterministic policies. Using identical machinery, this procedure can be extended to guarantee safety for stochastic policies. Recall from the policy improvement problem in (3) that the Lyapunov constraint is imposed at every state . Given a baseline feasible policy , for any arbitrary policy parameter , we denote by , the projection of onto the feasibility set induced by the Lyapunov constraints. One way to construct a feasible policy from a parameter is to solve the following -projection problem at each state :

(6)

We refer to this operation as the Lyapunov safety layer. Intuitively, this projection perturbs the unconstrained action as little as possible in the Euclidean norm in order to satisfy the Lyapunov constraints. Since this projection guarantees safety (in the Lyapunov sense), if we have access to a closed form of the projection, we may insert it into the policy parameterization and simply solve an unconstrained policy optimization problem, i.e., , using any standard PG algorithm.

To simplify the projection (6), we can approximate the LHS of the Lyapunov constraint with its first-order Taylor series (w.r.t. action ). Thus, at any given state , the safety layer solves the following projection problem:

(7)

where is the action-gradient of the state-action Lyapunov function induced by the baseline action .

Similar to the analysis of Section 3.1, if the auxiliary cost is state independent, one can readily find by computing the gradient of the constraint action-value function . Note that the objective function in (7) is positive-definite and quadratic, and the constraint approximation is linear. Therefore, the solution of this (convex) projection problem can be effectively computed by an in-graph QP-solver, such as OPT-Net (Amos & Kolter, 2017). Combined with the above projection procedure, this further implies that the CMDP problem can be effectively solved using an end-to-end PG training pipeline (such as DDPG or PPO). Furthermore, when the CMDP has a single constraint (and thus a single Lyapunov constraint), the policy has the following analytical solution.

Proposition 1.

At any given state , the solution to the optimization problem (7) has the form , where

The closed-form solution is essentially a linear projection of the unconstrained action

to the Lyapunov-safe hyperplane characterized with slope

and intercept . Extending this closed-form solution to handle multiple constraints is possible, if there is at most one constraint active at a time (see Proposition 1 in Dalal et al. 2018 for a similar extension).

Without loss of generality, this projection step can also be extended to handle actions generated by stochastic policies with bounded first and second order moments 

(Yu et al., 2009)

. For example when the policy is parameterized with a Gaussian distribution, then one needs to project both the mean and standard-deviation vector onto the Lyapunov-safe hyperplane, in order to obtain a feasible action probability. Algorithm 

5 in Appendix E contains the pseudo-code of our safe Lyapunov-based PG algorithms with -projection. We refer to the DDPG and PPO versions of this algorithm as SDDPG-modular and SPPO-modular, respectively.

4 Experiments on MuJoCo Benchmarks

(a) HalfCheetah-Safe, Return
(b) HalfCheetah-Safe, Constraint
(c) Point-Gather, Return
(d) Point-Gather, Constraint
(e) Ant-Gather, Return
(f) Ant-Gather, Constraint
(g) Point-Circle, Return
(h) Point-Circle, Constraint
Figure 9: DDPG (red), DDPG-Lagrangian (cyan), SDDPG (blue), DDPG -projection (green) on HalfCheetah-Safe and Point-Gather. Ours (SDDPG, SDDPG -projection) perform stable and safe learning, although the dynamics and cost functions are not known, control actions are continuous, and deep function approximations are necessary. Unit of x-axis is in thousands of episodes. Shaded areas represent the

-SD confidence intervals (over

random seeds). The dashed purple line represents the constraint limit.

We empirically evaluate the Lyapunov-based PG algorithms to assess: (i) the performance in terms of cost and safety during training, and (ii) robustness with respect to constraint violations in the presence of function approximation errors. To that end, we design three interpretable experiments in simulated robot locomotion continuous control tasks using the MuJoCo simulator (Todorov et al., 2012). The tasks notions of safety are motivated by physical constraints: (i)HalfCheetah-Safe: The HalfCheetah agent is rewarded for running, but its speed is limited for stability and safety; (ii) Point-Circle: The Point agent is rewarded for running in a wide circle, but is constrained to stay within a safe region defined by  (Achiam et al., 2017); (iii) Point-Gather & Ant-Gather: Point or Ant Gatherer agent, is rewarded for collecting target objects in a terrain map, while being constrained to avoid bombs (Achiam et al., 2017). Visualizations of these tasks as well as more details of the network architecture used in training the algorithms are given in Appendix C.

We compare the presented methods with two state-of-the-art unconstrained reinforcement learning algorithms, DDPG (Lillicrap et al., 2015) and PPO (Schulman et al., 2017), and two constrained methods, Lagrangian approach with optimized hyper-parameters for fairness (Appendix B) and on-policy CPO algorithm (Achiam et al., 2017). The original CPO is based on TRPO (schulman2015trust). We use its PPO alternative (which coincides with the SPPO algorithm derived in Section 4.1) as the safe RL baseline. SPPO preserves the essence of CPO by adding the first order constraint and the relative entropy regularization to the policy optimization problem. The main difference between CPO and SPPO is that the latter does not perform backtracking line-search in learning rate. The decision to compare with SPPO instead of CPO is 1) to avoid the additional computational complexity of line-search in TRPO, while maintaining the performance of PG using the popular PPO algorithm, 2) to have a back-propagatable version of CPO, and 3) to have a fair comparison with other back-propagatable safe RL algorithms, such as the DDPG and safety layer counterparts.

(a) HalfCheetah-Safe, Return
(b) HalfCheetah-Safe, Constraint
(c) Point-Gather, Return
(d) Point-Gather, Constraint
(e) Ant-Gather, Return
(f) Ant-Gather, Constraint
(g) Point-Circle, Return
(h) Point-Circle, Constraint
Figure 18: PPO (red), PPO-Lagrangian (cyan), SPPO (blue), SPPO -projection (green) on HalfCheetah-Safe and Point-Gather. Ours (PPO, SPPO -projection) perform stable and safe learning, when the dynamics and cost functions are not known, control actions are continuous, and deep function approximations are necessary.

Comparisons with baselines: The Lyapunov-based PG algorithms are stable in learning and all methods converge to feasible policies with reasonable performance (Figures (a)a, (c)c, (e)e, (g)g, (a)a, (c)c, (e)e, (g)g). In contrast, when examining the constraint violation (Figures (b)b, (d)d, (f)f, (h)h, (b)b, (d)d, (f)f, (g)g), the Lyapunov-based PG algorithms quickly stabilize the constraint cost to be below the threshold, while the unconstrained DDPG and PPO agents violate the constraints in these environments, and the the Lagrangian approach tends to jiggle around the constrain threshold. Furthermore it is worth-noting that the Lagrangian approach can be sensitive to the initialization of the Lagrange multiplier . If is too large, it would make policy updates overly conservative, while if is too small then constraint violation will be more pronounced. Without further knowledge about the environment, here we treat as a hyper-parameter and optimize it via grid-search. See Appendix C for more detail.

-projection vs. -projection: In many cases the -projection (DDPG and PPO -projections) converges faster and has lower constraint violation than its -projection counterpart (SDDPG, SPPO). This corroborates with the hypothesis that the -projection approach is less conservative during policy updates than the -projection approach (which is what CPO is based on) and generates smoother gradient updates during end-to-end training, resulting in more effective learning than CPO (-projection).

DDPG vs. PPO: Finally, in most experiments (HalfCheetah, PointGather, and AntGather) the DDPG algorithms tend to have faster learning than the PPO counterpart, while the PPO algorithms have better control on constraint violations (which are able to satisfy lower constraint thresholds). The faster learning behavior is potentially due to the improved data-efficiency when using off-policy samples in PG updates, however the covariate-shift in off-policy data makes tight constraint control more challenging.

5 Safe Policy Gradient for Robot Navigation

(a) Noisy Lidar observation in a corridor
(b) SDDPG for point to point task
Figure 21: Robot navigation task details.

We now evaluate the safe policy optimization on a real robot task – point to point (P2P) navigation (Chiang et al., 2019) – where a noisy differential drive robot with limited sensors (Fig. (a)a), is required to navigate to a goal outside of its visual field of view while avoiding collisions with obstacles. The agent’s observations consist of the relative goal position, the relative goal velocity, and the Lidar measurements (Fig. (a)a). The actions are the linear and angular velocity vector at the robot’s center of the mass. The transition probability captures the noisy differential drive robot dynamics, whose exact formulation is not known to the robot. The robot must navigate to arbitrary goal positions collision-free and without memory of the workspace topology.

Here the CMDP is non-discounting and has a fixed horizon. We reward the agent for reaching the goal, which translates to an immediate cost that measures the relative distance to goal. To measure the impact energy of obstacle collisions, we impose an immediate constraint cost to account for the speed during collision, with a constraint threshold that characterizes the agent’s maximum tolerable collision impact energy to any objects. This type of constraint allows the robot to touch the obstacle (such as walls) but prevent it from ramming into any objects. Under this CMDP framework (Fig. (b)b), the main goal is to train a policy that drives the robot along the shortest path to the goal and to limit the total impact energy of obstacle collisions. Furthermore, we note that due to limited data, in practice intermediate point-to-point policies are deployed on the real-world robot to collect more samples for further training. Therefore, guaranteeing safety during training is critical in this application. Descriptions about the robot navigation problem, including training and evaluation environments are in Appendix D.

(a) Navigation, Mission Success
(b) Navigation, Constraint
Figure 24: DDPG (red), DDPG-Lagrangian (cyan), SDDPG (blue), DDPG -projection (green) on Robot Navigation. Ours (SDDPG, SDDPG -projection) balance between reward and constraint learning. Unit of x-axis is in thousands of steps. The shaded areas represent the -SD confidence intervals (over runs). The dashed purple line represents the constraint limit.
(a) Lagrangian policy
(b) SDDPG (-proj.)
(c) SDDPG (-proj.) on robot
Figure 28: Navigation routes of two policies on a similar setup (a) and (b). Log of on-robot experiments (c). Larger version in Appendix D and the video is available in the supplementary materials.

Experimental Results: We evaluate the learning algorithms in terms of average mission success percentage and constraint control. The task is successful if the robot reaches the goal before the constraint threshold (total energy of collision) is exhausted, and the success rate is averaged over evaluation episodes with random initialization. While all methods converge to policies with reasonable performance, Figure (a)a and (b)b shows that the Lyapunov-based PG algorithms have higher success rates, due to their robust abilities of controlling the total constraint, as well minimizing the distance to goal. Although the unconstrained method often yields a lower distance to goal, it violates the constraint more frequently and thus leads to a lower success rate. Furthermore, note that the Lagrangian approach is less robust to initialization of parameters, and therefore it generally has lower success rate and higher variability than the Lyapunov-based methods. Unfortunately due to function approximation error and stochasticity of the problem, all the algorithms converged pre-maturely with constraints above the threshold. One reason is due to the constraint threshold () being overly-conservative. In real-world problems guaranteeing constraint satisfaction is more challenging than maximizing return, and that usually requires much more training data. Finally, Figures (a)a and (b)b illustrate the navigation routes of two policies. On similar goal configurations, the Lagrangian method tends to zigzag and has more collisions, while the Lyapunov-based algorithm (SDDPG) chooses a safer path to reach the goal.

Next, we evaluate how well the methods generalize to (i) longer trajectories, and (ii) new environments. P2P tasks are trained in a by meters environment (Fig. 32) with goals placed within to meters from the robot initial state, Figure 31 depicts the results evaluations, averaged over trials, on P2P tasks in a much larger evaluation environment ( by meters) with goals placed up to meters away from the goal. The success rate of all methods degrades as the goals are further away (Fig. (a)a), and the safety methods (-projection – SL-DDPG, and -projection – SG-DDPG) outperform unconstrained and Lagrangian (DDPG and LA-DDPG) as the task becomes more difficult. At the same time, our methods retain the lower constraints even when the task is difficult (Fig. (b)b).

(a) Navigation, Mission Success
(b) Navigation, Constraint
Figure 31: Robot navigation generalization over success rate (a) and constraint satisfaction (b) on a different environment.

Finally, we deployed the SL-DDPG policy onto the real Fetch robot (Melonee Wise & Dymesich, 2016) in an everyday office environment. Figure (c)c shows the top down view of the robot log. Robot travelled a total of meters to complete five repetitions of tasks, each averaging about meters to the goal. The experiments included narrow corridors and people walking through the office. The robot robustly avoids both static and dynamic (humans) obstacles coming into its path. We observed additional ”wobbling” effects, that was not present in simulation. This is likely due to the wheel slippage at the floor that the policy was not trained for. In several occasions when the robot could not find a clear path, the policy instructed the robot to stay put instead of narrowly passing by the obstacle. This is precisely the safety behavior we want to achieve with the Lyapunov-based algorithms.

6 Conclusions

We formulated safe RL as a continuous action CMDP and developed two classes, -projection and -projection, of policy optimization algorithms based on Lyapunov functions to learn safe policies with high expected cumulative return. We do so by combining both on and off-policy optimization (DDPG or PPO) with a critic that evaluates the policy and computes its corresponding Lyapunov function. We evaluated our algorithms on four high-dimensional simulated robot locomotion tasks and compared them with several baselines. To demonstrate the effectiveness of the Lyapunov-based algorithms in solving real-world problems, we also apply these algorithms to indoor robot navigation, to ensure that the agent’s path is optimal and collision-free. Our results indicate that our Lyapunov-based algorithms 1) achieve safe learning, 2) have better data-efficiency, 3) can be more naturally integrated within the standard end-to-end differentiable policy gradient training pipeline, and 4) are scalable to tackle real-world problems. Our work is a step forward in deploying RL to real-world problems in which safety guarantees are of paramount importance. Future work includes 1) further exploration of Lyapunov function properties to improve training stability and safety, 2) more efficient use of Lyapunov constraints in constrained policy optimization, and 3) extensions of the Lyapunov-approach to the model-based setting to better utilize the agent’s dynamics.

References

Appendix A The Lyapunov Approach to Solve CMDPs

In this section, we revisit the Lyapunov approach to solving CMDPs that was proposed by Chow et al. (2018) and report the mathematical results that are important in developing our safe policy optimization algorithms. To start, without loss of generality, we assume that we have access to a baseline feasible policy of Equation 1, ; i.e. satisfies . We define a set of Lyapunov functions w.r.t. initial state and constraint threshold as

and call the constraints in this feasibility set the Lyapunov constraints. For any arbitrary Lyapunov function , we denote by

the set of -induced Markov stationary policies. Since is a contraction mapping (Bertsekas, 2005), any -induced policy has the property , . Together with the property that , they imply that any -induced policy is a feasible policy of Equation 1. However, in general, the set does not necessarily contain an optimal policy of Equation 1, and thus it is necessary to design a Lyapunov function (w.r.t. a baseline policy ) that provides this guarantee. In other words, the main goal is to construct a Lyapunov function such that

(8)

Chow et al. (2018) show in their Theorem 1 that 1) without loss of optimality, the Lyapunov function can be expressed as

where is some auxiliary constraint cost uniformly upper-bounded by

and 2) if the baseline policy satisfies the condition

where is the maximum constraint cost, then the Lyapunov function candidate also satisfies the properties of Equation 8, and thus, its induced feasible policy set contains an optimal policy. Furthermore, suppose that the distance between the baseline and optimal policies can be estimated effectively. Using the set of -induced feasible policies and noting that the safe Bellman operator is monotonic and contractive, one can show that , has a unique fixed point , such that is a solution of Equation 1, and an optimal policy can be constructed via greedification, i.e., . This shows that under the above assumption, Equation 1 can be solved using standard dynamic programming (DP) algorithms. While this result connects CMDP with Bellman’s principle of optimality, verifying whether satisfies this assumption is challenging when a good estimate of is not available. To address this issue, Chow et al. (2018) propose to approximate with an auxiliary constraint cost , which is the largest auxiliary cost satisfying the Lyapunov condition and the safety condition . The intuition here is that the larger , the larger the set of policies . Thus, by choosing the largest such auxiliary cost, we hope to have a better chance of including the optimal policy in the set of feasible policies. Specifically, is computed by solving the following linear programming (LP) problem:

(9)

where represents a one-hot vector in which the non-zero element is located at . When is a feasible policy, this problem has a non-empty solution. Furthermore, according to the derivations in Chow et al. (2018), the maximizer of (9) has the following form:

where . They also show that by further restricting to be a constant function, the maximizer is given by

Using the construction of the Lyapunov function Chow et al. (2018) propose the safe policy iteration (SPI) algorithm (see Algorithm 1) in which the Lyapunov function is updated via bootstrapping, i.e., at each iteration is recomputed using Equation 9 w.r.t. the current baseline policy. At each iteration , this algorithm has the following properties: 1) Consistent Feasibility, i.e., if the current policy is feasible, then is also feasible; 2) Monotonic Policy Improvement, i.e.,  for any ; and 3) Asymptotic Convergence. Despite all these nice properties, SPI is still a value-function-based algorithm, and thus it is not straightforward to use it in continuous action problems. The main reason is that the greedification step becomes an optimization problem over the continuous set of actions that is not necessarily easy to solve. In Section 3, we show how we use SPI and its nice properties to develop safe policy optimization algorithms that can handle continuous action problems. Our algorithms can be thought as combinations of DDPG or PPO (or any other on-policy or off-policy policy optimization algorithm) with a SPI-inspired critic that evaluates the policy and computes its corresponding Lyapunov function. The computed Lyapunov function is then used to guarantee safe policy update, i.e., the new policy is selected from a restricted set of safe policies defined by the Lyapunov function of the current policy.

  Input: Initial feasible policy ;
  for  do
     Step 0: With , evaluate the Lyapunov function , where is a solution of Equation 9
     Step 1: Evaluate the cost value function ; Then update the policy by solving the following problem:
  end for
  Return Final policy
Algorithm 1 Safe Policy Iteration (SPI)

Appendix B Lagrangian Approach to Safe RL

There are a number of mild technical and notational assumptions which we will make throughout this section, so we state them here.

Assumption 1 (Differentiability).

For any state-action pair , is continuously differentiable in and is a Lipschitz function in for every and .

Assumption 2 (Strict Feasibility).

There exists a transient policy such that in the constrained problem.

Assumption 3 (Step Sizes).

The step size schedules , , and satisfy

(10)
(11)
(12)

Assumption 1 imposes smoothness on the optimal policy. Assumption 2 guarantees the existence of a local saddle point in the Lagrangian analysis introduced in the next subsection. Assumption 3 refers to step sizes corresponding to policy updates that will be introduced for the algorithms in this paper, and indicates that the update corresponding to is on the fastest time-scale, the updates corresponding to is on the intermediate time-scale, and the update corresponding to is on the slowest time-scale. As this assumption refer to user-defined parameters, they can always be chosen to be satisfied.

To solve the CMDP, we employ the Lagrangian relaxation procedure (Bertsekas, 1999) to convert it to the following unconstrained problem:

(13)

where is the Lagrange multiplier. Notice that is a linear function in . Then there exists a local saddle point for the minimax optimization problem , such that for some , and , we have

(14)

where is a hyper-dimensional ball centered at with radius .

In the following, we present a policy gradient (PG) algorithm and an actor-critic (AC) algorithm. While the PG algorithm updates its parameters after observing several trajectories, the AC algorithms are incremental and update their parameters at each time-step.

We now present a policy gradient algorithm to solve the optimization problem Equation 13. The idea of the algorithm is to descend in and ascend in using the gradients of w.r.t.  and , i.e.,

(15)

The unit of observation in this algorithm is a system trajectory generated by following policy . At each iteration, the algorithm generates trajectories by following the current policy, uses them to estimate the gradients in Equation 15, and then uses these estimates to update the parameters .

Let be a trajectory generated by following the policy , where is the target state of the system and is the (random) stopping time. The cost, constraint cost, and probability of are defined as , , and , respectively. Based on the definition of , one obtains .

Algorithm 2 contains the pseudo-code of our proposed policy gradient algorithm. What appears inside the parentheses on the right-hand-side of the update equations are the estimates of the gradients of w.r.t.  (estimates of the expressions in 15). Gradient estimates of the Lagrangian function are given by

where the likelihood gradient is

In the algorithm, is a projection operator to , i.e., , which ensures the convergence of the algorithm. Recall from Assumption 3 that the step-size schedules satisfy the standard conditions for stochastic approximation algorithms, and ensure that the policy parameter update is on the fast time-scale , and the Lagrange multiplier update is on the slow time-scale . This results in a two time-scale stochastic approximation algorithm, which has shown to converge to a (local) saddle point of the objective function

. This convergence proof makes use of standard in many stochastic approximation theory, because in the limit when the step-size is sufficiently small, analyzing the convergence of PG is equivalent to analyzing the stability of an ordinary differential equation (ODE) w.r.t. its equilibrium point.

In policy gradient, the unit of observation is a system trajectory. This may result in high variance for the gradient estimates, especially when the length of the trajectories is long. To address this issue, we propose two actor-critic algorithms that use value function approximation in the gradient estimates and update the parameters incrementally (after each state-action transition). We present two actor-critic algorithms for optimizing  Equation 

13. These algorithms are still based on the above gradient estimates. Algorithm 3 contains the pseudo-code of these algorithms. The projection operator is necessary to ensure the convergence of the algorithms. Recall from Assumption 3 that the step-size schedules satisfy the standard conditions for stochastic approximation algorithms, and ensure that the critic update is on the fastest time-scale , the policy and -update is on the intermediate timescale, and finally the Lagrange multiplier update is on the slowest time-scale . This results in three time-scale stochastic approximation algorithms.

Using the policy gradient theorem from Sutton et al. (2000), one can show that