Safe Reinforcement Learning of Control-Affine Systems with Vertex Networks

03/20/2020 ∙ by Liyuan Zheng, et al. ∙ University of Washington 5

This paper focuses on finding reinforcement learning policies for control systems with hard state and action constraints. Despite its success in many domains, reinforcement learning is challenging to apply to problems with hard constraints, especially if both the state variables and actions are constrained. Previous works seeking to ensure constraint satisfaction, or safety, have focused on adding a projection step to a learned policy. Yet, this approach requires solving an optimization problem at every policy execution step, which can lead to significant computational costs. To tackle this problem, this paper proposes a new approach, termed Vertex Networks (VNs), with guarantees on safety during exploration and on learned control policies by incorporating the safety constraints into the policy network architecture. Leveraging the geometric property that all points within a convex set can be represented as the convex combination of its vertices, the proposed algorithm first learns the convex combination weights and then uses these weights along with the pre-calculated vertices to output an action. The output action is guaranteed to be safe by construction. Numerical examples illustrate that the proposed VN algorithm outperforms vanilla reinforcement learning in a variety of benchmark control tasks.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the last couple of years, reinforcement learning (RL) algorithms have yielded impressive results on a variety of applications. These successes include playing video games with super-human performance [17], robot locomotion and manipulation [16, 15], autonomous vehicles [18], and many benchmark continuous control tasks [11].

In RL, an agent learns to make sequential decisions by interacting with the environment, gradually improving its performance at the task as learning progresses. Policy optimization algorithms [16, 19] for RL assume that agents are free to explore any behavior during learning, so long as it leads to performance improvement. However, in many real-world applications, there is often additional safety constraints, or specifications that lead to constraints, on the learning problem. For instance, a robot arm should prevent some behaviors that could cause it to damage itself or the objects around it, and autonomous vehicles must avoid crashing into others while navigating [12].

In real-world applications such as the above, constraints are an integral part of the problem description, and maintaining constraint satisfaction during learning is critical (i.e., these are hard constraints). Therefore, in this work, our goal is to maintain constraint satisfaction at each step throughout the whole learning process. This problem is sometimes called the safe exploration problem [3, 22]. In particular, we define safety as remaining within some pre-specified polytope constraints on both states and actions. Correspondingly, the action we take at each step should result in a state that in the safety set.

In the safe exploration literature, the projection technique is often leveraged to maintain safety during exploration [8, 10]. Specifically, at each step, an action is suggested by an unconstrained policy optimization algorithm, and then is projected into the safety region. However, this projection step either involves solving a computationally expensive optimization problem online [8], or has strict assumptions such as allowing for only one of the half-spaces to be violated [10]. More over, if real-time optimization is allowed by the application, then it is often more advantageous to solve a model predictive control problem than to ask for a policy learned by RL.

To alleviate the limitation, we proposed Vertex Networks (VNs), where we encode the safety constraints into the policy via neural network architecture design. In VNs, we compute the vertices of the safety region at each time step and design the action to be the convex combination of those vertices, allowing policy optimization algorithms to explore only in the safe region.

The contributions of this work can be briefly summarized as follows: (1) To the best of our knowledge, this is the first attempt to encode safety constraints into policies by explicit network design. (2) In simulation, the proposed approach achieves good performance while maintaining constraint satisfaction.

2 Related work

In safe RL, safety in expectation is a widely used criterion [2, 23]

. In recent literature, policy optimization algorithms have been proposed as a means to learn a policy for a continuous Markov decision process (MDP). Two state-of-the-art exemplars in terms of performance are the Lagrangian-based actor-critic algorithm

[5, 9] and Constrained Policy Optimization (CPO) [1]

. However, for these methods, constraint satisfaction can only be guaranteed in expectation. In a safety critical environment, this is not sufficient since even if safety is guaranteed in expectation, there is still a non-zero probability that unsafe trajectories will be generated by the controller.

Safe exploration, on the other hand, requires constraint satisfaction at each steps. Recent approaches [20, 22] model safety as unknown functions, proposed algorithm that trades off between exploring the safety function and reward function. However, their approaches require solving MDPs or constrained optimizations to obtain policy in each exploration iteration, which are less efficient than the policy optimization algorithm leveraged in our approach.

Literature that uses policy optimization algorithm in safe exploration is closely related to our work. Among those, projection technique often used for maintaining safety. In [10], the safety set is defined in terms of half-space constraints on the action. A policy optimization algorithm—in particular, deep deterministic policy gradient (DDPG) [16]—is leveraged to generate an action which is then projected into the safe set. Imposing that only one half-space constraint can be violated, the projection optimization problem can be solved in closed form. In [8]

, this projection step is solved as a quadratic program, based on confidence intervals for the approximation of the system dynamics, modeled as Gaussian processes.

It is possible to integrate a projection step into the policy network. Indeed, using the methodology provided in [4], one can leverage a policy optimization algorithm to train the policy network in an end-to-end fashion. However, the feed forward computation of the policy network in this case is computationally expensive as it involves solving an optimization problem before executing every single action. One might prefer to solve a model predictive control problem instead if solving the optimization problem online is involved. Instead of integrating the projection step, we propose VNs which leverage the convex combination of vertices to enforce safety.

3 Model Setup

Consider a discrete time affine control system in which the system evolves according to


where , , and and are known functions of appropriate dimensions. Our goal is to minimize a cost over time horizon , subject to safety constraints on and actuator constraints on :

s.t. (2b)

where and are convex polytopes. A convex polytope can be defined as an intersection of linear inequalities (half-space representation) or equivalently as a convex combination of a finite number of points (convex-hull representation) [14]. This type of constraints are widely used in theory and practice—for example, see [6] and the references within.

The goal of safe RL is to find an optimal feedback controller , that minimize the overall system cost (2a) while satisfies the safety constraints (2c) and the actuator constraints (2d). Solving (2) is a difficult task, even for linear systems with only the actuator constraints, except for a class of systems where analytic solutions can be found [13]. Therefore, RL (and its different variants) have been proposed to search for a feedback controller.

Numerous learning approaches have been adopted to solve the problem when the constraints (2c) and (2d) are not present. However, there are considerably less successful applications of RL to problems with hard constraints. One such approach is the two-stage method used in [10]. The first step is to simply train a policy that solves the problem in (2) without the constraints on state nor the action. To enforce the constraints, a projection step is solved, where the action determined by the unconstrained policy is projected into the constraint sets.

This two-step process is referred to as safe exploration in [10], since it leverages the fact that RL algorithms explore the action space while the projection satisfies the hard constraints. However, this approach has two drawbacks that we address in the current paper. Firstly, the projection step itself requires an optimization problem to be solved. This step could be computationally expensive. More fundamentally, it brings the question of why not directly solve (2) as a model predictive control problem, since online optimization needs to be used either way. Secondly, decoupling the policy and projection steps may lead to solutions that significantly deviate from the original unconstrained policy. To overcome these challenges, we propose a novel vertex policy network that encodes the geometry of the constraints into the network architecture, and train it in an end-to-end way. We will discuss the proposed vertex policy network framework in detail in the next section.

4 Vertex Policy Network

The key idea of our proposed VN is using a basic fact of the geometry of a convex polytope. Given a bounded convex polytope , it is always possible to find a finite number of vertices such that the convex hull is . In addition, there is no smaller set of points whose convex hull forms  [14]. Then, the next proposition follows directly.

Proposition 4.1.

Let be a convex polytope with vertices . For every point , there exists , such that

where and .

The preceding proposition implies that we can search for the set of weights ’s instead of directly finding a point inside polytope.

Proposition 4.1 can be applied to find a feedback control policy. Since both the constraint sets and are convex polytopes, the control action at each timestep must also live in a convex polytope. If its vertices are known, the output of a policy can be the weights

’s. The benefit of having the weights as the output is threefold. Firstly, it is much easier to normalize a set of real numbers to be all positive and to sum to unity (the probability simplex) than to project into an arbitrary polytope. For this paper, we use a softmax layer. Secondly, this approach allows us to fully explore the interior of the feasible space, where projections could be biased towards the boundary of the set. Thirdly, we are able to use standard policy gradient training techniques.

In particular, we use DDPG as the policy evaluation and update algorithm, where the policy is a neural network parameterized by and updated by


where is the expected return using the current policy and is defined by


We approximate by , where are the number of sampled trajectories generated by running the current policy and is the trajectory length. The overall algorithm procedure of the proposed VN framework is provided in Fig 1.

Figure 1: Flowchart of the proposed VN framework.

Below, we discuss the two major components of VN in detail: 1) the safety region and vertex calculation, and 2) the neural network architecture design for the safe layer.

4.1 Evolution of the Action Constraint Set

We require that at each time step the states of the system stay in the set , and the control actions at constrained to be in the set . As stated earlier, we assume and to be convex polytopes. The main algorithmic challenge comes from the need to repeatedly intersect translated versions of these polytopes. To be concrete, suppose we are given . Then for the next step, we require that . This translates into an affine constraint on , since the control action mush satisfy

Since is known, is a constant in the above equation, and the constraint on is again polytopic. We denote this polytope as . The set to which must belong is the intersection of and the actuator constraints:


After identifying the vertices of , the algorithm in Fig. 1 can be used to find the optimal feedback policies.

In general, it is fairly straightforward to find either the convex hull or the half-space representations of

, since it just requires a linear transformation of

. However, the intersection step in (5) and the process of finding the representation of its convex hull are non-trivial [6]. Below, we work through a simple example to illustrate the steps and then discuss how to overcome the computational challenges.

Example 1 (Intersection Step).

Consider the following two–dimensional linear system:

Suppose the action safety set is a convex polytope defined by: , and . The state safety set is a square defined by and the initial state is . By simple calculation, and is the box bounded by . Fig. 2 (left) visualizes the intersection operations.

Now suppose a feasible action is chosen and the system evolves. Then, . Performing the intersection of and , we get that is a rectangle defined by the vertices as depicted in Fig. 2 (right).

Figure 2: Evolution of action safety set for a two-dimensional linear system toy example. The left plot visualizes the safety set at time , and the right plot shows the safety set at time .

4.2 Intersection of Polytopes

It should be noted that finding the vertices of an intersection of polytopes is not easy [21]. If the polytopes are in half-space representation, then their intersection can be found by simply stacking the inequalities. However, finding the vertices of the resulting polytope can be computationally expensive. Similarly, directly intersecting two polytopes based on their convex hull representation is also intractable in general.

Luckily, in many applications, we are not intersecting two generic polytopes at each step. Rather, there are only two ”basic” sets, and , and we are intersecting a linear transformation of these. It turns out that for many systems (see Section 5

), we can find the resulting vertices by hand-designed rules. In addition, there are heuristics that work well for low-dimensional systems 

[7]. Applying the proposed VN technique to high-dimensional systems is the main future direction for this work.

In the case that and do not overlap, one can choose to stop the training process. However, in our rules of finding vertices, we pick the point in that closest to to be the vertex. By design, the output of the VN is the action within set , meanwhile transiting to the state closest to the safe state set .

4.3 Safe layer

Once we obtain , the next step is to encode the geometry information into the policy network such that the generated action stays in . According to Proposition 4.1, it suffices for the policy network to generate the weights (or coefficients) of that convex combination.

Suppose that can have at most vertices, labeled . In the policy network architecture design, we add an intermediate safe layer that first generates nodes . The value of these nodes, however, are not positive nor do they sum to

. Therefore, a softmax unit is included as the activation function in order to guarantee the non-negativity and the summation constraints. In particular, we define

, the weights of a convex combination. The final output layer (action ) is defined as the multiplication of these normalized weights and the corresponding vertex values,


An illustration diagram is provided in Fig. 3.

Figure 3: Illustration of the proposed safe layer architecture. The output of the policy network is modified to predict the weights . These weights are normalized to that satisfies and , via the softmax activation function. The action output is calculated as , where are the safety polytope vertices.

5 Simulation

In this section, we present and analyze the performance of the proposed VN. We first describe the baseline algorithms and then demonstrate the performance comparisons in three benchmark control tasks: (i) inverted pendulum, (ii) mass-spring and (iii) hovercraft tracking.

5.1 Baseline Algorithm and Architecture Design

As mentioned earlier, the baseline algorithm for the policy update is DDPG [16], which is a state-of-the-art continuous control RL algorithm. To add safety constraints to the vanilla DDPG algorithm, a natural approach is to artificially shape the reward such that the agent will learn to avoid undesired areas. This can be done by setting a negative reward to the unsafe states and actions. In standard policy network (PN) baselines to which we compare, we include such a soft penalty in the rewards. We train such PN along with VN for comparison. The main difference between PN and VN is that PN only has the feed-forward network (white block in Fig. 3) and does not contain the final safe layer (blue block). The output of PN is truncated to ensure the actuator safety constraints.

We use the following hyperparameters for all experiments. For PN, we use a three-layer feed-forward neural network, with

nodes in each hidden layer. For VN, it has two feed-forward layers (with nodes in each hidden layer) and a final safe layer as described in Section 4.3.

5.2 Pendulum

For the inverted pendulum simulation, we use the OpenAI gym environment (pendulum-v0), with the following pendulum specifications: mass , length . The system state is two-dimension that include angle and angular velocity of the pendulum, and the control variable is the applied torque . We set the safe region to be (radius) and torque limits . The reward function is defined as , with the goal of learning an optimal feedback controller.

With a discretization step size of , the following are the discretized system dynamics:


To keep the next state in the safe region , we can compute the corresponding upper and lower bound of to represent set by (7). Therefore, the vertices of VN can be found by intersecting and . Under the case where and have no overlap, we pick as the vertices if the upper bound of is less than . Otherwise, we pick as the vertices.

Figure 4: Comparison of accumulated reward and constraint violation (max angle) for the pendulum problem using PN and VN.

For comparison, the output of PN is constrained in using activation function in the final layer. The initial state of each episode is randomly sampled in the safe state region . In Fig. 4, we show a comparison of the accumulated reward and the max angle of each episode in training of PN and VN. We observe that VN maintains safety throughout the training process, and as a result, it achieves higher reward in the early stage. It is also interesting to observe that the PN also becomes “safe” after training, since the reward function itself drives to be small. This suggests if we can train the PN offline, it might be able to obey the safety constraint for some control systems. However, the next example shows that even a well-trained policy may violate constraints if these hard constraints are not explicitly taken into account.

5.3 Mass-Spring

Now we consider the task of damping an oscillating system Mass-Spring to the equilibrium point [6]. The system includes a mass and a spring and the state is two-dimensional with position and speed of the mass. The control variable is the force exerted on the mass . We set the safe region to be . We define the reward function to be . The initial state are randomly sampled from .

The system dynamics are defined as follows,

Figure 5: Comparison of accumulated reward and constraint violation (max speed) for Mass-Spring problem using PN and VN.

Fig. 5 compares the accumulated reward and the max speed of each episode in the training of PN and VN. VN maintains safety during training and receives higher reward in the early training stage. Note that even trained PN could still violate the constraints.

5.4 Hovercraft

Figure 6: (Left) Hovercraft example. and denote starboard and port fan forces. are the tilt angle and the coordinate position. (Right) Illustration of computing vertices of intersection of polytopes and .

Consider the task of controlling a hovercraft illustrated in Fig. 6 (left), which has system dynamics defined as follows:

Let . Considering the force exerted on two fans are coupled, the actuator constraint set is defined as . To keep the tile angle of the hovercraft in a safety region, we set the safe state region to be . Define the reward function to learn a controller that tracks the target position . Fig. 6 (right) shows how to use at most five vertices to represent the intersection of safe state region and the actuator constraint region.

Figure 7: Comparison of accumulated reward and constraint violation (max tile angle) from Hovercraft control using PN and VN with different constraint upper bound.
Figure 8: Trajectories generated by trained PN and VN (with different tilt angle upper limit) policies. (Left) tilt angle and (Right) square of distance to the target position.

In our experiment, the initial state is set at position and the target position is . To better investigate the effect of the constraint, we train VNs for tilt angle upper bounds of and radians. Fig. 7 compares the accumulated reward and max tilt angle of each episode in the training of PN and VN. Fig. 8 visualizes the trajectories of trained PN and VN policies. In the trajectories, we observe that the angle of hovercraft first turn positive to have some momentum pointed to the right, then turn to slow down the speed. When , the hovercraft has a strict constraint on its tilt angle and is unable to reach the target position. In both choices of the tilt angle upper limit , the constraint is never violated in the whole trajectory executing learned VN. However, running learned PN will still reach large tilt angle, even if the soft penalty is added in the reward.

6 Conclusions

Motivated by the problem of training an RL algorithm with hard state and action constraints, leveraging the geometric property that a convex polytope can be equivalently represented as the convex hull of a finite set of vertices, we design a novel policy network architecture called VN that guarantees the output satisfies the safety constraints by design. Empirically, we show that VN yields significantly better safety performance than a vanilla policy network architecture with a constraint violation penalty in several benchmark control systems. An important future direction is to extend the proposed method to high-dimensional control systems.


  • [1] J. Achiam, D. Held, A. Tamar, and P. Abbeel (2017) Constrained policy optimization. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 22–31. Cited by: §2.
  • [2] E. Altman (1999) Constrained markov decision processes. Vol. 7, CRC Press. Cited by: §2.
  • [3] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané (2016) Concrete problems in ai safety. arXiv preprint arXiv:1606.06565. Cited by: §1.
  • [4] B. Amos and J. Z. Kolter (2017) Optnet: differentiable optimization as a layer in neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 136–145. Cited by: §2.
  • [5] S. Bhatnagar and K. Lakshmanan (2012) An online actor–critic algorithm with function approximation for constrained markov decision processes. Journal of Optimization Theory and Applications 153 (3), pp. 688–708. Cited by: §2.
  • [6] F. Blanchini and S. Miani (2008) Set-theoretic methods in control. Springer. Cited by: §3, §4.1, §5.3.
  • [7] V. Broman and M. Shensa (1990) A compact algorithm for the intersection and approximation of n-dimensional polytopes. Mathematics and computers in simulation 32 (5-6), pp. 469–480. Cited by: §4.2.
  • [8] R. Cheng, G. Orosz, R. M. Murray, and J. W. Burdick (2019) End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 3387–3395. Cited by: §1, §2.
  • [9] Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone (2017) Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research 18 (1), pp. 6070–6120. Cited by: §2.
  • [10] G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y. Tassa (2018) Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757. Cited by: §1, §2, §3, §3.
  • [11] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel (2016) Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329–1338. Cited by: §1.
  • [12] J. Garcıa and F. Fernández (2015) A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16 (1), pp. 1437–1480. Cited by: §1.
  • [13] C. Gokcek, P. T. Kabamba, and S. M. Meerkov (2001) An lqr/lqg theory for systems with saturating actuators. IEEE Transactions on Automatic Control 46 (10), pp. 1529–1542. Cited by: §3.
  • [14] B. Grünbaum (2013) Convex polytopes. Vol. 221, Springer Science & Business Media. Cited by: §3, §4.
  • [15] S. Levine, C. Finn, T. Darrell, and P. Abbeel (2016) End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §1.
  • [16] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1, §1, §2, §5.1.
  • [17] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §1.
  • [18] A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani (2017) Deep reinforcement learning framework for autonomous driving. Electronic Imaging 2017 (19), pp. 70–76. Cited by: §1.
  • [19] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §1.
  • [20] Y. Sui, A. Gotovos, J. W. Burdick, and A. Krause (2015) Safe exploration for optimization with gaussian processes. Proceedings of Machine Learning Research 37, pp. 997–1005. Cited by: §2.
  • [21] H. R. Tiwary (2008) On the hardness of computing intersection, union and minkowski sum of polytopes. Discrete & Computational Geometry 40 (3), pp. 469–479. Cited by: §4.2.
  • [22] A. Wachi, Y. Sui, Y. Yue, and M. Ono (2018) Safe exploration and optimization of constrained mdps using gaussian processes. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §2.
  • [23] M. Yu, Z. Yang, M. Kolar, and Z. Wang (2019) Convergent policy optimization for safe reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3121–3133. Cited by: §2.