1 Introduction
Over the last couple of years, reinforcement learning (RL) algorithms have yielded impressive results on a variety of applications. These successes include playing video games with superhuman performance [17], robot locomotion and manipulation [16, 15], autonomous vehicles [18], and many benchmark continuous control tasks [11].
In RL, an agent learns to make sequential decisions by interacting with the environment, gradually improving its performance at the task as learning progresses. Policy optimization algorithms [16, 19] for RL assume that agents are free to explore any behavior during learning, so long as it leads to performance improvement. However, in many realworld applications, there is often additional safety constraints, or specifications that lead to constraints, on the learning problem. For instance, a robot arm should prevent some behaviors that could cause it to damage itself or the objects around it, and autonomous vehicles must avoid crashing into others while navigating [12].
In realworld applications such as the above, constraints are an integral part of the problem description, and maintaining constraint satisfaction during learning is critical (i.e., these are hard constraints). Therefore, in this work, our goal is to maintain constraint satisfaction at each step throughout the whole learning process. This problem is sometimes called the safe exploration problem [3, 22]. In particular, we define safety as remaining within some prespecified polytope constraints on both states and actions. Correspondingly, the action we take at each step should result in a state that in the safety set.
In the safe exploration literature, the projection technique is often leveraged to maintain safety during exploration [8, 10]. Specifically, at each step, an action is suggested by an unconstrained policy optimization algorithm, and then is projected into the safety region. However, this projection step either involves solving a computationally expensive optimization problem online [8], or has strict assumptions such as allowing for only one of the halfspaces to be violated [10]. More over, if realtime optimization is allowed by the application, then it is often more advantageous to solve a model predictive control problem than to ask for a policy learned by RL.
To alleviate the limitation, we proposed Vertex Networks (VNs), where we encode the safety constraints into the policy via neural network architecture design. In VNs, we compute the vertices of the safety region at each time step and design the action to be the convex combination of those vertices, allowing policy optimization algorithms to explore only in the safe region.
The contributions of this work can be briefly summarized as follows: (1) To the best of our knowledge, this is the first attempt to encode safety constraints into policies by explicit network design. (2) In simulation, the proposed approach achieves good performance while maintaining constraint satisfaction.
2 Related work
In safe RL, safety in expectation is a widely used criterion [2, 23]
. In recent literature, policy optimization algorithms have been proposed as a means to learn a policy for a continuous Markov decision process (MDP). Two stateoftheart exemplars in terms of performance are the Lagrangianbased actorcritic algorithm
[5, 9] and Constrained Policy Optimization (CPO) [1]. However, for these methods, constraint satisfaction can only be guaranteed in expectation. In a safety critical environment, this is not sufficient since even if safety is guaranteed in expectation, there is still a nonzero probability that unsafe trajectories will be generated by the controller.
Safe exploration, on the other hand, requires constraint satisfaction at each steps. Recent approaches [20, 22] model safety as unknown functions, proposed algorithm that trades off between exploring the safety function and reward function. However, their approaches require solving MDPs or constrained optimizations to obtain policy in each exploration iteration, which are less efficient than the policy optimization algorithm leveraged in our approach.
Literature that uses policy optimization algorithm in safe exploration is closely related to our work. Among those, projection technique often used for maintaining safety. In [10], the safety set is defined in terms of halfspace constraints on the action. A policy optimization algorithm—in particular, deep deterministic policy gradient (DDPG) [16]—is leveraged to generate an action which is then projected into the safe set. Imposing that only one halfspace constraint can be violated, the projection optimization problem can be solved in closed form. In [8]
, this projection step is solved as a quadratic program, based on confidence intervals for the approximation of the system dynamics, modeled as Gaussian processes.
It is possible to integrate a projection step into the policy network. Indeed, using the methodology provided in [4], one can leverage a policy optimization algorithm to train the policy network in an endtoend fashion. However, the feed forward computation of the policy network in this case is computationally expensive as it involves solving an optimization problem before executing every single action. One might prefer to solve a model predictive control problem instead if solving the optimization problem online is involved. Instead of integrating the projection step, we propose VNs which leverage the convex combination of vertices to enforce safety.
3 Model Setup
Consider a discrete time affine control system in which the system evolves according to
(1) 
where , , and and are known functions of appropriate dimensions. Our goal is to minimize a cost over time horizon , subject to safety constraints on and actuator constraints on :
(2a)  
s.t.  (2b)  
(2c)  
(2d) 
where and are convex polytopes. A convex polytope can be defined as an intersection of linear inequalities (halfspace representation) or equivalently as a convex combination of a finite number of points (convexhull representation) [14]. This type of constraints are widely used in theory and practice—for example, see [6] and the references within.
The goal of safe RL is to find an optimal feedback controller , that minimize the overall system cost (2a) while satisfies the safety constraints (2c) and the actuator constraints (2d). Solving (2) is a difficult task, even for linear systems with only the actuator constraints, except for a class of systems where analytic solutions can be found [13]. Therefore, RL (and its different variants) have been proposed to search for a feedback controller.
Numerous learning approaches have been adopted to solve the problem when the constraints (2c) and (2d) are not present. However, there are considerably less successful applications of RL to problems with hard constraints. One such approach is the twostage method used in [10]. The first step is to simply train a policy that solves the problem in (2) without the constraints on state nor the action. To enforce the constraints, a projection step is solved, where the action determined by the unconstrained policy is projected into the constraint sets.
This twostep process is referred to as safe exploration in [10], since it leverages the fact that RL algorithms explore the action space while the projection satisfies the hard constraints. However, this approach has two drawbacks that we address in the current paper. Firstly, the projection step itself requires an optimization problem to be solved. This step could be computationally expensive. More fundamentally, it brings the question of why not directly solve (2) as a model predictive control problem, since online optimization needs to be used either way. Secondly, decoupling the policy and projection steps may lead to solutions that significantly deviate from the original unconstrained policy. To overcome these challenges, we propose a novel vertex policy network that encodes the geometry of the constraints into the network architecture, and train it in an endtoend way. We will discuss the proposed vertex policy network framework in detail in the next section.
4 Vertex Policy Network
The key idea of our proposed VN is using a basic fact of the geometry of a convex polytope. Given a bounded convex polytope , it is always possible to find a finite number of vertices such that the convex hull is . In addition, there is no smaller set of points whose convex hull forms [14]. Then, the next proposition follows directly.
Proposition 4.1.
Let be a convex polytope with vertices . For every point , there exists , such that
where and .
The preceding proposition implies that we can search for the set of weights ’s instead of directly finding a point inside polytope.
Proposition 4.1 can be applied to find a feedback control policy. Since both the constraint sets and are convex polytopes, the control action at each timestep must also live in a convex polytope. If its vertices are known, the output of a policy can be the weights
’s. The benefit of having the weights as the output is threefold. Firstly, it is much easier to normalize a set of real numbers to be all positive and to sum to unity (the probability simplex) than to project into an arbitrary polytope. For this paper, we use a softmax layer. Secondly, this approach allows us to fully explore the interior of the feasible space, where projections could be biased towards the boundary of the set. Thirdly, we are able to use standard policy gradient training techniques.
In particular, we use DDPG as the policy evaluation and update algorithm, where the policy is a neural network parameterized by and updated by
(3) 
where is the expected return using the current policy and is defined by
(4) 
We approximate by , where are the number of sampled trajectories generated by running the current policy and is the trajectory length. The overall algorithm procedure of the proposed VN framework is provided in Fig 1.
Below, we discuss the two major components of VN in detail: 1) the safety region and vertex calculation, and 2) the neural network architecture design for the safe layer.
4.1 Evolution of the Action Constraint Set
We require that at each time step the states of the system stay in the set , and the control actions at constrained to be in the set . As stated earlier, we assume and to be convex polytopes. The main algorithmic challenge comes from the need to repeatedly intersect translated versions of these polytopes. To be concrete, suppose we are given . Then for the next step, we require that . This translates into an affine constraint on , since the control action mush satisfy
Since is known, is a constant in the above equation, and the constraint on is again polytopic. We denote this polytope as . The set to which must belong is the intersection of and the actuator constraints:
(5) 
After identifying the vertices of , the algorithm in Fig. 1 can be used to find the optimal feedback policies.
In general, it is fairly straightforward to find either the convex hull or the halfspace representations of
, since it just requires a linear transformation of
. However, the intersection step in (5) and the process of finding the representation of its convex hull are nontrivial [6]. Below, we work through a simple example to illustrate the steps and then discuss how to overcome the computational challenges.Example 1 (Intersection Step).
Consider the following two–dimensional linear system:
Suppose the action safety set is a convex polytope defined by: , and . The state safety set is a square defined by and the initial state is . By simple calculation, and is the box bounded by . Fig. 2 (left) visualizes the intersection operations.
Now suppose a feasible action is chosen and the system evolves. Then, . Performing the intersection of and , we get that is a rectangle defined by the vertices as depicted in Fig. 2 (right).
4.2 Intersection of Polytopes
It should be noted that finding the vertices of an intersection of polytopes is not easy [21]. If the polytopes are in halfspace representation, then their intersection can be found by simply stacking the inequalities. However, finding the vertices of the resulting polytope can be computationally expensive. Similarly, directly intersecting two polytopes based on their convex hull representation is also intractable in general.
Luckily, in many applications, we are not intersecting two generic polytopes at each step. Rather, there are only two ”basic” sets, and , and we are intersecting a linear transformation of these. It turns out that for many systems (see Section 5
), we can find the resulting vertices by handdesigned rules. In addition, there are heuristics that work well for lowdimensional systems
[7]. Applying the proposed VN technique to highdimensional systems is the main future direction for this work.In the case that and do not overlap, one can choose to stop the training process. However, in our rules of finding vertices, we pick the point in that closest to to be the vertex. By design, the output of the VN is the action within set , meanwhile transiting to the state closest to the safe state set .
4.3 Safe layer
Once we obtain , the next step is to encode the geometry information into the policy network such that the generated action stays in . According to Proposition 4.1, it suffices for the policy network to generate the weights (or coefficients) of that convex combination.
Suppose that can have at most vertices, labeled . In the policy network architecture design, we add an intermediate safe layer that first generates nodes . The value of these nodes, however, are not positive nor do they sum to
. Therefore, a softmax unit is included as the activation function in order to guarantee the nonnegativity and the summation constraints. In particular, we define
, the weights of a convex combination. The final output layer (action ) is defined as the multiplication of these normalized weights and the corresponding vertex values,(6) 
An illustration diagram is provided in Fig. 3.
5 Simulation
In this section, we present and analyze the performance of the proposed VN. We first describe the baseline algorithms and then demonstrate the performance comparisons in three benchmark control tasks: (i) inverted pendulum, (ii) massspring and (iii) hovercraft tracking.
5.1 Baseline Algorithm and Architecture Design
As mentioned earlier, the baseline algorithm for the policy update is DDPG [16], which is a stateoftheart continuous control RL algorithm. To add safety constraints to the vanilla DDPG algorithm, a natural approach is to artificially shape the reward such that the agent will learn to avoid undesired areas. This can be done by setting a negative reward to the unsafe states and actions. In standard policy network (PN) baselines to which we compare, we include such a soft penalty in the rewards. We train such PN along with VN for comparison. The main difference between PN and VN is that PN only has the feedforward network (white block in Fig. 3) and does not contain the final safe layer (blue block). The output of PN is truncated to ensure the actuator safety constraints.
We use the following hyperparameters for all experiments. For PN, we use a threelayer feedforward neural network, with
nodes in each hidden layer. For VN, it has two feedforward layers (with nodes in each hidden layer) and a final safe layer as described in Section 4.3.5.2 Pendulum
For the inverted pendulum simulation, we use the OpenAI gym environment (pendulumv0), with the following pendulum specifications: mass , length . The system state is twodimension that include angle and angular velocity of the pendulum, and the control variable is the applied torque . We set the safe region to be (radius) and torque limits . The reward function is defined as , with the goal of learning an optimal feedback controller.
With a discretization step size of , the following are the discretized system dynamics:
(7)  
To keep the next state in the safe region , we can compute the corresponding upper and lower bound of to represent set by (7). Therefore, the vertices of VN can be found by intersecting and . Under the case where and have no overlap, we pick as the vertices if the upper bound of is less than . Otherwise, we pick as the vertices.
For comparison, the output of PN is constrained in using activation function in the final layer. The initial state of each episode is randomly sampled in the safe state region . In Fig. 4, we show a comparison of the accumulated reward and the max angle of each episode in training of PN and VN. We observe that VN maintains safety throughout the training process, and as a result, it achieves higher reward in the early stage. It is also interesting to observe that the PN also becomes “safe” after training, since the reward function itself drives to be small. This suggests if we can train the PN offline, it might be able to obey the safety constraint for some control systems. However, the next example shows that even a welltrained policy may violate constraints if these hard constraints are not explicitly taken into account.
5.3 MassSpring
Now we consider the task of damping an oscillating system MassSpring to the equilibrium point [6]. The system includes a mass and a spring and the state is twodimensional with position and speed of the mass. The control variable is the force exerted on the mass . We set the safe region to be . We define the reward function to be . The initial state are randomly sampled from .
The system dynamics are defined as follows,
Fig. 5 compares the accumulated reward and the max speed of each episode in the training of PN and VN. VN maintains safety during training and receives higher reward in the early training stage. Note that even trained PN could still violate the constraints.
5.4 Hovercraft
Consider the task of controlling a hovercraft illustrated in Fig. 6 (left), which has system dynamics defined as follows:
Let . Considering the force exerted on two fans are coupled, the actuator constraint set is defined as . To keep the tile angle of the hovercraft in a safety region, we set the safe state region to be . Define the reward function to learn a controller that tracks the target position . Fig. 6 (right) shows how to use at most five vertices to represent the intersection of safe state region and the actuator constraint region.
In our experiment, the initial state is set at position and the target position is . To better investigate the effect of the constraint, we train VNs for tilt angle upper bounds of and radians. Fig. 7 compares the accumulated reward and max tilt angle of each episode in the training of PN and VN. Fig. 8 visualizes the trajectories of trained PN and VN policies. In the trajectories, we observe that the angle of hovercraft first turn positive to have some momentum pointed to the right, then turn to slow down the speed. When , the hovercraft has a strict constraint on its tilt angle and is unable to reach the target position. In both choices of the tilt angle upper limit , the constraint is never violated in the whole trajectory executing learned VN. However, running learned PN will still reach large tilt angle, even if the soft penalty is added in the reward.
6 Conclusions
Motivated by the problem of training an RL algorithm with hard state and action constraints, leveraging the geometric property that a convex polytope can be equivalently represented as the convex hull of a finite set of vertices, we design a novel policy network architecture called VN that guarantees the output satisfies the safety constraints by design. Empirically, we show that VN yields significantly better safety performance than a vanilla policy network architecture with a constraint violation penalty in several benchmark control systems. An important future direction is to extend the proposed method to highdimensional control systems.
References

[1]
(2017)
Constrained policy optimization.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pp. 22–31. Cited by: §2.  [2] (1999) Constrained markov decision processes. Vol. 7, CRC Press. Cited by: §2.
 [3] (2016) Concrete problems in ai safety. arXiv preprint arXiv:1606.06565. Cited by: §1.
 [4] (2017) Optnet: differentiable optimization as a layer in neural networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 136–145. Cited by: §2.
 [5] (2012) An online actor–critic algorithm with function approximation for constrained markov decision processes. Journal of Optimization Theory and Applications 153 (3), pp. 688–708. Cited by: §2.
 [6] (2008) Settheoretic methods in control. Springer. Cited by: §3, §4.1, §5.3.
 [7] (1990) A compact algorithm for the intersection and approximation of ndimensional polytopes. Mathematics and computers in simulation 32 (56), pp. 469–480. Cited by: §4.2.

[8]
(2019)
Endtoend safe reinforcement learning through barrier functions for safetycritical continuous control tasks.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 3387–3395. Cited by: §1, §2.  [9] (2017) Riskconstrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research 18 (1), pp. 6070–6120. Cited by: §2.
 [10] (2018) Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757. Cited by: §1, §2, §3, §3.
 [11] (2016) Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329–1338. Cited by: §1.
 [12] (2015) A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16 (1), pp. 1437–1480. Cited by: §1.
 [13] (2001) An lqr/lqg theory for systems with saturating actuators. IEEE Transactions on Automatic Control 46 (10), pp. 1529–1542. Cited by: §3.
 [14] (2013) Convex polytopes. Vol. 221, Springer Science & Business Media. Cited by: §3, §4.
 [15] (2016) Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §1.
 [16] (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1, §1, §2, §5.1.
 [17] (2015) Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §1.
 [18] (2017) Deep reinforcement learning framework for autonomous driving. Electronic Imaging 2017 (19), pp. 70–76. Cited by: §1.
 [19] (2015) Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §1.
 [20] (2015) Safe exploration for optimization with gaussian processes. Proceedings of Machine Learning Research 37, pp. 997–1005. Cited by: §2.
 [21] (2008) On the hardness of computing intersection, union and minkowski sum of polytopes. Discrete & Computational Geometry 40 (3), pp. 469–479. Cited by: §4.2.
 [22] (2018) Safe exploration and optimization of constrained mdps using gaussian processes. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §1, §2.
 [23] (2019) Convergent policy optimization for safe reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3121–3133. Cited by: §2.
Comments
There are no comments yet.