1 Introduction
The area of reinforcement learning (RL) has achieved tremendous success in various applications, including video games (Mnih et al., 2015; Lee et al., 2018). In these applications, the RL agent is free to explore the entire stateaction space to improve its performance via trial and error. In safetycritical scenarios, however, it is not possible for the agent to explore certain regions. For example, a selfdriving vehicle must stay on the road and avoid collisions with other vehicles and pedestrians. Moreover, industrial robots should not damage the safety of the workers. Another example is a medical robot, which should not endanger the safety of a patient. As a result, in contrast to the unconstrained exploration, the RL agent should satisfy certain safety constraints while exploring the environment.
The constrained exploration settings can be represented by the constrained Markov decision process (CMDP)
(Altman, 1999). While CMDP can be cast as linear programming in tabular setting
(Altman, 1999), it is generally not applicable to largescale, continuous domains. Instead, two classes of optimization techniques are applied to solve CMDP. The first approach is the primaldual method, which solves a minimax problem by alternating between primal policy variables and dual variables (Chow et al., 2017). This approach, however, is limited because solving a minimax problem is difficult due to the nonconvexity in nonlinear function approximations (e.g., deep neural networks). The other approach is to deal with CDMP as nonconvex optimization directly via successive convexification of the objective and constraints (Achiam et al., 2017; Yu et al., 2019). Such convexification can be linear or quadratic if a trustregion term is added. However, the convexification methods have several drawbacks: 1) it is unclear how the constraint is driven to be feasible; 2) the convexified subproblem can often encounter infeasibility, which requires a heuristic way to recover from infeasibility; and 3) at each iteration, it requires to solve convex programming with linear/quadratic objective and quadratic constraints, which can be inefficient.
In this paper, we introduce a new framework to address the aforementioned limitations in solving CDMP. Specifically, we propose to take constraints as Lyapunov functions to drive the constraint violation monotonically decrease and impose new constraints on the updating dynamics of the policy parameters. Such new constraints, which are linear inequalities and guaranteed to be feasible, can guarantee that the constraint violation can converge to zero if initialization is infeasible, and the trajectory will stay inside the feasible set if the agent initially starts starting there. Therefore, the feasible set is forward invariant. However, with the new constraints imposed on the updating the dynamics of the policy parameters, it is nontrivial to design such updating rules to optimize the objective while satisfying the constraints simultaneously. Methods like projected gradient descent are not applicable here because the constraints are not on the primal variables anymore. Instead, we propose to learn a metaoptimizer parameterized by long shortterm memory (LSTM), where the constraintsatisfaction is guaranteed by projecting the metaoptimizer output onto those linear inequality constraints. While generic projection onto polytopes formulated by
multiple linear inequalities cannot be solved in closed form, we design a proper metric for the projection such that it can be solved analytically.Contribution.
Our contributions are as follows: 1) We propose to learn a metaoptimizer to solve a safe RL formulated as CMDP with guaranteed feasibility without solving a constrained optimization problem iteratively; and 2) the resulting updating dynamics of the variables imply forwardinvariance of the safety set.
2 Related Works
Work in Cheng et al. (2019) proposed an endtoend trainable safe RL method by compensating the control input from the modelfree RL via modelbased control barrier function (Ames et al., 2016). With the dynamical model and the need to solve an optimization problem online, it is questioned why not solve it by approaches like model predictive control. To avoid solving an optimization problem to guarantee safety, the vertex network is presented in the work by Zheng et al. (2020) via formulating a polytope safety set as a convex combination of its vertices. However, finding vertices of a polytope formulated as linear equations is nontrivial. For tabular settings, Chow et al. (2018) presents to construct Lyapunov functions to guarantee global safety during training via a set of local linear constraints. More safe RL approaches are demonstrated in the survey paper (Garcıa and Fernández, 2015).
3 Preliminary
3.1 Markov decision process
The Markov decision process (MDP) is a tuple , where is the set of the agent state in the environment, is the set of agent actions, is the transition function, denotes the reward function, is the discount factor and is the initial state distribution. A policy
is a mapping from the state space to probability over actions.
denotes the probability of taking action under state following a policy parameterized by . The objective is to maximize the cumulative reward:(1) 
where is a trajectory. To optimize the policy that maximizes Eqn. (1), the policy gradient with respect to can be computed as (Sutton et al., 2000): with (Sutton and Barto, 2018).
3.2 Constrained Markov decision process
The constrained Markov decision process (CMDP, Altman (1999)) is defined as a tuple , where is the cost function and the remaining variables are identical to those in the MDP definition (see Section 3.1). While the discount factor for the cost can be different from that for the reward, we use the same one here for notational simplicity. The goal in CMDP is to maximize the cumulative reward while satisfying the constraints on the cumulative cost:
(2)  
where , is the constraint set, and is the maximum acceptable violation of . In later context, and are used as shorthand version of and respectively if necessary.
4 Approach
4.1 Setinvariant constraints on updating dynamics
The key to solve Eqn. (2) is how to deal with the constraints. Different from the existing work in the literature, we aim to build a mechanism that drives the constraint violation to converge to zero asymptotically if the initialization is infeasible. Otherwise, the trajectory will stay inside the feasible set. To accomplish this goal, we build a Lyaponuvlike condition in the following
(3) 
where is the updating dynamics of and is an extended class function. A special case of the class function is a scalar linear function with positive slope. With discretization, the updating rule becomes
(4) 
where is the learning rate. Note that with sufficiently small , the continuous dynamics can be approximated with a given accuracy. creftypecap 1 characterize how Eqn. (3) will make the safety set forward invariant. For notational simplicity, the statement is on one constraint with and . This simplification does not lose any generality since the joint forwardinvariance of multiple sets will naturally lead to the forwardinvariance of their intersection set.
Lemma 1
For a continuously differentiable set, is forward invariant with defined on , a superset of , i.e., .
Proof: Define as the boundary of . As a result, for , . Then, according to the Nagumo’s theorem (Blanchini and Miani, 2008; Blanchini, 1999), the set is forward invariant.
Here we give some intuition behind Eqn. (3
). Through the chain rule, suppose
. The the solution to this partial differential equation is
. With , it means that the initialization is infeasible (i.e., ), and thus will converge to (i.e., the boundary of ) asymptotically. It is similar with a feasible initialization (i.e., ). Note that Eqn. (3) is for deterministic constraint functions. If the cumulative cost is stochastic, the above results are true in expectation.It is worth noting that with , i.e., the number of constraints is smaller than that of the policy parameters, Eqn. (3) is guaranteed to be feasible. This saves the trouble of recovering from infeasibility in a heuristic manner, which is usually the case for the existing methods (Achiam et al., 2017; Yu et al., 2019).
4.2 Learning a metaOptimizer
So far, we have converted the constraint on in Eqn. (2) to that on in Eqn. (3), which formulates the new set
(5) 
and . However, it is unclear how to design an optimization algorithm that minimizes the objective in Eqn. (2) while satisfying Eqn. (3). Note that the typical constrained optimization algorithms, such as projected gradient descent (PGD) are not applicable anymore as the constrains are not on the primal variables anymore. Following PGD, we can update in the following way:
(6) 
where is the projection operator. However, this can be problematic as it is ambiguous if is still a appropriate direction. Consequently, standard optimization algorithms, such as SGD or Adam with Eqn. (6), will fail to optimize the objective while satisfying the constraints, and thus we propose to learn an optimizer by metalearning.
Following the work by Andrychowicz et al. (2016), which learns an metaoptimizer for unconstrained optimization problems, we extend it to the domain of constraint optimization. The metaoptimizer is parameterized by a long shortterm memory (LSTM) with as the parameters for the LSTM network . Similar to Andrychowicz et al. (2016), the updating rule is as follows:
(7)  
where is the hidden state for . The loss to train the optimizer parameter is defined as:
(8) 
where is the span of the LSTM sequence and is the weight coefficient. The main difference of ours in Eqn. (4.2) from that in Andrychowicz et al. (2016) is the projection step in the second line in Eqn. (4.2). It can be understood the endtoend training takes the role to minimize the loss and the constraintsatisfaction is guaranteed by the projection.
However, even is a polytope formulated by linear inequalities, projection onto is still nontrivial and requires an iterative solver such as in Achiam et al. (2017), except that there is only one inequality constraint (i.e., ). Work in Dalal et al. (2018) proposed two alternatives: one is to find the single active constraint to transform into a singleconstraint case and the other is to take the constraints as a penalty. The former is troublesome and possibly inefficient and the latter will sacrifice the strict satisfaction of the constraint.
Consequently, we propose to solve the projection onto the polytope formulated by multiple linear inequalities in closed form. Let us first take a look on the generic projection problem onto a polytope in the following
(9) 
where , is of full row rank and is positive definite. Then the dual problem of Eqn. (9) is
(10) 
The dual problem (Eqn. (10)) in general cannot be solved analytically as is positive definite but not diagonal. Though
is usually set as the identity matrix, it is not necessary other than that
should be positive definite. As a result, we design such that is diagonal by solving(11) 
with . As a result, we obtain . Then Eqn. (10) can be solved in closed form as
The illustration of the metaoptimizer is demonstrated in Figure 1.
5 Experiments
5.1 Quadratically constrained quadratic programming
We first apply the learned metaoptimizer on the following quadratically constrained quadratic programming (QCQP). Specifically, the objective and constraints in this domain are defined as:
(12) 
where , and . In this deterministic setting, the constraint violation is driven to satisfaction asymptotically as shown in Figure 3. Three unconstrained baselines, Adam, RMS, and SGD, are also presented to show the scale of the objective. The constraint violation converges to zero asymptotically as discussed before, and our objective is even comparable to that from the unconstrained solvers.
5.2 Reinforcement learning domain
We build a domain where a point mass agent tries to navigate in 2D space to reach the goal position (Achiam et al., 2017) (see Figure 4). The reward and cost function is set as and and if agent is out of the square and in the circular obstacle, respectively, and otherwise. Average performance of the policy trained by the metaoptimizer is demonstrated in Figure 2. Our algorithm drives the constraint function directly to the limit while maximize the cumulative reward.
6 Conclusion
In this paper, we propose to learn a metaoptimizer to solve a safe RL formulated as CMDP with guaranteed feasibility without solving a constrained optimization problem iteratively. Moreover, the resulting updating dynamics of the variables imply forwardinvariance of the safety set. Future work will focus on applying the proposed algorithm in more challenging RL domains as well as more general RL algorithms such as actorcritic, and extending it to multiagent RL domains with nonstationarity.
Acknowledgements
DongKi Kim was supported by IBM (as part of the MITIBM Watson AI Lab initiative) and Kwanjeong Educational Foundation Fellowship. We thank Amazon Web services for the computational support.
References
 Constrained policy optimization. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 22–31. Cited by: §1, §4.1, §4.2, §5.2.
 Constrained markov decision processes. Vol. 7, CRC Press. Cited by: §1, §3.2.
 Control barrier function based quadratic programs for safety critical systems. IEEE Transactions on Automatic Control 62 (8), pp. 3861–3876. Cited by: §2.
 Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pp. 3981–3989. Cited by: §2, Figure 1, §4.2.
 Settheoretic methods in control. Springer. Cited by: §4.1.
 Set invariance in control. Automatica 35 (11), pp. 1747–1767. Cited by: §4.1.
 Learning to learn without gradient descent by gradient descent. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 748–756. Cited by: §2.

Endtoend safe reinforcement learning through barrier functions for safetycritical continuous control tasks.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 3387–3395. Cited by: §2.  Riskconstrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research 18 (1), pp. 6070–6120. Cited by: §1.
 A lyapunovbased approach to safe reinforcement learning. In Advances in neural information processing systems, pp. 8092–8101. Cited by: §2.
 Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757. Cited by: §4.2.
 A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16 (1), pp. 1437–1480. Cited by: §2.
 Crafting papers on machine learning. In Proceedings of the 17th International Conference on Machine Learning (ICML 2000), P. Langley (Ed.), Stanford, CA, pp. 1207–1216. Cited by: Acknowledgements.
 Modular architecture for starcraft ii with deep reinforcement learning. In Fourteenth Artificial Intelligence and Interactive Digital Entertainment Conference, Cited by: §1.
 Metasgd: learning to learn quickly for fewshot learning. arXiv preprint arXiv:1707.09835. Cited by: §2.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §1.
 Reinforcement learning: an introduction. MIT press. Cited by: §3.1.
 Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §3.1.
 Convergent policy optimization for safe reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3121–3133. Cited by: §1, §4.1.
 Safe reinforcement learning of controlaffine systems with vertex networks. arXiv preprint arXiv:2003.09488. Cited by: §2.
Comments
There are no comments yet.