I Introduction
Reinforcement learning (RL) is promising for solving nonlinear optimal control problems and has received significant attention in the past decades, see [grondman2012survey, kim2021dynamic] and the references therein. Until recently, significant progress has been made on RL with the actorcritic structure for continuous control tasks [grondman2012survey, haarnoja2018soft, liu2013policy, zhang2009neural, jiang2012computational, lim2020prediction, wang2021intelligent]. In actorcritic RL, the value function and control policy are represented by the critic and actor networks, respectively, and learned via extensive policy exploration and exploitation. However, the resulting learningbased control system might not guarantee safety for systems with state and stability constraints. It is known that safety constraint satisfaction is crucial besides optimality in many realworld robot control applications [garcia2015comprehensive, chow2019lyapunov]. For instance, autonomous driving has been viewed as a promising technology that will bring fundamental changes to everyday life. Still, one of the crucial issues concerns how to learn to drive safely under dynamic and unknown environments with unexpected obstacles. For these practical reasons, many safe RL algorithms have been recently developed for safetycritical systems, see e.g. [chow2018lyapunovbased, chow2019lyapunov, berkenkamp2018safe, garcia2015comprehensive, srinivasan2020learning, yu2019convergent, xu2021crpo, ma2021model, simao2021alwayssafe, amani2021safe, yang2021accelerating, huh2020safe, zheng2021safe, marvi2021safe, chen2021context, li2021safe, brunke2021safe, richards2018lyapunov] and the references therein. Note that there are fruitful works in adaptive control with constraints, but the technique used is different from that in RL, for related references in adaptive control might refer to [sun2016neural, kong2019adaptive].
In general, current safe RL solutions can be categorized into three main approaches. (i) The first family utilizes a unique mechanism in the learning procedure for safe policy optimization using, e.g., control barrier functions [yang2019safety, ma2021model, marvi2021safe], formal verification [fulton2018safe, turchetta2020safe], shielding [alshiekh2018safe, zanon2020safe, thananjeyan2021recovery], and external intervention [saunders2017trial, wagener2021safe]. These methods are prone to a safebiased learning by sacrificing greatly on the performance. And some of them rely on extra human interference [saunders2017trial, wagener2021safe]. (ii) The second family proposes safe RL algorithms via primaldual methods [chen2016stochastic, paternain2019safe, xu2021crpo, ding2021provably]. In the resulting optimization problem, the Lagrangian multiplier serves as an extra weight whose update is noted sensitive to the control performance [xu2021crpo]. Moreover, some optimization problems such as those with ellipsoidal constraints (covered in this work) could not satisfy the strong duality condition [zhang2021robust]. (iii) The third is reward/cost shapingbased RL approaches [geibel2005risk, balakrishnan2019structured, hu2020learning, tessler2018reward] where the cost functions are augmented with various safetyrelated parts, e.g., barrier functions. As stated in [paternain2019safe], such a design only informs the goal of guaranteeing safety by minimizing the reshaped cost function but fails to guide how to achieve it well through an actorcritic structure design. Consequently, weights of actor and critic networks are prone to divergence in the training process due to the abrupt changes of the cost function caused by the safetyrelated terms in approaching the constraint boundary. These issues motivated our barrier functionbased (simplified as barrierbased) actorcritic structure. Moreover, few works have addressed the safe RL algorithm design under timevarying safety constraints.
This work proposes a modelbased safe RL algorithm with theoretical guarantees for optimal control with timevarying state and control constraints. A new barrierbased control policy (BCP) structure is constructed in the proposed safe RL approach, generating repulsive control forces as states and controls move toward the constraint boundaries. Moreover, the timevarying constraints are addressed by a multistep policy evaluation (MPE). The closedloop theoretical property of our approach under nominal and perturbed cases and the convergence condition of the barrierbased actorcritic (BAC) learning algorithm is derived. The effectiveness of our approach is tested on both simulations and realworld intelligent vehicles. Our contributions are summarized as follows.

We proposed a safe RL for optimal control under timevarying state and control constraints. Under certain conditions (see Sections LABEL:sec:32A and B), safety can be guaranteed in both online and offline learning scenarios. The performance and advantages of the proposed approach are achieved by two novel designs. The first is a barrierbased control policy to ensure safety with an actorcritic structure. The second is a multistep evaluation mechanism to predict the control policy’s future influence on the value function under timevarying safety constraints and guide the policy to update safely.

We proved that the proposed safe RL could guarantee stability and robustness in the nominal scenario and under external disturbances, respectively. And the convergence condition of the BAC learning algorithm was derived by the Lyapunov method.

The proposed approach was applied to solve an integrated path following and collision avoidance problem of intelligent vehicles so that the control performance can be optimized with theoretical guarantees even with external disturbances. (i) Extensive simulation results illustrate that our approach outperforms other stateoftheart safe RL methods in learning safety and performance. (ii) We verified our control policy’s offline simtoreal transfer capability and realworld online learning performance. The experimental results reveal that our approach outperforms a standard model predictive control (MPC) algorithm in terms of safety and optimality and shows an impressive simtoreal transfer capability and a satisfactory online control performance.
The remainder of the paper is organized as follows. Section II introduces the considered control problem and preliminary solutions. Section III presents the proposed safe RL approach and the BAC learning algorithm, while Section IV presents the main theoretical results. Section V shows the realworld experimental results, while some conclusions are drawn in Section VI. Some additional simulation results for theoretical verification are given in the appendix.
Notation: We denote and as the set of natural numbers and integers
. For a vector
, we denote as and as the Euclidean norm. For a function with an argument , we denote as the gradient to . For a function with arguments and , we denote as the partial gradient to , or . Given a matrix , we use () to denote the minimal (maximal) eigenvalues. We denote
as the interior of a general set . For variables , , we define , where .Ii Control problem formulation and definitions
Iia Control problem
The considered system under control is a class of discretetime nonlinear systems described by
(1) 
where and are the state and input variables, is the discretetime index, and are convex sets that represent timevarying constraints, , , is a bounded compact set; functions for , are assumed to be ; is the state transition function and .
Starting from any initial condition , the control objective is to find an optimal control policy that minimizes a quadratic regulation cost function of type
(2) 
subject to model (1), , and , ; where
and , , , is a discount factor.
Definition 1 (Local stabilizability [zhang2021robust])
System (1) with is stabilizable on if, for any , there exists a statefeedback policy , , such that and as .
Without loss of generality, many waypoint tracking problems in the robot control field can be naturally formed as the prescribed regulation one, with a proper coordination transformation of the reference waypoints. More generally, it is allowed that the timevarying state constraint might not contain the origin for some . Typical examples can be found in, for instance, path following of mobile robots with collision avoidance, where the potential obstacle to be avoided might occupy the reference waypoints, i.e., the origin after coordination transformation. It is still reasonable to introduce the following assumption for convergence guarantee.
Assumption 1 (State constraint)
There exists a finite number such that as .
Assumption 2 (Lipschitz continuous)
Model (1) is Lipschitz continuous in , for all , i.e., there exists a Lipschitz constant such that for all and control policies with ,
(3) 
Assumption 3 (Model)
in the domain , where .
Definition 2 (multistep safe control)
To simplify the notation, in the rest of the paper, the super index in is neglected, i.e., we use to denote .
IiB Definitions on barrier functions
We also introduce the definition of barrier functions.
Definition 3 (Barrier function)
For a convex set , a barrier function is defined as
(4) 
The recentered transformation of centered at is defined as , where if or is selected such that otherwise.
To derive a satisfactory control performance, it is suggested to select far from the set boundary of and as the central point or its neighbor of (if possible). It is observed that satisfies
(5) 
Lemma 1 (Relaxed barrier function [wills2004barrier])
Define a relaxed barrier function of as
(6) 
where is a relaxing factor, , the function is strictly monotone and differentiable on , and , then there exists a matrix such that .
Proof: For details please see [wills2004barrier].
Comments
There are no comments yet.