Reinforcement learning (RL) is promising for solving nonlinear optimal control problems and has received significant attention in the past decades, see [grondman2012survey, kim2021dynamic] and the references therein. Until recently, significant progress has been made on RL with the actor-critic structure for continuous control tasks [grondman2012survey, haarnoja2018soft, liu2013policy, zhang2009neural, jiang2012computational, lim2020prediction, wang2021intelligent]. In actor-critic RL, the value function and control policy are represented by the critic and actor networks, respectively, and learned via extensive policy exploration and exploitation. However, the resulting learning-based control system might not guarantee safety for systems with state and stability constraints. It is known that safety constraint satisfaction is crucial besides optimality in many real-world robot control applications [garcia2015comprehensive, chow2019lyapunov]. For instance, autonomous driving has been viewed as a promising technology that will bring fundamental changes to everyday life. Still, one of the crucial issues concerns how to learn to drive safely under dynamic and unknown environments with unexpected obstacles. For these practical reasons, many safe RL algorithms have been recently developed for safety-critical systems, see e.g. [chow2018lyapunovbased, chow2019lyapunov, berkenkamp2018safe, garcia2015comprehensive, srinivasan2020learning, yu2019convergent, xu2021crpo, ma2021model, simao2021alwayssafe, amani2021safe, yang2021accelerating, huh2020safe, zheng2021safe, marvi2021safe, chen2021context, li2021safe, brunke2021safe, richards2018lyapunov] and the references therein. Note that there are fruitful works in adaptive control with constraints, but the technique used is different from that in RL, for related references in adaptive control might refer to [sun2016neural, kong2019adaptive].
In general, current safe RL solutions can be categorized into three main approaches. (i) The first family utilizes a unique mechanism in the learning procedure for safe policy optimization using, e.g., control barrier functions [yang2019safety, ma2021model, marvi2021safe], formal verification [fulton2018safe, turchetta2020safe], shielding [alshiekh2018safe, zanon2020safe, thananjeyan2021recovery], and external intervention [saunders2017trial, wagener2021safe]. These methods are prone to a safe-biased learning by sacrificing greatly on the performance. And some of them rely on extra human interference [saunders2017trial, wagener2021safe]. (ii) The second family proposes safe RL algorithms via primal-dual methods [chen2016stochastic, paternain2019safe, xu2021crpo, ding2021provably]. In the resulting optimization problem, the Lagrangian multiplier serves as an extra weight whose update is noted sensitive to the control performance [xu2021crpo]. Moreover, some optimization problems such as those with ellipsoidal constraints (covered in this work) could not satisfy the strong duality condition [zhang2021robust]. (iii) The third is reward/cost shaping-based RL approaches [geibel2005risk, balakrishnan2019structured, hu2020learning, tessler2018reward] where the cost functions are augmented with various safety-related parts, e.g., barrier functions. As stated in [paternain2019safe], such a design only informs the goal of guaranteeing safety by minimizing the reshaped cost function but fails to guide how to achieve it well through an actor-critic structure design. Consequently, weights of actor and critic networks are prone to divergence in the training process due to the abrupt changes of the cost function caused by the safety-related terms in approaching the constraint boundary. These issues motivated our barrier function-based (simplified as barrier-based) actor-critic structure. Moreover, few works have addressed the safe RL algorithm design under time-varying safety constraints.
This work proposes a model-based safe RL algorithm with theoretical guarantees for optimal control with time-varying state and control constraints. A new barrier-based control policy (BCP) structure is constructed in the proposed safe RL approach, generating repulsive control forces as states and controls move toward the constraint boundaries. Moreover, the time-varying constraints are addressed by a multi-step policy evaluation (MPE). The closed-loop theoretical property of our approach under nominal and perturbed cases and the convergence condition of the barrier-based actor-critic (BAC) learning algorithm is derived. The effectiveness of our approach is tested on both simulations and real-world intelligent vehicles. Our contributions are summarized as follows.
We proposed a safe RL for optimal control under time-varying state and control constraints. Under certain conditions (see Sections LABEL:sec:32-A and -B), safety can be guaranteed in both online and offline learning scenarios. The performance and advantages of the proposed approach are achieved by two novel designs. The first is a barrier-based control policy to ensure safety with an actor-critic structure. The second is a multi-step evaluation mechanism to predict the control policy’s future influence on the value function under time-varying safety constraints and guide the policy to update safely.
We proved that the proposed safe RL could guarantee stability and robustness in the nominal scenario and under external disturbances, respectively. And the convergence condition of the BAC learning algorithm was derived by the Lyapunov method.
The proposed approach was applied to solve an integrated path following and collision avoidance problem of intelligent vehicles so that the control performance can be optimized with theoretical guarantees even with external disturbances. (i) Extensive simulation results illustrate that our approach outperforms other state-of-the-art safe RL methods in learning safety and performance. (ii) We verified our control policy’s offline sim-to-real transfer capability and real-world online learning performance. The experimental results reveal that our approach outperforms a standard model predictive control (MPC) algorithm in terms of safety and optimality and shows an impressive sim-to-real transfer capability and a satisfactory online control performance.
The remainder of the paper is organized as follows. Section II introduces the considered control problem and preliminary solutions. Section III presents the proposed safe RL approach and the BAC learning algorithm, while Section IV presents the main theoretical results. Section V shows the real-world experimental results, while some conclusions are drawn in Section VI. Some additional simulation results for theoretical verification are given in the appendix.
Notation: We denote and as the set of natural numbers and integers
. For a vector, we denote as and as the Euclidean norm. For a function with an argument , we denote as the gradient to . For a function with arguments and , we denote as the partial gradient to , or . Given a matrix , we use (
) to denote the minimal (maximal) eigenvalues. We denoteas the interior of a general set . For variables , , we define , where .
Ii Control problem formulation and definitions
Ii-a Control problem
The considered system under control is a class of discrete-time nonlinear systems described by
where and are the state and input variables, is the discrete-time index, and are convex sets that represent time-varying constraints, , , is a bounded compact set; functions for , are assumed to be ; is the state transition function and .
Starting from any initial condition , the control objective is to find an optimal control policy that minimizes a quadratic regulation cost function of type
subject to model (1), , and , ; where
and , , , is a discount factor.
Definition 1 (Local stabilizability [zhang2021robust])
System (1) with is stabilizable on if, for any , there exists a state-feedback policy , , such that and as .
Without loss of generality, many waypoint tracking problems in the robot control field can be naturally formed as the prescribed regulation one, with a proper coordination transformation of the reference waypoints. More generally, it is allowed that the time-varying state constraint might not contain the origin for some . Typical examples can be found in, for instance, path following of mobile robots with collision avoidance, where the potential obstacle to be avoided might occupy the reference waypoints, i.e., the origin after coordination transformation. It is still reasonable to introduce the following assumption for convergence guarantee.
Assumption 1 (State constraint)
There exists a finite number such that as .
Assumption 2 (Lipschitz continuous)
Model (1) is Lipschitz continuous in , for all , i.e., there exists a Lipschitz constant such that for all and control policies with ,
Assumption 3 (Model)
in the domain , where .
Definition 2 (multi-step safe control)
To simplify the notation, in the rest of the paper, the super index in is neglected, i.e., we use to denote .
Ii-B Definitions on barrier functions
We also introduce the definition of barrier functions.
Definition 3 (Barrier function)
For a convex set , a barrier function is defined as
The recentered transformation of centered at is defined as , where if or is selected such that otherwise.
To derive a satisfactory control performance, it is suggested to select far from the set boundary of and as the central point or its neighbor of (if possible). It is observed that satisfies
Lemma 1 (Relaxed barrier function [wills2004barrier])
Define a relaxed barrier function of as
where is a relaxing factor, , the function is strictly monotone and differentiable on , and , then there exists a matrix such that .
Proof: For details please see [wills2004barrier].