Model-Based Safe Reinforcement Learning with Time-Varying State and Control Constraints: An Application to Intelligent Vehicles

by   Xinglong Zhang, et al.

Recently, barrier function-based safe reinforcement learning (RL) with the actor-critic structure for continuous control tasks has received increasing attention. It is still challenging to learn a near-optimal control policy with safety and convergence guarantees. Also, few works have addressed the safe RL algorithm design under time-varying safety constraints. This paper proposes a model-based safe RL algorithm for optimal control of nonlinear systems with time-varying state and control constraints. In the proposed approach, we construct a novel barrier-based control policy structure that can guarantee control safety. A multi-step policy evaluation mechanism is proposed to predict the policy's safety risk under time-varying safety constraints and guide the policy to update safely. Theoretical results on stability and robustness are proven. Also, the convergence of the actor-critic learning algorithm is analyzed. The performance of the proposed algorithm outperforms several state-of-the-art RL algorithms in the simulated Safety Gym environment. Furthermore, the approach is applied to the integrated path following and collision avoidance problem for two real-world intelligent vehicles. A differential-drive vehicle and an Ackermann-drive one are used to verify the offline deployment performance and the online learning performance, respectively. Our approach shows an impressive sim-to-real transfer capability and a satisfactory online control performance in the experiment.



There are no comments yet.


page 2

page 3

page 4

page 5

page 6

page 9

page 11

page 12


Model-Based Actor-Critic with Chance Constraint for Stochastic System

Safety constraints are essential for reinforcement learning (RL) applied...

DOB-Net: Actively Rejecting Unknown Excessive Time-Varying Disturbances

This paper presents an observer-integrated Reinforcement Learning (RL) a...

Reinforcement Learning Control of Constrained Dynamic Systems with Uniformly Ultimate Boundedness Stability Guarantee

Reinforcement learning (RL) is promising for complicated stochastic nonl...

Lyapunov Barrier Policy Optimization

Deploying Reinforcement Learning (RL) agents in the real-world require t...

Off-policy reinforcement learning for H_∞ control design

The H_∞ control design problem is considered for nonlinear systems with ...

Collision Avoidance in Tightly-Constrained Environments without Coordination: a Hierarchical Control Approach

We present a hierarchical control approach for maneuvering an autonomous...

Safe Learning of Linear Time-Invariant Systems

We consider safety in simultaneous learning and control of discrete-time...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Reinforcement learning (RL) is promising for solving nonlinear optimal control problems and has received significant attention in the past decades, see [grondman2012survey, kim2021dynamic] and the references therein. Until recently, significant progress has been made on RL with the actor-critic structure for continuous control tasks [grondman2012survey, haarnoja2018soft, liu2013policy, zhang2009neural, jiang2012computational, lim2020prediction, wang2021intelligent]. In actor-critic RL, the value function and control policy are represented by the critic and actor networks, respectively, and learned via extensive policy exploration and exploitation. However, the resulting learning-based control system might not guarantee safety for systems with state and stability constraints. It is known that safety constraint satisfaction is crucial besides optimality in many real-world robot control applications [garcia2015comprehensive, chow2019lyapunov]. For instance, autonomous driving has been viewed as a promising technology that will bring fundamental changes to everyday life. Still, one of the crucial issues concerns how to learn to drive safely under dynamic and unknown environments with unexpected obstacles. For these practical reasons, many safe RL algorithms have been recently developed for safety-critical systems, see e.g. [chow2018lyapunovbased, chow2019lyapunov, berkenkamp2018safe, garcia2015comprehensive, srinivasan2020learning, yu2019convergent, xu2021crpo, ma2021model, simao2021alwayssafe, amani2021safe, yang2021accelerating, huh2020safe, zheng2021safe, marvi2021safe, chen2021context, li2021safe, brunke2021safe, richards2018lyapunov] and the references therein. Note that there are fruitful works in adaptive control with constraints, but the technique used is different from that in RL, for related references in adaptive control might refer to [sun2016neural, kong2019adaptive].

In general, current safe RL solutions can be categorized into three main approaches. (i) The first family utilizes a unique mechanism in the learning procedure for safe policy optimization using, e.g., control barrier functions [yang2019safety, ma2021model, marvi2021safe], formal verification [fulton2018safe, turchetta2020safe], shielding [alshiekh2018safe, zanon2020safe, thananjeyan2021recovery], and external intervention [saunders2017trial, wagener2021safe]. These methods are prone to a safe-biased learning by sacrificing greatly on the performance. And some of them rely on extra human interference [saunders2017trial, wagener2021safe]. (ii) The second family proposes safe RL algorithms via primal-dual methods [chen2016stochastic, paternain2019safe, xu2021crpo, ding2021provably]. In the resulting optimization problem, the Lagrangian multiplier serves as an extra weight whose update is noted sensitive to the control performance [xu2021crpo]. Moreover, some optimization problems such as those with ellipsoidal constraints (covered in this work) could not satisfy the strong duality condition [zhang2021robust]. (iii) The third is reward/cost shaping-based RL approaches [geibel2005risk, balakrishnan2019structured, hu2020learning, tessler2018reward] where the cost functions are augmented with various safety-related parts, e.g., barrier functions. As stated in [paternain2019safe], such a design only informs the goal of guaranteeing safety by minimizing the reshaped cost function but fails to guide how to achieve it well through an actor-critic structure design. Consequently, weights of actor and critic networks are prone to divergence in the training process due to the abrupt changes of the cost function caused by the safety-related terms in approaching the constraint boundary. These issues motivated our barrier function-based (simplified as barrier-based) actor-critic structure. Moreover, few works have addressed the safe RL algorithm design under time-varying safety constraints.

This work proposes a model-based safe RL algorithm with theoretical guarantees for optimal control with time-varying state and control constraints. A new barrier-based control policy (BCP) structure is constructed in the proposed safe RL approach, generating repulsive control forces as states and controls move toward the constraint boundaries. Moreover, the time-varying constraints are addressed by a multi-step policy evaluation (MPE). The closed-loop theoretical property of our approach under nominal and perturbed cases and the convergence condition of the barrier-based actor-critic (BAC) learning algorithm is derived. The effectiveness of our approach is tested on both simulations and real-world intelligent vehicles. Our contributions are summarized as follows.

  1. We proposed a safe RL for optimal control under time-varying state and control constraints. Under certain conditions (see Sections LABEL:sec:32-A and -B), safety can be guaranteed in both online and offline learning scenarios. The performance and advantages of the proposed approach are achieved by two novel designs. The first is a barrier-based control policy to ensure safety with an actor-critic structure. The second is a multi-step evaluation mechanism to predict the control policy’s future influence on the value function under time-varying safety constraints and guide the policy to update safely.

  2. We proved that the proposed safe RL could guarantee stability and robustness in the nominal scenario and under external disturbances, respectively. And the convergence condition of the BAC learning algorithm was derived by the Lyapunov method.

  3. The proposed approach was applied to solve an integrated path following and collision avoidance problem of intelligent vehicles so that the control performance can be optimized with theoretical guarantees even with external disturbances. (i) Extensive simulation results illustrate that our approach outperforms other state-of-the-art safe RL methods in learning safety and performance. (ii) We verified our control policy’s offline sim-to-real transfer capability and real-world online learning performance. The experimental results reveal that our approach outperforms a standard model predictive control (MPC) algorithm in terms of safety and optimality and shows an impressive sim-to-real transfer capability and a satisfactory online control performance.

The remainder of the paper is organized as follows. Section II introduces the considered control problem and preliminary solutions. Section III presents the proposed safe RL approach and the BAC learning algorithm, while Section IV presents the main theoretical results. Section V shows the real-world experimental results, while some conclusions are drawn in Section VI. Some additional simulation results for theoretical verification are given in the appendix.

Notation: We denote and as the set of natural numbers and integers

. For a vector

, we denote as and as the Euclidean norm. For a function with an argument , we denote as the gradient to . For a function with arguments and , we denote as the partial gradient to , or . Given a matrix , we use (

) to denote the minimal (maximal) eigenvalues. We denote

as the interior of a general set . For variables , , we define , where .

Ii Control problem formulation and definitions

Ii-a Control problem

The considered system under control is a class of discrete-time nonlinear systems described by


where and are the state and input variables, is the discrete-time index, and are convex sets that represent time-varying constraints, , , is a bounded compact set; functions for , are assumed to be ; is the state transition function and .

Starting from any initial condition , the control objective is to find an optimal control policy that minimizes a quadratic regulation cost function of type


subject to model (1), , and , ; where

and , , , is a discount factor.

Definition 1 (Local stabilizability [zhang2021robust])

System (1) with is stabilizable on if, for any , there exists a state-feedback policy , , such that and as .

Without loss of generality, many waypoint tracking problems in the robot control field can be naturally formed as the prescribed regulation one, with a proper coordination transformation of the reference waypoints. More generally, it is allowed that the time-varying state constraint might not contain the origin for some . Typical examples can be found in, for instance, path following of mobile robots with collision avoidance, where the potential obstacle to be avoided might occupy the reference waypoints, i.e., the origin after coordination transformation. It is still reasonable to introduce the following assumption for convergence guarantee.

Assumption 1 (State constraint)

There exists a finite number such that as .

Assumption 2 (Lipschitz continuous)

Model (1) is Lipschitz continuous in , for all , i.e., there exists a Lipschitz constant such that for all and control policies with ,

Assumption 3 (Model)

in the domain , where .

Definition 2 (multi-step safe control)

For a given state at time instant , a control policy , is -step safe for (1) if the resulting future state evolutions of (1) under satisfy , , where is the resulting state constraint under .

To simplify the notation, in the rest of the paper, the super index in is neglected, i.e., we use to denote .

Ii-B Definitions on barrier functions

We also introduce the definition of barrier functions.

Definition 3 (Barrier function)

For a convex set , a barrier function is defined as


The recentered transformation of centered at is defined as , where if or is selected such that otherwise.

To derive a satisfactory control performance, it is suggested to select far from the set boundary of and as the central point or its neighbor of (if possible). It is observed that satisfies

Lemma 1 (Relaxed barrier function [wills2004barrier])

Define a relaxed barrier function of as


where is a relaxing factor, , the function is strictly monotone and differentiable on , and , then there exists a matrix such that .

Proof: For details please see [wills2004barrier].