# Provably Safe Model-Based Meta Reinforcement Learning: An Abstraction-Based Approach

While conventional reinforcement learning focuses on designing agents that can perform one task, meta-learning aims, instead, to solve the problem of designing agents that can generalize to different tasks (e.g., environments, obstacles, and goals) that were not considered during the design or the training of these agents. In this spirit, in this paper, we consider the problem of training a provably safe Neural Network (NN) controller for uncertain nonlinear dynamical systems that can generalize to new tasks that were not present in the training data while preserving strong safety guarantees. Our approach is to learn a set of NN controllers during the training phase. When the task becomes available at runtime, our framework will carefully select a subset of these NN controllers and compose them to form the final NN controller. Critical to our approach is the ability to compute a finite-state abstraction of the nonlinear dynamical system. This abstract model captures the behavior of the closed-loop system under all possible NN weights, and is used to train the NNs and compose them when the task becomes available. We provide theoretical guarantees that govern the correctness of the resulting NN. We evaluated our approach on the problem of controlling a wheeled robot in cluttered environments that were not present in the training data.

## Authors

• 4 publications
• 2 publications
• 3 publications
• 16 publications
02/22/2021

### Provably Correct Training of Neural Network Controllers Using Reachability Analysis

In this paper, we consider the problem of training neural network (NN) c...
06/16/2020

### ShieldNN: A Provably Safe NN Filter for Unsafe NN Controllers

In this paper, we consider the problem of creating a safe-by-design Rect...
09/18/2020

### Learning Safe Neural Network Controllers with Barrier Certificates

We provide a novel approach to synthesize controllers for nonlinear cont...
04/06/2021

### Safe-by-Repair: A Convex Optimization Approach for Repairing Unsafe Two-Level Lattice Neural Network Controllers

In this paper, we consider the problem of repairing a data-trained Recti...
06/22/2021

### Failing with Grace: Learning Neural Network Controllers that are Boundedly Unsafe

In this work, we consider the problem of learning a feed-forward neural ...
10/24/2019

### Case Study: Verifying the Safety of an Autonomous Racing Car with a Neural Network Controller

This paper describes a verification case study on an autonomous racing c...
02/12/2022

### Learning by Doing: Controlling a Dynamical System using Causality, Control, and Reinforcement Learning

Questions in causality, control, and reinforcement learning go beyond th...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

While the current successes of meta-RL are undeniable, significant drawbacks of meta-RL in its current form are (i) the lack of formal guarantees on its ability to generalize to unforeseen tasks and (ii) the lack of formal guarantees with regards to its safety.

In this paper, we confine our attention to reach-avoid tasks (i.e., a robot that needs to reach a goal without hitting obstacles) and propose a framework for meta-RL that can generalize to tasks (e.g., different environments, obstacles, and goals) that were not present in the training data. The proposed framework results into NN controllers that are provably safe with regards to any reach-avoid task, which could be unseen during the design of these neural networks.

Recently, the authors proposed a framework for provably-correct training of neural networks [sun2021safeRL]. In that framework, given an error-free nonlinear dynamical system, a finite-state abstract model that captures the closed-loop behavior under all possible neural network controllers is computed. Using this finite-state abstract model, this framework identifies the subset of NN weights guaranteed to satisfy the safety requirements (i.e., avoiding obstacles). During training, the learning algorithm is augmented with a NN weight projection operator that enforces the resulting NN to be provably safe. To account for the liveness properties (i.e., reaching the goal), the proposed framework uses the finite-state abstract model to identify candidate NN weights that may satisfy the liveness properties. Using such candidate NN weights, the proposed framework biases the NN training to achieve the liveness specification.

While the previous results reported in [sun2021safeRL] focused on the case when the task (environment, obstacles, and goal) is known during the training of the NN controller, we extend these results in this paper to account for the case when the task is unknown during training. In particular, instead of training one neural network, we train a set of neural networks. To fulfill a set of infinitely many tasks using a finite set of neural network controllers, our approach is to restrict each neural network to some local behavior, yet the composition of these neural networks captures all possible behaviors. Moreover, and unlike the results reported in [sun2021safeRL], we consider in this paper the case when the nonlinear dynamical system is only partially known. We evaluated our approach on the problem of steering a wheeled robot and we show that our framework is capable of generalizing to tasks that were not present in the training of the NN controller while guaranteeing the safety of the robot.

## Ii Problem Formulation

### Ii-a Notation

Let

be the Euclidean norm of the vector

, be the induced 2-norm of the matrix , and be the max norm of the matrix . Given two vectors and , we denote by the column vector . We use to denote the Minkowski sum, and to denote the interior of the set . Any Borel space is assumed to be endowed with a Borel -algebra, which is denoted by . We use to denote the indicator function of a set .

### Ii-B Dynamical Model and Neural Network Controller

We consider discrete-time nonlinear dynamical systems of the form:

 x(k+1)=f(x(k),u(k))+g(x(k),u(k)), (1)

where is the state and is the control input at time step . The dynamical model consists of two parts: the priori known nominal model , and the unknown model-error , which is deterministic and captures unmodeled dynamics. Though the model-error is unknown, we assume it is bounded by a compact set , i.e., for all and . We also assume both functions and are locally Lipschitz continuous. As a well-studied technique to learn unknown functions from data, we assume the model-error can be learned using Gaussian Process (GP) regression [GP]. We use

to denote a GP regression model with the posterior mean and variance functions be

and , respectively111In the case of a multiple output function , i.e., , we model each output dimension with an independent GP. We keep the notations unchanged for simplicity.. Given a feedback control law , we use to denote the closed-loop trajectory of (1) that starts from the state and evolves under the control law .

In this paper, our primary focus is on controlling the nonlinear system (1) with a state-feedback neural network controller . A

-layer Rectified Linear Unit (ReLU) NN is specified by composing

layer functions (or just layers). A layer with inputs and outputs is specified by a weight matrix

and a bias vector

as follows:

 Lθ(l):z↦max{W(l)z+b(l),0}, (2)

where the function is taken element-wise, and for brevity. Thus, a -layer ReLU NN is specified by layer functions whose input and output dimensions are composable: that is they satisfy . Specifically:

 NNθ(x)=(Lθ(F)∘Lθ(F−1)∘⋯∘Lθ(1))(x), (3)

where we index a ReLU NN function by a list of matrices . Also, it is common to allow the final layer function to omit the function altogether, and we will be explicit about this when it is the case.

We use to denote a task where is the goal that the system would like to reach and with is the set of obstacles that the system would like to avoid. More formally, given a task , a safety specification requires avoiding all the obstacles and a liveness specification requires reaching the goal in a bounded time horizon . We use and to denote a trajectory satisfies the safety and liveness specifications, respectively, i.e.,

 ξx0,Ψ⊨ϕsafety⟺∀k∈N,∀i∈{O1,…,Oo},ξx0,Ψ(k)∉Oi, ξx0,Ψ⊨ϕliveness⟺∃k∈{1,…H},ξx0,Ψ(k)∈Xgoal.

Given a set of initial states , a control law satisfies a specification (denoted by ) if all trajectories starting from the set satisfy the specification, i.e., , . Since the specifications and the satisfying set of initial states depend on the task, we explicitly add as a superscript whenever need emphasize the dependency, such as , , and .

While conventional reinforcement learning focuses on training a neural network that works for one specific task, meta-RL focuses, instead, on training controllers that can work for a multitude of tasks. To formally capture this requirement, we use to denote the set of all the tasks (corresponding to configurations of the goal and obstacles) with the goals and the obstacles be defined over the state space . Though an arbitrary task such as the case of the goal is enclosed by obstacles may not be interesting, we use the set in the statement of our problem for simplicity.

### Ii-D Main Problem

We consider the problem of designing provably correct NN controllers for unseen tasks. Specifically, the task is unknown during the training of the NN controller. The task will be known only at runtime. Therefore, our objective is to train a set (or a collection) of different ReLU NNs along with a selection algorithm that can select the correct NNs once the task becomes available at runtime. Before presenting the problem under consideration, we introduce the following notion of NN composition.

###### Definition II.1

Given a set of Neural Networks along with an activation map , the composed neural network is defined as:

 NN[NN,Γ](x)=NNΓ(x)(x)

In other words, the activation map selects the index of the NN that need to be activated at a particular state . Now, we can define the problem of interest as follows.

###### Problem II.2

Given the nonlinear dynamical system (1). Design a NN controller consists of two parts: a set of ReLU NNs and a selection algorithm SEL, such that for any task , the selection algorithm returns a set of initial states and an activation map satisfying:

 NN[NN,ΓW],XWinit⊨ϕWsafety∧ϕWliveness.

Indeed, it is desirable that the algorithm SEL computes the largest possible for the task . While computing the largest possible set can be computationally demanding, our algorithm will instead focus on finding an sub-optimal . For space considerations, the quantification of the sub optimality in the computations of is omitted.

## Iii Framework

### Iii-a Overview

Before describing our approach to solve Problems II.2, we start by recalling that every ReLU NN represents a Continuous Piece-Wise Affine (CPWA) function [pascanu2013number]. We use to denote a CPWA function of the form:

 ΨCPWA(x)=Kix+biif x∈Ri, i=1,…,L, (4)

where the polytopic sets is a partition of the set . We call each polytopic set a linear region, and use to denote the set of linear regions associated with . In this paper, we confine our attention to CPWA controllers (and hence neural network controllers) that are selected from a bounded polytopic set , i.e., we assume that and .

To fulfill a set of infinitely many tasks using a finite set of ReLU NNs , our approach is to restrict each NN in the set to some local behavior, yet the set captures all possible behavior of the system. We use the mathematical model of the physical system (1) to guide training of the NNs, as well as selecting NNs from the set at runtime.

During training, without knowing the tasks, we train a set of ReLU NNs using the following two steps:

• Capture the closed-loop behavior of the system under all

CPWA controllers using a finite-state Markov decision process (MDP). To define the action space of this MDP, we partition the space of all CPWA controllers into a finite number of partitions. Each partition corresponds to a family of CPWA controllers. Hence, each transition in the MDP is labeled by a symbol that corresponds to a particular family of CPWA functions. The transition probabilities can then be computed using the knowledge of the model (

1) and the Gaussian Process .We refer to this finite-state MDP as the abstract model of the system.

• Train one NN corresponds to each transition in the MDP. We refer to each of these NNs as a local NN. Let be the set of all such local NNs. The training enforces each local NN to represent a CPWA function that belongs to the family of CPWA controllers associated with this transition. This is achieved by using the NN weight projection operator introduced in [sun2021safeRL]. Using these local NNs, we can construct the set of NN controllers .

Details of constructing the abstract model and training the local NN controllers in are given in Section IV.

At runtime, given an arbitrary task , the algorithm selects NNs from the set to satisfy :

• To satisfy the safety specification , the algorithm SEL identifies a subset of safe CPWA controllers at each abstract state in the MDP. The selected NNs from the set must correspond to one of those CPWA families that are marked as safe.

• For the liveness specification , the algorithm SEL first searches for the optimal policy of the MDP using dynamic programming (DP), where the allowed transitions in the MDP are limited to those have been identified to be safe. Based on the optimal policy of the MDP, it decides which local NN in the set should be used at each state.

We highlight that the proposed framework above always guarantees that the resulting NN controller satisfies the safety specification for any task , regardless the accuracy of the learned model-error using GP regression. For the liveness specification , due to the learned model-error is probabilistic, we relax Problem II.2 to maximize the probability of satisfying the liveness specification . We also provide a quantified bound on the probability for the NN controller to satisfy .

Figure 1 conceptualizes our framework. In Figure 1 (a), we partition the state space into a set of abstract states and the controller space into a set of controller partitions . Figure 1 (b) shows the resulting MDP, with transition probabilities labeled by the side of the transitions. Then, the set contains 9 local NNs corresponding to the 9 transitions in the MDP.

Consider two different tasks given at runtime. Task specifies that the goal is represented by the abstract state and the only obstacle is . At state , our selection algorithm decides to use the local network , which corresponds to the transition from state to under partition . In task , state is still the goal, but there is no obstacle. For this task, our selection algorithm decides to use at state and use at state . Notice that with this choice the probability of reaching the goal is , which is higher than the probability by using at state .

In the above procedure, the set may contain a large number of local NNs—one for every possible transition in the MDP—and need extensive training effort. To accelerate the training process, in Section VII

, we employ ideas from transfer learning to enable the use of partially complete

to rapidly train new NN controllers, at runtime, while satisfying the same guarantees of having a complete .

## Iv Provably-Correct Training of the Set of Neural Networks NN

### Iv-a Abstract Model

In this section, we extend the abstract model proposed in [sun2021safeRL] by taking into account the unknown model-error . Unlike the results reported in [sun2021safeRL] where the system was assume to be error-free and deterministic (and hence can be abstracted by a finite-state machine), in this paper, the dynamical model (1) is stochastic due to the use of GP regression to capture the error in the model. This necessitates the use of finite-state MDP to abstract the dynamics in (1).

State and Controller Space Partitioning: We partition the state space into a set of abstract states, denoted by . Each is an infinity-norm ball in centered around some state . The partitioning satisfies , and if . With an abuse of notation, denotes both an abstract state, i.e., , and a subset of states, i.e., . Since we construct the abstract model before knowing the tasks, the state space does not contain any obstacle or goal.

Similarly, we partition the controller space into polytopic subsets. For simplicity of notation, we define the set of parameters be a polytope that combines and . With some abuse of notation, we use with a single parameter to denote with the pair . The controller space is discretized into a collection of polytopic subsets in , denoted by . Each is an infinity-norm ball centered around some such that , and if . We call each of the subsets a controller partition. Each controller partition represents a subset of CPWA functions, by restricting parameters in a CPWA function to take values from .

MDP Transitions: Next, we compute the set of all allowable transitions in the MDP. To that end, we define the posterior of an abstract state under a controller partition be the set of states that can be reached in one step from states by using affine state feedback controllers with parameters under the dynamical model (1) as follows:

 Post(q,P)≜{h(x,K(x))∈Rn|x∈q,K∈P}⊕D, (5)

where is defined in Section II-B as the bound of the model-error . Indeed, computing the exact posterior for a nonlinear system is computationally expensive, and hence we rely on over-approximation instead. Furthermore, let be the set of abstract states that have overlap with .

 Next(q,P)≜{q′∈X|q′∩ˆPost(q,P)≠∅}. (6)

The transitions in the MDP can now be constructed using the information in . That is, a transition from state to state with label is allowed in the MDP if and only if .

Transition Probability: The final step is to compute the transition probabilities associated with each of the transitions constructed in the previous step. We define transition probabilities based on representative points in abstract states and controller partitions. Specifically, we choose the representative points to be the centers (recall that both and are infinity-norm balls and hence their centers are well defined). Let map an abstract state to its center and map a controller partition to the matrix , which is the center of . Furthermore, we use to denote the map from a state to the abstract state that contains , i.e., , and similarly, the map satisfies for any .

Given the dynamical system (1) with the model-error learned by a GP regression model , let be the corresponding conditional stochastic kernel. Specifically, given the current state and input , the distribution

is given by the Gaussian distribution

. For any set and any , the probability of reaching the set in one step from state with input is given by:

 Pr(x(k+1)∈A|x(k),u(k))=∫AT(dx(k+1)|x(k),u(k)) (7)

where we use the notation . This integral can be easily computed since is a Gaussian distribution222In the case of a multiple output function , i.e., , each dimension can be integrated independently..

With above notations, we define our abstract model as follows:

###### Definition IV.1

The abstract model of (1) is a finite MDP defined as a tuple , where:

• The state space is the set of abstract states ;

• The set of controls at each state is given by the set of controller partitions ;

• The transition probability from state to with label is given by:

 ^T(q′|q,P)={∫q′t(dx′|z,κ(z))if q′∈Next(q,P)0else

where , .

### Iv-B Train Local NNs with Weight Projection

Once the abstract model is computed, the next step is to train the set of local neural networks without the knowledge of the tasks. In order to capture the closed-loop behavior of the system under all possible CPWA controllers, we train one local NN corresponding to each transition (with non-zero transition probability) in the MDP . Algorithm 1 outlines training of all the local NNs. We use to denote the local NN corresponding to the transition in the MDP from abstract state to under controller partition .

We train each local network using Proximal Policy Optimization (PPO) [ppo] (line 5 in Algorithm 1). While choosing the reward function in reinforcement learning is often challenging, our algorithm enjoys a straightforward yet efficient formulation of reward functions. To be specific, for a local network , let and be pre-specified weights, our reward function encourages moving towards the state with controllers chosen from the partition :

 r(x,u)= ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩−w2||u−κ(x)||,if% h(x,u)+μg(x,u)∈q′−w1||h(x,u)+μg(x,u)−ctX(q′)||−w2||u−κ(x)||otherwise

where is the posterior mean function from the GP regression. With this dynamical model, PPO can efficiently explore the workspace without running the real agent.

The training of local networks is followed by applying a NN weight projection operator Project introduced in [sun2021safeRL]. Given a neural network and a controller partition , this projection operator ensures that:

 Project(NN,P)∈P.

In other words, this projection operator forces that can only give rise to one of the CPWA functions that belong to the controller partition . We refer readers to [sun2021safeRL] for more details on the NN weight projection. Algorithm 1 summarizes the discussion in this subsection.

## V The Selection Algorithm SEL(W,NN)

In this section, we present our selection algorithm which is used at runtime when an arbitrary task is given. The algorithm assigns one local NN in the set to each abstract state in order to satisfy the safety and liveness specification . Our approach is to first exclude all transitions in the MDP that can lead to violation of , followed by selecting the optimal solution from the remaining transitions in the MDP. More details are given below.

### V-a Exclude Unsafe Transitions using Backtracking

Given a task that specifies a set of obstacles and a goal , we use to denote the subset of abstract states that intersect the obstacles, i.e., , and use to denote the subset of abstract states contained in the goal, i.e., .

Algorithm 2 computes the set of safe states and safe controller partitions using an iterative backward procedure introduced in [sun2021safeRL]. With the set of unsafe states initialized to be the obstacles (line 1 in Algorithm 2), the algorithm backtracks unsafe states until a fixed point is reached, i.e., it can not find new unsafe states (line 2-4 in Algorithm 2). The set of safe initial states is the union of all the abstract states that are identified to be safe (line 6 in Algorithm 2). Furthermore, it computes the function , which assigns a set of safe controller partitions at each abstract state . Again, we use the superscript to emphasize the dependency of , and on the task .

### V-B Assign Controller Partition by Solving MDP

Once the set of safe controller partitions is computed, the next step is to assign one controller partition in to each abstract state . In particular, we consider the problem of solving the optimal policy for the MDP with states and controls limited to the set of safe abstract states and the set of safe controller partitions at , respectively. Since we are interested in maximizing the probability of satisfying the liveness specification , let the optimal value function map an abstract state to the maximum probability of reaching the goal in steps from . Using this notation, is then the maximum probability of satisfying the liveness specification . The optimal value functions can be solved by the following Dynamic Programming (DP) recursion [abate2013hscc]:

 ^Vk(q,P) =1Xgoal(q)+1Xsafe∖Xgoal(q)∑q′∈XWsafe^V∗k+1(q′)^T(q′|q,P) (8) ^V∗k(q) =maxP∈PWsafe(q)^Vk(q,P) (9)

with the initial condition , where .

Algorithm 3 solves the optimal policy for the MDP using the Dynamic Programming (DP) recursion (8)-(9). At time step , the optimal controller partition at state is given by the maximizer of (line 8 in Algorithm 3). The last step is to assign a corresponding neural network to be used at all the states for each . To that end, the activation map assigns the neural network indexed by to the abstract state , where maximizes the transition probability (line 9-10 in Algorithm 3). While the activation map assigns a neural network index to the abstract state , we can directly get the activation map to the actual state as:

 ΓWk(x)=ΓWk,abs(absX(x)).

In other words, given the state of the system , we first compute the corresponding abstract state , and use the corresponding neural network assigned to this abstract state to control the system. Note that, unlike the definition of the activation map in Problem II.2, the activation map obtained here is time-varying as captured by the subscript . This reflects the nature of the optimal solution computed by the DP regression (8)-(9).

The computed by Algorithm 2 along with the selection map returned by Algorithm 3 constitutes the algorithm.

## Vi Theoretical Guarantees

In this section, we study the theoretical guarantees of the proposed solution. We analyze the guarantees of satisfying and separately.

### Vi-a Safety Guarantee

The following theorem summarizes the safety guarantees for our solution.

###### Theorem VI.1

Consider the dynamical model (1). Let the NN controller consists of two parts: the set of local neural networks trained by Algorithm 1 and the selection algorithm SEL defined by Algorithm 2 and Algorithm 3. For any task , consider the set of initial conditions and the activation map computed by , the following holds: .

The proof of Theorem VI.1 follows the same argument of the error-free case presented in [sun2021safeRL] and hence is omitted for brevity. In particular, Theorem 4.2 in [sun2021safeRL] shows that at safe abstract states , any feedback CPWA controller with chosen from is guaranteed to be safe. Furthermore, Theorem 4.4 in [sun2021safeRL] shows that the NN weight projection operator Project ensures that the local NNs at only give rise to the feedback CPWA controllers with for some .

To take into account the model-error , the posterior in (5) is inflated with the error bound . Hence, Algorithm 2 provides the same safety guarantee, regardless the accuracy of the learned model-error by GP regression. With the NN weight projection in the training of local NNs (line 6 in Algorithm 1), the resulting NN controller is guaranteed to be safe for any task .

### Vi-B Probabilistic Optimality Guarantee

Due to the unknown model-error , which is learned by GP regression, the liveness specification may not be always satisfied. However, in this subsection, we provide a bound on the probability for the trained NN controller to satisfy . Intuitively, this bound tells how close is the NN controller to the optimal controller, which maximizes the probability of satisfying .

By replacing the model-error in (1) using the GP regression model , we consider the stochastic system , where . Given an arbitrary task , we use to denote the embedded MDP corresponding to this stochastic system, with states and controls limited to the subspace that has been identified to be safe (see Algorithm 2)333Since the task is fixed when comparing the NN controller and the optimal controller, we drop the superscript in this subsection.. Specifically, we define the continuous MDP as a tuple , where:

• The state space is the set of safe states ;

• The available controls at each state are given by the feedback CPWA controllers with chosen from the safe controller partitions, i.e., ;

• The set of controls is ;

• The conditional stochastic kernel follows the same definition in Section IV-A.

We first consider the optimal controller for the system in terms of maximizing the probability of satisfying the liveness specification . Similar to the finite-state MDP , let the optimal value function map a state to the maximum probability of reaching the goal in steps from . Let , the optimal value functions can be solved through DP recursion [abate2013hscc]:

 Vk(x,u) (10) V∗k(x) =supu∈Usafe(x)Vk(x,u) (11)

with the initial condition , where . In the following, we use the DP recursion (10)-(11) to bound the optimality of NN controllers without explicitly solving them, which is intractable due to the continuous state and input space.

The probability for the NN controller to satisfy the liveness specification is given by the value function , which maps a state to the probability of reaching the goal in steps from the state under the controller :

 VNNk(x)≜Pr(∃k′∈{k,…,H},ξx,NN(k′)∈Xgoal).

Similarly, can be solved through the DP recursion:

 VNNk(x)=1Xgoal(x)+1X′safe(x)∫XsafeVNNk+1(x′)T(dx′|x,NN(x)) (12)

with the initial condition , where .

With the above notations, the difference between the value functions and measures the optimality of the NN controller by comparing it with the optimal controller. The following theorem provides the upper bound on this difference. When , it upper bounds the difference between the probability of satisfying the liveness specification using the NN controller and the maximum probability that can be achieved.

###### Theorem VI.2

Let and be the functions defined above. For any it holds that

 |VNNk(x)−V∗k(x)|≤(H−k)(ΔNN+Δ∗) (13)

where

 ΔNN Δ∗ =max1≤i≤N′(Λiδq+ΓiLPδq+2√m(n+1)LXΓiδP)

and the constants are defined as follows: the number of safe abstract states , grid size , and . Furthermore, , , and is the Lipschitz constant of an arbitrary local NN corresponding to a transition leaving :

 ∀x,x′∈qi, ||NN(qi,P,q′)(x)−NN(qi,P,q′)(x′)||≤Li||x−x′||

for any and . Finally, and , where and are the Lipschitz constants of the stochastic kernel at abstract state , i.e., :

 |t(y|x′,u)−t(y|x,u)|≤λi(y)<