I Introduction
Meta Reinforcement Learning (metaRL) refers to algorithms that can leverage experience from previous learning experience to learn how to adapt to new tasks quickly. In other words, while contemporary reinforcement learning focuses on designing agents that can perform one task, metaRL aims to solve the problem of designing agents that can generalize to different tasks that were not considered during the design or the training of these agents. Given data often covering a distribution of related tasks (e.g., changes in the environments, goals, and the dynamics), metaRL aims to combine all such experience and use it to design agents that can quickly adapt to unforeseen tasks. To achieve such aim, and without loss of generality, metatraining can be seen as a bilevel optimization problem where one optimization contains another optimization as a constraint [sinha2017review, franceschi2018bilevel]. The inner optimization corresponds to the classical training of a policy to achieve a particular task while the outer optimization focuses instead on optimizing the metarepresentation that generalizes to different tasks [hospedales2020meta, finn2017model, fernando2018meta, fallah2020provably, liu2019taming]. For a review on the current achievements in the field of MetaRL, we refer the reader to this survey [hospedales2020meta].
While the current successes of metaRL are undeniable, significant drawbacks of metaRL in its current form are (i) the lack of formal guarantees on its ability to generalize to unforeseen tasks and (ii) the lack of formal guarantees with regards to its safety.
In this paper, we confine our attention to reachavoid tasks (i.e., a robot that needs to reach a goal without hitting obstacles) and propose a framework for metaRL that can generalize to tasks (e.g., different environments, obstacles, and goals) that were not present in the training data. The proposed framework results into NN controllers that are provably safe with regards to any reachavoid task, which could be unseen during the design of these neural networks.
Recently, the authors proposed a framework for provablycorrect training of neural networks [sun2021safeRL]. In that framework, given an errorfree nonlinear dynamical system, a finitestate abstract model that captures the closedloop behavior under all possible neural network controllers is computed. Using this finitestate abstract model, this framework identifies the subset of NN weights guaranteed to satisfy the safety requirements (i.e., avoiding obstacles). During training, the learning algorithm is augmented with a NN weight projection operator that enforces the resulting NN to be provably safe. To account for the liveness properties (i.e., reaching the goal), the proposed framework uses the finitestate abstract model to identify candidate NN weights that may satisfy the liveness properties. Using such candidate NN weights, the proposed framework biases the NN training to achieve the liveness specification.
While the previous results reported in [sun2021safeRL] focused on the case when the task (environment, obstacles, and goal) is known during the training of the NN controller, we extend these results in this paper to account for the case when the task is unknown during training. In particular, instead of training one neural network, we train a set of neural networks. To fulfill a set of infinitely many tasks using a finite set of neural network controllers, our approach is to restrict each neural network to some local behavior, yet the composition of these neural networks captures all possible behaviors. Moreover, and unlike the results reported in [sun2021safeRL], we consider in this paper the case when the nonlinear dynamical system is only partially known. We evaluated our approach on the problem of steering a wheeled robot and we show that our framework is capable of generalizing to tasks that were not present in the training of the NN controller while guaranteeing the safety of the robot.
Ii Problem Formulation
Iia Notation
Let
be the Euclidean norm of the vector
, be the induced 2norm of the matrix , and be the max norm of the matrix . Given two vectors and , we denote by the column vector . We use to denote the Minkowski sum, and to denote the interior of the set . Any Borel space is assumed to be endowed with a Borel algebra, which is denoted by . We use to denote the indicator function of a set .IiB Dynamical Model and Neural Network Controller
We consider discretetime nonlinear dynamical systems of the form:
(1) 
where is the state and is the control input at time step . The dynamical model consists of two parts: the priori known nominal model , and the unknown modelerror , which is deterministic and captures unmodeled dynamics. Though the modelerror is unknown, we assume it is bounded by a compact set , i.e., for all and . We also assume both functions and are locally Lipschitz continuous. As a wellstudied technique to learn unknown functions from data, we assume the modelerror can be learned using Gaussian Process (GP) regression [GP]. We use
to denote a GP regression model with the posterior mean and variance functions be
and , respectively^{1}^{1}1In the case of a multiple output function , i.e., , we model each output dimension with an independent GP. We keep the notations unchanged for simplicity.. Given a feedback control law , we use to denote the closedloop trajectory of (1) that starts from the state and evolves under the control law .In this paper, our primary focus is on controlling the nonlinear system (1) with a statefeedback neural network controller . A
layer Rectified Linear Unit (ReLU) NN is specified by composing
layer functions (or just layers). A layer with inputs and outputs is specified by a weight matrixand a bias vector
as follows:(2) 
where the function is taken elementwise, and for brevity. Thus, a layer ReLU NN is specified by layer functions whose input and output dimensions are composable: that is they satisfy . Specifically:
(3) 
where we index a ReLU NN function by a list of matrices . Also, it is common to allow the final layer function to omit the function altogether, and we will be explicit about this when it is the case.
IiC Task and Specification
We use to denote a task where is the goal that the system would like to reach and with is the set of obstacles that the system would like to avoid. More formally, given a task , a safety specification requires avoiding all the obstacles and a liveness specification requires reaching the goal in a bounded time horizon . We use and to denote a trajectory satisfies the safety and liveness specifications, respectively, i.e.,
Given a set of initial states , a control law satisfies a specification (denoted by ) if all trajectories starting from the set satisfy the specification, i.e., , . Since the specifications and the satisfying set of initial states depend on the task, we explicitly add as a superscript whenever need emphasize the dependency, such as , , and .
While conventional reinforcement learning focuses on training a neural network that works for one specific task, metaRL focuses, instead, on training controllers that can work for a multitude of tasks. To formally capture this requirement, we use to denote the set of all the tasks (corresponding to configurations of the goal and obstacles) with the goals and the obstacles be defined over the state space . Though an arbitrary task such as the case of the goal is enclosed by obstacles may not be interesting, we use the set in the statement of our problem for simplicity.
IiD Main Problem
We consider the problem of designing provably correct NN controllers for unseen tasks. Specifically, the task is unknown during the training of the NN controller. The task will be known only at runtime. Therefore, our objective is to train a set (or a collection) of different ReLU NNs along with a selection algorithm that can select the correct NNs once the task becomes available at runtime. Before presenting the problem under consideration, we introduce the following notion of NN composition.
Definition II.1
Given a set of Neural Networks along with an activation map , the composed neural network is defined as:
In other words, the activation map selects the index of the NN that need to be activated at a particular state . Now, we can define the problem of interest as follows.
Problem II.2
Given the nonlinear dynamical system (1). Design a NN controller consists of two parts: a set of ReLU NNs and a selection algorithm SEL, such that for any task , the selection algorithm returns a set of initial states and an activation map satisfying:
Indeed, it is desirable that the algorithm SEL computes the largest possible for the task . While computing the largest possible set can be computationally demanding, our algorithm will instead focus on finding an suboptimal . For space considerations, the quantification of the sub optimality in the computations of is omitted.
Iii Framework
Iiia Overview
Before describing our approach to solve Problems II.2, we start by recalling that every ReLU NN represents a Continuous PieceWise Affine (CPWA) function [pascanu2013number]. We use to denote a CPWA function of the form:
(4) 
where the polytopic sets is a partition of the set . We call each polytopic set a linear region, and use to denote the set of linear regions associated with . In this paper, we confine our attention to CPWA controllers (and hence neural network controllers) that are selected from a bounded polytopic set , i.e., we assume that and .
To fulfill a set of infinitely many tasks using a finite set of ReLU NNs , our approach is to restrict each NN in the set to some local behavior, yet the set captures all possible behavior of the system. We use the mathematical model of the physical system (1) to guide training of the NNs, as well as selecting NNs from the set at runtime.
During training, without knowing the tasks, we train a set of ReLU NNs using the following two steps:

Capture the closedloop behavior of the system under all
CPWA controllers using a finitestate Markov decision process (MDP). To define the action space of this MDP, we partition the space of all CPWA controllers into a finite number of partitions. Each partition corresponds to a family of CPWA controllers. Hence, each transition in the MDP is labeled by a symbol that corresponds to a particular family of CPWA functions. The transition probabilities can then be computed using the knowledge of the model (
1) and the Gaussian Process .We refer to this finitestate MDP as the abstract model of the system. 
Train one NN corresponds to each transition in the MDP. We refer to each of these NNs as a local NN. Let be the set of all such local NNs. The training enforces each local NN to represent a CPWA function that belongs to the family of CPWA controllers associated with this transition. This is achieved by using the NN weight projection operator introduced in [sun2021safeRL]. Using these local NNs, we can construct the set of NN controllers .
Details of constructing the abstract model and training the local NN controllers in are given in Section IV.
At runtime, given an arbitrary task , the algorithm selects NNs from the set to satisfy :

To satisfy the safety specification , the algorithm SEL identifies a subset of safe CPWA controllers at each abstract state in the MDP. The selected NNs from the set must correspond to one of those CPWA families that are marked as safe.

For the liveness specification , the algorithm SEL first searches for the optimal policy of the MDP using dynamic programming (DP), where the allowed transitions in the MDP are limited to those have been identified to be safe. Based on the optimal policy of the MDP, it decides which local NN in the set should be used at each state.
We highlight that the proposed framework above always guarantees that the resulting NN controller satisfies the safety specification for any task , regardless the accuracy of the learned modelerror using GP regression. For the liveness specification , due to the learned modelerror is probabilistic, we relax Problem II.2 to maximize the probability of satisfying the liveness specification . We also provide a quantified bound on the probability for the NN controller to satisfy .
Figure 1 conceptualizes our framework. In Figure 1 (a), we partition the state space into a set of abstract states and the controller space into a set of controller partitions . Figure 1 (b) shows the resulting MDP, with transition probabilities labeled by the side of the transitions. Then, the set contains 9 local NNs corresponding to the 9 transitions in the MDP.
Consider two different tasks given at runtime. Task specifies that the goal is represented by the abstract state and the only obstacle is . At state , our selection algorithm decides to use the local network , which corresponds to the transition from state to under partition . In task , state is still the goal, but there is no obstacle. For this task, our selection algorithm decides to use at state and use at state . Notice that with this choice the probability of reaching the goal is , which is higher than the probability by using at state .
In the above procedure, the set may contain a large number of local NNs—one for every possible transition in the MDP—and need extensive training effort. To accelerate the training process, in Section VII
, we employ ideas from transfer learning to enable the use of partially complete
to rapidly train new NN controllers, at runtime, while satisfying the same guarantees of having a complete .Iv ProvablyCorrect Training of the Set of Neural Networks
Iva Abstract Model
In this section, we extend the abstract model proposed in [sun2021safeRL] by taking into account the unknown modelerror . Unlike the results reported in [sun2021safeRL] where the system was assume to be errorfree and deterministic (and hence can be abstracted by a finitestate machine), in this paper, the dynamical model (1) is stochastic due to the use of GP regression to capture the error in the model. This necessitates the use of finitestate MDP to abstract the dynamics in (1).
State and Controller Space Partitioning: We partition the state space into a set of abstract states, denoted by . Each is an infinitynorm ball in centered around some state . The partitioning satisfies , and if . With an abuse of notation, denotes both an abstract state, i.e., , and a subset of states, i.e., . Since we construct the abstract model before knowing the tasks, the state space does not contain any obstacle or goal.
Similarly, we partition the controller space into polytopic subsets. For simplicity of notation, we define the set of parameters be a polytope that combines and . With some abuse of notation, we use with a single parameter to denote with the pair . The controller space is discretized into a collection of polytopic subsets in , denoted by . Each is an infinitynorm ball centered around some such that , and if . We call each of the subsets a controller partition. Each controller partition represents a subset of CPWA functions, by restricting parameters in a CPWA function to take values from .
MDP Transitions: Next, we compute the set of all allowable transitions in the MDP. To that end, we define the posterior of an abstract state under a controller partition be the set of states that can be reached in one step from states by using affine state feedback controllers with parameters under the dynamical model (1) as follows:
(5) 
where is defined in Section IIB as the bound of the modelerror . Indeed, computing the exact posterior for a nonlinear system is computationally expensive, and hence we rely on overapproximation instead. Furthermore, let be the set of abstract states that have overlap with .
(6) 
The transitions in the MDP can now be constructed using the information in . That is, a transition from state to state with label is allowed in the MDP if and only if .
Transition Probability: The final step is to compute the transition probabilities associated with each of the transitions constructed in the previous step. We define transition probabilities based on representative points in abstract states and controller partitions. Specifically, we choose the representative points to be the centers (recall that both and are infinitynorm balls and hence their centers are well defined). Let map an abstract state to its center and map a controller partition to the matrix , which is the center of . Furthermore, we use to denote the map from a state to the abstract state that contains , i.e., , and similarly, the map satisfies for any .
Given the dynamical system (1) with the modelerror learned by a GP regression model , let be the corresponding conditional stochastic kernel. Specifically, given the current state and input , the distribution
is given by the Gaussian distribution
. For any set and any , the probability of reaching the set in one step from state with input is given by:(7) 
where we use the notation . This integral can be easily computed since is a Gaussian distribution^{2}^{2}2In the case of a multiple output function , i.e., , each dimension can be integrated independently..
With above notations, we define our abstract model as follows:
Definition IV.1
The abstract model of (1) is a finite MDP defined as a tuple , where:

The state space is the set of abstract states ;

The set of controls at each state is given by the set of controller partitions ;

The transition probability from state to with label is given by:
where , .
IvB Train Local NNs with Weight Projection
Once the abstract model is computed, the next step is to train the set of local neural networks without the knowledge of the tasks. In order to capture the closedloop behavior of the system under all possible CPWA controllers, we train one local NN corresponding to each transition (with nonzero transition probability) in the MDP . Algorithm 1 outlines training of all the local NNs. We use to denote the local NN corresponding to the transition in the MDP from abstract state to under controller partition .
We train each local network using Proximal Policy Optimization (PPO) [ppo] (line 5 in Algorithm 1). While choosing the reward function in reinforcement learning is often challenging, our algorithm enjoys a straightforward yet efficient formulation of reward functions. To be specific, for a local network , let and be prespecified weights, our reward function encourages moving towards the state with controllers chosen from the partition :
where is the posterior mean function from the GP regression. With this dynamical model, PPO can efficiently explore the workspace without running the real agent.
The training of local networks is followed by applying a NN weight projection operator Project introduced in [sun2021safeRL]. Given a neural network and a controller partition , this projection operator ensures that:
In other words, this projection operator forces that can only give rise to one of the CPWA functions that belong to the controller partition . We refer readers to [sun2021safeRL] for more details on the NN weight projection. Algorithm 1 summarizes the discussion in this subsection.
V The Selection Algorithm
In this section, we present our selection algorithm which is used at runtime when an arbitrary task is given. The algorithm assigns one local NN in the set to each abstract state in order to satisfy the safety and liveness specification . Our approach is to first exclude all transitions in the MDP that can lead to violation of , followed by selecting the optimal solution from the remaining transitions in the MDP. More details are given below.
Va Exclude Unsafe Transitions using Backtracking
Given a task that specifies a set of obstacles and a goal , we use to denote the subset of abstract states that intersect the obstacles, i.e., , and use to denote the subset of abstract states contained in the goal, i.e., .
Algorithm 2 computes the set of safe states and safe controller partitions using an iterative backward procedure introduced in [sun2021safeRL]. With the set of unsafe states initialized to be the obstacles (line 1 in Algorithm 2), the algorithm backtracks unsafe states until a fixed point is reached, i.e., it can not find new unsafe states (line 24 in Algorithm 2). The set of safe initial states is the union of all the abstract states that are identified to be safe (line 6 in Algorithm 2). Furthermore, it computes the function , which assigns a set of safe controller partitions at each abstract state . Again, we use the superscript to emphasize the dependency of , and on the task .
VB Assign Controller Partition by Solving MDP
Once the set of safe controller partitions is computed, the next step is to assign one controller partition in to each abstract state . In particular, we consider the problem of solving the optimal policy for the MDP with states and controls limited to the set of safe abstract states and the set of safe controller partitions at , respectively. Since we are interested in maximizing the probability of satisfying the liveness specification , let the optimal value function map an abstract state to the maximum probability of reaching the goal in steps from . Using this notation, is then the maximum probability of satisfying the liveness specification . The optimal value functions can be solved by the following Dynamic Programming (DP) recursion [abate2013hscc]:
(8)  
(9) 
with the initial condition , where .
Algorithm 3 solves the optimal policy for the MDP using the Dynamic Programming (DP) recursion (8)(9). At time step , the optimal controller partition at state is given by the maximizer of (line 8 in Algorithm 3). The last step is to assign a corresponding neural network to be used at all the states for each . To that end, the activation map assigns the neural network indexed by to the abstract state , where maximizes the transition probability (line 910 in Algorithm 3). While the activation map assigns a neural network index to the abstract state , we can directly get the activation map to the actual state as:
In other words, given the state of the system , we first compute the corresponding abstract state , and use the corresponding neural network assigned to this abstract state to control the system. Note that, unlike the definition of the activation map in Problem II.2, the activation map obtained here is timevarying as captured by the subscript . This reflects the nature of the optimal solution computed by the DP regression (8)(9).
Vi Theoretical Guarantees
In this section, we study the theoretical guarantees of the proposed solution. We analyze the guarantees of satisfying and separately.
Via Safety Guarantee
The following theorem summarizes the safety guarantees for our solution.
Theorem VI.1
Consider the dynamical model (1). Let the NN controller consists of two parts: the set of local neural networks trained by Algorithm 1 and the selection algorithm SEL defined by Algorithm 2 and Algorithm 3. For any task , consider the set of initial conditions and the activation map computed by , the following holds: .
The proof of Theorem VI.1 follows the same argument of the errorfree case presented in [sun2021safeRL] and hence is omitted for brevity. In particular, Theorem 4.2 in [sun2021safeRL] shows that at safe abstract states , any feedback CPWA controller with chosen from is guaranteed to be safe. Furthermore, Theorem 4.4 in [sun2021safeRL] shows that the NN weight projection operator Project ensures that the local NNs at only give rise to the feedback CPWA controllers with for some .
To take into account the modelerror , the posterior in (5) is inflated with the error bound . Hence, Algorithm 2 provides the same safety guarantee, regardless the accuracy of the learned modelerror by GP regression. With the NN weight projection in the training of local NNs (line 6 in Algorithm 1), the resulting NN controller is guaranteed to be safe for any task .
ViB Probabilistic Optimality Guarantee
Due to the unknown modelerror , which is learned by GP regression, the liveness specification may not be always satisfied. However, in this subsection, we provide a bound on the probability for the trained NN controller to satisfy . Intuitively, this bound tells how close is the NN controller to the optimal controller, which maximizes the probability of satisfying .
By replacing the modelerror in (1) using the GP regression model , we consider the stochastic system , where . Given an arbitrary task , we use to denote the embedded MDP corresponding to this stochastic system, with states and controls limited to the subspace that has been identified to be safe (see Algorithm 2)^{3}^{3}3Since the task is fixed when comparing the NN controller and the optimal controller, we drop the superscript in this subsection.. Specifically, we define the continuous MDP as a tuple , where:

The state space is the set of safe states ;

The available controls at each state are given by the feedback CPWA controllers with chosen from the safe controller partitions, i.e., ;

The set of controls is ;

The conditional stochastic kernel follows the same definition in Section IVA.
We first consider the optimal controller for the system in terms of maximizing the probability of satisfying the liveness specification . Similar to the finitestate MDP , let the optimal value function map a state to the maximum probability of reaching the goal in steps from . Let , the optimal value functions can be solved through DP recursion [abate2013hscc]:
(10)  
(11) 
with the initial condition , where . In the following, we use the DP recursion (10)(11) to bound the optimality of NN controllers without explicitly solving them, which is intractable due to the continuous state and input space.
The probability for the NN controller to satisfy the liveness specification is given by the value function , which maps a state to the probability of reaching the goal in steps from the state under the controller :
Similarly, can be solved through the DP recursion:
(12) 
with the initial condition , where .
With the above notations, the difference between the value functions and measures the optimality of the NN controller by comparing it with the optimal controller. The following theorem provides the upper bound on this difference. When , it upper bounds the difference between the probability of satisfying the liveness specification using the NN controller and the maximum probability that can be achieved.
Theorem VI.2
Let and be the functions defined above. For any it holds that
(13) 
where
and the constants are defined as follows: the number of safe abstract states , grid size , and . Furthermore, , , and is the Lipschitz constant of an arbitrary local NN corresponding to a transition leaving :
for any and . Finally, and , where and are the Lipschitz constants of the stochastic kernel at abstract state , i.e., :
Comments
There are no comments yet.