This paper presents a safety-aware learning framework that employs an adaptive model learning method together with barrier certificates for systems with possibly nonstationary agent dynamics. To extract the dynamic structure of the model, we use a sparse optimization technique, and the resulting model will be used in combination with control barrier certificates which constrain feedback controllers only when safety is about to be violated. Under some mild assumptions, solutions to the constrained feedback-controller optimization are guaranteed to be globally optimal, and the monotonic improvement of a feedback controller is thus ensured. In addition, we reformulate the (action-)value function approximation to make any kernel-based nonlinear function estimation method applicable. We then employ a state-of-the-art kernel adaptive filtering technique for the (action-)value function approximation. The resulting framework is verified experimentally on a brushbot, whose dynamics is unknown and highly complex.

## Authors

• 4 publications
• 124 publications
• 15 publications
• 33 publications
01/28/2022

### Joint Differentiable Optimization and Verification for Certified Reinforcement Learning

In model-based reinforcement learning for safety-critical control system...
10/12/2020

### Control Barrier Functions for Unknown Nonlinear Systems using Gaussian Processes

This paper focuses on the controller synthesis for unknown, nonlinear sy...
03/20/2018

### Distributed Model Predictive Control for Linear Systems with Adaptive Terminal Sets

In this paper, we propose a distributed model predictive control (DMPC) ...
04/06/2021

### Adaptive Variants of Optimal Feedback Policies

We combine adaptive control directly with optimal or near-optimal value ...
11/19/2021

### Uncertainty-aware Low-Rank Q-Matrix Estimation for Deep Reinforcement Learning

Value estimation is one key problem in Reinforcement Learning. Albeit ma...
11/11/2021

### Model-Based Reinforcement Learning for Stochastic Hybrid Systems

Optimal control of general nonlinear systems is a central challenge in a...
05/29/2021

### Synthesizing Invariant Barrier Certificates via Difference-of-Convex Programming

A barrier certificate often serves as an inductive invariant that isolat...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

By exploring and interacting with an environment, reinforcement learning can successfully determine the optimal policy with respect to the long-term rewards given to an agent

[1, 2]. Whereas the idea of determining the optimal policy in terms of a cost over some time horizon is standard in the controls literature [3], reinforcement learning is aimed at learning the long-term rewards by exploring the states and actions. As such, the agent dynamics is no longer explicitly taken into account, but rather is subsumed by the data. Moreover, even the rewards need not necessarily be known a priori, but can be obtained through exploration, as well.

If no information about the agent dynamics is available, however, an agent might end up in certain regions of the state space which must be avoided while exploring. Avoiding such regions of the state space is referred to as safety. Safety includes collision avoidance, boundary-transgression avoidance, connectivity maintenance in teams of mobile robots, and other mandatory constraints, and this tension between exploration and safety becomes particularly pronounced in robotics, where safety is crucial.

In this paper, we address this safety issue, by employing model learning in combination with barrier certificates. In particular, we focus on learning for systems with discrete-time nonstationary agent dynamics. Nonstationarity comes, for example, from failures of actuators, battery degradations, or sudden environmental disturbances. The result is a method that adapts to a nonstationary agent dynamics and simultaneously extracts the dynamic structure without having to know how the agent dynamics changes over time. The resulting model will be used for barrier certificates. Under certain conditions, safety is recovered in the sense of Lyapunov stability even after violations of safety due to the nonstationarity occur. Moreover, we propose discrete-time barrier certificates with which a greedy policy update is ensured.

Over the last decade, the safety issue has been addressed under the name of safe learning, and plenty of solutions have been proposed [4, 5, 6, 7, 8, 9, 10, 11, 12, 13]. To ensure safety while exploring, an initial knowledge of the agent dynamics, some safe maneuver or their long-term rewards, or a teacher advising the agent is necessary [14, 4]. To obtain a model of the agent dynamics, human operators may maneuver the agent and record its trajectories [15, 12], or, starting from an initial safe maneuver, the set of safe policies can be expanded by exploring the states [4, 5]. It is also possible that an agent continues exploring without entering the states with low long-term rewards associated with some safe maneuver (e.g., [16]). Due to the inherent uncertainty, the worst case scenario (e.g., possible lowest rewards) is typically taken into account when expanding the set of safe polices [13, 17]. To address the issue of this uncertainty for nonlinear-model estimation tasks, Gaussian process regression [18] is a strong tool, and many safe learning studies have taken advantage of its property (e.g., [13, 4, 7, 6, 10]).

Nevertheless, when the agent dynamics is nonstationary, the assumptions often made in the safe learning literature cannot hold any more. In such cases, strictly guaranteeing safety is unrealistic. For nonstationary agent dynamics, stable tracking of the agent dynamics for mitigating the negative effect of an unexpected violation of safety is desirable. Moreover, the long-term rewards must also be learned in an adaptive manner. These are the core motivations of this paper.

To constrain the states within a desired safe region, we employ control barrier functions (cf. [19, 20, 21, 22, 23, 24]). When the exact model of the agent dynamics is available, control barrier certificates ensure that an agent remains in the set of safe states for all time by constraining the instantaneous control input at each time, and that an agent outside of the set of safe states is forced back to safety (Proposition III.1). A useful property of control barrier certificates is non-conservativeness, i.e., they modify polices when violations of safety are truly imminent. On the other hand, the global optimality of solutions to the constrained policy optimization is necessary to ensure the greedy improvement of a policy. Our first contribution of this paper is to propose a discrete-time control barrier certificate which ensures the global optimality under some mild conditions (see Section IV-C and Theorem IV.4 therein). This is an improvement of the previously proposed discrete-time control barrier certificate [24].

When the agent dynamics varies, the current estimate becomes no longer valid, possibly causing violations of safety. Therefore, we wish to adaptively learn the agent dynamics, and eventually bring the agent back to safety. To this end, we employ adaptive filtering techniques with stable tracking (or monotonic approximation) property: the current estimate is guaranteed to monotonically approach to the target system in the Hilbertian norm sense (see Section III-B). This guarantee is particularly important for safety-critical applications including robotics, and thus has high affinity to control theories. As the estimate becomes accurate, control barrier certificates will eventually force the agent back to the safe set hopefully before suffering from unrecoverable damages. In this paper, we employ a kernel adaptive filter [25] for nonlinear

agent dynamics, which is an adaptive extension of the kernel ridge regression

[26, 27] or Gaussian processes. Multikernel adaptive filter (cf. [28, 29, 30, 31, 32] and Appendix F) is a state-of-the-art kernel adaptive filter, which adaptively achieves a compact representation of a nonlinear function containing multi-component/partially-linear functions, and has a monotone approximation property for a possibly varying target function. Our second contribution of this paper is to guarantee Lyapunov stability of the safe set after the dynamics changes (Theorem IV.1), while adaptively learning the dynamic structure of the model by regarding a model of the agent dynamics as a combination of multiple structural components and employing a sparse optimization (see Section III-C and IV-B). The key idea is the use of an adaptive sparse optimization to extract truly active structural components.

Lastly, the action-value function, which approximates the long-term rewards, needs to be adaptively estimated under nonstationarity. Therefore, we wish to fully exploit the nonlinear adaptive filtering techniques. Actually, many attempts have been made to apply the online learning techniques to reinforcement learning (see [33, 34, 35, 36, 37, 38]). As a result, so-called off-policy approaches, which are convergent even when samples are not generated by the target policy (see [34]

), have been proposed. However, what differentiates action-value function approximation from an ordinary supervised learning, where input-output pairs are given, is that the output of the true action-value function is not explicitly observed. Our final contribution of this paper is, by assuming deterministic agent dynamics, to appropriately reformulate the action-value function approximation problem so that any kernel-based learning, which is widely-studied nonparametric technique, becomes straightforwardly applicable (Theorem

IV.3).

To validate our learning framework and clarify each contribution, we first conduct simulations of a quadrotor. We then conduct real-robotics experiments on a brushbot, whose dynamics is unknown, highly complex, and most probably nonstationary, to test the efficacy of our framework (see Section V). This is challenging due to many uncertainties and lack of simulators often used in applications of reinforcement learning to robotics (see [39] for reinforcement learning in robotics).

## Ii Preliminaries

In this section, we present some of the related work and the system model considered in this paper. Throughout, , , and are the sets of real numbers, nonnegative integers, and positive integers, respectively. Let be the norm induced by the inner product in an inner-product space . In particular, define for

-dimensional real vectors

, and , where stands for transposition, and we let denote . Let and for denote the state and the control input at time instant , respectively.

### Ii-a Related Work

The primary focus of this paper is the safety issue while exploring. Typically, some initial knowledges, such as an initial safe policy and a model of the agent dynamics, are required to address the safety issue while exploring, and model learning is often employed together. We introduce some related work on model learning and kernel-based action-value function approximation.

#### Ii-A1 Model Learning for Safe Maneuver

The recent work in [13], [7], and [4]

assumes an initial conservative set of safe policies, which is gradually expanded as more data become available. These approaches are designed for stationary agent dynamics, and Gaussian processes are employed to obtain the confidence interval of the model. To ensure safety, control barrier functions and control Lyapunov functions are employed in

[13] and [4], respectively. On the other hand, the work in [10] uses a trajectory optimization based on the receding horizon control and model learning by Gaussian processes, which is computationally expensive when the model is highly nonlinear.

As analyzed in Section V-A later in this paper, Gaussian processes cannot adapt to the abrupt and unexpected change of the agent dynamics111Although introducing forgetting factors will improve adaptivity, it is not straightforward to determine a suitable forgetting factor to guarantee any form of monotone approximation property under an abrupt and unexpected change of the agent dynamics.. Hence, we employ an adaptive filter with monotone approximation property, which shares similar ideas with stable online learning for adaptive control based on Lyapunov stability (c.f. [40, 41, 42, 43], for example).

#### Ii-A2 Learning Dynamic Structures in Reproducing Kernel Hilbert Spaces

An approach that learns dynamics so as to the resulting model satisfies the Euler-Lagrange equation in reproducing kernel Hilbert spaces (RKHSs) was proposed in [44], while our paper proposes an adaptive learning of control-affine structure in RKHSs.

#### Ii-A3 Reinforcement Learning in Reproducing Kernel Hilbert Spaces

We introduce, briefly, ideas of existing action-value function approximation techniques. Given a policy , the action-value function associated with the policy is defined as

 Qϕ(x,ϕ(x))=Vϕ(x):=∞∑n=0γnR(xn,ϕ(xn)), (II.1)

where is the discount factor, is a trajectory of the agent starting from , and is the immediate reward. It is known that the action-value function follows the Bellman equation (c.f. [2, Equation (66)]):

 Qϕ(xn,un)=γQϕ(xn+1,ϕ(xn+1))+R(xn,un). (II.2)

If the state and control are in the grid world, an optimal policy can be obtained by a greedy search.

However, for robotics applications, where the states and controls are continuous, we need some form of function approximators to approximate the action-value function (and/or policies). Nonparametric learning such as a kernel method is often desirable when a priori knowledge about a suitable set of basis functions for learning is unavailable222We refer the readers to [45] for a summary of parametric value function approximation techniques.. Kernel-based reinforcement learning has been studied in the literature, e.g., [36, 37, 46, 47, 48, 49, 35, 50, 51, 52, 53, 36, 54, 55, 56]. Although the outputs of the action-value function is unobserved, and supervised learning methods cannot be directly applied, so-called off-policy methods (e.g., the residual learning [33], the least squares temporal difference algorithm [57], and the gradient temporal difference learning [58, 34]) are proved to converge under certain conditions as most supervised learning methods even when samples are not generated by the policy . The least squares temporal difference algorithm has been extended to kernel-based methods [37], including Gaussian processes (e.g., the Gaussian process temporal difference and the Gaussian process SARSA [35]).

In contrast to the aforementioned approaches, we explicitly define a so-called reproducing kernel Hilbert space so that action-value function approximation can be conducted as supervised learning in that space, rather than presenting an ad-hoc kernel-based algorithm for action-value function approximation. Consequently, any kernel-based method can be straightforwardly applied. The Gaussian process SARSA can also be reproduced by employing a Gaussian process in the explicitly defined RKHS as will be discussed in Appendix G. We can also conduct action-value function approximation in the same RKHS even after the agent dynamics changes or the policy is updated if an adaptive filter is employed in the RKHS (See the remark below Theorem IV.3) and Section V-A2.

Specifically, in this paper, a possibly nonstationary agent dynamics is considered as described below.

### Ii-B System Model

In this paper, we consider the following discrete-time deterministic nonlinear model of the nonstationary agent dynamics,

 xn+1−xn=p(xn,un)+f(xn)+g(xn)un, (II.3)

where , , are continuous. Hereafter, we regard as the same as under the one-to-one correspondence between and , if there is no confusion.

We consider an agent with dynamics given in (II.3), and the goal is to find an optimal policy which drives the agent to a desirable state while remaining in the set of safe states defined as

 C:={x∈X|B(x)≥0}, (II.4)

where . An optimal policy is a policy that attains an optimal value for every state . Note that the value associated with a policy varies when the dynamics varies, and that a quadruple is available at each time instant .

With these preliminaries in place, we can present the overview of our safe learning framework under possibly nonstationary dynamics.

## Iii Overview of the Safe Learning Framework

When the agent dynamics varies abruptly and unexpectedly, safety cannot be no longer guaranteed. In this case, we at least wish to bring the agent back to safety. We introduce methods employed and extended in this paper, and present the motivations of using them here.

### Iii-a Discrete-time Control Barrier Functions

In this paper, we employ control barrier functions to deal with safety issues. The idea of control barrier functions is similar to Lyapunov functions; they require no explicit computations of the forward reachable set while ensuring certain properties by constraining the instantaneous control input. Particularly, control barrier functions guarantee that an agent starting from the safe set remains safe (i.e., forward invariance), and that an agent outside of the safe set is forced back to safety (i.e., Lyapunov stability of the safe set) if the agent dynamics is available. To make the use of barrier certificate compatible with the model learning and reinforcement learning, we employ the discrete-time version of control barrier certificates.

###### Definition III.1 ([24, Definition 4]).

A map is a discrete-time exponential control barrier function if there exists a control input such that

 B(xn+1)−B(xn)≥−ηB(xn), ∀n∈Z≥0, 0<η≤1. (III.1)

Note that we intentionally removed the condition originally presented in [24, Definition 4]. Then, the forward invariance and asymptotic stability of the set of safe states are ensured by the following proposition.

###### Proposition III.1.

The set defined in (II.4) for some discrete-time exponential control barrier function is forward invariant when , and is asymptotically stable when .

###### Proof.

See [24, Proposition 4] for the proof of forward invariance. The set is asymptotically stable as

 limn→∞B(xn)≥limn→∞(1−η)nB(x0)=0,

where the inequality holds from [24, Proposition 1]. ∎

Proposition III.1 implies that an agent remains in the set of safe states defined in (II.4) for all time if and (III.1) are satisfied, and the agent outside of the set of safe states is brought back to safety.

The main motivations of using control barrier functions are given below:

1. Non-conservativeness, i.e., control barrier functions modify polices only when violations of safety are imminent. Consequently, an inaccurate or rough estimation of the model causes less negative effect on (model-free) reinforcement learning. This is not true for control Lyapunov functions, which enforce the decrease condition even inside the safe set. The differences between control barrier functions and control Lyapunov functions are well-analyzed in [59], for example.

2. Asymptotic stability of the safe set, i.e., if the agent is outside of the safe set, it is brought back to the safe set. In addition to Proposition III.1, this robustness property is analyzed in [19]. This property together with adaptive model learning presented in the next subsection is particularly important when the nonstationarity of the dynamics pushes out the agent from the safe sets.

To enforce barrier certificates, we need a model of the agent dynamics, and for a possibly nonstationary agent dynamics, we need to adaptively learn the model.

### Iii-B Adaptive Model Learning with Stable Tracking Property

At each time instant, an input-output pair , where and for model learning is available. Under possibly nonstationary agent dynamics, it is vital for the model parameter estimation to be stable even after the agent dynamics changes. In this paper, we employ an adaptive learning with monotone approximation property. Note this approach shares the motivations of stable online learning with Lyapunov-like conditions.

Suppose that the estimate of model parameter at time instant is given by . Given a cost function at time instant , we update the parameter so as to satisfy the strictly monotone approximation property if , where is the empty set. Under nonstationarity, this monotone approximation property tells that, no matter how the target vector(function) changes, we can at least guarantee that the current estimator gets closer to the current target vector.

Assume that is nonempty. Then, , holds if . This is illustrated in Figure III.1. This property can also be viewed as Lyapunov stability of the model parameter.

By augmenting the state vector with the model parameter, and by employing a suitable candidate of Lyapunov functions, we can thus guarantee that the agent is brought back to safety even after abrupt and unexpected changes of the agent dynamics (see Section IV-A).

For general nonlinear dynamics, we use a kernel adaptive filter with monotone approximation property. Due to its celebrated property of reproducing kernels, the framework of linear adaptive filter is directly applied to nonlinear function estimation tasks in a possibly infinite-dimensional functional space, namely a reproducing kernel Hilbert space.

###### Definition III.2 ([60, page 343]).

Given a nonempty set and which is a Hilbert space defined in , the function of is called a reproducing kernel of if

1. for every , as a function of belongs to , and

2. it has the reproducing property, i.e., the following holds for every and every that

 φ(w)=⟨φ,κ(⋅,w)⟩H.

If has a reproducing kernel, is called a Reproducing Kernel Hilbert Space (RKHS).

One of the celebrated examples of kernels is the Gaussian kernel , , . It is well-known that the Gaussian reproducing kernel Hilbert space has universality [61], i.e., any continuous function on every compact subset of can be approximated with an arbitrary accuracy. Another widely used kernel is the polynomial kernel .

We emphasize that monotone approximation property due to convexity of the formulations is the main motivation of using a kernel adaptive filter for adaptive model learning.

### Iii-C Leaning Dynamic Structure via Sparse Optimizations

To efficiently constrain policies by using control barrier functions, a dynamic structure called control-affine structure is preferable (see Section IV-C and Theorem IV.4 therein). Control-affine dynamics is given by (II.3) when , where denotes the null function. In practice, it is unrealistic to have completely control-affine model (i.e., ) due to the effects of frictions and other disturbances. However, as long as the term is negligibly small, we can consider the term to be a system noise added to a control-affine system as discussed in Section IV-C and Theorem IV.4 therein, and we take the following steps to extract the control-affine structure:

1. Define , where .

2. Assume for simplicity that . We suppose that , , and , where , , and are RKHSs, and .

3. Let be the RKHS associated with the reproducing kernel , i.e., a polynomial kernel with and , and be the set of constant functions. Then, the function can be estimated in the Hilbert space . The Hilbert space is an RKHS (see Section IV-B and Theorem IV.2 therein), and hence we can employ a kernel adaptive filter.

4. Define the cost so as to promote sparsity of the model parameter. Consequently, we wish to obtain a control-affine model (i.e., the estimate of denoted by becomes null) if the underlying true dynamics is control affine.

The resulting control-affine part of the estimated dynamics will be used in combination with the control barrier certificates in order to efficiently constrain policies so that the agent remains in the set of safe states and is stabilized on the set even after an abrupt and unexpected change of the agent dynamics while and after learning an optimal policy (see Theorem IV.1 and Theorem IV.4 for more details).

### Iii-D Adaptive Action-value Function Approximation in RKHSs

Lastly, we present barrier-certified action-value function approximation in RKHSs. One of the issues arising when applying a kernel method to action-value function approximation is that the output of the action-value function associated with a policy , where is assumed to be an RKHS, is unobservable. Nevertheless, we know that the action-value function follows the Bellman equation (II.2). Hence, by defining a function , where , as

 ψQ([z;w]):=Qϕ(x,% u)−γQϕ(y,v), x,y∈X,u,v∈U,z=[x;u],w=[y;v],

the Bellman equation in (II.2) is solved via iterative nonlinear function estimation with the input-output pairs .

To theoretically guarantee the greedy improvement of policies, globally optimal control input within control constraints has to be taken at each state. Nevertheless, it is known that discrete-time barrier certificates become non-convex in general. Therefore, we will present certain conditions under which the control constraint becomes convex in control (Section IV-C and Theorem IV.4 therein).

## Iv Analysis of Barrier-certified Adaptive Reinforcement Learning

In the previous section, we presented our barrier-certified adaptive reinforcement learning framework and the motivations of employing each method. In this section, we present theoretical analysis of our framework to further strengthen the arguments.

### Iv-a Safety Recovery: Adaptive Learning and Control Barrier Certificates

Monotone approximation property of the model parameter is closely related to Lyapunov stability. In fact, by augmenting the state vector with the model parameter, we can construct a Lyapunov function which guarantees stability of the safe set under certain conditions.

We first make following assumptions.

###### Assumption IV.1.
1. The dimension of model parameter h remains finite, and is .

2. The input space is invariant.

3. All of the basis functions (or kernel functions) are bounded over .

4. The control barrier function is Lipschitz continuous over with Lipschitz constant .

5. There exist a control input satisfying for a sufficiently small that

 B(^xn+1)−B(xn)≥−ηB(x% n)+ϱ1, ∀n∈Z≥0, 0<η≤1, (IV.1)

where is the predicted output of the current estimate at and .

6. If , where is the cost function at time instant , then .

###### Remark IV.1 (On Assumption iv.1.1).

Assumption IV.1.1 is reasonable if polynomial kernels are employed for learning or if the input space is compact.

###### Remark IV.2 (On Assumption iv.1.5).

Assumption IV.1.5 implies that we can enforce barrier certificates for the current estimate of the dynamics with a sufficiently small margin . Although this assumption is somewhat restrictive, it is still reasonable if the initial estimate does not largely deviate from the true dynamics.

###### Remark IV.3 (On Assumption iv.1.6).

Assumption IV.1.6 implies that the set or equivalently the cost is designed so that the predicted output for is sufficiently close to the true output , and this assumption is thus reasonable.

Let the augmented state be . Then, the following theorem states that the safe set is (asymptotically) stable even after a violation of safety due to the abrupt and unexpected change of the agent dynamics.

###### Theorem IV.1.

Suppose that a triple is available at time instant , and the model parameter is updated as , where is continuous and has monotone approximation property: if , then, , for all , and for some . Suppose also that the agent dynamics changes unexpectedly at time instant , and the set is nonempty. Then, under Assumption IV.1, the augmented safe set is asymptotically stable if a control input satisfying (IV.1) is employed for all , and if for all such that .

###### Proof.

From Assumptions IV.1.3, IV.1.6, and the fact that the estimated output is linear to the model parameter for a fixed input, we obtain that

 ∥∥xn+1−^xn+1∥∥Rnx−ϱ1νB≤ϱ3dist(hn,Ωn),

for some bounded . Therefore, from Assumption IV.1.4, we obtain

 |B(xn+1)−B(^xn+1)|−ϱ1≤νB∥∥xn+1−^xn+1∥∥Rnx−ϱ1 ≤νBϱ3dist(hn,Ωn) ≤νBϱ31−ϱ2[dist(hn,Ω)−dist(hn+1,Ω)],

if . If , then we obtain

 B(xn+1)−B(^xn+1) ≥−ϱ1−νBϱ31−ϱ2[dist(hn,Ω)−dist(hn+1,Ω)]. (IV.2)

This inequality also holds for the case that and/or . We show that there exists a Lyapunov function for the augmented state . A candidate function is given by

 VC×Ω([x;h]) ={0if[x;h]∈C×Ω−min(B(x),0)+2νBϱ31−ϱ2dist(h% ,Ω)if[x,h]∉C×Ω

Since , , where is the boundary of the set , from Assumption IV.1.4, it follows that is continuous. It also holds that when , and that

 VC×Ω([xn+1;hn+1])−VC×Ω([xn;hn]) =−min(B(xn+1),0)+2νBϱ31−ϱ2dist(hn+1,Ω) +min(B(xn),0)−2νBϱ31−ϱ2dist(hn,Ω) ≤−νBϱ31−ϱ2[dist(hn,Ω)−dist(hn+1,Ω)]≤0, (IV.3)

for all , where the first inequality follows because (IV.1) and (IV.2) yield

 −min(B(xn+1),0)≤νBϱ31−ϱ2[dist(hn,Ω)−dist(hn+1,Ω)],

for , and

 −min(B(xn+1),0)+min(B(xn),0) ≤νBϱ31−ϱ2[dist(hn,Ω)−dist(hn+1,Ω)],

for , both of which imply (IV.3). Moreover, if , then, from monotonic approximation property, remains in , and from (IV.2), the control barrier certificate (III.1) is thus ensured by a control input satisfying (IV.1), rendering the set forward invariant. Therefore, from [62, Theorem 1], the set is asymptotically stable if for all such that . ∎

###### Remark IV.4 (On Theorem iv.1).

Monotonic approximation property plays a key role. If, for example, Bayesian-based learnings such as Gaussian processes are employed for model learning, then it is hard to guarantee any form of monotone approximation after unexpected and abrupt change of the agent dynamics in general. This will be analyzed in Section V-A.

For the agent dynamics which keeps changing, the augmented state can be regarded as following a hybrid system, and hence stability should be analyzed under additional assumptions in this case. Such an analysis is beyond the scope of this paper, and we omit the detail.

### Iv-B Structured Model Learning

We have seen that, by employing a model learning with monotone approximation property under Assumption IV.1, the agent is stabilized on the set of safe states even after an abrupt and unexpected change of the agent dynamics. To efficiently enforce control barrier certificates (IV.1), control-affine models are desirable as will be discussed in Section IV-C and Theorem IV.4 therein. Here, we propose a model learning technique that also learns the dynamic structure. We assume that for simplicity (we can employ approximators if ).

First, we show that the space (see Section III-C) is an RKHS.

###### Lemma IV.1.

The space is an RKHS associated with the reproducing kernel , with the inner product defined as , .

###### Proof.

See Appendix A. ∎

Then, the following lemma implies that can be approximated in the sum space of RKHSs denoted by .

###### Lemma IV.2 ([63, Theorem 13]).

Let and be two RKHSs associated with the reproducing kernels and

. Then the completion of the tensor product of

and , denoted by , is an RKHS associated with the reproducing kernel .

From Lemmas IV.1 and IV.2, we can now assume that and , where is an estimate of . As such, can be approximated in the RKHS .

Second, the following theorem ensures that can be uniquely decomposed into , , and in the RKHS .

###### Theorem IV.2.

Assume that and have nonempty interiors. Assume also that is a Gaussian RKHS. Then, is the direct sum of , , and , i.e., the intersection of any two of the RKHSs , , and is .

###### Proof.

See Appendix B. ∎

###### Remark IV.5 (On Theorem iv.2).

Because only the control-affine part of the learned model will be used in combination with barrier certificates (see Assumption IV.2 and Theorem IV.4) and the term is assumed to be a system noise added to the control-affine dynamics (see Section III-C), the unique decomposition is crucial; if the unique decomposition does not hold, the term is able to estimate the overall dynamics, including the control-affine terms.

Therefore, we can employ a multikernel adaptive filter working in the sum space . By using a sparse optimization for the coefficient vector , we wish to extract a structure of the model; The term is desired to be dropped off when the true agent dynamics is control affine.

In order to use the learned model in combination with control barrier functions, each entry of the vector is required. Assume, without loss of generality, that (this is always possible for by transforming coordinates of the control inputs and reducing the dimension if necessary). Then, the th entry of the vector is given by . Finally, we can use the learned model effectively to constrain control inputs by control barrier functions for policy updates. Adaptive action-value function approximation with barrier-certified policy updates will be presented in the next subsection.

So far, we showed that an agent can be brought back to safety by employing control barrier certificates and an adaptive model learning with monotone approximation property, and that a control-affine structure can be extracted by employing a sparse adaptive filter working in a certain RKHS. In this subsection, we present an adaptive action-value function approximation with barrier-certified policy updates.

We showed in Section III-D that the Bellman equation in (II.2) is solved via iterative nonlinear function estimation with the input-output pairs . The following theorem states that the iterative learning can be conducted in an RKHS, and hence any kernel-based method can be employed for action-value function approximation.

###### Theorem IV.3.

Suppose that is an RKHS associated with the reproducing kernel . Define, for ,

 HψQ:={φ|φ([z;w% ])=φQ(z)−γφQ(w), ∃φQ∈HQ,∀z,w∈Z}.

Then, the operator defined by , is bijective. Moreover, is an RKHS with the inner product defined by

 ⟨φ1,φ2⟩HψQ:=⟨φQ1,φQ2⟩HQ, (IV.4) φi([z;w]):=φQi(z)−γφQi(w),∀z,w∈Z,i∈{1,2}.

The reproducing kernel of the RKHS is given by

 κ([z;w],[~z;~%w]) :=(κQ(z,~z)−γκQ(z,~w)) −γ(κQ(w,~z)−γκQ(w,~w)),z,w,~z,~w∈Z. (IV.5)
###### Proof.

See Appendix C. ∎

From Theorem IV.3, any kernel-based method can be applied by assuming that for a policy . The estimate of denoted by is then obtained as , where is the estimate of . For instance, suppose that the estimate of for an input at time instant is given by

 ^ψQn([z;w]):=hQnTk([z;w]),

where is the model parameter, and for and for defined by (IV.5). Then, the estimate of for an input z at time instant is given by

 ^Qϕn(z):=hQnTkQ(z), (IV.6)

where
.

###### Remark IV.6 (On Theorem iv.3).

Because the domain of is defined as instead of , the RKHS does not depend on the agent dynamics, and we can conduct adaptive learning working in the same even after the dynamics changes or the policy is updated. This is especially important when analyzing convergence and/or monotone approximation property of action-value function approximation under possibly nonstationary agent dynamics (see Section V-A2, for example). As discussed in Appendix G, the Gaussian process SARSA can also be reproduced by applying a Gaussian process in the space , although the Gaussian process SARSA or other kernel-based action-value function approximation is ad-hoc and is designed for learning the action-value function associated with a fixed policy and for a stationary agent dynamics.

When the parameter for the estimator is monotonically approaching to a optimal point in the Euclidean norm sense, so is the model parameter for the action-value function because the same parameter is used to estimate and . Suppose we employ a method in which is monotonically approaching to a optimal function in the Hilbertian norm sense. Then, the following corollary implies that an estimator of the action-value function also satisfies the monotonicity.

###### Corollary IV.1.

Let and , where . Then, if is approaching to in the Hilbertian norm sense, i.e.,

 ∥∥^ψQn+1−ψQ∗∥∥HψQ≤∥∥^ψQn−ψQ∗∥∥HψQ,

it holds that

 ∥∥^Qϕn+1−Qϕ∗∥∥HQ≤∥∥^Qϕn−Qϕ∗∥∥HQ.
###### Proof.

See Appendix D. ∎

Note that employing the action-value function enables us to use random control inputs instead of the target policy for exploration, and we require no model of the agent dynamics for policy updates as discussed below.

For a current policy , assume that the action-value function with respect to at time instant is available. Given a discrete-time exponential control barrier function and , the barrier certified safe control space is define as

 S(xn):={un∈U|B(xn+1)−B(xn)≥−ηB(xn)}.

From Proposition III.1, the set defined in (II.4) is forward invariant and asymptotically stable if for all . Then, the updated policy given by

 ϕ+(x):=\operatornamewithlimitsargmaxu∈S(x)[Qϕ(x,u)], (IV.7)

is well-known (e.g., [64, 65]) to satisfy that , where is the action-value function with respect to at time instant . In practice, we use the estimate of because the exact function is unavailable. For example, the action-value function is estimated over iterations, and the policy is updated every iterations. To obtain analytical solutions for (IV.7), we follow the arguments in [35]. Suppose that is given by (IV.6). We define the reproducing kernel of as the tensor kernel given by

 κQ([x;u],[y;v]):=κx(x,y)κu(u,v), (IV.8)

where is, for example, defined by

 κu(u,v):=1+14(uTv).

Then, (IV.7) becomes

 ϕ+(x):=\operatornamewithlimitsargmaxu∈S(x)[hQnTkQ([x;u])], (IV.9)

where the target value being maximized is linear to u at x. Therefore, if the set is convex, an optimal solution to (IV.9) is guaranteed to be globally optimal, ensuring the greedy improvement of the policy.

As pointed out in [24], is not a convex set in general. To ensure convexity, we consider the set under the following moderate assumptions:

###### Assumption IV.2.
1. The set is convex.

2. Existence of Lipschitz continuous gradient of the barrier function: Given

 R:={(1−t)xn+t(^fn(xn)+^gn(xn)u)|t∈[0,1],u∈U},

there exists a constant such that the gradient of the discrete-time exponential control barrier function , denoted by , satisfies

 ∥∥∥∂B(a)∂x−