I Introduction
By exploring and interacting with an environment, reinforcement learning can successfully determine the optimal policy with respect to the longterm rewards given to an agent
[1, 2]. Whereas the idea of determining the optimal policy in terms of a cost over some time horizon is standard in the controls literature [3], reinforcement learning is aimed at learning the longterm rewards by exploring the states and actions. As such, the agent dynamics is no longer explicitly taken into account, but rather is subsumed by the data. Moreover, even the rewards need not necessarily be known a priori, but can be obtained through exploration, as well.If no information about the agent dynamics is available, however, an agent might end up in certain regions of the state space which must be avoided while exploring. Avoiding such regions of the state space is referred to as safety. Safety includes collision avoidance, boundarytransgression avoidance, connectivity maintenance in teams of mobile robots, and other mandatory constraints, and this tension between exploration and safety becomes particularly pronounced in robotics, where safety is crucial.
In this paper, we address this safety issue, by employing model learning in combination with barrier certificates. In particular, we focus on learning for systems with discretetime nonstationary agent dynamics. Nonstationarity comes, for example, from failures of actuators, battery degradations, or sudden environmental disturbances. The result is a method that adapts to a nonstationary agent dynamics and simultaneously extracts the dynamic structure without having to know how the agent dynamics changes over time. The resulting model will be used for barrier certificates. Under certain conditions, safety is recovered in the sense of Lyapunov stability even after violations of safety due to the nonstationarity occur. Moreover, we propose discretetime barrier certificates with which a greedy policy update is ensured.
Over the last decade, the safety issue has been addressed under the name of safe learning, and plenty of solutions have been proposed [4, 5, 6, 7, 8, 9, 10, 11, 12, 13]. To ensure safety while exploring, an initial knowledge of the agent dynamics, some safe maneuver or their longterm rewards, or a teacher advising the agent is necessary [14, 4]. To obtain a model of the agent dynamics, human operators may maneuver the agent and record its trajectories [15, 12], or, starting from an initial safe maneuver, the set of safe policies can be expanded by exploring the states [4, 5]. It is also possible that an agent continues exploring without entering the states with low longterm rewards associated with some safe maneuver (e.g., [16]). Due to the inherent uncertainty, the worst case scenario (e.g., possible lowest rewards) is typically taken into account when expanding the set of safe polices [13, 17]. To address the issue of this uncertainty for nonlinearmodel estimation tasks, Gaussian process regression [18] is a strong tool, and many safe learning studies have taken advantage of its property (e.g., [13, 4, 7, 6, 10]).
Nevertheless, when the agent dynamics is nonstationary, the assumptions often made in the safe learning literature cannot hold any more. In such cases, strictly guaranteeing safety is unrealistic. For nonstationary agent dynamics, stable tracking of the agent dynamics for mitigating the negative effect of an unexpected violation of safety is desirable. Moreover, the longterm rewards must also be learned in an adaptive manner. These are the core motivations of this paper.
To constrain the states within a desired safe region, we employ control barrier functions (cf. [19, 20, 21, 22, 23, 24]). When the exact model of the agent dynamics is available, control barrier certificates ensure that an agent remains in the set of safe states for all time by constraining the instantaneous control input at each time, and that an agent outside of the set of safe states is forced back to safety (Proposition III.1). A useful property of control barrier certificates is nonconservativeness, i.e., they modify polices when violations of safety are truly imminent. On the other hand, the global optimality of solutions to the constrained policy optimization is necessary to ensure the greedy improvement of a policy. Our first contribution of this paper is to propose a discretetime control barrier certificate which ensures the global optimality under some mild conditions (see Section IVC and Theorem IV.4 therein). This is an improvement of the previously proposed discretetime control barrier certificate [24].
When the agent dynamics varies, the current estimate becomes no longer valid, possibly causing violations of safety. Therefore, we wish to adaptively learn the agent dynamics, and eventually bring the agent back to safety. To this end, we employ adaptive filtering techniques with stable tracking (or monotonic approximation) property: the current estimate is guaranteed to monotonically approach to the target system in the Hilbertian norm sense (see Section IIIB). This guarantee is particularly important for safetycritical applications including robotics, and thus has high affinity to control theories. As the estimate becomes accurate, control barrier certificates will eventually force the agent back to the safe set hopefully before suffering from unrecoverable damages. In this paper, we employ a kernel adaptive filter [25] for nonlinear
agent dynamics, which is an adaptive extension of the kernel ridge regression
[26, 27] or Gaussian processes. Multikernel adaptive filter (cf. [28, 29, 30, 31, 32] and Appendix F) is a stateoftheart kernel adaptive filter, which adaptively achieves a compact representation of a nonlinear function containing multicomponent/partiallylinear functions, and has a monotone approximation property for a possibly varying target function. Our second contribution of this paper is to guarantee Lyapunov stability of the safe set after the dynamics changes (Theorem IV.1), while adaptively learning the dynamic structure of the model by regarding a model of the agent dynamics as a combination of multiple structural components and employing a sparse optimization (see Section IIIC and IVB). The key idea is the use of an adaptive sparse optimization to extract truly active structural components.Lastly, the actionvalue function, which approximates the longterm rewards, needs to be adaptively estimated under nonstationarity. Therefore, we wish to fully exploit the nonlinear adaptive filtering techniques. Actually, many attempts have been made to apply the online learning techniques to reinforcement learning (see [33, 34, 35, 36, 37, 38]). As a result, socalled offpolicy approaches, which are convergent even when samples are not generated by the target policy (see [34]
), have been proposed. However, what differentiates actionvalue function approximation from an ordinary supervised learning, where inputoutput pairs are given, is that the output of the true actionvalue function is not explicitly observed. Our final contribution of this paper is, by assuming deterministic agent dynamics, to appropriately reformulate the actionvalue function approximation problem so that any kernelbased learning, which is widelystudied nonparametric technique, becomes straightforwardly applicable (Theorem
IV.3).To validate our learning framework and clarify each contribution, we first conduct simulations of a quadrotor. We then conduct realrobotics experiments on a brushbot, whose dynamics is unknown, highly complex, and most probably nonstationary, to test the efficacy of our framework (see Section V). This is challenging due to many uncertainties and lack of simulators often used in applications of reinforcement learning to robotics (see [39] for reinforcement learning in robotics).
Ii Preliminaries
In this section, we present some of the related work and the system model considered in this paper. Throughout, , , and are the sets of real numbers, nonnegative integers, and positive integers, respectively. Let be the norm induced by the inner product in an innerproduct space . In particular, define for
dimensional real vectors
, and , where stands for transposition, and we let denote . Let and for denote the state and the control input at time instant , respectively.Iia Related Work
The primary focus of this paper is the safety issue while exploring. Typically, some initial knowledges, such as an initial safe policy and a model of the agent dynamics, are required to address the safety issue while exploring, and model learning is often employed together. We introduce some related work on model learning and kernelbased actionvalue function approximation.
IiA1 Model Learning for Safe Maneuver
The recent work in [13], [7], and [4]
assumes an initial conservative set of safe policies, which is gradually expanded as more data become available. These approaches are designed for stationary agent dynamics, and Gaussian processes are employed to obtain the confidence interval of the model. To ensure safety, control barrier functions and control Lyapunov functions are employed in
[13] and [4], respectively. On the other hand, the work in [10] uses a trajectory optimization based on the receding horizon control and model learning by Gaussian processes, which is computationally expensive when the model is highly nonlinear.As analyzed in Section VA later in this paper, Gaussian processes cannot adapt to the abrupt and unexpected change of the agent dynamics^{1}^{1}1Although introducing forgetting factors will improve adaptivity, it is not straightforward to determine a suitable forgetting factor to guarantee any form of monotone approximation property under an abrupt and unexpected change of the agent dynamics.. Hence, we employ an adaptive filter with monotone approximation property, which shares similar ideas with stable online learning for adaptive control based on Lyapunov stability (c.f. [40, 41, 42, 43], for example).
IiA2 Learning Dynamic Structures in Reproducing Kernel Hilbert Spaces
An approach that learns dynamics so as to the resulting model satisfies the EulerLagrange equation in reproducing kernel Hilbert spaces (RKHSs) was proposed in [44], while our paper proposes an adaptive learning of controlaffine structure in RKHSs.
IiA3 Reinforcement Learning in Reproducing Kernel Hilbert Spaces
We introduce, briefly, ideas of existing actionvalue function approximation techniques. Given a policy , the actionvalue function associated with the policy is defined as
(II.1) 
where is the discount factor, is a trajectory of the agent starting from , and is the immediate reward. It is known that the actionvalue function follows the Bellman equation (c.f. [2, Equation (66)]):
(II.2) 
If the state and control are in the grid world, an optimal policy can be obtained by a greedy search.
However, for robotics applications, where the states and controls are continuous, we need some form of function approximators to approximate the actionvalue function (and/or policies). Nonparametric learning such as a kernel method is often desirable when a priori knowledge about a suitable set of basis functions for learning is unavailable^{2}^{2}2We refer the readers to [45] for a summary of parametric value function approximation techniques.. Kernelbased reinforcement learning has been studied in the literature, e.g., [36, 37, 46, 47, 48, 49, 35, 50, 51, 52, 53, 36, 54, 55, 56]. Although the outputs of the actionvalue function is unobserved, and supervised learning methods cannot be directly applied, socalled offpolicy methods (e.g., the residual learning [33], the least squares temporal difference algorithm [57], and the gradient temporal difference learning [58, 34]) are proved to converge under certain conditions as most supervised learning methods even when samples are not generated by the policy . The least squares temporal difference algorithm has been extended to kernelbased methods [37], including Gaussian processes (e.g., the Gaussian process temporal difference and the Gaussian process SARSA [35]).
In contrast to the aforementioned approaches, we explicitly define a socalled reproducing kernel Hilbert space so that actionvalue function approximation can be conducted as supervised learning in that space, rather than presenting an adhoc kernelbased algorithm for actionvalue function approximation. Consequently, any kernelbased method can be straightforwardly applied. The Gaussian process SARSA can also be reproduced by employing a Gaussian process in the explicitly defined RKHS as will be discussed in Appendix G. We can also conduct actionvalue function approximation in the same RKHS even after the agent dynamics changes or the policy is updated if an adaptive filter is employed in the RKHS (See the remark below Theorem IV.3) and Section VA2.
Specifically, in this paper, a possibly nonstationary agent dynamics is considered as described below.
IiB System Model
In this paper, we consider the following discretetime deterministic nonlinear model of the nonstationary agent dynamics,
(II.3) 
where , , are continuous. Hereafter, we regard as the same as under the onetoone correspondence between and , if there is no confusion.
We consider an agent with dynamics given in (II.3), and the goal is to find an optimal policy which drives the agent to a desirable state while remaining in the set of safe states defined as
(II.4) 
where . An optimal policy is a policy that attains an optimal value for every state . Note that the value associated with a policy varies when the dynamics varies, and that a quadruple is available at each time instant .
With these preliminaries in place, we can present the overview of our safe learning framework under possibly nonstationary dynamics.
Iii Overview of the Safe Learning Framework
When the agent dynamics varies abruptly and unexpectedly, safety cannot be no longer guaranteed. In this case, we at least wish to bring the agent back to safety. We introduce methods employed and extended in this paper, and present the motivations of using them here.
Iiia Discretetime Control Barrier Functions
In this paper, we employ control barrier functions to deal with safety issues. The idea of control barrier functions is similar to Lyapunov functions; they require no explicit computations of the forward reachable set while ensuring certain properties by constraining the instantaneous control input. Particularly, control barrier functions guarantee that an agent starting from the safe set remains safe (i.e., forward invariance), and that an agent outside of the safe set is forced back to safety (i.e., Lyapunov stability of the safe set) if the agent dynamics is available. To make the use of barrier certificate compatible with the model learning and reinforcement learning, we employ the discretetime version of control barrier certificates.
Definition III.1 ([24, Definition 4]).
A map is a discretetime exponential control barrier function if there exists a control input such that
(III.1) 
Note that we intentionally removed the condition originally presented in [24, Definition 4]. Then, the forward invariance and asymptotic stability of the set of safe states are ensured by the following proposition.
Proposition III.1.
The set defined in (II.4) for some discretetime exponential control barrier function is forward invariant when , and is asymptotically stable when .
Proof.
Proposition III.1 implies that an agent remains in the set of safe states defined in (II.4) for all time if and (III.1) are satisfied, and the agent outside of the set of safe states is brought back to safety.
The main motivations of using control barrier functions are given below:

Nonconservativeness, i.e., control barrier functions modify polices only when violations of safety are imminent. Consequently, an inaccurate or rough estimation of the model causes less negative effect on (modelfree) reinforcement learning. This is not true for control Lyapunov functions, which enforce the decrease condition even inside the safe set. The differences between control barrier functions and control Lyapunov functions are wellanalyzed in [59], for example.

Asymptotic stability of the safe set, i.e., if the agent is outside of the safe set, it is brought back to the safe set. In addition to Proposition III.1, this robustness property is analyzed in [19]. This property together with adaptive model learning presented in the next subsection is particularly important when the nonstationarity of the dynamics pushes out the agent from the safe sets.
To enforce barrier certificates, we need a model of the agent dynamics, and for a possibly nonstationary agent dynamics, we need to adaptively learn the model.
IiiB Adaptive Model Learning with Stable Tracking Property
At each time instant, an inputoutput pair , where and for model learning is available. Under possibly nonstationary agent dynamics, it is vital for the model parameter estimation to be stable even after the agent dynamics changes. In this paper, we employ an adaptive learning with monotone approximation property. Note this approach shares the motivations of stable online learning with Lyapunovlike conditions.
Suppose that the estimate of model parameter at time instant is given by . Given a cost function at time instant , we update the parameter so as to satisfy the strictly monotone approximation property if , where is the empty set. Under nonstationarity, this monotone approximation property tells that, no matter how the target vector(function) changes, we can at least guarantee that the current estimator gets closer to the current target vector.
Assume that is nonempty. Then, , holds if . This is illustrated in Figure III.1. This property can also be viewed as Lyapunov stability of the model parameter.
By augmenting the state vector with the model parameter, and by employing a suitable candidate of Lyapunov functions, we can thus guarantee that the agent is brought back to safety even after abrupt and unexpected changes of the agent dynamics (see Section IVA).
For general nonlinear dynamics, we use a kernel adaptive filter with monotone approximation property. Due to its celebrated property of reproducing kernels, the framework of linear adaptive filter is directly applied to nonlinear function estimation tasks in a possibly infinitedimensional functional space, namely a reproducing kernel Hilbert space.
Definition III.2 ([60, page 343]).
Given a nonempty set and which is a Hilbert space defined in , the function of is called a reproducing kernel of if

for every , as a function of belongs to , and

it has the reproducing property, i.e., the following holds for every and every that
If has a reproducing kernel, is called a Reproducing Kernel Hilbert Space (RKHS).
One of the celebrated examples of kernels is the Gaussian kernel , , . It is wellknown that the Gaussian reproducing kernel Hilbert space has universality [61], i.e., any continuous function on every compact subset of can be approximated with an arbitrary accuracy. Another widely used kernel is the polynomial kernel .
We emphasize that monotone approximation property due to convexity of the formulations is the main motivation of using a kernel adaptive filter for adaptive model learning.
IiiC Leaning Dynamic Structure via Sparse Optimizations
To efficiently constrain policies by using control barrier functions, a dynamic structure called controlaffine structure is preferable (see Section IVC and Theorem IV.4 therein). Controlaffine dynamics is given by (II.3) when , where denotes the null function. In practice, it is unrealistic to have completely controlaffine model (i.e., ) due to the effects of frictions and other disturbances. However, as long as the term is negligibly small, we can consider the term to be a system noise added to a controlaffine system as discussed in Section IVC and Theorem IV.4 therein, and we take the following steps to extract the controlaffine structure:

Define , where .

Assume for simplicity that . We suppose that , , and , where , , and are RKHSs, and .

Let be the RKHS associated with the reproducing kernel , i.e., a polynomial kernel with and , and be the set of constant functions. Then, the function can be estimated in the Hilbert space . The Hilbert space is an RKHS (see Section IVB and Theorem IV.2 therein), and hence we can employ a kernel adaptive filter.

Define the cost so as to promote sparsity of the model parameter. Consequently, we wish to obtain a controlaffine model (i.e., the estimate of denoted by becomes null) if the underlying true dynamics is control affine.
The resulting controlaffine part of the estimated dynamics will be used in combination with the control barrier certificates in order to efficiently constrain policies so that the agent remains in the set of safe states and is stabilized on the set even after an abrupt and unexpected change of the agent dynamics while and after learning an optimal policy (see Theorem IV.1 and Theorem IV.4 for more details).
IiiD Adaptive Actionvalue Function Approximation in RKHSs
Lastly, we present barriercertified actionvalue function approximation in RKHSs. One of the issues arising when applying a kernel method to actionvalue function approximation is that the output of the actionvalue function associated with a policy , where is assumed to be an RKHS, is unobservable. Nevertheless, we know that the actionvalue function follows the Bellman equation (II.2). Hence, by defining a function , where , as
the Bellman equation in (II.2) is solved via iterative nonlinear function estimation with the inputoutput pairs .
To theoretically guarantee the greedy improvement of policies, globally optimal control input within control constraints has to be taken at each state. Nevertheless, it is known that discretetime barrier certificates become nonconvex in general. Therefore, we will present certain conditions under which the control constraint becomes convex in control (Section IVC and Theorem IV.4 therein).
Iv Analysis of Barriercertified Adaptive Reinforcement Learning
In the previous section, we presented our barriercertified adaptive reinforcement learning framework and the motivations of employing each method. In this section, we present theoretical analysis of our framework to further strengthen the arguments.
Iva Safety Recovery: Adaptive Learning and Control Barrier Certificates
Monotone approximation property of the model parameter is closely related to Lyapunov stability. In fact, by augmenting the state vector with the model parameter, we can construct a Lyapunov function which guarantees stability of the safe set under certain conditions.
We first make following assumptions.
Assumption IV.1.

The dimension of model parameter h remains finite, and is .

The input space is invariant.

All of the basis functions (or kernel functions) are bounded over .

The control barrier function is Lipschitz continuous over with Lipschitz constant .

There exist a control input satisfying for a sufficiently small that
(IV.1) where is the predicted output of the current estimate at and .

If , where is the cost function at time instant , then .
Remark IV.1 (On Assumption iv.1.1).
Assumption IV.1.1 is reasonable if polynomial kernels are employed for learning or if the input space is compact.
Remark IV.2 (On Assumption iv.1.5).
Assumption IV.1.5 implies that we can enforce barrier certificates for the current estimate of the dynamics with a sufficiently small margin . Although this assumption is somewhat restrictive, it is still reasonable if the initial estimate does not largely deviate from the true dynamics.
Remark IV.3 (On Assumption iv.1.6).
Assumption IV.1.6 implies that the set or equivalently the cost is designed so that the predicted output for is sufficiently close to the true output , and this assumption is thus reasonable.
Let the augmented state be . Then, the following theorem states that the safe set is (asymptotically) stable even after a violation of safety due to the abrupt and unexpected change of the agent dynamics.
Theorem IV.1.
Suppose that a triple is available at time instant , and the model parameter is updated as , where is continuous and has monotone approximation property: if , then, , for all , and for some . Suppose also that the agent dynamics changes unexpectedly at time instant , and the set is nonempty. Then, under Assumption IV.1, the augmented safe set is asymptotically stable if a control input satisfying (IV.1) is employed for all , and if for all such that .
Proof.
From Assumptions IV.1.3, IV.1.6, and the fact that the estimated output is linear to the model parameter for a fixed input, we obtain that
for some bounded . Therefore, from Assumption IV.1.4, we obtain
if . If , then we obtain
(IV.2) 
This inequality also holds for the case that and/or . We show that there exists a Lyapunov function for the augmented state . A candidate function is given by
Since , , where is the boundary of the set , from Assumption IV.1.4, it follows that is continuous. It also holds that when , and that
(IV.3) 
for all , where the first inequality follows because (IV.1) and (IV.2) yield
for , and
for , both of which imply (IV.3). Moreover, if , then, from monotonic approximation property, remains in , and from (IV.2), the control barrier certificate (III.1) is thus ensured by a control input satisfying (IV.1), rendering the set forward invariant. Therefore, from [62, Theorem 1], the set is asymptotically stable if for all such that . ∎
Remark IV.4 (On Theorem iv.1).
Monotonic approximation property plays a key role. If, for example, Bayesianbased learnings such as Gaussian processes are employed for model learning, then it is hard to guarantee any form of monotone approximation after unexpected and abrupt change of the agent dynamics in general. This will be analyzed in Section VA.
For the agent dynamics which keeps changing, the augmented state can be regarded as following a hybrid system, and hence stability should be analyzed under additional assumptions in this case. Such an analysis is beyond the scope of this paper, and we omit the detail.
IvB Structured Model Learning
We have seen that, by employing a model learning with monotone approximation property under Assumption IV.1, the agent is stabilized on the set of safe states even after an abrupt and unexpected change of the agent dynamics. To efficiently enforce control barrier certificates (IV.1), controlaffine models are desirable as will be discussed in Section IVC and Theorem IV.4 therein. Here, we propose a model learning technique that also learns the dynamic structure. We assume that for simplicity (we can employ approximators if ).
First, we show that the space (see Section IIIC) is an RKHS.
Lemma IV.1.
The space is an RKHS associated with the reproducing kernel , with the inner product defined as , .
Proof.
See Appendix A. ∎
Then, the following lemma implies that can be approximated in the sum space of RKHSs denoted by .
Lemma IV.2 ([63, Theorem 13]).
Let and be two RKHSs associated with the reproducing kernels and
. Then the completion of the tensor product of
and , denoted by , is an RKHS associated with the reproducing kernel .From Lemmas IV.1 and IV.2, we can now assume that and , where is an estimate of . As such, can be approximated in the RKHS .
Second, the following theorem ensures that can be uniquely decomposed into , , and in the RKHS .
Theorem IV.2.
Assume that and have nonempty interiors. Assume also that is a Gaussian RKHS. Then, is the direct sum of , , and , i.e., the intersection of any two of the RKHSs , , and is .
Proof.
See Appendix B. ∎
Remark IV.5 (On Theorem iv.2).
Because only the controlaffine part of the learned model will be used in combination with barrier certificates (see Assumption IV.2 and Theorem IV.4) and the term is assumed to be a system noise added to the controlaffine dynamics (see Section IIIC), the unique decomposition is crucial; if the unique decomposition does not hold, the term is able to estimate the overall dynamics, including the controlaffine terms.
Therefore, we can employ a multikernel adaptive filter working in the sum space . By using a sparse optimization for the coefficient vector , we wish to extract a structure of the model; The term is desired to be dropped off when the true agent dynamics is control affine.
In order to use the learned model in combination with control barrier functions, each entry of the vector is required. Assume, without loss of generality, that (this is always possible for by transforming coordinates of the control inputs and reducing the dimension if necessary). Then, the th entry of the vector is given by . Finally, we can use the learned model effectively to constrain control inputs by control barrier functions for policy updates. Adaptive actionvalue function approximation with barriercertified policy updates will be presented in the next subsection.
IvC Adaptive Actionvalue Function Approximation with Barriercertified Policy Updates
So far, we showed that an agent can be brought back to safety by employing control barrier certificates and an adaptive model learning with monotone approximation property, and that a controlaffine structure can be extracted by employing a sparse adaptive filter working in a certain RKHS. In this subsection, we present an adaptive actionvalue function approximation with barriercertified policy updates.
We showed in Section IIID that the Bellman equation in (II.2) is solved via iterative nonlinear function estimation with the inputoutput pairs . The following theorem states that the iterative learning can be conducted in an RKHS, and hence any kernelbased method can be employed for actionvalue function approximation.
Theorem IV.3.
Suppose that is an RKHS associated with the reproducing kernel . Define, for ,
Then, the operator defined by , is bijective. Moreover, is an RKHS with the inner product defined by
(IV.4)  
The reproducing kernel of the RKHS is given by
(IV.5) 
Proof.
See Appendix C. ∎
From Theorem IV.3, any kernelbased method can be applied by assuming that for a policy . The estimate of denoted by is then obtained as , where is the estimate of . For instance, suppose that the estimate of for an input at time instant is given by
where is the model parameter, and for and for defined by (IV.5). Then, the estimate of for an input z at time instant is given by
(IV.6) 
where
.
Remark IV.6 (On Theorem iv.3).
Because the domain of is defined as instead of , the RKHS does not depend on the agent dynamics, and we can conduct adaptive learning working in the same even after the dynamics changes or the policy is updated. This is especially important when analyzing convergence and/or monotone approximation property of actionvalue function approximation under possibly nonstationary agent dynamics (see Section VA2, for example). As discussed in Appendix G, the Gaussian process SARSA can also be reproduced by applying a Gaussian process in the space , although the Gaussian process SARSA or other kernelbased actionvalue function approximation is adhoc and is designed for learning the actionvalue function associated with a fixed policy and for a stationary agent dynamics.
When the parameter for the estimator is monotonically approaching to a optimal point in the Euclidean norm sense, so is the model parameter for the actionvalue function because the same parameter is used to estimate and . Suppose we employ a method in which is monotonically approaching to a optimal function in the Hilbertian norm sense. Then, the following corollary implies that an estimator of the actionvalue function also satisfies the monotonicity.
Corollary IV.1.
Let and , where . Then, if is approaching to in the Hilbertian norm sense, i.e.,
it holds that
Proof.
See Appendix D. ∎
Note that employing the actionvalue function enables us to use random control inputs instead of the target policy for exploration, and we require no model of the agent dynamics for policy updates as discussed below.
For a current policy , assume that the actionvalue function with respect to at time instant is available. Given a discretetime exponential control barrier function and , the barrier certified safe control space is define as
From Proposition III.1, the set defined in (II.4) is forward invariant and asymptotically stable if for all . Then, the updated policy given by
(IV.7) 
is wellknown (e.g., [64, 65]) to satisfy that , where is the actionvalue function with respect to at time instant . In practice, we use the estimate of because the exact function is unavailable. For example, the actionvalue function is estimated over iterations, and the policy is updated every iterations. To obtain analytical solutions for (IV.7), we follow the arguments in [35]. Suppose that is given by (IV.6). We define the reproducing kernel of as the tensor kernel given by
(IV.8) 
where is, for example, defined by
Then, (IV.7) becomes
(IV.9) 
where the target value being maximized is linear to u at x. Therefore, if the set is convex, an optimal solution to (IV.9) is guaranteed to be globally optimal, ensuring the greedy improvement of the policy.
As pointed out in [24], is not a convex set in general. To ensure convexity, we consider the set under the following moderate assumptions:
Assumption IV.2.

The set is convex.

Existence of Lipschitz continuous gradient of the barrier function: Given
there exists a constant such that the gradient of the discretetime exponential control barrier function , denoted by , satisfies
Comments
There are no comments yet.