# The interplay between system identification and machine learning

Learning from examples is one of the key problems in science and engineering. It deals with function reconstruction from a finite set of direct and noisy samples. Regularization in reproducing kernel Hilbert spaces (RKHSs) is widely used to solve this task and includes powerful estimators such as regularization networks. Recent achievements include the proof of the statistical consistency of these kernel- based approaches. Parallel to this, many different system identification techniques have been developed but the interaction with machine learning does not appear so strong yet. One reason is that the RKHSs usually employed in machine learning do not embed the information available on dynamic systems, e.g. BIBO stability. In addition, in system identification the independent data assumptions routinely adopted in machine learning are never satisfied in practice. This paper provides new results which strengthen the connection between system identification and machine learning. Our starting point is the introduction of RKHSs of dynamic systems. They contain functionals over spaces defined by system inputs and allow to interpret system identification as learning from examples. In both linear and nonlinear settings, it is shown that this perspective permits to derive in a relatively simple way conditions on RKHS stability (i.e. the property of containing only BIBO stable systems or predictors), also facilitating the design of new kernels for system identification. Furthermore, we prove the convergence of the regularized estimator to the optimal predictor under conditions typical of dynamic systems.

## Authors

• 27 publications
05/06/2020

### Mathematical foundations of stable RKHSs

Reproducing kernel Hilbert spaces (RKHSs) are key spaces for machine lea...
09/05/2019

### Kernel absolute summability is only sufficient for RKHS stability

Regularized approaches have been successfully applied to linear system i...
07/02/2015

### Identification of stable models via nonparametric prediction error methods

A new Bayesian approach to linear system identification has been propose...
09/22/2017

### Total stability of kernel methods

Regularized empirical risk minimization using kernels and their correspo...
04/15/2016

### A short note on extension theorems and their connection to universal consistency in machine learning

Statistical machine learning plays an important role in modern statistic...
03/12/2013

### Linear system identification using stable spline kernels and PLQ penalties

The classical approach to linear system identification is given by param...
09/02/2019

### Asymptotic linear expansion of regularized M-estimators

Parametric high-dimensional regression analysis requires the usage of re...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Learning from examples is key in science and engineering, considered at the core of intelligence’s understanding [PogShe99]. In mathematical terms, it can be described as follows. We are given a finite set of training data , where is the so called input location while is the corresponding output measurement. The goal is then the reconstruction of a function with good prediction capability on future data. This means that, for a new pair , the prediction should be close to .
To solve this task, nonparametric techniques have been extensively used in the last years. Within this paradigm, instead of assigning to the unknown function a specific parametric structure, is searched over a possibly infinite-dimensional functional space. The modern approach uses Tikhonov regularization theory [Tikhonov, Bertero1] in conjunction with Reproducing Kernel Hilbert Spaces (RKHSs) [Aronszajn50, Bergman50]. RKHSs possess many important properties, being in one to one correspondence with the class of positive definite kernels. Their connection with Gaussian processes is also described in [Kimeldorf71Bayes, Lukic, Bell2004, AravkinNN].

While applications of RKHSs in statistics, approximation theory and computer vision trace back to

[Bertero:1988, Wahba1990, Poggio90], these spaces were introduced to the machine learning community in [Girosi:1997]. RKHSs permit to treat in an unified way many different regularization methods. The so called kernel-based methods [PoggioMIT, Scholkopf01b] include smoothing splines [Wahba1990], regularization networks [Poggio90], Gaussian regression [Rasmussen][Drucker97, Vapnik98]. In particular, a regularization network (RN) has the structure

 ^g=argminf∈H N∑i=1(yi−f(xi))2N+γ∥f∥2HRN (1)

where denotes a RKHS with norm . Thus, the function estimate minimizes an objective sum of two contrasting terms. The first one is a quadratic loss which measures the adherence to experimental data. The second term is the regularizer (the RKHS squared norm) which restores the well-posedness and makes the solution depend continuously on the data. Finally, the positive scalar is the regularization parameter which has to suitably trade off these two components.
The use of (1) has significant advantages. The choice of an appropriate RKHS, often obtained just including function smoothness information [Scholkopf01b], and a careful tuning of , e.g. by the empirical Bayes approach [Maritz:1989, ARAVKIN2012, Aravkin2014]

, can well balance bias and variance. One can thus obtain favorable mean squared error properties. Furthermore, even if

is infinite-dimensional, the solution is always unique, belongs to a finite-dimensional subspace and is available in closed-form. This result comes from the representer theorem [Kimeldorf70, Scholkopf01, Argyriou:2009, ArgyriouD2014]. Building upon the work [Wahba77], many new results have been also recently obtained on the statistical consistency of (1). In particular, the property of to converge to the optimal predictor as the data set size grows to infinity is discussed e.g. in [Smale2007, Yuan2010, Wu2006, Muk2006, PoggioNature]. This point is also related to Vapnik’s concepts of generalization and consistency [Vapnik98], see [PoggioMIT]

for connections among regularization in RKHS, statistical learning theory and the concept of

dimension as a measure of function class complexity [Alon:1997, EvgVg]. The link between consistency and well-posedness is instead discussed in [Bousquet:2002, Muk2006, PoggioNature].

Parallel to this, many system identification techniques have been developed in the last decades. In linear contexts, the first regularized approaches trace back to [Shiller1973, Akaike1979, Kitagawa1996], see also [Goodwin1992, Ljung2014] where model error is described via a nonparametric structure. More recent approaches, also inspired by nuclear and atomic norms [Chand2012], can instead be found in [liu2009, GrossmanCDC09, Mohan2010, Rojas2014, Pillonetto2016]

. In the last years, many nonparametric techniques have been proposed also for nonlinear system identification. They exploit e.g. neural networks

[Tsungnan, Shun], Volterra theory [Franz06aunifying], kernel-type estimators [Leithead2003, PilTAC2011, Bai2015] which include also weights optimization to control the mean squared error [Roll2005, Bai2007, Bai2010]. Important connections between kernel-based regularization and nonlinear system identification have been also obtained by the least squares support vector machines [Suykens2002, Suykens2010] and using Gaussian regression for state space models [Frigola2013a, Frigola2013b]. Most of these approaches are inspired by machine learning, a fact not surprising since predictor estimation is at the core of the machine learning philosophy. Indeed, a black-box relationship can be obtained through (1) using past inputs and outputs to define the input locations (regressors). However, the kernels currently used for system identification are those conceived by the machine learning community for the reconstruction of static maps. RKHSs suited to linear system identification, e.g. induced by stable spline kernels which embed information on impulse response regularity and stability, have been proposed only recently [SS2010, SS2011, ChenOL12]. Furthermore, while stability of a RKHS (i.e. its property of containing only stable systems or predictors) is treated in [Carmeli, SurveyKBsysid, DinuzzoSIAM15], the nonlinear scenario still appears unexplored. Beyond stability, we also notice that the most used kernels for nonlinear regression, like the Gaussian and the Laplacian [Scholkopf01b], do not include other important information on dynamic systems like the fact that output energy is expected to increase if input energy augments.

Another aspect that weakens the interaction between system identification and machine learning stems also from the (apparently) different contexts these disciplines are applied to. In machine learning one typically assumes that data

are i.i.d. random vectors assuming values on a bounded subset of the Euclidean space. But in system identification, even when the system input is white noise, the input locations are not mutually independent. Already in the classical Gaussian noise setting, the outputs are not even bounded, i.e. there is no compact set containing them with probability one. Remarkably, this implies that none of the aforementioned consistency results developed for kernel-based methods can be applied. Some extensions to the case of correlated samples can be found in

[LR2, LR3, LR4] but still under conditions far from the system identification setting.

In this paper we provide some new insights on the interplay between system identification and machine learning in a RKHS setting. Our starting point is the introduction of what we call RKHSs of dynamic systems which contain functionals over input spaces induced by system inputs . More specifically, each input location contains a piece of the trajectory of so that any can be associated to a dynamic system. When is a stationary stochastic process, its distribution then defines the probability measure on from which the input locations are drawn. Again, we stress that this framework has been (at least implicitly) used in previous works on nonlinear system identification, see e.g. [Sjoberg:1995, PilTAC2011, Suykens, Tsungnan, Shun]. However, it has never been cast and studied in its full generality under a RKHS perspective.

At first sight, our approach could appear cumbersome. In fact, the space can turn out complex and unbounded just when the system input is Gaussian. Also, could be a function space itself (as e.g. happens in continuous-time). It will be instead shown that this perspective is key to obtain the following achievements:

• linear and nonlinear system identification can be treated in an unified way in both discrete- and continuous-time. Thus, the estimator (1) can be used in many different contexts, relevant for the control community, just changing the RKHS. This is important for the development of a general theory which links regularization in RKHS and system identification;

• system input’s role in determining the nature of the RKHS is made explicit. This will be also described in more detail in the linear system context, illustrating the distinction between the concept of RKHSs of dynamic systems and that of RKHSs of impulse responses;

• for linear systems we provide a new and simple derivation of the necessary and sufficient condition for RKHS stability [Carmeli, SurveyKBsysid, DinuzzoSIAM15] that relies just on basic RKHS theory;

• in the nonlinear scenario, we obtain a sufficient condition for RKHS stability which has wide applicability. We also derive a new stable kernel for nonlinear system identification;

• consistency of the RN (1) is proved under assumptions suited to system identification, revealing the link between consistency and RKHS stability.

The paper is organized as follows. In Section 2 we provide a brief overview on RKHSs. In Section 3, the concept of RKHSs of dynamic systems is defined by introducing input spaces induced by system inputs. The case of linear dynamic systems is then detailed via its relationship with linear kernels. The difference between the concepts of RKHSs of dynamic systems and RKHSs of impulse responses is also elucidated. Section 4 discusses the concept of stable RKHS. We provide a new simple characterization of RKHS stability in the linear setting. Then, a sufficient condition for RKHS stability is worked out in the nonlinear scenario. We also introduce a new kernel for nonlinear system identification, testing its effectiveness on a benchmark problem. In Section 5, we first review the connection between the machine learning concept of regression function and that of optimal predictor encountered in system identification. Then, the consistency of the RN (1) is proved in the general framework of RKHSs of dynamic systems. Conclusions end the paper while proofs of the consistency results are gathered in Appendix.

In what follows, the analysis is always restricted to causal systems and, to simplify the exposition, the input locations contain only past inputs so that output error models are considered. If an autoregressive part is included, the consistency analysis in Section 5 remains unchanged while the conditions developed in Section 4 guarantee predictor (in place of system) stability.

## 2 Brief overview on RKHSs

We use to indicate a function domain. This is a non-empty set often referred to as the input space in machine learning. Its generic element is the input location, denoted by or in the sequel. All the functions are assumed real valued, so that .
In function estimation problems, the goal is to estimate maps to make predictions over the whole . Thus, a basic requirement is to use an hypothesis space with functions well defined pointwise for any . In particular, assume that all the pointwise evaluators are linear and bounded over , i.e. there exists such that

 |g(x)|≤Cx∥g∥H,∀g∈H. (2)

###### Definition 1 (Rkhs).

A reproducing kernel Hilbert space (RKHS) over is a Hilbert space containing functions where (2) holds.

RKHSs are connected to the concept of positive definite kernel, a particular function defined over .

###### Definition 2 (Positive definite kernel and kernel section).

A symmetric function is called positive definite kernel if, for any integer , it holds

 p∑i=1p∑j=1cicjK(xi,xj)≥0,∀(xk,ck)∈(X,R),k=1,…,p.

The kernel section centered at is the function from to defined by

 Kx(a)=K(a,x)∀a∈X.

The following theorem provides the one-to-one correspondence between RKHSs and positive definite kernels.

###### Theorem 3 (Moore-Aronszajn and reproducing property).

To every RKHS there corresponds a unique positive definite kernel such that the so called reproducing property holds, i.e.

 ⟨Kx,g⟩=g(x)∀(x,g)∈(X,H) (3)

Conversely, given a positive definite kernel , there exists a unique RKHS of real-valued functions defined over where (3) holds.

Theorem 3 shows that a RKHS is completely defined by a kernel , also called the reproducing kernel of . More specifically, it can be proved that any RKHS is generated by the kernel sections in the following manner. Let denote the subspace spanned by and for any , say , define the norm

 ∥g∥2H=p∑i=1p∑j=1cicjK(xi,xj). (4)

Then, one has that is the union of and all the limits w.r.t. of the Cauchy sequences contained in . A consequence of this construction is that any inherits kernel properties, e.g. continuity of implies that all the are continuous [Cucker01][p. 35].
The kernel sections play a key role also in providing the closed-form solution of the RN (1), as illustrated in the famous representer theorem.

###### Theorem 4 (Representer theorem).

The solution of (1) is unique and given by

 ^g=N∑i=1 ^ciKxi, (5)

where the scalars are the components of the vector

 ^c=(K+γNIN)−1Y, (6)

is the column vector with -th element , is the identity matrix and the entry of is .

Another RKHS characterization useful in what follows is obtained when the kernel can be diagonalized as follows

 K(a,x)=∞∑i=1 ζiρi(a)ρi(x),  ζi>0 ∀i. (7)

The RKHS is then separable and the following result holds, e.g. see [PoggioMIT][p. 15] and [Cucker01][p. 36].

###### Theorem 5 (Spectral representation of a RKHS).

Let (7) hold and assume that the form a set of linearly independent functions on . Then, one has

 H={g | g(x)=∞∑i=1ciρi(x) s.t. ∞∑i=1c2iζi<∞}, (8)

and

 ⟨f,g⟩H=∞∑i=1biciζi,∥f∥2H=∞∑i=1b2iζi, (9)

where and .

The expansion (7) can e.g. be obtained by the Mercer theorem [Mercer1909, Hochstadt73]. In particular, let be a nondegenerate -finite measure on . Then, under somewhat general conditions [Sun05], the and in (7

) can be set to the eigenfunctions and eigenvalues of the integral operator induced by

, i.e.

 ∫XK(⋅,x)ρi(x)dμx(x)=ζiρi(⋅),0<ζ1≤ζ2≤ … (10)

In addition, the form a complete orthonormal basis in the classical Lebesgue space of functions square integrable under .111Thus, the representation (8) is not unique since spectral maps are not unique. Eigendecompositions depend on the measure but lead to the same RKHS.

## 3 RKHSs of dynamic systems and the linear system scenario

### 3.1 RKHSs of dynamic systems

The definition of RKHSs of dynamic systems given below relies on simple constructions of input spaces induced by system inputs .

#### Discrete-time

First, the discrete-time setting is considered. Assume we are given a system input . Then, we think of any input location in indexed by the time . Different cases arise depending on the postulated system model. For example, one can have

 xt=[ut  ut−1  …  ut−m+1]T, (11)

where is the system memory. This construction is connected to FIR or NFIR models and makes a subset of the classical Euclidean space .
Another scenario is

 xt=[ut  ut−1  ut−2  …]T, (12)

where any input location is a sequence (an infinite-dimensional column vector) and the input space becomes a subset of . The definition (12) is related to infinite memory systems, e.g. IIR models in linear settings.

#### Continuous-time

The continuous-time input is the map . In this case, the input location becomes the function defined by

 xt(τ)=u(t−τ),τ≥0, (13)

i.e. contains the input’s past up to the instant . In many circumstances, one can assume , where contains piecewise continuous functions on . When the input is causal, and is smooth for , the is indeed piecewise continuous.
Note that (13) is the continuous-time counterpart of (12) while that of (11) can be obtained just zeroing part of the input location, i.e.

 xt(τ)=u(t−τ)ξT(τ),τ≥0, (14)

where is the indicator function of the interval . In linear systems, (14) arises when the impulse response support is compact.

RKHSs of functions over domains , induced by system inputs as illustrated above, are hereby called RKHSs of dynamic systems. Thus, if , the scalar is the noiseless output at of the system fed with the input trajectory contained in . Note that in general is a functional: in the cases (12-14) the arguments entering are infinite-dimensional objects.

### 3.2 The linear system scenario

RKHSs of linear dynamic systems are now introduced also discussing the structure of the resulting RN.
Linear system identification was faced In [SS2010] and [SurveyKBsysid][Part III] by introducing RKHSs of impulse responses. These are spaces induced by kernels defined over subsets of . They thus contain causal functions, each of them representing an impulse response . The RN which returns the impulse response estimate was

 ^θ=argminθ∈I N∑i=1(yi−(θ⊗u)ti)2N+γ∥θ∥2I, (15)

where is the convolution between the impulse response and the input evaluated at .
The RKHSs of linear dynamic systems here introduced are instead associated to (output) linear kernels defined on through convolutions of with system inputs. In particular, if and are as in (11-14), one has222Translated in a stochastic setting, the (output) kernel can be seen as the covariance of a causal random process of covariance filtered by .

 K(xt,xτ)=(u⊗(K⊗u)τ)t

which e.g. in continuous-time becomes

 K(xt,xτ)=∫+∞0u(t−α)(∫+∞0u(τ−β)K(α,β)dβ)dα. (16)

These kernels lead to the RN (1) which corresponds to (15) after the “reparametrization” so that, in place of the impulse response , the optimization variable becomes the functional .
The kernels arising in discrete- and continuous-time are described below in (17,19) and (22). The distinction between the RKHSs induced by and will be further discussed in Section 3.3.

#### FIR models

We start assuming that the input location is defined by (11) so that any is an -dimensional (column) vector and . If is a symmetric and positive semidefinite matrix, a linear kernel is defined as follows

 K(a,x)=aTKx,(a,x)∈Rm×Rm. (17)

All the kernel sections are linear functions. Their span defines a finite-dimensional (closed) subspace that, in view of the discussion following Theorem 3, coincides with the whole . Hence, is a space of linear functions: for any , there exists such that

 g(x)=aTKx=Ka(x).

If is full rank, it holds that

 ||g||2H = ||Ka||2H=⟨Ka,Ka⟩H = K(a,a)=aTKa = θTK−1θ  with  θ:=Ka.

Let us use the associated to (17) as hypothesis space for the RN in (1). Let and with -th row equal to , where

 xi:=xtiyi:=yti,

and is the time instant where the -th output is measured. Then, after plugging the representation in (1), one obtains with

 ^θ =argminθ∈Rm ∥Y−Φθ∥2+γθTK−1θ (18a) =(ΦTΦ+γP−1)−1ΦTY. (18b)

The nature of the input locations (11) shows that is the impulse response estimate. Thus, (18) corresponds to regularized FIR estimation as e.g. discussed in [ChenOL12].

#### IIR models

Consider now the input locations defined by (12). The input space contains sequences and . Interpreting any input location as an infinite-dimensional column, we can use ordinary algebra’s notation to handle infinite-dimensional objects. For example, if then , where is the inner-product in the classical space of squared summable sequences.
Let be symmetric and positive semidefinite infinite-dimensional matrix (the nature of is discussed also in Section 3.3). Then, the function

 K(x,a)=xTKa,(x,a)∈R∞×R∞ (19)

defines a linear kernel on . Following arguments similar to those developed in the FIR case, one can see that the RKHS associated to such contains linear functions of the form with . Note that each is a functional defined by the sequence which represents an impulse response. In fact, one can deduce from (12) that is the discrete-time convolution, evaluated at , between and .
The RN with induced by (19) now implements regularized IIR estimation. Roughly speaking, (1) becomes the limit of (18) for . The exact solution can be obtained by the representer theorem (5) and turns out

 ^g(x)=N∑i=1 ^ciKxi(x)=^θTx, (20)

where the are the components of (6) while the infinite-dimensional column vector

 ^θ:=N∑i=1 ^ciKxi (21)

contains the impulse response coefficients estimates.

#### Continuous-time

The continuous-time scenario arises considering the input locations defined by (13) or (14). The input space now contains causal functions. Considering (13), given a positive-definite kernel , the linear kernel is

 K(x,a)=∫R+×R+ K(t,τ)x(t)a(τ)dtdτ (22)

which coincides with (16) when and . Each kernel section is a continuous-time linear system with impulse response . Thus, the corresponding RKHS contains linear functionals and (1) now implements regularized system identification in continuous-time. Using the representer theorem, the solution of (1) is

 ^g(x)=N∑i=1 ^ciKxi(x)=∫R+^θ(τ)x(τ)dτ (23)

where is still defined by (6) while is the impulse response estimate given by

 ^θ(τ):=N∑i=1 ^ci∫R+ K(τ,t)xi(t)dt. (24)

### 3.3 Relationship between RKHSs of impulse responses and RKHSs of dynamic systems

In (19), the infinite-dimensional matrix represents a kernel over . Then, let be the corresponding RKHS which contains infinite-dimensional column vectors . We will now see that is the RKHS of impulse responses associated to , i.e. each is the impulse response of a linear system . In particular, let admit the following expansion in terms of linearly independent infinite-dimensional (column) vectors :

 K=∞∑i=1ζiψiψTi.

According to Theorem 5, the span of the provides all the . Moreover, if , then The equality

 K(a,x) =aTKx=aT(∞∑i=1ζiψiψTi)x =∞∑i=1ζi(aTψi)ρi(a)(ψTix)ρi(x),

also provides the expansion of in terms of functionals defined by . Assuming that such functionals are linearly independent, it comes from Theorem 5 that each dynamic system has the representation . It is now obvious that such system is associated to the impulse response and the two spaces are isometrically isomorphic since

 ∥g∥2H=∞∑i=1c2iζi=∥θ∥2I.

This result holds also in continuous-time where is now the RKHS associated to the kernel . Letting be real-valued functions on , this comes from the same arguments adopted in discrete-time but now applied to the expansions

 K(t,τ)=∞∑i=1ζiψi(t)ψi(τ),

and

 K(x,a) =∫R+×R+ K(t,τ)x(t)a(τ)dtdτ =∞∑i=1ζi(∫R+ψi(t)x(t)dt)ρi(x)(∫R+ψi(τ)a(τ)dτ)ρi(a).
###### Remark 6.

The functionals could turn out linearly dependent even if the composing are linearly independent. This depends on the nature of the input space. For instance, let contain only the input locations induced by . Then, if is a rational transfer function with zeros , the functional associated to vanishes over . In these cases, there is no isometry between and : the same dynamic system could be defined by different impulse responses . In particular, results on RKHSs induced by sums of kernels reported in [Aronszajn50][Section 6 on p. 352] allow us to conclude that . Thus, among all the possible equivalent representations in of the dynamic system , the complexity of is quantified by that of minimum norm. This is illustrated through the following simple continuous-time example. Assume that

 K(t,τ)=ζ1ψ1(t)ψ1(τ)+ζ2ψ2(t)ψ2(τ),

where the Laplace transforms of and are given, respectively, by the rational transfer functions

 W1(s)=2ss+1+√2,W2(s)=s+1−√2s+1,

which satisfy . Let the input space contain only the input locations induced by . Then, the functionals associated, respectively, to coincide over the entire . Thus, the two impulse responses and induce the same system . Using Theorem 5, one has

 ∥ψ1∥2I=1ζ1,∥ψ2∥2I=1ζ2,

which implies

 ∥g∥2H=min(1ζ1,1ζ2).

## 4 Stable RKHSs

BIBO stability of a dynamic system is a familiar notion in control. In the RKHS context, we have the following definition.

###### Definition 7 (stable dynamic system).

Let be any bounded input, i.e. satisfying . Then, the dynamic system is said to be (BIBO) stable if there exits a constant such that for any and any input location induced by .

Note that, for to be stable, the above definition implicitly requires the input space of to contain any induced by any bounded input.

###### Definition 8 (stable RKHS).

Let be a RKHS of dynamic systems induced by the kernel . Then, and are said to be stable if each is stable.

To derive stability conditions on the kernel, let us first introduce some useful Banach spaces. The first two regard the discrete-time setting:

• the space of absolutely summable real sequences , i.e. such that equipped with the norm

 ∥a∥1=∞∑i=1|ai|;
• the space of bounded real sequences , i.e. such that , equipped with the norm

 ∥a∥∞=supi|ai|.

The other two are concerned with continuous-time:

• the Lebesgue space of functions absolutely integrable, i.e. such that , equipped with the norm

 ∥a∥1=∫R+|a(t)|dt;
• the Lebesgue space of functions essentially bounded, i.e. for any there exists such that

 |a(t)|≤Ma almost everywhere in R+,

equipped with the norm

 ∥a∥∞=inf{M s.t. |a(t)|≤M a.e.}.

### 4.1 The linear system scenario

We start studying the stability of RKHSs of linear dynamic systems. Obviously, all the FIR kernels (17) induce stable RKHSs. As for the IIR and continuous-time kernels in (19) and (22), first it is useful to recall the classical result linking BIBO stability and impulse response summability.

###### Proposition 9.

(BIBO stability and impulse response summability) Let be the impulse response of a linear system. Then, the system is BIBO stable iff in discrete-time or in continuous-time.

The next proposition provides the necessary and sufficient condition for RKHS stability in the linear scenario.

###### Proposition 10 (RKHS stability in the linear case).

Let be the RKHS of dynamic systems induced by the IIR kernel (19). Then, the following statements are equivalent

1. is stable;

2. The input space contains so that

 g(a)<∞  for any  (g,a)∈(H,ℓ∞);
3. for any

Let instead be the RKHS induced by the continuous-time kernel (22). The following statements are then equivalent

1. is stable;

2. The input space contains so that

 g(a)<∞  for any  (g,a)∈(H,L∞);
3. for any

Proof: The proof is developed in discrete-time. The continuous-time case follows exactly by the same arguments with minor modifications.
Recalling Definition 7 and subsequent discussion, this is a direct consequence of the BIBO stability assumption of any .
Given any , let its associated impulse response and define

 xt=[sign(θ1) sign(θ2) …]T. (27)

The assumption implies that and the implication follows by Proposition 9.
By assumption, the kernel is well defined over the entire . Hence, any kernel section centred on is a well defined element in and corresponds to a dynamic system with associated impulse response . With still defined by (27), one has which implies and proves (3).
By assumption, any impulse response associated to any kernel section centred on belongs to . This implies that the kernel associated to is well defined over the entire . Recalling Definition 1 and eq. (2), RKHS theory then ensures that any is well defined pointwise on and .

Point (3) contained in Proposition 10 was also cited in [SurveyKBsysid, DinuzzoSIAM15] as a particularization of a quite involved and abstract result reported in [Carmeli]. The stability proof reported below turns instead out surprisingly simple. The reason is that, with the notation adopted in (19) and (22), the outcomes in [Carmeli] were obtained starting from spaces of impulse responses induced by . Our starting point is instead the RKHSs of dynamic systems induced by (in turn defined by ). This different perspective permits to greatly simplify the analysis: kernel stability can be characterized just combining basic RKHS theory and Proposition 9.

Proposition 10 shows that RKHS stability is implied by the absolute integrability of , i.e. by

 ∞∑i=1∞∑j=1|K(i,j)|<∞  or  ∫R+×R+|K(t,τ)|dtdτ<∞ (28)

in discrete- and continuous-time, respectively. The condition (28) is also necessary for nonnegative-valued kernels [SurveyKBsysid][Section 13]. Then, considering e.g. the continuous-time setting, the popular Gaussian and Laplacian kernels, which belong to the class of radial basis kernels for , are all unstable. Stability instead holds for the stable spline kernel [PillACC2010] given by:

 K(t,s)=e−βmax(t,s)t,s≥0 (29)

where is related to the impulse response’s dominant pole.

### 4.2 The nonlinear system scenario

Let us now consider RKHSs of nonlinear dynamic systems with input locations (11-14). A very simple sufficient condition for RKHS stability is reported below.

###### Proposition 11 (RKHS stability in the nonlinear case).

Let be a RKHS of dynamic systems induced by the kernel . Let denote the closed ball of radius induced by , contained in or in discrete-time, or in in continuous-time. Assume that, for any , there exists such that

 K(x,x)

Then, the RKHS is stable.

Proof: Let the system input in discrete-time or in continuous-time. Then, we can find a closed ball containing, for any , all the input locations induced by as defined in (11-14). For any and , exploiting the reproducing property and the Cauchy-Schwartz inequality, one obtains

 |g(xt)| =|⟨g,Kxt⟩H|≤∥g∥H∥Kxt∥H =∥g∥H√K(xt,xt)≤∥g∥H√Cr,

hence proving the stability of .

The following result will be also useful later on. It derives from the fact that kernels products (sums) induce RKHSs of functions which are products (sums) of the functions induced by the single kernels [Aronszajn50][p. 353 and 361].

###### Proposition 12.

(RKHS stability with kernels sum and product) Let and be stable kernels. Then, the RKHSs induced by and are both stable.

The two propositions above allow to easily prove the stability of a very large class of kernels, as discussed in discrete-time in the remaining part of this section.

First, consider the input locations (11) contained in . As already mentioned in Section 4.1, radial basis kernels , with now to indicate the Euclidean norm, are widely adopted in machine learning. Important examples are the Gaussian kernel

 K(x,a)=exp(−|x−a|2η),η>0 (31)

and the Laplacian kernel

 K(x,a)=exp(−|x−a|η),η>0. (32)

From Proposition 11 one immediately sees that both these kernels are stable. More in general, all the radial basis kernels are stable333This statement should not be confused with the result discussed in Section 4.1 in the linear system scenario. There, we have seen that radial basis kernels lead to unstable linear kernels when used to define in the IIR (19) and continuous-time (22) case. Here, the Gaussian and Laplace kernels are instead used to define directly in the nonlinear system scenario. since they are constant along their diagonal ().
However, despite their stability, some drawbacks affect the use of (31,32) in system identification. First, the fact that is constant implies that these models do not include the information that output energy is likely to increase if input energy augments. Second, they measure the similarity among input locations without using the information that is expected to have less influence on the prediction of as the positive lag augments. Such limitation is also in some sense hidden by the finite-dimensional context. In fact, if the input locations are now defined by (12), i.e. the system memory is infinite, the Gaussian kernel becomes

 K(x,a)=exp(−∥x−a∥2η). (33)

This model is not reasonable: it is not continuous around the origin of and, out of the diagonal, is null for many input locations. This reveals the importance of finding an appropriate metric to measure the distance between different input trajectories.

New kernel for nonlinear system identification We now show how stable spline kernels, which embed exponential stability of linear systems [SS2010], can be useful also to define nonlinear models. Specifically, let be a stable spline kernel, e.g. diagonal with or given by for any integer and . Then, define the nonlinear stable spline (NSS) kernel as

 K(a,x)=aTKαx×exp(−(a−x)TKα(a−x)η)  NSS, (34)

which corresponds to the product between a linear kernel and a modified version of (31). Such kernel defines a new infinite-dimensional RKHS suited for identification of nonlinear output error models. Being no more constant along the diagonal, it embeds the information that output energy augments if input energy increases, preserving BIBO stability (as one can easily deduce from Propositions 11 and 12). Note also that, letting the dimensionality of the regression space go to infinity, the difficult selection of the discrete dimension of the regressors

has been eliminated. In fact, input locations similarity is regulated by the hyperparameter

that includes the information that the influence of on goes to zero as the time lag increases.

### 4.3 Numerical experiment

The following nonlinear system is taken from [Spinelli2005]:

 f(xt) = ut+0.6ut−1+0.35(ut.2+ut−4)−0.25u2t−3 + 0.2(ut−5+ut−6)+0.9ut−3+0.25utut−1+0.75u3t−2 − ut−1ut−2+0.5(u2t+utut−2+ut−1ut−3)

Then, consider the identification of the following two systems, called (S1) and (S2):

 yt=f(xt)+et    (S1),yt=∞∑k=1θkut−k+f(xt)+et    (S2),

where all the and are independent Gaussian noises of variance 4. Note that (S2) contains the sum of a linear time invariant system (details on the impulse response are given below) and the nonlinear FIR in (S1). Our aim is to identify the two systems from 1000 input-output pairs via (1). The performance will be measured by the percentage fit on a test set of 1000 noiseless system outputs contained in the vector , i.e.

 100%(1−|ytest−^ytest||ytest−¯ytest|), (35)

where is the mean of the components of while is the prediction from an estimated model. We will display MATLAB boxplots of the 100 fits achieved by (1) after a Monte Carlo of 100 runs, using different kernels. At any run, independent noises are drawn to form new identification and test data. The impulse response in (S2) also varies. It is a -th order rational transfer function with norm equal to 10 (this makes similar the contribution to the output variance of the linear and nonlinear system components) and poles inside the complex circle of radius 0.95, randomly generated as detailed in [Pillonetto2016][section 7.4].
First, we use the Gaussian kernel (31) over an -dimensional input space. Plugged in (1), it defines the hyperparameter vector . For tuning the Gaussian kernel hyperparameters, the regressor vector dimension is chosen by an oracle not implementable in practice. Specifically, at any run, for each the pair is determined via marginal likelihood optimization [DeNicolao1] using only the identification data. Multiple starting points have been adopted to mitigate the effect of local minima. The oracle has then access to the test set to select, among the 50 couples, that maximizing the prediction fit (35).

The two left boxplots in Fig. 1 report the prediction fits achieved by this procedure applied to identify the first (top) and the second (bottom) system. Even if the Gaussian kernel is equipped with the oracle, its performance is satisfactory only in the (S1) scenario while the capability of predicting outputs in the (S2) case is poor during many runs. In place of the marginal likelihood, a cross-validation strategy has been also used for tuning , obtaining results similar to those here displayed.
During the Monte Carlo, even when the number of impulse response coefficients different from zero is quite large, we have noticed that a relatively small value for is frequently chosen, i.e. the oracle tends to use few past input values to predict . This indicates that the Gaussian kernel structure induces the oracle to introduce a significant bias to guard the estimator’s variance.
Now, we show that in this example model complexity can be better controlled avoiding the difficult and computationally expensive choice of discrete orders. In particular, we set and use the new NSS kernel (34). For tuning the NSS hyperparameter vector , no oracle having access to the test set is employed but just a single continuous optimization of the marginal likelihood that uses only the identification data. The fits achieved by NSS after the two Monte Carlo studies are in the right boxplots of the two figures above: in both the cases the new estimator behaves very nicely.

## 5 Consistency of regularization networks for system identification

### 5.1 The regression function

In what follows, the system input is a stationary stochastic process over in discrete-time or in continuous-time. The distribution of induces on the (Borel non degenerate) probability measure , from which the input locations are drawn. In view of their dependence on , the are in general correlated each other and unbounded, e.g. for Gaussian no bounded set contains with probability one. Such peculiarities, inherited by the system identification setting, already violate the data generation assumptions routinely adopted in machine learning.
The identification data are assumed to be a stationary stochastic process. In particular, each couple has joint probability measure where is the probability measure of the output conditional on a particular input location .
Given a function (dynamic system) , the least squares error associated to is

 (36)

The following result is well known and characterizes the minimizer of (36) which goes under the name of regression function in machine learning.

###### Theorem 13 (The regression function).

We have

 fρ=argminf E(y−f(x))2,

where is the regression function defined for any by

 fρ(x)=∫Ry dμy|x(y|x). (37)

In our system identification context, is the dynamic system associated to the optimal predictor that minimizes the expected quadratic loss on a new output drawn from .

### 5.2 Consistency of regularization networks for system identification

Consider a scenario where (and possibly also ) is unknown and only samples from are available. We study the convergence of the RN in (1) to the optimal predictor as under the input-induced norm

 ∥f∥2x=∫X f2(x)dμx(x).

This is the norm in the classical Lebesgue space already introduced at the end of Section 2.
We assume that the reproducing kernel of admits the expansion

 K(x,a)=∞∑i=1 ζiρi(x)ρi(a),  ζi>0 ∀i,

with and defined via (10) and the probability measure .
Exploiting the regression function, the measurements process can be written as

 yi=fρ(xi)+ei, (38)

where the errors are zero-mean and identically distributed. They can be correlated each other and also with the input locations . Given any and the -th kernel eigenfunction

, we define the random variables

by combining , and the errors defined by (38) as follows

 vℓi=(f(xi)+ei)ρℓ(xi). (39)

Let be stable so that the form a stationary process. In particular, note that each is the product of the outputs from two stable systems: the first one, , is corrupted by noise while the second one, , is noiseless.
Now, by summing up over the cross covariances of lag using kernel eigenvalues as weights one obtains444As shown in Appendix, the in (40) are invariant w.r.t. the particular spectral decomposition used to obtain the pairs .

 ck:=∞∑ℓ=1 ζℓCov(vℓi,vℓ,i+k) (40)

where is the covariance operator. The next proposition shows that summability of the is key for consistency.

###### Proposition 14.

(RN consistency in system identification) Let be the RKHS with kernel

 K(x,a)=∞∑i=1 ζiρi(x)ρi(a),  ζi>0 ∀i, (41)

where are the eigenvalues/eigenfunctions pairs defined by (10) under the probability measure . Assume that, for any and s.t. , there exists a constant such that

 ∞∑k=