Learning from examples is key in science and engineering,
considered at the core of intelligence’s understanding [PogShe99].
In mathematical terms, it can be described as follows. We are given a finite set of training data ,
where is the so called input location while is the corresponding output measurement.
The goal is then the reconstruction of a function
with good prediction capability on future data. This means that, for a new
pair , the prediction should be close to .
To solve this task, nonparametric techniques have been extensively used in the last years. Within this paradigm, instead of assigning to the unknown function a specific parametric structure, is searched over a possibly infinite-dimensional functional space. The modern approach uses Tikhonov regularization theory [Tikhonov, Bertero1] in conjunction with Reproducing Kernel Hilbert Spaces (RKHSs) [Aronszajn50, Bergman50]. RKHSs possess many important properties, being in one to one correspondence with the class of positive definite kernels. Their connection with Gaussian processes is also described in [Kimeldorf71Bayes, Lukic, Bell2004, AravkinNN].
While applications of RKHSs in statistics, approximation theory and computer vision trace back to[Bertero:1988, Wahba1990, Poggio90], these spaces were introduced to the machine learning community in [Girosi:1997]. RKHSs permit to treat in an unified way many different regularization methods. The so called kernel-based methods [PoggioMIT, Scholkopf01b] include smoothing splines [Wahba1990], regularization networks [Poggio90], Gaussian regression [Rasmussen]
, and support vector machines[Drucker97, Vapnik98]. In particular, a regularization network (RN) has the structure
where denotes a RKHS with norm .
Thus, the function estimate minimizes an objective
sum of two contrasting terms.
The first one is a quadratic loss which measures the adherence to experimental data.
The second term is the regularizer (the RKHS squared norm) which
restores the well-posedness and makes the solution depend continuously on the data.
Finally, the positive scalar is the regularization parameter which has to
suitably trade off these two components.
The use of (1) has significant advantages. The choice of an appropriate RKHS, often obtained just including function smoothness information [Scholkopf01b], and a careful tuning of , e.g. by the empirical Bayes approach [Maritz:1989, ARAVKIN2012, Aravkin2014]
, can well balance bias and variance. One can thus obtain favorable mean squared error properties. Furthermore, even ifis infinite-dimensional, the solution is always unique, belongs to a finite-dimensional subspace and is available in closed-form. This result comes from the representer theorem [Kimeldorf70, Scholkopf01, Argyriou:2009, ArgyriouD2014]. Building upon the work [Wahba77], many new results have been also recently obtained on the statistical consistency of (1). In particular, the property of to converge to the optimal predictor as the data set size grows to infinity is discussed e.g. in [Smale2007, Yuan2010, Wu2006, Muk2006, PoggioNature]. This point is also related to Vapnik’s concepts of generalization and consistency [Vapnik98], see [PoggioMIT]
for connections among regularization in RKHS, statistical learning theory and the concept ofdimension as a measure of function class complexity [Alon:1997, EvgVg]. The link between consistency and well-posedness is instead discussed in [Bousquet:2002, Muk2006, PoggioNature].
Parallel to this, many system identification techniques have been developed in the last decades. In linear contexts, the first regularized approaches trace back to [Shiller1973, Akaike1979, Kitagawa1996], see also [Goodwin1992, Ljung2014] where model error is described via a nonparametric structure. More recent approaches, also inspired by nuclear and atomic norms [Chand2012], can instead be found in [liu2009, GrossmanCDC09, Mohan2010, Rojas2014, Pillonetto2016]
. In the last years, many nonparametric techniques have been proposed also for nonlinear system identification. They exploit e.g. neural networks[Tsungnan, Shun], Volterra theory [Franz06aunifying], kernel-type estimators [Leithead2003, PilTAC2011, Bai2015] which include also weights optimization to control the mean squared error [Roll2005, Bai2007, Bai2010]. Important connections between kernel-based regularization and nonlinear system identification have been also obtained by the least squares support vector machines [Suykens2002, Suykens2010] and using Gaussian regression for state space models [Frigola2013a, Frigola2013b]. Most of these approaches are inspired by machine learning, a fact not surprising since predictor estimation is at the core of the machine learning philosophy. Indeed, a black-box relationship can be obtained through (1) using past inputs and outputs to define the input locations (regressors). However, the kernels currently used for system identification are those conceived by the machine learning community for the reconstruction of static maps. RKHSs suited to linear system identification, e.g. induced by stable spline kernels which embed information on impulse response regularity and stability, have been proposed only recently [SS2010, SS2011, ChenOL12]. Furthermore, while stability of a RKHS (i.e. its property of containing only stable systems or predictors) is treated in [Carmeli, SurveyKBsysid, DinuzzoSIAM15], the nonlinear scenario still appears unexplored. Beyond stability, we also notice that the most used kernels for nonlinear regression, like the Gaussian and the Laplacian [Scholkopf01b], do not include other important information on dynamic systems like the fact that output energy is expected to increase if input energy augments.
Another aspect that weakens the interaction between system identification and machine learning stems also from the (apparently) different contexts these disciplines are applied to. In machine learning one typically assumes that data
are i.i.d. random vectors assuming values on a bounded subset of the Euclidean space. But in system identification, even when the system input is white noise, the input locations are not mutually independent. Already in the classical Gaussian noise setting, the outputs are not even bounded, i.e. there is no compact set containing them with probability one. Remarkably, this implies that none of the aforementioned consistency results developed for kernel-based methods can be applied. Some extensions to the case of correlated samples can be found in[LR2, LR3, LR4] but still under conditions far from the system identification setting.
In this paper we provide some new insights on the interplay
between system identification and machine learning
in a RKHS setting. Our starting point is the introduction
of what we call RKHSs of dynamic systems which contain functionals over input spaces induced by system inputs .
each input location contains a piece of the trajectory of so that
any can be associated to a dynamic system.
When is a stationary stochastic process, its distribution
then defines the probability measure on from which the input locations are drawn. Again, we stress that this framework has been
(at least implicitly) used in previous works on nonlinear system identification, see e.g.
[Sjoberg:1995, PilTAC2011, Suykens, Tsungnan, Shun].
However, it has never been cast and studied in
its full generality under a RKHS perspective.
At first sight, our approach could appear cumbersome. In fact, the space can turn out complex and unbounded just when the system input is Gaussian. Also, could be a function space itself (as e.g. happens in continuous-time). It will be instead shown that this perspective is key to obtain the following achievements:
linear and nonlinear system identification can be treated in an unified way in both discrete- and continuous-time. Thus, the estimator (1) can be used in many different contexts, relevant for the control community, just changing the RKHS. This is important for the development of a general theory which links regularization in RKHS and system identification;
system input’s role in determining the nature of the RKHS is made explicit. This will be also described in more detail in the linear system context, illustrating the distinction between the concept of RKHSs of dynamic systems and that of RKHSs of impulse responses;
for linear systems we provide a new and simple derivation of the necessary and sufficient condition for RKHS stability [Carmeli, SurveyKBsysid, DinuzzoSIAM15] that relies just on basic RKHS theory;
in the nonlinear scenario, we obtain a sufficient condition for RKHS stability which has wide applicability. We also derive a new stable kernel for nonlinear system identification;
consistency of the RN (1) is proved under assumptions suited to system identification, revealing the link between consistency and RKHS stability.
The paper is organized as follows. In Section 2 we provide a brief
overview on RKHSs. In Section 3, the concept of RKHSs of dynamic systems is defined
by introducing input spaces induced by system inputs.
The case of linear dynamic systems is then detailed via its relationship
with linear kernels. The difference between
the concepts of RKHSs of dynamic systems and RKHSs of impulse responses
is also elucidated.
Section 4 discusses the concept of stable RKHS.
We provide a new simple characterization
of RKHS stability in the linear setting. Then, a sufficient condition
for RKHS stability is worked out in the nonlinear scenario.
We also introduce a new kernel for nonlinear system identification, testing its effectiveness on a benchmark problem.
In Section 5, we first review the connection between
the machine learning concept of regression function and
that of optimal predictor encountered in system identification.
Then, the consistency of the RN
(1) is proved in the general framework of RKHSs of dynamic systems.
Conclusions end the paper while proofs of the consistency results are
gathered in Appendix.
In what follows, the analysis is always restricted to causal systems and, to simplify the exposition, the input locations contain only past inputs so that output error models are considered. If an autoregressive part is included, the consistency analysis in Section 5 remains unchanged while the conditions developed in Section 4 guarantee predictor (in place of system) stability.
2 Brief overview on RKHSs
We use to indicate a function domain. This is a non-empty set
often referred to as the input space in machine learning. Its generic element
is the input location, denoted by or in the sequel. All the functions are assumed real valued,
so that .
In function estimation problems, the goal is to estimate maps to make predictions over the whole . Thus, a basic requirement is to use an hypothesis space with functions well defined pointwise for any . In particular, assume that all the pointwise evaluators are linear and bounded over , i.e. there exists such that
This property already leads to the spaces of interest.
Definition 1 (Rkhs).
A reproducing kernel Hilbert space (RKHS) over is a Hilbert space containing functions where (2) holds.
RKHSs are connected to the concept of positive definite kernel, a particular function defined over .
Definition 2 (Positive definite kernel and kernel section).
A symmetric function is called positive definite kernel if, for any integer , it holds
The kernel section centered at is the function from to defined by
The following theorem provides the one-to-one correspondence between RKHSs and positive definite kernels.
Theorem 3 (Moore-Aronszajn and reproducing property).
To every RKHS there corresponds a unique positive definite kernel such that the so called reproducing property holds, i.e.
Conversely, given a positive definite kernel , there exists a unique RKHS of real-valued functions defined over where (3) holds.
Theorem 3 shows that a RKHS is completely defined by a kernel , also called the reproducing kernel of . More specifically, it can be proved that any RKHS is generated by the kernel sections in the following manner. Let denote the subspace spanned by and for any , say , define the norm
Then, one has that is the union of
and all the limits w.r.t. of the Cauchy sequences contained in .
A consequence of this construction is that any inherits kernel properties, e.g.
continuity of implies that all the are continuous [Cucker01][p. 35].
The kernel sections play a key role also in providing the closed-form solution of the RN (1), as illustrated in the famous representer theorem.
Theorem 4 (Representer theorem).
Another RKHS characterization useful in what follows is obtained when the kernel can be diagonalized as follows
The RKHS is then separable and the following result holds, e.g. see [PoggioMIT][p. 15] and [Cucker01][p. 36].
Theorem 5 (Spectral representation of a RKHS).
Let (7) hold and assume that the form a set of linearly independent functions on . Then, one has
where and .
The expansion (7) can e.g. be obtained by the Mercer theorem [Mercer1909, Hochstadt73]. In particular, let be a nondegenerate -finite measure on . Then, under somewhat general conditions [Sun05], the and in (7, i.e.
In addition, the form a complete orthonormal basis in the classical Lebesgue space
of functions square integrable under .111Thus, the representation (8) is not unique since spectral maps are
not unique. Eigendecompositions depend on the measure but
lead to the same RKHS.
3 RKHSs of dynamic systems and the linear system scenario
3.1 RKHSs of dynamic systems
The definition of RKHSs of dynamic systems given below relies on simple constructions of input spaces induced by system inputs .
First, the discrete-time setting is considered. Assume we are given a system input . Then, we think of any input location in indexed by the time . Different cases arise depending on the postulated system model. For example, one can have
where is the system memory. This construction is connected to
FIR or NFIR models and makes a subset of the
classical Euclidean space .
Another scenario is
where any input location is a sequence (an infinite-dimensional column vector)
and the input space becomes a subset of .
The definition (12) is related to infinite memory systems, e.g. IIR models in linear settings.
The continuous-time input is the map . In this case, the input location becomes the function defined by
i.e. contains the input’s past up to the instant .
In many circumstances, one can assume ,
where contains piecewise continuous functions on .
When the input is causal, and is smooth for ,
the is indeed piecewise continuous.
Note that (13) is the continuous-time counterpart of (12) while that of (11) can be obtained just zeroing part of the input location, i.e.
where is the indicator function of the interval . In linear systems,
(14) arises when the impulse response support is compact.
RKHSs of functions over domains , induced by system inputs as illustrated above, are hereby called RKHSs of dynamic systems. Thus, if , the scalar is the noiseless output at of the system fed with the input trajectory contained in . Note that in general is a functional: in the cases (12-14) the arguments entering are infinite-dimensional objects.
3.2 The linear system scenario
RKHSs of linear dynamic systems
are now introduced also discussing
the structure of the resulting RN.
Linear system identification was faced In [SS2010] and [SurveyKBsysid][Part III] by introducing RKHSs of impulse responses. These are spaces induced by kernels defined over subsets of . They thus contain causal functions, each of them representing an impulse response . The RN which returns the impulse response estimate was
where is the convolution between the impulse response
and the input evaluated at .
The RKHSs of linear dynamic systems here introduced are instead associated to (output) linear kernels defined on through convolutions of with system inputs. In particular, if and are as in (11-14), one has222Translated in a stochastic setting, the (output) kernel can be seen as the covariance of a causal random process of covariance filtered by .
which e.g. in continuous-time becomes
These kernels lead to the RN (1)
which corresponds to (15) after the “reparametrization”
so that, in place of the impulse response ,
the optimization variable becomes
the functional .
The kernels arising in discrete- and continuous-time are described below in (17,19) and (22). The distinction between the RKHSs induced by and will be further discussed in Section 3.3.
We start assuming that the input location is defined by (11) so that any is an -dimensional (column) vector and . If is a symmetric and positive semidefinite matrix, a linear kernel is defined as follows
All the kernel sections are linear functions. Their span defines a finite-dimensional (closed) subspace that, in view of the discussion following Theorem 3, coincides with the whole . Hence, is a space of linear functions: for any , there exists such that
If is full rank, it holds that
and is the time instant where the -th output is measured. Then, after plugging the representation in (1), one obtains with
Consider now the input locations defined by
(12). The input space contains sequences
Interpreting any input location as an infinite-dimensional
column, we can use ordinary algebra’s notation
to handle infinite-dimensional objects. For example,
if then ,
where is the inner-product in the classical space
of squared summable sequences.
Let be symmetric and positive semidefinite infinite-dimensional matrix (the nature of is discussed also in Section 3.3). Then, the function
defines a linear kernel on .
Following arguments similar to those developed in the FIR case,
one can see that the RKHS associated to such contains linear functions of the form
with . Note that each
is a functional defined by the sequence which represents an impulse response.
In fact, one can deduce from (12) that is the discrete-time convolution, evaluated at ,
between and .
The RN with induced by (19) now implements regularized IIR estimation. Roughly speaking, (1) becomes the limit of (18) for . The exact solution can be obtained by the representer theorem (5) and turns out
where the are the components of (6) while the infinite-dimensional column vector
contains the impulse response coefficients estimates.
The continuous-time scenario arises considering the input locations defined by (13) or (14). The input space now contains causal functions. Considering (13), given a positive-definite kernel , the linear kernel is
which coincides with (16) when and . Each kernel section is a continuous-time linear system with impulse response . Thus, the corresponding RKHS contains linear functionals and (1) now implements regularized system identification in continuous-time. Using the representer theorem, the solution of (1) is
where is still defined by (6) while is the impulse response estimate given by
3.3 Relationship between RKHSs of impulse responses and RKHSs of dynamic systems
In (19), the infinite-dimensional matrix represents a kernel over . Then, let be the corresponding RKHS which contains infinite-dimensional column vectors . We will now see that is the RKHS of impulse responses associated to , i.e. each is the impulse response of a linear system . In particular, let admit the following expansion in terms of linearly independent infinite-dimensional (column) vectors :
According to Theorem 5, the span of the provides all the . Moreover, if , then The equality
also provides the expansion of in terms of functionals defined by . Assuming that such functionals are linearly independent, it comes from Theorem 5 that each dynamic system has the representation . It is now obvious that such system is associated to the impulse response and the two spaces are isometrically isomorphic since
This result holds also in continuous-time where is now the RKHS associated to the kernel . Letting be real-valued functions on , this comes from the same arguments adopted in discrete-time but now applied to the expansions
The functionals could turn out linearly dependent even if the composing are linearly independent. This depends on the nature of the input space. For instance, let contain only the input locations induced by . Then, if is a rational transfer function with zeros , the functional associated to vanishes over . In these cases, there is no isometry between and : the same dynamic system could be defined by different impulse responses . In particular, results on RKHSs induced by sums of kernels reported in [Aronszajn50][Section 6 on p. 352] allow us to conclude that . Thus, among all the possible equivalent representations in of the dynamic system , the complexity of is quantified by that of minimum norm. This is illustrated through the following simple continuous-time example. Assume that
where the Laplace transforms of and are given, respectively, by the rational transfer functions
which satisfy . Let the input space contain only the input locations induced by . Then, the functionals associated, respectively, to coincide over the entire . Thus, the two impulse responses and induce the same system . Using Theorem 5, one has
4 Stable RKHSs
BIBO stability of a dynamic system is a familiar notion in control. In the RKHS context, we have the following definition.
Definition 7 (stable dynamic system).
Let be any bounded input, i.e. satisfying . Then, the dynamic system is said to be (BIBO) stable if there exits a constant such that for any and any input location induced by .
Note that, for to be stable, the above definition implicitly requires the input space of to contain any induced by any bounded input.
Definition 8 (stable RKHS).
Let be a RKHS of dynamic systems induced by the kernel . Then, and are said to be stable if each is stable.
To derive stability conditions on the kernel, let us first introduce some useful Banach spaces. The first two regard the discrete-time setting:
the space of absolutely summable real sequences , i.e. such that equipped with the norm
the space of bounded real sequences , i.e. such that , equipped with the norm
The other two are concerned with continuous-time:
the Lebesgue space of functions absolutely integrable, i.e. such that , equipped with the norm
the Lebesgue space of functions essentially bounded, i.e. for any there exists such that
equipped with the norm
4.1 The linear system scenario
We start studying the stability of RKHSs of linear dynamic systems. Obviously, all the FIR kernels (17) induce stable RKHSs. As for the IIR and continuous-time kernels in (19) and (22), first it is useful to recall the classical result linking BIBO stability and impulse response summability.
(BIBO stability and impulse response summability) Let be the impulse response of a linear system. Then, the system is BIBO stable iff in discrete-time or in continuous-time.
The next proposition provides the necessary and sufficient condition
for RKHS stability in the linear scenario.
Proposition 10 (RKHS stability in the linear case).
Let be the RKHS of dynamic systems induced by the IIR kernel (19). Then, the following statements are equivalent
The input space contains so that
Let instead be the RKHS induced by the continuous-time kernel (22). The following statements are then equivalent
The input space contains so that
Proof: The proof is developed in discrete-time. The continuous-time case follows exactly by the same arguments with minor modifications.
Recalling Definition 7 and subsequent discussion, this is a direct consequence of the BIBO stability assumption of any .
Given any , let its associated impulse response and define
The assumption implies that
and the implication follows by Proposition 9.
By assumption, the kernel is well defined over the entire . Hence, any kernel section centred on is a well defined element in and corresponds to a dynamic system with associated impulse response . With still defined by (27), one has which implies and proves (3).
By assumption, any impulse response associated to any kernel section centred on belongs to . This implies that the kernel associated to is well defined over the entire . Recalling Definition 1 and eq. (2), RKHS theory then ensures that any is well defined pointwise on and .
Point (3) contained in Proposition 10 was also cited in [SurveyKBsysid, DinuzzoSIAM15] as a particularization of a quite involved and abstract result reported in [Carmeli]. The stability proof reported below turns instead out surprisingly simple. The reason is that, with the notation adopted in (19) and (22), the outcomes in [Carmeli] were obtained starting from spaces of impulse responses induced by . Our starting point is instead the RKHSs of dynamic systems induced by (in turn defined by ). This different perspective permits to greatly simplify the analysis: kernel stability can be characterized just combining basic RKHS theory and Proposition 9.
Proposition 10 shows that RKHS stability is implied by the absolute integrability of , i.e. by
in discrete- and continuous-time, respectively. The condition (28) is also necessary for nonnegative-valued kernels [SurveyKBsysid][Section 13]. Then, considering e.g. the continuous-time setting, the popular Gaussian and Laplacian kernels, which belong to the class of radial basis kernels for , are all unstable. Stability instead holds for the stable spline kernel [PillACC2010] given by:
where is related to the impulse response’s dominant pole.
4.2 The nonlinear system scenario
Proposition 11 (RKHS stability in the nonlinear case).
Let be a RKHS of dynamic systems induced by the kernel . Let denote the closed ball of radius induced by , contained in or in discrete-time, or in in continuous-time. Assume that, for any , there exists such that
Then, the RKHS is stable.
Proof: Let the system input in discrete-time or in continuous-time. Then, we can find a closed ball containing, for any , all the input locations induced by as defined in (11-14). For any and , exploiting the reproducing property and the Cauchy-Schwartz inequality, one obtains
hence proving the stability of .
The following result will be also useful later on. It derives from the fact that kernels products (sums) induce RKHSs of functions which are products (sums) of the functions induced by the single kernels [Aronszajn50][p. 353 and 361].
(RKHS stability with kernels sum and product) Let and be stable kernels. Then, the RKHSs induced by and are both stable.
The two propositions above allow to easily prove the stability of a very large class of kernels, as discussed in discrete-time in the remaining part of this section.
Radial basis kernels
First, consider the input locations (11) contained in . As already mentioned in Section 4.1, radial basis kernels , with now to indicate the Euclidean norm, are widely adopted in machine learning. Important examples are the Gaussian kernel
and the Laplacian kernel
From Proposition 11 one immediately sees
that both these kernels are stable.
More in general, all the radial basis kernels are stable333This statement should not be confused with the result
discussed in Section 4.1 in the linear system scenario.
There, we have seen that radial basis kernels
lead to unstable linear kernels
when used to define in the IIR (19) and
continuous-time (22) case.
Here, the Gaussian and Laplace kernels are instead used
to define directly in the nonlinear system scenario. since they are constant
along their diagonal ().
However, despite their stability, some drawbacks affect the use of (31,32) in system identification. First, the fact that is constant implies that these models do not include the information that output energy is likely to increase if input energy augments. Second, they measure the similarity among input locations without using the information that is expected to have less influence on the prediction of as the positive lag augments. Such limitation is also in some sense hidden by the finite-dimensional context. In fact, if the input locations are now defined by (12), i.e. the system memory is infinite, the Gaussian kernel becomes
This model is not reasonable: it is not continuous around the origin of and, out of the diagonal, is null for many input locations. This reveals the importance of finding an appropriate metric to measure the distance between different input trajectories.
New kernel for nonlinear system identification We now show how stable spline kernels, which embed exponential stability of linear systems [SS2010], can be useful also to define nonlinear models. Specifically, let be a stable spline kernel, e.g. diagonal with or given by for any integer and . Then, define the nonlinear stable spline (NSS) kernel as
which corresponds to the product between a linear kernel and a modified version of (31). Such kernel defines a new infinite-dimensional RKHS suited for identification of nonlinear output error models. Being no more constant along the diagonal, it embeds the information that output energy augments if input energy increases, preserving BIBO stability (as one can easily deduce from Propositions 11 and 12). Note also that, letting the dimensionality of the regression space go to infinity, the difficult selection of the discrete dimension of the regressors
has been eliminated. In fact, input locations similarity is regulated by the hyperparameterthat includes the information that the influence of on goes to zero as the time lag increases.
4.3 Numerical experiment
The following nonlinear system is taken from [Spinelli2005]:
Then, consider the identification of the following two systems, called (S1) and (S2):
where all the and are independent Gaussian noises of variance 4. Note that (S2) contains the sum of a linear time invariant system (details on the impulse response are given below) and the nonlinear FIR in (S1). Our aim is to identify the two systems from 1000 input-output pairs via (1). The performance will be measured by the percentage fit on a test set of 1000 noiseless system outputs contained in the vector , i.e.
where is the mean of the components of while is the prediction from an estimated model.
We will display MATLAB boxplots of the 100 fits achieved by (1) after a Monte Carlo of 100 runs, using different kernels.
At any run, independent noises are drawn to form new identification and test data.
The impulse response in (S2) also varies. It is a -th order rational transfer function with norm
equal to 10 (this makes similar the contribution to the output variance of the linear and nonlinear system components) and poles inside the complex circle of radius 0.95, randomly generated
as detailed in [Pillonetto2016][section 7.4].
First, we use the Gaussian kernel (31) over an -dimensional input space. Plugged in (1), it defines the hyperparameter vector . For tuning the Gaussian kernel hyperparameters, the regressor vector dimension is chosen by an oracle not implementable in practice. Specifically, at any run, for each the pair is determined via marginal likelihood optimization [DeNicolao1] using only the identification data. Multiple starting points have been adopted to mitigate the effect of local minima. The oracle has then access to the test set to select, among the 50 couples, that maximizing the prediction fit (35).
The two left boxplots in Fig. 1 report the prediction fits achieved by this procedure
applied to identify the first (top) and the second (bottom) system. Even if the Gaussian kernel is equipped with the oracle, its performance is satisfactory only
in the (S1) scenario while the capability of predicting outputs in the (S2) case is poor
during many runs. In place of the marginal likelihood, a cross-validation strategy has been also used for tuning ,
obtaining results similar to those here displayed.
During the Monte Carlo, even when the number of impulse response coefficients different from zero is quite large, we have noticed that a relatively small value for is frequently chosen, i.e. the oracle tends to use few past input values to predict . This indicates that the Gaussian kernel structure induces the oracle to introduce a significant bias to guard the estimator’s variance.
Now, we show that in this example model complexity can be better controlled avoiding the difficult and computationally expensive choice of discrete orders. In particular, we set and use the new NSS kernel (34). For tuning the NSS hyperparameter vector , no oracle having access to the test set is employed but just a single continuous optimization of the marginal likelihood that uses only the identification data. The fits achieved by NSS after the two Monte Carlo studies are in the right boxplots of the two figures above: in both the cases the new estimator behaves very nicely.
5 Consistency of regularization networks for system identification
5.1 The regression function
In what follows, the system input is a stationary stochastic process over
in discrete-time or in continuous-time.
The distribution of induces on the (Borel non degenerate) probability measure ,
from which the input locations are drawn. In view of their dependence on , the are in general correlated each other
and unbounded, e.g. for Gaussian no bounded set contains with probability one.
Such peculiarities, inherited by the system identification
setting, already violate the data generation assumptions routinely adopted in machine learning.
The identification data are assumed to be a stationary stochastic process. In particular, each couple has joint probability measure where is the probability measure of the output conditional on a particular input location .
Given a function (dynamic system) , the least squares error associated to is
The following result is well known and characterizes the minimizer of (36) which goes under the name of regression function in machine learning.
Theorem 13 (The regression function).
where is the regression function defined for any by
In our system identification context, is the dynamic system associated to the optimal predictor that minimizes the expected quadratic loss on a new output drawn from .
5.2 Consistency of regularization networks for system identification
Consider a scenario where (and possibly also ) is unknown and only samples from are available. We study the convergence of the RN in (1) to the optimal predictor as under the input-induced norm
This is the norm in the classical
Lebesgue space already introduced at the end of Section 2.
We assume that the reproducing kernel of admits the expansion
with and defined via (10) and the probability measure .
Exploiting the regression function, the measurements process can be written as
where the errors are zero-mean and identically distributed. They can be correlated each other and also with the input locations . Given any and the -th kernel eigenfunction
, we define the random variablesby combining , and the errors defined by (38) as follows
Let be stable so that the form a stationary process. In particular, note that each is the product of the outputs from
two stable systems: the first one, , is corrupted by noise while the second one,
, is noiseless.
Now, by summing up over the cross covariances of lag using kernel eigenvalues as weights one obtains444As shown in Appendix, the in (40) are invariant w.r.t. the particular spectral decomposition used to obtain the pairs .
where is the covariance operator. The next proposition shows that summability of the is key for consistency.
(RN consistency in system identification) Let be the RKHS with kernel
where are the eigenvalues/eigenfunctions pairs defined by (10) under the probability measure . Assume that, for any and s.t. , there exists a constant such that