# Kernel Least Mean Square with Adaptive Kernel Size

Kernel adaptive filters (KAF) are a class of powerful nonlinear filters developed in Reproducing Kernel Hilbert Space (RKHS). The Gaussian kernel is usually the default kernel in KAF algorithms, but selecting the proper kernel size (bandwidth) is still an open important issue especially for learning with small sample sizes. In previous research, the kernel size was set manually or estimated in advance by Silvermans rule based on the sample distribution. This study aims to develop an online technique for optimizing the kernel size of the kernel least mean square (KLMS) algorithm. A sequential optimization strategy is proposed, and a new algorithm is developed, in which the filter weights and the kernel size are both sequentially updated by stochastic gradient algorithms that minimize the mean square error (MSE). Theoretical results on convergence are also presented. The excellent performance of the new algorithm is confirmed by simulations on static function estimation and short term chaotic time series prediction.

## Authors

• 30 publications
• 2 publications
• 44 publications
• 36 publications
• ### The Generalized Complex Kernel Least-Mean-Square Algorithm

We propose a novel adaptive kernel based regression method for complex-v...
02/22/2019 ∙ by Rafael Boloix-Tortosa, et al. ∙ 0

• ### A stochastic behavior analysis of stochastic restricted-gradient descent algorithm in reproducing kernel Hilbert spaces

This paper presents a stochastic behavior analysis of a kernel-based sto...
10/14/2014 ∙ by Masa-aki Takizawa, et al. ∙ 0

• ### Improving Sparsity in Kernel Adaptive Filters Using a Unit-Norm Dictionary

07/13/2017 ∙ by Felipe Tobar, et al. ∙ 0

• ### Initialising Kernel Adaptive Filters via Probabilistic Inference

We present a probabilistic framework for both (i) determining the initia...
07/11/2017 ∙ by Iván Castro, et al. ∙ 0

• ### Kernel Risk-Sensitive Loss: Definition, Properties and Application to Robust Adaptive Filtering

Nonlinear similarity measures defined in kernel space, such as correntro...
08/01/2016 ∙ by Badong Chen, et al. ∙ 0

• ### Online dictionary learning for kernel LMS. Analysis and forward-backward splitting algorithm

Adaptive filtering algorithms operating in reproducing kernel Hilbert sp...
06/22/2013 ∙ by Wei Gao, et al. ∙ 0

• ### Study of Set-Membership Adaptive Kernel Algorithms

In the last decade, a considerable research effort has been devoted to d...
08/15/2018 ∙ by A. Flores, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Kernel based methods are successfully used in machine learning and nonlinear signal processing due to their inherent advantages of convex optimization and universality in the space of

functions. By mapping the input data into a feature space associated with a Mercer kernel, many efficient nonlinear algorithms can be developed, thanks to the kernel trick. Popular kernel methods include support vector machine (SVM)

[1, 2], kernel regularization network [3]

, kernel principal component analysis (KPCA)

[4], and kernel Fisher discriminant analysis (KFDA) [5], etc. These nonlinear algorithms show significant performance improvement over their linear counterparts.

Online kernel learning [6, 7, 8, 9] has also been extensively studied in the machine learning and statistical signal processing literature, and it provides efficient alternatives to approximate the desired nonlinearity incrementally. As the training data are sequentially presented to the learning system, online learning requires, in general, much less memory and computational cost. Recently, kernel based online algorithms for adaptive filtering have been developed and have become an emerging area of research [10]. Kernel adaptive filters (KAF) are derived in Reproducing Kernel Hilbert Spaces (RKHS) [11, 12], by using the linear structure and inner product of this space to implement the well-established linear adaptive filtering algorithms that correspond to nonlinear filters in the original input space. Typical KAF algorithms include the kernel least mean square (KLMS) [13, 14], kernel affine projection algorithms (KAPA) [15], kernel recursive least squares (KRLS) [16], and extended kernel recursive least squares (EX-KRLS) [17]

, etc. With a radially symmetric Gaussian kernel they create a growing radial-basis function (RBF) network to learn the network topology and adapt free parameters directly from the training data. Among these KAF algorithms, the KLMS is the simplest, and fastest to implement yet very effective.

There are two main open challenges in the KAF algorithms . The first is their growing structure with each sample, which results in increasing computational costs and memory requirements especially in continuous adaptation scenarios. In order to curb the network growth and to obtain a compact representation, a variety of sparsification techniques have been applied, where only the important input data are accepted as new centers. The presently available sparsification criteria include the novelty criterion [18], approximate linear dependency (ALD) criterion [16] surprise criterion [19], and so on. In a recent work [20], we have proposed a novel method, the quantized kernel least mean square (QKLMS) algorithm, to compress the input space and hence constrain the network size which is shown to be very effective in yielding a compact network with desirable accuracy.

Selecting a proper Mercer kernel is the second remaining problem that should be addressed when implementing kernel adaptive filtering algorithms, especially when the training data size is small. In this case, the kernel selection includes two parts: first, the kernel type is chosen, and second, its parameters are determined. Among various kernels, the Gaussian kernel is very popular and is usually a default choice in kernel adaptive filtering due to its universal approximating capability, desirable smoothness and numeric stability. The normalized Gaussian kernel is

 κ(u,u′)=exp(−∥∥u−u′∥∥2/∥∥u−u′∥∥22σ22σ2) (1)

where the free parameter () is called the kernel size (also known as the kernel bandwidth or smoothing parameter). In fact, the Gaussian kernel is strictly positive definite and as such produces a RKHS that is dense [12] and as such linear algorithms in this RKHS are universal approximators of smooth functions. In principle this means that in the large sample size regime the asymptotic properties of the mean square approximation are independent of the kernel size [21]. This means that the kernel size in KAF only affects the dynamics of learning, because in the initial steps the sample size is always small, therefore both the accuracy for batch learning and the convergence properties for online learning are dependent upon the kernel size. This should be contrasted with the effect of the kernel size in classification where the kernel size controls both the accuracy and the generalization of the optimal solution [1, 2]

. Up to now, there are many methods for selecting a kernel size for the Gaussian kernel borrowed from the areas of statistics, nonparametric regression and kernel density estimation. The most popular methods for the selection of the kernel size are: cross-validation (CV)

[22, 23, 24, 25, 26] which can always be used since the kernel size is a free parameter, penalizing functions [23], plug-in methods [23, 27], Silverman s rule [28] and other rules of thumb [29]. The cross-validation, penalizing functions, and plug-in methods are computationally intensive and are not suitable for online kernel learning. The Silverman s rule is widely accepted in kernel density estimation although it is derived under a Gaussian assumption and is usually not appropriate for multimodal distributions. Besides the fixed kernel size, some adaptive or varying kernel size algorithms can also be found in the literature [30, 31, 32, 33]. This topic is also closely related to the techniques of multi-kernel learning or learning the kernel in the machine learning literature [34, 35, 36, 37, 38, 39]. There the goal is typically to learn a combination of kernels based on some optimization methods, but in KAF this approach is normally avoided due to the computational complexity [10].

All the above mentioned methods, however, are not suitable for determining an optimal kernel size in online kernel adaptive filtering, since they either are batch mode methods or originate from a different problem, such as the kernel density estimation. Given that in online learning the number of samples is large and not specified a priori, the final solution will be practically independent of the kernel size. The real issue is therefore how to speed up convergence to the neighborhood of the optimal solution, which will also provide smaller network sizes. In the present work, by treating the kernel size as an extra parameter for the optimization, a novel sequential optimization framework is proposed for the KLMS algorithm. The new optimization paradigm allows for an online adaptation algorithm. At each iteration cycle, the filter weights and the kernel size are both sequentially updated to minimize the mean square error (MSE). As the kernel size is updated sequentially, the proposed algorithm is computationally very simple. The new algorithm can also be incorporated in the quantization method so as to yield a compact model.

The rest of the paper is organized as follows. In section II, we briefly revisit the KLMS algorithm. In section III, we propose a sequential optimization strategy for the kernel size in KLMS, and then derive a simple stochastic gradient algorithm to adapt the kernel size. In section IV, we give theoretical results on convergence of KLMS with adaptive kernel size. Specifically, we derive the energy conservation relation in RKHS, and on this basis we derive a sufficient condition for the mean square convergence, and arrive at the theoretical value of the steady-state excess mean-square error (EMSE). In section V, we present simulation examples on static function estimation and short term chaotic time series prediction to confirm the satisfactory performance of the KLMS with adaptive kernel size. Finally, in section VI, we present the conclusion.

## 2 Klms

Suppose the goal is to learn a continuous input-output mapping based on a sequence of input-output examples (training data), where is the input domain, is the desired output space. The hypothesis space for learning is assumed to be a Reproducing Kernel Hilbert Space (RKHS) associated with a Mercer kernel , a continuous, symmetric, and positive-definite function [11]. To find such a function , one may solve the regularized least squares regression in :

 minf∈HkN∑i=1(y(i)−f(u(i)))2+γ∥f∥2Hk (2)

where denotes the norm in , is the regularization factor that controls the smoothness of the solution. As the inner product in RKHS satisfies the reproducing property, namely, , (2) can be rewritten as

 minf∈HkN∑i=1(y(i)−⟨f|κ(u(i),.)⟩Hk)2+γ∥f∥2Hk (3)

By the representer theorem [12], the solution of (2) can be expressed as a linear combination of kernels:

 f(u)=N∑i=1αiκ(u(i),u) (4)

The coefficient vector can be calculated as , where is the Gram matrix with elements , and .

Solving the previous least squares problem usually requires significant memory and computational burden due to the necessity of calculating a large Gram matrix, whose dimension equals the number of input patterns. The KAF algorithms, however, provide efficient alternatives that build the solution incrementally, without explicitly computing the Gram matrix. Denote the estimated mapping (hypothesis) at iteration . The KLMS algorithm can be expressed as [10]

 {f0=0fi=fi−1+ηκ(u(i),.)e(i) (5)

where denotes the step size, is the instantenous prediction error at iteration , , i.e. the instantenous error only depends upon the difference between the desired response at the current time and the evaluation of the current sample () with the previous system model (). The learned mapping of KLMS, at iteration , will be

 fN(u)=ηN∑i=1e(i)κ(u(i),u) (6)

This is a very nice result because it states that the solution to the unknown nonlinear mapping is done incrementally one step at a time, with a growing RBF network, where the centers are the samples and the fitting parameter is automatically determined as the current error.

Taking advantage of the incremental nature of the KLMS updates, the KLMS adaptation is in essence the solution of the following incremental regularized least squares problem:

 minfi∈Hk(y(i)−fi(u(i)))2+1−ηη∥fi−fi−1∥2Hk (7)

Letting , (7) is equivalent to

 minΔfi∈Hk(e(i)−Δfi(u(i)))2+1−ηη∥Δfi∥2Hk (8)

From (8), one may observe: 1) KLMS learning at iteration is equivalent to solving a regularized least squares problem, in which the previous hypothesis is frozen, and only the adjustment term is optimized; 2) in this least squares problem, there is only one training example involved, i.e.; 3) the regularization factor is directly related to the step-size via .

In the rest of the paper, the Mercer kernel is assumed to be the Gaussian kernel. In addition, to explicitly show the kernel size dependence, we denote the Gaussian kernel by , and the induced RKHS by .

## 3 KLMS with Adaptive Kernel Size

The kernel is a crucial factor in all kernel methods in the sense that it defines the similarity between data points. For the Gaussian kernel, this similarity depends on the kernel size selected. If the kernel size is too large, then all the data will look similar in the RKHS (with inner products all close to 1), and the procedure reduces to linear regression. On the other hand, if the kernel size is too small, then all the data will look distinct (with inner products all close to 0), and the system fails to do inference on unseen data that fall between the training points.

Up to now the KLMS has been only studied with a constant kernel size, so all the elegance of the solution has not been fully recognized. In fact, the sequential learning algorithm (5) builds the current estimate of from two additive parts: the previous hypothesis and a correction term proportional to the prediction error on new data. In principle we can possibly use one RKHS to compute the previous hypothesis and change the RKHS to compute the correction term in (5), which can be efficiently done by changing the kernel size. This is the motivating idea that we pursue in this paper, and it has two fundamental components: (1) we have to formalize this approach; (2) we have to find an easy way of implementing it from samples. In the following, we propose an approach to sequentially optimize the KLMS with a variable kernel size.

### 3.1 Sequential Optimization of the Kernel Size in KLMS

In order to determine an optimal kernel size for KLMS, one should define in advance a cost function for the optimality. To make this precise, we suppose the training data

are random, and there is an absolutely continuous probability measure

(usually unknown) on the product space from which the data are drawn. The measure defines a regression function:

 f∗(u)=∫YydP(y|u) (9)

where is the conditional measure on . In this situation, the function can be said to be the desired mapping that needs to be estimated. Thus a measurement of the error in (which is updated by KLMS) is

 J1=∫U(f∗−fi)2dP(u) (10)

where is the marginal measure on . Then the optimization should find the kernel size that minimizes this error. Since in practice is usually unknown, one can use the mean square error as an alternative cost for optimization:

 J2=∫U×Y[y−fi(u)]2dP(u,y) (11)

The cost can be easily estimated from sample data. This is especially important when the probability measure is unknown.

Of course, the kernel size in KLMS can be optimized in batch mode, that is, the optimization is performed only after presenting the whole training data. Then, combining (6) and (11) yields the optimization:

 (12)

where stands for the optimal kernel size. As KLMS is an online learning algorithm, we are more interested in a sequential optimization framework, which allows the kernel size to be sequentially optimized across iterations. To this end, we propose the following sequential optimization:

 σ∗i=argminσi∈R+∫U×Y[y−fi−1(u)−ηe(i)κσi(u(i),u)]2dP(u,y) (13)

where the previous hypothesis is frozen, and denotes the kernel size at iteration .

Remark 1: By (13), the kernel size is optimized sequentially. Thus at iteration , the initial learning step will determine an optimal value of the kernel size (the old kernel sizes remain unchanged), followed by the addition of a new center using KLMS with this new kernel size. Learning with a varying kernel size implies at each iteration cycle to perform adaptation in a different RKHS since changing the kernel size modifies the inner product of the Hilbert space. For the KLMS, this learning paradigm is indeed feasible, because at each iteration cycle, the old centers remain frozen, and only a new center is added, or in other words, the correction term is just a feature vector in the current RKHS.

Remark 2:A more reasonable approach should be to jointly optimize the kernel size and the step size (corresponding to the regularization factor). In this work, however, for simplicity the step size is assumed to be fixed and only the kernel size is optimized.

Suppose the training data are independent, identically distributed (i.i.d.). The hypothesis , which depends on the previous training data, will be independent of the future training data. Then the mean square prediction error at iteration , conditioned on , equals

 E[e2(i+1)|fi]=∫U×Y[y(i+1)−fi(u(i+1))]2dP(u(i+1),y(i+1)|fi)=∫U×Y[y(i+1)−fi(u(i+1))]2dP(u(i+1),y(i+1))=∫U×Y[y−fi(u)]2dP(u,y) (14)

where denotes the probability measure of conditioned on . The mapping update at iteration will affect directly the prediction error at iteration , according to (14) the sequential optimization problem (13) can then be equivalently defined to search a value of the kernel size such that the conditional mean square error is minimized:

 σ∗i=argminσi∈R+E[e2(i+1)|fi]=argminσi∈R+E[e2(i+1)∣∣fi−1+ηe(i)κσi(u(i),.)] (15)

To understand the above optimization in more detail, we consider the nonlinear regression model in which the output data are related to the input vectors via

 y(i)=f∗(u(i))+v(i) (16)

where denotes the unknown nonlinear mapping that needs to be estimated, and stands for the disturbance noise. In this case, the prediction error can be expressed as

 e(i)=y(i)−fi−1(u(i))=~fi−1(u(i))+v(i) (17)

where is the residual mapping at iteration . The mean square error at iteration , conditioned on , is

 E[e2(i+1)|fi]=∫U×V(~fi(u(i+1))+v(i+1))2dPuv(i+1)=∫U×V(~fi−1(u(i+1))+v(i+1)−ηe(i)κσi(u(i),u(i+1)))2dPuv(i+1) (18)

where denotes the noise space, and denotes the probability measure of .

Remark 3: One can see from (18) that the optimal kernel size at iteration depends upon the residual mapping , prediction error , step size

, and the joint distribution

, which is much different from the optimal kernel sizes in problems of density estimation. Theoretically, given the desired mapping and the joint distribution , the optimal kernel sizes can be solved sequentially. This is, however, a rather tedious and impractical procedure since we have to solve an involved nonlinear optimization at each iteration cycle. More importantly, in practice the desired mapping, the noise, and the input distribution are usually unknown. Next, we will develop a stochastic gradient algorithm to adapt the kernel size, without resorting to any prior knowledge.

### 3.2 KLMS with Adaptive Kernel Size

As discussed previously, at each iteration cycle, the kernel size in KLMS can be optimized by minimizing the mean square error at next iteration (conditioned on the learned mapping at the current iteration). In this sense, one can optimize the previous kernel size using the current prediction error; that is, at iteration , when prediction error is available, the kernel size can be optimized.

Actually, the kernel size can be simply optimized by minimizing the instantaneous square error at iteration , and a stochastic gradient algorithm can be readily derived as follows:

 σ′i−1 =σi−1−2μe(i)∂∂σi−1[~fi−1(u(i))+v(i)] (19) (a)=σi−1+ρe(i−1)e(i)∂∂σi−1κσi−1(u(i−1),u(i)) =σi−1+(ρe(i−1)e(i)∥u(i−1)−u(i)∥2×κσi−1(u(i−1),u(i))/κσi−1(u(i−1),u(i))σ3i−1σ3i−1)

where denotes the updated kernel size at iteration , (a) follows from the fact that does not depend on , is the step size for the kernel size adaptation. At iteration , however, the residual mapping has been frozen, and actually, the kernel size cannot be modified. In this case, we just set , and obtain the following sequential update algorithm:

 σi=σi−1+ρe(i−1)e(i)×∥u(i−1)−u(i)∥2κσi−1(u(i−1),u(i))/∥u(i−1)−u(i)∥2κσi−1(u(i−1),u(i))σ3i−1σ3i−1 (20)

Remark 4:The above algorithm is computationally very simple, since the kernel size is updated sequentially, where only the kernel size of the new center is updated, and those of the old centers remain frozen. The initial value of the kernel size can be set manually or calculated roughly using Silverman s rule based on the input distribution in advance.

From (20), we have the following observations:

1)

The direction of the gradient depends upon the signs of the prediction errors and . Specifically, when the signs of and are the same, the kernel size will increase; while when the signs of and are different, the kernel size will decrease. This is reasonable, since in general the signs of two successive errors contain the information about the smoothness of the desired mapping. If there is little sign change, the desired mapping is likely a “moothing function” and a larger kernel size is desirable; while if the sign changes frequently, the desired mapping is likely a “zig zag function”, and in this case a smaller kernel size is usually better.

2)

The magnitude of the gradient depends on the input data through
.The value of this term will be nearly zero when the distance between and is very small or very large. This is also reasonable, since when is very close to , the sign change between and only implies a “very local fluctuation”; while when is very far away from , the sign change between and contains little information about the smoothness of the desired mapping.

3)

The magnitude of the gradient depends on through
. For the case ,this term will approach zero when is very small or very large. Therefore, the kernel size will be properly adjusted within a reasonable range.
Combining (5) and (20), we obtain the KLMS with adaptive kernel size:

 (21)

Remark 5:The computational complexity of the algorithm (21) is in the same order of magnitude as that of the original KLMS algorithm, which equals at iteration . This is because both algorithms share the same most time-consuming part, that is, the calculation of the prediction error.
Similar to the original KLMS algorithm, the new algorithm also produces a growing RBF network, whose network size increases linearly with the number of training data. In order to obtain a compact model (a network with as few centers as possible) and reduce the computational and memory costs, some sparsification or quantization methods can still be applied. Here, we only discuss the quantization approach. In [20], we use the idea of quantization to compress the input space of KLMS and constrain efficiently the network size growth. The learning rule of the quantized KLMS (QKLMS) is

 ⎧⎪⎨⎪⎩f0=0e(i)=y(i)−fi−1(u(i))fi=fi−1+ηe(i)κσ(Q[u(i)],.) (22)

where denotes a quantization operator in input space . In QKLMS (22), the centers are limited to the quantization codebook C, and the network size can never be larger than the size of the codebook . At iteration , we just add to the coefficient of the code-vector closest to the current input . A simple online vector quantization (VQ) method was also proposed in [20].

Now suppose the online VQ method is adopted and the kernel size of QKLMS is varying as a function of the codebook index. Denote the quantization size, and the distance between and C: , where denotes the jth element of the codebook C. Then at iteration i, if , a new code-vector will be added into the codebook, i.e. . In this case, a new center will also be allocated, and its kernel size can be computed in a similar way as in (20):

 σj=σj−1+ρe(j−1)e(j)∥∥\emph{C}j−\emph{C}j−1∥∥2κσj−1(\emphCj,\emph{C}j−1)/κσj−1(\emph{C}j,\emph{C}j−1)σ3j−1σ3j−1 (23)

where denotes the kernel size corresponding to the jth code-vector , and denotes the prediction error at the iteration when the code-vector is added. If , , there is no new center added and no kernel size update.

## 4 Convergence Analysis

This section gives theoretical results on convergence of the algorithm (21). The unknown system is assumed to be the nonlinear regression model given in (16). First, let us define the a priori error and a posteriori error as follows:

 ea(i)=~fi−1(u(i)),ep(i)=~fi(u(i)) (24)

where and are the residual mappings at iteration and , respectively.

### 4.1 Kernel Size Convergence

The exact analysis of the convergence of kernel size is complex. In the following, we only show under several assumptions that the difference between two successive kernel sizes will converge in the mean to zero. These assumptions are:

A1: The noise is zero-mean, independent, identically distributed, and independent of the input sequence ;

A2: The step-sizes and are relative small such that as , the prediction errors and are independent of and the kernel size ;

A3: As , the a priori errors and are zero-mean and uncorrelated.

Under assumptions A1-A3, we have, as ,

 (25)

which does not in itself imply convergence, but implies the difference between two successive kernel sizes will converge in mean to zero.

Remark 6: The assumption A1 is commonly used in the convergence analysis for adaptive filtering algorithms [40]. This assumption implies the independence between and . The assumption A2 is reasonable, since when the step size

is very small, the steady-state misadjustment will be much smaller than the noise variance. In this case we have, as

, and . And hence, by the assumption A1, and are approximately independent of . Further, if the step-size is also small, and will be approximately independent of the kernel size . The assumption A3 will be easily met if the input sequence is i.i.d.

### 4.2 Energy Conservation Relation

In adaptive filtering theory, the energy conservation relation provides a powerful tool for the mean square convergence analysis [40, 41, 42, 43, 44]. In our recent studies [20, 45, 21], this important relation has been extended into the RKHS. Before carrying out the mean square convergence analysis for the mapping update in (21), we derive the corresponding energy conservation relation.

The mapping update in (21) can be expressed as the residual-mapping update:

 ~fi=~fi−1−ηe(i)κσi(u(i),.) (26)

Due to the variable kernel size, at each iteration the correction term in (26) is computed in a different RKHS. In order to derive the energy conservation relation in a fixed RKHS, one should find a RKHS that contains all the correction terms. Here we give an important lemma.

Lemma 1: Let be any set with nonempty interior. Then the RKHS induced by the Gaussian kernel on contains the function if and only if . For such , the function has norm given by:

 ∥∥κσ(u,.)∥∥Hσ∗=(σ2σ∗√2σ2−σ2∗)m/2 (27)

Remark 7: The above lemma is a direct consequence of the Theorem 2 in [46].For the case ., the Hilbert space will not contain the function . We point out that in this case, the function can still be arbitrarily ”close” to a member of , because is dense in the space of continuous functions on provided is compact.

Now we select a fixed kernel size , satisfying , where . By Lemma 1, the RKHS will contain all the correction terms.

By the reproducing property of the RKHS , the prediction error , a priori error , and a posteriori error can be expressed as

 (28)

Further, one can derive the relationship between and :

 ep(i)=ea(i)−ηe(i)κσi(u(i),u(i))=ea(i)−ηe(i) (29)

Hence

 ~fi=~fi−1+(ep(i)−ea(i))κσi(u(i),.) (30)

Squaring both sides of (30) in RKHS , we derive

where is the residual mapping power (RMP) at iteration i, , and

It follows that

 ∥∥~fi∥∥2Hσ∗+e2a(i)=∥∥~fi−1∥∥2Hσ∗+e2p(i)+ε(i) (31)

Remark 8: Equation (31) is referred to as the energy conservation relation for KLMS with adaptive kernel size. If the kernel size is fixed, say , we have , and hence . In this case, (31) reduces to the energy conservation relation for the original KLMS:

 (32)

which, in form, is identical to the energy conservation relation for the normalized LMS (NLMS) algorithm.

### 4.3 Sufficient Condition for Mean Square Convergence

Substituting into (31) and taking expectations of both sides yield

 (33)

where (b) follows from

 ⟨κσi(u(i),.)|δi(.)⟩Hσ∗=⟨δi(.)+κσ∗(u(i),.)|δi(.)⟩Hσ∗=⟨κσ∗(u(i),.)|δi(.)⟩Hσ∗+∥δi(.)∥2Hσ∗=δi(u(i))+∥δi(.)∥2Hσ∗=∥δi(.)∥2Hσ∗ (34)

It follows easily that

 (35)

Thus, if , the step-size satisfies the inequality

 0<η≤2E[e(i)(ea(i)+⟨~fi−1|δi(.)⟩Hσ∗)]E[e2(i)(1+∥δi(.)∥2Hσ∗)] (36)

the RMP in RKHS will monotonically decrease (and hence converge). The inequality (36) implies

 E[e(i)(ea(i)+⟨~fi−1|δi(.)⟩Hσ∗)]>0

So a sufficient condition for the mean square convergence (monotonic decrease of the RMP) will be,,

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩E[e(i)(ea(i)+⟨~fi−1|δi(.)⟩Hσ∗)]>00<η≤2E[e(i)(ea(i)+⟨~fi−1|δi(.)⟩Hσ∗)]E[e2(i)(1+∥δi(.)∥2Hσ∗)] (37)

### 4.4 Steady-State Excess Mean Square Error

Further, we take the limit of (33) as :

 (38)

If the RMP reaches steady-state, that is

 limi→∞E[∥∥~fi∥∥2Hσ∗]=limi→∞E[∥∥~fi−1∥∥2Hσ∗] (39)

then the following relation holds:

 limi→∞E[e(i)(ea(i)+⟨~fi−1|δi(.)⟩Hσ∗)]=η2limi→∞E[e2(i)(1+∥δi(.)∥2Hσ∗)] (40)

In order to derive the steady-state excess mean-square error111The a priori error power is also referred to as the excess mean-square error in literature of adaptive filtering.(EMSE) , we use two assumptions: one is the assumption A1, and another is as follows: A4: The squared a priori error and are uncorrelated 222The assumption A4 will be easily met if the input sequence is i.i.d..

Under the assumptions A1 and A4, (40) becomes

 (41)

where is the noise variance. Then we have

 limi→∞E[e2a(i)]=ηξ2v(1+ς)−2τ2−η(1+ς) (42)

where , and . By Lemma 1, can also be expressed as

 (43)

Remark 9:Although both and depend on the kernel size , we should note that the steady-state EMSE itself does not depend on . This can be easily understood by the fact that the residual mapping has no relation to .

To further investigate the steady-state EMSE, we consider the case in which as , and are very close and satisfy , where . In this case we have as . Then at the steady-state stage, we can set . And hence

 ς=limi→∞E⎡⎢ ⎢⎣⎛⎜ ⎜⎝σ2iσ∗√2σ2i−σ2∗⎞⎟ ⎟⎠m−1⎤⎥ ⎥⎦≈0 (44)

and

 (45)

It follows that

 limi→∞E[e2a(i)]=ηξ2v(1+ς)−2τ2−η(1+ς)≈ηξ2v2−η (46)

Remark 10: It has been shown that the steady-state EMSE of the original KLMS (with a fixed kernel size) equals , which is not related to the specific value of the kernel size [21]. From (46) one observes that, when the kernel size converges to a neighborhood of a certain constant (), the adaptation of kernel size also has little effect on the steady-state EMSE. This will be confirmed later by simulation results. We should point out here that, although the kernel size may have little effect on the KLMS steady-state accuracy (in terms of the EMSE), it has significant influence on the convergence speed. In most practical situations, the training data are finite and the algorithm can never reach the steady state. In these cases the kernel size also has significant influence on the final accuracy (not the steady-state accuracy).

## 5 Simulation Results

In this section, we present simulation results that illustrate the performance of the proposed algorithm. The simulation examples presented include static function approximation and short-term chaotic time series prediction.

### 5.1 Static Function Approximation

Consider a simple static function estimation problem in which the desired output data are generated by

 y(i)=cos(8u(i))+v(i) (47)

where the input

is uniformly distributed over

, and is a white Gaussian noise with variance 0.0001.

For the KLMS with different kernel sizes, the average convergence curves (in terms of the EMSE) over 1000 Monte Carlo runs are shown in Fig. 1. In the simulation, the step-sizes for all the cases are set at . For the KLMS with adaptive kernel size, the step-size for the kernel size adaptation is set at , and the initial kernel size is set as 1.0. From Fig. 1, we see clearly that the kernel size has significant influence on the convergence speed. In this example, the kernel size and produce a rather slow convergence speed. The kernel sizes and work very well, and in particular, the kernel size achieves a fast convergence speed and the smallest final EMSE (at the 5000th iteration). The kernel size (selected by Silverman s rule) works, but obviously the performance is not so good. Although the initial kernel size is set as 1.0 (with which the algorithm is almost stalled), the KLMS with adaptive kernel size () can still converge to a very small EMSE at the final iteration. This can be clearly explained from Fig. 2, where the evolution curve of the adaptive kernel size has been plotted. In Fig. 2, the adaptive kernel size converges to a desirable value between 0.1 and 0.2. Fig. 3 shows the learned mappings at final iteration for different kernel sizes. The desired mapping is also plotted in Fig. 3 for comparison purpose. One can see for the cases , and , the learned mappings match the desired mapping very well, while when and 0.5, the learned mappings deviate severely from the desired function. For the kernel size , there is still some visible deviation between the learned mapping and the desired one. A more detailed comparison is also presented in Table 1, where the EMSE at final iteration is summarized.

As illustrated in Fig. 1, the initial convergence speed of the KLMS with adaptive kernel size can still be very slow if the initial kernel size is inappropriately chosen. In order to improve the initial convergence speed, one can select a suitable initial kernel size using a certain method such as Silverman s rule. For the present example, if we set the initial kernel size to be 0.35, the convergence speed of the new algorithm will be improved significantly. This can be clearly seen from Fig. 4, in which the learning curves for and (with ) are shown.

The kernel size will influence the convergence speed (see Fig. 1) and the final accuracy with finite training data (see Table 1), but it has little effect on the steady-state EMSE with infinite training data. In order to confirm this theoretical prediction, we perform another set of simulations with the same settings, except now much more iterations are run. For different kernel sizes and iterations, the EMSEs (obtained as the averages over a window of 2000 samples) are listed in Table 2. One can see for the kernel sizes , and , the algorithms almost reach the steady-state before the iteration. For the kernel size , the algorithm attains its steady-state at around the iteration. For the cases and , it is hard to obtain the steady-state EMSE via simulation since the convergence speed is too slow. In Table 2, the simulated steady-state EMSEs for different kernel sizes ( , , and ) are very close and approach to 0.000033333, the theoretical value of the steady-state EMSE calculated using (46).

### 5.2 Short Term Chaotic Time Series Prediction

The second example is about short term chaotic time series prediction. Consider the Lorenz oscillator whose state equations are

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩dxdt=−βx+yzdydt=δ(z−y)dzdt=−xy+ρy−z (48)

where the parameters are set as , , and . The second state is picked for the prediction task and a sample time series is shown in Fig. 5. Here the goal is to predict the value of the current sample using the previous five consecutive samples.

We continue to compare the performances of the KLMS with different kernel sizes. In the simulations below, the step-sizes for the mapping update are set at , and the step-size for the kernel size adaptation is set at . The initial value of the adaptive kernel size is set as 1.0. For the kernel sizes , 5.5(selected by Silverman s rule), 10, 15, 20, 30 and the adaptive kernel size , the convergence curves in terms of the testing MSE are illustrated in Fig. 6. For each kernel size, 20 independent simulations were run with different segments of the time series. In each segment, 1000 samples are used as the training data and another 100 as the test data. At each iteration, the testing MSE is computed based on the test data using the learned filter at that iteration. It can be seen from Fig. 6 that the kernel size achieves the best performance with the smallest testing MSE at final iteration. Also, as expected, the adaptive kernel size yields a satisfactory performance very close to the best result. Fig. 7 shows the evolution of the adaptive kernel size (see the solid line). Interestingly, we observe the adaptive kernel size converges quickly and very close to the desirable value 15. The mean deviation results of the testing MSE at final iteration are given in Table 3.

Further, we would like to evaluate the performance when the quantization method is applied to curb the growth of the network size. The experimental setting is the same as before, except now the quantization method is used and the quantization size is set at . The convergence curves for different kernel sizes are demonstrated in Fig. 8. Once again, the adaptive kernel size works very well, and obtains a performance very close to the best one. As shown in Fig. 7, the adaptive kernel size still converges close to the value 15 (see the dotted line). Fig. 9 illustrates the evolution curve of the network size. One can see that with quantization the network size grows very slowly, and the final network size is only around 75. Table 4 shows the mean deviation results of the testing MSE at final iteration.

## 6 Conclusion

The kernel function implicitly defines the feature space and plays a central role in all kernel methods. In kernel adaptive filtering (KAF) algorithms, the Gaussian kernel (a radial basis function kernel) is usually a default kernel. The kernel size (or bandwidth) of the Gaussian kernel controls the smoothness of the mapping and has significant influence on the learning performance. How to select a proper kernel size is a very crucial problem in KAF algorithms. Some existing techniques (e.g. Silverman s rule) for selecting a kernel size can be applied, but they are not appropriate for a KAF algorithm since the problem is approximation in a joint space (the input and the desired), which is different from density estimation.

In this work, we propose an approach for sequentially optimizing the kernel size for the kernel least mean square (KLMS), a simple yet efficient KAF algorithm. At each iteration cycle, the kernel size is adjusted by a stochastic gradient based algorithm to minimizing the mean square error. The proposed algorithm is computationally very simple and easy to implement. Theoretical results on convergence are also presented. Based on the energy conservation relation in RKHS, we derive a sufficient condition for the mean square convergence, and obtain the theoretical steady-state excess mean-square error (EMSE). Simulation results confirm the theoretical prediction, and show the adaptive kernel size can automatically converge to a proper value, so as to help KLMS converge faster and achieve better accuracy.

In future study, it is of interest to extend this work to the case where the kernel is of any form (not restricted to the Gaussian kernel). Especially, we will study how to sequentially optimize the kernel function using the idea of multi-kernel learning or learning the kernel. Another interesting line of study is how to jointly optimize the kernel size and the step size.

## References

• [1] Vapnik, V.:

The nature of statistical learning theory.

springer (2000)
• [2] Schölkopf, B., Smola, A.: Learning with kernels-support vector machines, regularization, optimization and beyond. 2002
• [3] Girosi, F., Jones, M., Poggio, T.:

Regularization theory and neural networks architectures.

Neural computation 7(2) (1995) 219–269
• [4] Schölkopf, B., Smola, A., Müller, K.R.:

Nonlinear component analysis as a kernel eigenvalue problem.

Neural computation 10(5) (1998) 1299–1319
• [5] Yang, M.H.:

Kernel eigenfaces vs. kernel fisherfaces: Face recognition using kernel methods.

In:Proceedings of the 5th IEEE ICAFGR (2002) 215–220
• [6] Kivinen, J., Smola, A.J., Williamson, R.C.: Online learning with kernels. Signal Processing, IEEE Transactions on 52(8) (2004) 2165–2176
• [7] Slavakis, K., Theodoridis, S., Yamada, I.: Online kernel-based classification using adaptive projection algorithms. Signal Processing, IEEE Transactions on 56(7) (2008) 2781–2796
• [8] Orabona, F., Keshet, J., Caputo, B.: Bounded kernel-based online learning. The Journal of Machine Learning Research 10 (2009) 2643–2666
• [9] Zhao, P., Hoi, S.C., Jin, R.: Double updating online learning. Journal of Machine Learning Research 12 (2011) 1587–1615
• [10] Príncipe, J.C., Liu, W., Haykin, S.: Kernel Adaptive Filtering: A Comprehensive Introduction. Volume 57. John Wiley & Sons (2011)
• [11] Aronszajn, N.: Theory of reproducing kernels. Transactions of the American mathematical society 68(3) (1950) 337–404
• [12] Burges, C.J.:

A tutorial on support vector machines for pattern recognition.

Data mining and knowledge discovery 2(2) (1998) 121–167
• [13] Liu, W., Pokharel, P.P., Principe, J.C.: The kernel least-mean-square algorithm. Signal Processing, IEEE Transactions on 56(2) (2008) 543–554
• [14] Bouboulis, P., Theodoridis, S.: Extension of wirtinger’s calculus to reproducing kernel hilbert spaces and the complex kernel lms. Signal Processing, IEEE Transactions on 59(3) (2011) 964–978
• [15] Liu, W., Príncipe, J.: Kernel affine projection algorithms. EURASIP Journal on Advances in Signal Processing 2008(1) (2008) 784292
• [16] Engel, Y., Mannor, S., Meir, R.: The kernel recursive least-squares algorithm. Signal Processing, IEEE Transactions on 52(8) (2004) 2275–2285
• [17] Liu, W., Park, I., Wang, Y., Príncipe, J.C.: Extended kernel recursive least squares algorithm. Signal Processing, IEEE Transactions on 57(10) (2009) 3801–3814
• [18] Platt, J.:

A resource-allocating network for function interpolation.

Neural computation 3(2) (1991) 213–225
• [19] Liu, W., Park, I., Príncipe, J.C.: An information theoretic approach of designing sparse kernel adaptive filters. Neural Networks, IEEE Transactions on 20(12) (2009) 1950–1961
• [20] Chen, B., Zhao, S., Zhu, P., Principe, J.C.: Quantized kernel least mean square algorithm. Neural Networks and Learning Systems, IEEE Transactions on 23(1) (2012) 22–32
• [21] Chen, B., Zhao, S., Zhu, P., Príncipe, J.C.: Mean square convergence analysis for kernel least mean square algorithm. Signal Processing 92(11) (2012) 2624–2632
• [22] Wahba, G.: Spline models for observational data. Number 59. Siam (1990)
• [23] Hardle, W.: Applied nonparametric regression. Volume 5. Cambridge Univ Press (1990)
• [24] Racine, J.: An efficient cross-validation algorithm for window width selection for nonparametric kernel regression. Communications in Statistics-Simulation and Computation 22(4) (1993) 1107–1114
• [25] Cawley, G.C., Talbot, N.L.:

Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers.

Pattern Recognition 36(11) (2003) 2585–2592
• [26] An, S., Liu, W., Venkatesh, S.:

Fast cross-validation algorithms for least squares support vector machine and kernel ridge regression.

Pattern Recognition 40(8) (2007) 2154–2162
• [27] Herrmann, E.: Local bandwidth choice in kernel regression estimation. Journal of Computational and Graphical Statistics 6(1) (1997) 35–54
• [28] Silverman, B.W.: Density estimation for statistics and data analysis. Volume 26. CRC press (1986)
• [29] Jones, M.C., Marron, J.S., Sheather, S.J.: A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association 91(433) (1996) 401–407
• [30] Brunsdon, C.: Estimating probability surfaces for geographical point data: an adaptive kernel algorithm. Computers & Geosciences 21(7) (1995) 877–894
• [31] Katkovnik, V., Shmulevich, I.: Kernel density estimation with adaptive varying window size. Pattern recognition letters 23(14) (2002) 1641–1648
• [32] Yuan, J., Bo, L., Wang, K., Yu, T.: Adaptive spherical gaussian kernel in sparse bayesian learning framework for nonlinear regression. Expert Systems with Applications 36(2) (2009) 3982–3989
• [33] Singh, A., Príncipe, J.C.: Information theoretic learning with adaptive kernels. Signal Processing 91(2) (2011) 203–213
• [34] Gönen, M., Alpaydın, E.: Multiple kernel learning algorithms. The Journal of Machine Learning Research 12 (2011) 2211–2268
• [35] Herbster, M.: Relative loss bounds and polynomial-time predictions for the k-lms-net algorithm. In: Algorithmic Learning Theory, Springer (2004) 309–323
• [36] Ong, C.S., Williamson, R.C., Smola, A.J.: Learning the kernel with hyperkernels. In: Journal of Machine Learning Research. (2005) 1043–1071
• [37] Argyriou, A., Micchelli, C.A., Pontil, M.: Learning convex combinations of continuously parameterized basic kernels. In: Learning Theory. Springer (2005) 338–352
• [38] Jin, R., Hoi, S.C., Yang, T.: Online multiple kernel learning: Algorithms and mistake bounds. In: Algorithmic Learning Theory, Springer (2010) 390–404
• [39] Orabona, F., Jie, L., Caputo, B.: Multi kernel learning with online-batch optimization. The Journal of Machine Learning Research 13 (2012) 227–253
• [40] Sayed, A.H.: Fundamentals of adaptive filtering. John Wiley & Sons (2003)
• [41] Al-Naffouri, T.Y., Sayed, A.H.: Adaptive filters with error nonlinearities: Mean-square analysis and optimum design. EURASIP Journal on Advances in Signal Processing 2001(4) (1900) 192–205
• [42] Yousef, N.R., Sayed, A.H.: A unified approach to the steady-state and tracking analyses of adaptive filters. Signal Processing, IEEE Transactions on 49(2) (2001) 314–324
• [43] Al-Naffouri, T.Y., Sayed, A.H.: Transient analysis of data-normalized adaptive filters. Signal Processing, IEEE Transactions on 51(3) (2003) 639–652
• [44] Al-Naffouri, T.Y., Sayed, A.H.: Transient analysis of adaptive filters with error nonlinearities. Signal Processing, IEEE Transactions on 51(3) (2003) 653–663
• [45] Zhao, S., Chen, B., Principe, J.C.: Kernel adaptive filtering with maximum correntropy criterion. In: Neural Networks (IJCNN), The 2011 International Joint Conference on, IEEE (2011) 2012–2017
• [46] Quang, M.H.: Further properties of gaussian reproducing kernel hilbert spaces. arXiv preprint arXiv:1210.6170 (2012)