# A stochastic behavior analysis of stochastic restricted-gradient descent algorithm in reproducing kernel Hilbert spaces

This paper presents a stochastic behavior analysis of a kernel-based stochastic restricted-gradient descent method. The restricted gradient gives a steepest ascent direction within the so-called dictionary subspace. The analysis provides the transient and steady state performance in the mean squared error criterion. It also includes stability conditions in the mean and mean-square sense. The present study is based on the analysis of the kernel normalized least mean square (KNLMS) algorithm initially proposed by Chen et al. Simulation results validate the analysis.

There are no comments yet.

## Authors

• 1 publication
• 6 publications
• 21 publications
08/08/2021

### Mean-square Analysis of the NLMS Algorithm

This work presents a novel approach to the mean-square analysis of the n...
06/22/2013

### Online dictionary learning for kernel LMS. Analysis and forward-backward splitting algorithm

Adaptive filtering algorithms operating in reproducing kernel Hilbert sp...
01/23/2014

### Kernel Least Mean Square with Adaptive Kernel Size

Kernel adaptive filters (KAF) are a class of powerful nonlinear filters ...
08/25/2016

### Transient performance analysis of zero-attracting LMS

Zero-attracting least-mean-square (ZA-LMS) algorithm has been widely use...
09/09/2019

### A Complete Transient Analysis for the Incremental LMS Algorithm

The incremental least mean square (ILMS) algorithm was presented in <cit...
10/30/2018

### Exact Expectation Analysis of the Deficient-Length LMS Algorithm

Stochastic models that predict adaptive filtering algorithms performance...
09/10/2020

### Mean-square contractivity of stochastic θ-methods

The paper is focused on the nonlinear stability analysis of stochastic θ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Kernel adaptive filtering [1]

is an attractive approach for nonlinear estimation problems based on the theory of reproducing kernel Hilbert space (RKHS), and a number of kernel adaptive filtering algorithms have been proposed

[2, 3, 4, 5, 6, 7, 8]

. The existing kernel adaptive filtering algorithms are classified into two general categories according to the space in which optimization is performed

[6]: (i) the RKHS approach (e.g., [2, 5, 7]) and (ii) the parameter-space approach (e.g., [4, 6, 9]). The kernel normalized least mean square (KNLMS) algorithm is a representative example of the parameter-space approach and its stochastic behavior analyses have been presented in [10, 11, 12]. The analyses have clarified the transient and steady-state performance in the mean squared error (MSE). A stochastic restricted-gradient descent algorithm studied in the present work is an RKHS counterpart of the KNLMS algorithm. We call it the natural kernel least mean square (Natural KLMS) algorithm to distinguish it from the KLMS algorithm proposed in [13]. A primitive question is whether it is possible to give the same analyses as in [10, 11, 12] for the stochastic restricted-gradient descent algorithm. If this is possible, it will provide a theoretical basis to compare the performances of KNLMS and Natural KLMS. This will eventually give a new insight into the relationship between the two classes of kernel adaptive filtering algorithms.

To clarify the orientation of the Natural KLMS algorithm in the kernel adaptive filtering researches, let us give a short note on the RKHS approach. Dictionary sparsification is a common issue of kernel adaptive filtering [14, 3, 4, 1]. The KLMS algorithm [13] updates the filter only when the current input datum is added into the dictionary and this would cause severe performance degradations. A systematic scheme which eliminates such a limitation has been proposed in [15] under the name of hyperplane projection along affine subspace (HYPASS)

. The HYPASS algorithm updates the filter using the projection onto the zero-instantaneous-error hyperplane along the so-called

dictionary subspace , the subspace spanned by the dictionary elements. This is achieved by projecting the gradient direction onto . In a nutshell, HYPASS is the NLMS algorithm operated in the dictionary subspace . Natural KLMS is actually an LMS counterpart of HYPASS and we consider this LMS-based algorithm to make the analysis feasible. In [16] and [7], the mean square convergence analysis and the theoretical steady-state MSE have been presented for the KLMS and Quantized KLMS algorithms, respectively. However, transient performance analyses have not yet been reported due to the difficulty in treating the growing number of dictionary elements.

In this paper, we present a stochastic behavior analysis of the Natural KLMS algorithm with a Gaussian kernel under i.i.d. random inputs based on the framework presented in [12]. Natural KLMS is derived by using the restricted gradient which gives a steepest ascent direction within the dictionary subspace . The analysis provides theoretical MSEs during the transient phase as well as at the steady-state. We also derive stability conditions in the mean and mean-square sense. The key ingredients for the analysis are the restricted gradient and the isomorphism between the dictionary subspace and a Euclidean space; these were also the key when the first and second authors developed a sparse version of HYPASS in [17]. The validity of the analysis is illustrated by simulations.

## 2 Preliminaries

We address an adaptive estimation problem of a nonlinear system with sequentially arriving input signals , and its noisy output , where

is assumed an i.i.d. random vector and

is a zero-mean additive noise uncorrelated with any other signals. The function is modeled as an element of the RKHS associated with a Gaussian kernel , where is the kernel parameter. We denote by and the canonical inner product and the norm defined in , respectively, and and those in . A kernel adaptive filter is given as a finite order filter:

 φn:=∑j∈Jα(n)jκ(⋅,uj), n∈N, (1)

where are the filter coefficients and indicates the dictionary ; is the time index. Without loss of generality, we assume that the dictionary is a linearly independent set so that it spans an dimensional subspace

 M:=span{κ(⋅,uj)}j∈J⊂H, (2)

which is called the dictionary subspace. Although the dictionary is updated typically during the learning process, we assume that the dictionary is fixed to make the analysis tractable.

The instantaneous error at time instant is defined as , where is the vector of the kernelized input and is the coefficient vector. The MSE cost function, with respect to the coefficient vector , is given by

 J(α):=E(e2n(α))=E(d2n)+αTRκα−2pTα, (3)

where is the autocorrelation matrix of the kernelized input and is the cross-correlation vector between and . With the optimization in RKHS in mind, the MSE, with respect to , is given by:

 J(φ):=E(e2n(φ))= E(d2n)+E(⟨φ,κ(⋅,un)⟩2H) −2E(dn⟨φ,κ(⋅,un)⟩H). (4)

While the KNLMS algorithm optimizes in the Euclidean space , the Natural KLMS algorithm presented in the following section optimizes in the RKHS under the restriction to the dictionary subspace , or in short, it optimizes in . Referring to [2]

, the stochastic gradient descent method for

in updates the filter along the ‘line’ (one dimensional subspace) spanned by the singleton . This implies that the filter is updated only when is added into the dictionary, because otherwise for any . We thus present the restricted gradient, which was initially introduced in [17], and derive the Natural KLMS algorithm in the following section.

## 3 The Natural KLMS algorithm

The ordinary gradient of in is given by . Given any positive definite matrix , and define an inner product and its induced norm, respectively. The -gradient of (3) with the inner product is defined as [17]

 ∇GJ(α):=G−1∇J(α), (5)

where for is the Gram matrix.111The Gram matrix is ensured to be positive definite due to the assumption that the elements of the dictionary are linearly independent. The definition of the -gradient is validated by observing that for any .

The functional Hilbert space of dimension is isomorphic to the Hilbert space under the correspondence (see Fig. 1)

 M∋φ:=∑j∈Jαjκ(⋅,uj)⟷α:=[αj1,⋯,αjr]T∈IRr. (6)

Note here that the isomorphism as Hilbert spaces includes, in addition to the one-to-one correspondence between the elements, the preservation of the inner product; i.e., for any and . Under the correspondence in (6), the restricted gradient is defined, through the -gradient in , as follows [17]:

 ∇|MJ(φ)⟷∇GJ(α)=G−1∇J(α). (7)

The restricted gradient gives the steepest ascent direction, within the dictionary subspace , of the tangent plane of the functional (4) at the point . See the derivation of the restricted gradient in [17]. An instantaneous approximation of the restricted gradient , where is given by . Hence, for the initial vector , the stochastic restricted-gradient descent method, which we call the Natural KLMS algorithm, is given by

 αn+1:=αn−η2~∇GJ(αn)=αn+ηenG−1κn,   n∈N, (8)

where is the step size. The Natural KLMS algorithm (8) requires complexity for each time update, and this would make a significant impact on the overall complexity of the algorithm. In [15, 18], a simple selective-updating idea for complexity reduction without serious performance degradations has been presented; it will be shown in Section 5 that the selective-updating works well.

## 4 Performance analysis

### 4.1 Key idea and assumption

We derive a theoretical MSE and stability conditions for the Natural KLMS algorithm given by (8) with Gaussian kernel, given the dictionary . Left-multiplying both-sides of (8) by the square root of yields222For any positive semi-definite matrix , there exists a unique square root satisfying .

 ~αn+1=~αn+ηen~κn, (9)

where , . The cost function in (3) can be rewritten by

 (J(α)=) ~J(~α)=E(d2n)+~αT~Rκ~α−2~pT~α, (10)

as a function of , and (9) can be regarded as a stochastic gradient descent method for this cost function . Here

 ~Rκ:=E(~κn~κTn)=G−12RκG−12, (11)

and

 ~p:=E(dn~κn)=G−12p, (12)

are the autocorrelation matrix and the cross-correlation vector for the modified vector , respectively.

As is positive definite [10], the optimum weight vector is given by

 ~α∗:=~R−1κ~p, (13)

and with , we define the weight error vector

 ~vn:=~αn−~α∗. (14)

In the present analysis, needs to be independent of , which is guaranteed by making the following conditioned modified independence assumption (CMIA) [12].

###### Assumption 1

is independent of .

### 4.2 Mean weight error analysis

The estimation error can be expressed by

 en=dn−~κTn~vn−~κTn~α∗. (15)

Substituting (15) to (9), we obtain the recursive expression for :

 ~vn+1=~vn+ηdn~κn−η~κTn~vn~κn−η~κTn~α∗~κn. (16)

Using CMIA, we obtain the mean weight error model

 E(~vn+1)=(Ir−η~Rκ)E(~vn), (17)

where denotes the identity matrix for any positive integer . Let the input

be a random vector following a Gaussian distribution with zero mean and the covariance matrix

. Then, the component () of the autocorrelation matrix of is given by [12]:

 [Rκ]ℓ,m=|IL+2σ2Ru|−12 exp⎡⎢⎣−14σ2⎛⎜⎝2∥¯uℓm∥(2)−∥¯uℓm∥2(IL+σ22R−1u)−1⎞⎟⎠⎤⎥⎦,

where , , and stands for determinant.

From the recursion in (17), we obtain the mean stability condition of the Natural KLMS algorithm as follows.

###### Theorem 1 (Stability in the mean)

Assume CMIA holds. Then, for any initial condition, given dictionary , the Natural KLMS algorithm asymptotically converges in the mean if the step size is chosen to satisfy

 0<η<2λmax(~Rκ), (18)

where

denotes the maximum eigenvalue of the matrix.

Proof: It is clear from the well-known mean stability results (see, e.g., [19]).

### 4.3 Mean-square error analysis

Squaring (15) and taking its expectation under CMIA, the MSE (10) of Natural KLMS can be rewritten as

 ~J(~αn)=Jmin+tr(~Rκ~Cn), (19)

where is the correlation matrix of and is the minimum MSE. We assume is sufficiently close to the optimal solution of the infinite order model so that , and and are uncorrelated. Following the arguments in [10, Section III. D] with and replaced respectively by and , we arrive, with simple manipulations, at the following recursion:

 ~Cn+1≈~Cn+η2(~Tn+Jmin~Rκ)−η(~Rκ~Cn+~Cn~Rκ), (20)

where and its component can be approximated as

 [~Tn]ℓ,m≈tr(~Sℓ,m~Cn),   1≤ℓ,m≤r. (21)

Here, the component () of is defined as

 [~Sℓ,m]p,q:=E(~κn,ℓ~κn,m~κn,p~κn,q)=gTℓHm,p gq, (22)

where , () is the -th column vector of , and . The approximation in (21) can be developed by following the arguments in [12, Section 3.3] with and replaced by and , respectively. Finally, the component of can be written as

 [Hm,p]i,j=gTmSi,jgp,   1≤i,j≤r, (23)

where , , with can be computed by [12, Eq. (35)].

Let us now establish the mean-square stability condition and derive the steady-state MSE. Due to the presence of in (20), we exploit the lexicographic representation of , i.e, the columns of each matrix are stacked on top of each other into a vector. The recursion (20) can be rewritten as

 ~cn+1=K~cn+η2Jmin~rκ, (24)

where and are the lexicographic forms of and , respectively, and

 K:=Ir2−η(K1+K2)+η2K3, (25)

where , , and is an matrix entries are: with . Here, denotes the Kronecker product. By (24) and (25), we obtain the following results.

###### Theorem 2 (Mean-square stability)

Assume CMIA holds. For any initial conditions and satisfying (18), given a dictionary , the Natural KLMS algorithm with Gaussian kernel is mean-square stable, if the matrix is stable (i.e., the spectral radius of is less than one).

Proof: The algorithm is said to be mean-square stable if, and only if, the state vector remains bounded and tends to a steady-state value, regardless of the initial condition [19]. To complete the proof, it is sufficient to show that remains bounded and tends to a steady-state value, where is a diagonal positive definite matrix. This is verified by the fact that is bounded and tends to a steady-state value if the matrix is stable.

###### Theorem 3 (MSE in the steady state)

Consider a sufficiently small step size , which ensures mean and mean-square stability. The steady-state MSE is given by (19) with the lexicographic representation of given by

 ~c∞=η2Jmin(Ir2−K)−1~rκ, (26)

provided that is invertible.

Proof: Letting in (24) and rearranging the equation, we obtain (26).

We remark on Theorem 3 that the invertibility of is actually ensured by the stability of .

## 5 Simulation results

We shall compare simulated learning curves and analytic models to validate the analysis. We conduct two experiments under the same settings as in [12]. In the first experiment, the input sequence is generated by

 un:=ρun−1+σu√1−ρ2ωn, (27)

where

is the noise following the i.i.d standard normal distribution. The nonlinear system is defined as follows:

 {xn:=0.5un−0.3un−1dn:=xn−0.5x2n+0.1x3n+νn, (28)

where

is an additive zero-mean Gaussian noise with the standard deviation

. The input vector is . The step size, the standard deviation of the input, the input correlation parameter, the kernel parameter and the dictionary size are set to , , , and , respectively. The dictionary is samples on a uniform grid defined on .

Fig. 2 depicts the results: the learning curves, the theoretical transient MSE curve, and the theoretical steady state MSE line are presented in blue, red, and green (dotted line), respectively. The simulated curve is obtained by averaging over 300 Monte-Carlo runs. The theoretical MSE is estimated by (19) with recursively evaluated by (20). The steady state MSE is computed by Theorem 3. Although the input is correlated, the theoretical MSE presented in this paper well represents the behavior of the Natural KLMS algorithm.

In the second experiment, the fluid-flow control problem is considered [20]:

 ⎧⎪⎨⎪⎩xn:=0.1044un+0.0883un−1+1.4138xn−1−0.6065xn−2dn:=0.3163xn/√0.1+0.9x2n+νn, (29)

where the input is generated again by (27) with and , and the standard deviation of the additive Gaussian noise is set to . The kernel parameter is set to . The input vector is . 31 dictionary elements are selected from the inputs based on the coherence criterion [4] in advance. The step size is set to . The simulated curves are obtained by averaging over 300 Monte-Carlo runs, and the same theoretical model as the first experiment is used. Fig 3 depicts the results. Again, the simulation results show the validity of the analysis. Table 1 summarizes the overall per-iteration complexity (the number of real multiplications) of the Natural KLMS algorithm with full update and selective update (see [15, 18]), and Fig. 4 illustrates the complexity as a function of the dictionary size for and ; is counted simply as . Here, means that only one coefficient is updated at each iteration and hence the complexity is reduced drastically. Fig 2 and 3 depict the MSE learning curves of the Natural KLMS algorithm with full update and selective update for . It can be seen that the Natural KLMS algorithm with the selective update exhibits a steady-state MSE comparable to the full-update case with drastically lower complexity.

## 6 Conclusion

This paper presented a stochastic behavior analysis of the Natural KLMS algorithm which is a stochastic restricted-gradient descent method. The analysis provided a transient and steady-state MSEs of the algorithm. We also derived stability conditions in the mean and mean-square sense. Simulation results showed that the theoretical MSE curves given by the analysis well meet the simulated MSE curves. The outcomes of this study will serve as a theoretical basis to compare the performances of KNLMS and Natural KLMS.

## References

• [1] W. Liu, J. Príncipe, and S. Haykin, Kernel Adaptive Filtering.   New Jersey: Wiley, 2010.
• [2] J. Kivinen, A. J. Smola, and R. C. Williamson, “Online learning with kernels,” IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2165–2176, Aug. 2004.
• [3] Y. Engel, S. Mannor, and R. Meir, “The kernel recursive least-squares algorithm,” IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2275–2285, Aug. 2004.
• [4] C. Richard, J.-C. M. Bermudez, and P. Honeine, “Online prediction of time series data with kernels,” IEEE Trans. Signal Process., vol. 57, no. 3, pp. 1058–1067, Mar. 2009.
• [5] K. Slavakis, S. Theodoridis, and I. Yamada, “Adaptive constrained learning in reproducing kernel Hilbert spaces: the robust beamforming case,” IEEE Trans. Signal Process., vol. 57, no. 12, pp. 4744–4764, Dec. 2009.
• [6] M. Yukawa, “Multikernel adaptive filtering,” IEEE Trans. Signal Processing, vol. 60, no. 9, pp. 4672–4682, Sep. 2012.
• [7] B. Chen, S. Zhao, P. Zhu, and J. C. Príncipe, “Quantized kernel least mean square algorithm,”

IEEE Trans. Neural Networks and Learning Systems

, vol. 23, no. 1, pp. 22–32, 2012.
• [8] S. V. Vaerenbergh, M. Lazaro-Gradilla, and I. Santamaria, “Kernel recursive least-squares tracker for time-varying regression,” IEEE Trans. Neural Network and Learning Systems, vol. 23, no. 8, pp. 1313–1326, Aug 2012.
• [9] W. Gao, J. Chen, C. Richard, and J. Huang, “Online dictionary learning for kernel LMS,” IEEE Trans. Signal Process., vol. 62, no. 11, pp. 2765–2777, June 2014.
• [10] W. D. Parreira, J.-C. M. Bermudez, C. Richard, and J. Y. Tourneret, “Stochastic behavior analysis of the Gaussian kernel least-mean-square algorithm,” IEEE Trans. Signal Processing, vol. 60, no. 5, pp. 2208–2222, May 2012.
• [11] C. Richard and J.-C. M. Bermudez, “Closed-form conditions for convergence of the gaussian kernel-least-mean-square algorithm,” in Proc. Asilomar, Pacific Grove, CA, USA, Nov. 2012, pp. 1797–1801.
• [12] J. Chen, W. Gao, C. Richard, and J.-C. M. Bermudez, “Convergence analysis of kernel LMS algorithm with pre-tuned dictionary,” in Proc. IEEE ICASSP, 2014, pp. 7243–7247.
• [13] W. Liu, P. P. Pokharel, and J. C. Príncipe, “The kernel least-mean-square algorithm,” IEEE Trans. Signal Process., vol. 56, no. 2, pp. 543–554, Feb. 2008.
• [14]

J. Platt, “A resourse-allocating network for function interpolation,”

IEEE Transactions on Neural Networks and Learning Systems, vol. 3, no. 2, pp. 213–225, 1991.
• [15] M. Yukawa and R. Ishii, “An efficient kernel adaptive filtering algorithm using hyperplane projection along affine subspace,” in Proc. EUSIPCO, 2012, pp. 2183–2187.
• [16] B. Chen, S. Zhao, P. Zhu, and J. C. Príncipe, “Mean square convergence analysis for kernel least mean square algorithm,” Signal Processing, vol. 92, pp. 2624–2632, 2012.
• [17] M. Takizawa and M. Yukawa, “An efficient sparse kernel adaptive filtering algorithm based on isomorphism between functional subspace and euclidean space,” in Proc. IEEE ICASSP, 2014, pp. 4508–4512.
• [18] ——, “Adaptive nonlinear estimation based on parallel projection along affine subspaces in reproducing kernel hilbert space,” IEEE Trans. Signal Processing, submitted for publication.
• [19] A. H. Sayed, Adaptive Filters.   John Wiley & Sons, 2008.
• [20] H. Al-Duwaish, M. N. Karim, and V. Chandrasekar, “Use of multilayer feedforward neural networks in identification and control of wiener model,” in Proc. IEEE Control Theory Appl., vol. 143, 1996, pp. 255–258.