Using reproducing kernels for nonlinear adaptive filtering tasks has widely been investigated [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]. See, e.g., [12, 13, 14, 15, 16, 17, 18, 19, 20, 21] for the theory and applications of reproducing kernels. The author has proposed and studied multikernel adaptive filtering, using ‘multiple’ kernels [22, 23, 24]. Different approaches using multiple kernels have also been proposed subsequently. Pokharel et al. have proposed a mixture-kernel approach , and Gao et al. have proposed convex-combinations of kernel adaptive filters . Tobar et al.
have proposed a multikernel least mean square algorithm for vector-valued functions. Multikernel adaptive filtering is effective particularly in the following situations.
An adequate kernel is unavailable because (i) the amount of prior information about the unknown system is limited, and/or (ii) the unknown system is time-varying and so is the adequate kernel for the system.
The situation (b) has mainly been supposed in [22, 23, 24]. Use of many, say fifty, kernels has been investigated and kernel-dictionary joint-refinement techniques have been proposed based on double regularization with a pair of block norms [32, 33]. Our primal focus in the current study is on the situation (a) in which the use of multiple kernels is expected to allow a compact representation of the unknown system.
Separately from the study of multikernel adaptive filtering, the author has proposed an efficient single-kernel adaptive filtering algorithm named hyperplane projection along affine subspace (HYPASS) [34, 35]. The HYPASS algorithm is a natural extension of the naive online minimization algorithm (NORMA) proposed by Kivinen et al. 
. NORMA seeks to minimize a risk functional in terms of a nonlinear function by using the stochastic gradient descent method in a reproducing kernel Hilbert space (RKHS). This approach builds a dictionary (the set of basic nonlinear functions to generate an estimate of the unknown system) by using all the observed data. This implies that the dictionary size grows with the number of data observed. As a remedy for this issue, a simple truncation rule has been introduced. It would be more realistic to build a dictionary in a selective manner based on some criterion to evaluate the novelty of a new datum; simple criteria include Platt’s criterion , the approximate linear dependency , and the coherence criterion . Introducing one of those criteria to NORMA raises another issue: if a new datum is regarded to be not sufficiently novel and does not enter into the dictionary, then this observed datum is simply discarded and makes no contributions to estimation even though it can be informative enough to adjust the coefficients. Moreover, the coefficient of each dictionary element is updated only when that element enters into the dictionary. The HYPASS algorithm systematically eliminates this limitation by enforcing the update direction to lie in the dictionary subspace which is spanned by the dictionary elements. It has been extended to a parallel-projection-based algorithm [37, 35]. HYPASS includes the method of Dodd et al.  and the quantized kernel LMS (QKLMS)  as its particular case. There are a similarity, and also a considerable dissimilarity, between HYPASS and the kernel normalized least mean square (KNLMS) algorithm  proposed by Richard et al. Both algorithms share the philosophy of projecting the current estimate onto a hyperplane which makes the instantaneous error to be zero. The difference is that HYPASS operates the projection in a functional space (i.e., in a RKHS) while KNLMS operates the projection in a Euclidean space of the coefficient vector (see[40, 35]). The multikernel adaptive filtering algorithms presented in [22, 23, 24] are basically extensions of the KNLMS algorithm. Our recent study, on the other hand, reveals significant advantages of HYPASS over KNLMS (cf. [34, 37, 35]). It is therefore of significant interests how the two different streams (multikernel adaptive filtering and HYPASS) meet.
In the present article, we propose an efficient multikernel adaptive filtering algorithm based on iterative orthogonal projections in a functional space, inheriting the spirit of HYPASS (see Fig. 1). A multikernel adaptive filter is characterized as a superposition of vectors lying in multiple RKHSs, namely as a vector in the sum space of multiple RKHSs. In general, a vector in the sum space can be decomposed, in infinitely many ways, into vectors in the multiple RKHSs, and this would cause a difficulty in computing the inner product in the sum space. To avoid the difficulty, we first consider the particular case that any pair of the multiple RKHSs intersects only trivially; i.e., any pair of the RKHSs shares only the zero vector. It covers the important case of using linear and Gaussian kernels simultaneously (see Corollary 2 in Section III-A). In this case, the decomposition is unique, which means that the sum space is the direct sum of the RKHSs, and the inner product can be computed easily in the sum space. This allows us to derive an efficient algorithm by reformulating the HYPASS algorithm in the sum space which is known to be a RKHS (Theorem 1). Due to the uniqueness of decomposition, the sum space is isomorphic, as a Hilbert space, to the Cartesian-product of the multiple RKHSs. This implies that the same derivation is possible through the Cartesian formulation instead of the sum-space formulation. This is the key to extending the algorithm to the general case.
Now, let us turn our attention to another important case of using multiple Gaussian kernels simultaneously. It is widely known that Gaussian RKHSs have a nested structure [41, 42, 43] (see also Theorem 5 in Section IV-B). This means that the multiple-Gaussian case is not covered by the first particular case. We therefore consider the general case in which some pair of the RKHSs may intersect non-trivially; i.e., some pair of the RKHSs may share common nonzero vectors. In this case, the inner product in the sum space has no closed-form expression, and hence it is generally intractable to derive an algorithm through the sum-space formulation. The inner product in the Cartesian product, on the other hand, is always expressed in a closed form. As a result, the algorithm formulated in the product space for the general case boils down to the same formula as obtained from the sum-space algorithm for the first case. The proposed algorithm is an iterative projection method in the Cartesian product and, only in the first particular case, it can be viewed as a sum-space projection method. The proposed algorithm is thus referred to as the Cartesian HYPASS (CHYPASS) algorithm. The computational complexity is low due to a selective updating technique, which is also employed in HYPASS. Numerical examples with toy models demonstrate that (i) CHYPASS with linear and Gaussian kernels is effective in the case that the unknown system contains linear and nonlinear components and (ii) CHYPASS with two Gaussian kernels is effective in the case that the unknown system contains high- and low- frequency components. We also apply CHYPASS to real-world data and show its efficacy over the KNLMS and HYPASS algorithms.
The rest of the paper is organized as follows. Section II presents the sum space model. In Section III, we derive the proposed algorithm through the sum-space formulation for the particular case mentioned above. We show that the use of linear and single-Gaussian kernels corresponds to the particular case based on a theorem proved recently by Minh . In Section IV, we present the CHYPASS algorithm for the general case as well as its computational complexity for the two useful cases: the linear-Gaussian and two-Gaussian cases. Section V presents numerical examples, followed by concluding remarks in Section VI.
Ii Sum Space Model
Ii-a Basic Mathematics
We denote by and
the sets of all real numbers and nonnegative integers, respectively. Vectors and matrices are denoted by lower-case and upper-case letters in bold-face, respectively. The identity matrix is denoted byand the transposition of a vector/matrix is denoted by . We denote the null (zero) function by .
Let and be the input and output spaces, respectively. We consider a problem of estimating/tracking a nonlinear unknown function by means of sequentially arriving input-output measurements. Our particular attention is focused on the case where contains several distinctive components; e.g., linear and nonlinear (but smooth) components, high- and low- frequency components, etc. To generate a minimal model to describe such a multicomponent function , it would be natural to use multiple RKHSs , , , over ; i.e., each of the s consists of functions mapping from to . Here, is the number of components of and each RKHS is associated with each component. The positive definite kernel associated with the th RKHS , , is denoted by , and the norm induced by is denote by . The is modeled as an element of the sum space
Given an , decomposition , , is not necessarily unique in general. If such decomposition is unique for any , the sum space is specially called the direct sum of s  and is usually indicated as .
Theorem 1 (Reproducing kernel of sum space )
The sum space equipped with the norm
is a RKHS with the reproducing kernel .
Proof: One can apply [12, Theorem in Part I Section 6] recursively to verify the claim.
Let be the reproducing kernel of a real Hilbert space . Then, given an arbitrary , , , is the reproducing kernel of the RKHS with the inner product , .
Proof: It is clear that for any . Also, for any and , we have .
Corollary 1 (Weighted norm and reproducing kernel)
Given any , , , is the reproducing kernel of the sum space equipped with the weighted norm defined as , .
Without loss of generality, we let ,
, in the following. For some batch processing techniques such as the kernel ridge regression, the sum spaceis easy to handle; see Appendix A. For online/adaptive processing, on the other hand, it is hard due to the fact that the inner product in has no closed-form expression in general. Fortunately, however, the inner product has a simple closed-form expression in the case of direct sum, allowing us to build an adaptive algorithm in as shown in Section III.
Ii-B Multikernel Adaptive Filter
We denote by the dictionary constructed for the th kernel at time . The kernel-by-kernel dictionary subspaces are defined as , , , and their sum is the dictionary subspace of the sum space . The multikernel adaptive filter at time is given in the following form:
where . Thus, the dictionary contains the atoms (vectors) that form the next estimate . If some a priori information is available, we may accordingly define an initial dictionary and an initial filter . Otherwise, we simply let and . We assume that ‘active’ elements in remain in so that
Iii Special Case: for any
In this section, we focus on the particular case that for any . This is the case of direct sum (in which any can be decomposed uniquely into , ) and includes some useful examples as will be discussed precisely in Section III-A. Due to the unique decomposability, the norm in (1) is reduced to
and accordingly the inner product between and is given by
It is clear that, under the correspondence between and the -tuple , the sum space is isomorphic to the Cartesian product
which is a real Hilbert space equipped with the inner product defined as
We present three cerebrated examples of positive definite kernel below (see, e.g., ).
Example 1 (Positive definite kernels)
Linear kernel: Given ,
Polynomial kernel: Given and ,
Gaussian kernel (normalized): Given ,
For the linear kernel, is a typical choice. If one knows that the linear component of is zero-passing, one can simply let . The following theorem has been shown by Minh in 2010 .
Theorem 3 ()
Let be any set with nonempty interior and the RKHS associated with a Gaussian kernel for an arbitrary together with the input space . Then, does not contain any polynomial on , including the nonzero constant function.
The following corollary is obtained as a direct consequence of Theorem 3.
Corollary 2 (Polynomial and Gaussian RKHSs)
Assume that the input space has nonempty interior. Given arbitrary , , and , denote by and the RKHSs associated respectively with the polynomial and Gaussian kernels and . Then,
In particular, (10) for implies that
We mention that a (manually-tuned) convex combination of linear and Gaussian kernels has been used in  within a single-kernel adaptive filtering framework for nonlinear acoustic echo cancellation. The case of linear plus Gaussian kernels is of particular interest when the unknown function contains linear and nonlinear (smooth) components [28, 29, 30]. (Our recent work in  is devoted to this important case.) We will present a dictionary design for this case in the following subsection.
Iii-B Dictionary Design: Linear Plus Gaussian Case
The dictionaries are designed on a kernel-by-kernel basis. With Corollary 2 in mind, we present a possible dictionary design for the case of with for and , assuming that the input space has nonempty interior. Due to the interior assumption on , it is seen that the dimension of is . It is clear that and , where is the unit vector having one at the th entry and zeros elsewhere. Based on this observation, one can see that
gives an orthonormal basis of the dimensional space . We thus let for all , which implies that and hence for any . Note that, in the case of , the dimension of is and one can remove from the dictionary .
On the other hand, the dictionary for the Gaussian kernel needs to be constructed in online fashion. In general, one may consider growing and pruning strategies to construct an adequate dictionary. A growing strategy is given as follows: (i) start with , and (ii) add a new candidate into the dictionary at each time only when it is sufficiently novel. In this case, for some . As a possible novelty criterion for the present example, we use Platt’s criterion  with a slight modification: is regarded to be novel if for some and if for some . Here, given a RKHS with its associated kernel and a dictionary with an index set , the coherence is defined as . Pruning can be done based, e.g., on regularization; see, e.g., [24, 48, 49, 40].
Iii-C Adaptive Learning Algorithm in Sum Space
At every time instant , a new measurement and arrives, and is updated to based on the new measurement. A question is how to exploit the new measurement for obtaining a better estimator within the subspace . A simple strategy accepted widely in adaptive filtering is the way of the normalized least mean square (NLMS) algorithm [50, 51], projecting the current estimate onto a zero-instantaneous-error hyperplane in a relaxed sense. See [52, 53, 9] and the references therein for more about the projection-based adaptive methods. As we assume that the search space is restricted to , we consider the following hyperplane in :
Note here that can also be represented as
where is a hyperplane in the whole space . The update equation is given by
Theorem 4 (Orthogonal projection in sum space)
Let , , be a RKHS over with its reproducing kernel and define the sum space with its kernel . Let be a subspace of and define its sum . Also define for some and . Then, the following hold.
For any ,
Assume that for any . Then, for any with ,
Lemma 1 ()
Let denote a RKHS associated with an input space and a positive definite kernel . Let for , , and . Then, given any ,
where the coefficient vector is characterized as a solution of the following normal equation:
where is the kernel (or Gram) matrix whose entry is and .
If for some , we obtain a trivial solution and for which yields .
Iii-D The Sum-space HYPASS Algorithm: Complexity Issue and Practical Remedy
Theorem 4 and Lemma 1 indicate that the computation of in (14) would involve the inversion of the kernel matrix (if invertible) for each kernel as well as the multiplication of the inverse matrix by a vector, where the size of the kernel matrix and the vector is determined by the dictionary size. Note here that this computation is unnecessary when the dictionary is orthonormal such as in the case of linear kernel (see Section III-A). In the case of Gaussian kernels, the inversion needs to be computed and a practical remedy to reduce the complexity is the selective update which is described below.
Let be a selected subset of the dictionary for the th kernel . For instance, in the case of and (the case of linear and Gaussian kernels), one can simply let and design by selecting a few s in that are most coherent to ; i.e., choose such that is the largest [34, 37, 35]. In other words, we choose s such that is the smallest (or the neighbors of are collected in short). Geometrically, the maximal coherence implies the least angle between and which gives the direction of update in the exact form of (14); see [37, 35]. This means that the selected approximates the exact direction best in the Gaussian dictionary . The coherence-based selection is therefore reasonable, as justified by numerical examples in Section V.
Now, we define the subspace spanned by each selected dictionary as
and its sum . To update only the coefficient(s) of the selected dictionary element(s) and keep the other coefficients fixed, the next estimate is restricted to , rather than to (cf. (13)). Accordingly, the update equation in (14) is modified into
In the trivial case that for all , (20) is reduced to (14). Indeed, the algorithm in (20) is a sum-space extension of the HYPASS algorithm proposed in . The following proposition can be used, together with Theorem 4 and Lemma 1, to compute .
For any and a subspace of , let and for some and . Then, for any ,
The computational complexity of the proposed algorithm under the selective updating strategy stated above will be given in Section IV-E.
Iv General Case
We consider the general case in which it may happen that for some . In this case, given an , decomposition , , is not necessarily unique, and thus Theorem 4.2 does not generally hold anymore, although Theorem 4.1 and Proposition 1 still hold. This implies that in (22) cannot be obtained simply in general. In the following, we show that this issue can be overcome by considering the Cartesian product rather than sticking to the sum space .
We show below, in a slightly general form, the known fact that the class of Gaussian kernels has a nested structure.
Let be an arbitrary subset and and Gaussian kernels for and . Then, the associated RKHSs and satisfy the following.
for any .
Proof: Let , and define
. Then, its Fourier transform is given byfor . The function is clearly bounded and also satisfies because . Hence, Bochner’s theorem  ensures that is a positive definite kernel on , and so on as well by the definition of positive definite kernels. Applying [12, Theorem I in Part I Section 7], we obtain and , which verifies the case of . This is generalized to any because one can verify under the light of Theorem 2 that for any (). We remark here that the two RKHSs (associated with ) and (associated with ) shares the common elements — this is what is meant by above — but are equipped with different inner products when .
There exist several articles that show some results related to Theorem 5. For instance, a special case of Theorem 5 for and can be found in . The proof in  is based on a characterization of a Gaussian RKHS in terms of Fourier transform. It is straightforward to generalize it to any subset with nonempty interior by exploiting [44, Theorem 1] which gives another characterization of a Gaussian RKHS. Note that Theorem 5 holds with no assumption on the existence of interior of . To verify Theorem 5, one can also follow the way in  which proves the case of and by using another theorem in place of Bochner’s theorem. The inclusion operator “id” appearing in  would imply Theorem 5.1, and a result related to a special case of Theorem 5.2 for can also be found in [42, Corollary 6].
Iv-B Dictionary Design: Two Gaussian Case
We present our dictionary selection strategy for the case of two Gaussian kernels and for and . In analogy with Section III-A, we define the dictionary for each kernel as for , where . For the kernel , we simply adopt the coherence criterion : is regarded to be novel if for some . The kernel is complementary in the sense that it only needs to be used in those regions (of the input space ) where the unknown system contains high frequency components which make the ‘wider’ kernel underfit the system. To do so, a new element enters into the dictionary only when all of the following three conditions are satisfied: (i) does not enter into the dictionary (the no-simultaneous-entrance condition), (ii) for some (the small-coherence condition), and (iii) for some (the large-error condition).
Iv-C The Cartesian HYPASS Algorithm
By virtue of the isomorphism between the sum space and the product space in the case of , , the arguments as in Section III can be translated into the product space . (See  for the direct derivation in the product space.) Fortunately, the translated arguments can be applied to the general case, including the case that for some . This is because, even when can be decomposed in two different ways like , the two functions and are distinguished in the product space as . Therefore, the product-space formulation delivers the following algorithm for the general case:
which is seemingly identical to (20) under Proposition 1. We emphasize here that (20) can be written in the form of (23) only in the case of , . Namely, in the case of , , (23) can be regarded as a hyperplane projection algorithm in the product space , but not in the sum space . We call the general algorithm in (23) the Cartesian HYPASS (CHYPASS) algorithm, since it is a product-space extension of the HYPASS algorithm. In the case of two Gaussian kernels, the coherence-based selective updating strategy discussed in Section III-D is applied to each Gaussian kernel.
Iv-D Alternative Algorithm: Parameter-space Approach
We present a simple alternative to the CHYPASS algorithm. Let us parametrize by
where . Then, can be expressed as
by defining the vectors and appropriately that consist of s and s for , respectively, where . Concatenating vectors yields and with . Then, is simply expressed by
One can therefore build an algorithm that projects the current coefficient vector onto the following zero-instantaneous-error hyperplane in the Euclidean space:
This is the idea of the alternative algorithm. The next coefficient vector containing s for is computed as
where . At the next iteration, if (), is given by itself. Otherwise, is obtained with and for . We call the alternative algorithm the multikernel NLMS (MKNLMS) since it is essentially the same as the algorithm presented in [24, Section III.A] except that the dictionary is designed individually for each kernel. MKNLMS with two Gaussian kernels with individual dictionaries has been studied earlier in .
Iv-E Computational Complexity
The computational complexity is discussed in terms of the number of multiplications required for each update, including the dictionary update, for CHYPASS and MKNLMS in the linear-Gaussian and two-Gaussian cases, respectively. The complexity is summarized in Table I.
Iv-E1 Linear-Gaussian case
The complexity of CHYPASS is , where is the size of the Gaussian dictionary and is the size of its selected subset . Here, denotes the cardinality of a set . The term