Kernels and kernel methods have been gaining their prevalence as standard tools in machine learning. The essential idea in kernel methods lies in the transformation of the input data to certain high-dimensional feature space with certain nice computational properties preserved, which is the so-calledkernel trick. Literature has witnessed a growth of various kernel-based learning schemes and a flourish of studies, e.g. ScSm01 ; SuVaDe02 ; StCh08 . Kernel methods are black-boxes in that the feature maps are not interpretable and one cannot know their explicit formulas, which actually brings computational convenience. However, it treats all features the same, which is usually not in line with the case in practice.
The great empirical success of deep learning in the past decade is attributed to the feature learning/ feature engineering brought by the inherent structure of multiple hidden layersBengio09a ; BeCoVi13a . However, feature engineering can never be ad-hoc for deep learning, and can actually play a role in the context of kernel methods. In fact, it is shown that deep learning models are closely related to kernel methods and many of them can be interpreted by using kernel machines ChSa09a ; AnRoTaPo15a . Besides, several deep kernel methods have also been proposed by introducing deeper kernels ChSa09a ; AnRoTaPo15a .
Noticing the necessities of feature representation in the context of kernel methods, the present study aims at investigating learning problems with a special class of kernel functions, i.e., anisotropic kernels. The concept of anisotropic here is termed as a generalization of the vanilla kernel class, namely, the isotropic kernels. Different from the isotropic ones, the specialty of anisotropic kernels lies in that their shape parameters entail feature-wise adjustments. Though that may involves more hyper-parameters, it may also improves the empirical performance. The validity of the anisotropic kernels can be demonstrated via a wide range of applications, especially the image processing problems, such as lung pattern classification ShSo04a and forecasting GaOrSaCaSaPo13a
via SVMs with anisotropic Gaussian RBF kernels. Literature has been focused on the feature selection capacity that anisotropic kernels bring, for example,Allen13a
presents an algorithm to minimize a feature-regularized loss function and thus achieving automatic feature selection based on the feature-weighted kernels,LiYaXi05a proposes the so called Feature Vector Machine which can be easily extended for feature selection with non-linear models by introducing kernels defined on feature vectors, and MaWeBa11a
introduces the kernel-penalized SVM that simultaneously selects relevant features during classifier construction by penalizing each feature’s use in the dual formulation of SVMs. To cope with the highdimensional nature of the input space,BrRoCrSe07a
proposes a density estimator with anisotropic kernels which are especially appropriate when the density concentrates on a low-dimensional subspace. Finally,ShZh12a proposes a noise-robust edge detector which combines a small-scaled isotropic Gaussian kernel and large-scaled anisotropic Gaussian kernels to obtain edge maps of images.
Fully aware of the significance of the anisotropic kernels, we address the learning problem of non-parametric least squares regression with anisotropic Gaussian RBF kernel-based support vector machines. The learning goal is to find a function that is a good estimate of the unknown conditional mean , given i.i.d. observations of input/output pairs drawn from an unknown distribution on , where . To be specific, we ought to find a function such that, for the loss function , the risk
should be close to the Bayes risk with respect to and . A Bayes decision function is a function satisfying .
In this paper, we consider an anisotropic kernel-based regularized empirical risk minimizers, namely support vector machine (SVMs) with anisotropic kernel , which solve the regularized problem
Here, is a fixed real number and is a reproducing kernel Hilbert space (RKHS) of over .
The main contribution of this paper lies in the establishment of the almost optimal learning rates for nonparametric regression with anisotropic Gaussian SVMs, provided that the target functions are contained in some anisotropic Besov spaces. Recall that the overall smoothness of the commonly used isotropic kernels depends on the worst smoothness of all dimensions, which faces the dilemma where one poor smoothness along certain dimension may lead to unsatisfying convergence rates of the decision functions, even when smoothness along other dimensions is fairly good. Unlike the isotropic ones, the anisotropic kernels are more resistant to the poor smoothness of certain dimensions. To be specific, the overall smoothness is embodied by the effective smoothness, or the exponent of global smoothness, whose reciprocal is the mean of the reciprocals of smoothness of all dimensions. In this manner, poor smoothness along certain dimensions will not able to jeopardize the whole good one. Based on the effective smoothness, we manage to derive almost optimal learning rates which not only match the theoretical optimal ones for anisotropic kernels up to some logarithmic factor, but also in line with the published optimal learning rates derived by different algorithms. Moreover, when embed our results in cases of the isotropic one where we take all shape parameters as the same, our optimal learning rates still coincide with the theoretical optimal ones for isotropic kernels up to a logarithmic factor and are even better than the existing rates obtained via other methods.
Moreover, even though literature mainly concentrates on isotropic classes, the assumption of this isotropy might result in the loss of efficiency if the regression function actually belongs to an anisotropic class. In fact, this inefficiency is getting worser with the dimension getting higher. Therefore, assumption of anisotropy might serve as a more appropriate substitute. Besides, the anisotropy assumption also shows its advantages in confronting sparse regression functions where the learning rates will automatically depend on a small subset of the coordinates owing to the nature of effective smoothness. This phenomenon is also supported by theoretical analysis in this paper with even faster learning rates established.
The paper is organized as follows: Section 2 summarizes notations and preliminaries. We present the main results of the almost optimal learning rates of the anisotropic kernels for regression in Section 3. The error analysis is clearly illustrated in Section 4. Detailed proofs of Sections 3 and 4 are placed in Section 5, for the sake of clarity. Finally, we conclude this paper in Section 6.
Throughout this paper, we assume that is a non-empty, open and bounded set such that its boundary has Lebesgue measure , for some and
is a probability measure onsuch that on is absolutely continuous with respect to the Lebesgue measure on . Furthermore, we assume that the corresponding density of is bounded away from and . In what follows, we denote the closed unit ball of a Banach space by , the -dimensional Euclidean space , we write . For , is the greatest integer smaller or equal and is the smallest integer greater or equal
. The tensor productof two functions is defined by , . The -fold tensor product can be defined analogously.
2.1 Anisotropic Gaussian kernels and their RKHSs
The anisotropic kernels can be defined as follows:
Definition 1 (Anisotropic kernel).
A function is called an anisotropic kernel on with the shape parameter if there exists a Hilbert space and a map such that for all we have
where and . is called a feature map and is called a feature space of .
Careful observation of the definition of isotropic kernels, see e.g. Definition 4.1 in StCh08 , will find that they can be taken as anisotropic kernels with the shape parameter being an all-one vector. One commonly utilized anisotropic kernel is the anisotropic Gaussian kernel. With the shape parameter , it takes the form:
where is called the multi-bandwidth of the anisotropic Gaussian kernels .
Next, we are encouraged to determine an explicit formula for the RKHSs of anisotropic Gaussian RBF kernels. To this end, let us fix where and . For a given function we define
where stands for the Lebesgue measure on . Furthermore, we write
Obviously, is a function space with Hilbert norm . The following theorem shows that is the RKHS of the anisotropic Gaussian RBF kernel .
Theorem 1 (RKHS of the anisotropic Gaussian RBF).
Let where and . Then is an RKHS and is its reproducing kernel. Furthermore, for , let be defined by
Then the system of functions defined by
is an orthonormal basis of .
The above theorem of orthonormal basis (OBS) of is in the same way as Theorem 4.38 in StCh08 . Therefore, we omit the proof here. Note that the reproducing kernel of a RKHS is determined by an arbitrary ONB of this RKHS. Therefore, is its reproducing kernel of the RKHS which turns out to be the product function space of the RKHSs , that is, , where is the RKHS of the one-dimensional Gaussian kernel
2.2 Anisotropic Besov spaces
Let us begin by introducing some function spaces we need. Sobolev spaces AdFo03 ; Tr83 are one type of subspaces of . Let be the -th weak derivative for a multi-index with . Then, for an integer , , and a measure , the Sobolev space of order with respect to is defined by
It is the space of all functions in whose weak derivative up to order exist and are contained in . The Sobolev norm AdFo03 of the Sobolev space is given by
In addition, we write , and define for the Lebesgue measure on .
Another typical subspaces of with a fine scale of smoothness which is commonly considered in the approximation theory, namely anisotropic Besov spaces. In order to clearly describe these function spaces, we need to introduce some device to measure the smoothness of function, which is the modulus of smoothness.
Let be a subset with non-empty interior, be an arbitrary measure on with marginal measure on , . For a function with for some , , and , let , denote the univariate function
Now, we give the formal definition of the modulus of smoothness.
Definition 2 (Modulus of smoothness).
Let be a subset with non-empty interior, be an arbitrary measure on , and be a function with for some . For , the -th modulus of smoothness of in the direction of the variable is defined by
where the -th difference of in the direction of the variable , , denoted as , is defined by
for , and .
To elucidate the idea of the modulus of smoothness, let us consider the case where , . Then, we obtain
if the derivative of exists in . As a result, equals the secant’s slope and is bounded, if is differentiable at . Analogously, is bounded, if, e.g. second order derivatives exist.
It follows JoSc76 immediately that for all , all , and all , the modulus of smoothness with respect to in the direction of the variable satisfies
The modulus of smoothness (6) in the direction of the variable can be used to define the scale of Besov spaces in the direction of the variable . For , , , , and an arbitrary measure , the Besov space in the direction of the variable is defined by
where the seminorm is defined by
In both cases, the norm of can be defined by
In addition, for , we often write
and call the generalized Lipschitz space of order . Finally, if is the Lebesgue measure on , we write .
Definition 3 (Anisotropic Besov space).
For , , and , the anisotropic Besov space is defined by
In the case and , , we use the notation
3 Main results
In this section, we present our main results: optimal learning rates for LS-SVMs using anisotropic Gaussian kernels for the non-parametric regression problem based on the least squares loss defined by .
3.1 Convergence rates
It is well known that, for the least squares loss, the function defined by , , is the only function for which the Bayes risk is attained. Furthermore, some simple and well-known transformations show
Note that, for all and , the least squares loss can be clipped at in the sense of Definition 2.22 in StCh08 . To be precise, we denote the clipped value of some by , that is if , if , and if . It can be easily verified that the risks of the least squares loss satisfies , and therefore holds for all .
Let for , and be a distribution on such that is a bounded domain with . Furthermore, assume that is a distribution on that for all , the marginal distributions has a Lebesgue density for some . Moreover, let be a Bayes decision function such that as well as for and with . Let
Then, for all and all , the SVM using the anisotropic RKHS and the least squares loss with
with probability not less than . Here, and , , are user-specified constants and is a constant independent of .
Note that for isotropic cases, the overall smoothness is depend on the worst smoothness of all dimensions. In other words, one poor smoothness along certain dimension may lead to unsatisfying convergence rates of the decision functions, even when smoothness of other dimensions is well-behaved. In contrast, the anisotropic cases are more appropriate for one poor smoothness of certain dimension will not jeopardize the overall good smoothness much by embodying smoothness by in (9). This is called the effective smoothness HoLe02a , or the exponent of global smoothness Birge86a . Moreover, we can still precisely characterize the anisotropy by considering the dimension-specific smoothness vector with , .
In the statistical literature, optimal rates of convergence in anisotropic Hölder, Sobolev and Besov spaces have been studied in IbHa81 ; Nussbaum85a ; HoLe02a . The theoretical optimal learning rate for a function with smoothness along the -th dimension is given by , see e.g. HoLe02a . Therefore, our established convergence rates in (10) match the theoretical optimal ones up to the logarithmic factor . Other published convergence rates for anisotropic cases include ones learned by Gaussian process IbHa81 . With optimal rates learned by SVMs based on anisotropic Gaussian kernel, the results we obtained is in line with these existing ones derived via different algorithms. Moreover, when considering our rates in the isotropic classes where for all , our rates become and it is better than the learning rates via SVMs based on isotropic kernel obtained in eberts2013optimal . Furthermore, the well-known theoretical optimal rate for isotropic cases is , see Stone82 , and our learning rates coincide with it up to the logarithmic factor .
Though literature often focuses on the isotropic class, this assumption of isotropy would lead to loss of efficiency if the regression function actually belongs to an anisotropic class. Moreover, this inefficiency is getting worser when the dimension becomes higher. Therefore, assumption of anisotropy which treats isotropy as a special case may be a better choice. In addition, the assumption of anisotropy also shows its advantages in facing sparse regression functions, i.e., if the regression function depends only on a small subset of coordinates . Therefore, the effective smoothness in (9) will depend less on smoothness along some certain dimensions, and thus become larger. In this manner, the learning rates in (10) can be further significantly improved.
Let the assumptions on the distribution and the Bayes decision function in Theorem 2 hold. Moreover, suppose belongs to for some subset of . Let
Then, for all and all , the SVM using the anisotropic RKHS and the least squares loss with
with probability not less than . Here, and , , are user-specified constants and is a constant independent of .
The proof of Theorem 3 will be omitted as it is similar to the previous theorem. We only mention that the exponents of the logarithmic terms depend on the capacity of the underlying RKHSs.
3.2 Rate Analysis
In this section, we compare our results with previously obtained learning rates for SVMs for regression. To this end, according to Theorem 9 in steinwart2009optimal , we need to verify two conditions, which are and , , where is the integral operator (see, (5) in steinwart2009optimal ).
First of all, we prove that holds. With the help of Theorem 15 in steinwart2009optimal , is equivalent to
modulo a constant only depending on . If we denote the space of all bounded functions on , then the above inequality (13) will be satisfied if the following more classical, distribution-free entropy number assumption
Now, we verify that the second condition , also holds. Since the proof of Theorem 4.1 in cucker2007learning shows that the image of
is continuously embedded into the real interpolation space, therefore the second condition is satisfied if we can prove that is continuously embedded in .
In order to present a more concrete example, we need to introduce some notations. In the following, for , we denote and . Let us now consider the case where for some . If has a Lebesgue density that is bounded away from and , then we have
where . Consequently, by Corollary 6 in steinwart2009optimal , we can obtain the learning rates whenever . Conversely, according to the Imbedding Theorem for Anisotropic Besov Space, see Theorem 8 in the Appendix, can be continuously embedded into for all , and therefore, Theorem 9 in steinwart2009optimal shows that the learning rates is asymptotic optimal for such . However, if , we can still obtain the rates , but we no longer know whether they are optimal in the minimax sense.
It is noteworthy that, when the target functions reside in the anisotropic Besov space, as shown in Theorem 2, we obtain the optimal learning rates up to certain logarithmic factor. There, , and the decision functions reside in the RKHSs induced by the anisotropic Gaussian RBF kernel. While, when using an anisotropic Sobolev space as the underlying RKHS, the learning rates obtained are . It can be apparently observed that since , our learning rates in Theorem 2 are faster than that with the anisotropic Sobolev space.
4 Error Analysis
4.1 Bounding the Sample Error Term
Aiming at proving the new oracle inequality in the Proposition 2, there is a need for us to control the capacity of the RKHS where we use the entropy numbers, see e.g. CaSt90 or Definition A.5.26 in StCh08 . The definition of the entropy numbers is presented as follows:
Let be a bounded, linear operator between the normed spaces and and be an integer. Then the -th (dyadic) entropy number of is defined by
where the convention is used.
The following proposition with regard to the capacity of can be derived by Theorem 7.34 and Corollary 7.31 in StCh08 .
Let be a closed Euclidean ball. Then there exists a constant , such that, for all , and , we have
Having developed the above proposition, we are now able to derive the oracle inequality for the least squares loss as follows:
Let , be a closed subset with and be a distribution on . Furthermore, let be the least squares loss, be the anisotropic Gaussian RBF kernel over with multi-bandwidth and be the associated RKHS. Fix an and a constant such that . Then, for all fixed and and , the SVM using and satisfies
with probability not less that , where is a constant only depending on and .
4.2 Bounding the Approximation Error Term
In this section, we consider bounding the approximation error of some function contained in the RKHS , which is defined by
where the infimum is actually attained by a unique element , see Lemma 5.1 and Theorem 5.2 in StCh08 . To this end, it suffices to find a function such that both the regularization term and the excess risk are small. To construct this function we define, for , , , and , the function by
with functions defined by
Assume that there exists a Bayes decision function , then we define by
In order to show that is indeed a suitable function to bound (16), there is a need for us to first ensure that is contained in . Moreover, we need to bound both of the excess risk of and the -norm. Proposition 3 gives the bound of the excess risk with the help of the modulus of smoothness and Proposition 4 focus on the estimation of the regularization term.
Let us fix some . Furthermore, assume that is a distribution on that for all , the marginal distributions has a Lebesgue density for some . Let be such that