1 Introduction
A challenging problem in modern statistics is how to handle complex, highdimensional data. Sparsity has emerged as a major tool for making efficient inferences and predictions for multidimensional data. Generally speaking, sparsity refers to a situation where the data, despite their apparent high dimensionality, are highly redundant with a low intrinsic dimensionality. In our paper, we use the term “sparse structure” to refer to cases where the underlying distribution places most of its mass on a subset of of small Lebesgue measure. This scenario includes, but is not limited to, Riemannian submanifolds of , and highdensity clusters separated by lowdensity regions. In applications of interest, observable data often have (complex) sparse structure due to the nature of the underlying physical systems. For example, in astronomy, raw galaxy spectra are of dimension equal to the number of wavelength measurements , but inspection of a sample of such spectra will reveal clear, lowdimensional features and structure resulting from the shared physical system that generated these galaxies. While the real dimensionality of data is much smaller than , the challenge remains to exploit this when predicting, for example, the age, composition, and star formation history of a galaxy.
In its simplest form, lowdimensional structure is apparent in the original coordinate system. Indeed, in regression, much research on “large p, small n” problems concerns variable selection
and the problem of recovering a “sparse” coefficient vector (i.e., a vector with mostly zeros) with respect to the given variables. Such approaches include, for example, lassotype regularization
[51], the Dantzig selector [9], and RODEO [28]. There are also various extensions that incorporate lowerorder interactions and groupings of covariates [57, 61, 37] but, like lassotype estimators, they are not directly applicable to the more intricate structures observed in, e.g., natural images, spectra, and hurricane tracks.At the same time, there has been a growing interest in statistical methods that explicitly consider geometric structure in the data themselves. Most traditional dimensionreducing regression techniques, e.g., principal component regression (PCR; [26]) partial least squares (PLS; [55]) and sparse coding [35], are based on linear data transformations and enforce sparsity (with respect to the or norm) of the regression in a rotated space. More recently, several authors [7, 2, 10] have studied local polynomial regression methods on nonlinear manifolds. For example, Aswani et al. [2] propose a geometrybased regularization scheme that estimates the local covariance matrix at a point and then penalizes regression coefficients perpendicular to the estimated manifold direction. In the same spirit, Cheng and Wu [10]
suggest first reducing the dimensionality to the estimated intrinsic dimension of the manifold, and then applying local linear regression to a tangent plane estimate. Local regression and manifoldbased methods tend to perform well when there is a clear submanifold but these approaches are not practical in higher dimensions or when the local dimension varies from point to point in the sample space. Hence, existing nonparametric models still suffer when estimating unknown functions (e.g., density and regression functions) on complex objects
, where is large.Much statistical research has revolved around adapting classical methods, such as linear, kernelweighted, and additive models to high dimensions. On the other hand, statisticians have paid little attention to the potential of orthogonal series approaches. In low dimensions, orthogonal series is a powerful nonparametric technique for estimating densities and regression functions. Such methods are fast to implement with easily interpretable results, they have sharp optimality properties, and a wide variety of bases allows the data analyst to model multiscale structure and any challenging shape of the target function [16]. As a result, Fourier series approaches have dominated research in signal processing and mathematical physics. This success, however, has not translated to more powerful nonparametric tools in dimensions of the order of or
; in fact, extensions via tensor products (as well as more sophisticated adaptive grid or triangulation methods; see
[31] and references within) quickly become unpractical in dimensions .In this work, we will build on ideas from harmonic analysis and spectral methods to construct nonparametric methods for estimating unknown functions in highdimensional spaces with nonstandard data objects (such as images, spectra, and distributions) that possess sparse nonlinear structure. We derive a Fourierlike basis of that adapts to the intrinsic geometry of the underlying data distribution , and which is orthonormal with respect to rather than the Lebesgue measure of the ambient space. The empirical basis functions are then used to estimate functions on complex data ; such as, for example, the regression function
of a response variable
on an object . Because of the adaptiveness of the basis, there is no need for highdimensional tensor products. Moreover, we take advantage of the orthogonality property of the basis for fast computation and model selection. We refer to our approach as spectral series as it is based on spectral methods (in particular, diffusion maps [13, 11, 29] and spectral connectivity analysis [30]) and Fourier series. Sections 2.12.3 describe the main idea of the series method in a regression setting.Our work generalizes and ties together ideas in classical smoothing, kernel machine learning [44, 45, 14]
, support vector machines (SVMs;
[49]) and manifold regularization [6] without the many restrictive assumptions (fixed kernel, exact manifold, infinite unlabeled data and so on) seen in other works. There is a large literature on SVMs and kernel machine learning that use similar approximation spaces as us, but it is unclear whether and how those procedures adapt to the structure of the data distribution. Generally, there is a discrepancy between theoretical work on SVMs, which assume a fixed RKHS (e.g., a fixed kernel bandwidth), and applied SVM work, where the RKHS is chosen in a datadependent way (by, e.g., decreasing the kernel bandwidth for larger sample sizes ). Indeed, issues concerning the choice of tuning parameters, and their relation to the data distribution , are considered to be open problems in the mainstream RKHS literature. The manifold regularization work by Belkin et al. [6] addresses adaptivity to sparse structure but under restrictive assumptions, such as the existence of a welldefined submanifold and the presence of infinite unlabeled data.Another key difference between our work and kernel machine learning is that we explicitly compute the eigenfunctions of a kernelbased operator and then use an orthogonal series approach to nonparametric curve estimation. Neither SVMs nor manifold regularizers exploit orthogonality relative to . In our paper, we point out the advantages of an orthogonal series approach in terms of computational efficiency (such as fast crossvalidation and tuning of parameters), visualization, and interpretation. SVMs can sometimes have a “black box feel,” whereas the spectral series method allows the user to directly link the datadriven Fourierlike eigenfunctions to the function of interest and the sample space. Indeed, there is a dual interpretation of the computed eigenfunctions: (i) They define new coordinates of the data
which are useful for nonlinear dimensionality reduction, manifold learning, and data visualization. (ii) They form an orthogonal Hilbert basis for functions on the data and are a means to
nonparametric curve estimation via the classical orthogonal series method, even when there is no clearly defined manifold structure. There is a large body of work in the machine learning literature addressing the first perspective; see, e.g., Laplacian maps [3], Hessian maps [15], diffusion maps [13], Euclidean Commute Time maps [41], and spectral clustering
[46]. In this paper, we are mainly concerned with the second view, i.e., that of estimating unknown functions on complex data objects and understanding the statistical properties of such estimators.Fig. 1
, for example, shows a 2D visualization of the Isomap face data using the eigenvectors of a renormalized Gaussian kernel as coordinates (Eq.
4). Assume we want to estimate the pose of the faces. How does one solve a regression problem where the predictor is an entire image? Traditional methods do not cope well with this task while our spectral series approach (Eq. 1 with estimated eigenfunctions as a basis) can use complex aggregate objects (e.g., images, spectra, trajectories, and text data) as predictors, without an explicit dimension reduction step. Note that the eigenvectors capture the pose and other continuous variations of an image fairly well, and that the regression appears to vary smoothly in sample space. We will return to the face pose estimation problem in Sec. 6.1. We will also discuss the theoretical properties of a spectral series estimator of the regression function in Sec. 5, including the connection between smoothness and efficient estimators.Our paper has the following aims:

Unifying. To generalize and connect ideas in kernel machine learning, manifold learning, spectral methods and classical smoothing, without the many restrictive assumptions (fixed kernel, exact manifold, infinite unlabeled data, low dimension) seen in other works.

Theoretical. To present new theoretical results in the limit of the kernel bandwidth that shed light on why RKHS/SVM methods often are so successful for complex data with sparse structure (Theorem 14 and Corollary 16), and to link smoothness of the regression with respect to to the approximation error of spectral series (Theorem 10).

Experimental.
To systematically compare the statistical as well as the computational performance of spectral series and other methods using simulated and realworld data. Competing estimators include classical kernel smoothing, knearest neighbors (kNN) regression, regularization in RKHS, and recent stateoftheart manifold and local regression methods. We ask questions such as: Do the methods scale well with increasing dimension
and increasing sample size ? What is the estimated loss and what is the computational time?
The paper is organized as follows. In Sec. 2, we describe the construction of the spectral series method, including details on how to estimate relevant quantities from empirical data and how to tune model parameters. Sec. 3 discusses the connection to related work in machine learning and statistics. In Sections 4 and 5, we discuss the choice of kernel, and provide theoretical guarantees on the spectral series method. Finally, in Sec. 6, we compare the performance of spectral series and other nonparametric estimators for a wide variety of data sets.
2 Orthogonal Series Regression
2.1 General Formulation
In low dimensions, orthogonal series has proved to be a powerful technique for nonparametric curve estimation [16]. In higher dimensions, there is the question of whether one can find an appropriate basis and actually construct a series estimator that performs well. The general setup of an orthogonal series regression is otherwise simple: Let be an iid sample from a distribution with compact support . Suppose we have a realvalued response
where is an unknown function, and
denotes iid random noise with mean zero and variance
. Our goal is to estimate the regression function in situations where is large and the data have sparse (i.e., lowdimensional) structure.Let be an orthonormal basis of some appropriate Hilbert space with inner product and norm . We consider estimators of the form
(1) 
where is a smoothing parameter, and and , in the general case, are databased estimators of the basis functions and the expansion coefficients .
2.2 What Basis?
A challenging problem is how to choose a good basis. The standard approach in nonparametric curve estimation is to choose a fixed known basis for, say, , such as a Fourier or wavelet basis. There is then no need to estimate basis functions. In theory, such an approach can be extended to, eg., by a tensor product,^{1}^{1}1Traditional orthogonal series estimators require tensor products in . For instance, if , then it is common to choose a basis of the form
What basis should one then choose when the dimension is large, say, ? Ideally, the basis should be able to adapt to the underlying structure of the data distribution. This means: The basis should be orthogonal with respective to the distribution that generates the data, as opposed to the standard Lebesgue measure of the ambient space; the basis vectors should be concentrated around highdensity regions where most of the “action” takes place; and the performance of the final series estimator should depend on the intrinsic rather than the ambient dimension of the data. In what follows, we present a spectral series approach where the unknown function is expanded into the estimated eigenfunctions of a kernelbased integral operator. As we shall see, the proposed estimator has many of the properties listed above.
2.3 Construction of Adaptive Basis
Our starting point is a symmetric and positive semidefinite (psd) socalled Mercer kernel . These kernels include covariance functions and polynomial kernels, but we are in this work primarily interested in local, radially symmetric kernels ,^{2}^{2}2Depending on the application, one can replace the Euclidean distance with a dissimilarity measure that better reflects the distance between two data objects and . where is a parameter that defines the scale of the analysis, and the elements are positive and bounded for all . To simplify the theory, we renormalize the kernel according to
(2) 
where . This normalization is common in spectral clustering because it yields eigenvectors that act as indicator functions of connected components [53, Section 3.2]. The same normalization is also implicit in traditional NadarayaWatson kernel smoothers, which compute the local average at a point by weighting surrounding points by .
We will refer to in Eq. 2 as the diffusion kernel. The term “diffusion” stems from a random walks view over the sample space [32, 30]
: One imagines a Markov chain on
with transition kernel . Starting at x, this chain moves to points close to , giving preference to points with high density . The chain essentially encodes the “connectivity” of the sample space relative to , and it has a unique stationary distribution given bywhere as For finite , the stationary distribution is a smoothed version of .
In our regression setting, we seek solutions from a Hilbert space associated with the kernel . Following [30], we define a “diffusion operator” — which maps a function to a new function — according to
(3) 
The operator
has a discrete set of nonnegative eigenvalues
with associated eigenfunctions , which we for convenience normalize to have unit norm. These eigenfunctions have two very useful properties: First, they are orthogonal with respect to the densityweighted inner productthat is,
Second, they also form a set of oscillatory functions which are concentrated around highdensity regions. By construction, is a constant function, and the higherorder eigenfunctions are increasingly oscillatory. Generally speaking, is the smoothest function relative to , subject to being orthogonal to for .
Interpretation. The diffusion operator and its eigenfunctions contain information about the connectivity structure of the sample space. There are two ways one can view the eigenfunctions :

The eigenfunctions define new coordinates of the data. If the data represent highdimensional complex objects, there is often no simple way of ordering the data. However, by a socalled “eigenmap”
(4) one can transform the data into an embedded space where points that are highly connected are mapped close to each other [29]. The eigenmap can be used for data visualization as in Fig. 1 and Fig. 6. If we choose , then we are effectively reducing the dimensionality of the problem by mapping the data from to .

The eigenfunctions form a Hilbert basis for functions on the data. More specifically, the set is an orthogonal basis of . The value of this result is that we can express most physical quantities that vary as a function of the data as a series expansion of the form .
In this work, we study the second point of view and its implications on nonparametric estimation in high dimensions.
2.4 Estimating the Regression Function from Data
In practice, of course, we need to estimate the basis and the projections from data. In this section, we describe the details.
Given
, we compute a rowstochastic matrix
, where(5) 
for . The elements
can be interpreted as transition probabilities
for a Markov chain over the data points (i.e., this is the discrete analogue of Eq. 2 and a diffusion over ). Let . The Markov chain has a unique stationary measure given by , where the th element(6) 
is a kernelsmoothed density estimate at the th observation.
To estimate the eigenfunctions of the continuous diffusion operator in Eq. 3, we first calculate the eigenvalues and the associated (orthogonal) eigenvectors of the symmetrized kernel matrix , where
(7) 
We normalize the eigenvectors so that , and define the new vectors for and . By construction, it holds that the ’s and
’s are eigenvalues and right eigenvectors of the Markov matrix
:(8) 
where
(9) 
Note that the dimensional vector can be regarded as estimates of at the observed values . In other words, let
(10) 
for . We estimate the function at values of not corresponding to one of the ’s using the socalled Nyström method. The idea is to first rearrange the eigenfunctioneigenvalue equation as
and use the kernelsmoothed estimate
(11) 
for .
Our final regression estimator is defined by Eq. 1 with the estimated eigenvectors in Eq. 11 and expansion coefficients computed according to
(12) 
Remark 1 (SemiSupervised Learning, SSL).
The spectral series framework naturally extends to semisupervised learning (SSL) [60] where in addition to the “labeled” sample there are additional “unlabeled” data; i.e., data where the covariates but not the labels are known. Typically , as collecting data often is less costly than labeling them. By including unlabeled examples (drawn from the same distribution ) into the kernel matrix , we can improve our estimates of , and . The summation in Equations 9 and 11 will then be over all observations, while Eq. 12 remains the same as before. See e.g. [34, 58] for SSL with Laplacian eigenmaps in the limit of infinite unlabeled data, i.e., in the limit .
2.5 Loss Function and Tuning of Parameters
We measure the performance of an estimator via the loss function
(13) 
To choose tuning parameters (such as the kernel bandwidth and the number of basis functions ), we split the data into a training and a validation set. For each choice of and a sufficiently large constant , we use the training set and Eqs. 1112 to estimate the eigenvectors and the expansion coefficients . We then use the validation set to compute the estimated loss
(14) 
for different values of . We choose the (, )model with the lowest estimated loss on the validation set.
The computation for fixed and different is very fast. Due to orthogonality of the basis, the estimates and depend on but not on .
2.6 Scalability
The spectral series estimator is faster than most traditional approaches in high dimensions. Once the kernel matrix has been constructed, the eigendecomposition takes the same amount of time for all values of .
In terms of scalability for large data sets, one can dramatically reduce the computational cost by implementing fast approximate eigendecompositions. For example, the Randomized SVD by Halko et al. [22] cuts down the cost from to roughly with little impact on statistical performance (see Fig. 9). According to Halko et al., these randomized methods are especially wellsuited for parallel implementation, which is a topic we will explore in future work.
3 Connection to Other Work
3.1 Linear Regression with Transformed Data
One can view our series model as a (weighted) linear regression after a data transformation , where are the first eigenvectors of the diffusion operator . By increasing , the dimension of the feature space, we achieve more flexible, fully nonparametric representations. Decreasing adds more structure to the regression, as dictated by the eigenstructure of the data.
Eq. 12 is similar to a weighted least squares (WLS) solution to a linear regression in but with an efficient orthogonal series implementation and no issues with collinear variables. Define the matrix of predictors,
(15) 
and introduce the weight matrix
(16) 
where and are estimated from data (Equations 6 and 10). Suppose that , where , , and the random vector represents the errors. By minimizing the weighted residual sum of squares
(17) 
we arrive at the WLS estimator
(18) 
where the matrix puts more weight on observations in highdensity regions. This expression is equivalent to Eq. 12.
Note that thanks to the orthogonality property , model search and model selection are feasible even for complex models with very large . This is in clear contrast with standard multiple regression where one needs to recompute the estimates for each model with a different , invert the matrix , and potentially deal with inputs (columns of the design matrix ) that are linearly dependent.
Remark 2 (Heteroscedasticity).
More generally, let be a nonnegative function rather than a constant, and let
be iid realizations of a random variable
with zero mean and unit variance. Consider the regression model. We can handle heteroscedastic errors by applying the same framework as above to a
rescaled regression function .3.2 Kernel Machine Learning and Regularization in RKHS
Kernelbased regularization methods use similar approximation spaces as us. In kernel machine learning [45, 14], one often considers the variational problem
(19) 
where is a convex loss function, is a penalty parameter, and is the Reproducing Kernel Hilbert Space (RKHS) associated with a symmetric positive semidefinite kernel .^{3}^{3}3To every continuous, symmetric, and positive semidefinite kernel is associated a unique RKHS [1]. This RKHS is defined to be the closure of the linear span of the set of functions with the inner product satisfying the reproducing property for all . Penalizing the RKHS norm imposes smoothness conditions on possible solutions. Now suppose that
where the RKHS inner product is related to the inner product according to Eq. 19 is then equivalent to considering eigenexpansions
and seeking solutions to where the hypothesis space
(20) 
is a ball of the RKHS with radius , and the RKHS norm is given by .
Here are some key observations:
(i) The above setting is similar to ours. The regularization in Eq. 19 differentially shrinks contributions from higherorder terms with small values. In spectral series, we use a projection (i.e., a basis subset selection) method, but the empirical performance is usually similar.
(ii) There are some algorithmic differences, as well as differences in how the two regression estimators are analyzed and interpreted. In our theoretical work, we consider Gaussian kernels with flexible variances; that is, we choose the approximation spaces in a datadependent way (cf. multikernel regularization schemes for SVMs [56]) so that the estimator can adapt to sparse structure and the intrinsic dimension of the data. Most theoretical work in kernel machine learning assume a fixed RKHS.
(iii) There are also other differences. Support Vector Machines [49]
and other kernelbased regularization methods (such as splines, ridge regression and radial basis functions) never explicitly compute the eigenvectors of the kernel. Instead, these methods rely on the classical Representer Theorem
[54] which states that the solution to Eq. 19 is a finite expansion of the form . The original infinitedimensional variational problem is then reduced to a finitedimensional optimization of the coefficients . In a naive leastsquares implementation, however, one has to recompute these coefficients for each choice of the penalty parameter , which can make crossvalidation cumbersome. In our spectral series approach, we take advantage of the orthogonality of the basis for fast model selection and computation of the parameters. As in spectral clustering, we also use eigenvectors to organize and visualize data that can otherwise be hard to interpret.3.3 Manifold Regularization and SemiSupervised Learning
Our spectral series method is closely related to Laplacianbased regularization: In [6], Belkin et al. extend the kernelbased regularization framework to incorporate additional information about the geometric structure of the marginal . Their idea is to add a datadependent penalty term to Eq. 19 that controls the complexity as measured by the geometry of the distribution. Suppose that one is given labeled data as well as unlabeled data , where in general . (The limit corresponds to having full knowledge of .) Under the assumption that the support of is a compact submanifold of , the authors propose minimizing a graphLaplacian regularized least squares function
(21) 
where are the edge weights in the graph, and the last Laplacian penalty term favors functions for which is close to when and are connected with large weights.
Note that the eigenbasis of our rowstochastic matrix minimizes the distortion in Eq. 21 if you regard the entries of as the weights [3]. Indeed, the eigenvector minimizes the term subject to being orthogonal to for . Hence, including a Laplacian penalty term is comparable to truncating the eigenbasis expansion in spectral series. Moreover, the semisupervised regularization in Eq. 21 is similar to a semisupervised version of our spectral series approach, where we first use both labeled and unlabeled data and a kernel with bandwidth to compute the eigenbasis, and then extend the eigenfunctions according to Eq. 11 via a (potentially wider) kernel with bandwidth . The main downside of the Laplacianbased framework above is that it is hard to analyze theoretically. As with other kernelbased regularizers, the method also does not explicitly exploit eigenvectors and orthogonal bases.
4 Choice of Kernel
In the RKHS literature, there is a long list of commonly used kernels. These include, e.g., the Gaussian kernel , polynomial kernels [52], and the thinplate spline kernel [20]. In our numerical experiments (Sec. 6), we will consider both Gaussian and polynomial kernels, but throughout the rest of the paper, we will primarily work with the (rownormalized) Gaussian kernel. There are several reasons for this choice:

The eigenfunctions of the Gaussian kernel are simultaneously concentrated in time (i.e., space) and frequency, and are particularly wellsuited for estimating functions that are smooth with respect to a lowdimensional data distribution.
The following two examples illustrate some of the differences in the eigenbases of Gaussian and polynomial kernels:
Example 3.
Suppose that
is a uniform distribution
on the real line. Fig. 2, left, shows the eigenfunctions of a thirdorder polynomial kernel . These functions are smooth but have large values outside the support of . Contrast this eigenbasis with the eigenfunctions in Fig. 2, right, of a Gaussian kernel. The latter functions are concentrated on the support of and are orthogonal on as well as on .Example 4.
Consider data around a noisy spiral:
where is a uniform random variable, and and
are normally distributed random variables. The eigenfunctions of a polynomial kernel do not adapt well to the underlying distribution of the data. Fig.
3, left, for example, is a contour plot of the Nyström extension of the fourth empirical eigenvector of a thirdorder polynomial kernel. In contrast, the eigenfunctions of a Gaussian diffusion kernel vary smoothly along the spiral direction, forming a Fourierlike basis with orthogonal eigenfunctions that concentrate around highdensity regions; see Fig. 3, right.In high dimensions, Gaussian extensions can be seen as a generalization of prolate spheroidal wave functions [12]. Prolates were originally introduced by Slepian and Pollack as the solution to the problem of simultaneously and optimally concentrating a function and its Fourier content (see [48] for a fascinating recount of this development in Fourier analysis and modeling). The bandlimited functions that maximize their energy content within a space domain are extensions of the eigenfunctions of the integral operator of a Bessel kernel restricted to [12, Section 3.1]. In high dimensions, Bessel and Gaussian kernels are equivalent [43], suggesting that the eigenfunctions of the Gaussian kernel are nearly optimal.
However, although Gaussian kernels have many advantages, they may not always be the best choice in practice. Ultimately, this is determined by the application and by what the best measure of similarity between two data points would be. Our framework suggests a principled way of selecting the best kernel for regression: Among a set of reasonable candidate kernels, choose the estimator with the smallest empirical loss according to Eq. 14. We will, for example, use this approach in Sec. 6 to choose the optimal degree for a set of polynomial kernels of the form .
Normalization of Local Kernels. In the RKHS literature, it is standard to work with “unnormalized” kernels. In spectral clustering [53], on the other hand, researchers often use the “stochastic” and “symmetric” normalization schemes in Eq. 5 and Eq. 7, respectively. We have found (Sec. 6) that the exact normalization often has little effect on the performance in regression. Nevertheless, we choose to use the rowstochastic kernel for reasons of interpretation and analysis: First, the limit of the bandwidth is welldefined, and there is a series of works on the convergence of the graph Laplacian to the LaplaceBeltrami operator on Riemannian manifolds [11, 5, 24, 47, 19]. Fourier functions originate from solving a Laplace eigenvalue problem on a bounded domain; hence, the eigenfunctions of the diffusion operator can be seen as a generalization of Fourier series to manifolds.
Moreover, the rowstochastic kernel yields less variable empirical functions than the unnormalized or symmetric forms. As an illustration, consider the noisy spiral data in Example 4. Fig. 4 shows the estimated projections onto the spiral direction of the eigenfunctions of the symmetric and the stochastic forms; see the left and right plots, respectively. The eigenfunctions are clearly smoother in the latter case. By construction, the empirical eigenfunctions of the symmetric operator are orthogonal with respect to the empirical distribution , whereas the estimated eigenfunctions of the stochastic operator are orthogonal with respect to the smoothed data distribution . The kernel bandwidth defines the scale of the analysis.
5 Theory
In this section, we derive theoretical bounds on the loss (Eq. 13) of a series regression estimator with a radial kernel for a standard fixed RKHS setting (Theorem 13), as well as a setting where the kernel bandwidth varies with the sample size (Theorem 14). We also further elaborate on the connection between spectral series and Fourier analysis by generalizing the wellknown link between Sobolev differentiable signals and the approximation error in a Fourier basis.
Using the same notation as before, let
where and . We write
and refer to the two terms as “bias” and “variance”. Hence, we define the integrated bias and variance
and
and bound the two components separately. Our assumptions are:
(A1) has compact support and bounded density , .
(A2) The weights are positive and bounded; that is, ,
where and are constants that do not depend on .
(A3) The psd operator has nondegenerate eigenvalues; i.e.,
(A4) For all and , there exists some constant (not depending on ) such that
where and .
Without loss of generality, we assume that the eigenfunctions are estimated using an unlabeled sample that is drawn independently from the data used to estimate the coefficients . This is to simplify the proofs and can always be achieved by splitting the data in two sets.
5.1 Bias
A key point is that the approximation error of the regression depends on the smoothness of relative to . Here we present two different calculations of the bias based on two related notions of smoothness. The first notion is standard in the kernel literature and is based on RKHS norms. The second notion is based on our diffusion framework and can be seen as a generalization of Sobolev differentiability.
Method 1: Smoothness measured by RKHS norm.
Let where is a strictly positive number. Under previous assumptions, this kernel is symmetric and psd with a unique RKHS which we denote by . A standard way to measure smoothness of a function in a RKHS is through the RKHS norm (see, e.g., [33]). One can then define function classes
where is a positive number dependent on .
Proposition 5.
Assume , where . Then,
For fixed , contains “smoother” functions for smaller values of .
Method 2: Smoothness measured by diffusion operator.
Alternatively, let
(22) 
where is the identity. The operator has the same eigenvectors as the differential operator . Its eigenvalues are given by where are the eigenvalues of . Define the functional
(23) 
which maps a function into a nonnegative real number. For small , measures the variability of the function with respect to the distribution . The expression is a variation of the graph Laplacian regularizers popular in semisupervised learning [59]. In fact, a Taylor expansion yields where is the gradient operator and is the psd Laplace operator in . In kernel regression smoothing, the extra term is considered an undesirable extra bias, called design bias. In classical regression, it is removed by using local linear smoothing [17], which is asymptotically equivalent to replacing the original kernel by the biascorrected kernel [11].
The following result bounds the approximation error of an orthogonal series expansion of . The bound is consistent with Theorem 2 in [58], which applies to the more restrictive setting of SSL with infinite unlabeled data and . Our result holds for all and and does not assume unlimited data.
Proposition 6.
For ,
(24)  
where is the eigenvalue of .
Smoothness and Sparsity
In the limit , we have several interesting results, including a generalization of the classical connection between Sobolev differentiability and the error decay of Fourier approximations [31, Section 9.1.2] to a setting with adaptive bases and highdimensional data. We denote the quantities derived from the biascorrected kernel by , , and so forth.
Definition 7.
(Smoothness relative to P) A function is smooth relative to if
where is the stationary distribution of the random walk on the data as . The smaller the value of , the smoother the function.
Lemma 8.
For functions whose gradients vanish at the boundary,
This is similar to the convergence of the (unnormalized) graph Laplacian regularizer to the densitydependent smoothness functional [8].
Next we will see that smoothness relative to (Definition 7) and sparsity (with respect to the norm) in the eigenbasis of the diffusion operator (Definition 9 below) are really the same thing. As a result, we can link smoothness and sparsity to the rate of the error decay of the eigenbasis approximation.
Definition 9.
(Sparsity in ) A set of real numbers lies in a Sobolev ellipsoid if for some number . For a given basis , let
where . Functions in are sparse in . The larger the value of , the sparser the representation.
Theorem 10.
Assume that are the eigenvectors of with eigenvalues for some . Let . Then, the following two statements are equivalent:

(smoothness relative to P)

(sparsity in ).
Furthermore, sparsity in (or smoothness relative to P) implies
The rate of the error decay depends on the dimension of the data. We will address this issue in Sec. 5.3.
5.2 Variance
The matrix (defined in Eq. 5) can be viewed as a perturbation of the integral operator due to finite sampling. To estimate the variance, we bound the difference , where are the eigenvectors of , and are the Nyström extensions (Eq. 11) of the eigenvectors of . We adopt a strategy from Rosasco et al. [40], which is to introduce two new integral operators that are related to and but both act on an auxiliary^{4}^{4}4This auxiliary space only enters the intermediate derivations and plays no role in the error analysis of the algorithm itself. RKHS of smooth functions (see Appendix A.2 for details). As before, we write to indicate that we let the kernel bandwidth depend on the sample size .
Proposition 11.
Let and as . Under assumptions (A1)(A4) and ,
where and
Proposition 12.
Let
Comments
There are no comments yet.