 # A Spectral Series Approach to High-Dimensional Nonparametric Regression

A key question in modern statistics is how to make fast and reliable inferences for complex, high-dimensional data. While there has been much interest in sparse techniques, current methods do not generalize well to data with nonlinear structure. In this work, we present an orthogonal series estimator for predictors that are complex aggregate objects, such as natural images, galaxy spectra, trajectories, and movies. Our series approach ties together ideas from kernel machine learning, and Fourier methods. We expand the unknown regression on the data in terms of the eigenfunctions of a kernel-based operator, and we take advantage of orthogonality of the basis with respect to the underlying data distribution, P, to speed up computations and tuning of parameters. If the kernel is appropriately chosen, then the eigenfunctions adapt to the intrinsic geometry and dimension of the data. We provide theoretical guarantees for a radial kernel with varying bandwidth, and we relate smoothness of the regression function with respect to P to sparsity in the eigenbasis. Finally, using simulated and real-world data, we systematically compare the performance of the spectral series approach with classical kernel smoothing, k-nearest neighbors regression, kernel ridge regression, and state-of-the-art manifold and local regression methods.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A challenging problem in modern statistics is how to handle complex, high-dimensional data. Sparsity has emerged as a major tool for making efficient inferences and predictions for multidimensional data. Generally speaking, sparsity refers to a situation where the data, despite their apparent high dimensionality, are highly redundant with a low intrinsic dimensionality. In our paper, we use the term “sparse structure” to refer to cases where the underlying distribution places most of its mass on a subset of of small Lebesgue measure. This scenario includes, but is not limited to, Riemannian submanifolds of , and high-density clusters separated by low-density regions. In applications of interest, observable data often have (complex) sparse structure due to the nature of the underlying physical systems. For example, in astronomy, raw galaxy spectra are of dimension equal to the number of wavelength measurements , but inspection of a sample of such spectra will reveal clear, low-dimensional features and structure resulting from the shared physical system that generated these galaxies. While the real dimensionality of data is much smaller than , the challenge remains to exploit this when predicting, for example, the age, composition, and star formation history of a galaxy.

In its simplest form, low-dimensional structure is apparent in the original coordinate system. Indeed, in regression, much research on “large p, small n” problems concerns variable selection

and the problem of recovering a “sparse” coefficient vector (i.e., a vector with mostly zeros) with respect to the given variables. Such approaches include, for example, lasso-type regularization

, the Dantzig selector , and RODEO . There are also various extensions that incorporate lower-order interactions and groupings of covariates [57, 61, 37] but, like lasso-type estimators, they are not directly applicable to the more intricate structures observed in, e.g., natural images, spectra, and hurricane tracks.

At the same time, there has been a growing interest in statistical methods that explicitly consider geometric structure in the data themselves. Most traditional dimension-reducing regression techniques, e.g., principal component regression (PCR; ) partial least squares (PLS; ) and sparse coding , are based on linear data transformations and enforce sparsity (with respect to the or norm) of the regression in a rotated space. More recently, several authors [7, 2, 10] have studied local polynomial regression methods on non-linear manifolds. For example, Aswani et al.  propose a geometry-based regularization scheme that estimates the local covariance matrix at a point and then penalizes regression coefficients perpendicular to the estimated manifold direction. In the same spirit, Cheng and Wu 

suggest first reducing the dimensionality to the estimated intrinsic dimension of the manifold, and then applying local linear regression to a tangent plane estimate. Local regression and manifold-based methods tend to perform well when there is a clear submanifold but these approaches are not practical in higher dimensions or when the local dimension varies from point to point in the sample space. Hence, existing nonparametric models still suffer when estimating unknown functions (e.g., density and regression functions) on complex objects

, where is large.

Much statistical research has revolved around adapting classical methods, such as linear, kernel-weighted, and additive models to high dimensions. On the other hand, statisticians have paid little attention to the potential of orthogonal series approaches. In low dimensions, orthogonal series is a powerful nonparametric technique for estimating densities and regression functions. Such methods are fast to implement with easily interpretable results, they have sharp optimality properties, and a wide variety of bases allows the data analyst to model multiscale structure and any challenging shape of the target function . As a result, Fourier series approaches have dominated research in signal processing and mathematical physics. This success, however, has not translated to more powerful nonparametric tools in dimensions of the order of or

; in fact, extensions via tensor products (as well as more sophisticated adaptive grid or triangulation methods; see

 and references within) quickly become unpractical in dimensions .

In this work, we will build on ideas from harmonic analysis and spectral methods to construct nonparametric methods for estimating unknown functions in high-dimensional spaces with non-standard data objects (such as images, spectra, and distributions) that possess sparse nonlinear structure. We derive a Fourier-like basis of that adapts to the intrinsic geometry of the underlying data distribution , and which is orthonormal with respect to rather than the Lebesgue measure of the ambient space. The empirical basis functions are then used to estimate functions on complex data ; such as, for example, the regression function

on an object . Because of the adaptiveness of the basis, there is no need for high-dimensional tensor products. Moreover, we take advantage of the orthogonality property of the basis for fast computation and model selection. We refer to our approach as spectral series as it is based on spectral methods (in particular, diffusion maps [13, 11, 29] and spectral connectivity analysis ) and Fourier series. Sections 2.1-2.3 describe the main idea of the series method in a regression setting.

Our work generalizes and ties together ideas in classical smoothing, kernel machine learning [44, 45, 14]

, support vector machines (SVMs;

) and manifold regularization  without the many restrictive assumptions (fixed kernel, exact manifold, infinite unlabeled data and so on) seen in other works. There is a large literature on SVMs and kernel machine learning that use similar approximation spaces as us, but it is unclear whether and how those procedures adapt to the structure of the data distribution. Generally, there is a discrepancy between theoretical work on SVMs, which assume a fixed RKHS (e.g., a fixed kernel bandwidth), and applied SVM work, where the RKHS is chosen in a data-dependent way (by, e.g., decreasing the kernel bandwidth for larger sample sizes ). Indeed, issues concerning the choice of tuning parameters, and their relation to the data distribution , are considered to be open problems in the mainstream RKHS literature. The manifold regularization work by Belkin et al.  addresses adaptivity to sparse structure but under restrictive assumptions, such as the existence of a well-defined submanifold and the presence of infinite unlabeled data.

Another key difference between our work and kernel machine learning is that we explicitly compute the eigenfunctions of a kernel-based operator and then use an orthogonal series approach to nonparametric curve estimation. Neither SVMs nor manifold regularizers exploit orthogonality relative to . In our paper, we point out the advantages of an orthogonal series approach in terms of computational efficiency (such as fast cross-validation and tuning of parameters), visualization, and interpretation. SVMs can sometimes have a “black box feel,” whereas the spectral series method allows the user to directly link the data-driven Fourier-like eigenfunctions to the function of interest and the sample space. Indeed, there is a dual interpretation of the computed eigenfunctions: (i) They define new coordinates of the data

which are useful for nonlinear dimensionality reduction, manifold learning, and data visualization. (ii) They form an orthogonal Hilbert basis for functions on the data and are a means to

nonparametric curve estimation via the classical orthogonal series method, even when there is no clearly defined manifold structure. There is a large body of work in the machine learning literature addressing the first perspective; see, e.g., Laplacian maps , Hessian maps , diffusion maps , Euclidean Commute Time maps 

, and spectral clustering

. In this paper, we are mainly concerned with the second view, i.e., that of estimating unknown functions on complex data objects and understanding the statistical properties of such estimators.

Fig. 1

, for example, shows a 2D visualization of the Isomap face data using the eigenvectors of a renormalized Gaussian kernel as coordinates (Eq.

4). Assume we want to estimate the pose of the faces. How does one solve a regression problem where the predictor is an entire image? Traditional methods do not cope well with this task while our spectral series approach (Eq. 1 with estimated eigenfunctions as a basis) can use complex aggregate objects (e.g., images, spectra, trajectories, and text data) as predictors, without an explicit dimension reduction step. Note that the eigenvectors capture the pose and other continuous variations of an image fairly well, and that the regression appears to vary smoothly in sample space. We will return to the face pose estimation problem in Sec. 6.1. We will also discuss the theoretical properties of a spectral series estimator of the regression function in Sec. 5, including the connection between smoothness and efficient estimators. Figure 1: Embedding or so-called “eigenmap” of the Isomap face data using the first two non-trivial eigenvectors of the Gaussian diffusion kernel. The eigenvectors capture the pose Y and other continuous variations of an image x fairly well, and the regression E(Y|x) appears to vary smoothly in sample space.

Our paper has the following aims:

1. Unifying. To generalize and connect ideas in kernel machine learning, manifold learning, spectral methods and classical smoothing, without the many restrictive assumptions (fixed kernel, exact manifold, infinite unlabeled data, low dimension) seen in other works.

2. Theoretical. To present new theoretical results in the limit of the kernel bandwidth that shed light on why RKHS/SVM methods often are so successful for complex data with sparse structure (Theorem 14 and Corollary 16), and to link smoothness of the regression with respect to to the approximation error of spectral series (Theorem 10).

3. Experimental.

To systematically compare the statistical as well as the computational performance of spectral series and other methods using simulated and real-world data. Competing estimators include classical kernel smoothing, k-nearest neighbors (kNN) regression, regularization in RKHS, and recent state-of-the-art manifold and local regression methods. We ask questions such as: Do the methods scale well with increasing dimension

and increasing sample size ? What is the estimated loss and what is the computational time?

The paper is organized as follows. In Sec. 2, we describe the construction of the spectral series method, including details on how to estimate relevant quantities from empirical data and how to tune model parameters. Sec. 3 discusses the connection to related work in machine learning and statistics. In Sections 4 and 5, we discuss the choice of kernel, and provide theoretical guarantees on the spectral series method. Finally, in Sec. 6, we compare the performance of spectral series and other nonparametric estimators for a wide variety of data sets.

## 2 Orthogonal Series Regression

### 2.1 General Formulation

In low dimensions, orthogonal series has proved to be a powerful technique for nonparametric curve estimation . In higher dimensions, there is the question of whether one can find an appropriate basis and actually construct a series estimator that performs well. The general set-up of an orthogonal series regression is otherwise simple: Let be an iid sample from a distribution with compact support . Suppose we have a real-valued response

 Yi=f(Xi)+ϵi,

where is an unknown function, and

denotes iid random noise with mean zero and variance

. Our goal is to estimate the regression function in situations where is large and the data have sparse (i.e., low-dimensional) structure.

Let be an orthonormal basis of some appropriate Hilbert space with inner product and norm . We consider estimators of the form

 ˆf(x)=J∑j=0ˆβjˆψj(x), (1)

where is a smoothing parameter, and and , in the general case, are data-based estimators of the basis functions and the expansion coefficients .

### 2.2 What Basis?

A challenging problem is how to choose a good basis. The standard approach in nonparametric curve estimation is to choose a fixed known basis for, say, , such as a Fourier or wavelet basis. There is then no need to estimate basis functions. In theory, such an approach can be extended to, eg., by a tensor product,111Traditional orthogonal series estimators require tensor products in . For instance, if , then it is common to choose a basis of the form

where , and and are bases for functions in . but tensor-product bases, as well as more sophisticated adaptive grid or triangulation methods (see  and references within), quickly become unusable for even as few as dimensions.

What basis should one then choose when the dimension is large, say, ? Ideally, the basis should be able to adapt to the underlying structure of the data distribution. This means: The basis should be orthogonal with respective to the distribution that generates the data, as opposed to the standard Lebesgue measure of the ambient space; the basis vectors should be concentrated around high-density regions where most of the “action” takes place; and the performance of the final series estimator should depend on the intrinsic rather than the ambient dimension of the data. In what follows, we present a spectral series approach where the unknown function is expanded into the estimated eigenfunctions of a kernel-based integral operator. As we shall see, the proposed estimator has many of the properties listed above.

### 2.3 Construction of Adaptive Basis

Our starting point is a symmetric and positive semi-definite (psd) so-called Mercer kernel . These kernels include covariance functions and polynomial kernels, but we are in this work primarily interested in local, radially symmetric kernels ,222Depending on the application, one can replace the Euclidean distance with a dissimilarity measure that better reflects the distance between two data objects and . where is a parameter that defines the scale of the analysis, and the elements are positive and bounded for all . To simplify the theory, we renormalize the kernel according to

 aε(x,y)=kε(x,y)pε(x), (2)

where . This normalization is common in spectral clustering because it yields eigenvectors that act as indicator functions of connected components [53, Section 3.2]. The same normalization is also implicit in traditional Nadaraya-Watson kernel smoothers, which compute the local average at a point by weighting surrounding points by .

We will refer to in Eq. 2 as the diffusion kernel. The term “diffusion” stems from a random walks view over the sample space [32, 30]

: One imagines a Markov chain on

with transition kernel . Starting at x, this chain moves to points close to , giving preference to points with high density . The chain essentially encodes the “connectivity” of the sample space relative to , and it has a unique stationary distribution given by

 Sε(A)=∫Apε(x)dP(x)∫Xpε(x)dP(x),

where as For finite , the stationary distribution is a smoothed version of .

In our regression setting, we seek solutions from a Hilbert space associated with the kernel . Following , we define a “diffusion operator” — which maps a function to a new function — according to

 Aεf(x)=∫Xaε(x,y)f(y)dP(y), % for x∈X. (3)

The operator

has a discrete set of non-negative eigenvalues

with associated eigenfunctions , which we for convenience normalize to have unit norm. These eigenfunctions have two very useful properties: First, they are orthogonal with respect to the density-weighted inner product

 ⟨f,g⟩ε=∫Xf(x)g(x)dSε(x);

that is,

 ⟨ψε,i,ψε,j⟩ε=δi,j.

Second, they also form a set of oscillatory functions which are concentrated around high-density regions. By construction, is a constant function, and the higher-order eigenfunctions are increasingly oscillatory. Generally speaking, is the smoothest function relative to , subject to being orthogonal to for .

Interpretation. The diffusion operator and its eigenfunctions contain information about the connectivity structure of the sample space. There are two ways one can view the eigenfunctions :

1. The eigenfunctions define new coordinates of the data. If the data represent high-dimensional complex objects, there is often no simple way of ordering the data. However, by a so-called “eigenmap”

 x↦(ψε,1(x),ψε,2(x),…,ψε,J(x)), (4)

one can transform the data into an embedded space where points that are highly connected are mapped close to each other . The eigenmap can be used for data visualization as in Fig. 1 and Fig. 6. If we choose , then we are effectively reducing the dimensionality of the problem by mapping the data from to .

2. The eigenfunctions form a Hilbert basis for functions on the data. More specifically, the set is an orthogonal basis of . The value of this result is that we can express most physical quantities that vary as a function of the data as a series expansion of the form .

In this work, we study the second point of view and its implications on nonparametric estimation in high dimensions.

### 2.4 Estimating the Regression Function from Data

In practice, of course, we need to estimate the basis and the projections from data. In this section, we describe the details.

Given

, we compute a row-stochastic matrix

, where

 Aε(i,j)=kε(Xi,Xj)∑nl=1kε(Xi,Xl) (5)

for . The elements

can be interpreted as transition probabilities

for a Markov chain over the data points (i.e., this is the discrete analogue of Eq. 2 and a diffusion over ). Let . The Markov chain has a unique stationary measure given by , where the th element

 ˆsε(Xi)=ˆpε(Xi)∑nj=1ˆpε(Xj) (6)

is a kernel-smoothed density estimate at the th observation.

To estimate the eigenfunctions of the continuous diffusion operator in Eq. 3, we first calculate the eigenvalues and the associated (orthogonal) eigenvectors of the symmetrized kernel matrix , where

 ˜Aε(i,j)=kε(Xi,Xj)√∑lkε(Xi,Xl)√∑lkε(Xl,Xj). (7)

We normalize the eigenvectors so that , and define the new vectors for and . By construction, it holds that the ’s and

’s are eigenvalues and right eigenvectors of the Markov matrix

:

 AεψAε,j=λAε,jψAε,j (8)

where

 1nn∑i=1ψAε,j(i)ψAε,k(i)ˆsε(Xi)=δj,k. (9)

Note that the -dimensional vector can be regarded as estimates of at the observed values . In other words, let

 ˆλε,j≡λAε,j   and   ˆψε,j(Xi)≡ψAε,j(i) (10)

for . We estimate the function at values of not corresponding to one of the ’s using the so-called Nyström method. The idea is to first rearrange the eigenfunction-eigenvalue equation as

 ψε,j(x)=Aεψε,jλε,j=1λε,j∫Xkε(x,y)∫Xkε(x,y)dP(y)ψε,j(y)dP(y),

and use the kernel-smoothed estimate

 ˆψε,j(x)=1ˆλε,jn∑i=1kε(x,Xi)∑nl=1kε(x,Xl)ˆψε,j(Xi). (11)

for .

Our final regression estimator is defined by Eq. 1 with the estimated eigenvectors in Eq. 11 and expansion coefficients computed according to

 ˆβε,j=1nn∑i=1Yiˆψε,j(Xi)ˆsε(Xi). (12)
###### Remark 1 (Semi-Supervised Learning, SSL).

The spectral series framework naturally extends to semi-supervised learning (SSL)   where in addition to the “labeled” sample there are additional “unlabeled” data; i.e., data where the covariates but not the labels are known. Typically , as collecting data often is less costly than labeling them. By including unlabeled examples (drawn from the same distribution ) into the kernel matrix , we can improve our estimates of , and . The summation in Equations 9 and 11 will then be over all observations, while Eq. 12 remains the same as before. See e.g. [34, 58] for SSL with Laplacian eigenmaps in the limit of infinite unlabeled data, i.e., in the limit .

### 2.5 Loss Function and Tuning of Parameters

We measure the performance of an estimator via the loss function

 L(f,ˆf)=∫X(f(x)−ˆf(x))2dP(x). (13)

To choose tuning parameters (such as the kernel bandwidth and the number of basis functions ), we split the data into a training and a validation set. For each choice of and a sufficiently large constant , we use the training set and Eqs. 11-12 to estimate the eigenvectors and the expansion coefficients . We then use the validation set to compute the estimated loss

 ˆL(f,ˆf)=1nn∑i=1(Y′i−ˆf(X′i))2=1nn∑i=1(Y′i−J∑j=0ˆβε,jˆψε,j(X′i))2 (14)

for different values of . We choose the (, )-model with the lowest estimated loss on the validation set.

The computation for fixed and different is very fast. Due to orthogonality of the basis, the estimates and depend on but not on .

### 2.6 Scalability

The spectral series estimator is faster than most traditional approaches in high dimensions. Once the kernel matrix has been constructed, the eigendecomposition takes the same amount of time for all values of .

In terms of scalability for large data sets, one can dramatically reduce the computational cost by implementing fast approximate eigendecompositions. For example, the Randomized SVD by Halko et al.  cuts down the cost from to roughly with little impact on statistical performance (see Fig. 9). According to Halko et al., these randomized methods are especially well-suited for parallel implementation, which is a topic we will explore in future work.

## 3 Connection to Other Work

### 3.1 Linear Regression with Transformed Data

One can view our series model as a (weighted) linear regression after a data transformation , where are the first eigenvectors of the diffusion operator . By increasing , the dimension of the feature space, we achieve more flexible, fully nonparametric representations. Decreasing adds more structure to the regression, as dictated by the eigenstructure of the data.

Eq. 12 is similar to a weighted least squares (WLS) solution to a linear regression in but with an efficient orthogonal series implementation and no issues with collinear variables. Define the matrix of predictors,

 Z=⎛⎜ ⎜ ⎜ ⎜ ⎜⎝1ψ1(X1)⋯ψJ(X1)1ψ1(X2)⋯ψJ(X2)⋮⋮⋱⋮1ψ1(Xn)⋯ψJ(Xn)⎞⎟ ⎟ ⎟ ⎟ ⎟⎠, (15)

and introduce the weight matrix

 W=⎛⎜ ⎜ ⎜ ⎜ ⎜⎝s(X1)0⋯00s(X2)⋯0⋮⋮⋱⋮00⋯s(Xn)⎞⎟ ⎟ ⎟ ⎟ ⎟⎠, (16)

where and are estimated from data (Equations 6 and 10). Suppose that , where , , and the random vector represents the errors. By minimizing the weighted residual sum of squares

we arrive at the WLS estimator

 ˆβ=(ZTWZ)−1(ZTWY)=1nZTWY, (18)

where the matrix puts more weight on observations in high-density regions. This expression is equivalent to Eq. 12.

Note that thanks to the orthogonality property , model search and model selection are feasible even for complex models with very large . This is in clear contrast with standard multiple regression where one needs to recompute the estimates for each model with a different , invert the matrix , and potentially deal with inputs (columns of the design matrix ) that are linearly dependent.

###### Remark 2 (Heteroscedasticity).

More generally, let be a non-negative function rather than a constant, and let

be iid realizations of a random variable

with zero mean and unit variance. Consider the regression model

. We can handle heteroscedastic errors by applying the same framework as above to a

rescaled regression function .

### 3.2 Kernel Machine Learning and Regularization in RKHS

Kernel-based regularization methods use similar approximation spaces as us. In kernel machine learning [45, 14], one often considers the variational problem

 minf∈Hk[1nn∑i=1L(yi,f(xi))+γ∥f∥2Hk], (19)

where is a convex loss function, is a penalty parameter, and is the Reproducing Kernel Hilbert Space (RKHS) associated with a symmetric positive semi-definite kernel .333To every continuous, symmetric, and positive semi-definite kernel is associated a unique RKHS  . This RKHS is defined to be the closure of the linear span of the set of functions with the inner product satisfying the reproducing property for all . Penalizing the RKHS norm imposes smoothness conditions on possible solutions. Now suppose that

 k(x,y)=∞∑j=0λjϕj(x)ϕj(y),

where the RKHS inner product is related to the -inner product according to Eq. 19 is then equivalent to considering eigen-expansions

 f(x)=∞∑j=0βjϕj(x),

and seeking solutions to where the hypothesis space

 Br={f∈Hk:∥f∥Hk≤r} (20)

is a ball of the RKHS with radius , and the RKHS norm is given by .

Here are some key observations:

(i) The above setting is similar to ours. The regularization in Eq. 19 differentially shrinks contributions from higher-order terms with small values. In spectral series, we use a projection (i.e., a basis subset selection) method, but the empirical performance is usually similar.

(ii) There are some algorithmic differences, as well as differences in how the two regression estimators are analyzed and interpreted. In our theoretical work, we consider Gaussian kernels with flexible variances; that is, we choose the approximation spaces in a data-dependent way (cf. multi-kernel regularization schemes for SVMs ) so that the estimator can adapt to sparse structure and the intrinsic dimension of the data. Most theoretical work in kernel machine learning assume a fixed RKHS.

(iii) There are also other differences. Support Vector Machines 

and other kernel-based regularization methods (such as splines, ridge regression and radial basis functions) never explicitly compute the eigenvectors of the kernel. Instead, these methods rely on the classical Representer Theorem

 which states that the solution to Eq. 19 is a finite expansion of the form . The original infinite-dimensional variational problem is then reduced to a finite-dimensional optimization of the coefficients . In a naive least-squares implementation, however, one has to recompute these coefficients for each choice of the penalty parameter , which can make cross-validation cumbersome. In our spectral series approach, we take advantage of the orthogonality of the basis for fast model selection and computation of the parameters. As in spectral clustering, we also use eigenvectors to organize and visualize data that can otherwise be hard to interpret.

### 3.3 Manifold Regularization and Semi-Supervised Learning

Our spectral series method is closely related to Laplacian-based regularization: In , Belkin et al. extend the kernel-based regularization framework to incorporate additional information about the geometric structure of the marginal . Their idea is to add a data-dependent penalty term to Eq. 19 that controls the complexity as measured by the geometry of the distribution. Suppose that one is given labeled data as well as unlabeled data , where in general . (The limit corresponds to having full knowledge of .) Under the assumption that the support of is a compact submanifold of , the authors propose minimizing a graph-Laplacian regularized least squares function

 minf∈Hk[1nn∑i=1L(yi,f(xi))+γA∥f∥2Hk+γIn+mn+m∑i,j=1(f(xi)−f(xj))2Wi,j], (21)

where are the edge weights in the graph, and the last Laplacian penalty term favors functions for which is close to when and are connected with large weights.

Note that the eigenbasis of our row-stochastic matrix minimizes the distortion in Eq. 21 if you regard the entries of as the weights  . Indeed, the eigenvector minimizes the term subject to being orthogonal to for . Hence, including a Laplacian penalty term is comparable to truncating the eigenbasis expansion in spectral series. Moreover, the semi-supervised regularization in Eq. 21 is similar to a semi-supervised version of our spectral series approach, where we first use both labeled and unlabeled data and a kernel with bandwidth to compute the eigenbasis, and then extend the eigenfunctions according to Eq. 11 via a (potentially wider) kernel with bandwidth . The main downside of the Laplacian-based framework above is that it is hard to analyze theoretically. As with other kernel-based regularizers, the method also does not explicitly exploit eigenvectors and orthogonal bases.

## 4 Choice of Kernel

In the RKHS literature, there is a long list of commonly used kernels. These include, e.g., the Gaussian kernel , polynomial kernels  , and the thin-plate spline kernel  . In our numerical experiments (Sec. 6), we will consider both Gaussian and polynomial kernels, but throughout the rest of the paper, we will primarily work with the (row-normalized) Gaussian kernel. There are several reasons for this choice:

1. The Gaussian kernel can be interpreted as the heat kernel in a manifold setting [3, 21]. We will take advantage of this connection in the theoretical analysis of the spectral series estimator (Sec. 5).

2. The eigenfunctions of the Gaussian kernel are simultaneously concentrated in time (i.e., space) and frequency, and are particularly well-suited for estimating functions that are smooth with respect to a low-dimensional data distribution.

The following two examples illustrate some of the differences in the eigenbases of Gaussian and polynomial kernels:

###### Example 3.

Suppose that

on the real line. Fig. 2, left, shows the eigenfunctions of a third-order polynomial kernel . These functions are smooth but have large values outside the support of . Contrast this eigenbasis with the eigenfunctions in Fig. 2, right, of a Gaussian kernel. The latter functions are concentrated on the support of and are orthogonal on as well as on . Figure 2: Uniform distribution U(-1,1). Eigenfunctions of a third-order polynomial kernel, left, and of an (un-normalized) Gaussian kernel, right. The latter eigenfunctions form an orthogonal Fourier-like basis concentrated on the support of the data distribution.
###### Example 4.

Consider data around a noisy spiral:

 {x(u)=√ucos(√u)+ϵxy(u)=√usin(√u)+ϵy,

where is a uniform random variable, and and

are normally distributed random variables. The eigenfunctions of a polynomial kernel do not adapt well to the underlying distribution of the data. Fig.

3, left, for example, is a contour plot of the Nyström extension of the fourth empirical eigenvector of a third-order polynomial kernel. In contrast, the eigenfunctions of a Gaussian diffusion kernel vary smoothly along the spiral direction, forming a Fourier-like basis with orthogonal eigenfunctions that concentrate around high-density regions; see Fig. 3, right. Figure 3: Spiral data. Contour plots of the fourth eigenvector of a third-order polynomial kernel (left) and the fourth eigenvector of a Gaussian diffusion kernel (right). The latter eigenvector is localized and varies smoothly along the spiral direction.

In high dimensions, Gaussian extensions can be seen as a generalization of prolate spheroidal wave functions . Prolates were originally introduced by Slepian and Pollack as the solution to the problem of simultaneously and optimally concentrating a function and its Fourier content (see  for a fascinating recount of this development in Fourier analysis and modeling). The band-limited functions that maximize their energy content within a space domain are extensions of the eigenfunctions of the integral operator of a Bessel kernel restricted to  [12, Section 3.1]. In high dimensions, Bessel and Gaussian kernels are equivalent , suggesting that the eigenfunctions of the Gaussian kernel are nearly optimal.

However, although Gaussian kernels have many advantages, they may not always be the best choice in practice. Ultimately, this is determined by the application and by what the best measure of similarity between two data points would be. Our framework suggests a principled way of selecting the best kernel for regression: Among a set of reasonable candidate kernels, choose the estimator with the smallest empirical loss according to Eq. 14. We will, for example, use this approach in Sec. 6 to choose the optimal degree for a set of polynomial kernels of the form .

Normalization of Local Kernels. In the RKHS literature, it is standard to work with “unnormalized” kernels. In spectral clustering , on the other hand, researchers often use the “stochastic” and “symmetric” normalization schemes in Eq. 5 and Eq. 7, respectively. We have found (Sec. 6) that the exact normalization often has little effect on the performance in regression. Nevertheless, we choose to use the row-stochastic kernel for reasons of interpretation and analysis: First, the limit of the bandwidth is well-defined, and there is a series of works on the convergence of the graph Laplacian to the Laplace-Beltrami operator on Riemannian manifolds [11, 5, 24, 47, 19]. Fourier functions originate from solving a Laplace eigenvalue problem on a bounded domain; hence, the eigenfunctions of the diffusion operator can be seen as a generalization of Fourier series to manifolds.

Moreover, the row-stochastic kernel yields less variable empirical functions than the unnormalized or symmetric forms. As an illustration, consider the noisy spiral data in Example 4. Fig. 4 shows the estimated projections onto the spiral direction of the eigenfunctions of the symmetric and the stochastic forms; see the left and right plots, respectively. The eigenfunctions are clearly smoother in the latter case. By construction, the empirical eigenfunctions of the symmetric operator are orthogonal with respect to the empirical distribution , whereas the estimated eigenfunctions of the stochastic operator are orthogonal with respect to the smoothed data distribution . The kernel bandwidth defines the scale of the analysis. Figure 4: Projection onto spiral direction (t=√u) of the eigenvectors of the symmetric graph Laplacian, left, and of the Gaussian diffusion kernel, right. The latter estimates are less noisy.

## 5 Theory

In this section, we derive theoretical bounds on the loss (Eq. 13) of a series regression estimator with a radial kernel for a standard fixed RKHS setting (Theorem 13), as well as a setting where the kernel bandwidth varies with the sample size (Theorem 14). We also further elaborate on the connection between spectral series and Fourier analysis by generalizing the well-known link between Sobolev differentiable signals and the approximation error in a Fourier basis.

Using the same notation as before, let

 f(x)=∞∑j=0βε,jψε,j(x), fε,J(x)=J∑j=0βε,jψε,j(x), ˆfε,J(x)=J∑j=0ˆβε,jˆψε,j(x),

where and . We write

 |f(x)−ˆfε,J(x)|2≤2|f(x)−fε,J(x)|2+2|fε,J(x)−ˆfε,J(x)|2,

and refer to the two terms as “bias” and “variance”. Hence, we define the integrated bias and variance

 Lbias=∫X|f(x)−fε,J(x)|2dP(x),

and

 Lvar=∫X|fε,J(x)−ˆfε,J(x)|2dP(x),

and bound the two components separately. Our assumptions are:

(A1) has compact support and bounded density , .

(A2) The weights are positive and bounded; that is, ,

 0

where and are constants that do not depend on .

(A3) The psd operator has nondegenerate eigenvalues; i.e.,

 1≡λε,0>λε,1>λε,2>…λε,J>0.

(A4) For all and , there exists some constant (not depending on ) such that

 E[|ˆφε,j(X)−φε,j(X)|2]

where and .

Without loss of generality, we assume that the eigenfunctions are estimated using an unlabeled sample that is drawn independently from the data used to estimate the coefficients . This is to simplify the proofs and can always be achieved by splitting the data in two sets.

### 5.1 Bias

A key point is that the approximation error of the regression depends on the smoothness of relative to . Here we present two different calculations of the bias based on two related notions of smoothness. The first notion is standard in the kernel literature and is based on RKHS norms. The second notion is based on our diffusion framework and can be seen as a generalization of Sobolev differentiability.

#### Method 1: Smoothness measured by RKHS norm.

Let where is a strictly positive number. Under previous assumptions, this kernel is symmetric and psd with a unique RKHS which we denote by . A standard way to measure smoothness of a function in a RKHS is through the RKHS norm (see, e.g., ). One can then define function classes

 Hε,M={f∈Hε:∥f∥Hε≤M},

where is a positive number dependent on .

###### Proposition 5.

Assume , where . Then,

 Lbias=O(Mλε,J).

For fixed , contains “smoother” functions for smaller values of .

#### Method 2: Smoothness measured by diffusion operator.

Alternatively, let

 Gε=Aε−Iε, (22)

where is the identity. The operator has the same eigenvectors as the differential operator . Its eigenvalues are given by where are the eigenvalues of . Define the functional

 Jε(f)=−⟨Gεf,f⟩ε (23)

which maps a function into a non-negative real number. For small , measures the variability of the function with respect to the distribution . The expression is a variation of the graph Laplacian regularizers popular in semi-supervised learning . In fact, a Taylor expansion yields where is the gradient operator and is the psd Laplace operator in . In kernel regression smoothing, the extra term is considered an undesirable extra bias, called design bias. In classical regression, it is removed by using local linear smoothing , which is asymptotically equivalent to replacing the original kernel by the bias-corrected kernel  .

The following result bounds the approximation error of an orthogonal series expansion of . The bound is consistent with Theorem 2 in , which applies to the more restrictive setting of SSL with infinite unlabeled data and . Our result holds for all and and does not assume unlimited data.

###### Proposition 6.

For ,

 ∫X|f(x)−fε,J(x)|2dSε(x) ≤ Jε(f)ν2ε,J+1 (24) Lbias=O(Jε(f)ν2ε,J+1),

where is the eigenvalue of .

#### Smoothness and Sparsity

In the limit , we have several interesting results, including a generalization of the classical connection between Sobolev differentiability and the error decay of Fourier approximations [31, Section 9.1.2] to a setting with adaptive bases and high-dimensional data. We denote the quantities derived from the bias-corrected kernel by , , and so forth.

###### Definition 7.

(Smoothness relative to P) A function is smooth relative to if

 ∫X∥∇f(x)∥2dS(x)≤c2<∞,

where is the stationary distribution of the random walk on the data as . The smaller the value of , the smoother the function.

###### Lemma 8.

For functions whose gradients vanish at the boundary,

 limε→0J∗ε(f)=∫X∥∇f(x)∥2dS(x).

This is similar to the convergence of the (un-normalized) graph Laplacian regularizer to the density-dependent smoothness functional  .

Next we will see that smoothness relative to (Definition 7) and sparsity (with respect to the norm) in the eigenbasis of the diffusion operator (Definition 9 below) are really the same thing. As a result, we can link smoothness and sparsity to the rate of the error decay of the eigenbasis approximation.

###### Definition 9.

(Sparsity in ) A set of real numbers lies in a Sobolev ellipsoid if for some number . For a given basis , let

 WB(s,c)={f=∑jβjψj:β1,β2,…∈Θ(s,c)}

where . Functions in are sparse in . The larger the value of , the sparser the representation.

###### Theorem 10.

Assume that are the eigenvectors of with eigenvalues for some . Let . Then, the following two statements are equivalent:

1. (smoothness relative to P)

2. (sparsity in ).

Furthermore, sparsity in (or smoothness relative to P) implies

 ∫X|f(x)−fJ(x)|2dS(x)=o(1J2s).

The rate of the error decay depends on the dimension of the data. We will address this issue in Sec. 5.3.

### 5.2 Variance

The matrix (defined in Eq. 5) can be viewed as a perturbation of the integral operator due to finite sampling. To estimate the variance, we bound the difference , where are the eigenvectors of , and are the Nyström extensions (Eq. 11) of the eigenvectors of . We adopt a strategy from Rosasco et al. , which is to introduce two new integral operators that are related to and but both act on an auxiliary444This auxiliary space only enters the intermediate derivations and plays no role in the error analysis of the algorithm itself. RKHS of smooth functions (see Appendix A.2 for details). As before, we write to indicate that we let the kernel bandwidth depend on the sample size .

###### Proposition 11.

Let and as . Under assumptions (A1)-(A4) and ,

 ∥ψε,j−ˆψε,j∥L2(X,P)=OP(γnδε,j),

where and

Let