1 Introduction
With rapid technological advancements in data collection and processing, massive largescale and highdimensional datasets are widely available nowadays in diverse research fields such as astronomy, business analytics, human genetics and microbiology. A common feature of these datasets is that their statistical and geometric properties can be well understood via a meaningful lowrank representation of reduced dimensionality. Learning lowdimensional structures from these highdimensional noisy data is one of the central topics in statistics and data science. Moreover, nonlinear structures have been found predominant and intrinsic in these realworld datasets, which may not be easily captured or preserved in commonly used linear or quasilinear methods such as principal component analysis (PCA), singular value decomposition (SVD)
[38] and multidimensional scaling (MDS) [12]. As a longstanding and wellrecognized technique for analyzing datasets with possibly intrinsic nonlinear structures, kernel methods have been shown effective in various applications ranging from clustering and data visualization to classification and prediction [56, 34, 43]. On the other hand, spectral methods, as a fundamental tool for dimension reduction, are oftentimes applied in combination with kernel methods to better capture the underlying lowdimensional nonlinear structure in the data. These approaches are commonly referred as nonlinear dimension reduction techniques; see Section 1.1 below for a brief overview.Albeit the effectiveness and success of the kernelspectral methods in many applications, these methods are usually applied haphazardly, especially in terms of tuning parameter selection and interpretation of the results. For example, as a key step in these methods, the construction of the kernel (or affinity) matrices requires specifying a proper bandwidth parameter, which is commonly determined by unjustified heuristics or conventional empiricism. This is mainly due to a lack of theoretical understanding of the methods, including its intrinsic objective, its range of applicability, and its susceptibility to noise or dimensionality of the data. Likewise, the absence of a theoretical foundation may significantly constrain the users’ recognition or exploitation of the full potentials of the method. Given the indispensable status and the practical power of these methods in many fields of applications, such as singlecell transcriptomics
[72, 49, 41] and medical informatics [1, 59], there is a pressing need of theoretically justified kernelspectral method that accounts for both the highdimensionality and the noisiness of the data.In this study, we focus on an integral operator approach to learning lowdimensional nonlinear structures from highdimensional noisy data. We propose a kernelspectral embedding algorithm with a dataadaptive bandwidth selection procedure (c.f. Algorithm 1), justified by rigorous theoretical analysis, and provide solid interpretations of the lowdimensional embeddings by establishing its asymptotic behaviors in the large system limit.
Specifically, we consider the following manifold “signalplusnoise” datagenerative model
(1.1) 
where are observed samples, are the underlying noiseless samples drawn independently from a lowdimensional smooth manifold of interest, embedded into and are independent subGaussian noise (see Section 1.3 for the definition) with covariance . As will be seen in Section 3.1, (1.1) is closely related to the spiked covariance model [37] extensively studied in the past two decades. Throughout, we focus on the highdimensional setting where the number of variables is comparable to the sample size , that is, for some constant , we have
(1.2) 
As a general and important problem in nonlinear dimension reduction and manifold learning, our goal is to find a welljustified lowdimensional embedding of that captures the nonlinear structure of the underlying manifold , as encapsulated in the noiseless random samples . To facilitate our discussion, we define the Gaussian kernel matrix , where
(1.3) 
and is some bandwidth.
1.1 Some Related Works
Unlike our general setup (1.1) under highdimensionality (1.2), most of the existing work concern computational algorithms for learning nonlinear manifold from random observations based on heuristics developed under either the noiseless (i.e., ) or lowdimensional settings where is fixed. These methods can be roughly separated into two categories, depending on how the data is conceptually represented. On the one hand, there are spectral graph based methods, including Laplacian eigenmap [6], diffusion map (DM) [19]
, vector diffusion map (VDM)
[61], ISOMAP [66], maximal variance unfolding (MVU)
[73], locally linear embedding (LLE) [53], Hessian LLE [23], tSNE [67], UMAP [48] and local tangent space alignment (LTSA) [76], among many others. For more details of these methods, see [44, 68]. Generally speaking, these methods start by constructing a possibly sparse graph using kernels in which the nodes represent the input patterns and the edges represent the neighborhood relations. Consequently, the resulting graph can be viewed as a discretized approximation of sampled by the inputs. From these graphs, one may apply results from spectral graph theory and construct matrices whose spectral decomposition reveals the lowdimensional structure ofIncidentally, these methods have been used to facilitate various downstream statistical applications, such as spectral clustering
[69][17], among others. The main differences between these methods are the kernels used to construct the spectral graphs and the associated nonlinear eigenfunctions (or feature maps [34]) employed for embedding (or data representation). On the other hand, there are reproducing kernel Hilbert space (RKHS) based methods, that use eigenfunctions of some kernel map of the RKHS to infer the nonlinear structures of based on noiseless observations [56, 34, 43]. Specifically, these methods start from a certain similarity matrix based on some positive kernel function, and infer the geometric properties from the integral operator associated with the kernel. Similarly, various downstream tasks such as clustering, regression and classification [34, 54, 58, 70] have been treated in combination with the RKHSbased methods. In addition to the above two categories, kernel principal component analysis (kPCA) [55] has also been widely used.From a mathematical viewpoint, in the lowdimensional and noiseless setting, theoretical properties of some of these methods have been explored. For the graphbased methods, existing theoretical results reveal that, under some regularity conditions, the discretized approximation obtained from various kernelspectral methods will converge to a continuum quantity, which is usually represented as a certain operator capturing the underlying structures of For example, it has been shown that the discrete Graph Laplacian (GL) matrices (c.f. (2.6) and (2.7)) used by Laplacian eigenmap and DM would converge to the LaplaceBeltrami operator of under various settings and assumptions [7, 15, 24, 31, 32, 60, 62, 74]. For the LLE, as shown in [75], after being properly normalized, the similarity matrix would converge to the LLE kernel which is closely related to the LaplaceBeltrami operator. Convergence of tSNE, MVU, VDM and LTSA have been studied in [4, 45, 14, 3, 61, 65]
. For the RKHSbased methods, it has been shown that the empirical eigenvalues and eigenfunctions would converge to those of the integral operator associated with the reproducing kernel
[13, 42, 51, 58, 63, 64, 70]. Finally, kPCA has been studied under the noiseless setting [10].Another important difference between the graphbased methods and the RKHSbased methods lie in the treatment of the kernel matrices (c.f. (1.3
)). The graphbased methods usually rely on some GLtype operations, where the kernel (affinity) matrix are properly normalized by the graph degrees
[20, 68]. In contrast, the RKHSbased methods commonly employ the kernel matrix directly without further normalization. Consequently, in terms of interpretations, existing theory under the lowdimensional noiseless setting indicate that the embeddings from the graphbased methods are associated with the eigenfunctions of LaplaceBeltrami operator of , whereas those from the RKHSbased methods are related to the eigenfunctions of some reproducing kernels of .Albeit these endeavors, much less is known when the data is highdimensional and noisy as modeled by (1.1) and (1.2
), nor does the behavior of the associated kernel random matrix
. In the null case (i.e., ), it has been shown in [11, 18, 21, 22, 26, 29] that when the random kernel matrix can be well approximated by a lowrank perturbed random Gram matrix. Consequently, studying under pure noise is closely related to PCA with some lowrank perturbations. Moreover, since in this case the degree matrix is close to a scalar matrix [21], the graphbased methods and the RKHSbased methods are asymptotically equivalent. As for the nonnull cases, spectral convergence of has been studied in some special cases in [25] when , and more recently under a general setting in [20]. These results show how the eigenvalues of the kernel matrices of relate to those of Moreover, as can be seen in [20], in the nonnull cases, the degree matrix is usually nontrivial, so that the graphbased methods can be very different from the RKHSbased methods.In what follows, we focus on an RKHSbased integral operator approach to nonlinear dimension reduction and manifold learning under the general setup (1.1) and (1.2). Even though in this setting has been studied to some extent in [20, 25], several pieces are still missing for rigorous and interpretable statistical applications. Firstly, it is unclear how to select the bandwidth adaptively for embedding, that is, free from prior knowledge of
. Secondly, the limiting behavior of the eigenvectors of
have not been analyzed. Such results are particularly relevant for embedding purposes. Finally, even though the convergence of kernel matrices associated to (c.f. (1.3)) and (c.f. (3.5)) can be established, it is unclear how they are related to the underlying manifold . Such results are important for appropriate interpretations of the embeddings. We will accomplish these goals in this paper.1.2 An Overview of Main Results and Contributions
Briefly speaking, our proposed kernelspectral embedding algorithm starts by constructing a Gaussian kernel matrix as defined in (1.3), where is carefully selected by a theoryguided dataadaptive procedure proposed in Section 2. Then we obtain the kernelspectral embedding of the original data by conducting an eigendecomposition for the scaled kernel matrix , and define the embeddings as the leading eigenvectors of , weighted by their associated eigenvalues (Algorithm 1). As we will show later, as with respect to (1.2), the thus constructed final embeddings are essentially related to and determined by a population integral operator defined by the reproducing kernel of an RKHS associated to the underlying manifold . It is in this sense our proposed method is an integral operator approach.
Rigorous theoretical understanding are obtained for the proposed algorithm, including the theoretical justifications for the bandwidth selection procedure, the limiting behavior of the lowdimensional embeddings, and their interpretations in relation to the underlying manifolds. Compared to the existing works on kernelspectral embeddings (see Section 1.1), the current study has the following methodological advantages and theoretical significance:

We propose a kernelspectral embedding algorithm for learning lowdimensional nonlinear structure from highdimensional noisy data. Unlike most existing methods, our proposal takes into account both the noisiness of the data, characterized by our model (1.1), and its highdimensionality as in (1.2). To the best of our knowledge, this is the first kernelspectral embedding method with theoretically guaranteed performance for highdimensional and noisy data under the nonnull settings.

A key component in any kernelbased method is to select a proper bandwidth parameter. One major innovation of our proposed method lies in a careful analysis of the asymptotic behavior of kernel random matrices constructed from highdimensional noisy data, which in turn leads to a theoryinformed bandwidth selection procedure guaranteeing the strong performance of the method in a dataadaptive manner. In particular, our bandwidth selection procedure does not rely on prior knowledge about the underlying manifold, and can be implemented efficiently.

We also provide indepth theoretical understanding of the proposed method. On the one hand, we study the convergence of the final lowdimensional embeddings to their oracle counterparts based on the noiseless samples in the large system limit, and characterize explicitly the effect of the underlying signaltonoise ratio on the respective rates of convergence and phase transition. On the other hand, we establish the convergence of the kernel matrices to the population integral operator that captures the intrinsic geometry of the underlying manifold. The second result is essential for our understanding of the final lowdimensional embeddings, interpreted as a finite approximation of the samples projected onto the leading eigenfunctions of the population integral operator.
Our theoretical results explain the empirically observed dataadaptive nature of the proposed method (Section 4), which extracts the nonlinear structure from the data regardless which kind of manifold is in play. Moreover, our recognition of the limiting integral operator also suggests potential advantages of our method over alternative approaches targeting for distinct geometric features of the manifold. For example, we may conclude from the theory that our method differs significantly from the graphbased methods, such as Laplacian eigenmap, DM, and LLE, as they all essentially aim at the LaplaceBeltrami operator rather than an integral operator (see Sections 2.2 and 5 for more discussions).
Our theoretical analysis, on the one hand, relies on a general model reduction scheme, detailed in Section 3.1, which simplifies the analysis by connecting the noisy manifold model (1.1) with a spiked covariance model. On the other hand, we also leverage advanced tools from operator theory such as the theory of RKHS and random matrix theory, to prove our main results.
The proposed method admits strong empirical performance. In Section 4.1, we present simulation studies that show the superiority and flexibility of the proposed bandwidth selection method over some existing alternatives. In Section 4.2, we analyze three realworld highdimensional datasets with distinct nonlinear structures, including a path manifold, a circle manifold, and a multiclass mixture manifold, to demonstrate the usefulness of the method. In particular, for each of the examples, our proposed method shows significant improvements over the existing stateoftheart methods in inferring the respective underlying nonlinear structures.
1.3 Organization and Notations
The rest of the paper is organized as follows. In Section 2, we introduce our proposed kernelspectral embedding algorithm in detail, and we point out important differences between our proposal and some existing methods. In Section 3, we study the theoretical properties of the proposed method, including the convergence of the lowdimensional embeddings to its noiseless counterpart, and the spectral convergence of the kernel matrix to an integral operator. In Section 4, we include our simulation studies on bandwidth selection and analyze three realworld datasets to show the numerical performance of the proposed method in various applications. Finally, we discuss some extensions and related problems in Section 5. Technical proofs and additional discussions and numerical results are provided in Appendices A–D.
We will finish this section by introducing some notations used in the paper. To streamline our statements, we use the notion of stochastic domination, which is commonly adopted in advanced probability theory to syntactically simplify precise statements of the form “
is bounded with high probability by
up to small powers of .”Definition 1.1 (Stochastic domination).
Let
be two families of nonnegative random variables, where
is a possibly dependent parameter set. We say that is stochastically dominated by , uniformly in the parameter , if for all small and large , there exists so that we havefor all . In addition, we say that an dependent event holds with high probability if for any large , there exists so that
for all
We interchangeably use the notation , or if is stochastically dominated by , uniformly in , when there is no risk of confusion. For two sequences of deterministic positive values and we write if for some positive constant In addition, if both and we write Moreover, we write if for some positive sequence For any probability measure over , we denote as the collection of integrable functions with respect to , that is, for any , we have . For a vector , we define its norm as . We denote as the diagonal matrix whose th diagonal entry is . For a matrix , we define its Frobenius norm as , and its operator norm as . For any integer , we denote the set . For a random vector we say it is subGaussian if
(1.4) 
for any deterministic vector Throughout, are universal constants independent of , and can vary from line to line.
2 KernelSpectral Embedding for HighDimensional Noisy Data
In this section, we introduce in detail our proposed bandwidth selection and kernelspectral embedding algorithm. After that we discuss its unique features compared with some popular existing methods.
2.1 DataAdaptive Embedding Algorithm
Our proposed embedding method is summarized as Algorithm 1 below.
(2.1) 
(2.2) 
(2.3) 
(2.4) 
In Step 1 of the algorithm, a dataadaptive bandwidth parameter is defined as the
percentail of the empirical cumulative distribution function
of the pairwise squareddistances among the observed data. Such a strategy is motivated by our theoretical analysis of the spectrum of kernel random matrices and its dependence on the associated bandwidth parameter. It ensures the thus determined bandwidth adapts well to the unknown manifold structure and the signaltonoise ratio of the data, so that the associated kernel matrix captures the respective underlying lowdimensional structure via an integral operator. is a tunable parameter. In Section 3, we show in theory that can be chosen as any constant between 0 and 1 to have the final embeddings achieve the same asymptotic behavior; in Section 4, we also demonstrate numerically that the final embeddings are insensitive to the choice of . In practice, to optimize the empirical performance and improve automation of the method, we recommend using a resampling approach, described in Appendix D, to determine .In Step 2, the Gaussian kernel is adopted for the construction of kernel matrices. This is mainly due to its wide use in practice and the relative simplicity of its theoretical analysis, although other kernel functions may be considered in practice (see Section 5 for more discussions).
In Step 3, the final embeddings are defined as the indexed leading eigenvectors of the scaled kernel matrix, weighted by their eigenvalues. Similar forms of embeddings have been considered in [2] for spectral clustering and in [77] for network clustering. The coordinate set of the embedding space is determined by the users, depending on the specific aims or downstream applications of the lowdimensional embeddings. For example, in Section 4.2, we choose for visualizing a path manifold, and for the downstream ranking task, choose for learning circle manifolds, and set for a variety of integers for downstream clustering purposes.
2.2 Comparison with Existing KernelSpectral Embedding Methods
Regardless of the kernel functions being used, Algorithm 1 has important differences from the existing kernelspectral embedding methods. We first focus on kPCA, Lapalcian eigenmap, and DM due to their close relations to our proposal.
Compared to Step 3 of Algorithm 1, these methods rely on eigendecomposition of matrices distinct from . Specifically, in the standard kPCA implementation (Section 12.3 of [9]), the eigendecomposition is applied to the centred kernel matrix
(2.5) 
where is an allone vector; in Laplacian eigenmap [6], the eigendecomposition is applied to the kernelized graph Lapalcian matrix
(2.6) 
where ; in diffusion map [19], the eigendecomposition is applied to the normalized kernel matrix (i.e., transition matrix)
(2.7) 
where is some tunable parameter and . More importantly, our numerical results (Section 4.2) indicate that such differences may lead to distinct lowdimensional embeddings, which, according to our theory (Section 3.3), may correspond to geometric features of the underlying manifold, that are identifiable by the proposed method but not likely to be captured by these methods.
As for other existing methods, it is largely unclear to what extend their performance is guaranteed, or if their interpretations continue to hold, for general highdimensional and noisy datasets. Empirically, our real applications in Sections 4.2.2 and 4.2.3 show that our proposed method outperforms various existing methods, especially when dimension increases (Figure 4).
Moreover, compared to the existing methods, Algorithm 1 has a bandwidth selection procedure that is provably coherent with the subsequent steps, in the sense that both the highdimensionality and noisiness of the data are intrinsically accounted for in the constructed kernel matrix and the final embeddings. In contrast, graphbased methods usually require the bandwidth decrease to zero, to ensure a meaningful convergence. However, oftentimes this is not admissible under highdimensional and noisy datasets. In this respect, we are only aware of the recent work [20], in which a datadriven bandwidth selection procedure for GL was developed. Nonetheless, in Section 4, we compare various bandwidths and demonstrate the advantage of the proposed bandwidth; we also show that, even after incorporating the proposed bandwidth to these existing methods, Algorithm 1 still has the superior performance in learning lowdimensional structures from the realworld datasets.
3 Theoretical Properties: Justifications and Interpretations
We develop a theoretical framework to rigorously justify the embedding algorithm proposed in Section 2.1. Consequently, our theory suggests an asymptotic geometric interpretation that helps to better understand the final lowdimensional embeddings.
3.1 General Model Reduction Scheme
To facilitate our analysis of the proposed embedding algorithm under model (1.1), we introduce a useful model reduction scheme, that significantly simplifies the theoretical analysis without sacrificing the generality of our discussions.
Suppose the underlying manifold is dimensional. By Nash’s isometric embedding theorem [50], there exists some integer so that the embedded manifold is supported in an dimensional subspace in Consequently, there exists a deterministic rotation matrix only depending on such that
Now suppose are drawn independently from a probability measure on . Let be the covariance matrix of , and denote its eigendecomposition as
where contains the eigenvalues of , arranged in nonincreasing order, and the columns of are the corresponding eigenvectors. Given , we could define an orthonormal matrix as
Then it is clear that for each , the covariance matrix of is diagonal, with
(3.1) 
Therefore, by rotating the original model (1.1) with a deterministic matrix , we obtain that
(3.2) 
where , and In particular, after rotation become independent sparse random vectors in the sense that
(3.3) 
where and
(3.4) 
whereas the random noises remains independent and centred. Now a key observation is that the kernel matrix from (2.3) only depends on the pairwise Euclidean distances of the data , which are invariant to any rotations. In other words, we have
Therefore, to theoretically analyze the kernel random matrix and the subsequent spectral embeddings, we can invariably starts from the structurally reduced model as characterized by (3.2) to (3.4).
The main benefit of such a reduction scheme is twofold. On the one hand, it helps to identify a natural coordinate system for the noiseless samples so that the underlying lowdimensional manifold only relies on their first few nonzero coordinates. On the other hand, the reparametrization leads to a parsimonious model in which the sampling distribution from the underlying manifold has a simple diagonal covariance structure. Both aspects contribute to improving the theoretical accessibility of the problem without loss of generality.
3.2 Spectral Convergence to Noiseless Kernel Matrix
In this part, we show the convergence of the lowdimensional embeddings to their noiseless counterparts, by establishing the spectral convergence of the scaled kernel random matrix to the noiseless random matrix , defined by
(3.5) 
where with being the noise level as in (3.6).
By the invariance , the noiseless kernel matrix can also be treated as a characterization of the noiseless samples embedded in . To this end, we first introduce and discuss the theoretical assumptions we made throughout our analysis.
Assumption 3.1.
A few remarks on Assumption 3.1 are in order. First, unlike most of the existing literature on manifold learning and random matrix theory where is usually assumed to be bounded, our assumption (3.7) allows to diverge slowly with This is more flexible for practical applications and it guarantees that the bandwidth selected according to (2.2) has the same order as the signal (i.e., ) with high probability. Second, for ease of statements, we assume that (3.8) holds for the same However, our results can easily generalize to the setting for some nonnegative constants , with some minor adaptation of our argument and extra notational complications. Finally, in the current paper, we assume that and therefore
are white noise as in (
3.6) for simplicity. Nevertheless, our discussion applies invariably to the colored noise setting wherehas the same eigenspace as
For example, in light of (3.4), we can allow a diagonal but nonscalar covariance matrix forBefore stating the main result of this part, we introduce a few more notations. We denote the eigenvalues of as where , and denote the corresponding eigenvectors as . Let be the probability measure on for the independent subvectors of the reduced noiseless samples . We define the associated integral operator such that, for any we have
(3.9) 
By Mercer’s theorem (e.g., [39]), there exist a sequence of nonnegative eigenvalues in the decreasing order and orthonormal basis of known as eigenfunctions so that
(3.10) 
Finally, we also recall that the eigenvalues and eigenvectors of are denoted as and , respectively.
Theorem 3.2.
Suppose Assumption 3.1 holds and . Then the following holds.

(Eigenvalue convergence) We have
(3.11) and therefore

(Eigenvector convergence) For any , if the th population eigengap
(3.12) satisfies that for some fixed constant and some constant
(3.13) then we have
(3.14)
Theorem 3.2 establishes the asymptotic spectral equivalence between the noisy kernel matrix and the noiseless kernel matrix . Note that both matrices are random with respect to the underlying manifold and noise. We provide several remarks here. First, unlike many existing results on the spectral convergence of kernel random matrices, Theorem 3.2 reveals the role of in the convergence rate, suggesting that when does not diverge fast, a larger will potentially lead to a faster convergence between the two matrices. Second, as shown in Corollary 3.2 of [20], for we have Consequently, in practice, we only need to focus on recovering the first few (at most at an order of ) eigenvalues and eigenvectors of . Third, the results concerning can be made more precise given additional information. For example, if is Gaussian and , then from Appendix C, for any finite we have In this case, whenever we have consistency for the leading eigenvalues and eigenvectors
and
Remark 3.3.
We point out that our assumption on the signaltonoise ratio is actually necessary to guarantee convergence. As proved in part (i) of Theorem 3.1 of [20], when the noise dominates the signal and the convergence between and fails to hold. Indeed, if and , we have , so that asymptotically can be approximated by the Gram matrix of the noise vectors , plus a few noninformative rankone spikes. In particular, these rankone spikes, as pointed out by Remark 2.4 of [20], only depend on the kernel function, and do not contain any information about ; meanwhile, the empirical spectral distribution of will be governed by that of the noise Gram matrix and converges to the MarchenkoPastur law [47]. In essence, there is a phase transition at . See also Section 5 for more discussions.
Remark 3.4.
There are fundamental advantages of the proposed bandwidth. Under the highdimensional setup (1.2), most of the existing literature regarding the kernel random matrix focus on a prespecified fixed bandwidth [18, 21, 22, 26]. In particular, when as they mainly deal with the null cases, is chosen for some constant , since the total noise is . In contrast, under the nonnull regime considered in the current study, if we still choose the bandwidth may be too small compared to the signal () so that the decay of the kernel function forces each sample to be ”blind” of the other nearby samples. Especially, as shown in Theorem 2.7 of [20], when the signals are strong enough so that if we still choose then we have for all with high probability. In this case, the kernel matrix converges to identity and becomes useless. In comparison, under our nonnull regime our theory suggests one should choose In this connection, although both and are generally unknown, our proposed bandwidth selection procedure is able to achieve this automatically from the data. In other words, as will be seen in the proof of Theorem 3.2, our proposed method is guaranteed to select with high probability.
3.3 Spectral Convergence to Population Integral Operator
So far we have shown that our proposed lowdimensional spectral embeddings converge to those associated to the noiseless random samples . However, it is still unclear how these embeddings relate to the underlying manifold . Answering this question is important as to understand the proper interpretation of these embeddings.
Our next result concerns the limiting behavior of the eigenvalues and eigenvectors of , as characterized by a deterministic integral operator defined over the lowdimensional manifold. Before stating our main result, we first introduce a few notations from operator theory.
Recall that is the probability measure on for in (3.3) and is the rotation matrix satisfying (3.1). Let
be the underlying manifold after rotation. We define as the probability measure on for , which is also embedded into such that for , we have Accordingly, we define the population integral operator such that for any we have
(3.15) 
where we recall that
Clearly, the eigenvalues and eigenfunctions of denoted as and satisfy that
(3.16) 
Moreover, let be the empirical distribution of the noisy samples . Then the corresponding empirical operator is defined by
(3.17) 
It is easy to see that (for example, see the discussion of Section 2.2 of [58]), for , the eigenvalues of coincide with , and for any eigenvalue of the corresponding eigenfunction satisfies
(3.18) 
and , where is the th component of , the th eigenvector of . In particular, we have
(3.19) 
so that the eigenvector can be interpreted as the discretized th empirical eigenfunction evaluated at the noisy samples . Finally, for some constant we denote the integer as
(3.20) 
In order to describe the convergence of eigenfunctions, we consider an RKHS for functions defined on , associated with the kernel function such that
(3.21) 
Specifically, we define as the completion of the linear span of the set of functions , with the inner product denoted as satisfying and the reproducing property for any (see [8, 46] for systematic treatments of RKHS).
Theorem 3.5.
Under the assumptions of Theorem 3.2, the following results hold.

(Eigenvalue convergence) We have
(3.22) 
(Eigenfunction convergence) For in (3.20) and , if we restrict on , then we have and
(3.23) where is the normalized eigenfunction of in satisfying , and is the norm induced by the inner product .
Theorem 3.5 establishes the spectral convergence of the kernel matrix to the population integral operator given by (3.15) in . On the one hand, (3.22) demonstrates that the deterministic limits of the eigenvalues are those of the integral operator . On the other hand, (3.23) indicates that the empirical eigenfunctions , after proper rescaling, can also be regarded as an approximation of the population eigenfunctions in an average manner. In particular, Theorem 3.5 implies that for any ,
(3.24) 
so that as long as , the final lowdimensional embedding is approximately the leading eigenfunctions evaluated at the rotated data , and weighted by their corresponding eigenvalues . In other words, each coordinate of the final embedding is essentially a nonlinear transform of the original data, by some functions uniquely determined by the population integral operator . This explains why our proposed method may capture important geometric features of the underlying manifold, and what kind of geometric features our method aims at.
As a comparison, existing methods such as Laplacian eigenmap, DM, or LLE, have their final embeddings correspond asymptotically to the LaplaceBeltrami differential operator evaluated at lowdimensional noiseless data. The fundamental distinction between integral and differential operators indicates the potential advantage of the proposed method over these existing methods, especially for highdimensional noisy datasets as indicated by our real data analysis (Section 4.2). In this respect, Theorem 3.5 provides a systematic way to understand the general geometric interpretation of the proposed lowdimensional embedding, although their concrete meanings may vary from case to case. To better illustrate this, in Appendix C we consider a specific probability measure , obtain the explicit forms of its corresponding eigenvalues and eigenfunctions, and demonstrate their geometric interpretations accordingly.
4 Numerical Studies
We provide numerical simulations and three real data examples to illustrate the usefulness of our proposed method. Especially, in Section 4.1, we show that our adaptive bandwidth selection scheme (2.2) is robust and outperforms other existing methods. In Section 4.2, we compare our proposed embedding method (Algorithm 1) with various existing methods and show its outstanding performance.
4.1 Simulations
We conduct numerical experiments to evaluate the performance of the proposed bandwidth selection procedure (2.2) compared to some existing bandwidth methods in terms of the final embedding quality. To better demonstrate the wide range of applicability of the proposed method, we consider model (1.1) under three different settings of signal manifolds, and assess the quality of spectral embeddings as obtained from Steps 2 and 3 of Algorithm 1 under various bandwidths.
Specifically, we set and , and generate . For the signal samples , we set for some , and generate from one of the following settings:

Nested spheres manifold with : for some global scaling parameter , we set and generate samples from a mixture model of nested spheres , where
are uniform distributions on nested spheres in
of radii respectively, and the cluster proportion vector . 
“Smiley face” manifold with : for some global scaling parameter , we set and generate samples from a twodimensional “smiley face” as shown in Figure 1 left.

“Mammoth” manifold with : for some global scaling parameter , we set and generate samples from a threedimensional “mammoth” as shown in Figure 1 right.
The global scaling parameter characterizes the overall signaltonoise ratio, which plays the similar role as the term in our theoretical results. In addition to the proposed adaptive bandwidth selection scheme, we also consider (i) the fixed bandwidth , which has been used in many existing works such as [18, 21, 22, 25, 26, 27], and (ii) the datadriven bandwidth selection method, denoted as “SBY,” proposed by [58]. We evaluate the above bandwidths by comparing the corresponding lowdimensional spectral embeddings for of the highdimensional noisy data , and the actual lowdimensional noiseless samples . To do that, we calculate the pairwise distance matrices from the rows of , and from , and characterize their concordance by the correlation measure
(4.1) 
which is invariant to any scaling and rotations of the lowdimensional embeddings. Presumably, a good bandwidth selection should lead to higher concordance between the final embeddings and the original lowdimensional manifolds.
In Figure 2, we show the concordance evaluated for each bandwidth selection method over a variety of signaltonoise ratios as quantified by , under each of the three manifold settings. In particular, to demonstrate the advantage and flexibility of the proposed method, we consider . From Figure 2, we observe that the proposed method has superior performance compared to and SBY in all the three manifolds, and across almost all the signaltonoise ratios. The commonly used bandwidth only shows comparable performance in the “smiley face” example when signaltonoise ratio is relatively low; the SBY method has poor performance throughout, probably as a result of ignorance of highdimensionality of the noise. Importantly, the advantage of the proposed method is not sensitive to the values of , and can be more significant when the signaltonoise ratio is larger. In Appendix D, we show that selected by a datadriven resampling approach can achieve equally good performances in these examples. These numerical results indicate that accounting for the data highdimensionality and noise impact is important for our kernel spectral embedding method to perform well, and our proposed bandwidth selection algorithm is reliable in that respect.
4.2 Applications to RealWorld Datasets
To illustrate the proposed method and its numerical advantage over various existing methods in diverse applications, we analyze three real datasets with distinct underlying manifold structures.
4.2.1 Pseudotemporal Ordering of Single Cells
Our first example concerns pseudotemporal ordering of single cells. As an important problem in singlecell biology, pseudotemporal cell ordering aims to determine the pattern of a dynamic process experienced by cells and then arrange cells according to their progression through the process, based on singlecell RNASeq data collected at multiple time points. In the following, we show that our proposed embedding method is able to capture such an underlying progression path as a onedimensional manifold, and therefore to help determine a more precise ordering of cells, compared to the existing stateofart methods. The dataset consists of singlecell RNA sequencing reads for 149 human primordial germ cells ranging from 4 weeks to 19 weeks old [33]. Specifically, there are 6 cells of 4 weeks old, 37 cells of 7 weeks old, 20 cells of 10 weeks old, 27 cells of 11 weeks old, and 57 cells of 19 weeks old; these are treated as the ground truth for cell ordering. The raw count data were preprocessed, normalized, and scaled by following the standard procedure (R functions CreateSeuratObject, NormalizeData and ScaleData under default settings) as incorporated in the R package Seurat^{1}^{1}1https://cran.rproject.org/web/packages/Seurat/index.html. We also applied the R function FindVariableFeatures in Seurat to identify most variable genes for subsequent analysis. The final dataset consists of standardized expression levels of genes for the 149 cells. We apply our proposed method with a variety of and we consider a twodimensional embedding , where the onedimensional manifold is already clearly captured; see left of Figure 3 for an illustration.
Observing that the path manifold is aligned with the second eigenvector, we order the cells according to their values in the second eigenvector. Inferred cell orders are also obtained using stateofart cell ordering algorithms such as TSCAN [36], SCORPIUS [16], SerialRank [30], and the classical PCA. Note that SerialRank essentially has Laplacian eigenmap as its core. For TSCAN and SCORPIUS, we applied functions from their R packages^{2}^{2}2https://github.com/zji90/TSCAN, and https://cran.rstudio.com/web/packages/SCORPIUS/index.html under the default settings. To evaluate the quality of the inferred orders, we compare them with the ground truth by evaluating their respective Kendall’s tau statistic.
On the left of Figure 4, the temporal ordering based on the proposed method () aligns better with the actual cell order than the other methods, with almost threefold improvement over the second best method TSCAN in Kendall’s tau on average. The superiorty of the proposed method is not sensitive to the dimension
, or the quantile parameter
. See Appendix D for similar results with and 0.75.Finally, in Figure 3 we show the scatter plots of the 2dimensional embeddings based on PCA and the proposed method, denoted as KEF standing for kernel eigenfunctions, when
. The advantage of the proposed embedding over its linear counterpart, such as informativeness and robustness to outliers, is visible and significant.
4.2.2 Cell Cycle Reconstruction
Our second example concerns reconstruction of the cell cycle from singlecell RNASeq data. The cell cycle, or celldivision cycle, is the series of events that take place in a cell that cause it to divide into two daughter cells^{3}^{3}3https://en.wikipedia.org/wiki/Cell_cycle. Determining the cell cycle stages of individual cells analyzed during development is important for understanding its wideranging effects on cellular physiology and gene expression profiles. Our second dataset contains 288 mouse embryonic stem cells, whose cell cycle stages were determined using flow cytometry sorting. As a result, onethird (96) of the cells are in the G1 stage, onethird in the S stage, and the rest in the G2M stage. The raw count data were preprocessed and normalized using the same method as in Section 4.2.1, leading to a dataset consisting of standardized expression levels of most variable genes for the 288 cells. Again, we apply our proposed method with a variety of . Observing that the leading eigenvector is approximately a constant vector and therefore noninformative, we consider a twodimensional embedding with , expected to capture the underlying circle manifold. As a comparison, we also obtain twodimensional embeddings based on PCA, MDS, Laplacian eigenmap, LLE and DM, using functions implemented in the R package dimRed under their default settings. To recover the cell cycle stages, or the underlying position of the cells on the circle manifold, we project each embedding to the twodimensional unit circle, and then identify the cell stages with their respective angles on the unit circle. We compare the reconstruction performance of different methods based on their Kendall’s tau distance to the true stages, up to a possible circular shift of the reconstructed cycles.
The right panel of Figure 4 shows that our proposed method has outstanding performance compared to the other methods. In addition, Figure 5 shows the unitcircle projections of the twodimensional embeddings obtained from each method under . It can be seen that the cells are most separated and wellordered according to their true cycle stages along the unitcircle reconstructed by the proposed method, demonstrating its superiority over the other methods.
4.2.3 Clustering of HandWritten Digits
Our last example concerns the MNIST
^{4}^{4}4http://yann.lecun.com/exdb/mnist/ dataset containing images of handwritten digits. Specifically, we focus on images of handwritten digits “2,” “4,” “6” and “8,” among which there are about 1000 images for each digit. As each image contains pixels, they can be treated as dimensional vectors. We apply six different lowdimensional embedding methods, namely DM, kPCA as defined in Section 2.2, Laplacian eigenmap, MDS, PCA and the proposed method, for a variety of embedding dimensions n.dim . In particular, to ensure fairness in comparison, we have not included twostage algorithms such as tSNE and UMAP, which essentially take one of the above kernel spectral embeddings as an initialization, and then refine the lowdimensional embedding to make the cluster pattern more salient based on some localmetric adjustment [48, 4, 45, 14].We evaluate the embedding quality of a method by calculating the average Silhouette index [52] of the final lowdimensional embedding with respect to the underlying true cluster membership. In general, a higher average Silhouette index of a method indicates the underlying clusters are more separate in the final embedding. Again, for the proposed method we use . Similar results are obtained for as in Appendix D. In Figure 6, we found that our proposed method along with DM has overall the best performance among the six methods. In particular, although the standard kPCA differs from our proposed method only by an additional mean shift (2.5), in all cases its embedding quality is clearly worse than the latter, demonstrating the important distinction between kPCA and the proposed method, and the potential advantage of the latter in applications.
5 Discussion
We propose a kernelspectral embedding algorithm for learning lowdimensional nonlinear structure from highdimensional noisy datasets under a manifold setup. A key component is to develop an adaptive bandwidth selection procedure that is free from prior knowledge of the manifold. Our proposed method is theoretically justified and our embeddings correspond to an integral operator from the RKHS associated to the underlying manifold. While our results pave the way towards statistical foundations for nonlinear dimension reduction, manifold learning, among others, there are various problems to be investigated further. We now highlight a few of them.
First, in this paper, we focus on the prototypical Gaussian kernel for the construction of the lowdimensional embeddings. However, this is only an initial step towards interpretable manifold learning under highdimensional noisy data, and our results stand as a vintage point for exploring and understanding similar kernelspectral embedding algorithms based on other kernel functions, such as cosine kernel, polynomial kernel, innerproduct kernel, selftune kernel, etc.
Second, in Part 2 of Theorem 3.5, the convergence of eigenfunctions is obtained under the kernelinduced norm in , characterizing the closeness in a global manner. In some applications, a more precise local characterization of the final embedding is needed. For example, one need to study the convergence property of the eigenfunctions under the norm, defined by . However, to our best knowledge, even in the noiseless setting, the convergence has only been studied to some extent for normalized integral operator in [71, 70] and the linear differential Laplace–Beltrami operator in [15, 24, 74]. Less is known for the unnormalized integral operator as used in the current paper. Moreover, to understand the convergence in norm for the noisy datasets, in contrast to (3.14), we also need to study the convergence of the eigenvectors of the kernel random matrices. The analysis requires a more sophisticated argument from random matrix theory as in [5, 28]. We will pursue these directions in future works.
Additionally, the current work mainly concerns the supercritical nonnull regime