DeepAI

# Learning Low-Dimensional Nonlinear Structures from High-Dimensional Noisy Data: An Integral Operator Approach

We propose a kernel-spectral embedding algorithm for learning low-dimensional nonlinear structures from high-dimensional and noisy observations, where the datasets are assumed to be sampled from an intrinsically low-dimensional manifold and corrupted by high-dimensional noise. The algorithm employs an adaptive bandwidth selection procedure which does not rely on prior knowledge of the underlying manifold. The obtained low-dimensional embeddings can be further utilized for downstream purposes such as data visualization, clustering and prediction. Our method is theoretically justified and practically interpretable. Specifically, we establish the convergence of the final embeddings to their noiseless counterparts when the dimension and size of the samples are comparably large, and characterize the effect of the signal-to-noise ratio on the rate of convergence and phase transition. We also prove convergence of the embeddings to the eigenfunctions of an integral operator defined by the kernel map of some reproducing kernel Hilbert space capturing the underlying nonlinear structures. Numerical simulations and analysis of three real datasets show the superior empirical performance of the proposed method, compared to many existing methods, on learning various manifolds in diverse applications.

• 19 publications
• 15 publications
04/18/2019

### A kernel-based method for coarse graining complex dynamical systems

We present a novel kernel-based machine learning algorithm for identifyi...
11/22/2021

### How do kernel-based sensor fusion algorithms behave under high dimensional noise?

We study the behavior of two kernel based sensor fusion algorithms, nonp...
11/21/2020

### Phase transition of graph Laplacian of high dimensional noisy random point cloud

We systematically explore the spectral distribution of kernel-based grap...
06/24/2009

### On landmark selection and sampling in high-dimensional data analysis

In recent years, the spectral analysis of appropriately defined kernel m...
05/16/2021

### Theoretical Foundations of t-SNE for Visualizing High-Dimensional Clustered Data

This study investigates the theoretical foundations of t-distributed sto...
03/02/2021

### Factoring out prior knowledge from low-dimensional embeddings

Low-dimensional embedding techniques such as tSNE and UMAP allow visuali...
06/01/2019

### Learning low-dimensional state embeddings and metastable clusters from time series data

This paper studies how to find compact state embeddings from high-dimens...

## 1 Introduction

With rapid technological advancements in data collection and processing, massive large-scale and high-dimensional datasets are widely available nowadays in diverse research fields such as astronomy, business analytics, human genetics and microbiology. A common feature of these datasets is that their statistical and geometric properties can be well understood via a meaningful low-rank representation of reduced dimensionality. Learning low-dimensional structures from these high-dimensional noisy data is one of the central topics in statistics and data science. Moreover, nonlinear structures have been found predominant and intrinsic in these real-world datasets, which may not be easily captured or preserved in commonly used linear or quasi-linear methods such as principal component analysis (PCA), singular value decomposition (SVD)

[38] and multidimensional scaling (MDS) [12]. As a longstanding and well-recognized technique for analyzing datasets with possibly intrinsic nonlinear structures, kernel methods have been shown effective in various applications ranging from clustering and data visualization to classification and prediction [56, 34, 43]. On the other hand, spectral methods, as a fundamental tool for dimension reduction, are oftentimes applied in combination with kernel methods to better capture the underlying low-dimensional nonlinear structure in the data. These approaches are commonly referred as nonlinear dimension reduction techniques; see Section 1.1 below for a brief overview.

Albeit the effectiveness and success of the kernel-spectral methods in many applications, these methods are usually applied haphazardly, especially in terms of tuning parameter selection and interpretation of the results. For example, as a key step in these methods, the construction of the kernel (or affinity) matrices requires specifying a proper bandwidth parameter, which is commonly determined by unjustified heuristics or conventional empiricism. This is mainly due to a lack of theoretical understanding of the methods, including its intrinsic objective, its range of applicability, and its susceptibility to noise or dimensionality of the data. Likewise, the absence of a theoretical foundation may significantly constrain the users’ recognition or exploitation of the full potentials of the method. Given the indispensable status and the practical power of these methods in many fields of applications, such as single-cell transcriptomics

[72, 49, 41] and medical informatics [1, 59], there is a pressing need of theoretically justified kernel-spectral method that accounts for both the high-dimensionality and the noisiness of the data.

In this study, we focus on an integral operator approach to learning low-dimensional nonlinear structures from high-dimensional noisy data. We propose a kernel-spectral embedding algorithm with a data-adaptive bandwidth selection procedure (c.f. Algorithm 1), justified by rigorous theoretical analysis, and provide solid interpretations of the low-dimensional embeddings by establishing its asymptotic behaviors in the large system limit.

Specifically, we consider the following manifold “signal-plus-noise” data-generative model

 yi=xi+zi∈Rp,1≤i≤n, (1.1)

where are observed samples, are the underlying noiseless samples drawn independently from a low-dimensional smooth manifold of interest, embedded into and are independent sub-Gaussian noise (see Section 1.3 for the definition) with covariance . As will be seen in Section 3.1, (1.1) is closely related to the spiked covariance model [37] extensively studied in the past two decades. Throughout, we focus on the high-dimensional setting where the number of variables is comparable to the sample size , that is, for some constant , we have

 τ≤np≤τ−1. (1.2)

As a general and important problem in nonlinear dimension reduction and manifold learning, our goal is to find a well-justified low-dimensional embedding of that captures the nonlinear structure of the underlying manifold , as encapsulated in the noiseless random samples . To facilitate our discussion, we define the Gaussian kernel matrix , where

 Kn=(K(i,j))1≤i,j≤n,K(i,j)=exp(−∥yi−yj∥22hn), (1.3)

and is some bandwidth.

### 1.1 Some Related Works

Unlike our general setup (1.1) under high-dimensionality (1.2), most of the existing work concern computational algorithms for learning nonlinear manifold from random observations based on heuristics developed under either the noiseless (i.e., ) or low-dimensional settings where is fixed. These methods can be roughly separated into two categories, depending on how the data is conceptually represented. On the one hand, there are spectral graph based methods, including Laplacian eigenmap [6], diffusion map (DM) [19]

, vector diffusion map (VDM)

[61], ISOMAP [66]

, maximal variance unfolding (MVU)

[73], locally linear embedding (LLE) [53], Hessian LLE [23], t-SNE [67], UMAP [48] and local tangent space alignment (LTSA) [76], among many others. For more details of these methods, see [44, 68]. Generally speaking, these methods start by constructing a possibly sparse graph using kernels in which the nodes represent the input patterns and the edges represent the neighborhood relations. Consequently, the resulting graph can be viewed as a discretized approximation of sampled by the inputs. From these graphs, one may apply results from spectral graph theory and construct matrices whose spectral decomposition reveals the low-dimensional structure of

Incidentally, these methods have been used to facilitate various downstream statistical applications, such as spectral clustering

[69][17], among others. The main differences between these methods are the kernels used to construct the spectral graphs and the associated nonlinear eigenfunctions (or feature maps [34]) employed for embedding (or data representation). On the other hand, there are reproducing kernel Hilbert space (RKHS) based methods, that use eigenfunctions of some kernel map of the RKHS to infer the nonlinear structures of based on noiseless observations [56, 34, 43]. Specifically, these methods start from a certain similarity matrix based on some positive kernel function, and infer the geometric properties from the integral operator associated with the kernel. Similarly, various downstream tasks such as clustering, regression and classification [34, 54, 58, 70] have been treated in combination with the RKHS-based methods. In addition to the above two categories, kernel principal component analysis (kPCA) [55] has also been widely used.

From a mathematical viewpoint, in the low-dimensional and noiseless setting, theoretical properties of some of these methods have been explored. For the graph-based methods, existing theoretical results reveal that, under some regularity conditions, the discretized approximation obtained from various kernel-spectral methods will converge to a continuum quantity, which is usually represented as a certain operator capturing the underlying structures of For example, it has been shown that the discrete Graph Laplacian (GL) matrices (c.f. (2.6) and (2.7)) used by Laplacian eigenmap and DM would converge to the Laplace-Beltrami operator of under various settings and assumptions [7, 15, 24, 31, 32, 60, 62, 74]. For the LLE, as shown in [75], after being properly normalized, the similarity matrix would converge to the LLE kernel which is closely related to the Laplace-Beltrami operator. Convergence of t-SNE, MVU, VDM and LTSA have been studied in [4, 45, 14, 3, 61, 65]

. For the RKHS-based methods, it has been shown that the empirical eigenvalues and eigenfunctions would converge to those of the integral operator associated with the reproducing kernel

[13, 42, 51, 58, 63, 64, 70]. Finally, kPCA has been studied under the noiseless setting [10].

Another important difference between the graph-based methods and the RKHS-based methods lie in the treatment of the kernel matrices (c.f. (1.3

)). The graph-based methods usually rely on some GL-type operations, where the kernel (affinity) matrix are properly normalized by the graph degrees

[20, 68]. In contrast, the RKHS-based methods commonly employ the kernel matrix directly without further normalization. Consequently, in terms of interpretations, existing theory under the low-dimensional noiseless setting indicate that the embeddings from the graph-based methods are associated with the eigenfunctions of Laplace-Beltrami operator of , whereas those from the RKHS-based methods are related to the eigenfunctions of some reproducing kernels of .

Albeit these endeavors, much less is known when the data is high-dimensional and noisy as modeled by (1.1) and (1.2

), nor does the behavior of the associated kernel random matrix

. In the null case (i.e., ), it has been shown in [11, 18, 21, 22, 26, 29] that when the random kernel matrix can be well approximated by a low-rank perturbed random Gram matrix. Consequently, studying under pure noise is closely related to PCA with some low-rank perturbations. Moreover, since in this case the degree matrix is close to a scalar matrix [21], the graph-based methods and the RKHS-based methods are asymptotically equivalent. As for the non-null cases, spectral convergence of has been studied in some special cases in [25] when , and more recently under a general setting in [20]. These results show how the eigenvalues of the kernel matrices of relate to those of Moreover, as can be seen in [20], in the non-null cases, the degree matrix is usually non-trivial, so that the graph-based methods can be very different from the RKHS-based methods.

In what follows, we focus on an RKHS-based integral operator approach to nonlinear dimension reduction and manifold learning under the general setup (1.1) and (1.2). Even though in this setting has been studied to some extent in [20, 25], several pieces are still missing for rigorous and interpretable statistical applications. Firstly, it is unclear how to select the bandwidth adaptively for embedding, that is, free from prior knowledge of

. Secondly, the limiting behavior of the eigenvectors of

have not been analyzed. Such results are particularly relevant for embedding purposes. Finally, even though the convergence of kernel matrices associated to (c.f. (1.3)) and (c.f. (3.5)) can be established, it is unclear how they are related to the underlying manifold . Such results are important for appropriate interpretations of the embeddings. We will accomplish these goals in this paper.

### 1.2 An Overview of Main Results and Contributions

Briefly speaking, our proposed kernel-spectral embedding algorithm starts by constructing a Gaussian kernel matrix as defined in (1.3), where is carefully selected by a theory-guided data-adaptive procedure proposed in Section 2. Then we obtain the kernel-spectral embedding of the original data by conducting an eigen-decomposition for the scaled kernel matrix , and define the embeddings as the leading eigenvectors of , weighted by their associated eigenvalues (Algorithm 1). As we will show later, as with respect to (1.2), the thus constructed final embeddings are essentially related to and determined by a population integral operator defined by the reproducing kernel of an RKHS associated to the underlying manifold . It is in this sense our proposed method is an integral operator approach.

Rigorous theoretical understanding are obtained for the proposed algorithm, including the theoretical justifications for the bandwidth selection procedure, the limiting behavior of the low-dimensional embeddings, and their interpretations in relation to the underlying manifolds. Compared to the existing works on kernel-spectral embeddings (see Section 1.1), the current study has the following methodological advantages and theoretical significance:

• We propose a kernel-spectral embedding algorithm for learning low-dimensional nonlinear structure from high-dimensional noisy data. Unlike most existing methods, our proposal takes into account both the noisiness of the data, characterized by our model (1.1), and its high-dimensionality as in (1.2). To the best of our knowledge, this is the first kernel-spectral embedding method with theoretically guaranteed performance for high-dimensional and noisy data under the non-null settings.

• A key component in any kernel-based method is to select a proper bandwidth parameter. One major innovation of our proposed method lies in a careful analysis of the asymptotic behavior of kernel random matrices constructed from high-dimensional noisy data, which in turn leads to a theory-informed bandwidth selection procedure guaranteeing the strong performance of the method in a data-adaptive manner. In particular, our bandwidth selection procedure does not rely on prior knowledge about the underlying manifold, and can be implemented efficiently.

• We also provide in-depth theoretical understanding of the proposed method. On the one hand, we study the convergence of the final low-dimensional embeddings to their oracle counterparts based on the noiseless samples in the large system limit, and characterize explicitly the effect of the underlying signal-to-noise ratio on the respective rates of convergence and phase transition. On the other hand, we establish the convergence of the kernel matrices to the population integral operator that captures the intrinsic geometry of the underlying manifold. The second result is essential for our understanding of the final low-dimensional embeddings, interpreted as a finite approximation of the samples projected onto the leading eigenfunctions of the population integral operator.

Our theoretical results explain the empirically observed data-adaptive nature of the proposed method (Section 4), which extracts the nonlinear structure from the data regardless which kind of manifold is in play. Moreover, our recognition of the limiting integral operator also suggests potential advantages of our method over alternative approaches targeting for distinct geometric features of the manifold. For example, we may conclude from the theory that our method differs significantly from the graph-based methods, such as Laplacian eigenmap, DM, and LLE, as they all essentially aim at the Laplace-Beltrami operator rather than an integral operator (see Sections 2.2 and 5 for more discussions).

Our theoretical analysis, on the one hand, relies on a general model reduction scheme, detailed in Section 3.1, which simplifies the analysis by connecting the noisy manifold model (1.1) with a spiked covariance model. On the other hand, we also leverage advanced tools from operator theory such as the theory of RKHS and random matrix theory, to prove our main results.

The proposed method admits strong empirical performance. In Section 4.1, we present simulation studies that show the superiority and flexibility of the proposed bandwidth selection method over some existing alternatives. In Section 4.2, we analyze three real-world high-dimensional datasets with distinct nonlinear structures, including a path manifold, a circle manifold, and a multiclass mixture manifold, to demonstrate the usefulness of the method. In particular, for each of the examples, our proposed method shows significant improvements over the existing state-of-the-art methods in inferring the respective underlying nonlinear structures.

### 1.3 Organization and Notations

The rest of the paper is organized as follows. In Section 2, we introduce our proposed kernel-spectral embedding algorithm in detail, and we point out important differences between our proposal and some existing methods. In Section 3, we study the theoretical properties of the proposed method, including the convergence of the low-dimensional embeddings to its noiseless counterpart, and the spectral convergence of the kernel matrix to an integral operator. In Section 4, we include our simulation studies on bandwidth selection and analyze three real-world datasets to show the numerical performance of the proposed method in various applications. Finally, we discuss some extensions and related problems in Section 5. Technical proofs and additional discussions and numerical results are provided in Appendices AD.

We will finish this section by introducing some notations used in the paper. To streamline our statements, we use the notion of stochastic domination, which is commonly adopted in advanced probability theory to syntactically simplify precise statements of the form “

is bounded with high probability by

up to small powers of .”

###### Definition 1.1 (Stochastic domination).

Let

 X={X(n)(u):n∈N, u∈U(n)},Y={Y(n)(u):n∈N, u∈U(n)},

be two families of nonnegative random variables, where

is a possibly -dependent parameter set. We say that is stochastically dominated by , uniformly in the parameter , if for all small and large , there exists so that we have

 supu∈U(n)P(X(n)(u)>nυY(n)(u))≤n−D,

for all . In addition, we say that an -dependent event holds with high probability if for any large , there exists so that

 P(Ω)≥1−n−D,

for all

We interchangeably use the notation , or if is stochastically dominated by , uniformly in , when there is no risk of confusion. For two sequences of deterministic positive values and we write if for some positive constant In addition, if both and we write Moreover, we write if for some positive sequence For any probability measure over , we denote as the collection of -integrable functions with respect to , that is, for any , we have . For a vector , we define its norm as . We denote as the diagonal matrix whose -th diagonal entry is . For a matrix , we define its Frobenius norm as , and its operator norm as . For any integer , we denote the set . For a random vector we say it is sub-Gaussian if

 Eexp(a⊤g)≤exp(∥a∥22/2), (1.4)

for any deterministic vector Throughout, are universal constants independent of , and can vary from line to line.

## 2 Kernel-Spectral Embedding for High-Dimensional Noisy Data

In this section, we introduce in detail our proposed bandwidth selection and kernel-spectral embedding algorithm. After that we discuss its unique features compared with some popular existing methods.

Our proposed embedding method is summarized as Algorithm 1 below.

In Step 1 of the algorithm, a data-adaptive bandwidth parameter is defined as the

-percentail of the empirical cumulative distribution function

of the pairwise squared-distances among the observed data. Such a strategy is motivated by our theoretical analysis of the spectrum of kernel random matrices and its dependence on the associated bandwidth parameter. It ensures the thus determined bandwidth adapts well to the unknown manifold structure and the signal-to-noise ratio of the data, so that the associated kernel matrix captures the respective underlying low-dimensional structure via an integral operator. is a tunable parameter. In Section 3, we show in theory that can be chosen as any constant between 0 and 1 to have the final embeddings achieve the same asymptotic behavior; in Section 4, we also demonstrate numerically that the final embeddings are insensitive to the choice of . In practice, to optimize the empirical performance and improve automation of the method, we recommend using a resampling approach, described in Appendix D, to determine .

In Step 2, the Gaussian kernel is adopted for the construction of kernel matrices. This is mainly due to its wide use in practice and the relative simplicity of its theoretical analysis, although other kernel functions may be considered in practice (see Section 5 for more discussions).

In Step 3, the final embeddings are defined as the -indexed leading eigenvectors of the scaled kernel matrix, weighted by their eigenvalues. Similar forms of embeddings have been considered in [2] for spectral clustering and in [77] for network clustering. The coordinate set of the embedding space is determined by the users, depending on the specific aims or downstream applications of the low-dimensional embeddings. For example, in Section 4.2, we choose for visualizing a path manifold, and for the downstream ranking task, choose for learning circle manifolds, and set for a variety of integers for downstream clustering purposes.

### 2.2 Comparison with Existing Kernel-Spectral Embedding Methods

Regardless of the kernel functions being used, Algorithm 1 has important differences from the existing kernel-spectral embedding methods. We first focus on kPCA, Lapalcian eigenmap, and DM due to their close relations to our proposal.

Compared to Step 3 of Algorithm 1, these methods rely on eigendecomposition of matrices distinct from . Specifically, in the standard kPCA implementation (Section 12.3 of [9]), the eigendecomposition is applied to the centred kernel matrix

 (2.5)

where is an all-one vector; in Laplacian eigenmap [6], the eigendecomposition is applied to the kernelized graph Lapalcian matrix

 Ln=I−D−1nKn, (2.6)

where ; in diffusion map [19], the eigendecomposition is applied to the normalized kernel matrix (i.e., transition matrix)

 Mn=D′−1nK′n,K′n=D−ζnKnD−ζn, (2.7)

where is some tunable parameter and . More importantly, our numerical results (Section 4.2) indicate that such differences may lead to distinct low-dimensional embeddings, which, according to our theory (Section 3.3), may correspond to geometric features of the underlying manifold, that are identifiable by the proposed method but not likely to be captured by these methods.

As for other existing methods, it is largely unclear to what extend their performance is guaranteed, or if their interpretations continue to hold, for general high-dimensional and noisy datasets. Empirically, our real applications in Sections 4.2.2 and 4.2.3 show that our proposed method outperforms various existing methods, especially when dimension increases (Figure 4).

Moreover, compared to the existing methods, Algorithm 1 has a bandwidth selection procedure that is provably coherent with the subsequent steps, in the sense that both the high-dimensionality and noisiness of the data are intrinsically accounted for in the constructed kernel matrix and the final embeddings. In contrast, graph-based methods usually require the bandwidth decrease to zero, to ensure a meaningful convergence. However, oftentimes this is not admissible under high-dimensional and noisy datasets. In this respect, we are only aware of the recent work [20], in which a data-driven bandwidth selection procedure for GL was developed. Nonetheless, in Section 4, we compare various bandwidths and demonstrate the advantage of the proposed bandwidth; we also show that, even after incorporating the proposed bandwidth to these existing methods, Algorithm 1 still has the superior performance in learning low-dimensional structures from the real-world datasets.

## 3 Theoretical Properties: Justifications and Interpretations

We develop a theoretical framework to rigorously justify the embedding algorithm proposed in Section 2.1. Consequently, our theory suggests an asymptotic geometric interpretation that helps to better understand the final low-dimensional embeddings.

### 3.1 General Model Reduction Scheme

To facilitate our analysis of the proposed embedding algorithm under model (1.1), we introduce a useful model reduction scheme, that significantly simplifies the theoretical analysis without sacrificing the generality of our discussions.

Suppose the underlying manifold is -dimensional. By Nash’s isometric embedding theorem [50], there exists some integer so that the embedded manifold is supported in an -dimensional subspace in Consequently, there exists a deterministic rotation matrix only depending on such that

 Rxi=(xi1,xi2,⋯,xir,0,⋯,0)⊤∈Rp.

Now suppose are drawn independently from a probability measure on . Let be the covariance matrix of , and denote its eigendecomposition as

 Σ=VΓV⊤,

where contains the eigenvalues of , arranged in nonincreasing order, and the columns of are the corresponding eigenvectors. Given , we could define an orthonormal matrix as

 O=(V⊤00Ip−r).

Then it is clear that for each , the covariance matrix of is diagonal, with

 Cov(ORxi)=(Γ000). (3.1)

Therefore, by rotating the original model (1.1) with a deterministic matrix , we obtain that

 y0i=x0i+z0i, (3.2)

where , and In particular, after rotation become independent sparse random vectors in the sense that

 x0i=(x0i,0,⋯,0)∈Rp, (3.3)

where and

 Cov(x0i)=Γ=diag(θ1,⋯,θr), (3.4)

whereas the random noises remains independent and centred. Now a key observation is that the kernel matrix from (2.3) only depends on the pairwise Euclidean distances of the data , which are invariant to any rotations. In other words, we have

 exp(−∥yi−yj∥22h)=exp(−∥y0i−y0j∥22h),1≤i,j≤n.

Therefore, to theoretically analyze the kernel random matrix and the subsequent spectral embeddings, we can invariably starts from the structurally reduced model as characterized by (3.2) to (3.4).

The main benefit of such a reduction scheme is two-fold. On the one hand, it helps to identify a natural coordinate system for the noiseless samples so that the underlying low-dimensional manifold only relies on their first few nonzero coordinates. On the other hand, the reparametrization leads to a parsimonious model in which the sampling distribution from the underlying manifold has a simple diagonal covariance structure. Both aspects contribute to improving the theoretical accessibility of the problem without loss of generality.

### 3.2 Spectral Convergence to Noiseless Kernel Matrix

In this part, we show the convergence of the low-dimensional embeddings to their noiseless counterparts, by establishing the spectral convergence of the scaled kernel random matrix to the noiseless random matrix , defined by

 K∗n=(K∗(i,j))1≤i,j≤n,K∗(i,j)=exp(−∥xi−xj∥22h), (3.5)

where with being the noise level as in (3.6).

By the invariance , the noiseless kernel matrix can also be treated as a characterization of the noiseless samples embedded in . To this end, we first introduce and discuss the theoretical assumptions we made throughout our analysis.

###### Assumption 3.1.

We assume that both and in (1.1) are independent sub-Gaussian random vectors, and

 Ezi=0,Cov(zi)=σ2Ip, (3.6)

with for some constant . Moreover, in the reduced model (3.2) to (3.4), we assume that

 r≡r(n)=O(logcn), (3.7)

for some constant , and for all

 θi≍nα, (3.8)

for some constant

A few remarks on Assumption 3.1 are in order. First, unlike most of the existing literature on manifold learning and random matrix theory where is usually assumed to be bounded, our assumption (3.7) allows to diverge slowly with This is more flexible for practical applications and it guarantees that the bandwidth selected according to (2.2) has the same order as the signal (i.e., ) with high probability. Second, for ease of statements, we assume that (3.8) holds for the same However, our results can easily generalize to the setting for some nonnegative constants , with some minor adaptation of our argument and extra notational complications. Finally, in the current paper, we assume that and therefore

are white noise as in (

3.6) for simplicity. Nevertheless, our discussion applies invariably to the colored noise setting where

has the same eigenspace as

For example, in light of (3.4), we can allow a diagonal but non-scalar covariance matrix for

Before stating the main result of this part, we introduce a few more notations. We denote the eigenvalues of as where , and denote the corresponding eigenvectors as . Let be the probability measure on for the independent subvectors of the reduced noiseless samples . We define the associated integral operator such that, for any we have

 Kf(x)=∫exp(−∥x−y∥22h)f(y)P(dy),x,y∈Rr. (3.9)

By Mercer’s theorem (e.g., [39]), there exist a sequence of nonnegative eigenvalues in the decreasing order and orthonormal basis of known as eigenfunctions so that

 Kϕi(x)=γiϕi(x),for i≥1. (3.10)

Finally, we also recall that the eigenvalues and eigenvectors of are denoted as and , respectively.

###### Theorem 3.2.

Suppose Assumption 3.1 holds and . Then the following holds.

1. (Eigenvalue convergence) We have

 ∥∥n−1Kn−n−1K∗n∥∥=O≺(1√n+1rnα−β−1), (3.11)

and therefore

 maxi∈[n]|λi−μi|=O≺(1√n+1rnα−β−1).
2. (Eigenvector convergence) For any , if the -th population eigen-gap

 ri:=min{γi−1−γi,γi−γi+1}, (3.12)

satisfies that for some fixed constant and some constant

 r2i≥Cnϵ(1√n+1rnα−β−1), (3.13)

then we have

 |⟨ui,vi⟩2−1|=O≺([1√n+1rnα−β−1]1r2i). (3.14)

Theorem 3.2 establishes the asymptotic spectral equivalence between the noisy kernel matrix and the noiseless kernel matrix . Note that both matrices are random with respect to the underlying manifold and noise. We provide several remarks here. First, unlike many existing results on the spectral convergence of kernel random matrices, Theorem 3.2 reveals the role of in the convergence rate, suggesting that when does not diverge fast, a larger will potentially lead to a faster convergence between the two matrices. Second, as shown in Corollary 3.2 of [20], for we have Consequently, in practice, we only need to focus on recovering the first few (at most at an order of ) eigenvalues and eigenvectors of . Third, the results concerning can be made more precise given additional information. For example, if is Gaussian and , then from Appendix C, for any finite we have In this case, whenever we have -consistency for the leading eigenvalues and eigenvectors

 maxi∈[N]|λi−μi|=O≺(n−1/2),

and

 maxi∈[N]∥ui−vi∥22=2−2mini∈[n]⟨ui,vi⟩2=O≺(n−1/2).
###### Remark 3.3.

We point out that our assumption on the signal-to-noise ratio is actually necessary to guarantee convergence. As proved in part (i) of Theorem 3.1 of [20], when the noise dominates the signal and the convergence between and fails to hold. Indeed, if and , we have , so that asymptotically can be approximated by the Gram matrix of the noise vectors , plus a few noninformative rank-one spikes. In particular, these rank-one spikes, as pointed out by Remark 2.4 of [20], only depend on the kernel function, and do not contain any information about ; meanwhile, the empirical spectral distribution of will be governed by that of the noise Gram matrix and converges to the Marchenko-Pastur law [47]. In essence, there is a phase transition at . See also Section 5 for more discussions.

###### Remark 3.4.

There are fundamental advantages of the proposed bandwidth. Under the high-dimensional setup (1.2), most of the existing literature regarding the kernel random matrix focus on a pre-specified fixed bandwidth [18, 21, 22, 26]. In particular, when as they mainly deal with the null cases, is chosen for some constant , since the total noise is . In contrast, under the non-null regime considered in the current study, if we still choose the bandwidth may be too small compared to the signal () so that the decay of the kernel function forces each sample to be ”blind” of the other nearby samples. Especially, as shown in Theorem 2.7 of [20], when the signals are strong enough so that if we still choose then we have for all with high probability. In this case, the kernel matrix converges to identity and becomes useless. In comparison, under our non-null regime our theory suggests one should choose In this connection, although both and are generally unknown, our proposed bandwidth selection procedure is able to achieve this automatically from the data. In other words, as will be seen in the proof of Theorem 3.2, our proposed method is guaranteed to select with high probability.

### 3.3 Spectral Convergence to Population Integral Operator

So far we have shown that our proposed low-dimensional spectral embeddings converge to those associated to the noiseless random samples . However, it is still unclear how these embeddings relate to the underlying manifold . Answering this question is important as to understand the proper interpretation of these embeddings.

Our next result concerns the limiting behavior of the eigenvalues and eigenvectors of , as characterized by a deterministic integral operator defined over the low-dimensional manifold. Before stating our main result, we first introduce a few notations from operator theory.

Recall that is the probability measure on for in (3.3) and is the rotation matrix satisfying (3.1). Let

 M0={ORx:x∈M⊆Rp},

be the underlying manifold after rotation. We define as the probability measure on for , which is also embedded into such that for , we have Accordingly, we define the population integral operator such that for any we have

 (3.15)

where we recall that

Clearly, the eigenvalues and eigenfunctions of denoted as and satisfy that

 ˜γi=γi,˜ϕi(x)=ϕi(x),for i≥1. (3.16)

Moreover, let be the empirical distribution of the noisy samples . Then the corresponding empirical operator is defined by

 (3.17)

It is easy to see that (for example, see the discussion of Section 2.2 of [58]), for , the eigenvalues of coincide with , and for any eigenvalue of the corresponding eigenfunction satisfies

 ˆϕ(n)i(x)=1λi√nn∑j=1exp(−∥x−y0j∥22hn)uij,x∈Rp, (3.18)

and , where is the -th component of , the -th eigenvector of . In particular, we have

 ˆϕ(n)i(y0j)=√nuij,for % all j∈[n], (3.19)

so that the eigenvector can be interpreted as the discretized -th empirical eigenfunction evaluated at the noisy samples . Finally, for some constant we denote the integer as

 K≡K(τ):=argmax{1≤i≤n:γi≥τ}. (3.20)

In order to describe the convergence of eigenfunctions, we consider an RKHS for functions defined on , associated with the kernel function such that

 K(x,y)=exp{−∥x−y∥22h}. (3.21)

Specifically, we define as the completion of the linear span of the set of functions , with the inner product denoted as satisfying and the reproducing property for any (see [8, 46] for systematic treatments of RKHS).

###### Theorem 3.5.

Under the assumptions of Theorem 3.2, the following results hold.

1. (Eigenvalue convergence) We have

 maxi∈[n]|λi−γi|=O≺(1√n+1rnα−β−1). (3.22)
2. (Eigenfunction convergence) For in (3.20) and , if we restrict on , then we have and

 ∥∥∥√λiˆϕ(n)i−√γi˜ϕi∥∥∥K=O≺⎛⎝1√n+1rnα−β−1+[1√n+1rnα−β−1]1/21ri⎞⎠, (3.23)

where is the normalized eigenfunction of in satisfying , and is the norm induced by the inner product .

Theorem 3.5 establishes the spectral convergence of the kernel matrix to the population integral operator given by (3.15) in . On the one hand, (3.22) demonstrates that the deterministic limits of the eigenvalues are those of the integral operator . On the other hand, (3.23) indicates that the empirical eigenfunctions , after proper rescaling, can also be regarded as an approximation of the population eigenfunctions in an average manner. In particular, Theorem 3.5 implies that for any ,

 λiui=λi√n⋅(˜ϕ(n)i(y01),˜ϕ(n)i(y02),...,˜ϕ(n)i(y0n))⊤≈γi√n⋅(˜ϕi(y01),˜ϕi(y02),...,˜ϕi(y0n))⊤, (3.24)

so that as long as , the final low-dimensional embedding is approximately the leading eigenfunctions evaluated at the rotated data , and weighted by their corresponding eigenvalues . In other words, each coordinate of the final embedding is essentially a nonlinear transform of the original data, by some functions uniquely determined by the population integral operator . This explains why our proposed method may capture important geometric features of the underlying manifold, and what kind of geometric features our method aims at.

As a comparison, existing methods such as Laplacian eigenmap, DM, or LLE, have their final embeddings correspond asymptotically to the Laplace-Beltrami differential operator evaluated at low-dimensional noiseless data. The fundamental distinction between integral and differential operators indicates the potential advantage of the proposed method over these existing methods, especially for high-dimensional noisy datasets as indicated by our real data analysis (Section 4.2). In this respect, Theorem 3.5 provides a systematic way to understand the general geometric interpretation of the proposed low-dimensional embedding, although their concrete meanings may vary from case to case. To better illustrate this, in Appendix C we consider a specific probability measure , obtain the explicit forms of its corresponding eigenvalues and eigenfunctions, and demonstrate their geometric interpretations accordingly.

## 4 Numerical Studies

We provide numerical simulations and three real data examples to illustrate the usefulness of our proposed method. Especially, in Section 4.1, we show that our adaptive bandwidth selection scheme (2.2) is robust and outperforms other existing methods. In Section 4.2, we compare our proposed embedding method (Algorithm 1) with various existing methods and show its outstanding performance.

### 4.1 Simulations

We conduct numerical experiments to evaluate the performance of the proposed bandwidth selection procedure (2.2) compared to some existing bandwidth methods in terms of the final embedding quality. To better demonstrate the wide range of applicability of the proposed method, we consider model (1.1) under three different settings of signal manifolds, and assess the quality of spectral embeddings as obtained from Steps 2 and 3 of Algorithm 1 under various bandwidths.

Specifically, we set and , and generate . For the signal samples , we set for some , and generate from one of the following settings:

• Nested spheres manifold with : for some global scaling parameter , we set and generate samples from a mixture model of nested spheres , where

are uniform distributions on nested spheres in

of radii respectively, and the cluster proportion vector .

• “Smiley face” manifold with : for some global scaling parameter , we set and generate samples from a two-dimensional “smiley face” as shown in Figure 1 left.

• “Mammoth” manifold with : for some global scaling parameter , we set and generate samples from a three-dimensional “mammoth” as shown in Figure 1 right.

The global scaling parameter characterizes the overall signal-to-noise ratio, which plays the similar role as the term in our theoretical results. In addition to the proposed adaptive bandwidth selection scheme, we also consider (i) the fixed bandwidth , which has been used in many existing works such as [18, 21, 22, 25, 26, 27], and (ii) the data-driven bandwidth selection method, denoted as “SBY,” proposed by [58]. We evaluate the above bandwidths by comparing the corresponding low-dimensional spectral embeddings for of the high-dimensional noisy data , and the actual low-dimensional noiseless samples . To do that, we calculate the pairwise distance matrices from the rows of , and from , and characterize their concordance by the correlation measure

 Concordance=⟨Dembed,Dx⟩∥Dembed∥F∥Dx∥F, (4.1)

which is invariant to any scaling and rotations of the low-dimensional embeddings. Presumably, a good bandwidth selection should lead to higher concordance between the final embeddings and the original low-dimensional manifolds.

In Figure 2, we show the concordance evaluated for each bandwidth selection method over a variety of signal-to-noise ratios as quantified by , under each of the three manifold settings. In particular, to demonstrate the advantage and flexibility of the proposed method, we consider . From Figure 2, we observe that the proposed method has superior performance compared to and SBY in all the three manifolds, and across almost all the signal-to-noise ratios. The commonly used bandwidth only shows comparable performance in the “smiley face” example when signal-to-noise ratio is relatively low; the SBY method has poor performance throughout, probably as a result of ignorance of high-dimensionality of the noise. Importantly, the advantage of the proposed method is not sensitive to the values of , and can be more significant when the signal-to-noise ratio is larger. In Appendix D, we show that selected by a data-driven resampling approach can achieve equally good performances in these examples. These numerical results indicate that accounting for the data high-dimensionality and noise impact is important for our kernel spectral embedding method to perform well, and our proposed bandwidth selection algorithm is reliable in that respect.

### 4.2 Applications to Real-World Datasets

To illustrate the proposed method and its numerical advantage over various existing methods in diverse applications, we analyze three real datasets with distinct underlying manifold structures.

#### 4.2.1 Pseudotemporal Ordering of Single Cells

Our first example concerns pseudotemporal ordering of single cells. As an important problem in single-cell biology, pseudotemporal cell ordering aims to determine the pattern of a dynamic process experienced by cells and then arrange cells according to their progression through the process, based on single-cell RNA-Seq data collected at multiple time points. In the following, we show that our proposed embedding method is able to capture such an underlying progression path as a one-dimensional manifold, and therefore to help determine a more precise ordering of cells, compared to the existing state-of-art methods. The dataset consists of single-cell RNA sequencing reads for 149 human primordial germ cells ranging from 4 weeks to 19 weeks old [33]. Specifically, there are 6 cells of 4 weeks old, 37 cells of 7 weeks old, 20 cells of 10 weeks old, 27 cells of 11 weeks old, and 57 cells of 19 weeks old; these are treated as the ground truth for cell ordering. The raw count data were preprocessed, normalized, and scaled by following the standard procedure (R functions CreateSeuratObject, NormalizeData and ScaleData under default settings) as incorporated in the R package . We also applied the R function FindVariableFeatures in Seurat to identify most variable genes for subsequent analysis. The final dataset consists of standardized expression levels of genes for the 149 cells. We apply our proposed method with a variety of and we consider a two-dimensional embedding , where the one-dimensional manifold is already clearly captured; see left of Figure 3 for an illustration.

Observing that the path manifold is aligned with the second eigenvector, we order the cells according to their values in the second eigenvector. Inferred cell orders are also obtained using state-of-art cell ordering algorithms such as TSCAN [36], SCORPIUS [16], SerialRank [30], and the classical PCA. Note that SerialRank essentially has Laplacian eigenmap as its core. For TSCAN and SCORPIUS, we applied functions from their R packages under the default settings. To evaluate the quality of the inferred orders, we compare them with the ground truth by evaluating their respective Kendall’s tau statistic.

On the left of Figure 4, the temporal ordering based on the proposed method () aligns better with the actual cell order than the other methods, with almost three-fold improvement over the second best method TSCAN in Kendall’s tau on average. The superiorty of the proposed method is not sensitive to the dimension

, or the quantile parameter

. See Appendix D for similar results with and 0.75.

Finally, in Figure 3 we show the scatter plots of the 2-dimensional embeddings based on PCA and the proposed method, denoted as KEF standing for kernel eigenfunctions, when

. The advantage of the proposed embedding over its linear counterpart, such as informativeness and robustness to outliers, is visible and significant.

#### 4.2.2 Cell Cycle Reconstruction

Our second example concerns reconstruction of the cell cycle from single-cell RNA-Seq data. The cell cycle, or cell-division cycle, is the series of events that take place in a cell that cause it to divide into two daughter cells. Determining the cell cycle stages of individual cells analyzed during development is important for understanding its wide-ranging effects on cellular physiology and gene expression profiles. Our second dataset contains 288 mouse embryonic stem cells, whose cell cycle stages were determined using flow cytometry sorting. As a result, one-third (96) of the cells are in the G1 stage, one-third in the S stage, and the rest in the G2M stage. The raw count data were preprocessed and normalized using the same method as in Section 4.2.1, leading to a dataset consisting of standardized expression levels of most variable genes for the 288 cells. Again, we apply our proposed method with a variety of . Observing that the leading eigenvector is approximately a constant vector and therefore non-informative, we consider a two-dimensional embedding with , expected to capture the underlying circle manifold. As a comparison, we also obtain two-dimensional embeddings based on PCA, MDS, Laplacian eigenmap, LLE and DM, using functions implemented in the R package dimRed under their default settings. To recover the cell cycle stages, or the underlying position of the cells on the circle manifold, we project each embedding to the two-dimensional unit circle, and then identify the cell stages with their respective angles on the unit circle. We compare the reconstruction performance of different methods based on their Kendall’s tau distance to the true stages, up to a possible circular shift of the reconstructed cycles.

The right panel of Figure 4 shows that our proposed method has outstanding performance compared to the other methods. In addition, Figure 5 shows the unit-circle projections of the two-dimensional embeddings obtained from each method under . It can be seen that the cells are most separated and well-ordered according to their true cycle stages along the unit-circle reconstructed by the proposed method, demonstrating its superiority over the other methods.

#### 4.2.3 Clustering of Hand-Written Digits

Our last example concerns the MNIST

dataset containing images of hand-written digits. Specifically, we focus on images of hand-written digits “2,” “4,” “6” and “8,” among which there are about 1000 images for each digit. As each image contains pixels, they can be treated as -dimensional vectors. We apply six different low-dimensional embedding methods, namely DM, kPCA as defined in Section 2.2, Laplacian eigenmap, MDS, PCA and the proposed method, for a variety of embedding dimensions n.dim . In particular, to ensure fairness in comparison, we have not included two-stage algorithms such as tSNE and UMAP, which essentially take one of the above kernel spectral embeddings as an initialization, and then refine the low-dimensional embedding to make the cluster pattern more salient based on some local-metric adjustment [48, 4, 45, 14].

We evaluate the embedding quality of a method by calculating the average Silhouette index [52] of the final low-dimensional embedding with respect to the underlying true cluster membership. In general, a higher average Silhouette index of a method indicates the underlying clusters are more separate in the final embedding. Again, for the proposed method we use . Similar results are obtained for as in Appendix D. In Figure 6, we found that our proposed method along with DM has overall the best performance among the six methods. In particular, although the standard kPCA differs from our proposed method only by an additional mean shift (2.5), in all cases its embedding quality is clearly worse than the latter, demonstrating the important distinction between kPCA and the proposed method, and the potential advantage of the latter in applications.

## 5 Discussion

We propose a kernel-spectral embedding algorithm for learning low-dimensional nonlinear structure from high-dimensional noisy datasets under a manifold setup. A key component is to develop an adaptive bandwidth selection procedure that is free from prior knowledge of the manifold. Our proposed method is theoretically justified and our embeddings correspond to an integral operator from the RKHS associated to the underlying manifold. While our results pave the way towards statistical foundations for nonlinear dimension reduction, manifold learning, among others, there are various problems to be investigated further. We now highlight a few of them.

First, in this paper, we focus on the prototypical Gaussian kernel for the construction of the low-dimensional embeddings. However, this is only an initial step towards interpretable manifold learning under high-dimensional noisy data, and our results stand as a vintage point for exploring and understanding similar kernel-spectral embedding algorithms based on other kernel functions, such as cosine kernel, polynomial kernel, inner-product kernel, self-tune kernel, etc.

Second, in Part 2 of Theorem 3.5, the convergence of eigenfunctions is obtained under the kernel-induced norm in , characterizing the closeness in a global manner. In some applications, a more precise local characterization of the final embedding is needed. For example, one need to study the convergence property of the eigenfunctions under the norm, defined by . However, to our best knowledge, even in the noiseless setting, the convergence has only been studied to some extent for normalized integral operator in [71, 70] and the linear differential Laplace–Beltrami operator in [15, 24, 74]. Less is known for the unnormalized integral operator as used in the current paper. Moreover, to understand the convergence in norm for the noisy datasets, in contrast to (3.14), we also need to study the convergence of the eigenvectors of the kernel random matrices. The analysis requires a more sophisticated argument from random matrix theory as in [5, 28]. We will pursue these directions in future works.

Additionally, the current work mainly concerns the super-critical non-null regime