Given an i.i.d. sample from the data generating measure in Euclidean space
, the goal of most tasks in machine learning and statistics is to infer properties of. A particularly interesting case is if has support on a -dimensional compact submanifold in e.g. due to strong dependencies between the individual features. In this case one can construct a neighborhood graph on the sample by connecting all vertices of Euclidean distance less than a certain length-scale , and in this way produce a discrete approximation of the unknown manifold . Laplacian Eigenmaps Belkin02laplacianeigenmaps and Diffusion Maps Coifman1 have been proposed as tools to extract intrinsic structure of the manifold by considering the eigenvectors of the resulting unnormalized resp. normalized graph Laplacian; in particular, Laplacian eigenmaps are used in the first step of spectral clustering vonLux_tutorial , one of the most popular graph-based clustering methods . In general, it is well known that the spectrum of the graph Laplacian resp. Laplace-Beltrami operator captures important structural resp. geometric properties of the graph Moh1991 resp. manifold Cha1984 .
In this paper we examine this question: under what conditions, and at what rate, does the spectrum of the graph Laplacian built from i.i.d. samples on a submanifold converge to the spectrum of the (weighted) Laplace–Beltrami operator of the submanifold as the sample size and the neighborhood radius ?
Graph-based approximations to the Laplace-Beltrami operator have been studied by several authors and in a variety of settings. The pointwise convergence of the graph Laplacian towards the Laplace-Beltrami operator has been proven in HeAuvL07 ; bel_niy_LB ; GK ; Hei2006 ; singer06; THJ . The spectral convergence of the graph Laplacian for fixed neighborhood size for Euclidean domains has been established in vLBeBo08 ; RosBelVit2010
. The spectral convergence of the graph Laplacian towards the Laplace–Beltrami operator for the uniform distribution has been discussed inbelkin2007convergence for the case of Gaussian weights and in SinWu13 for the connection Laplacian, without precise information on allowable scaling of neighborhood radius, and without convergence rates. In GTSspectral the authors establish the conditions on graph connectivity for the spectral convergence on domains in . In particular they prove convergence when as and
However no error estimates were established. The preprint Shi2015 establishes (in Theorem 1.1) the spectral convergence of graph Laplacians constructed from data sampled from a submanifold in with a convergence rate of , where is the intrinsic dimension of the submanifold.
In this paper we propose a general framework to analyze the rates of spectral convergence for a large family of graph Laplacians. This framework in particular allows us to improve the results in Shi2015 and establish a convergence rate of which is a significant improvement, in particular for small dimensions . These convergence rates hold for different reweighing schemes of the graph Laplacian found in the literature including the unnormalized Laplacian, normalized Laplacian, and the random walk Laplacian. When the intrinsic dimension of the submanifold is small, our results show, to some extent, why Laplacian eigenmaps can effectively extract geometric information from the data set, even though the number of features may be high. Moreover, similar to GTSspectral , we show that the conditions in (1.1) are sufficient for spectral convergence. This is essentially the same condition required to ensure that the constructed graph is almost surely connected PenroseBook and thus is close to optimal. It is interesting to note that for pointwise consistency of the graph Laplacian HeAuvL07 ; GK the required stronger condition is .
Our framework is completely different from that in belkin2007convergence ; Shi2015 and builds on two main ideas. First, it builds on an extension of the recent result of Burago, Ivanov und Kurylev BIK , see also Fuj1995 , which shows in a non-probabilistic setting how one can approximate eigenvalues and eigenfunctions of the Laplace-Beltrami operator using the eigenvalues/eigenvectors of the graph Laplacian associated to an -net of the submanifold. As in our setting the manifold is unknown, we generalize the result of BIK by using a graph construction which requires no knowledge about the submanifold but which achieves the same approximation guarantees for the eigenvalues. In addition, we introduce a new out-of-sample extension of the eigenvectors for the approximation of the eigenfunctions which requires no information about the submanifold without significant loss in the convergence rate compared to the corresponding construction used in BIK . Our second main result generalizes the recent work of García Trillos and Slepčev GTS15a to the setting of empirical measures on submanifolds and establishes their rate of convergence in -optimal transportation (OT) distance; the -OT distance between the empirical measure associated to a point cloud and the volume form of the submanifold can be seen to be closely related to the notion of -net used in BIK . These estimates encompass all the probabilistic computations that we need to obtain our main results, and in particular, when combined with our deterministic computations, provide all the probabilistic estimates that quantify the rate of convergence of the spectrum of graph Laplacians constructed from randomly generated data towards the spectrum of a (weighted) Laplace-Beltrami operator on . We believe that both the generalization of BIK , as well as the generalization of GTS15a are of independent interest. The combination of these two ideas and a number of careful estimates lead to our main results.
In what follows we make the setting that we consider in the sequel precise, as well as define precisely the different graph Laplacians and their continuous counterparts.
1.1 Graph construction
Let be a compact connected -dimensional Riemannian manifold without boundary, embedded in , with . We assume that the absolute value of sectional curvature is bounded by , the injectivity radius is and with reach . We write for the distance between and on the manifold and for the Euclidean distance in .
Let be a probability measure on that has a non-vanishing Lipschitz continuous density with respect to the Riemannian volume on with Lipschitz constant . Compactness of and continuity of guarantee the existence of a constant such that
We let be a sequence of i.i.d. samples from . In order to leverage the geometry of from the data, we build a graph with vertex set . In the simplest setting, for each we choose a neighborhood parameter and we put an edge from to and from to (and write ) provided that ; we let be the set of such edges. More generally, we consider weighted graphs, with weights that depend on the distance between the vertices connected by them. For that purpose, let us consider a decreasing function with support on the interval such that the restriction of to is Lipschitz continuous. Normalizing if needed allows us to assume from here on that
For convenience we assume that . We denote by
the surface tension of , where represents the first coordinate of the vector . To every given edge we assign the weight where
and we consider the weighted graph with as in (1.5) for every . In fact, note that if the points , are not connected by an edge in then .
The function can be chosen as as well as a smooth function like
(where is the appropriate constant ensuring normalization) or simply a truncated version of a Gaussian. Also, we note that for it follows from (BIK, , (2.7)) that , where is the volume of the unit ball in . While the definition of the weights is up to the constant and a slightly different rescaling in terms of is similar to BIK , the main difference is that we use the Euclidean metric of the ambient space in (1.5), whereas in BIK neighborhoods are throughout defined in terms of the geodesic distance. Here we are forced to use the metric from the ambient space as the manifold is in general assumed to be unknown.
We have assumed that is decreasing and that , which would imply that . Nevertheless, we remark that none of the results presented in this paper change if we modify the value of . In particular we allow for if desired and we can simply assume that is decreasing and Lipschitz in (then the condition changes to ). This observation is relevant in order to allow for graphs where vertices have no edges with themselves.
The requirement that is compactly supported is purely a technical one. It is in principle possible to carry out the arguments of this work for noncompact kernels, like the Gaussian one. However that would require obtaining error bounds on extra terms and would make the already involved estimates even more complicated.
1.2 Dirichlet forms and laplacians
In this section we introduce the Laplacians in both discrete and continuous settings.
We use the graph structure defined in the previous section to define a Dirichlet form in the discrete setting. First, the weights serve as a measure on the set and thus induce a scalar product of functions given by
Second, for functions on the vertices, we define the discrete differential
We can then define the discrete Dirichlet form between as
In the continuous setting, on the domain (the Sobolev space of functions in with distributional derivative in ) we define the Dirichlet form as
where stands for the Riemannian volume form of , and are the gradients of and and represents the Riemannian metric induced on . Since is bounded from above, this symmetric bilinear form is continuous, i.e. for a suitable constant and all . For the remainder we use and as shorthand for and , respectively.
Next, we choose measures on and on the manifold and define corresponding operators associated with the forms and on and , respectively. The idea is that by modifying the inner product in and in we obtain different realizations of Laplacian operators. The so-called unnormalized and random walk graph Laplacian (see definitions below), as well as their continuous counterparts, are instances of the general framework that we consider. Let be the empirical measure of the random sample, i.e.
On we consider the measure endowed with a density , denoted by . On the other hand, on , we consider the measure , where is a Lipschitz continuous density with Lipschitz constant with respect to satisfying
On the graph , we define the associated weighted graph Laplacian as , i.e. as the unique operator satisfying
for all .
At the continuum level, we define a weighted Laplacian associated with the form and the measure as follows. On the domain
we set . The operator is formally defined as
where stands for the divergence operator on .
One of the main results of this paper is that the spectrum of approximates well that of . Intuitively, one of the elements needed for this to be true is that the measure approximates as . We use
to quantify this approximation.
We now describe particular forms of the graph Laplacian which frequently used in the machine learning literature.
1.2.1 Unnormalized graph Laplacian
To obtain the unnormalized graph Laplacian, we choose the density vector as . Then is explicitly given by
for all , which is, up to the factor , known as the unnormalized graph Laplacian. In this case , since is the limit of as . This results in a realization of the Laplacian on that satisfies
for all . In case , this operator coincides with
from Definition 8 of HeAuvL07 , where it was identified as the pointwise limit of the unnormalized graph Laplacian.
1.2.2 Random walk graph Laplacian
In order to obtain the random walk graph Laplacian, we choose the density vector as the vertex degrees, i.e.
and for all . Then is given by
for all and satisfies
for all . In case that , is nothing but
from (HeAuvL07, , Definition 8). In the remainder we use to denote
the random walk graph Laplacian and for its continuous counterpart.
Showing the closeness of and , (1.10 ), reduces to showing a kernel density estimate on a manifold. In the Appendix
), reduces to showing a kernel density estimate on a manifold. In the AppendixA we show that provided satisfies Assumption 1.3, we have
where is a universal constant and is the -OT distance between and (see (1.13) and Section 2). These estimates are proved using a simple and general approach using the transportation maps introduced in Section 2; in contrast to usual kernel density estimation approaches. The estimates are not optimal, but they are on the same order of error as the approximation error of the Dirichlet form by the discrete Dirichlet form that we present in Lemma 13 and Lemma 14; the bottom line is that the rates of convergence for the spectrum of the random walk graph Laplacian are unaffected by the non-optimal estimate (1.12). On the other hand our proof of (1.12) has the advantage of reducing all probabilistic estimates in our problem to estimating the -OT distance between and ; which is done in Section 2.
1.2.3 Normalized graph Laplacian
So far we have described how one can obtain the unnormalized and random walk Laplacians as examples of the general framework introduced in this section. Let us recall another popular version of normalized Laplacian usually referred to as symmetric normalized graph Laplacian. For given , the symmetric normalized Laplacian of is given by
with defined by (1.11). We remark that can not be obtained by appropriately choosing the measure as described in this section (in order to recover it we would have to modify the definition of discrete differential in (1.6)). Nevertheless, we can indirectly analyze the rate convergence of its spectrum towards that of a continuous counterpart noting that and are similar matrices. Indeed, we recall that if and only if where . Thus, and share the same spectrum.
1.3 Main results
1.3.1 Convergence of eigenvalues and transportation estimates
Our first main result is the following.
Let be i.i.d. samples from a distribution supported on , with density satisfying (1.2). Consider and as in Section 1.2.1 or Section 1.2.2. For let be the -th eigenvalue of the graph Laplacian defined in Section 1.2 with
where if and if . Let be the -th eigenvalue of the Laplacian defined in Section 1.2. Then,
The actual choice of in the previous theorem is explained by the more general and detailed result stated in Theorem 1.4, together with the estimates for the -OT distance between and in Theorem 1.2. Indeed, we have taken to scale like where is the -OT distance between and . More precisely,
where means that for every Borel subset of . Such mappings are called transport maps from to . One of the key ingredients needed to establish Theorem 1.1 is the probabilistic estimate on -OT distance contained in our next theorem.
Let be a smooth, connected, compact manifold with dimension . Let be a probability density satisfying (1.9) and consider the measure . Let be an i.i.d sample of . Then, for any and every there exists a transportation map and a constant such that
holds with probability at least , where depends only on , , , , and .
With the estimates in Theorem 1.2 at hand, Theorem 1.1 follows from the more general Theorem 1.4 below (more precisely from its corollaries). Indeed, convergence rates for the spectrum of graph Laplacians can be written in terms of and as long as . Throughout this paper we assume that and are sufficiently small. In particular we make the following assumptions.
where is the injectivity radius of the manifold , is a global upper bound on the absolute value of sectional curvatures of , is the dimension of , and is the reach of (seen as a submanifold embedded in ).
For let be the -th eigenvalue of the graph Laplacian defined in Section 1.2 using the weights , and let be the -th eigenvalue of the Laplacian defined in Section 1.2 using the weight function . Finally let be the -OT distance between and and assume that satisfies Assumptions 1.3. Then,
(Upper bound) If and are such that
for a positive constant that depends only on and , then,
where only depends on , and .
(Lower bound) If and are such that
for a positive constant that depends only on , and , then,
where only depends on , and .
Note that the lower bound does not depend on the reach . This is due to the one sided-inequality
In contrast, for the upper bound one must use a reverse inequality with an additional higher order correction term that depends on . See Proposition 2.
It is also worth pointing out that the presence of the term in the upper bound ultimately comes from the estimate on how far is the map in (1.22) from being an isometry when restricted to the first eigenspaces of ; the relevant length-scale for this estimate is the size of transport cells, i.e., . On the other hand, the term in the lower bound comes from the estimate on how far is the map in (1.24) from being an isometry when restricted to the first eigenspaces of ; the relevant length-scale for this estimate is , which is of the same order as the bandwidth for the kernel used to define the map . This can be seen from Lemmas 13 and 14 respectively.
The estimates on from Theorem 1.2 combined with Theorem 1.4 imply that converges towards with probability one whenever , , . We can specialize Theorem 1.4 to the examples from Section 1.2, where in particular we provide estimates on in terms of .
Corollary 1 (Convergence of eigenvalues unnormalized graph Laplacian)
In the context of Theorem 1.4 suppose that the weights are taken to be and . If is small enough for
to hold for a positive constant that depends only on , and , then
where only depends on , and .
The result follows directly from Theorem 1.4 after noticing that in this case and .
Corollary 2 (Convergence of eigenvalues random walk graph Laplacian)
Notice that the estimates in the previous results provide a lower bound on the mode at which the spectrum of the graph Laplacian stops being informative about the spectrum of the Laplace-Beltrami operator. Namely, notice that the right hand sides of (1.19) and (1.20) are small when is small. Using Weyl’s law for the growth of eigenvalues of the Laplace-Beltrami operator we know that
and thus, the relative error of approximating with is small when and . In particular, if is taken to scale like (as is the case in Theorem 1.1) then is approximated by if for and for .
We would like to remark that one of the main advantages of writing all our estimates in Theorem 1.4 in terms of the quantity (which is the only one where randomness is involved) is that we can transfer probabilistic estimates for into probabilistic estimates for the error of approximation of . In particular, when combined with Theorem 1.2, Corollary 1 and Corollary 2 can be read as follows: Suppose that . Let be such that . Let . Then, with probability at least ,
for all .
Moreover, writing all our estimates in Theorem 1.4 in terms of the quantity is also convenient because the conclusion of the theorem holds even when the points are not i.i.d. samples from the measure . That is one only needs to ensure that the assumption (1.15) is satisfied. In other words whenever one has an an estimate on the -OT distance between the point cloud and the measure and a kernel density estimate to ensure that (1.15) holds, the theorem provides an error estimate on the eigenvalues. We note that the kernel density estimate in terms of the -OT distance we provide in Lemma 18 implies that one in fact only needs an estimate on the -OT distance between the point cloud and the measure .
1.3.2 Convergence of eigenfunctions
We prove that eigenvectors of converge towards eigenfunctions of and provide quantitative error estimates. To make the statements precise, we need to make sense of how to compare functions defined on the graph/sample with functions defined on the manifold . In this paper we consider two different ways of doing this.
The first approach involves an interpolation step by composing with the optimal transportation map from (1.13) followed by a smoothening step. Both of these steps require the knowledge of . The map induces a partition of where
We note that for all . We define the contractive discretization map by
and the extension map by
We note that can be written as . We then consider the interpolation operator
Let be the graph Laplacian defined in Section 1.2 using the weights , and let be the Laplacian defined in Section 1.2 using the weight function . Let be the -OT distance between and and assume that satisfies Assumptions 1.3. Finally, assume that and are small enough so that
for a constant that depends only on .
Then, for every normalized eigenfunction of corresponding to the eigenvalue , there exists a normalized eigenfunction of corresponding to the -th eigenvalue such that
where is a constant that only depends on and where is the difference between the smallest eigenvalue of that is strictly larger than and the largest eigenvalue of that is strictly smaller than (i.e a spectral gap).
In particular, if we take
where for and for , then,
As in Remark 7, we would like to emphasize that the probabilistic estimates for translate directly into probabilistic estimates for the convergence of eigenfunctions in Theorem 1.5. Likewise, we would like to point out that Theorem 1.5 can be made concrete in the context of Sections 1.2.1 and 1.2.2 using the corresponding estimates for in terms of and .
The second approach to compare eigenvectors of with eigenfunctions of is to extrapolate the values of discrete eigenvectors to the Euclidean Voronoi cells induced by the points . That is, for an arbitrary function we assign to each point the value where is the nearest neighbor of in with respect to the Euclidean distance. More formally, for we consider the Voronoi cells
and define the function by
We notice that the Voronoi cells form a partition of , up to a set of ambiguity of -measure zero. Besides being a computationally simple interpolation, the Voronoi extension can be constructed exclusively from the data and no information on is needed. We show that the interpolation of a discrete eigenvector approximates the corresponding eigenfunction on with almost the same rate as in Theorem 1.5. In order to obtain convergence of the Voronoi extensions , we require to satisfy
This condition holds, for instance, when is chosen as , which minimizes the error in the following result.
Fix . Let be the graph Laplacian defined in Section 1.2 using the weights , and let be the Laplacian defined in Section 1.2 using the weight function . Let be the -OT distance between and and assume that satisfies Assumptions 1.3. Finally, assume that and are small enough so that in particular
for a constant that depends only on .
We remark that the first term in (1.28) is worse than the estimate in Theorem 1.5 by a logarithmic factor of . This is due to our uniform estimates on the size of Voronoi cells based on transportation (see Lemma 17). On the other hand, the extra factor in (1.28) is an estimate for the difference of the averages of over transport cells and Voronoi cells; here we use the regularity of an eigenfunction and in particular we use a bound for found in ShiXu10 .
1.4 Outline of the approach and discussion
To prove our main results we exploit well-known variational characterizations for the spectra of and . Our results are then deduced from a careful comparison between the objective functionals of the variational problems.
From the definition of in Section 1.2 it clear that is positive-semidefinite with respect to the inner product of . We denote by
the eigenvalues of , repeated according to their multiplicities. By the minmax principle we have
where the minimum is over all -dimensional subspaces of . At the continuum level, and given that and are bounded from below, one can show that is a closed and densely defined symmetric operator with compact resolvent (AreEls12, , Lemma 2.7). Therefore, its spectrum consists of positive eigenvalues only, which we denote by
where eigenvalues are repeated according to their multiplicities. Moreover, by Courant’s minmax principle we have
where the minimum is over all -dimensional subspaces of , see (MugNit12, , Lemma 2.9).
The proof of our results may be split into two main parts. The first part contains all the probabilistic estimates needed in the rest of the paper and is devoted to the proof of Theorem 1.2. The study of the estimates for goes back to ShorYukich ; LeightonShor ; TalagrandGenericChain where the problem was considered in a simpler setting: is the Lebesgue measure on the unit cube and the points are i.i.d. uniformly distributed on . In that context, with very high probability,
where is defined in (1.1). In GTS15a the estimates are extended to measures defined on more general domains (not just ) and with more general densities (not just uniform). In this paper we extend the results in GTS15a to the manifold case. In order to prove Theorem 1.2, we use a similar proof scheme to the one used in GTS15a . Indeed, we first establish Lemma 1 below which is analogous to (GTS15a, , Theorem 1.2) and is of interest on its own. The result includes explicit estimates on how the distance depends on the geometry of .
Let , be two probability densities defined on with
for some . Then it holds for the corresponding measures , , defined as and ,