1. Introduction
1.1. Aim and Background
Several techniques for the analysis of high dimensional data exploit the fact that datagenerating mechanisms can be often described by few degrees of freedom. The focus of this paper is on graphbased methods that employ similarity relationships between points to uncover the low intrinsic dimensionality and geometric structure of a data set. Graphbased learning provides a wellbalanced compromise between accuracy and interpretability, and is popular in a variety of unsupervised and semisupervised tasks
[28, 23]. These methods have been analyzed in the idealized setting where the data is sampled from a lowdimensional manifold and similarities are computed using the ambient Euclidean distance or the geodesic distance, see e.g. [7, 20, 5, 10]. The manifold setting is truthful in spirit to the presupposition that data arising from structured systems may be described by few degrees of freedom, but it is not so in that the data are typically noisy. The aim of this paper is to provide new mathematical theory under the more general model assumption that the data consists of random perturbations of lowdimensional features lying on a manifold.By relaxing the manifold assumption we bring forward two fundamental questions that are at the heart of graphbased learning which have not been accounted for by previous theory. First, how should one define the interpoint similarities between noisy data points in order to recover Euclidean distances between unperturbed datapoints more faithfully? Second, is it possible to recover global geometric features of the manifold from suitablydefined similarities between noisy data? We will show by rigorous mathematical reasoning that:

Denoising interpoint distances leads to an improved approximation of the hidden Euclidean distance between unperturbed points. We illustrate this general idea by analyzing a simple, easilycomputable similarity defined in terms of a localregularization of the noisy data set.

Graphbased objects defined via locally regularized similarities can be guaranteed to satisfy improved error bounds in the recovery of global geometric properties. We illustrate this general idea by showing the spectral approximation of an unnormalized graph Laplacian to a Laplace operator defined on the underlying manifold.
In addition to giving theoretical support for the regularization of noisy point clouds, we study the practial use of local regularization in classification problems. Our analytically tractable localregularization depends on a parameter that modulates the amount of localization, and our analysis suggests the scaling of the localization parameter in terms of the level of noise in the data. In our numerical experiments we show that in semisupervised classification problems this parameter may be chosen by crossvalidation, ultimately producing classification rules with improved accuracy. Finally, we propose two alternative denoising methods with similar empirical performance that are sometimes easier to implement with. In short, the improved geometric understanding facilitated by (local) regularization translates into improved graphbased data analysis, and the results seem to be robust to the choice of methodology.
1.2. Framework
We assume a data model
(1.1) 
where the unobserved points are sampled from an unknown dimensional manifold the vectors represent noise, and is the observed data. Further geometric and probabilistic structure will be imposed to prove our main results –see Assumptions 1 and 2 below. Our analysis is motivated by the case, often found in applications, where the number of data points and the ambient space dimension are large, but the underlying intrinsic dimension is small or moderate. Thus, the datagenerating mechanism is described (up to a noisy perturbation) by degrees of freedom. We aim to uncover geometric properties of the underlying manifold from the observed data by using similarity graphs. The set of vertices of these graphs will be identified with the set —so that the th node corresponds to the th data point— and the weight between the th and th datapoint will be defined in terms of a similarity function .
The first question that we consider is how to choose the similarity function so that approximates the hidden Euclidean distance Full knowledge of the Euclidean distance between the latent variables would allow to recover, in the large limit, global geometric features of the underlying manifold. This motivates the idea of denoising the observed point cloud to approximate the hidden similarity function Here we study a family of similarity functions based on the Euclidean distance between local averages of points in . We define a denoised data set by locally averaging the original data set, and we then define an associated similarity function
In its simplest form, is defined by averaging all points in that are inside the ball of radius centered around that is,
(1.2) 
where is the cardinality of Note that (and the associated similarity function ) depends on but we do not include said dependence in our notation for simplicity. Other possible local and nonlocal averaging approaches may be considered. We will only analyze the choice made in (1.2) and we will explore other constructions numerically. Introducing the notation
the first question that we study may be formalized as understanding when, and to what extent, the similarity function is a better approximation than (the standard choice) to the hidden similarity function An answer is given in Theorem 1 below.
The second question that we investigate is how an improvement in the approximation of the hidden similarity affects the approximation of global geometric quantities of the underlying manifold. Specifically, we study how the spectral convergence of graphLaplacians constructed with noisy data may be improved by local regularization of the point cloud. For concreteness, our theoretical analysis is focused on graphs and unnormalized graphLaplacians, but we expect our results to generalize to other graphs and graphLaplacians –evidence to support this claim will be given through numerical experiments. We now summarize the necessary background to formalize this question. For a given similarity and a parameter , we define a weighted graph by setting the weight between the th and th node to be
(1.3) 
where is the volume of the dimensional Euclidean unit ball. Associated to the graph we define the unnormalized graph Laplacian matrix
(1.4) 
where is a diagonal matrix with diagonal entries
For the rest of the paper we shall denote and We use analogous notation for and . The second question that we consider may be formalized as understanding when, and to what extent, provides a better approximation (in the spectral sense) than to a Laplace operator on the manifold An answer is given in Theorem 2 below.
1.3. Main results
In this subsection we state our main theoretical results. We first impose some geometric conditions on the underlying manifold
Assumption 1.
is a smooth, oriented, compact manifold with no boundary and intrinsic dimension , embedded in Moreover, has injectivity radius , maximum of the absolute value of sectional curvature and reach
Loosely speaking, the injectivity radius determines the range of the exponential map (which will be an important tool in our analysis and will be reviewed in the next section) and the sectional curvature controls the metric distortion by the exponential map, and thereby its Jacobian. The reach can be thought of as an (inverse) conditioning number of the manifold and controls its second fundamental form; it can also be interpreted as a measure of extrinsic curvature. The significance of these geometric quantities and their role in our analysis will be further discussed in Section 2.
Next we impose further probabilistic structure into the data model (1.1). We assume that the pairs are i.i.d. samples of the random vector . Let and be, respectively, the marginal distribution of and the conditional distribution of given . We assume that is absolutely continuous with respect to the Riemannian volume form of with density , i.e.,
(1.5) 
Furthermore, we assume that is supported on (the orthogonal complement of the tangent space and that it is absolutely continuous with respect to the dimensional Hausdorff measure restricted to with density , i.e.,
To ease the notation we will write instead of . We make the following assumptions on these densities.
Assumption 2.
It holds that:

The density is of class and is bounded above and below by positive constants:

For all ,
Moreover, there is such that for all with
Note that the assumption on ensures that the noise is centered and bounded by a constant
In our first main theorem we study the approximation of the similarity function by . We consider points and that are close with respect to the geodesic distance on the manifold, and show that local regularization improves the approximation of the hidden similarity provided that is large and the noise level is small. The local regularity parameter needs to be suitably scaled with We make the following standing assumption linking both parameters; we refer to Remark 1 below for a discussion on the optimal scaling of with , and to our numerical experiments for practical guidelines.
Assumption 3.
The localization parameter and the noise level satisfy
(1.6) 
where is a universal constant, denotes the volume of the Euclidean unit ball in and and are as in Assumption 1.
In words, Assumption 3 requires both and to be sufficiently small, and to be larger than .
Now we are ready to state the first main result.
Theorem 1.
, with probability at least
, for all and with we have(1.7) 
where and is a constant depending on , a uniform bound on the change in second fundamental form of , and on the regularity of the density
Remark 1.
Theorem 1 gives concrete evidence of the importance of the choice of similarity function. For the usual Euclidean distance between observed data, one can only guarantee that
which follows from
However, if we choose , then the error in (1.7) is of order , which is a considerably smaller quantity in the small noise limit.
Remark 2.
It will become evident from our analysis that for points that are sufficiently close, the quantity is much smaller than the terms This is an important observation, since our purpose is not to estimate the location of the point using , but rather to estimate distances. In other words, our interest is in the intrinsic geometry of the point cloud and not in its actual representation.
Our second main result translates the local similarity bound from Theorem 1 into a global geometric result concerning the spectral convergence of the graph Laplacian to the Laplace operator formally defined by
(1.8) 
where div and denote the divergence and gradient operators on the manifold and is the sampling density of the hidden point cloud , as introduced in Equation (1.5). It is intuitively clear that the spectral approximation of the discrete graphLaplacian to the continuum operator necessarily rests upon having a sufficient number of samples from (defined in (1.5)). In other words, the empirical measure needs to be close to the sampling density of the hidden data set. We characterize the closeness between and by the OT transport distance, defined as
where denotes the pushforward of by , that is, for any Borel subset of . In [10] it is shown that for every , with probability at least ,
where if and for . This is the high probability scaling of in terms of .
We introduce some notation before stating our second main result. Let be the
th eigenvalue of the unnormalized graphLaplacian
defined in Equation (1.4), and let be the th eigenvalue of the continuum Laplace operator defined in Equation (1.8).Theorem 2.
Suppose that Assumptions 1, 2, and 3 hold. Suppose further that is small enough (but not too small) so that
(1.9) 
where is a constant that only depends on and the regularity of the density , is a universal constant, and
is the bound in (1.7). Then, with probability at least ,
where only depends on and the regularity of , and
Remark 3.
As described in Remark 1, local regularization enables a smaller than if no regularization is performed. This in turn allows one to choose, for a given error tolerance, a smaller connectivity , leading to a sparser graph that is computationally more efficient. Note also that the bound in Theorem 2 does not depend on the ambient space dimension but only on the intrinsic dimension of the data.
Remark 4.
Theorem 2
concretely shows how an improvement in metric approximation translates into an improved estimation of global geometric quantities. We have restricted our attention to analyzing eigenvalues of a Laplacian operator, but we remark that the idea goes beyond this particular choice. For example, one can conduct an asymptotic analysis illustrating the effect of changing the similarity function in the approximation of other geometric quantities of interest like Cheeger cuts. Such analysis could be carried out using the variational convergence approach from
[13].We would also like to mention that it is possible to make statements about convergence of eigenvectors of graph Laplacians following the results in
[10]. We have omitted the details for brevity.1.4. Related and Future Work
Graphbased learning algorithms include spectral clustering, total variation clustering, graphLaplacian regularization for semisupervised learning, graph based Bayesian semisupervised learning. A brief and incomplete summary of methodological and review papers is
[19, 16, 2, 27, 21, 23, 28, 4]. These algorithms involve either a graph Laplacian, the graph total variation, or Sobolev norms involving the graph structure. The large sample theory studying the behavior of some of the above methodologies has been analyzed without reference to the intrinsic dimension of the data [24] and in the case of points laying on a low dimensional manifold, see e.g. [3, 12, 11] and references therein. Some papers that account for both the noisy and low intrinsic dimensional structure of data are [17, 15, 1, 25]. For example, [17] studies the recovery of the homology groups of submanifolds from noisy samples. We use the techniques for the analysis of spectral convergence of graphLaplacians introduced in [5] and further developed in [10]. The results in the latter reference would allow to extend our analysis to other graph Laplacians, but we do not pursue this here for conciseness.We highlight that the denoising by local regularization occurs at the level of the data set. That is, rather than denoising each of the observed features individually, we analyze denoising by averaging different data points. In practice combining both forms of denoising may be advantageous. For instance, when each of the data points corresponds to an image, one can first denoise each image at the pixel level and then do regularization at the level of the data set as proposed here. In this regard, our regularization at the level of the dataset is similar to applying a filter at the level of individual pixels [22]. The success of nonlocal filter image denoising algorithms suggests that nonlocal methods may be also of interest at the level of the data set, but we expect this to be applicationdependent. Finally, while in this paper we only consider firstorder regularization based on averages, a topic for further research is the analysis of local PCA regularization [15], incorporating covariance information.
It is worth noting the parallel between the local regularization that we study here and meanshift and mode seeking methods [6, 9]. Indeed, a side benefit of local averaging in classification and clustering applications is that the datapoints are pushed to regions of higher density. This paralellism with meanshift techniques also suggests the idea of doing local averaging iteratively. Local regularization may be also interpreted as a form of dictionary learning, where each datapoint is represented in terms of its neighbors. For specific applications it may be of interest to restrict (or extend) the dictionary used to represent each data point [14].
1.5. Outline
The paper is organized as follows. In Section 2 we formalize the geometric setup and prove Theorem 1. Section 3 contains the proof of Theorem 2 and a lemma that may be of independent interest. Finally, Section 4 includes several numerical experiments. In the Appendix we prove a technical lemma that serves as a key ingredient in proving Theorem 1.
2. Distance Approximation
In this section we prove Theorem 1. We start with Subsection 2.1 by giving some intuition on the geometric conditions imposed in Assumption 1 and introducing the main geometric tools in our analysis. In Subsection 2.2 we decompose the approximation error between the similarity functions and into three terms, which are bounded in Subsections 2.3, 2.4, and 2.5.
2.1. Geometric Preliminaries
2.1.1. Basic Notation
For each we let be the tangent plane of at centered at the origin. In particular, is a dimensional subspace of , and we denote by its orthogonal complement. We will use to denote the Riemannian volume form of . We will denote by the Euclidean distance between arbitrary points in and denote by the geodesic distance between points in . We denote by balls in and by balls in the manifold (with respect to the geodesic distance). Also, unless otherwise specified , without subscripts will be used to denote balls in . We denote by the volume of the unit Euclidean ball in . Throughout the rest of the paper we use and to denote the reach, injectivity radius, and maximum absolute curvature of as in Assumption 1. We now describe at an intuitive level the role that these quantities play in our analysis.
2.1.2. The Reach
The reach of a manifold is defined as the largest value for which the projection map
is well defined, i.e., every point in the tubular neighborhood around of width has a unique closest point in . Our assumption that the noise level satisfies guarantees that is the (welldefined) projection of onto the manifold. The reach can be thought of as an inverse conditioning number for the manifold [17]. We will use that the inverse of the reach provides a uniform upper bound on the second fundamental form (see Lemma 4).
2.1.3. Exponential Map, Injectivity Radius and Sectional Curvature
We will make use of the exponential map , which for every is a map
where is the injectivity radius for the manifold . We recall that the exponential map takes a vector and maps it to the point that is at geodesic distance from along the unit speed geodesic that at time passes through with velocity . The injectivity radius is precisely the maximum radius of a ball in centered at the origin for which the exponential map is a well defined diffeomorphism for every . We denote by the Jacobian of the exponential map . Integrals with respect to can then be written in terms of integrals on weighted by the function . More precisely, for an arbitrary test function ,
For fixed one can obtain bounds on the metric distortion by the exponential map ([8, Chapter 10] and [5, Section 2.2]), and thereby guarantee the existence of a universal constant such that
(2.1) 
An immediate consequence of the previous inequalities is
(2.2) 
where we recall is the volume of the unit ball in . Equations (2.1) and (2.2) will be used in our geometric and probabilistic arguments and motivate our assumptions on the choice of local regularization parameter in terms of the injectivity radius and the sectional curvature.
2.2. Local Distributions
Next we study the local behavior of . To characterize its local distribution, it will be convenient to introduce the following family of probability measures.
Definition 3.
Let be a vector in whose distance to is less than . Let be the projection of onto
. We say that the random variable
has the distribution provided thatfor all Borel sets , where in the above is distributed according to .
In the remainder we use as shorthand notation for . As for the original measure , we characterize in terms of a marginal and conditional distribution. We introduce the density given by
(2.3) 
and define
(2.4) 
where in the above and in the remainder we use and to denote conditional expectation and conditional probability given . It can be easily shown that these functions correspond to the marginal density of and the conditional density of given , where . The distribution is of relevance because by definition of one has
Now we are ready to introduce the main decomposition of the error between the similarity functions and . Using the triangle inequality we can write
(2.5)  
(2.6)  
(2.7) 
In the next subsections we bound each of the terms (2.6) (expected conditional noise), (2.5) (difference in geometric bias), and (2.7) (sampling error). As we will see in Subsection 2.5 we can control both terms in (2.7) with very high probability using standard concentration inequalities. The other three terms are deterministic quantities that can be written in terms of integrals with respect to the distributions and . To study these integrals it will be convenient to introduce two quantities (independent of ) satisfying:

For all with we have
In particular, the density is supported in .

For all with we have
In Appendix A we present the proof of the following lemma giving estimates for and .
Lemma 1 (Bounds for and ).
2.3. Bounding Expected Conditional Noise
Proof.
Using the definition of
The first integral is the zero vector because for we have and is assumed to be centered. Therefore,
where we have used (2.1) and the assumptions on to say (in particular) that , and also the fact that, for
Finally, notice that
where again we have used (2.1) to conclude (in particular) that . The result now follows by (2.8).
∎
2.4. Bounding Difference in Geometric Bias
In terms of and , the difference (and likewise ) can be written as:
where the second to last equality follows from (2.3). To further simplify the expression for let us define
It follows that
(2.9) 
where in the above
Lemma 2.
The following hold.

The terms satisfy

The terms satisfy:
where, up to universal multiplicative constants,

Suppose that . Then,
where, up to universal multiplicative constants,
and only depends on bounds on the first derivatives of the density .