Efficient Manifold and Subspace Approximations with Spherelets

06/26/2017 ∙ by Didong Li, et al. ∙ 0

Data lying in a high-dimensional ambient space are commonly thought to have a much lower intrinsic dimension. In particular, the data may be concentrated near a lower-dimensional subspace or manifold. There is an immense literature focused on approximating the unknown subspace, and in exploiting such approximations in clustering, data compression, and building of predictive models. Most of the literature relies on approximating subspaces using a locally linear, and potentially multiscale, dictionary. In this article, we propose a simple and general alternative, which instead uses pieces of spheres, or spherelets, to locally approximate the unknown subspace. Building on this idea, we develop a simple and computationally efficient algorithm for subspace learning and clustering. Results relative to state-of-the-art competitors show dramatic gains in ability to accurately approximate the subspace with orders of magnitude fewer components. This leads to substantial gains in data compressibility, few clusters and hence better interpretability, and much lower MSE based on small to moderate sample sizes. Basic theory on approximation accuracy is presented, and the methods are applied to multiple examples.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given a data set , where

are independent and identically distributed (i.i.d.) from a probability measure

, it is common to assume that is supported on or near a lower dimensional manifold with dimension . “Manifold learning” refers to algorithms that map the -dimensional observed data to -dimensional coordinates exploiting the structure of . This is very useful for dimensionality reduction, providing a nonlinear generalization of Principal Components Analysis (PCA).

A rich variety of algorithms are available for manifold learning and nonlinear dimension reduction, including both global and local methods. Popular global approaches include Curvilinear-CA (Demartines and Hérault, 1997), Kernel PCA (Schölkopf et al., 1998), Isomap (Tenenbaum et al., 2000), Laplacian eigenmap (Belkin and Niyogi, 2002), Nonlinear PCA (Scholz et al., 2005), Probabilistic PCA (Lawrence, 2005) and Diffusion Map (Coifman and Lafon, 2006). Examples of local methods are local PCA (Kambhatla and Leen, 1997), local linear embedding (Roweis and Saul, 2000) and their variants like Donoho and Grimes (2003), Zhang and Wang (2007). See Lee and Verleysen (2007) for an overview. As for PCA, these methods provide a mapping from the high-dimensional space to the low-dimensional embedding, and can be used for data visualization for .

Ironically, almost all available algorithms only produce lower dimensional coordinates and do not in general produce an estimate of the manifold

. In addition, one cannot embed new data points without completely reimplementing the algorithm as data are added, eliminating the possibility of using cross validation (CV) for out-of-sample assessments and tuning parameter choice. Finally, it is not in general possible to obtain denoised estimates , for , of the original data exploiting the manifold structure.

There are some methods designed to approximate manifolds without the above disadvantages, relying on variations of locally projecting data in a neighborhood to the best approximating hyperplane (see

Walder and Schölkopf (2009), Chen and Maggioni (2011), Allard et al. (2012), Maggioni et al. (2016), Liao and Maggioni (2016), Liao et al. (2016) and Xianxi et al. (2017)). The motivation for such methods is that a manifold can be approximated by a collection of tangent spaces at different locations. However, such approaches can lack statistical efficiency and parsimony of the representation, particularly when the true manifold has large curvature. For example, if the manifold is a circle with small radius (large curvature), a huge number of tangent lines will be needed to fit the circle well and achieve a small approximation error. As a result, a locally nonlinear approximation method, considering the curvature of the manifold, is needed to replace locally linear approximations and local PCA.

We propose a simple and efficient alternative to PCA, which uses spheres instead of hyperplanes to locally approximate the unknown manifold. Our approach relies on a spherical generalization of PCA, deemed SPCA. SPCA has a simple analytic form, and can be used as a generalization of PCA to incorporate curvature. We refer to any algorithm that uses spheres for local approximation as spherelets. Spherelets provide an excellent basis for locally approximating non-linear manifolds having either positive, zero, or negative Gaussian curvature that may vary across regions. A major advantage is the ability to accurately approximate with dramatically fewer pieces, without significantly increasing the number of parameters per piece. This leads to very substantial practical gains in practice, as we will illustrate.

Relative to competitors for dimensionality reduction ranging from local linear embeddings (LLE) to diffusion maps, our spherelets approach has the major advantages of producing an estimate of the manifold, allowing new data points to be embedded quickly and efficiently without reimplementing the algorithm, and easily providing denoised estimates of the data adjusting for measurement errors. Relative to competitors for manifold approximation like local PCA, spherelets approximates the manifold with lower error when the manifold is not close to flat globally.

We formalize these gains through providing theory bounding the covering number, showing that dramatic improvement are possible for manifolds having sufficiently large curvature but not too many subregions having locations with large change in normal curvature. We also prove the consistency of empirical SPCA and provide an upper bound on estimation error of SPCA. The sign of the Gaussian curvature has no impact on approximation performance. In fact, spherelets works well for a broad class of Riemannian manifolds, regardless of whether the Gaussian curvature is positive or negative, or even varying in sign with location.

Our simulation results show significant reductions in the number of components needed to provide an approximation within a given error tolerance. Comparing to local PCA, our spherelets approach has much lower mean square error using dramatically fewer components. Substantial practical gains are shown in multiple examples.

Manifold approximation is not the only application of spherelets. We apply spherelets to denoise data on manifolds. There are several competitors designed for denoising including Gaussian Blurring Mean Shift (GBMS) (Zemel and Carreira-Perpiñán (2005), Carreira-Perpiñán (2006), Carreira-Perpinán (2008), Hein and Maier (2007b), Hein and Maier (2007a)), Manifold Blurring Mean Shift (MBMS) (Wang and Carreira-Perpinán (2010), Wang et al. (2011)), and its special case Local Tangent Projection (LTP). Spherelets outperforms these methods in terms of denoising performance in our experiments.

Visualization algorithms project data to a lower dimensional space, while preserving geometric structures hidden in the original data set. Popular algorithms include Isomap, Diffusion Map, and t-distributed Stochastic Neighborhood Embedding (tSNE) (Maaten and Hinton, 2008). We show that all these algorithms can be modified easily to produce spherical versions, potentially improving performance.

The paper is organized as follows. In section 2, we prove our main theorem on the covering numbers of hyperplanes and spherelets. In section 3, we develop an algorithm for estimating an optimal approximating sphere, and propose spherical principal component analysis (SPCA). We prove upper bounds on approximation errors along with some convergence properties of empirical SPCA. In section 4, we propose a simple approximation algorithm along with some numerical experiments. We also consider manifold denoising and data visualization, providing algorithms and numerical results. In section 5, we discuss some open problems and future work. Proofs related to the main theorem and SPCA solution are in the Appendix, while proofs of SPCA variance and bias are in the supplementary file.

2 Covering Number

We define the covering number as the minimum number of local bases needed to approximate the manifold within error. Our main theorem shows the covering number of spherelets is smaller than that of hyperplanes. In this paper, we always assume to be compact, otherwise the covering number is not well-defined.

Definition 1.

Let denote a -dimensional compact Riemannian manifold, and a dictionary of basis functions. Then the covering number is defined as

where is a partition of , is a local basis, is the corresponding local projection and is the global projection.

We focus on two choices of : all -dimensional hyperplanes in , denoted by , and the class of all dimensional spheres in , denoted by . Including spheres with infinite radius in , we have implying the following Proposition.

Proposition 1.

For any compact Riemannian manifold , and ,

Before proving a non-trivial upper bound, we define some important features of .

Definition 2.

Let be a -dimensional Riemannian manifold. The volume of is denoted by . Let denote the curvature of the geodesic on starting from point with initial direction :

where is the unit sphere bundle over and is the exponential map from the tangent plane at to . Then is the maximum curvature. Similarly,

is the maximum of the absolute rate of change of the curvature.

Definition 3.

Given any and letting denote the tangent plane to at ,

is called the set of -spherical points on , where is the unit ball in . Let be the geodesic ball centered at with radius , then is called the spherical submanifold of , and the volume is . is called an sphere if .

Example 1.

A space form, a complete, simply connected Riemmanian manifold of constant sectional curvature, is an sphere for any manifold dimension and .

Example 2.

A one dimensional manifold (a curve) is an sphere for any .

A nontrivial example, the Enneper’s surface (a minimal surface) is given in the supplementary material with detailed calculation.

Theorem 1 (Main Theorem).

For any and compact -dimensional Riemannian manifold ,


When , is a curve and we have the following Corollary.

Corollary 1.

For any and compact curve , .

Remark 1.

The curse of dimensionality comes in through the term

, but we can decrease its impact from to .

Proposition 2.

The upper bounds of covering number and are both tight.

We note that the covering number is largely determined by the geometry of the manifold, mainly by the curvature of geodesics on the manifold, not the sectional curvature or Ricci curvature. Although spheres have positive Gaussian/sectional curvature, they can be used as a dictionary to estimate manifolds with negative Gaussian/sectional curvature. For example when the manifold is a 2-dimensional surface, is determined by the difference of the two principal curvatures, not the Gaussian curvature or mean curvature, which are more well known. When , the absolute curvature determines the radius of the osculating circle and the sign of the curvature determines which side of the tangent line the center of the circle locates in. This provides intuition for why spherelets works well for both positive and negative curvature spaces. The performance of spherelets can also be seen in the numerical examples (Seals) where the curvature may vary from positive to negative depending on the location.

The bounds in our main theorem are tight, implying that spherelets often require many fewer pieces than locally linear dictionaries to approximate to any fixed accuracy level ; particularly large gains occur when a non-negligible subset of is covered by the closure of points having not too large change in curvature of geodesics along different directions. As each piece involves unknown parameters, these gains in covering numbers should lead to real practical gains in statistical performance; indeed this is what we have observed in applications. In the next section, we develop algorithms for fitting local data by a sphere, providing a spherical alternative to local PCA.

3 Spherical PCA

3.1 Estimation

Let , where is the center, is the radius, and determines an affine subspace the sphere lies in. Our goal in this section is to estimate to obtain the best approximating sphere through data consisting of samples in .

Lemma 1.

For any , the closest point is

where is the projection of onto the affine subspace .

In fact, is projected to an affine subspace first and then to the sphere (see the proof of Lemma 1). To find the optimal affine subspace, we use -dimensional PCA, obtaining


where is the

th eigenvector of

in decreasing order. Letting , we then find the optimal sphere through points . A sphere can be expressed as the set of zeros of a quadratic function , where is the center and . When this quadratic function has positive value, is outside the sphere, and

is inside the sphere if the function has negative value. Hence, we define the loss function

By simple calculation, we can show that

Hence, if we define

we can then minimize to obtain and let . The resulting is shown in the following Theorem.

Theorem 2.

The minimizer of is given by


We refer to the resulting estimates as Spherical PCA (SPCA), where

Remark 2.

Alternatively, we could have minimized corresponding to the sum of squared residuals. However, the resulting optimization problem is non convex, lacks an analytic solution, and iterative algorithms may be slow to converge, while only producing local minima.

Corollary 2.

If for all , SPCA will find the same minimizer as the loss function in Remark 2, corresponding to exactly .

Definition 4.

Supposing data , we let denote the values of

obtained by plugging in exact moments of the population distribution in place of sample values.

Next we focus on the consistency of empirical SPCA, characterized by and . The corresponding population version is characterized by and .

We denote the population variance-covariance matrix and sample variance-covariance matrix by and , respectively. Furthermore, we rely on the following two assumptions:

  1. Distributional Assumption: The data matrix where is a matrix whose elements

    ’s are independent and identically distributed (i.i.d.) non-degenerate random variables with

    , and .

  2. Spike Population Model: If

    are the ordered eigenvalues of

    then there exists such that .

An assumption similar to (A) is also considered in Lee et al. (2010) with bounded forth moment. The spike population model is defined in Johnstone (2001).

Theorem 3.

Under the assumptions A and B, we have converges to uniformly in probability as , that is, ,

3.2 Local SPCA

A single sphere will typically not be sufficient to approximate the entire manifold , but instead we partition into local neighborhoods and implement SPCA separately in each neighborhood. This follows similar practice to popular implementations of local PCA, but we apply SPCA locally instead of PCA. We divide into non-overlapping subsets . For the th subset, we let , denote the results of applying SPCA to data , denote the projection map from to obtained by Lemma 1, and . Then, we approximate by .

In general, will not be continuous or a manifold but instead is made up of a collection of pieces of spheres chosen to approximate the manifold . There are many ways in which one can choose the subsets , but in general the number of subsets will be chosen to be increasing with the sample size with a constraint so that the number of data points in each subset cannot be too small, as then cannot be reliably estimated. Below we provide theory on mean square error properties of the estimator under some conditions on how the subsets are chosen but without focusing on a particular algorithm for choosing the subsets.

There are a wide variety of algorithms for multiscale partitioning of the sample space, ranging from cover trees (Beygelzimer et al., 2006) to METIS (Karypis and Kumar, 1998) to iterated PCA (Szlam, 2009). As the scale becomes finer, the number of partition sets increases exponentially and the size of each set decreases exponentially. Assume is an arbitrary submanifold of , is the density of data conditionally on and

For example, if we bisect the unit cube in times, then the diameter of each piece will be . The approximation error depends on : as , each local neighborhood is getting smaller so tangent plane or spherical approximations perform better.

Definition 5.

Given fixed , assume is a submanifold of and is the conditional density. Let be the solution of SPCA (population version, see Definition 4), which depends on only, then the sphere is called the spherelet of .

Theorem 4.

Assume there exists such that for any submanifold , where is the spherelet of . Then there exists that depends only on such that for any submanifold with spherelet ,

Remark 3.

The assumption for all is weak and reasonable when is compact. Recall that (2) so the curvature at any point in any direction is bounded by . Since SPCA is aimed at finding the best sphere to approximate the manifold locally, the curvature of each spherelet is approximately upper bounded by . Since the curvature of a circle is the reciprocal of its radius, we have the relation . As a result, there exists such that for any .

Combining Theorem 3 and Theorem 4, we have the following Corollary:

Corollary 3.

Assume is the empirical spherical projection obtained from SPCA on with , then under assumptions A, B, there exists that depends only on such that for any ,

Remark 4.

Corollary 3 does not imply that applying SPCA in local neighborhoods leads to better approximation error than applying PCA. However, this is not surprising since we are not putting restrictions on the curvature. In the special case in which the curvature is zero, we expect SPCA and PCA to have similar performance. However, outside of such an extreme case, when curvature is not very close to zero, SPCA is expected to yield notably improved performance except for very small local regions. This remark is consistent with the empirical results in the following sections.

4 Applications, Algorithms and Examples

Spherelets can be applied to various problems including manifold approximation, manifold denoising and data visualization. In this section we demonstrate these three main applications with algorithms and some benchmark examples. The examples presented are a representative subset of a broader class of examples we have considered, but do not have space to present.

4.1 Manifold approximation

In manifold approximation we attempt to find an estimator of the unknown manifold , say . As local SPCA provides an estimator of a submanifold in a neighborhood, we split into pieces and apply local SPCA to estimate the manifold in each piece.

There are many existing partitioning algorithms for subdividing the sample space into local neighborhoods. Popular algorithms, such as cover trees, iterated PCA and METIS, have a multi-scale structure, repeatedly bisecting until a stopping condition is achieved. Given any such algorithm, let be the partitions of the ambient space, and be the sub-manifold of restricted to . We find an estimate of using local SPCA (see Section 3.2 for details), and set . The map which projects a data point to the estimated manifold is denoted by . Algorithm 1 explains the calculation of and given a partition of .

input : Data ; intrinsic dimension ; Partition
output : Estimated manifold and projection map for each ; Estimated manifold of and the projection map
1 for  do
2       Define ;
3       Calculate by 4;
4       Calculate ;
5       Calculate ;
7 end for
Calculate , and .
Algorithm 1 Spherelets

For simplicity, we consider a multi-scale partitioning scheme which iteratively splits the sample space based on the first principal component score until a predefined bound on mean square error (MSE) is met for each of partition sets or the sample size no longer exceeds a minimal value . The MSE within set is estimated as


where and are as defined in Algorithm 1. If and , we calculate , where and is the first eigenvector of the covariance matrix of . Next we split into two sub-partitions and based on the sign of , i.e, sample of is assigned to if and to otherwise.

Out-of-sample Projection

The projection operator can be estimated using a training dataset , and then applied to test dataset without retraining to obtain predictive values and a predictive MSE. This out of sample MSE provides a useful measure of performance not subject to over-fitting.

Below we validate the performance of local SPCA using a synthetic and a real data example. In each case, we compare our performance with that of local PCA. For PCA we use the same partitioning scheme. We compare PCA and SPCA with respect to the rate of decrease in with number of partitions. For the first example where , we depict the plots of estimated manifolds by representing the different partitions in different colors.

Example 1: Euler Spiral

We generate training and test samples from manifold:

The Euler spiral is a common example in the manifold learning literature with curvature linear with respect to the arc length parameter, that is, for all . Figure 1(c) shows the comparative performance of local PCA and SPCA with respect to the number of partitions vs MSE. Clearly SPCA has much better performance than PCA, as it requires only 14 partitions to achieve an of about , while PCA requires 120 partitions to achieve a similar error. Figure 1(a) and (b) show the projected test dataset with different partitions described in different colors. It is clear that there are fewer pieces of circles than tangent lines and the estimated is smoother in the second panel, reflecting better approximation by SPCA.

Figure 1: (a) local linear projection of Euler spiral data by local PCA; (b) locally spherical projection of Euler spiral data by local SPCA; (c) logplot of MSE vs. number of pieces for Euler spiral data; (d) logplot of MSE vs. number of pieces for Seals dataset.

Example 2: Seals dataset

Data are available in R-package . This dataset has attributes, and samples. We choose samples randomly as the training set and the remaining samples as the test set and set . Figure 1(d) shows the comparative performance of PCA and SPCA. From the figure it is clear that for the same number of partitions, the MSE of SPCA is much lower than PCA.

The above examples show that local SPCA has smaller MSE than local PCA given the same number of partitions. Equivalently, given a fixed error , the number of spheres needed to approximate the manifold is smaller than that of hyperplanes. This coincides with the statement of Theorem 1. When training sample size is small to moderate, there will be limited data available per piece and local PCA will have high error regardless of the number of pieces.

4.2 Manifold denoising

Another important application of SPCA is manifold denoising, where based on a set of noisy samples one tries to approximate the underlying manifold. Manifold Blurring Mean Shift (MBMS) is designed for denoising data which are assumed to lie in some lower dimensional manifold. MBMS first shifts the original data toward their local mean. Then the shifted data are projected to the tangent space, approximating the manifold by standard PCA. There are several variants of MBMS but the basic idea is very similar. Removing the linear projection step, we obtain Gaussian Blurring Mean Shift (GBMS) which only averages the data using Gaussian weights. If we remove the averaging step, that is, do local linear projection only, the method is called Local Tangent Projection (LTP). Hence, MBMS is a combination of GBMS and LTP so we can expect that MBMS performs the best among the three methods. We modify MBMS by applying spherelets to provide a locally spherical projection instead of linear by replacing local PCA by local SPCA. We call the new algorithm Spherical Manifold Blurring Mean Shift (SMBMS) or Local Spherical Projection (LSP). The SMBMS algorithm is described below.

input : Data ; intrinsic dimension ; number of neighbors ; Gaussian bandwidth
output : Denoised data
1 for  do
2       Find , set of k-nearest neighbors of ;
3       Shift toward the Gaussian mean: ;
4       For , find spherelet by 4;
5       Calculate spherical projection
6 end for
Algorithm 2 Manifold denoising (SMBMS)

As before we consider a synthetic and a real data example. We depict the original and denoised data in each case.

Example 3: Noisy Spiral

We generate samples from the spiral with equation:

Then we add white noise to get the noisy spiral. We run both MBMS and SMBMS with the same tuning parameters. We tuned the parameters manually to find the best combination. Here the neighborhood size

is and the Gaussian bandwidth is .

Figure 2: Noisy spiral. (a) clean data; (b) noisy data; (c) denoised data by MBMS; (d) denoised data by SMBMS.

Figure 2(a) is the data without noise, a clean spiral, while (b) is the noisy spiral. Panel (c) is the denoised spiral by MBMS, which is much cleaner than panel 2. However, the denoised spiral by SMBMS, shown in (d), is even closer to the true one.

Example 4: USPS Digit Dataset

This is one of the standard datasets for handwritten digit recognition. Each point in the data set is an image containing a handwritten digit in pixels. The data are noisy and contain many images that are hard to read. We run both MBMS and SMBMS on this dataset and present the result in Figure 3.

Figure 3: USPS Hand written digits images. The first row: original data; the second row: denoised data by MBMS; the third row: denoised data by SMBMS.

The first row in Figure 3 contains original noisy images. The second and third rows are the denoised images obtained by MBMS and SMBMS, respectively. Clearly SMBMS outperforms MBMS, as the noise level is much lower in the third row.

Another advantage of spherical MBMS is that the algorithm is more stable than MBMS when the Gaussian bandwidth varies, according to our observation on many numerical experiments. The reason might be the fact that spherelets can approximate the manifold locally with smaller error than PCA. Thus the spherical projection is more stable than linear projection when the data are noisy.

4.3 Data Visualization

Data visualization is an increasingly popular area due to the need for tools for understanding structure in multivariate data. When the dimension is greater than it’s hard to visualize the data.

The algorithm having the best performance in our experiments is tSNE (Maaten and Hinton, 2008)

. tSNE is focused on minimizing the Kullback-Leibler divergence between a discrete density determined by the original data through a Gaussian kernel and another density determined by the projected data through a t-distributed kernel. The first step is to obtain a pairwise distance matrix between data points relying on Euclidean distance. We assume that the manifold can be approximated locally by a sphere, so the spherical distance is better than the Euclidean distance as an estimator of the geodesic distance. In general, for a sphere centered at

with radius , the distance between two points has an analytic form:

As a result, we can replace the Euclidean distance in the pairwise distance matrix by this spherical distance to obtain a better estimation of the geodesic distance locally. One can then use this modified pairwise distance matrix in any algorithm. As an example, we compare the original tSNE with spherical tSNE (S-tSNE).

input : Data ; intrinsic dimension ; dimension of the projected space ; number of neighbors ; Gaussian bandwidth
output : Projected data
1 for  do
2       Find , set of -nearest neighbors of ;
3       Do SPCA for to find by 4;
4       for  do
5             ;
7       end for
9 end for
11 ;
12 , where , ;
Algorithm 3 S-tSNE

Example 5: Iris dataset

This is a well known data set created by Fisher Fisher (1936). This dataset has attributes, and

labeled samples classified into three groups. Figure 4 shows the 2 dimensional projection given by tSNE and S-tSNE with two different parameters

. The first advantage of S-tSNE is that there are fewer overlapping points across classes than tSNE, for both and . The second advantage is that S-tSNE is more stable as the tuning parameter varies. This relative stability can be seen not only in data visualization but also in manifold denoising, as discussed in the previous subsection.

Figure 4: Iris dataset. (a) tSNE, ; (b) S-tSNE, ; (c) tSNE, ; (d) S-tSNE, .

5 Discussion

There are several natural next directions building on the spherelets approach proposed in this article. The current version of spherelets is not constrained to be connected, so that the estimate of the manifold will in general be disconnected. We view this as an advantage in many applications, because it avoids restricting consideration to manifolds that have only one connected component and instead accommodates true disconnectedness. Nevertheless, in certain applications it is very useful to obtain a connected estimate; for example, when we have prior knowledge that the true manifold is connected and want to use this knowledge to improve statistical efficiency and produce a more visually appealing and realistic estimate. A typical case is in imaging when is 2 or 3 and is 1 or 2 and we are trying to estimate a known object from noisy data. In addition, by restricting the manifold to be connected and producing a connected estimation, we have the possibility of using to estimate geodesic distances between data points; a problem of substantial interest in the literature (Bernstein et al. (2000), Yang (2004), Meng et al. (2007), Meng et al. (2008)). Possibilities in terms of incorporating connectedness constraints include (a) producing an initial

using spherelets and then closing the gaps through linear interpolation; and (b) incorporating a continuity constraint directly into the objective function, to obtain essentially a type of higher dimensional analogue of splines.

Another important direction is to extend the approach to accommodate observed data that do not consist of

-dimensional Euclidean vectors but have a more complex form. For example, the original data may themselves be functions or surfaces or networks or may have a discrete form; e.g, consisting of high-dimensional binary vectors. There are two natural possibilities to account for more elaborate data. The first is to modify the spherelets algorithm to take as input a pairwise distance matrix in place of the original data, as already addressed in the data visualization case. The second is to develop a model-based version of spherelets in which we define a likelihood function for the data, which can be fitted from a frequentist or Bayesian perspective. We are currently considering the latter direction, through a mixture of spherelets model that includes von-Mises Fisher kernels on different spheres with Gaussian noise. Once we have such a generative probability model for the data, it becomes natural to include modifications to account for more elaborate data structures; e.g. exploiting the flexibility of the Bayesian hierarchical modeling framework. Such an approach also has the advantage of include uncertainty quantification in manifold learning and associated tasks.

An additional direction is improving the flexibility of the basis by further broadening the dictionary beyond simply pieces of spheres. Although one of the main advantages of spherelets is that we maintain much of the simplicity and computational tractability of locally linear bases, it is nonetheless intriguing to include additional flexibility in an attempt to obtain more concise representations of the data with fewer pieces. Possibilities we are starting to consider include the use of quadratic forms to obtain a higher order local approximation to the manifold and extending spheres to ellipses. In considering such extensions, there are major statistical and computational hurdles; the statistical challenge is maintaining parsimony, while the computational one is to obtain a simple and scalable algorithm. As a good compromise to clear both of these hurdles, one possibility is to start with spherelets and then perturb initial sphere estimates (e.g., to produce an ellipse) to better fit the data.

Finally, there is substantial interest in scaling up to very large

cases; the current algorithm will face problems in this regard similar to issues faced in applying usual PCA to high-dimensional data. To scale up spherelets, one can potentially leverage on scalable extensions of PCA, such as sparse PCA (

Zou et al. (2006), Johnstone and Lu (2009)). The availability of a very simple closed form solution to spherical PCA makes such extensions conceptually straightforward, but it remains to implement such approaches in practice and carefully consider appropriate asymptotic theory. In terms of theory, it is interesting to consider optimal rates of simultaneously estimating and the density of the data on (or close) to , including in cases in which is large and potentially increasing with sample size.


The authors acknowledge support for this research from the Office of Naval Research grant N000141712844 and the National Institutes of Health grants R01 CA 193655.


  • Allard et al. (2012) Allard, W. K., Chen, G., and Maggioni, M. (2012). Multi-scale geometric methods for data sets ii: Geometric multi-resolution analysis. Applied and Computational Harmonic Analysis, 32(3):435–462.
  • Andrews (1992) Andrews, D. W. (1992). Generic uniform convergence. Econometric theory, 8(2):241–257.
  • Belkin and Niyogi (2002) Belkin, M. and Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems, pages 585–591.
  • Bernstein et al. (2000) Bernstein, M., De Silva, V., Langford, J. C., and Tenenbaum, J. B. (2000). Graph approximations to geodesics on embedded manifolds. Technical report, Technical report, Department of Psychology, Stanford University.
  • Beygelzimer et al. (2006) Beygelzimer, A., Kakade, S., and Langford, J. (2006). Cover trees for nearest neighbor. In

    Proceedings of the 23rd International Conference on Machine learning

    , pages 97–104. ACM.
  • Carreira-Perpiñán (2006) Carreira-Perpiñán, M. Á. (2006). Fast nonparametric clustering with gaussian blurring mean-shift. In Proceedings of the 23rd International Conference on Machine Learning, pages 153–160. ACM.
  • Carreira-Perpinán (2008) Carreira-Perpinán, M. A. (2008). Generalised blurring mean-shift algorithms for nonparametric clustering. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE.
  • Chen and Maggioni (2011) Chen, G. and Maggioni, M. (2011). Multiscale geometric and spectral analysis of plane arrangements. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 2825–2832. IEEE.
  • Coifman and Lafon (2006) Coifman, R. R. and Lafon, S. (2006). Diffusion maps. Applied and Computational Harmonic Analysis, 21(1):5–30.
  • Demartines and Hérault (1997) Demartines, P. and Hérault, J. (1997).

    Curvilinear component analysis: A self-organizing neural network for nonlinear mapping of data sets.

    IEEE Transactions on Neural Networks, 8(1):148–154.
  • Donoho and Grimes (2003) Donoho, D. L. and Grimes, C. (2003). Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences, 100(10):5591–5596.
  • Fisher (1936) Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2):179–188.
  • Hein and Maier (2007a) Hein, M. and Maier, M. (2007a). Manifold denoising. In Advances in Neural Information Processing Systems, pages 561–568.
  • Hein and Maier (2007b) Hein, M. and Maier, M. (2007b). Manifold denoising as preprocessing for finding natural representations of data. In AAAI, pages 1646–1649.
  • Hu et al. (2008) Hu, T.-C., Rosalsky, A., and Volodin, A. (2008). On convergence properties of sums of dependent random variables under second moment and covariance restrictions. Statistics and Probability Letters, 78(14):1999–2005.
  • Johnstone (2001) Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Annals of Statistics, pages 295–327.
  • Johnstone and Lu (2009) Johnstone, I. M. and Lu, A. Y. (2009). On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104(486):682–693.
  • Kambhatla and Leen (1997) Kambhatla, N. and Leen, T. K. (1997). Dimension reduction by local principal component analysis. Neural Computation, 9(7):1493–1516.
  • Karypis and Kumar (1998) Karypis, G. and Kumar, V. (1998). A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359–392.
  • Lawrence (2005) Lawrence, N. (2005). Probabilistic non-linear principal component analysis with gaussian process latent variable models. Journal of Machine Learning Research, 6(Nov):1783–1816.
  • Lee and Verleysen (2007) Lee, J. A. and Verleysen, M. (2007). Nonlinear dimensionality reduction. Springer Science & Business Media.
  • Lee et al. (2010) Lee, S., Zou, F., and Wright, F. A. (2010). Convergence and prediction of principal component scores in high-dimensional settings. Annals of Statistics, 38(6):3605.
  • Liao and Maggioni (2016) Liao, W. and Maggioni, M. (2016). Adaptive geometric multiscale approximations for intrinsically low-dimensional data. arXiv preprint arXiv:1611.01179.
  • Liao et al. (2016) Liao, W., Maggioni, M., and Vigogna, S. (2016). Learning adaptive multiscale approximations to data and functions near low-dimensional sets. In Information Theory Workshop (ITW), 2016 IEEE, pages 226–230. IEEE.
  • Maaten and Hinton (2008) Maaten, L. v. d. and Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov):2579–2605.
  • Maggioni et al. (2016) Maggioni, M., Minsker, S., and Strawn, N. (2016). Multiscale dictionary learning: non-asymptotic bounds and robustness. Journal of Machine Learning Research, 17.
  • Meng et al. (2008) Meng, D., Leung, Y., Xu, Z., Fung, T., and Zhang, Q. (2008). Improving geodesic distance estimation based on locally linear assumption. Pattern Recognition Letters, 29(7):862–870.
  • Meng et al. (2007) Meng, D., Xu, Z., Gu, N., and Dai, M. (2007). Estimating geodesic distances on locally linear patches. In Signal Processing and Information Technology, 2007 IEEE International Symposium on, pages 851–854. IEEE.
  • Newey (1991) Newey, W. K. (1991). Uniform convergence in probability and stochastic equicontinuity. Econometrica: Journal of the Econometric Society, pages 1161–1167.
  • Roweis and Saul (2000) Roweis, S. T. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326.
  • Schölkopf et al. (1998) Schölkopf, B., Smola, A., and Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299–1319.
  • Scholz et al. (2005) Scholz, M., Kaplan, F., Guy, C. L., Kopka, J., and Selbig, J. (2005). Non-linear pca: a missing data approach. Bioinformatics, 21(20):3887–3895.
  • Szlam (2009) Szlam, A. (2009). Asymptotic regularity of subdivisions of euclidean domains by iterated pca and iterated 2-means. Applied and Computational Harmonic Analysis, 27(3):342–350.
  • Tenenbaum et al. (2000) Tenenbaum, J. B., De Silva, V., and Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323.
  • Walder and Schölkopf (2009) Walder, C. and Schölkopf, B. (2009). Diffeomorphic dimensionality reduction. In Advances in Neural Information Processing Systems, pages 1713–1720.
  • Wang and Carreira-Perpinán (2010) Wang, W. and Carreira-Perpinán, M. A. (2010). Manifold blurring mean shift algorithms for manifold denoising. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 1759–1766. IEEE.
  • Wang et al. (2011) Wang, W., Carreira-Perpinán, M. A., and Lu, Z. (2011). A denoising view of matrix completion. In Advances in Neural Information Processing Systems, pages 334–342.
  • Xianxi et al. (2017) Xianxi, L., Li, S., Guoquan, L., Menghua, X., and Wei, W. (2017). Nonlinear system monitoring with piecewise performed principal component analysis. In Control Conference (CCC), 2017 36th Chinese, pages 9665–9669. IEEE.
  • Yang (2004) Yang, L. (2004). K-edge connected neighborhood graph for geodesic distance estimation and nonlinear data projection. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 1, pages 196–199. IEEE.
  • Zemel and Carreira-Perpiñán (2005) Zemel, R. S. and Carreira-Perpiñán, M. Á. (2005). Proximity graphs for clustering and manifold learning. In Advances in Neural Information Processing Systems, pages 225–232.
  • Zhang and Wang (2007) Zhang, Z. and Wang, J. (2007). Mlle: Modified locally linear embedding using multiple weights. In Advances in Neural Information Processing Systems, pages 1593–1600.
  • Zou et al. (2006) Zou, H., Hastie, T., and Tibshirani, R. (2006). Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2):265–286.

Appendix A: Covering Number

Curves (d=1)

Throughout this section, is a curve with input , the arc length parameter. Let be fixed. Let be the curvature at point .

Proposition 3.

Let be the tangent line of at point , then is the unique line such that , so

where .


This is a standard result so we will not prove it in this paper. ∎

Corollary 4.

Let , then there exists tangent lines at points such that , s.t. .


Let , then by Proposition 3, , so if , . Then start from , repeat this process to find tangent lines, so . ∎

Proposition 4.

Let be the osculating circle of at point . Then if , is the unique circle such that , so

where .

Note 1.

Proposition 4 holds only when the osculating circle is non degenerate. If the curvature of at is , the osculating circle degenerates to tangent line . In this case, Proposition 3 applies.


The osculating circle has radius and center , where is the unit normal vector. Let be the Frenet frame, where . Without loss of generality, assume and . Under the Frenet frame, we can rewrite the osculating circle as

The Taylor expansion for can be written as

where , so. As a result,

As a result,

Now we prove the uniqueness. Observe the first entry of :

which means is the osculating circle. ∎

Corollary 5.

Let , then there exists osculating circles at points such that , s.t. .


Let , then by Proposition 4, , so if , . Then start from , repeat this process to find osculating circles, so . ∎

Surfaces (d=2)

Throughout this section, is a regular surface parametrized by where is a compact subset of . Let be the directional curvature, that is, is the curvature in direction , is the unit sphere bundle of . Without loss of generality, assume is fixed.

Proposition 5.

Letting be the tangent plane of at point , is the unique plane such that , so